Skip to contents

A custom data class

STAVE works via a single class (an R6 object) that acts as the main data container. This class allows users to efficiently import, store, and manipulate genetic data via specialized member functions.

For example, a new object can be created and data read in like this:

# create new object
s <- STAVE_object$new()

# append data using a member function
s$append_data(studies_dataframe = df_studies,
              surveys_dataframe = df_surveys,
              counts_dataframe = df_counts)

Notice that the function is attached to the object, accessed via the $ symbol.

Using a custom class offers several key advantages. Once loaded, all data remain consolidated within a single object, avoiding fragmentation. The class structure also ensures the data are encapsulated, meaning they cannot be directly edited by the user. This built-in protection minimizes the risk of accidental data corruption.

STAVE performs close to 100 rigorous checks on the data during the import process. These checks cover a wide range of validations, from ensuring proper character formatting to verifying that prevalences do not exceed 100%. If the data pass all checks, they are successfully loaded into the object. If not, the import is rejected, and an informative error message is provided.

A core principle of STAVE is that it does not modify the input data during import. For instance, if column headers in your dataset contain capital letters when only lowercase is allowed, STAVE will not automatically convert them for you — even though it could! Instead it will reject the import. This strict approach ensures that users are fully aware of the exact structure of their input data, meaning there is only a single format used universally by anyone using STAVE. The downside is that you must conform to this structure in order to use the package.

The remainder of this page specifies the formatting requirements for each of the three input tables.


The Studies table

This table captures information about the origin of the data. An example of a correctly formatted Studies table is given below:

study_id study_name study_type authors publication_year url
Dama_2017 Reduced ex vivo susceptibility of Plasmodium falciparum after oral artemether-lumefantrine treatment in Mali peer_reviewed Dama et al 2017 https://doi.org/10.1186/s12936-017-1700-8
Asua_2019 Changing Molecular Markers of Antimalarial Drug Sensitivity across Uganda peer_reviewed Asua et al 2019 https://doi.org/10.1128/aac.01818-18

The only mandatory fields are study_id and url. Please ensure that URLs are accurate and permanent, as they serve as the sole external reference for verifying the origin of the data.

All other fields are optional, meaning cells can be left blank, although column headings must still be included. These optional fields have minimal formatting requirements and are primarily intended for storing descriptive information to help you quickly identify a study. The exception is study_type, which must adhere to a predefined set of options (see below).

Study IDs

Study IDs must be “valid identifiers”, meaning they must:

  • Contain only English letters (uppercase or lowercase), numbers (0-9), or underscores (_).
  • Not begin with a number or an underscore.

Beyond these restrictions, any naming convention can be used. However, it is recommended to adopt a systematic approach to avoid potential conflicts. For instance, using generic IDs like “study1” is not a good idea, as such IDs could overlap with those from other datasets, causing issues when combining data. A better approach is to use a concise, descriptive format, such as the first author’s surname and the year of publication, e.g., Bloggs_2024.

Data types

Each column has its own rules about data type:

Column Compulsory Type
study_id Y Valid identifier (see above)
study_name N Character string
study_type N One of {‘peer_reviewed’, ‘preprint’, ‘other’, ‘private’}
authors N Character string
publication_year N Positive integer
url Y Character string

If any entries have study_type that is private then a warning message will be printed when data are imported. This warning does not prevent data from being loaded, but flags to the user that private data are present in case this was not intentional.


The Surveys table

The Surveys table captures information about the context within which data were collected. We can think of a survey here as a single instance of data collection. An example of a correctly formatted Surveys table is given below:

study_key survey_id country_name site_name latitude longitude spatial_notes collection_start collection_end collection_day time_notes
Dama_2017 Bamako_2014 Mali Koulikoro 12.612900 -8.13560 WWARN lat and long 2014-01-01 2014-12-31 2014-07-02 automated midpoint
Asua_2019 Agago_2017 Uganda Agago 2.984722 33.33055 WWARN lat and long 2017-01-01 2017-12-31 2017-07-02 automated midpoint
Asua_2019 Arua_2017 Uganda Arua 3.030000 30.91000 WWARN lat and long 2017-01-01 2017-12-31 2017-07-02 automated midpoint
Asua_2019 Kole_2017 Uganda Kole 2.428611 32.80111 WWARN lat and long 2017-01-01 2017-12-31 2017-07-02 automated midpoint
Asua_2019 Lamwo_2017 Uganda Lamwo 3.533333 32.80000 WWARN lat and long 2017-01-01 2017-12-31 2017-07-02 automated midpoint
Asua_2019 Mubende_2017 Uganda Mubende 0.557500 31.39500 WWARN lat and long 2017-01-01 2017-12-31 2017-07-02 automated midpoint

Notice that the study_key links back to the Studies table. This table must include the fields latitude, longitude, and collection_day. In some cases, this information might not be directly available in the raw data. For example, locations may only be reported at a regional level, or collection periods might span an entire season. Nonetheless, STAVE strictly enforces the requirement that data must be provided as a single point in space and time. There are several reasons for this strict requirement:

  1. Support for spatial modeling methods: Many spatial methods, such as those modeling prevalence as a continuous surface in space and time, rely on precise point-level data. These methods struggle to accommodate areal data, such as prevalence reported only at the province level.
  2. Avoidance of ambiguity in location reporting: Using spatial coordinates eliminates the ambiguities associated with place names. For instance, “Côte d’Ivoire” could also appear as “Cote d’Ivoire” (without accents), “Republic of Côte d’Ivoire,” “Ivory Coast,” or many other variations. This issue is even more pronounced for site names, where interpretations may vary — for example, the name of a health facility versus the name of a nearby village. Even standardized identifiers like ISO 3166 country codes can pose challenges, as countries and their political boundaries may change over time. Latitude and longitude are inherently stable and precise, making them the most reliable method for identifying collection locations.

Note that the Surveys table includes fields for country_name, site_name, collection_start and collection_end, however, these fields are solely for convenience. They allow users to quickly scan the table to identify where and when data were collected. Ideally, they should not be used for spatial analysis. A more robust approach would be to overlay these spatial coordinates with a shapefile (e.g., from GADM) to determine the country or region of each survey, and then to use this as the country identifier in the analysis, rather than the value stored in country_name. This method enables country-level analysis while avoiding the risks of errors or ambiguities associated with inconsistent or conflicting country names.

When exact locations or collection timings are unavailable, data imputation may be necessary. For example, the centroid of a region could be used to approximate the spatial location, or the midpoint of a collection range to estimate the timing. In these cases, the optional fields spatial_notes and time_notes should be used to document the methods and assumptions applied during data preparation.

Survey IDs

As with Study IDs, Survey IDs must be valid identifiers. The same survey ID can be reused across different studies, for example two studies can both include a survey with the ID “south_district”, but if this ID was found twice within the same study this would throw an error. This ensures the integrity of the relational links while allowing some flexibility across studies.

Collection times

Dates are always a tricky issue as there are so many different possible conventions. STAVE requires that dates be stored as YYYY-MM-DD format, for example a valid date would be "2024-01-19". This avoids confusion caused by regional date formats, for example MM/DD/YYYY common in the US vs. DD/MM/YYYY common in Europe. It also has the added advantage that dates are sortable numerically.

Dates should be represented as character strings. There is no need to convert to a specific Date class using packages like lubridate.

Data types

Each column has its own rules about data type:

Column Compulsory Type
study_key Y Valid identifier (see above)
survey_id Y Valid identifier (see above)
country_name N Character string
site_name N Character string
latitude Y Numeric, from -180 to +180
longitude Y Numeric, from -180 to +180
spatial_notes N Character string
collection_start N Valid date string (see above)
collection_end N Valid date string (see above)
collection_day Y Valid date string (see above)
time_notes N Character string

The Counts table

The final table is the Counts table, which stores the genetic information. An example of a correctly formatted Counts table is given below:

study_key survey_key variant_string variant_num total_num
Dama_2017 Bamako_2014 crt:76:T 130 170
Dama_2017 Bamako_2014 mdr1:86:Y 46 158
Asua_2019 Agago_2017 k13:469:Y 42 42
Asua_2019 Agago_2017 k13:675:V 42 42
Asua_2019 Arua_2017 k13:675:V 43 43
Asua_2019 Kole_2017 k13:469:Y 47 47
Asua_2019 Kole_2017 k13:675:V 47 47
Asua_2019 Lamwo_2017 k13:469:Y 43 43
Asua_2019 Lamwo_2017 k13:675:V 43 43
Asua_2019 Mubende_2017 k13:469:F 45 45

All columns in this table are compulsory. The variant_string format is defined in the variantstring package - see that package documentation for details. The variant_num gives the number of times this variant was observed, and the total_num gives the number of times this locus or combination of loci was successfully sequenced. You can think of variant_num as the numerator in a prevalence calculation, and total_num as the denominator.

Data types

Each column has its own rules about data type:

Column Compulsory Type
study_key Y Valid identifier (see above)
variant_string Y Valid variant string
variant_num Y Positive integer or zero
total_num Y Positive integer

The next page shows how to encode genetic data in the Counts table.