Input Formats
input_formats.Rmd
A custom data class
STAVE works via a single class (an R6 object) that acts as the main data container. This class allows users to efficiently import, store, and manipulate genetic data via specialized member functions.
For example, a new object can be created and data read in like this:
# create new object
s <- STAVE_object$new()
# append data using a member function
s$append_data(studies_dataframe = df_studies,
surveys_dataframe = df_surveys,
counts_dataframe = df_counts)
Notice that the function is attached to the object, accessed via the
$
symbol.
Using a custom class offers several key advantages. Once loaded, all data remain consolidated within a single object, avoiding fragmentation. The class structure also ensures the data are encapsulated, meaning they cannot be directly edited by the user. This built-in protection minimizes the risk of accidental data corruption.
STAVE performs close to 100 rigorous checks on the data during the import process. These checks cover a wide range of validations, from ensuring proper character formatting to verifying that prevalences do not exceed 100%. If the data pass all checks, they are successfully loaded into the object. If not, the import is rejected, and an informative error message is provided.
A core principle of STAVE is that it does not modify the input data during import. For instance, if column headers in your dataset contain capital letters when only lowercase is allowed, STAVE will not automatically convert them for you — even though it could! Instead it will reject the import. This strict approach ensures that users are fully aware of the exact structure of their input data, meaning there is only a single format used universally by anyone using STAVE. The downside is that you must conform to this structure in order to use the package.
The remainder of this page specifies the formatting requirements for each of the three input tables.
The Studies table
This table captures information about the origin of the data. An example of a correctly formatted Studies table is given below:
study_id | study_name | study_type | authors | publication_year | url |
---|---|---|---|---|---|
Dama_2017 | Reduced ex vivo susceptibility of Plasmodium falciparum after oral artemether-lumefantrine treatment in Mali | peer_reviewed | Dama et al | 2017 | https://doi.org/10.1186/s12936-017-1700-8 |
Asua_2019 | Changing Molecular Markers of Antimalarial Drug Sensitivity across Uganda | peer_reviewed | Asua et al | 2019 | https://doi.org/10.1128/aac.01818-18 |
The only mandatory fields are study_id and url. Please ensure that URLs are accurate and permanent, as they serve as the sole external reference for verifying the origin of the data.
All other fields are optional, meaning cells can be left blank, although column headings must still be included. These optional fields have minimal formatting requirements and are primarily intended for storing descriptive information to help you quickly identify a study. The exception is study_type, which must adhere to a predefined set of options (see below).
Study IDs
Study IDs must be “valid identifiers”, meaning they must:
- Contain only English letters (uppercase or lowercase), numbers (0-9), or underscores (_).
- Not begin with a number or an underscore.
Beyond these restrictions, any naming convention can be used. However, it is recommended to adopt a systematic approach to avoid potential conflicts. For instance, using generic IDs like “study1” is not a good idea, as such IDs could overlap with those from other datasets, causing issues when combining data. A better approach is to use a concise, descriptive format, such as the first author’s surname and the year of publication, e.g., Bloggs_2024.
Data types
Each column has its own rules about data type:
Column | Compulsory | Type |
---|---|---|
study_id | Y | Valid identifier (see above) |
study_name | N | Character string |
study_type | N | One of {‘peer_reviewed’, ‘preprint’, ‘other’, ‘private’} |
authors | N | Character string |
publication_year | N | Positive integer |
url | Y | Character string |
If any entries have study_type that is private then a warning message will be printed when data are imported. This warning does not prevent data from being loaded, but flags to the user that private data are present in case this was not intentional.
The Surveys table
The Surveys table captures information about the context within which data were collected. We can think of a survey here as a single instance of data collection. An example of a correctly formatted Surveys table is given below:
study_key | survey_id | country_name | site_name | latitude | longitude | spatial_notes | collection_start | collection_end | collection_day | time_notes |
---|---|---|---|---|---|---|---|---|---|---|
Dama_2017 | Bamako_2014 | Mali | Koulikoro | 12.612900 | -8.13560 | WWARN lat and long | 2014-01-01 | 2014-12-31 | 2014-07-02 | automated midpoint |
Asua_2019 | Agago_2017 | Uganda | Agago | 2.984722 | 33.33055 | WWARN lat and long | 2017-01-01 | 2017-12-31 | 2017-07-02 | automated midpoint |
Asua_2019 | Arua_2017 | Uganda | Arua | 3.030000 | 30.91000 | WWARN lat and long | 2017-01-01 | 2017-12-31 | 2017-07-02 | automated midpoint |
Asua_2019 | Kole_2017 | Uganda | Kole | 2.428611 | 32.80111 | WWARN lat and long | 2017-01-01 | 2017-12-31 | 2017-07-02 | automated midpoint |
Asua_2019 | Lamwo_2017 | Uganda | Lamwo | 3.533333 | 32.80000 | WWARN lat and long | 2017-01-01 | 2017-12-31 | 2017-07-02 | automated midpoint |
Asua_2019 | Mubende_2017 | Uganda | Mubende | 0.557500 | 31.39500 | WWARN lat and long | 2017-01-01 | 2017-12-31 | 2017-07-02 | automated midpoint |
Notice that the study_key
links back to the
Studies table. This table must include the fields
latitude, longitude, and collection_day. In
some cases, this information might not be directly available in the raw
data. For example, locations may only be reported at a regional level,
or collection periods might span an entire season. Nonetheless, STAVE
strictly enforces the requirement that data must be
provided as a single point in space and time. There are several reasons
for this strict requirement:
- Support for spatial modeling methods: Many spatial methods, such as those modeling prevalence as a continuous surface in space and time, rely on precise point-level data. These methods struggle to accommodate areal data, such as prevalence reported only at the province level.
- Avoidance of ambiguity in location reporting: Using spatial coordinates eliminates the ambiguities associated with place names. For instance, “Côte d’Ivoire” could also appear as “Cote d’Ivoire” (without accents), “Republic of Côte d’Ivoire,” “Ivory Coast,” or many other variations. This issue is even more pronounced for site names, where interpretations may vary — for example, the name of a health facility versus the name of a nearby village. Even standardized identifiers like ISO 3166 country codes can pose challenges, as countries and their political boundaries may change over time. Latitude and longitude are inherently stable and precise, making them the most reliable method for identifying collection locations.
Note that the Surveys table includes fields for country_name, site_name, collection_start and collection_end, however, these fields are solely for convenience. They allow users to quickly scan the table to identify where and when data were collected. Ideally, they should not be used for spatial analysis. A more robust approach would be to overlay these spatial coordinates with a shapefile (e.g., from GADM) to determine the country or region of each survey, and then to use this as the country identifier in the analysis, rather than the value stored in country_name. This method enables country-level analysis while avoiding the risks of errors or ambiguities associated with inconsistent or conflicting country names.
When exact locations or collection timings are unavailable, data imputation may be necessary. For example, the centroid of a region could be used to approximate the spatial location, or the midpoint of a collection range to estimate the timing. In these cases, the optional fields spatial_notes and time_notes should be used to document the methods and assumptions applied during data preparation.
Survey IDs
As with Study IDs, Survey IDs must be valid identifiers. The same survey ID can be reused across different studies, for example two studies can both include a survey with the ID “south_district”, but if this ID was found twice within the same study this would throw an error. This ensures the integrity of the relational links while allowing some flexibility across studies.
Collection times
Dates are always a tricky issue as there are so many different
possible conventions. STAVE requires that dates be stored as YYYY-MM-DD
format, for example a valid date would be "2024-01-19"
.
This avoids confusion caused by regional date formats, for example
MM/DD/YYYY common in the US vs. DD/MM/YYYY common in Europe. It also has
the added advantage that dates are sortable numerically.
Dates should be represented as character strings. There is
no need to convert to a specific Date
class using packages
like lubridate
.
Data types
Each column has its own rules about data type:
Column | Compulsory | Type |
---|---|---|
study_key | Y | Valid identifier (see above) |
survey_id | Y | Valid identifier (see above) |
country_name | N | Character string |
site_name | N | Character string |
latitude | Y | Numeric, from -180 to +180 |
longitude | Y | Numeric, from -180 to +180 |
spatial_notes | N | Character string |
collection_start | N | Valid date string (see above) |
collection_end | N | Valid date string (see above) |
collection_day | Y | Valid date string (see above) |
time_notes | N | Character string |
The Counts table
The final table is the Counts table, which stores the genetic information. An example of a correctly formatted Counts table is given below:
study_key | survey_key | variant_string | variant_num | total_num |
---|---|---|---|---|
Dama_2017 | Bamako_2014 | crt:76:T | 130 | 170 |
Dama_2017 | Bamako_2014 | mdr1:86:Y | 46 | 158 |
Asua_2019 | Agago_2017 | k13:469:Y | 42 | 42 |
Asua_2019 | Agago_2017 | k13:675:V | 42 | 42 |
Asua_2019 | Arua_2017 | k13:675:V | 43 | 43 |
Asua_2019 | Kole_2017 | k13:469:Y | 47 | 47 |
Asua_2019 | Kole_2017 | k13:675:V | 47 | 47 |
Asua_2019 | Lamwo_2017 | k13:469:Y | 43 | 43 |
Asua_2019 | Lamwo_2017 | k13:675:V | 43 | 43 |
Asua_2019 | Mubende_2017 | k13:469:F | 45 | 45 |
All columns in this table are compulsory. The variant_string format is defined in the variantstring package - see that package documentation for details. The variant_num gives the number of times this variant was observed, and the total_num gives the number of times this locus or combination of loci was successfully sequenced. You can think of variant_num as the numerator in a prevalence calculation, and total_num as the denominator.
Data types
Each column has its own rules about data type:
Column | Compulsory | Type |
---|---|---|
study_key | Y | Valid identifier (see above) |
variant_string | Y | Valid variant string |
variant_num | Y | Positive integer or zero |
total_num | Y | Positive integer |
The next page shows how to encode genetic data in the Counts table.