Input Formats • STAVE

A custom data class

STAVE works via a single class (an R6 object) that acts as the main data container. This class allows users to efficiently import, store, and manipulate genetic data via specialized member functions.

For example, a new object can be created and data read in like this:

# create new object
s <- STAVE_object$new()

# append data using a member function
s$append_data(studies_dataframe = df_studies,
              surveys_dataframe = df_surveys,
              counts_dataframe = df_counts)

Notice that the function is attached to the object, accessed via the $ symbol.

Using a custom class offers several key advantages. Once loaded, all data remain consolidated within a single object, avoiding fragmentation. The class structure also ensures the data are encapsulated, meaning they cannot be directly edited by the user. This built-in protection minimizes the risk of accidental data corruption.

STAVE performs close to 100 rigorous checks on the data during the import process. These checks cover a wide range of validations, from ensuring proper character formatting to verifying that prevalences do not exceed 100%. If the data pass all checks, they are successfully loaded into the object. If not, the import is rejected, and an informative error message is provided.

A core principle of STAVE is that it does not modify the input data during import. For instance, if column headers in your dataset contain capital letters when only lowercase is allowed, STAVE will not automatically convert them for you — even though it could! Instead it will reject the import. This strict approach ensures that users are fully aware of the exact structure of their input data, meaning there is only a single format used universally by anyone using STAVE. The downside is that you must conform to this structure in order to use the package.

The remainder of this page specifies the formatting requirements for each of the three input tables.

The Studies table

This table captures information about the origin of the data. An example of a correctly formatted Studies table is given below:

study_id	study_name	study_type	authors	publication_year	url
Dama_2017	Reduced ex vivo susceptibility of Plasmodium falciparum after oral artemether-lumefantrine treatment in Mali	peer_reviewed	Dama et al	2017	https://doi.org/10.1186/s12936-017-1700-8
Asua_2019	Changing Molecular Markers of Antimalarial Drug Sensitivity across Uganda	peer_reviewed	Asua et al	2019	https://doi.org/10.1128/aac.01818-18

The only mandatory fields are study_id and url. Please ensure that URLs are accurate and permanent, as they serve as the sole external reference for verifying the origin of the data.

All other fields are optional, meaning cells can be left blank, although column headings must still be included. These optional fields have minimal formatting requirements and are primarily intended for storing descriptive information to help you quickly identify a study. The exception is study_type, which must adhere to a predefined set of options (see below).

Study IDs

Study IDs must be “valid identifiers”, meaning they must:

Contain only English letters (uppercase or lowercase), numbers (0-9), or underscores (_).
Not begin with a number or an underscore.

Beyond these restrictions, any naming convention can be used. However, it is recommended to adopt a systematic approach to avoid potential conflicts. For instance, using generic IDs like “study1” is not a good idea, as such IDs could overlap with those from other datasets, causing issues when combining data. A better approach is to use a concise, descriptive format, such as the first author’s surname and the year of publication, e.g., Bloggs_2024.

Data types

Each column has its own rules about data type:

Column	Compulsory	Type
study_id	Y	Valid identifier (see above)
study_name	N	Character string
study_type	N	One of {‘peer_reviewed’, ‘preprint’, ‘other’, ‘private’}
authors	N	Character string
publication_year	N	Positive integer
url	Y	Character string

If any entries have study_type that is private then a warning message will be printed when data are imported. This warning does not prevent data from being loaded, but flags to the user that private data are present in case this was not intentional.

The Surveys table

The Surveys table captures information about the context within which data were collected. We can think of a survey here as a single instance of data collection. An example of a correctly formatted Surveys table is given below:

study_key	survey_id	country_name	site_name	latitude	longitude	spatial_notes	collection_start	collection_end	collection_day	time_notes
Dama_2017	Bamako_2014	Mali	Koulikoro	12.612900	-8.13560	WWARN lat and long	2014-01-01	2014-12-31	2014-07-02	automated midpoint
Asua_2019	Agago_2017	Uganda	Agago	2.984722	33.33055	WWARN lat and long	2017-01-01	2017-12-31	2017-07-02	automated midpoint
Asua_2019	Arua_2017	Uganda	Arua	3.030000	30.91000	WWARN lat and long	2017-01-01	2017-12-31	2017-07-02	automated midpoint
Asua_2019	Kole_2017	Uganda	Kole	2.428611	32.80111	WWARN lat and long	2017-01-01	2017-12-31	2017-07-02	automated midpoint
Asua_2019	Lamwo_2017	Uganda	Lamwo	3.533333	32.80000	WWARN lat and long	2017-01-01	2017-12-31	2017-07-02	automated midpoint
Asua_2019	Mubende_2017	Uganda	Mubende	0.557500	31.39500	WWARN lat and long	2017-01-01	2017-12-31	2017-07-02	automated midpoint

Notice that the study_key links back to the Studies table. This table must include the fields latitude, longitude, and collection_day. In some cases, this information might not be directly available in the raw data. For example, locations may only be reported at a regional level, or collection periods might span an entire season. Nonetheless, STAVE strictly enforces the requirement that data must be provided as a single point in space and time. There are several reasons for this strict requirement:

Support for spatial modeling methods: Many spatial methods, such as those modeling prevalence as a continuous surface in space and time, rely on precise point-level data. These methods struggle to accommodate areal data, such as prevalence reported only at the province level.
Avoidance of ambiguity in location reporting: Using spatial coordinates eliminates the ambiguities associated with place names. For instance, “Côte d’Ivoire” could also appear as “Cote d’Ivoire” (without accents), “Republic of Côte d’Ivoire,” “Ivory Coast,” or many other variations. This issue is even more pronounced for site names, where interpretations may vary — for example, the name of a health facility versus the name of a nearby village. Even standardized identifiers like ISO 3166 country codes can pose challenges, as countries and their political boundaries may change over time. Latitude and longitude are inherently stable and precise, making them the most reliable method for identifying collection locations.

Note that the Surveys table includes fields for country_name, site_name, collection_start and collection_end, however, these fields are solely for convenience. They allow users to quickly scan the table to identify where and when data were collected. Ideally, they should not be used for spatial analysis. A more robust approach would be to overlay these spatial coordinates with a shapefile (e.g., from GADM) to determine the country or region of each survey, and then to use this as the country identifier in the analysis, rather than the value stored in country_name. This method enables country-level analysis while avoiding the risks of errors or ambiguities associated with inconsistent or conflicting country names.

When exact locations or collection timings are unavailable, data imputation may be necessary. For example, the centroid of a region could be used to approximate the spatial location, or the midpoint of a collection range to estimate the timing. In these cases, the optional fields spatial_notes and time_notes should be used to document the methods and assumptions applied during data preparation.

Survey IDs

As with Study IDs, Survey IDs must be valid identifiers. The same survey ID can be reused across different studies, for example two studies can both include a survey with the ID “south_district”, but if this ID was found twice within the same study this would throw an error. This ensures the integrity of the relational links while allowing some flexibility across studies.

Collection times

Dates are always a tricky issue as there are so many different possible conventions. STAVE requires that dates be stored as YYYY-MM-DD format, for example a valid date would be "2024-01-19". This avoids confusion caused by regional date formats, for example MM/DD/YYYY common in the US vs. DD/MM/YYYY common in Europe. It also has the added advantage that dates are sortable numerically.

Dates should be represented as character strings. There is no need to convert to a specific Date class using packages like lubridate.

Data types

Each column has its own rules about data type:

Column	Compulsory	Type
study_key	Y	Valid identifier (see above)
survey_id	Y	Valid identifier (see above)
country_name	N	Character string
site_name	N	Character string
latitude	Y	Numeric, from -180 to +180
longitude	Y	Numeric, from -180 to +180
spatial_notes	N	Character string
collection_start	N	Valid date string (see above)
collection_end	N	Valid date string (see above)
collection_day	Y	Valid date string (see above)
time_notes	N	Character string

The Counts table

The final table is the Counts table, which stores the genetic information. An example of a correctly formatted Counts table is given below:

study_key	survey_key	variant_string	variant_num	total_num
Dama_2017	Bamako_2014	crt:76:T	130	170
Dama_2017	Bamako_2014	mdr1:86:Y	46	158
Asua_2019	Agago_2017	k13:469:Y	42	42
Asua_2019	Agago_2017	k13:675:V	42	42
Asua_2019	Arua_2017	k13:675:V	43	43
Asua_2019	Kole_2017	k13:469:Y	47	47
Asua_2019	Kole_2017	k13:675:V	47	47
Asua_2019	Lamwo_2017	k13:469:Y	43	43
Asua_2019	Lamwo_2017	k13:675:V	43	43
Asua_2019	Mubende_2017	k13:469:F	45	45

All columns in this table are compulsory. The variant_string format is defined in the variantstring package - see that package documentation for details. The variant_num gives the number of times this variant was observed, and the total_num gives the number of times this locus or combination of loci was successfully sequenced. You can think of variant_num as the numerator in a prevalence calculation, and total_num as the denominator.

Data types

Each column has its own rules about data type:

Column	Compulsory	Type
study_key	Y	Valid identifier (see above)
variant_string	Y	Valid variant string
variant_num	Y	Positive integer or zero
total_num	Y	Positive integer

The next page shows how to encode genetic data in the Counts table.