Skip to contents

A custom data class

As mentioned in the previous page, STAVE adopts a relational database structure that connects three tables together via linked IDs. This is implemented through a single class (an R6 object) that acts as the main data container. The class allows users to efficiently import, store, and manipulate genetic data via specialized member functions.

For example, a new object can be created an data read in like this:

# create new object
s <- STAVE_object$new()

# append data using a member function
s$append_data(studies_dataframe = df_studies,
              surveys_dataframe = df_surveys,
              counts_dataframe = df_counts)

Using a custom class offers several key advantages. Once loaded, all data remain consolidated within a single object, avoiding fragmentation across separate objects. Additionally, the class structure ensures the data are “encapsulated”, meaning they cannot be directly edited by the user. This built-in protection minimizes the risk of accidental data corruption.

STAVE also performs nearly 100 rigorous checks on the data during the import process. These checks cover a wide range of validations, from ensuring proper character formatting to verifying that prevalence values do not exceed 100%. If the data pass all checks, they are successfully loaded into the object. If not, the import is rejected, and an informative error message is provided.

A core principle of STAVE is that it does not modify the input data during import. For instance, if column headers in your dataset contain capital letters when only lowercase is allowed, STAVE will not automatically convert them for you — even though it could! This strict approach shifts the responsibility of data formatting to the user, ensuring they (you) are fully aware of the exact structure of their input data.

The remainder of this page lists the formatting requirements for each of the three tables.

The Studies table

This table captures information about the origin of the data. An example of a correctly formatted Studies table is given below:

study_id study_name study_type authors publication_year url
Bloggs_2024 first study peer reviewed Bloggs_etal 2024 https://doi.org/10.1093%2Fgenetics%2F16.2.97
Globbs_2020 second study peer reviewed Globbs_etal 2020 https://doi.org/10.1093%2Fgenetics%2F16.2.97

The only mandatory fields are study_id and url, as these provide the minimum information required to link the tables and identify the data source. It is crucial to ensure that URLs are accurate and permanent, as they serve as the sole external reference for verifying the origin of the data.

All other fields are optional, meaning cells can be left blank, though column headings should still be included. These optional fields have minimal formatting requirements and are primarily intended for storing descriptive information to help you quickly identify a study. The exception is study_type, which must adhere to a predefined set of options (see below).

Study IDs

Study IDs must be valid identifiers, which means they must:

  • Contain only English letters (uppercase or lowercase), numbers (0-9), or underscores (_).
  • Not begin with a number or an underscore.

Beyond these restrictions, any naming convention can be used. However, it is recommended to adopt a systematic approach to avoid potential conflicts in the future. For instance, using generic IDs like “study1” is not advisable, as such IDs could overlap with those from other datasets, causing issues when combining data. A better approach is to use a concise, descriptive format, such as the first author’s surname and the year of publication, e.g., Bloggs_2024.

Data types

Each column has its own rules about data type:

Column Compulsory Type
study_id Y Valid identifier (see above)
study_name N Character string
study_type N One of {peer_reviewed, preprint, other, private}
authors N Character string
publication_year N Positive integer
url Y Character string

Note that if any entries have study_type that is private then a warning message will be printed when data are imported. This does not prevent data from loading, but flags to the user in case this was not intentional.

The Surveys table

This table captures information about the context within which data were collected. An example of a correctly formatted Surveys table is given below:

study_key survey_id country_name site_name latitude longitude spatial_notes collection_start collection_end collection_day time_notes
Bloggs_2024 site_01 Gambia example site 0 0 example data 2020-01-01 2020-01-01 2020-01-01 example data

This table must include latitude, longitude, and collection day, as these are essential for the data to be tied to a specific point in space and time. However, there may be instances where this information is not directly available from the raw data. For example, the location might only be reported at a regional level, or the collection period might span an entire season. Despite this, STAVE enforces the requirement that data be reported at a single point in space and time, as this is a fundamental prerequisite for certain types of spatial analysis.

When exact locations or timings are unavailable, imputation may be necessary. For instance, you might use the centroid of a region for the location or the midpoint of a collection range for the timing. In such cases, the optional fields (e.g. spatial_notes, time_notes) in the table provide a space to document how the raw data were manipulated, ensuring transparency and reproducibility in the data preparation process.

Survey IDs

Similar to study IDs, survey IDs must also be valid identifiers. Additionally, they must satisfy relational linking in both directions: each study_key in the Surveys table must correspond to an existing study_id in the Studies table, and each survey_id must be referenced in the survey_key column of the Counts table.

Survey IDs can be reused across different studies, but they must remain unique within the same study. For example, two studies can both include a survey with the ID “south_district” without any issues, as long as this ID is not duplicated within a single study. This ensures the integrity of the relational links while allowing flexibility across studies.

Collection times

Storing dates is always a thorny issue as there are so many different conventions. STAVE requires that dates be stored as YYYY-MM-DD format, for example a valid date would be 2024-01-19. This avoids confusion caused by regional date formats, for example MM/DD/YYYY common in the US vs. DD/MM/YYYY common Europe. It also has the added advantage of being sortable numerically.

Dates should be represented as character strings that follow this convention. There is no need to convert to Date class using packages like lubridate, and in fact these will be rejected.

Data types

Each column has its own rules about data type:

Column Compulsory Type
study_key Y Valid identifier (see above)
survey_id Y Valid identifier (see above)
country_name N Character string
site_name N Character string
latitude Y Numeric, from -180 to +180
longitude Y Numeric, from -180 to +180
spatial_notes N Character string
collection_start N Valid date string (see above)
collection_end N Valid date string (see above). Must be
collection_day Y Valid date string (see above)
time_notes N Character string

The next page goes into further detail of the long string format used to encode genetic variants.