Input Data Formats
input_formats.Rmd
A custom data class
As mentioned in the previous page, STAVE adopts a relational database structure that connects three tables together via linked IDs. This is implemented through a single class (an R6 object) that acts as the main data container. The class allows users to efficiently import, store, and manipulate genetic data via specialized member functions.
For example, a new object can be created an data read in like this:
# create new object
s <- STAVE_object$new()
# append data using a member function
s$append_data(studies_dataframe = df_studies,
surveys_dataframe = df_surveys,
counts_dataframe = df_counts)
Using a custom class offers several key advantages. Once loaded, all data remain consolidated within a single object, avoiding fragmentation across separate objects. Additionally, the class structure ensures the data are “encapsulated”, meaning they cannot be directly edited by the user. This built-in protection minimizes the risk of accidental data corruption.
STAVE also performs nearly 100 rigorous checks on the data during the import process. These checks cover a wide range of validations, from ensuring proper character formatting to verifying that prevalence values do not exceed 100%. If the data pass all checks, they are successfully loaded into the object. If not, the import is rejected, and an informative error message is provided.
A core principle of STAVE is that it does not modify the input data during import. For instance, if column headers in your dataset contain capital letters when only lowercase is allowed, STAVE will not automatically convert them for you — even though it could! This strict approach shifts the responsibility of data formatting to the user, ensuring they (you) are fully aware of the exact structure of their input data.
The remainder of this page lists the formatting requirements for each of the three tables.
The Studies table
This table captures information about the origin of the data. An example of a correctly formatted Studies table is given below:
study_id | study_name | study_type | authors | publication_year | url |
---|---|---|---|---|---|
Bloggs_2024 | first study | peer reviewed | Bloggs_etal | 2024 | https://doi.org/10.1093%2Fgenetics%2F16.2.97 |
Globbs_2020 | second study | peer reviewed | Globbs_etal | 2020 | https://doi.org/10.1093%2Fgenetics%2F16.2.97 |
The only mandatory fields are study_id and url, as these provide the minimum information required to link the tables and identify the data source. It is crucial to ensure that URLs are accurate and permanent, as they serve as the sole external reference for verifying the origin of the data.
All other fields are optional, meaning cells can be left blank, though column headings should still be included. These optional fields have minimal formatting requirements and are primarily intended for storing descriptive information to help you quickly identify a study. The exception is study_type, which must adhere to a predefined set of options (see below).
Study IDs
Study IDs must be valid identifiers, which means they must:
- Contain only English letters (uppercase or lowercase), numbers (0-9), or underscores (_).
- Not begin with a number or an underscore.
Beyond these restrictions, any naming convention can be used. However, it is recommended to adopt a systematic approach to avoid potential conflicts in the future. For instance, using generic IDs like “study1” is not advisable, as such IDs could overlap with those from other datasets, causing issues when combining data. A better approach is to use a concise, descriptive format, such as the first author’s surname and the year of publication, e.g., Bloggs_2024.
Data types
Each column has its own rules about data type:
Column | Compulsory | Type |
---|---|---|
study_id | Y | Valid identifier (see above) |
study_name | N | Character string |
study_type | N | One of {peer_reviewed, preprint, other, private} |
authors | N | Character string |
publication_year | N | Positive integer |
url | Y | Character string |
Note that if any entries have study_type that is private then a warning message will be printed when data are imported. This does not prevent data from loading, but flags to the user in case this was not intentional.
The Surveys table
This table captures information about the context within which data were collected. An example of a correctly formatted Surveys table is given below:
study_key | survey_id | country_name | site_name | latitude | longitude | spatial_notes | collection_start | collection_end | collection_day | time_notes |
---|---|---|---|---|---|---|---|---|---|---|
Bloggs_2024 | site_01 | Gambia | example site | 0 | 0 | example data | 2020-01-01 | 2020-01-01 | 2020-01-01 | example data |
This table must include latitude, longitude, and collection day, as these are essential for the data to be tied to a specific point in space and time. However, there may be instances where this information is not directly available from the raw data. For example, the location might only be reported at a regional level, or the collection period might span an entire season. Despite this, STAVE enforces the requirement that data be reported at a single point in space and time, as this is a fundamental prerequisite for certain types of spatial analysis.
When exact locations or timings are unavailable, imputation may be necessary. For instance, you might use the centroid of a region for the location or the midpoint of a collection range for the timing. In such cases, the optional fields (e.g. spatial_notes, time_notes) in the table provide a space to document how the raw data were manipulated, ensuring transparency and reproducibility in the data preparation process.
Survey IDs
Similar to study IDs, survey IDs must also be valid identifiers. Additionally, they must satisfy relational linking in both directions: each study_key in the Surveys table must correspond to an existing study_id in the Studies table, and each survey_id must be referenced in the survey_key column of the Counts table.
Survey IDs can be reused across different studies, but they must remain unique within the same study. For example, two studies can both include a survey with the ID “south_district” without any issues, as long as this ID is not duplicated within a single study. This ensures the integrity of the relational links while allowing flexibility across studies.
Collection times
Storing dates is always a thorny issue as there are so many different
conventions. STAVE requires that dates be stored as YYYY-MM-DD format,
for example a valid date would be 2024-01-19
. This avoids
confusion caused by regional date formats, for example MM/DD/YYYY common
in the US vs. DD/MM/YYYY common Europe. It also has the added advantage
of being sortable numerically.
Dates should be represented as character strings that follow
this convention. There is no need to convert to Date
class
using packages like lubridate
, and in fact these will be
rejected.
Data types
Each column has its own rules about data type:
Column | Compulsory | Type |
---|---|---|
study_key | Y | Valid identifier (see above) |
survey_id | Y | Valid identifier (see above) |
country_name | N | Character string |
site_name | N | Character string |
latitude | Y | Numeric, from -180 to +180 |
longitude | Y | Numeric, from -180 to +180 |
spatial_notes | N | Character string |
collection_start | N | Valid date string (see above) |
collection_end | N | Valid date string (see above). Must be |
collection_day | Y | Valid date string (see above) |
time_notes | N | Character string |
The next page goes into further detail of the long string format used to encode genetic variants.