STAVE data object (R6 class)
STAVE_object.Rd
The main class that stores the data and is responsible for all data input, output, and processing functions. Most of the functionality of the STAVE package is through this class in the form of member functions.
Details
The raw data are stored as private variables within this object, meaning they
cannot (or should not) be edited directly. Rather, tables can be extracted
using get_counts()
and similarly for other tables. The three tables
are:
studies: Information on where the data came from, for example a url and author names. Each study is indexed with a unique study_id.
surveys: Information on the surveys represented within a study. A survey is defined here as a discrete instance of data collection, which includes information on geography (latitude and longitude) and collection time. Surveys are given survey_ids and are linked to a particular study through the study_id.
counts: The actual genetic information, which is linked to a particular survey through the survey_id. Genetic variants are encoded in character strings that must follow a specified format, and the number of times this variant was observed among the total sample is stored in columns.
This combination of linked tables allows efficient and flexible encoding of variants, while avoiding unnecessary duplication of information.
Methods
Method get_version()
Extract the version number of the STAVE object. This is important as member functions of a STAVE object are directly linked to the object itself, and will not be updated by updating the version of the package in your environment. To update a STAVE object to a new package version, you should first extract the data and then load into a new STAVE object created with the most recent version.
Method append_data()
Append new data
Arguments
studies_dataframe
a data.frame containing information at the study level. This data.frame must have the following columns: study_id, study_name, study_type, authors, publication_year, url
surveys_dataframe
a data.frame containing information at the survey level. This data.frame must have the following columns: study_key, survey_id, country_name, site_name, latitude, longitude, spatial_notes, collection_start, collection_end, time_notes. The study_key element must correspond to a study_id in the studies_dataframe.
counts_dataframe
a data.frame of genetic information. Must contain the following columns: survey_key, variant_string, variant_num, total_num. The survey_key element must correspond to a valid survey_id in the surveys_dataframe.
Method get_prevalence()
Calculate prevalence
Arguments
target_variant
the name of the variant on which we want to calculate prevalence, for example crt:72:C. Note that there can be no heterozygous calls within this name.
keep_ambiguous
there may be variants in the data for which the target_variant could be in the sample, but this cannot be proven conclusively. For example, the sequence A_A_A may be a match to the sequence A/C_A/C_A or it may not, these are unphased genotypes so we cannot be sure. If
keep_ambiguous = TRUE
then both a min and a max numerator are reported that either exclude all ambiguous calls (min) or include all ambiguous calls (max). IfFALSE
(the default) then only the min is reported.prev_from_min
the output object includes a point estimate of the prevalence along with exact binomial confidence intervals. These must be calculated from one of
numerator_min
ornumerator_max
in the case of ambiguous calls. This argument sets which one of these numerators is used in the calculation.@import dplyr
Method drop_study()
Drop one or more study_ids from the data. This will drop from all internally stored data objects, including the corresponding surveys and counts data.