The Relational Data Structure
relational_structure.Rmd
What’s the issue?
Imagine you’ve extracted some data on genetic variants from an academic paper and are now building a data table to organize the results. Often, a single paper will yield multiple entries — for example, each row might correspond to a different sampling location. To ensure that others can trace the results back to their original source, you’ll need to include information about the paper in the table, for example the author list and year of publication. However, this raises an issue: should you duplicate the same paper details in every row of your table?
On the one hand, duplicating the information ensures that the variant counts are always linked to their source, which is crucial for reproducibility. On the other hand, it introduces redundancy, increasing file size unnecessarily and heightening the risk of data entry errors.
The solution
Fortunately, this is a common class of problem, and the well-known solution is to use a relational database structure. In this approach, unique IDs are employed to link fields across multiple tables. One table would store metadata about the paper (e.g., authors, year of publication etc.), while a second table would hold data about the genetic variants (e.g., gene name, counts). The two tables are connected by a relational key, such that the same key in both tables links related information. This setup avoids redundancy, minimizes errors, and maintains a clear link between data and source.
In STAVE, we split data into three tables:
- Studies: This is a record of where the data came from. This could be academic publications, reports, or other sources.
- Surveys: A survey is defined here as a discrete instance of data collection. Survey-level data include information on location (latitude and longitude) and time (day) of collection. There may be multiple surveys within a given study, for example different sampling sites or different collection periods may come from the same academic publication.
- Counts: This is where the aggregate genetic data are stored. Data include the name of the variant that was observed (i.e. the gene, locus, and mutation), how many times it was observed, and how many times it was successfully tested for.
These tables are linked together through two sets of keys; Studies-to-Surveys, and Surveys-to-Counts.
Why use an R package?
There are many software platforms specifically designed for managing relational databases, notable examples being MySQL and Oracle. So why is STAVE written in R? There are few reasons for this:
- R is my language of choice, making it my fastest option to develop and maintain the package.
- Many downstream tools are written in R, making it is convenient to stay within the same environment.
- Although R is missing some of the more advanced features of solutions tailored to relational databases (e.g. allowing synchronous editing while avoiding clashes), these are also not needed for this simple application. At the same time, R has excellent capabilities for writing more general functions, e.g. for visualisation and mapping.
The next page goes into the specific formatting requirements of the three linked tables.