Skip to contents

What’s the issue?

Imagine you’ve extracted some data on genetic variants from an academic paper and are now building a table to organize the results. Often, a single paper will contain multiple entries — for example, each row might correspond to a different sampling location. To ensure that others can trace the results back to their original source, you’ll need to include information about the paper in the table, for example the author list and year of publication. However, this raises a common issue: should you duplicate the same details in every row of your table?

On the one hand, duplicating the information ensures that the variant counts are always linked to their source, which is crucial for reproducibility. On the other hand, it introduces redundancy, increasing file size unnecessarily and heightening the risk of data entry errors.

The solution

Fortunately, this is a common class of problem, and the well-known solution is to use a relational database structure. In this approach, unique IDs are employed to link fields across multiple tables. One table would store metadata about the paper (e.g., authors, year of publication etc.), while a second table would hold data about the genetic variants (e.g., gene name, counts). The two tables would be connected by a relational key, such that the same key in both tables links related information. This setup avoids redundancy, minimizes errors, and maintains a clear link between data and source.

In STAVE, we split data into three tables:

  1. Studies: This is a record of where the data came from. This could be academic publications, reports, or other sources.
  2. Surveys: We define as survey as a discrete instance of data collection. Survey-level data include information on location (latitude and longitude) and time (day) of collection. There may be multiple surveys within a given study, for example different sampling sites or different collection periods may come from the same academic publication.
  3. Counts: This is where the aggregate genetic data are stored. Fields include the name of the variant that was observed (i.e. the gene, locus, and mutation), how many times it was observed (the numerator), and how many times it was successfully tested for (the denominator).

These three tables are linked through two sets of keys.

Why use an R package?

There are many software platforms specifically designed for managing relational databases, notable examples being MySQL and Oracle. So why is STAVE written in R? There are few reasons for this:

  • Many downstream tools are written in R, making it convenient to stay within the same environment.
  • Although R is missing some of the more advanced features of solutions tailored to relational databases (e.g. allowing synchronous editing while avoiding clashes), these are also not needed for this simple application.
  • R has excellent capabilities for writing more general functions, e.g. for visualisation and mapping.
  • R is my language of choice, making it my fastest option to develop and maintain this package.

The next page goes into the specific formatting requirements of the three linked tables.