Ensure that each article gets a unique id across all tables
Source:R/make_unique_id.R
make_unique_id.Rd
Ensure that each article gets a unique id across all tables
Arguments
- articles
A dataframe with the articles table
- df
A dataframe with the table that needs to be checked for duplicate covidence ids. Both articles and df will be loaded through
load_epidata
.- df_name
A the name of the df that will be loaded through
load_epidata
.
Details
In some instances, an article is associated with more than one id in the
parameters, models, or outbreaks tables.
This can lead to unexpected failures because we use the id to join the
articles with other dataframes. This function will resolve the issue by
first checking if a covidence id is mapped to more than one id. If it is, we
replace one of the two ensuring that the same id is used across
articles, models, outbreaks, and parameters. This function is not expected to
be used directly by the user, but is called by load_epidata
. Hence
checks on arguments are not implemented.
Need for article ids
Why do we need article ids in the first place? Why not use covidence id?
To ease the process of data extraction, we created a separate database for
each extractor, with the goal of then merging the databases into a single
database. Within each individual database, the different dataframes are linked
by Access generated primary keys. These keys are unique to each database,
but are not unique across databases. To merge the databases, we therefore
generate a unique id for each article using random_id
.
Note that we cannot use covidence id for merging tables within each database
because covidence id is only present in the articles table.
Why does an article end up with multiple ids?
Articles that have been extracted by two extractors will have multiple ids. In principle, this should not be a problem because as extractors resolve differences in data extracted, they generate a consensus entry by deleting one of the entries so that only one of the two ids should enter back into the database when the resolved entries are merged with the rest of the data. However, in practice, while resolving conflicts for multiple parameters or models, extractors may delete the row with Id1 in one case and the row with Id2 in another case. This can lead to the same article having multiple ids in parameters, models, or outbreaks. Because an article that has been double extracted will have been aasigned two ids, it is also possible that the retained id in articles is not the same as the retained id in parameters, models, or outbreaks. This can lead to rows in parameters, models, or outbreaks that are not linked to any article.