Ensure that each article gets a unique id across all tables

Usage

make_unique_id(articles, df, df_name)

Arguments

articles: A dataframe with the articles table
df: A dataframe with the table that needs to be checked for duplicate covidence ids. Both articles and df will be loaded through load_epidata.
df_name: A the name of the df that will be loaded through load_epidata.

Value

A dataframe with the same structure as df, but with unique ids for each covidence id.

Details

In some instances, an article is associated with more than one id in the parameters, models, or outbreaks tables. This can lead to unexpected failures because we use the id to join the articles with other dataframes. This function will resolve the issue by first checking if a covidence id is mapped to more than one id. If it is, we replace one of the two ensuring that the same id is used across articles, models, outbreaks, and parameters. This function is not expected to be used directly by the user, but is called by load_epidata. Hence checks on arguments are not implemented.

Need for article ids

Why do we need article ids in the first place? Why not use covidence id? To ease the process of data extraction, we created a separate database for each extractor, with the goal of then merging the databases into a single database. Within each individual database, the different dataframes are linked by Access generated primary keys. These keys are unique to each database, but are not unique across databases. To merge the databases, we therefore generate a unique id for each article using random_id. Note that we cannot use covidence id for merging tables within each database because covidence id is only present in the articles table.

Why does an article end up with multiple ids?

Articles that have been extracted by two extractors will have multiple ids. In principle, this should not be a problem because as extractors resolve differences in data extracted, they generate a consensus entry by deleting one of the entries so that only one of the two ids should enter back into the database when the resolved entries are merged with the rest of the data. However, in practice, while resolving conflicts for multiple parameters or models, extractors may delete the row with Id1 in one case and the row with Id2 in another case. This can lead to the same article having multiple ids in parameters, models, or outbreaks. Because an article that has been double extracted will have been aasigned two ids, it is also possible that the retained id in articles is not the same as the retained id in parameters, models, or outbreaks. This can lead to rows in parameters, models, or outbreaks that are not linked to any article.