orderly2
is a package designed with two complementary
goals in mind:
- to make analyses reproducible without significant effort on the part of the analyst
- to make it easy to collaborate on analyses by allowing easy sharing of artefacts from an analysis amongst a group of analysts.
In this vignette we will expand on these two aims, and show that the
first one is a prerequisite for the second. The second is more
interesting though and we start there. If you just want to get started
using orderly2
, you might prefer
vignette("introduction")
and if you are already familiar
with version 1 you might prefer vignette("migrating")
.
Collaborative analysis
Many analyses only involve a single person and a single machine; in this case there are any number of workflow tools that will make orchestrating this analysis easy. In a workflow model you have a graph of dependencies over your analysis, over which data flows. So for example you might have
[raw data] -> [processed data] -> [model fits] -> [forecasts] -> [report]
If you update the data, the whole pipeline should rerun. But if you update the code for the forecast, then only the forecasts and report should rerun.
In our experience, this model works well for a single-user setting but falls over in a collaborative setting, especially where the analysis is partitioned by person; so Alice is handling the data pipeline, Bob is running fits, while Carol is organising forecasts and the final report. In this context, changes upstream affect downstream analysis, and require the same sort of care around integration as you might be used to with version controlling source code.
For example, if Alice is dealing with a change in the incoming data format which is going to break the analysis at the same time that Bob is trying to get the model fits working, Bob should not be trying to integrate both his code changes and Alice’s new data. We typically deal with this for source code by using branches within git; for the code Bob would work on a branch that is isolated from Alice’s changes. But in most contexts like this you will not have (and should not have) the data and analysis products in git. What is needed is a way of versioning the outputs of each step of analysis and controlling when these are integrated into subsequent analyses.
Another way of looking at the problem is that we seek a way of making analysis composable in the same way that functions and OOP achieve for programs, or the way that docker and containerisation have achieved for deploying software. To do this we need a way of putting interfaces around pieces of analysis and to allow people to refer to them and fetch them from somewhere where they have been run.
The conceptual pieces that are needed here are:
- some way of referring unambiguously (and globally) to a piece of analysis so that it can be depended upon, and so that everyone can agree they’re talking about the same piece of analysis
- a system of storage so that results of running analysis can be shared among a group
- a system to control how integration of these pieces of analysis takes place
We refer to a transportable unit of analysis as a “packet”. This
conceptually is a directory of files created by running some code, and
is our atomic unit of work from the point of view of
orderly
. Each packet has an underlying source form, which
anyone can run. However, most of the time people will use
pre-run packets that they or their collaborators have run as inputs to
onward analyses (see vignette("dependencies")
and
vignette("collaboration")
for more details).
Reproducible analyses
Any degree of collaboration in the style above requires reproducibility, but there are several aspects of this.
With the system we describe here, even though everyone can typically run any step of an analysis, they typically don’t. This differs from workflow tools, which users may be familiar with.
Difference from a workflow system
Workflow systems have been hugely influential in scientific computing, from people co-opting build systems like make through to sophisticated systems designed for parallel running of large and complex workflows such as nextflow. The general approach is to define interdependencies among parts of an analysis, forming a graph over parts of an analysis and track inputs and outputs through the workflow.
This model of computation has lots of good points:
- it defines an interface over an analysis and allows (and encourages) breaking a monolithic analysis into component pieces which can be reasoned about
- it allows high-level parallelism by making obvious parts of the workflow that can be run concurrently
- by tracking the way data flows through an analysis, it allows the minimum amount of recalculation to be done on change, with only downstream parts triggered
- with a shared workspace or online runner, allows a degree of collaboration so long as everyone is happy to be working with a constantly changing set of code and analysis artefacts
We have designed orderly2
for working patterns that do
not suit the above. Some motivating reasons include:
- Some nodes in the computational graph are very expensive to compute or require exotic hardware.
- Workflows where the upstream data never settle, but we need to know which version of the data ends up used in a particular analysis
- Workflows where upstream analyses are used in many downstream analyses, and where the upstream developer may not know much about the downstream use
- Nondeterministic analyses, e.g., those involving stochastic simulations, where rerunning a node is not expected to return the exact same numerical results, and so we can’t rely on different users recovering the same results on different occasions[*]
In all these cases the missing piece we need is a way of versioning the nodes within the computational graph, and shifting the emphasis from automatically rerunning portions of the graph to tracking how data has flowed through the graph. This in turn shifts the reproducibility emphasis from “everyone will run the same code and get the same results” to “everyone could run the same code, but will instead work with the results”.
For those familiar with docker, our approach is similar to working with pre-built docker images, whereas the workflow approach is more similar to working directly with Dockerfiles; in many situations the end result is the same, but the approaches differ in guarantees, in where the computation happens, and in how users refer to versions.
[*] We discourage trying to force determinism by manually setting seeds, as this has the potential to violate the statistical properties of random number streams, and is fragile at best.
What is reproducibility anyway?
Reproducibility means different things to different people, even within the narrow sense of “rerun an analysis and retrieve the same results”. In the last decade, the idea that one should be able to rerun a piece of analysis and retrieve the same results has slightly morphed into one must rerun a piece of analysis. Similarly, the emphasis on the utility of reproducibility has shifted from authors being able to rerun their own work (or have confidence that they could rerun it) to some hypothetical third party wanting to rerun an analysis.
Our approach flips the perspective around a bit, based on our experiences with collaborative research projects, and draws from an (overly) ambitious aim we had:
Can we prove that a given set of inputs produced a given set of outputs?
We quickly found that this was impossible, but provided a few systems
were in place one could be satisfied with this statement to a given
level of trust in a system. So if a piece of analysis comes from a
server where the primary way people run analyses is through our web
front-end (currently OrderlyWeb, soon to be
Packit) we know
that the analysis was run end-to-end with no modification and that
orderly2
preserves inputs alongside outputs so the files
that are present in the final packet were the files
that went into the analysis, and the recorded R and package versions
were the full set that were used.
Because this system naturally involves running on multiple machines
(typically we will have the analysts’ laptops, a server and perhaps an
HPC environment), and because of the way that orderly2
treats paths, practically there is very little problem getting analyses
working in multiple places, trivially satisfying the typical
reproducibility aim, even though it is not what people are typically
focussed on.
This shift in focus has proved valuable. In any analysis that is run
on more than one occasion (e.g., regular reporting, or simply updating a
figure for a final submission of a manuscript after revision), the
outputs may change. Understanding why these changes have
happened is important. Because orderly2
automatically saves
a lot of metadata about what was run it is easy to find out why things
might have changed. Further, you can start interrogating the graph among
packets to find out what effect that change has had; so find all the
previously run packets that pulled in the old version of a data set, or
that used the previous release of a package.