Orderly Tutorial

MRC Centre for Global Infectious Disease Analysis

What is orderly?

A reproducible reporting framework
A way of keeping track of versions of data
A way of collaborating with other people
A way of thinking about analysis

Original aims:

“With orderly we have two main hopes:

analysts can write code that will straightforwardly run on someone else’s machine (or a remote machine)

when an analysis that is run several times starts behaving differently it will be easy to see when the outputs started changing, and what inputs started changing at the same time”

(~ 2018)

But what is it?

A package designed to make analysis more reproducible
A way of structuring your analysis so to make it easy to understand, run and reuse
A set of tools that make it easy to:
- track all inputs into an analysis (packages, code, and data resources)
- store multiple versions of an analysis where it is repeated
- track outputs of an analysis
- create analyses that depend on the outputs of previous analyses

Who uses it?

Developed since May 2017 for the Vaccine Impact Modelling Consortium
Used in the 2018-2020 DRC Ebola responses
Used in the COVID-19 response, especially within the “real time modelling” group
Used within research groups (HIV, Malaria, possibly others?)

Historical notes

orderly (version 1)
- Created for Vaccine Impact Modelling Consortium and strongly focussed on reproducible research
- Used YAML everywhere
- Supported simple ways of working for a small centralised team
orderly2 (soon to be orderly 2.0.0)
- A complete rewrite taking the best ideas from version 1 and dropping many less useful bits
- Easier to program against
- No more YAML
- Focusses on distributed collaborative analysis
- Also available as a python package!

Part 1: Getting started

Install `orderly2`

From the mrc-ide r-universe (recommended)

install.packages(
  "orderly2",
  repos = c("https://mrc-ide.r-universe.dev",
            "https://cloud.r-project.org"))

From GitHub using remotes:

remotes::install_github("mrc-ide/orderly2")

From PyPi (for Python)

pip install pyorderly

My first orderly report / task

There is a discussion to have here about naming. We might have this in a break…

The setup

First, load the package and create a new empty orderly root.

library(orderly2)
orderly_init("workdir/part1")
## ✔ Created orderly root at '/home/runner/work/orderly-tutorial/orderly-tutorial/workdir/part1'
## ✔ Wrote '.gitignore'

(for the rest of this section, we have setwd() into this directory; you should create an RStudio “Project” here.)

What’s in the box?

fs::dir_tree("workdir/part1")
## workdir/part1
## └── orderly_config.yml

Really:

fs::dir_tree("workdir/part1", all = TRUE)
## workdir/part1
## ├── .gitignore
## ├── .outpack
## │   ├── config.json
## │   ├── location
## │   ├── metadata
## │   └── r
## │       └── git_ok
## └── orderly_config.yml

But leave everything in .outpack/ alone, just like .git/

Create an empty report

orderly_new("example")
## ✔ Created 'src/example/example.R'

Our contents now:

fs::dir_tree("workdir/part1")
## workdir/part1
## ├── orderly_config.yml
## └── src
##     └── example
##         └── example.R

The name example.R is important; this always has the form src/<name>/name.R

Hello orderly world

We have edited src/example/example.R to contain:

d <- data.frame(greeting = "hello", to = "world")
write.csv(d, "hello.csv", row.names = FALSE)

Now we run

id <- orderly_run("example")
## ℹ Starting packet 'example' `20241023-074719-5042592d` at 2024-10-23 07:47:19.325677
## > d <- data.frame(greeting = "hello", to = "world")
## > write.csv(d, "hello.csv", row.names = FALSE)
## ✔ Finished running 'example.R'
## ℹ Finished 20241023-074719-5042592d at 2024-10-23 07:47:19.377219 (0.05154157 secs)

Files created

fs::dir_tree("workdir/part1")
## workdir/part1
## ├── archive
## │   └── example
## │       └── 20241023-074719-5042592d
## │           ├── example.R
## │           └── hello.csv
## ├── draft
## │   └── example
## ├── orderly_config.yml
## └── src
##     └── example
##         └── example.R

Directory named after the id (in archive/example)
We have copied example.R into the directory
Output sits next to inputs
Metadata is stored in a hidden location

Contents of hello.csv:

read.csv(file.path("workdir/part1/archive/example", id, "hello.csv"))
##   greeting    to
## 1    hello world

Every packet has a unique id

id
## [1] "20241023-074719-5042592d"

and a bunch of metadata:

orderly_metadata(id)
## $schema_version
## [1] "0.1.1"
## 
## $name
## [1] "example"
## 
## $id
## [1] "20241023-074719-5042592d"
## 
## $time
## $time$start
## [1] "2024-10-23 07:47:19 UTC"
## 
## $time$end
## [1] "2024-10-23 07:47:19 UTC"
## 
## 
## $parameters
## NULL
## 
## $files
##        path size
## 1 example.R   95
## 2 hello.csv   32
##                                                                      hash
## 1 sha256:541682d8b8dba9b2ddb4ac5809c03e6bedd58b52ab3e64a662f3f48e66a9639f
## 2 sha256:b9f0704f459f7ad9785ddee01a281d81f95a461dbb682436a263e0b7252e92b7
## 
## $depends
## [1] packet query  files 
## <0 rows> (or 0-length row.names)
## 
## $git
## $git$sha
## [1] "c6f469f362fd925206fad90a6749d5f9beaa6f02"
## 
## $git$branch
## [1] "main"
## 
## $git$url
## [1] "https://github.com/mrc-ide/orderly-tutorial"
## 
## 
## $custom
## $custom$orderly
## $custom$orderly$artefacts
## [1] description paths      
## <0 rows> (or 0-length row.names)
## 
## $custom$orderly$role
##        path    role
## 1 example.R orderly
## 
## $custom$orderly$description
## $custom$orderly$description$display
## NULL
## 
## $custom$orderly$description$long
## NULL
## 
## $custom$orderly$description$custom
## NULL
## 
## 
## $custom$orderly$shared
## [1] here  there
## <0 rows> (or 0-length row.names)
## 
## $custom$orderly$session
## $custom$orderly$session$platform
## $custom$orderly$session$platform$version
## [1] "R version 4.4.1 (2024-06-14)"
## 
## $custom$orderly$session$platform$os
## [1] "Ubuntu 22.04.5 LTS"
## 
## $custom$orderly$session$platform$system
## [1] "x86_64, linux-gnu"
## 
## 
## $custom$orderly$session$packages
##        package version attached
## 1     orderly2 1.99.48     TRUE
## 2        vctrs   0.6.5    FALSE
## 3          cli   3.6.3    FALSE
## 4        knitr    1.48    FALSE
## 5        rlang   1.1.4    FALSE
## 6         xfun    0.48    FALSE
## 7     jsonlite   1.8.9    FALSE
## 8         glue   1.8.0    FALSE
## 9      openssl   2.2.2    FALSE
## 10     askpass   1.2.1    FALSE
## 11   htmltools 0.5.8.1    FALSE
## 12         sys   3.4.3    FALSE
## 13       fansi   1.0.6    FALSE
## 14   rmarkdown    2.28    FALSE
## 15    evaluate   1.0.1    FALSE
## 16      tibble   3.2.1    FALSE
## 17     fastmap   1.2.0    FALSE
## 18        yaml  2.3.10    FALSE
## 19   lifecycle   1.0.4    FALSE
## 20    compiler   4.4.1    FALSE
## 21          fs   1.6.4    FALSE
## 22   pkgconfig   2.0.3    FALSE
## 23      digest  0.6.37    FALSE
## 24        gert   2.1.4    FALSE
## 25          R6   2.5.1    FALSE
## 26        utf8   1.2.4    FALSE
## 27      pillar   1.9.0    FALSE
## 28 credentials   2.0.2    FALSE
## 29    magrittr   2.0.3    FALSE
## 30       withr   3.0.1    FALSE
## 31       tools   4.4.1    FALSE

What is a hash?

A one-way transformation from data to a fairly short string

orderly_hash_data("hello", "sha256")
## [1] "sha256:2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824"

Very small changes to the string give large changes to the hash

orderly_hash_data("hel1o", "sha256")
## [1] "sha256:28ad6a376c5c22d1a3ad7f115c0cd8f4b8a9c94f55325c17f9302b6a4e41b29c"

This means we can compare hashes and be confident we are looking at the same file (git does a lot of this)

Run it again, Sam

orderly_run("example")
## ℹ Starting packet 'example' `20241023-074719-752c6978` at 2024-10-23 07:47:19.462132
## > d <- data.frame(greeting = "hello", to = "world")
## > write.csv(d, "hello.csv", row.names = FALSE)
## ✔ Finished running 'example.R'
## ℹ Finished 20241023-074719-752c6978 at 2024-10-23 07:47:19.494556 (0.03242469 secs)
## [1] "20241023-074719-752c6978"

We have a new id with the new copy

A copy saved every time we run

Stop naming files data_final-rgf (2).csv, please

fs::dir_tree("workdir/part1")
## workdir/part1
## ├── archive
## │   └── example
## │       ├── 20241023-074719-5042592d
## │       │   ├── example.R
## │       │   └── hello.csv
## │       └── 20241023-074719-752c6978
## │           ├── example.R
## │           └── hello.csv
## ├── draft
## │   └── example
## ├── orderly_config.yml
## └── src
##     └── example
##         └── example.R

A high-level overview of of packets:

orderly_metadata_extract()
##                         id    name parameters
## 1 20241023-074719-5042592d example           
## 2 20241023-074719-752c6978 example

(more on this later).

Part 2: orderly code

What can I do?

orderly code is any code you can use from R
Use (almost) any package, any sort of file

But…

The directory above your file does not exist
Don’t use absolute paths or ../ path fragments
You can add metadata to help future you/others

A clean beginning

orderly_init("workdir/part2")
## ✔ Created orderly root at '/home/runner/work/orderly-tutorial/orderly-tutorial/workdir/part2'
## ✔ Wrote '.gitignore'

Suppose we’re working on a data analysis pipeline, starting with “incoming data”:

orderly_new("incoming")
## ✔ Created 'src/incoming/incoming.R'

Our setup:

fs::dir_tree("workdir/part2")
## workdir/part2
## ├── orderly_config.yml
## └── src
##     └── incoming
##         └── incoming.R

Incoming data

I’ve copied some data in as data.xlsx into src/incoming.

fs::dir_tree("workdir/part2")
## workdir/part2
## ├── orderly_config.yml
## └── src
##     └── incoming
##         ├── data.xlsx
##         └── incoming.R

Modify incoming.R to tidy that up for consumption using your favourite packages.
Set your working directory to src/incoming and just edit things as usual
Which sheet contains the data?
Where is the data in that sheet?
Do you like those column names?
How about that date format?

Incoming data, cleaned

My attempt at cleaning:

d <- readxl::read_excel("data.xlsx", sheet = 2, skip = 2)
names(d) <- gsub(" ", "_", tolower(names(d)))
d$date <- as.Date(d$date)
write.csv(d, "data.csv", row.names = FALSE)

Read in the data
Clean up the names (could have used janitor)
Convert date format
Write as csv

Incoming data, running things

id <- orderly_run("incoming")
## ℹ Starting packet 'incoming' `20241023-074719-a46c26cc` at 2024-10-23 07:47:19.646878
## > d <- readxl::read_excel("data.xlsx", sheet = 2, skip = 2)
## > names(d) <- gsub(" ", "_", tolower(names(d)))
## > d$date <- as.Date(d$date)
## > write.csv(d, "data.csv", row.names = FALSE)
## ✔ Finished running 'incoming.R'
## ℹ Finished 20241023-074719-a46c26cc at 2024-10-23 07:47:19.678587 (0.03170943 secs)

Our generated metadata:

orderly_metadata(id)
## $schema_version
## [1] "0.1.1"
## 
## $name
## [1] "incoming"
## 
## $id
## [1] "20241023-074719-a46c26cc"
## 
## $time
## $time$start
## [1] "2024-10-23 07:47:19 UTC"
## 
## $time$end
## [1] "2024-10-23 07:47:19 UTC"
## 
## 
## $parameters
## NULL
## 
## $files
##         path  size
## 1   data.csv   466
## 2  data.xlsx 10851
## 3 incoming.R   174
##                                                                      hash
## 1 sha256:9117700f079b786812cc20904f4c34f5659f2986a923d2771216551cf378e86f
## 2 sha256:149445ecc545eb987ecc0d5255a48f21165ee1c9b6b54c84b882ce1fbda066b7
## 3 sha256:8ce9dc59614d62b39cd4ff8cfba08e48f993a9be6d3c29357118df386bf7688f
## 
## $depends
## [1] packet query  files 
## <0 rows> (or 0-length row.names)
## 
## $git
## $git$sha
## [1] "c6f469f362fd925206fad90a6749d5f9beaa6f02"
## 
## $git$branch
## [1] "main"
## 
## $git$url
## [1] "https://github.com/mrc-ide/orderly-tutorial"
## 
## 
## $custom
## $custom$orderly
## $custom$orderly$artefacts
## [1] description paths      
## <0 rows> (or 0-length row.names)
## 
## $custom$orderly$role
##         path    role
## 1 incoming.R orderly
## 
## $custom$orderly$description
## $custom$orderly$description$display
## NULL
## 
## $custom$orderly$description$long
## NULL
## 
## $custom$orderly$description$custom
## NULL
## 
## 
## $custom$orderly$shared
## [1] here  there
## <0 rows> (or 0-length row.names)
## 
## $custom$orderly$session
## $custom$orderly$session$platform
## $custom$orderly$session$platform$version
## [1] "R version 4.4.1 (2024-06-14)"
## 
## $custom$orderly$session$platform$os
## [1] "Ubuntu 22.04.5 LTS"
## 
## $custom$orderly$session$platform$system
## [1] "x86_64, linux-gnu"
## 
## 
## $custom$orderly$session$packages
##        package version attached
## 1     orderly2 1.99.48     TRUE
## 2        vctrs   0.6.5    FALSE
## 3          cli   3.6.3    FALSE
## 4        knitr    1.48    FALSE
## 5        rlang   1.1.4    FALSE
## 6         xfun    0.48    FALSE
## 7     jsonlite   1.8.9    FALSE
## 8         glue   1.8.0    FALSE
## 9      openssl   2.2.2    FALSE
## 10     askpass   1.2.1    FALSE
## 11   htmltools 0.5.8.1    FALSE
## 12         sys   3.4.3    FALSE
## 13      readxl   1.4.3    FALSE
## 14       fansi   1.0.6    FALSE
## 15   rmarkdown    2.28    FALSE
## 16  cellranger   1.1.0    FALSE
## 17    evaluate   1.0.1    FALSE
## 18      tibble   3.2.1    FALSE
## 19     fastmap   1.2.0    FALSE
## 20        yaml  2.3.10    FALSE
## 21   lifecycle   1.0.4    FALSE
## 22    compiler   4.4.1    FALSE
## 23          fs   1.6.4    FALSE
## 24   pkgconfig   2.0.3    FALSE
## 25      digest  0.6.37    FALSE
## 26        gert   2.1.4    FALSE
## 27          R6   2.5.1    FALSE
## 28        utf8   1.2.4    FALSE
## 29      pillar   1.9.0    FALSE
## 30 credentials   2.0.2    FALSE
## 31    magrittr   2.0.3    FALSE
## 32       withr   3.0.1    FALSE
## 33       tools   4.4.1    FALSE

“Resources”

Any file that is an input
For example:
- Scripts that you source()
- R Markdown files for knitr or rmarkdown
- Data files (.csv, .xlsx, etc)
- Plain text files (README.md, licence info, etc)
Here, data.xlsx is an input

Telling orderly about resources

orderly_resource("data.xlsx")
d <- readxl::read_excel("data.xlsx", sheet = 2, skip = 2)
names(d) <- gsub(" ", "_", tolower(names(d)))
d$date <- as.Date(d$date)
write.csv(d, "data.csv", row.names = FALSE)

Tells orderly data.xlsx is a resource
Fail early if resource not found
Error if resource is modified
Extra metadata, advertising what files were used

id <- orderly_run("incoming")
## ℹ Starting packet 'incoming' `20241023-074719-bbb08e35` at 2024-10-23 07:47:19.737755
## > orderly_resource("data.xlsx")
## > d <- readxl::read_excel("data.xlsx", sheet = 2, skip = 2)
## > names(d) <- gsub(" ", "_", tolower(names(d)))
## > d$date <- as.Date(d$date)
## > write.csv(d, "data.csv", row.names = FALSE)
## ✔ Finished running 'incoming.R'
## ℹ Finished 20241023-074719-bbb08e35 at 2024-10-23 07:47:19.764167 (0.02641225 secs)
orderly_metadata(id)$custom$orderly$role
##         path     role
## 1 incoming.R  orderly
## 2  data.xlsx resource

“Artefacts”

Any file that is an output
For example:
- Datasets you generate
- html or pdf output from knitr or rmarkdown
- Plain text files
- Inputs themselves, sometimes
Here, data.csv is an artefact

Telling orderly about artefacts

orderly_resource("data.xlsx")
orderly_artefact(files = "data.csv", description = "Cleaned data")
d <- readxl::read_excel("data.xlsx", sheet = 2, skip = 2)
names(d) <- gsub(" ", "_", tolower(names(d)))
d$date <- as.Date(d$date)
write.csv(d, "data.csv", row.names = FALSE)

Tells orderly csv.xlsx is an artefact
Fail if artefact not produced
Extra metadata, advertising what files were produced

id <- orderly_run("incoming")
## ℹ Starting packet 'incoming' `20241023-074719-cf0bc2a0` at 2024-10-23 07:47:19.813293
## > orderly_resource("data.xlsx")
## > orderly_artefact(files = "data.csv", description = "Cleaned data")
## > d <- readxl::read_excel("data.xlsx", sheet = 2, skip = 2)
## > names(d) <- gsub(" ", "_", tolower(names(d)))
## > d$date <- as.Date(d$date)
## > write.csv(d, "data.csv", row.names = FALSE)
## ✔ Finished running 'incoming.R'
## ℹ Finished 20241023-074719-cf0bc2a0 at 2024-10-23 07:47:19.844239 (0.03094578 secs)
orderly_metadata(id)$custom$orderly$artefacts
##    description    paths
## 1 Cleaned data data.csv

More metadata

orderly_description(
  display = "Incoming data from Otherlandia",
  long = "Data as given to us from the MoH in Otherlandia.",
  custom = list(received = "2024-10-22"))
orderly_resource("data.xlsx")
orderly_artefact(files = "data.csv", description = "Cleaned data")
d <- readxl::read_excel("data.xlsx", sheet = 2, skip = 2)
names(d) <- gsub(" ", "_", tolower(names(d)))
d$date <- as.Date(d$date)
write.csv(d, "data.csv", row.names = FALSE)

Running this:

id <- orderly_run("incoming", echo = FALSE)
## ℹ Starting packet 'incoming' `20241023-074719-e393a4e3` at 2024-10-23 07:47:19.893583
## ✔ Finished running 'incoming.R'
## ℹ Finished 20241023-074719-e393a4e3 at 2024-10-23 07:47:19.920295 (0.02671218 secs)
orderly_metadata(id)$custom$orderly$description
## $display
## [1] "Incoming data from Otherlandia"
## 
## $long
## [1] "Data as given to us from the MoH in Otherlandia."
## 
## $custom
## $custom$received
## [1] "2024-10-22"

Dependencies

This is really the point of orderly
You can pull in any file from any previously run packet
You can use queries to select packets to depend on

Our aim: We want to use data.csv in some analysis

orderly_new("analysis")
## ✔ Created 'src/analysis/analysis.R'

Setting up a dependency

orderly_dependency("incoming", "latest", "data.csv")
orderly_artefact(files = c("coverage-gf.png", "coverage-bf.png"),
                 description = "Plots of coverage")

d <- read.csv("data.csv")
d$date <- as.Date(d$date)

png("coverage-gf.png")
plot(gf_coverage ~ date, d, type = "l")
dev.off()

png("coverage-bf.png")
plot(bf_coverage ~ date, d, type = "l")
dev.off()

This is the only file within our analysis directory:

fs::dir_tree("workdir/part2/src/analysis")
## workdir/part2/src/analysis
## └── analysis.R

Running the report

id <- orderly_run("analysis")
## ℹ Starting packet 'analysis' `20241023-074719-fca3dc27` at 2024-10-23 07:47:19.991456
## > orderly_dependency("incoming", "latest", "data.csv")
## ℹ Depending on incoming @ `20241023-074719-e393a4e3` (via latest(name == "incoming"))
## > orderly_artefact(files = c("coverage-gf.png", "coverage-bf.png"),
## +                  description = "Plots of coverage")
## > d <- read.csv("data.csv")
## > d$date <- as.Date(d$date)
## > png("coverage-gf.png")
## > plot(gf_coverage ~ date, d, type = "l")
## > dev.off()
## png 
##   2 
## > png("coverage-bf.png")
## > plot(bf_coverage ~ date, d, type = "l")
## > dev.off()
## png 
##   2
## ✔ Finished running 'analysis.R'
## ℹ Finished 20241023-074719-fca3dc27 at 2024-10-23 07:47:20.083316 (0.09186029 secs)

The aftermath

fs::dir_tree("workdir/part2")
## workdir/part2
## ├── archive
## │   ├── analysis
## │   │   └── 20241023-074719-fca3dc27
## │   │       ├── analysis.R
## │   │       ├── coverage-bf.png
## │   │       ├── coverage-gf.png
## │   │       └── data.csv
## │   └── incoming
## │       ├── 20241023-074719-a46c26cc
## │       │   ├── data.csv
## │       │   ├── data.xlsx
## │       │   └── incoming.R
## │       ├── 20241023-074719-bbb08e35
## │       │   ├── data.csv
## │       │   ├── data.xlsx
## │       │   └── incoming.R
## │       ├── 20241023-074719-cf0bc2a0
## │       │   ├── data.csv
## │       │   ├── data.xlsx
## │       │   └── incoming.R
## │       └── 20241023-074719-e393a4e3
## │           ├── data.csv
## │           ├── data.xlsx
## │           └── incoming.R
## ├── draft
## │   ├── analysis
## │   └── incoming
## ├── orderly_config.yml
## └── src
##     ├── analysis
##     │   └── analysis.R
##     └── incoming
##         ├── data.xlsx
##         └── incoming.R

Some comments on this

The data.csv file has been copied from the final copy of incoming into analysis
The dependency system works interactively too (try it!)
The logs indicate how dependency resolution occurred
Metadata about the dependencies is included:

orderly_metadata(id)$depends
##                     packet                      query        files
## 1 20241023-074719-e393a4e3 latest(name == "incoming") data.csv....

Using specific versions

orderly_dependency("incoming", "20241023-074719-bbb08e35", "data.csv")
orderly_artefact(files = c("coverage-gf.png", "coverage-bf.png"),
                 description = "Plots of coverage")

d <- read.csv("data.csv")
d$date <- as.Date(d$date)

png("coverage-gf.png")
plot(gf_coverage ~ date, d, type = "l")
dev.off()

png("coverage-bf.png")
plot(bf_coverage ~ date, d, type = "l")
dev.off()

Running this

id <- orderly_run("analysis")
## ℹ Starting packet 'analysis' `20241023-074720-27b9baf4` at 2024-10-23 07:47:20.159643
## > orderly_dependency("incoming", "20241023-074719-bbb08e35", "data.csv")
## ℹ Depending on incoming @ `20241023-074719-bbb08e35` (via single(id == "20241023-074719-bbb08e35" && name == "incoming"))
## > orderly_artefact(files = c("coverage-gf.png", "coverage-bf.png"),
## +                  description = "Plots of coverage")
## > d <- read.csv("data.csv")
## > d$date <- as.Date(d$date)
## > png("coverage-gf.png")
## > plot(gf_coverage ~ date, d, type = "l")
## > dev.off()
## png 
##   2 
## > png("coverage-bf.png")
## > plot(bf_coverage ~ date, d, type = "l")
## > dev.off()
## png 
##   2
## ✔ Finished running 'analysis.R'
## ℹ Finished 20241023-074720-27b9baf4 at 2024-10-23 07:47:20.215387 (0.05574393 secs)

with metadata

orderly_metadata(id)$depends
##                     packet
## 1 20241023-074719-bbb08e35
##                                                            query        files
## 1 single(id == "20241023-074719-bbb08e35" && name == "incoming") data.csv....

Part 3: Collaboration

Working with other people

Where do you store your code?
Where do you store your data?
Where do you store your outputs?
How will things change over time?
Is it sensitive?

The setup

Here we ignore the git side for now and focus on sharing outputs

getwd()
## [1] "/home/runner/work/orderly-tutorial/orderly-tutorial"
orderly_init("workdir/part3")
## ✔ Created orderly root at '/home/runner/work/orderly-tutorial/orderly-tutorial/workdir/part3'
## ✔ Wrote '.gitignore'

Thom Rawson has kindly set up a bunch of case data for us to use. He’s put it in an orderly root that we can use as an orderly location.

The sausage factory

We’ll hide this in the final build

local({
  path_upstream <- "workdir/part3-upstream"
  orderly_init(path_upstream)
  orderly_init(path_upstream)
  orderly_new("cases", root = path_upstream)
  fs::file_copy("inputs/part3/cases.R",
                file.path(path_upstream, "src/cases/cases.R"),
                overwrite = TRUE)
  re <- "^(.+)-2020\\.csv$"
  files <- dir("inputs/part3/cases", re)
  regions <- sub(re, "\\1", files)
  names(files) <- regions
  for (region in regions) {
    fs::file_copy(file.path("inputs/part3/cases", files[[region]]),
                  file.path(path_upstream, "src/cases/cases.csv"),
                  overwrite = TRUE)
    orderly_run("cases", list(region = region, year = 2020), root = path_upstream)
  }
})
## ✔ Created orderly root at '/home/runner/work/orderly-tutorial/orderly-tutorial/workdir/part3-upstream'
## ✔ Wrote '.gitignore'
## ✔ Created 'src/cases/cases.R'
## ℹ Starting packet 'cases' `20241023-074720-562a1556` at 2024-10-23 07:47:20.347022
## ℹ Parameters:
## • region: east_of_england
## • year: 2020
## > orderly_parameters(region = NULL, year = NULL)
## > orderly_resource(files = "cases.csv")
## > orderly_artefact(files = "cases.csv", description = "Case data")
## ✔ Finished running 'cases.R'
## ℹ Finished 20241023-074720-562a1556 at 2024-10-23 07:47:20.38153 (0.03450727 secs)
## ℹ Starting packet 'cases' `20241023-074720-67c51964` at 2024-10-23 07:47:20.411574
## ℹ Parameters:
## • region: london
## • year: 2020
## > orderly_parameters(region = NULL, year = NULL)
## > orderly_resource(files = "cases.csv")
## > orderly_artefact(files = "cases.csv", description = "Case data")
## ✔ Finished running 'cases.R'
## ℹ Finished 20241023-074720-67c51964 at 2024-10-23 07:47:20.443204 (0.03163028 secs)
## ℹ Starting packet 'cases' `20241023-074720-79050123` at 2024-10-23 07:47:20.479728
## ℹ Parameters:
## • region: midlands
## • year: 2020
## > orderly_parameters(region = NULL, year = NULL)
## > orderly_resource(files = "cases.csv")
## > orderly_artefact(files = "cases.csv", description = "Case data")
## ✔ Finished running 'cases.R'
## ℹ Finished 20241023-074720-79050123 at 2024-10-23 07:47:20.511881 (0.03215313 secs)
## ℹ Starting packet 'cases' `20241023-074720-89dc11be` at 2024-10-23 07:47:20.544771
## ℹ Parameters:
## • region: north_east_and_yorkshire
## • year: 2020
## > orderly_parameters(region = NULL, year = NULL)
## > orderly_resource(files = "cases.csv")
## > orderly_artefact(files = "cases.csv", description = "Case data")
## ✔ Finished running 'cases.R'
## ℹ Finished 20241023-074720-89dc11be at 2024-10-23 07:47:20.580047 (0.0352757 secs)
## ℹ Starting packet 'cases' `20241023-074720-9af20659` at 2024-10-23 07:47:20.611648
## ℹ Parameters:
## • region: north_west
## • year: 2020
## > orderly_parameters(region = NULL, year = NULL)
## > orderly_resource(files = "cases.csv")
## > orderly_artefact(files = "cases.csv", description = "Case data")
## ✔ Finished running 'cases.R'
## ℹ Finished 20241023-074720-9af20659 at 2024-10-23 07:47:20.643338 (0.03169012 secs)
## ℹ Starting packet 'cases' `20241023-074720-ab86950a` at 2024-10-23 07:47:20.676958
## ℹ Parameters:
## • region: south_east
## • year: 2020
## > orderly_parameters(region = NULL, year = NULL)
## > orderly_resource(files = "cases.csv")
## > orderly_artefact(files = "cases.csv", description = "Case data")
## ✔ Finished running 'cases.R'
## ℹ Finished 20241023-074720-ab86950a at 2024-10-23 07:47:20.715482 (0.03852415 secs)
## ℹ Starting packet 'cases' `20241023-074720-bde9b16f` at 2024-10-23 07:47:20.748124
## ℹ Parameters:
## • region: south_west
## • year: 2020
## > orderly_parameters(region = NULL, year = NULL)
## > orderly_resource(files = "cases.csv")
## > orderly_artefact(files = "cases.csv", description = "Case data")
## ✔ Finished running 'cases.R'
## ℹ Finished 20241023-074720-bde9b16f at 2024-10-23 07:47:20.778946 (0.03082228 secs)

Adding a location

orderly_location_add_path("thom", fs::path_abs("workdir/part3-upstream"))

Here we have used the path location type
We will try this with the packit type in the workshop
Locations are just another orderly root where you can find packets
We use fs::path_abs() for now, for uninteresting reasons

What has Thom been up to?

orderly_location_pull_metadata()
## ℹ Fetching metadata from 1 location: 'thom'
## ✔ Found 7 packets at 'thom', of which 7 are new
orderly_metadata_extract(location = "thom")
##                         id  name   parameters
## 1 20241023-074720-562a1556 cases east_of_....
## 2 20241023-074720-67c51964 cases london, 2020
## 3 20241023-074720-79050123 cases midlands....
## 4 20241023-074720-89dc11be cases north_ea....
## 5 20241023-074720-9af20659 cases north_we....
## 6 20241023-074720-ab86950a cases south_ea....
## 7 20241023-074720-bde9b16f cases south_we....

Slightly easier to read, but harder to write

orderly_metadata_extract(
  location = "thom",
  extract = c("name",
              region = "parameters.region is string",
              year = "parameters.year is number"))
##                         id  name                   region year
## 1 20241023-074720-562a1556 cases          east_of_england 2020
## 2 20241023-074720-67c51964 cases                   london 2020
## 3 20241023-074720-79050123 cases                 midlands 2020
## 4 20241023-074720-89dc11be cases north_east_and_yorkshire 2020
## 5 20241023-074720-9af20659 cases               north_west 2020
## 6 20241023-074720-ab86950a cases               south_east 2020
## 7 20241023-074720-bde9b16f cases               south_west 2020

Depending on this

orderly_new("analysis")
## ✔ Created 'src/analysis/analysis.R'

And code:

orderly_dependency("cases",
                   'latest(parameter:region == "london")',
                   c("london.csv" = "cases.csv"))
d <- read.csv("london.csv")
png("london.png")
plot(Week_Cases ~ Week, d)
dev.off()

Query syntax

orderly_search("latest", name = "cases", location = "thom")
## [1] "20241023-074720-bde9b16f"
orderly_search("latest(parameter:region == 'east_of_england')", name = "cases",
               location = "thom")
## [1] "20241023-074720-562a1556"
orderly_search(
  "latest(parameter:region == 'london' && parameter:year == 2020)",
  name = "cases",
  location = "thom")
## [1] "20241023-074720-67c51964"

You can use this elsewhere:

orderly_metadata_extract("parameter:region == 'london'", location = "thom")
##                         id  name   parameters
## 1 20241023-074720-67c51964 cases london, 2020

The files argument

    c("london.csv" = "cases.csv")

“Save the file cases.csv as london.csv in the working version”

Run the new report

orderly_run("analysis")
## ℹ Starting packet 'analysis' `20241023-074720-f4fe1ce6` at 2024-10-23 07:47:20.961586
## > orderly_dependency("cases",
## +                    'latest(parameter:region == "london")',
## +                    c("london.csv" = "cases.csv"))
## ✖ Error running 'analysis.R'
## ℹ Finished 20241023-074720-f4fe1ce6 at 2024-10-23 07:47:21.005248 (0.04366183 secs)
## Error in `orderly_run()`:
## ! Failed to run report
## Caused by error in `outpack_packet_use_dependency()`:
## ! Failed to find packet for query 'latest(parameter:region == "london"
##   && name == "cases")'
## ℹ See 'rlang::last_error()$explanation' for details

oh no

Using packets from elsewhere

Two options

Pull the packet to make it local (orderly_location_pull_packet)
Tell orderly which locations to use

Running with location

id <- orderly_run("analysis", location = "thom") # or allow_remote = TRUE
## ℹ Starting packet 'analysis' `20241023-074721-0f28013a` at 2024-10-23 07:47:21.063695
## > orderly_dependency("cases",
## +                    'latest(parameter:region == "london")',
## +                    c("london.csv" = "cases.csv"))
## ℹ Looking for suitable files already on disk
## ℹ Need to fetch 1 file (669.9 kB) from 1 location
## ℹ Depending on cases @ `20241023-074720-67c51964` (via latest(parameter:region == "london" && name == "cases"))
## > d <- read.csv("london.csv")
## > png("london.png")
## > plot(Week_Cases ~ Week, d)
## > dev.off()
## png 
##   2
## ✔ Finished running 'analysis.R'
## ℹ Finished 20241023-074721-0f28013a at 2024-10-23 07:47:21.178957 (0.115262 secs)

Where dependencies are resolved is a property of orderly_run(), not the source of the report
Files are fetched as we run - only used files are copied

The result

Include many dependencies

Thom has new data for us!

orderly_location_pull_metadata()
## ℹ Fetching metadata from 1 location: 'thom'
## ✔ Found 8 packets at 'thom', of which 1 is new

orderly_metadata_extract(
  location = "thom",
  extract = c("name",
              region = "parameters.region is string",
              year = "parameters.year is number"))
##                         id  name                   region year
## 1 20241023-074720-562a1556 cases          east_of_england 2020
## 2 20241023-074720-67c51964 cases                   london 2020
## 3 20241023-074720-79050123 cases                 midlands 2020
## 4 20241023-074720-89dc11be cases north_east_and_yorkshire 2020
## 5 20241023-074720-9af20659 cases               north_west 2020
## 6 20241023-074720-ab86950a cases               south_east 2020
## 7 20241023-074720-bde9b16f cases               south_west 2020
## 8 20241023-074721-3ced1c27 cases                   london 2021

Run our analysis with this data

This time we try pulling

orderly_location_pull_packet("latest", name = "cases", location = "thom")
## ℹ Looking for suitable files already on disk
## ✔ Found 1 file on disk
## ℹ Need to fetch 1 file (150 B) from 1 location
id <- orderly_run("analysis")
## ℹ Starting packet 'analysis' `20241023-074721-672ddcb4` at 2024-10-23 07:47:21.407465
## > orderly_dependency("cases",
## +                    'latest(parameter:region == "london")',
## +                    c("london.csv" = "cases.csv"))
## ℹ Depending on cases @ `20241023-074721-3ced1c27` (via latest(parameter:region == "london" && name == "cases"))
## > d <- read.csv("london.csv")
## > png("london.png")
## > plot(Week_Cases ~ Week, d)
## > dev.off()
## png 
##   2
## ✔ Finished running 'analysis.R'
## ℹ Finished 20241023-074721-672ddcb4 at 2024-10-23 07:47:21.491953 (0.0844872 secs)

Some thoughts

Documentation

The orderly reference

Details for writing reports/tasks

Shared resources
Resources
Loops over dependencies
Metadata

What is saved?

Information about files you consumed, and produced

Ways of collaborating

Assorted issues

Misc

How to get data into orderly in the first place
- git versioned files
- git ignored files
- files from canonical locations
- databases
Coping with failure
Running knitr/rmarkdown
Getting files out of orderly

The right number of packets

Similar to “how big is a git repo”
Some issues:
- People fragmenting packets to overcome flakey analysis
- Millions of packets, leading to complex and slow queries
- Hard to get the right combination of packets

The right number of parameters

Too few is too inflexible
Too many

Interaction with git

Don’t save outputs
Save some inputs?
Don’t save secrets
Don’t save locations

Orderly Tutorial

What is orderly?

Original aims:

But what is it?

Who uses it?

Historical notes

Part 1: Getting started

Install orderly2

My first orderly report / task

The setup

What’s in the box?

Create an empty report

Hello orderly world

Files created

Every packet has a unique id

What is a hash?

Run it again, Sam

A copy saved every time we run

A high-level overview of of packets:

Part 2: orderly code

What can I do?

A clean beginning

Incoming data

Incoming data, cleaned

Incoming data, running things

“Resources”

Telling orderly about resources

“Artefacts”

Telling orderly about artefacts

More metadata

Dependencies

Setting up a dependency

Running the report

The aftermath

Some comments on this

Using specific versions

Running this

Part 3: Collaboration

Working with other people

The setup

The sausage factory

Adding a location

What has Thom been up to?

Slightly easier to read, but harder to write

Depending on this

Query syntax

The files argument

Run the new report

Using packets from elsewhere

Running with location

The result

Include many dependencies

Thom has new data for us!

Run our analysis with this data

Some thoughts

Documentation

Details for writing reports/tasks

What is saved?

Ways of collaborating

Assorted issues

Misc

The right number of packets

The right number of parameters

Interaction with git

Install `orderly2`