1 How to read and use this guide

This is intended as a quick reference guide to the R you will need for Infectious Disease Modelling, starting from the basics, and noting common sources of error along the way. If you want to go further with R or programming in general, there is a lot of information on the internet: see section 10 for some suggestions.

  • R code within text looks like this: x <- 7.
  • Functions will be referred to like this: function() (with parentheses after the function name)
  • You can copy and paste code from here into your own R script to run it. Where there’s function output examples in the code blocks in this document, they start like this: #>

2 Things to know

No-one memorises everything! If you know how to look up what you want to do, and understand what you find, you’re most of the way there.

2.1 When writing your own code

Put # comments everywhere to annotate your code - for your benefit while learning and for referring back to after a period of time. (Also, for other people who might use your code.)

Aim for consistent formatting and naming:

  • R doesn’t enforce much formatting, but without doing things like indenting your code where appropriate, it’s difficult to read, and even harder to find problems or make changes.
  • Variable names: Keep in mind, R is case-sensitive. snake_case (words separated by underscores) is recommended often for variable names; you can’t have spaces, and separating with dots is used in other contexts in R so it can cause confusion. Meaningful variable names can help you follow your work.
  • For a style guide: look here https://style.tidyverse.org/index.html

There is usually more than one way of programming something.

2.2 Not covered here

Many concepts in programming and R are not covered (or not in any detail): of note,

2.3 Computers are logical; languages can be weird

Like human languages, programming languages have similarities and differences; you may have prior knowledge about how other languages do things - or you might find this out in the future. In general, if you think a language has some strange syntax or weird function names, or behaves oddly, remember that this might be due to:

  • historical reasons - a language evolves as people develop it, not always totally logically
  • speed optimisation
  • storage optimisation (storage used to be expensive!)
  • other implementation reasons

R was developed for statistics; some aspects of R can be traced to this history.

So whether or not you already know other languages, keep an open mind when it comes to ‘idiosyncracies’, and be aware that other programming languages do differ!


3 Get started

3.1 On your computer: R and RStudio

If you can and want to work on your own computer, and become familiar with a common R user setup:

Install R from: (https://cran.r-project.org/) (searching R download should find this)

Install RStudio (free version) from: (https://www.rstudio.com/products/rstudio/download/).

RStudio is an Integrated Development Environment (IDE) for R. IDEs add many useful features to your programming experience, panels showing history, help, plots, etc.
See http://ncss-tech.github.io/stats_for_soil_survey/chapters/1_introduction/1_introduction.html#3_rstudio:_an_integrated_development_environment_(ide)_for_r for an illustration. An IDE has more features than a good text editor such as notepad++, which will at least do things such as syntax highlighting (colouring your code).

Get started after installation:

  • open RStudio and open a new R script
  • decide on a working directory, especially if you will load data from files, and/or have multiple scripts
    • it’s good to keep files associated with your work in one place
    • type setwd("[insert path here]")
setwd("C:/mydocs/mynewproject") # in Windows, use / for a separator
setwd("~/mydocs/mynewproject") # linux/ mac

You’ll mostly type or copy into your script, like writing a text file; and then run (all or some of it). Output will appear in the console, plots in the plot panel, and as you create variables, they’ll appear in the environment panel.

If you’re re-opening RStudio, having worked in it before: RStudio can save your open scripts and the contents of your environment, which can help if it was closed unexpectedly.
Beware that you should not rely on anything specifically being saved.

4 Basics / refresher: comments, variable assignation, function calls

# text starting with # are comments (they don't get run)

x <- 10 # 'assigns' the value 10 to a name, x
# read this as "x gets 10"
x # if you just want to see what is assigned to a name, you can run a line of code that just has the name of the variable

# once you have created an object, you can refer to it by name in later code
y <- 2 * x

# R doesn't tell you if you've already used a name, it just reassigns
x <- 20 # x is now 20

a <- x # a is given the same value as x
x <- 50 # now x is 50. a keeps the value it had before, rather than changing when you change x
# not all languages work this way!

# R is case sensitive:
X <- 51 # capital X, not the same as x

## Functions
# to call a function, use the function name, with the arguments - inputs - in parentheses
print(x)  # this returns the contents of x to the screen

5 Data Types

5.1 Atomic types, and creating atomic vectors

R has 6 ‘atomic’ - fundamental - types: Logical, character, numeric types (integer, double, and complex), and raw (for binary data). They can be the building blocks for other types.

The type factor (you may remember this from the statistics module) is special.

The simplest data structures are atomic vectors (usually just called vectors). All elements in an atomic vector must be of one type. This code chunk shows you a function, c(), using it to create vectors of the 4 common atomic types.

# the function "c()" combines what you provide it. By default, it makes a vector.

# logical vector
infected         <- c(TRUE, FALSE, TRUE, TRUE, FALSE) # sometimes abbreviated to T and F
# Often useful as the output of a function:
# i.e. test whether something is TRUE or FALSE and then use that to determine what happens somewhere else

# character vector
parameters       <- c("mu", "gamma", "beta") # character elements are always within quotes

# numeric: double (short for: double precision floating point)
min_temperature  <- c(1.51, 2.7, 3.1, 0.05, -2, 5) #

# numeric: integer? 
number_of_eggs   <- c(2, 4, 6, 8, 5) # doesn't actually make an integer vector
typeof(number_of_eggs)
#> [1] "double" 
# Not an integer. To make an integer vector, either:
number_of_eggs   <- as.integer(number_of_eggs) # converts your vector
number_of_eggs   <- c(2L, 4L) 
# L specifies the numbers are integers. We don't use "I" for integer - too close to a "1" and "i" is used in complex numbers.

number_of_eggs <- c(2, 4, 6, 8, 5) did not create an integer vector, even though all the elements were given as integers. R made it a double.

Many languages require you to define your data type when you create (or ‘declare’) a variable. R does not. When you don’t specify a type, R decides what it is. This can make getting started quick, but can hide issues if you make a mistake or if you don’t know what R automatically decides.

#What does R do with this?
stuff <- c(3, 4, TRUE, "cat")
stuff
#> [1] "3"    "4"    "TRUE" "cat" 
# R has made everything into character elements

5.1.1 Coercion

A function often expects certain data types and structures as inputs. The function c() expects all the inputs to be the same type, so R chooses the first type on this hierarchy: logical > integer > double > character that all the inputs given could be. If you try to input a variable into a function which is expecting a different type, R will try to coerce what you give it into what it needs.

#The function "paste()" expects character inputs.
paste(number_of_eggs) 
#> [1] "3 4 5" # the output here is a character vector (with one element); you can tell because the element is in quotes
 
# Try taking the mean() of a character vector. Does R coerce to anything?
mean(parameters)
#> Warning in mean.default(parameters): argument is not numeric or logical:
#> returning NA
#> [1] NA

5.1.2 Missing values: NA

Feeding a character vector to mean() returned NA.NA represents missing values in R (equivalent of SQL’s NULL). When a function cannot use the input given, it might return NA. If you need to show that data values are missing (e.g. one measurement wasn’t available or applicable for a particular patient), you should include an NA in the relevant place.

It is important to distinguish between “missing” and “blank”. Missing values often have to be specially considered, or removed.

5.1.3 Attributes

Attributes are “metadata”. 2 important attributes:

  • names: if given, can be useful for referring to elements within structures.
  • dimensions: giving a vector a dimensions attribute, turns it into an array: an array with 2 dimensions, is a matrix.

6 Data Structures

6.1 Vectors and matrices

Atomic vectors (all one type)

6.1.1 Selecting elements: []

Square brackets select elements:

min_temperature  <- c(1.51, 2.7, 3.14, 0.05, -2, 5)
parameters       <- c("mu", "gamma", "beta")

min_temperature[2] # returns the second element. 
## [1] 2.7
min_temperature[1] <- 1.4 # select and replace: the first element is now 1.4

# Many other languages count from 0, so the second element would be [1].

# you can select more than one element, using a vector or a sequence within your square brackets:
parameters[c(2,3)]
## [1] "gamma" "beta"
# 2 dimensions: 
# to select from a matrix, you can use a single number, 
# which will count down each column in turn
# or like this [i, j]
# to select the item at the i'th row, j'th column

my_mat <- matrix(c(1:9), ncol = 3) # c(1:9) - the colon is shorthand for creating a sequence
my_mat[2, 3]
## [1] 8
my_mat[8]
## [1] 8

6.2 Lists

Sometimes called generic vectors, distinguishing them from atomic vectors.

The elements of a list contain other data structures. You can have different structures and different types in one list: vectors, matrices, other lists, functions…

6.2.1 Selecting from a list: [[]] and [], and $name

my_list <- list(infected, parameters)

my_list[1] # gets you a slice of the list (it's called a slice in some other languages too)
## [[1]]
## [1]  TRUE FALSE  TRUE  TRUE FALSE
# it returns you part of your original list, as a list. Clearer if you select more than one element:
my_list[c(1,2)]
## [[1]]
## [1]  TRUE FALSE  TRUE  TRUE FALSE
## 
## [[2]]
## [1] "mu"    "gamma" "beta"
my_list[[1]] # gets out the element itself. 
## [1]  TRUE FALSE  TRUE  TRUE FALSE
# It's like the double bracket selects the slice and then the element out of that slice.

my_list[1][1] # selects the first element, of the first element, of the list.
## [[1]]
## [1]  TRUE FALSE  TRUE  TRUE FALSE
my_list[[c(1,2)]] # is meaningless.
## [1] FALSE
# You can give the elements in the list names when you create it:
my_list <- list(status = infected, params = parameters) # each element is given the values from the vectors, and the names "status" and "params".

my_list
## $status
## [1]  TRUE FALSE  TRUE  TRUE FALSE
## 
## $params
## [1] "mu"    "gamma" "beta"
# then you can select elements using their names and the $ sign
my_list$params
## [1] "mu"    "gamma" "beta"
my_list$params[1]
## [1] "mu"

http://www.r-tutor.com/r-introduction/list for a basic introduction.

6.2.2 Dataframes: a special case of a list

Given that R has been developed for scientists, imagine a structure like a results table for recording outcomes of your experiments.

Every element in a dataframe holds a vector of the same length. These vectors can be different types. You should be able to treat a column like a vector, and the whole dataframe like a list. You can select using the [i, j] notation that works for matrices as well as the ways of selecting from a list.

observations <- data.frame(patientID = c("A", "B", "C"), 
                           test1 = c(27, 40, 48), 
                           test2 = c(28, 25, 50),
                           stringsAsFactors = FALSE) # this last argument means the character inputs - strings - are not treated as factors
head(observations) # have a look - at the first 5 rows by default

Note: Another commonly used type for storing data are tibbles. Further information on these can be found here, https://r4ds.had.co.nz/tibbles.html.

6.3 Data formats: Wide and long

When you imagine a “results table” you might think of a table like the observations dataframe:

observations

This is “wide” format. Lots of situations require “long” format. There’s a useful pair of functions that helps here:

# install.packages(reshape2)
require(reshape2)
## Loading required package: reshape2
melt(observations, id.vars = c("patientID"))

The ultimate shape you want to get your data into will depend on what you are doing with it.


7 Functions

A function takes in inputs, called arguments, and does “something” with them. For example, c() combines what you give it, into a vector. Functions generally expect particular types of input; arguments you give, might be coerced to that type or structure if possible, or you might get an error.

Generally, you’ll want to run a function and store the output - i.e. assign it to a name so you can use it somewhere else.

mean(c(1,2,3,4,5))            # the argument here is a numeric vector - created within the parentheses
my_mean <- mean(min_temperature)     # you can just pass in the name of a variable, as long as it exists in your environment.

# some functions look a bit different, e.g. +
1 + 2 # here, you are applying the function '+', to the arguments 1 and 2. For some functions, this is intuitive syntax.

7.1 Vectorisation

Many functions can take a vector as input. Either the function is applied using the whole vector, such as the mean() example, or functions can be applied element-by-element. https://bookdown.org/rdpeng/rprogdatascience/vectorized-operations.html shows a couple of examples.

7.2 Reading the help file for a function

?functionname retrieves the help file. They are all structured similarly. The Arguments section tells you the arguments the function expects. Many functions have a Value section: this is the value outputted by the function.

Arguments of a function, can be functions. The function apply() takes in a function and a data structure and applies it to relevant divisions of the data structure (for example, if you need to apply a function over each row of a matrix in turn).

7.3 Make your own function:

If a function that you require does not exist, you can also write your own function.

# Make a function which takes in a character vector, and says Hi.

greetings <- function(x) { # function takes in 1 argument
                           # the argument gets assigned the name x within the environment of the function

    hello <- paste("Hi", x, "!")
    return(hello)  

}

greetings("Doctor")
greetings(c("Doctor", "Professor")) # Vectorised
greetings(1) # coerces the input to character.

# make a function which requires a numeric argument:

f_to_c <- function(f) { # function takes in 1 argument

  c <- (f - 32) * (5 / 9)
  return(c)
}

f_to_c(90)
# f_to_c("ninety") # would it work?

8 Base R and packages

Base R comes with the default installation. A package is a collection of functions (and sometimes data), developed for particular uses. You can think of the R “library” as containing many packages.

install.packages(): installs packages
library() takes a package name and loads it, if installed. Then you can directly use the functions in it.


9 Plotting

Base R’s plot() function is versatile; you can often get a quick look at your data easily. Plot has a lot of methods which R chooses from depending on the data you give it, to produce a suitable plot. There are also base functions such as boxplot which can be quick and easy.

9.1 ggplot

A powerful system for visualising data.

  • makes you think about the structure and the meaning of your data in order to visualise it
  • the syntax is very different to base plotting functions

You create a ggplot object, and then add layers with whatever aesthetics you require.

To learn about ggplot, go here: http://r-statistics.co/ggplot2-Tutorial-With-R.html - a nice tutorial, which we will not simply rewrite here! A note: early on, that tutorial says to state what dataframe to use first. This is not necessary; each layer of a ggplot can be drawn from different dataframes, using a data = argument. But you should think carefully about your data if you are doing this!

require(ggplot2)
## Loading required package: ggplot2
# have a quick look at what these do.

# examples of basic, base plots
plot(observations$test1, observations$test2)

boxplot(observations$test1, observations$test2)

# ggplot
patient_plot <- ggplot(data = observations)
patient_plot <- patient_plot + geom_point(aes(x = test1, y = test2))
patient_plot

9.2 tidyverse

To learn about tidyverse, go here, (https://www.linkedin.com/learning/learning-the-r-tidyverse/what-is-the-tidyverse - a nice tutorial, which we will not simply rewrite here!


10 Finding general help and resources

within R:

  • ?functionname to get the help file for a function
    • parts of the help file also show up automatically in RStudio when you type functions
  • When you get error messages - search for them on the internet if you can’t work out what’s wrong

on the internet :

Bear in mind that R, and even more so certain packages, change fast! Often, someone else will be writing something to solve an issue that you’ve been having, and the next time you look for help, you might find something new. This also means when you search the internet for help, you should check whether what you find is still valid.