This is intended as a quick reference guide to the R you will need for Infectious Disease Modelling, starting from the basics, and noting common sources of error along the way. If you want to go further with R or programming in general, there is a lot of information on the internet: see section 10 for some suggestions.
x <- 7
.function()
(with parentheses after the function name)#>
No-one memorises everything! If you know how to look up what you want to do, and understand what you find, you’re most of the way there.
Put # comments
everywhere to annotate your code - for your benefit while learning and for referring back to after a period of time. (Also, for other people who might use your code.)
Aim for consistent formatting and naming:
There is usually more than one way of programming something.
Many concepts in programming and R are not covered (or not in any detail): of note,
Like human languages, programming languages have similarities and differences; you may have prior knowledge about how other languages do things - or you might find this out in the future. In general, if you think a language has some strange syntax or weird function names, or behaves oddly, remember that this might be due to:
R was developed for statistics; some aspects of R can be traced to this history.
So whether or not you already know other languages, keep an open mind when it comes to ‘idiosyncracies’, and be aware that other programming languages do differ!
If you can and want to work on your own computer, and become familiar with a common R user setup:
Install R from: (https://cran.r-project.org/) (searching R download should find this)
Install RStudio (free version) from: (https://www.rstudio.com/products/rstudio/download/).
RStudio is an Integrated Development Environment (IDE) for R. IDEs add many useful features to your programming experience, panels showing history, help, plots, etc.
See http://ncss-tech.github.io/stats_for_soil_survey/chapters/1_introduction/1_introduction.html#3_rstudio:_an_integrated_development_environment_(ide)_for_r for an illustration. An IDE has more features than a good text editor such as notepad++, which will at least do things such as syntax highlighting (colouring your code).
Get started after installation:
setwd("[insert path here]")
setwd("C:/mydocs/mynewproject") # in Windows, use / for a separator
setwd("~/mydocs/mynewproject") # linux/ mac
You’ll mostly type or copy into your script, like writing a text file; and then run (all or some of it). Output will appear in the console, plots in the plot panel, and as you create variables, they’ll appear in the environment panel.
If you’re re-opening RStudio, having worked in it before: RStudio can save your open scripts and the contents of your environment, which can help if it was closed unexpectedly.
Beware that you should not rely on anything specifically being saved.
# text starting with # are comments (they don't get run)
x <- 10 # 'assigns' the value 10 to a name, x
# read this as "x gets 10"
x # if you just want to see what is assigned to a name, you can run a line of code that just has the name of the variable
# once you have created an object, you can refer to it by name in later code
y <- 2 * x
# R doesn't tell you if you've already used a name, it just reassigns
x <- 20 # x is now 20
a <- x # a is given the same value as x
x <- 50 # now x is 50. a keeps the value it had before, rather than changing when you change x
# not all languages work this way!
# R is case sensitive:
X <- 51 # capital X, not the same as x
## Functions
# to call a function, use the function name, with the arguments - inputs - in parentheses
print(x) # this returns the contents of x to the screen
R has 6 ‘atomic’ - fundamental - types: Logical, character, numeric types (integer, double, and complex), and raw (for binary data). They can be the building blocks for other types.
The type factor
(you may remember this from the statistics module) is special.
The simplest data structures are atomic vectors (usually just called vectors). All elements in an atomic vector must be of one type. This code chunk shows you a function, c()
, using it to create vectors of the 4 common atomic types.
# the function "c()" combines what you provide it. By default, it makes a vector.
# logical vector
infected <- c(TRUE, FALSE, TRUE, TRUE, FALSE) # sometimes abbreviated to T and F
# Often useful as the output of a function:
# i.e. test whether something is TRUE or FALSE and then use that to determine what happens somewhere else
# character vector
parameters <- c("mu", "gamma", "beta") # character elements are always within quotes
# numeric: double (short for: double precision floating point)
min_temperature <- c(1.51, 2.7, 3.1, 0.05, -2, 5) #
# numeric: integer?
number_of_eggs <- c(2, 4, 6, 8, 5) # doesn't actually make an integer vector
typeof(number_of_eggs)
#> [1] "double"
# Not an integer. To make an integer vector, either:
number_of_eggs <- as.integer(number_of_eggs) # converts your vector
number_of_eggs <- c(2L, 4L)
# L specifies the numbers are integers. We don't use "I" for integer - too close to a "1" and "i" is used in complex numbers.
number_of_eggs <- c(2, 4, 6, 8, 5)
did not create an integer vector, even though all the elements were given as integers. R made it a double.
Many languages require you to define your data type when you create (or ‘declare’) a variable. R does not. When you don’t specify a type, R decides what it is. This can make getting started quick, but can hide issues if you make a mistake or if you don’t know what R automatically decides.
#What does R do with this?
stuff <- c(3, 4, TRUE, "cat")
stuff
#> [1] "3" "4" "TRUE" "cat"
# R has made everything into character elements
A function often expects certain data types and structures as inputs. The function c()
expects all the inputs to be the same type, so R chooses the first type on this hierarchy: logical > integer > double > character that all the inputs given could be. If you try to input a variable into a function which is expecting a different type, R will try to coerce what you give it into what it needs.
#The function "paste()" expects character inputs.
paste(number_of_eggs)
#> [1] "3 4 5" # the output here is a character vector (with one element); you can tell because the element is in quotes
# Try taking the mean() of a character vector. Does R coerce to anything?
mean(parameters)
#> Warning in mean.default(parameters): argument is not numeric or logical:
#> returning NA
#> [1] NA
Feeding a character vector to mean()
returned NA
.NA
represents missing values in R (equivalent of SQL’s NULL). When a function cannot use the input given, it might return NA
. If you need to show that data values are missing (e.g. one measurement wasn’t available or applicable for a particular patient), you should include an NA
in the relevant place.
It is important to distinguish between “missing” and “blank”. Missing values often have to be specially considered, or removed.
Attributes are “metadata”. 2 important attributes:
names
: if given, can be useful for referring to elements within structures.dimensions
: giving a vector a dimensions attribute, turns it into an array: an array with 2 dimensions, is a matrix.Atomic vectors (all one type)
[]
Square brackets select elements:
min_temperature <- c(1.51, 2.7, 3.14, 0.05, -2, 5)
parameters <- c("mu", "gamma", "beta")
min_temperature[2] # returns the second element.
## [1] 2.7
min_temperature[1] <- 1.4 # select and replace: the first element is now 1.4
# Many other languages count from 0, so the second element would be [1].
# you can select more than one element, using a vector or a sequence within your square brackets:
parameters[c(2,3)]
## [1] "gamma" "beta"
# 2 dimensions:
# to select from a matrix, you can use a single number,
# which will count down each column in turn
# or like this [i, j]
# to select the item at the i'th row, j'th column
my_mat <- matrix(c(1:9), ncol = 3) # c(1:9) - the colon is shorthand for creating a sequence
my_mat[2, 3]
## [1] 8
my_mat[8]
## [1] 8
Sometimes called generic vectors, distinguishing them from atomic vectors.
The elements of a list contain other data structures. You can have different structures and different types in one list: vectors, matrices, other lists, functions…
[[]]
and []
, and $name
my_list <- list(infected, parameters)
my_list[1] # gets you a slice of the list (it's called a slice in some other languages too)
## [[1]]
## [1] TRUE FALSE TRUE TRUE FALSE
# it returns you part of your original list, as a list. Clearer if you select more than one element:
my_list[c(1,2)]
## [[1]]
## [1] TRUE FALSE TRUE TRUE FALSE
##
## [[2]]
## [1] "mu" "gamma" "beta"
my_list[[1]] # gets out the element itself.
## [1] TRUE FALSE TRUE TRUE FALSE
# It's like the double bracket selects the slice and then the element out of that slice.
my_list[1][1] # selects the first element, of the first element, of the list.
## [[1]]
## [1] TRUE FALSE TRUE TRUE FALSE
my_list[[c(1,2)]] # is meaningless.
## [1] FALSE
# You can give the elements in the list names when you create it:
my_list <- list(status = infected, params = parameters) # each element is given the values from the vectors, and the names "status" and "params".
my_list
## $status
## [1] TRUE FALSE TRUE TRUE FALSE
##
## $params
## [1] "mu" "gamma" "beta"
# then you can select elements using their names and the $ sign
my_list$params
## [1] "mu" "gamma" "beta"
my_list$params[1]
## [1] "mu"
http://www.r-tutor.com/r-introduction/list for a basic introduction.
Given that R has been developed for scientists, imagine a structure like a results table for recording outcomes of your experiments.
Every element in a dataframe holds a vector of the same length. These vectors can be different types. You should be able to treat a column like a vector, and the whole dataframe like a list. You can select using the [i, j] notation that works for matrices as well as the ways of selecting from a list.
observations <- data.frame(patientID = c("A", "B", "C"),
test1 = c(27, 40, 48),
test2 = c(28, 25, 50),
stringsAsFactors = FALSE) # this last argument means the character inputs - strings - are not treated as factors
head(observations) # have a look - at the first 5 rows by default
Note: Another commonly used type for storing data are tibbles. Further information on these can be found here, https://r4ds.had.co.nz/tibbles.html.
When you imagine a “results table” you might think of a table like the observations
dataframe:
observations
This is “wide” format. Lots of situations require “long” format. There’s a useful pair of functions that helps here:
# install.packages(reshape2)
require(reshape2)
## Loading required package: reshape2
melt(observations, id.vars = c("patientID"))
The ultimate shape you want to get your data into will depend on what you are doing with it.
A function takes in inputs, called arguments, and does “something” with them. For example, c()
combines what you give it, into a vector. Functions generally expect particular types of input; arguments you give, might be coerced to that type or structure if possible, or you might get an error.
Generally, you’ll want to run a function and store the output - i.e. assign it to a name so you can use it somewhere else.
mean(c(1,2,3,4,5)) # the argument here is a numeric vector - created within the parentheses
my_mean <- mean(min_temperature) # you can just pass in the name of a variable, as long as it exists in your environment.
# some functions look a bit different, e.g. +
1 + 2 # here, you are applying the function '+', to the arguments 1 and 2. For some functions, this is intuitive syntax.
Many functions can take a vector as input. Either the function is applied using the whole vector, such as the mean()
example, or functions can be applied element-by-element. https://bookdown.org/rdpeng/rprogdatascience/vectorized-operations.html shows a couple of examples.
?functionname
retrieves the help file. They are all structured similarly. The Arguments section tells you the arguments the function expects. Many functions have a Value section: this is the value outputted by the function.
Arguments of a function, can be functions. The function apply()
takes in a function and a data structure and applies it to relevant divisions of the data structure (for example, if you need to apply a function over each row of a matrix in turn).
If a function that you require does not exist, you can also write your own function.
# Make a function which takes in a character vector, and says Hi.
greetings <- function(x) { # function takes in 1 argument
# the argument gets assigned the name x within the environment of the function
hello <- paste("Hi", x, "!")
return(hello)
}
greetings("Doctor")
greetings(c("Doctor", "Professor")) # Vectorised
greetings(1) # coerces the input to character.
# make a function which requires a numeric argument:
f_to_c <- function(f) { # function takes in 1 argument
c <- (f - 32) * (5 / 9)
return(c)
}
f_to_c(90)
# f_to_c("ninety") # would it work?
Base R comes with the default installation. A package is a collection of functions (and sometimes data), developed for particular uses. You can think of the R “library” as containing many packages.
install.packages()
: installs packages
library()
takes a package name and loads it, if installed. Then you can directly use the functions in it.
Base R’s plot()
function is versatile; you can often get a quick look at your data easily. Plot has a lot of methods which R chooses from depending on the data you give it, to produce a suitable plot. There are also base functions such as boxplot
which can be quick and easy.
A powerful system for visualising data.
You create a ggplot object, and then add layers with whatever aesthetics you require.
To learn about ggplot, go here: http://r-statistics.co/ggplot2-Tutorial-With-R.html - a nice tutorial, which we will not simply rewrite here! A note: early on, that tutorial says to state what dataframe to use first. This is not necessary; each layer of a ggplot can be drawn from different dataframes, using a data =
argument. But you should think carefully about your data if you are doing this!
require(ggplot2)
## Loading required package: ggplot2
# have a quick look at what these do.
# examples of basic, base plots
plot(observations$test1, observations$test2)
boxplot(observations$test1, observations$test2)
# ggplot
patient_plot <- ggplot(data = observations)
patient_plot <- patient_plot + geom_point(aes(x = test1, y = test2))
patient_plot
To learn about tidyverse, go here, (https://www.linkedin.com/learning/learning-the-r-tidyverse/what-is-the-tidyverse - a nice tutorial, which we will not simply rewrite here!
within R:
?functionname
to get the help file for a function
on the internet :
General guides - RStudio website https://www.rstudio.com/online-learning/#r-programming has plenty of links and suggestions for R programming and related subjects.
Books (free, online)
ggplot help
R language definitions: the official explanation of R
Q & A sites: such as https://stackoverflow.com/
Bear in mind that R, and even more so certain packages, change fast! Often, someone else will be writing something to solve an issue that you’ve been having, and the next time you look for help, you might find something new. This also means when you search the internet for help, you should check whether what you find is still valid.