R for Reproducible Scientific Analysis: Reference

Key Points

Introduction to R and RStudio
  • Use RStudio to write and run R programs.

  • R has the usual arithmetic operators and mathematical functions.

  • Use <- to assign values to variables.

  • Use ls() to list the variables in a program.

  • Use rm() to delete objects in a program.

  • Use install.packages() to install packages (libraries).

Project Management With RStudio
  • Use RStudio to create and manage projects with consistent layout.

  • Treat raw data as read-only.

  • Treat generated output as disposable.

  • Separate function definition and application.

Seeking Help
  • Use help() to get online help in R.

Data Structures
  • Use read.csv to read tabular data in R.

  • The basic data types in R are double, integer, complex, logical, and character.

  • Use factors to represent categories in R.

Exploring Data Frames
  • Use cbind() to add a new column to a data frame.

  • Use rbind() to add a new row to a data frame.

  • Remove rows from a data frame.

  • Use na.omit() to remove rows from a data frame with NA values.

  • Use levels() and as.character() to explore and manipulate factors

  • Use str(), nrow(), ncol(), dim(), colnames(), rownames(), head() and typeof() to understand structure of the data frame

  • Read in a csv file using read.csv()

  • Understand length() of a data frame

Subsetting Data
  • Indexing in R starts at 1, not 0.

  • Access individual values by location using [].

  • Access slices of data using [low:high].

  • Access arbitrary sets of data using [c(...)].

  • Use logical operations and logical vectors to access subsets of data.

Control Flow
  • Use if and else to make choices.

  • Use for to repeat operations.

Creating Publication-Quality Graphics
  • Use ggplot2 to create plots.

  • Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping.

Vectorization
  • Use vectorized operations instead of loops.

Functions Explained
  • Use function to define a new function in R.

  • Use parameters to pass values into functions.

  • Use stopifnot() to flexibly check function arguments in R.

  • Load functions into programs using source().

Writing Data
  • Save plots from RStudio using the ‘Export’ button.

  • Use write.table to save tabular data.

Putting it together
  • Intermediate results need to be kept

  • Thinking about the location of output files is important to keep things organised

Dataframe Manipulation with dplyr
  • Use the dplyr package to manipulate dataframes.

  • Use select() to choose variables from a dataframe.

  • Use filter() to choose data based on values.

  • Use group_by() and summarise() to work with subsets of data.

  • Use mutate() to create new variables.

  • Join dataframes with left_join(), full_join(), inner_join() or find non-matching rows with anti_join()`

Intro to the tidyverse
  • For data to be tidy, each variable must be in its own column

  • For data to be tidy, each case must be in its own row

  • For data to be tidy, each value must be in its own cell

Dataframe Manipulation with tidyr
  • Use the tidyr package to change the layout of dataframes.

  • Use gather() to go from wide to long format.

  • Use spread() to go from long to wide format.

Producing Reports With knitr
  • Mix reporting written in R Markdown with software written in R.

  • Specify chunk options to control formatting.

  • Use knitr to convert these documents into PDF and other formats.

Writing Good Software
  • Document what and why, not how.

  • Break programs into short single-purpose functions.

  • Write re-runnable tests.

  • Don’t repeat yourself.

  • Don’t repeat yourself.

  • Be consistent in naming, indentation, and other aspects of style.

Reference

Introduction to R and RStudio

Project management with RStudio

Seeking help

Data structures

Individual values in R must be one of 5 data types, multiple values can be grouped in data structures.

Data types

Basic data structures in R:

Remember that matrices are really atomic vectors underneath the hood, and that data.frames are really lists underneath the hood (this explains some of the weirder behaviour of R).

Vectors

Factors

Lists

Matrices

Data Frames

Useful functions for querying data structures:

Exploring Data Frames

Subsetting data

Control flow

Creating publication quality graphics

Vectorization

Functions explained

Writing data

Split-apply-combine

Dataframe manipulation with dplyr

Dataframe manipulation with tidyr

Producing reports with knitr

Best practices for writing good code

Glossary

argument
A value given to a function or program when it runs. The term is often used interchangeably (and inconsistently) with parameter.
assign
To give a value a name by associating a variable with it.
body
(of a function): the statements that are executed when a function runs.
comment
A remark in a program that is intended to help human readers understand what is going on, but is ignored by the computer. Comments in Python, R, and the Unix shell start with a # character and run to the end of the line; comments in SQL start with --, and other languages have other conventions.
comma-separated values
(CSV) A common textual representation for tables in which the values in each row are separated by commas.
delimiter
A character or characters used to separate individual values, such as the commas between columns in a CSV file.
documentation
Human-language text written to explain what software does, how it works, or how to use it.
floating-point number
A number containing a fractional part and an exponent. See also: integer.
for loop
A loop that is executed once for each value in some kind of set, list, or range. See also: while loop.
index
A subscript that specifies the location of a single value in a collection, such as a single pixel in an image.
integer
A whole number, such as -12343. See also: floating-point number.
library
In R, the directory(ies) where packages are stored.
package
A collection of R functions, data and compiled code in a well-defined format. Packages are stored in a library and loaded using the library() function.
parameter
A variable named in the function’s declaration that is used to hold a value passed into the call. The term is often used interchangeably (and inconsistently) with argument.
return statement
A statement that causes a function to stop executing and return a value to its caller immediately.
sequence
A collection of information that is presented in a specific order.
shape
An array’s dimensions, represented as a vector. For example, a 5×3 array’s shape is (5,3).
string
Short for “character string”, a sequence of zero or more characters.
syntax error
A programming error that occurs when statements are in an order or contain characters not expected by the programming language.
type
The classification of something in a program (for example, the contents of a variable) as a kind of number (e.g. floating-point, integer), string, or something else. In R the command typeof() is used to query a variables type.
while loop
A loop that keeps executing as long as some condition is true. See also: for loop.