Intro to the tidyverse

Overview

Teaching: 40 min
Exercises: 15 min

Questions

How can I use a consistent underlying data structure?

Objectives

To recognise tidy vs messy data

To know where to find more information about the tidyverse

Happy families are all alike; every unhappy family is unhappy in its own way

Leo Tolstoy

Tidy data are all alike; every untidy data is untidy in its own way

Hadley Wickham

Data can come in many different shapes and forms, and often people invent whatever makes sense to them. This often means that a great deal of time is spent modifying data to be structured in a format that R can use.

Within R, different packages can have different expectations about data structures, which can make it difficult to move between functions in different packages.

The tidyverse is a subset of R packages that conform to a particular philosophy about data structure.

The concept of tidy data can be distilled into three principles. A data set can be considered ‘tidy’ if:

Each variable is in its own column
Each case is in its own row
Each value is in its own cell

Challenge 1

In the following table, what makes it untidy?

id rep1 rep2

1 1.44 2.07

2 1.77 2.13

3 3.56 3.72

id	rep1	rep2
1	1.44	2.07
2	1.77	2.13
3	3.56	3.72

Challenge 2

What it would look like if it was tidy?

Solution to Challenge 2

id rep value

1 1 1.44

2 1 1.77

3 1 3.56

1 2 2.07

2 2 2.13

3 2 3.72

id	rep	value
1	1	1.44
2	1	1.77
3	1	3.56
1	2	2.07
2	2	2.13
3	2	3.72

In the above the same variable (a measurement value) was stored in two different columns. In this case making the data tidy required converting those two columns into one, which made the dataset have twice as many rows. This is usually called going from “wide” to “long” format, which is often done in the simplest cases of tidying data.

Challenge 3

Open the file plates.xlsx (download here). This is a very common format to store data from 96-well plates. What would this look like if it was tidy? Discuss the steps you would need to go through to convert it to a tidy format.

As well as consistency of the underlying data structure, packages in the tidyverse are also all compatible with the %>% operator.

There is a tidyverse package, which doesn’t have any functionality, except to import core packages of the tidyverse.

library(tidyverse)

Warning: replacing previous import by 'tidyr::%>%' when loading 'broom'

Warning: replacing previous import by 'tidyr::gather' when loading 'broom'

Warning: replacing previous import by 'tidyr::spread' when loading 'broom'

── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

✔ tibble  1.4.2     ✔ purrr   0.2.4
✔ tibble  1.4.2     ✔ forcats 0.3.0

── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

is the equivalent of

library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
library(tibble)
library(stringr)
library(forcats)

Other great resources

The original tidy data paper

Key Points

For data to be tidy, each variable must be in its own column

For data to be tidy, each case must be in its own row

For data to be tidy, each value must be in its own cell

previous episode

R for Reproducible Scientific Analysis

next episode

Intro to the tidyverse

Overview

Challenge 1

Challenge 2

Solution to Challenge 2

Challenge 3

Other great resources

Key Points

previous episode

next episode