Overview: https://dplyr.tidyverse.org)

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges

1 Functions

2 iris data

iris

3 select

Subset columns using their names and types. Example uses babynames package as in Posit Primers: Work with Data.

Helper Function Use Example
- Columns except select(babynames, -prop)
: Columns between (inclusive) select(babynames, year:n)
contains() Columns that contains a string select(babynames, contains(“n”))
ends_with() Columns that ends with a string select(babynames, ends_with(“n”))
matches() Columns that matches a regex select(babynames, matches(“n”))
num_range() Columns with a numerical suffix in the range Not applicable with babynames
one_of() Columns whose name appear in the given set select(babynames, one_of(c(“sex”, “gender”)))
starts_with() Columns that starts with a string select(babynames, starts_with(“n”))

3.1 Columns except

select(iris, -Species)

3.2 Columns between (inclusive)

select(iris, 3:5)

3.3 Columns that contains a string

select(iris, contains("Sepal"))

3.4 Columns that ends with a string

select(iris, ends_with("Width"))

3.5 Columns that matches a regex

select(iris, matches("pe"))

3.6 Columns with a numerical suffix in the range

Not applicable to iris data

3.7 Columns whose name appear in the given set

select(iris, one_of("Sepal.Length", "Petal.Width"))

3.8 Columns that starts with a string

select(iris, starts_with("s"))

3.9 Columns by changing names

select(iris, sl = Sepal.Length, sw = Sepal.Width, sp = Species)

4 filter

Subset rows using column values

Logical operator tests Example
> Is x greater than y? x > y
>= Is x greater than or equal to y? x >= y
< Is x less than y? x < y
<= Is x less than or equal to y? x <= y
== Is x equal to y? x == y
!= Is x not equal to y? x != y
is.na() Is x an NA? is.na(x)
!is.na() Is x not an NA? !is.na(x)

4.1 Is x greater than y?

filter(iris, Sepal.Length > 6.0)

4.2 Is x greater than or equal to y?

And: &

filter(iris, Sepal.Length >= 5.0 & Sepal.Width >= 4.0)

4.3 Is x less than y?

filter(iris, Sepal.Length < 5.0)

4.4 Is x less than or equal to y?

OR: |

filter(iris, Sepal.Length >= 5.0 & Sepal.Width >= 4.0)

4.5 Is x equal to y?

filter(iris, Species=="virginica")

4.6 Is x an NA?

filter(iris, is.na(Sepal.Width))

4.7 Is x not an NA?

filter(iris, !is.na(Sepal.Width))

4.8 Extra

filter(iris, Sepal.Width > Petal.Length)

5 arrange

Unlike other dplyr verbs, arrange() largely ignores grouping; you need to explicitly mention grouping variables (`or use .by_group = TRUE) in order to group by them, and functions of variables are evaluated once per data frame, not once per group.

arrange(iris, Sepal.Length, desc(Sepal.Width))

5.1 slice, slice_head, slice_tail, slice_min, slice_max, slice_sample

slice_min(iris, Sepal.Length, n = 6)

6 pipes(R4DS)

7 mutate

mutate(iris, rank = min_rank(Sepal.Length)) %>%
  arrange(rank)
iris %>% mutate(Sepal.Ratio = Sepal.Length/Sepal.Width) %>%
  arrange(Sepal.Ratio)

8 group_by

Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. ungroup()removes grouping.

9 summarise or summarize

Summary functions

You can use any function in summarise() so long as it meets one criteria: the function must take a vector of values as input and return a single value as output. Functions that do this are known as summary functions and they are common in the field of descriptive statistics. Some of the most useful summary functions include:

  1. Measures of location - mean(x), median(x), quantile(x, 0.25), min(x), and max(x)
  2. Measures of spread - sd(x), var(x), IQR(x), and mad(x)
  3. Measures of position - first(x), nth(x, 2), and last(x)
  4. Counts - n_distinct(x) and n(), which takes no arguments, and returns the size of the current group or data frame.
  5. Counts and proportions of logical values - sum(!is.na(x)), which counts the number of TRUEs returned by a logical test; mean(y == 0), which returns the proportion of TRUEs returned by a logical test.
  6. if_else(), recode(), case_when()
iris %>% 
  group_by(Species) %>% 
  summarize(sl_mean = mean(Sepal.Length), sw_mean = mean(Sepal.Width), 
  pl_mean = mean(Petal.Length), pw_mean = mean(Petal.Width))
iris %>% 
  group_by(Species) %>%
  mutate(rank = min_rank(Sepal.Length)) %>%
  arrange(rank)

10 Posit Primers

Please practice with Posit Primers - Work with Data

Work with Data

Learn the most important data handling skills in R: how to extract values from a table, subset tables, calculate summary statistics, and derive new variables.

Working with Tibbles

Learn to use tibbles, the most user-friendly tabular data structure in R, as well as how to manage tidyverse packages with... the tidyverse package.

Isolating Data with dplyr

Master three simple functions for finding, and extracting, the data in your data set. Here you will learn to select variables, filter observations, and arrange values. Here, you will also meet R’s pipe operator, %>%.

Deriving Information with dplyr

Data sets contain more information than they display, and this tutorial will show you how to access that information. You’ll learn to derive new variables and to compute groupwise summary statistics.

