Overview: https://dplyr.tidyverse.org)
dplyr
is a grammar of data manipulation, providing a
consistent set of verbs that help you solve the most common data
manipulation challenges
Functions
select()
picks variables based on their
names.
filter()
picks cases based on their values.
mutate()
adds new variables that are functions of
existing variable
summarise()
reduces multiple values down to a single
summary.
arrange()
changes the ordering of the rows.
group_by()
takes an existing tbl and converts it
into a grouped tbl.
Subset columns using their names and types. Example uses
babynames
package as in Posit Primers: Work with Data.
- |
Columns except |
select(babynames, -prop) |
: |
Columns between (inclusive) |
select(babynames, year:n) |
contains() |
Columns that contains a string |
select(babynames, contains(“n”)) |
ends_with() |
Columns that ends with a string |
select(babynames, ends_with(“n”)) |
matches() |
Columns that matches a regex |
select(babynames, matches(“n”)) |
num_range() |
Columns with a numerical suffix in the range |
Not applicable with babynames |
one_of() |
Columns whose name appear in the given set |
select(babynames, one_of(c(“sex”, “gender”))) |
starts_with() |
Columns that starts with a string |
select(babynames, starts_with(“n”)) |
Columns except
select(iris, -Species)
Columns between
(inclusive)
select(iris, 3:5)
Columns that contains
a string
select(iris, contains("Sepal"))
Columns that ends
with a string
select(iris, ends_with("Width"))
Columns that matches
a regex
select(iris, matches("pe"))
Columns with a
numerical suffix in the range
Not applicable to iris
data
Columns whose name
appear in the given set
select(iris, one_of("Sepal.Length", "Petal.Width"))
Columns that starts
with a string
select(iris, starts_with("s"))
Columns by changing
names
select(iris, sl = Sepal.Length, sw = Sepal.Width, sp = Species)
Subset rows using column values
> |
Is x greater than y? |
x > y |
>= |
Is x greater than or equal to y? |
x >= y |
< |
Is x less than y? |
x < y |
<= |
Is x less than or equal to y? |
x <= y |
== |
Is x equal to y? |
x == y |
!= |
Is x not equal to y? |
x != y |
is.na() |
Is x an NA? |
is.na(x) |
!is.na() |
Is x not an NA? |
!is.na(x) |
Is x greater than
y?
filter(iris, Sepal.Length > 6.0)
Is x greater than or
equal to y?
And: &
filter(iris, Sepal.Length >= 5.0 & Sepal.Width >= 4.0)
Is x less than
y?
filter(iris, Sepal.Length < 5.0)
Is x less than or
equal to y?
OR: |
filter(iris, Sepal.Length >= 5.0 & Sepal.Width >= 4.0)
Is x equal to y?
filter(iris, Species=="virginica")
Is x an NA?
filter(iris, is.na(Sepal.Width))
Is x not an NA?
filter(iris, !is.na(Sepal.Width))
arrange()
orders the rows of a data frame by the values
of selected columns.
Unlike other dplyr
verbs, arrange()
largely
ignores grouping; you need to explicitly mention grouping variables (`or
use .by_group = TRUE) in order to group by them, and functions of
variables are evaluated once per data frame, not once per group.
arrange(iris, Sepal.Length, desc(Sepal.Width))
slice, slice_head, slice_tail, slice_min, slice_max, slice_sample
slice_min(iris, Sepal.Length, n = 6)
mutate(iris, rank = min_rank(Sepal.Length)) %>%
arrange(rank)
iris %>% mutate(Sepal.Ratio = Sepal.Length/Sepal.Width) %>%
arrange(Sepal.Ratio)
Most data operations are done on groups defined by
variables. group_by()
takes an existing tbl and converts it
into a grouped tbl where operations are performed “by
group”. ungroup()
removes grouping.
Summary functions
You can use any function in summarise()
so long as it
meets one criteria: the function must take a vector of values as input
and return a single value as output. Functions that do this are known as
summary functions and they are common in the field of descriptive
statistics. Some of the most useful summary functions include:
- Measures of location - mean(x), median(x), quantile(x, 0.25),
min(x), and max(x)
- Measures of spread - sd(x), var(x), IQR(x), and mad(x)
- Measures of position - first(x), nth(x, 2), and last(x)
- Counts - n_distinct(x) and n(), which takes no arguments, and
returns the size of the current group or data frame.
- Counts and proportions of logical values - sum(!is.na(x)), which
counts the number of TRUEs returned by a logical test; mean(y == 0),
which returns the proportion of TRUEs returned by a logical test.
- if_else(), recode(), case_when()
iris %>%
group_by(Species) %>%
summarize(sl_mean = mean(Sepal.Length), sw_mean = mean(Sepal.Width),
pl_mean = mean(Petal.Length), pw_mean = mean(Petal.Width))
iris %>%
group_by(Species) %>%
mutate(rank = min_rank(Sepal.Length)) %>%
arrange(rank)
Posit Primers
Please practice with Posit Primers - Work with Data
Work with Data
Learn the most important data handling skills in R: how to extract
values from a table, subset tables, calculate summary statistics, and
derive new variables.
Working with
Tibbles
Learn to use tibbles, the most user-friendly tabular data structure
in R, as well as how to manage tidyverse packages with... the tidyverse
package.
Isolating Data with
dplyr
Master three simple functions for finding, and extracting, the data
in your data set. Here you will learn to select variables, filter
observations, and arrange values. Here, you will also meet R’s pipe
operator, %>%.
Deriving Information
with dplyr
Data sets contain more information than they display, and this
tutorial will show you how to access that information. You’ll learn to
derive new variables and to compute groupwise summary statistics.
