DEV Community

loading...

Dplyr Cheat Sheet

Lâm
・4 min read

Reference

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

Syntax

dplyr::tbl_df(iris)
Enter fullscreen mode Exit fullscreen mode

Converts data to tbl class. tbl’s are easier to examine than data frames. R displays only the data that fits onscreen:

dplyr::glimpse(iris)
Enter fullscreen mode Exit fullscreen mode

Information dense summary of tbl data.

utils::View(iris)
Enter fullscreen mode Exit fullscreen mode

View data set in the spreadsheet-like display (note capital V).

dplyr::%>%
Enter fullscreen mode Exit fullscreen mode

Passes object on left hand side as the first argument (or .argument) of function on the righthand side.

x %>% f(y) is the same as f(x, y)
y %>% f(x, ., z) is the same as f(x, y, z )
Enter fullscreen mode Exit fullscreen mode

"Piping" with %>% makes code more readable, e.g.

iris %>%
    group_by(Species) %>%
    summarise(avg = mean(Sepal.Width)) %>%
    arrange(avg)
Enter fullscreen mode Exit fullscreen mode

Reshaping Data

  tidyr::gather(cases, "year", "n", 2:4)
Enter fullscreen mode Exit fullscreen mode

Gather columns into rows.

  tidyr::unite(data, col, ..., sep)
Enter fullscreen mode Exit fullscreen mode

Unite several columns into one.

  dplyr::data_frame(a = 1:3, b = 4:6)
Enter fullscreen mode Exit fullscreen mode

Combine vectors into data frame (optimized).

  dplyr::arrange(mtcars, mpg)
Enter fullscreen mode Exit fullscreen mode

Order rows by values of a column (low to high).

  dplyr::arrange(mtcars, desc(mpg))
Enter fullscreen mode Exit fullscreen mode

Order rows by values of a column (high to low).

  dplyr::rename(tb, y = year)
Enter fullscreen mode Exit fullscreen mode

Rename the columns of a data frame.

  tidyr::spread(pollution, size, amount)
Enter fullscreen mode Exit fullscreen mode

Spread rows into columns.

  tidyr::separate(storms, date, c("y", "m", "d"))
Enter fullscreen mode Exit fullscreen mode

Separate one column into several.

Subset Observations (Rows)

dplyr::filter(iris, Sepal.Length > 7)
Enter fullscreen mode Exit fullscreen mode

Extract rows that meet logical criteria.

dplyr::distinct(iris)
Enter fullscreen mode Exit fullscreen mode

Remove duplicate rows.

dplyr::sample_frac(iris, 0.5, replace = TRUE)
Enter fullscreen mode Exit fullscreen mode

Randomly select fraction of rows.

dplyr::sample_n(iris, 10, replace = TRUE)
Enter fullscreen mode Exit fullscreen mode

Randomly select n rows.

dplyr::slice(iris, 10:15)
Enter fullscreen mode Exit fullscreen mode

Select rows by position.

dplyr::top_n(storms, 2, date)
Enter fullscreen mode Exit fullscreen mode

Select and order top n entries (by group if grouped data).

Logic in R

< Less than != Not equal to
> Greater than %in% Group membership
== Equal to is.na Is NA
<= Less than or equal to !is.na Is not NA
>= Greater than or equal to &,\ ,!,xor,any,all

Subset Variables (Columns)

dplyr::select(iris, Sepal.Width, Petal.Length, Species)
Enter fullscreen mode Exit fullscreen mode

Select columns by name or helper function.

Helper functions for select

select(iris, contains("."))
Enter fullscreen mode Exit fullscreen mode

Select columns whose name contains a character string.

select(iris, ends_with("Length"))
Enter fullscreen mode Exit fullscreen mode

Select columns whose name ends with a character string.

select(iris, everything())
Enter fullscreen mode Exit fullscreen mode

Select every column.

select(iris, matches(".t."))
Enter fullscreen mode Exit fullscreen mode

Select columns whose name matches a regular expression.

select(iris, num_range("x", 1:5))
Enter fullscreen mode Exit fullscreen mode

Select columns named x1, x2, x3, x4, x5.

select(iris, one_of(c("Species", "Genus")))
Enter fullscreen mode Exit fullscreen mode

Select columns whose names are in a group of names.

select(iris, starts_with("Sepal"))
Enter fullscreen mode Exit fullscreen mode

Select columns whose name starts with a character string.

select(iris, Sepal.Length:Petal.Width)
Enter fullscreen mode Exit fullscreen mode

Select all columns between Sepal.Length and Petal.Width (inclusive).

select(iris, -Species)
Enter fullscreen mode Exit fullscreen mode

Select all columns except Species.

Make New Variables

dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width)
Enter fullscreen mode Exit fullscreen mode

Compute and append one or more new columns.

dplyr::mutate_each(iris, funs(min_rank))
Enter fullscreen mode Exit fullscreen mode

Apply window function to each column.

dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width)
Enter fullscreen mode Exit fullscreen mode

Compute one or more new columns. Drop original columns.

Window function

Mutate uses window functions, functions that take a vector of values and return another vector of values, such as:

dplyr::lead
Enter fullscreen mode Exit fullscreen mode

Copy with values shifted by 1.

dplyr::lag
Enter fullscreen mode Exit fullscreen mode

Copy with values lagged by 1.

dplyr::dense_rank
Enter fullscreen mode Exit fullscreen mode

Ranks with no gaps.

dplyr::min_rank
Enter fullscreen mode Exit fullscreen mode

Ranks. Ties get min rank.

dplyr::percent_rank
Enter fullscreen mode Exit fullscreen mode

Ranks rescaled to [0, 1].

dplyr::row_number
Enter fullscreen mode Exit fullscreen mode

Ranks. Ties got to first value.

dplyr::ntile
Enter fullscreen mode Exit fullscreen mode

Bin vector into n buckets.

dplyr::between
Enter fullscreen mode Exit fullscreen mode

Are values between a and b?

dplyr::cume_dist
Enter fullscreen mode Exit fullscreen mode

Cumulative distribution.

dplyr::cumall
Enter fullscreen mode Exit fullscreen mode

Cumulative all

dplyr::cumany
Enter fullscreen mode Exit fullscreen mode

Cumulative any

dplyr::cummean
Enter fullscreen mode Exit fullscreen mode

Cumulative mean

cumsum
Enter fullscreen mode Exit fullscreen mode

Cumulative sum

cummax
Enter fullscreen mode Exit fullscreen mode

Cumulative max

cummin
Enter fullscreen mode Exit fullscreen mode

Cumulative min

cumprod
Enter fullscreen mode Exit fullscreen mode

Cumulative prod

pmax
Enter fullscreen mode Exit fullscreen mode

Element-wise max

pmin
Enter fullscreen mode Exit fullscreen mode

Element-wise min

Summarise Data

dplyr::summarise(iris, avg = mean(Sepal.Length))
Enter fullscreen mode Exit fullscreen mode

Summarise data into single row of values.

dplyr::summarise_each(iris, funs(mean))
Enter fullscreen mode Exit fullscreen mode

Apply summary function to each column.

dplyr::count(iris, Species, wt = Sepal.Length)
Enter fullscreen mode Exit fullscreen mode

Count the number of rows with each unique value of a variable (with or without weights).

Summary function

Summarise uses summary functions, functions that take a vector of values and return a single value, such as:

dplyr::first
Enter fullscreen mode Exit fullscreen mode

The first value of a vector.

dplyr::last
Enter fullscreen mode Exit fullscreen mode

Last value of a vector.

dplyr::nth
Enter fullscreen mode Exit fullscreen mode

Nth value of a vector.

dplyr::n
Enter fullscreen mode Exit fullscreen mode

of values in a vector.

dplyr::n_distinct
Enter fullscreen mode Exit fullscreen mode

of distinct values in a vector.

IQR
Enter fullscreen mode Exit fullscreen mode

IQR of a vector.

min
Enter fullscreen mode Exit fullscreen mode

The minimum value in a vector.

max
Enter fullscreen mode Exit fullscreen mode

The maximum value in a vector.

mean
Enter fullscreen mode Exit fullscreen mode

Mean value of a vector.

median
Enter fullscreen mode Exit fullscreen mode

The median value of a vector.

var
Enter fullscreen mode Exit fullscreen mode

The variance of a vector.

sd
Enter fullscreen mode Exit fullscreen mode

The standard deviation of a vector

Combine Data Sets

Mutating Joins

dplyr::lef_join(a, b, by = "x1")
Enter fullscreen mode Exit fullscreen mode

Join matching rows from b to a.

dplyr::right_join(a, b, by = "x1")
Enter fullscreen mode Exit fullscreen mode

Join matching rows from a to b.

dplyr::inner_join(a, b, by = "x1")
Enter fullscreen mode Exit fullscreen mode

Join data. Retain only rows in both sets.

dplyr::full_join(a, b, by = "x1")
Enter fullscreen mode Exit fullscreen mode

Join data. Retain all values, all rows.

Filtering Joins

dplyr::semi_join(a, b, by = "x1")
Enter fullscreen mode Exit fullscreen mode

All rows in a that have a match in b.

dplyr::anti_join(a, b, by = "x1")
Enter fullscreen mode Exit fullscreen mode

All rows in a that do not have a match in b.

Set Operations

dplyr::intersect(y, z)
Enter fullscreen mode Exit fullscreen mode

Rows that appear in both y and z.

dplyr::union(y, z)
Enter fullscreen mode Exit fullscreen mode

Rows that appear in either or both y and z.

dplyr::setdiff(y, z)
Enter fullscreen mode Exit fullscreen mode

Rows that appear in y but not z.

Binding

dplyr::bind_rows(y, z)
Enter fullscreen mode Exit fullscreen mode

Append z to y as new rows.

dplyr::bind_cols(y, z)
Enter fullscreen mode Exit fullscreen mode

Append z to y as new columns.

Caution: matches rows by position.

Group Data

null

Group data into rows with the same value of Species.

dplyr::group_by(iris, Species)
Enter fullscreen mode Exit fullscreen mode

Remove grouping information from data frame.

dplyr::ungroup(iris)
Enter fullscreen mode Exit fullscreen mode

Compute separate summary row for each group.

iris %>% group_by(Species) %>% summarise(…)
Enter fullscreen mode Exit fullscreen mode

Compute new variables by group.

iris %>% group_by(Species) %>% mutate(…)
Enter fullscreen mode Exit fullscreen mode

Usage

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
Enter fullscreen mode Exit fullscreen mode

Installation

null

You can install from CRAN with

install.packages("dtplyr")
Enter fullscreen mode Exit fullscreen mode

Discussion (0)