R is an open-source programming language that is mainly used for statistical analysis, data analysis, graphical representation of data, and reporting. It a popular language among Data Scientists, Statisticians, researchers, and marketers to analyze and visualize the data. So, today we will be checking out the 13 most asked R programming questions.
13 Most Asked R Programming Questions
1. How to make a great R reproducible example?
Answer:
A minimal reproducible example consists of the following items:
- a minimal dataset, necessary to demonstrate the problem
- the minimal runnable code necessary to reproduce the error, which can be run on the given dataset
- the necessary information on the used packages, R version, and the system it is run on.
- in the case of random processes, a seed (set by
set.seed()
) for reproducibility1
For examples of good minimal reproducible examples, see the help files of the function you are using. In general, all the code given there fulfills the requirements of a minimal reproducible example: data is provided, minimal code is provided, and everything is runnable.
Producing a minimal dataset
In most cases, this can be easily done by just providing a vector/data frame with some values. Or you can use one of the built-in datasets, which are provided with most packages. A comprehensive list of built-in datasets can be seen with library(help = "datasets")
. There is a short description of every datasets and more information can be obtained for example with ?mtcars
where ‘mtcars’ is one of the datasets in the list. Other packages might contain additional datasets.
Making a vector is easy. Sometimes it is necessary to add some randomness to it, and there are a whole number of functions to make that. sample()
can randomize a vector, or give a random vector with only a few values. letters
is a useful vector containing the alphabet. This can be used for making factors.
A few examples :
- random values :
x <- rnorm(10)
for normal distribution,x <- runif(10)
for uniform distribution, … - a permutation of some values :
x <- sample(1:10)
for vector 1:10 in random order. - a random factor :
x <- sample(letters[1:4], 20, replace = TRUE)
For matrices, one can use matrix()
, eg :
matrix(1:10, ncol = 2)
Making data frames can be done using data.frame()
. One should pay attention to name the entries in the data frame, and to not make it overly complicated.
An example:
set.seed(1)
Data <- data.frame(
X = sample(1:10),
Y = sample(c("yes", "no"), 10, replace = TRUE)
)
For some questions, specific formats can be needed. For these, one can use any of the provided as.someType
functions : as.factor, as.Date, as.xts
. These in combination with the vector and/or data frame tricks.
Copy your data
If you have some data that would be too difficult to construct using these tips, then you can always make a subset of your original data, using head()
, subset()
, or the indices. Then use dput()
to give us something that can be put in R immediately:
> dput(iris[1:4, ]) # first four rows of the iris data set
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5,
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2,
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa",
"versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length",
"Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA,
4L), class = "data.frame")
If your data frame has a factor with many levels, the dput
output can be unwieldy because it will still list all the possible factor levels even if they aren’t present in the subset of your data. To solve this issue, you can use the droplevels()
function. Notice below how species is a factor with only one level:
> dput(droplevels(iris[1:4, ]))
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5,
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2,
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = "setosa",
class = "factor")), .Names = c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width", "Species"), row.names = c(NA,
4L), class = "data.frame")
When using dput
, you may also want to include only relevant columns:
> dput(mtcars[1:3, c(2, 5, 6)]) # first three rows of columns 2, 5, and 6
structure(list(cyl = c(6, 6, 4), drat = c(3.9, 3.9, 3.85), wt = c(2.62,
2.875, 2.32)), row.names = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710"
), class = "data.frame")
One other caveat for dput
is that it will not work for keyed data.table
objects or for grouped tbl_df
(class grouped_df
) from dplyr
. In these cases you can convert back to a regular data frame before sharing, dput(as.data.frame(my_data))
.
Worst case scenario, you can give a text representation that can be read in using the text
parameter of read.table
:
zz <- "Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa"
Data <- read.table(text=zz, header = TRUE)
Producing minimal code
This should be the easy part but often isn’t. What you should not do, is:
- add all kind of data conversions. Make sure the provided data is already in the correct format (unless that is the problem of course)
- copy-paste a whole function/chunk of code that gives an error. First, try to locate which lines exactly result in the error. More often than not you’ll find out what the problem is yourself.
What you should do, is:
- add which packages should be used if you use any (using
library()
) - if you open connections or create files, add some code to close them or delete the files (using
unlink()
) - if you change options, make sure the code contains a statement to revert them back to the original ones. (eg
op <- par(mfrow=c(1,2)) ...some code... par(op)
) - test run your code in a new, empty R session to make sure the code is runnable. People should be able to just copy-paste your data and your code in the console and get exactly the same as you have.
Give extra information
In most cases, just the R version and the operating system will suffice. When conflicts arise with packages, giving the output of sessionInfo()
can really help. When talking about connections to other applications (be it through ODBC or anything else), one should also provide version numbers for those, and if possible also the necessary information on the setup.
If you are running R in R Studio using rstudioapi::versionInfo()
can be helpful to report your RStudio version.
If you have a problem with a specific package you may want to provide the version of the package by giving the output of packageVersion("name of the package")
.
Note: The output of set.seed()
differs between R >3.6.0 and previous versions. Do specify which R version you used for the random process, and don’t be surprised if you get slightly different results when following old questions. To get the same result in such cases, you can use the RNGversion()
-function before set.seed()
(e.g.: RNGversion("3.5.2")
).
2. How to sort a dataframe by multiple column(s)?
Answer:
You can use the order()
function directly without resorting to add-on tools. See this simpler answer which uses a trick right from the top of the example(order)
code:
R> dd[with(dd, order(-z, b)), ]
b x y z
4 Low C 9 2
2 Med D 3 1
1 Hi A 8 1
3 Hi A 9 1
If you want to do this by column index, the answer is to simply pass the desired sorting column(s) to the order()
function:
R> dd[order(-dd[,4], dd[,1]), ]
b x y z
4 Low C 9 2
2 Med D 3 1
1 Hi A 8 1
3 Hi A 9 1
R>
rather than using the name of the column (and with()
for easier/more direct access).
3. How to join (merge) data frames (inner, outer, left, right)?
Answer:
By using the merge
function and its optional parameters:
Inner join: merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId")
to make sure that you were matching only the fields you desired. You can also use the by.x
and by.y
parameters if the matching variables have different names in the different data frames.
Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join: merge(x = df1, y = df2, by = NULL)
It’s almost always best to explicitly state the identifiers on which you want to merge; it’s safer if the input data.frames change unexpectedly and easier to read later on.
You can merge on multiple columns by giving by
a vector, e.g., by = c("CustomerId", "OrderId")
.
If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1"
, by.y = "CustomerId_in_df2"
where CustomerId_in_df1
is the name of the column in the first data frame and CustomerId_in_df2
is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)
Alternative Answer:
There is the data.table approach for an inner join, which is very time and memory-efficient (and necessary for some larger data.frames):
library(data.table)
dt1 <- data.table(df1, key = "CustomerId")
dt2 <- data.table(df2, key = "CustomerId")
joined.dt1.dt.2 <- dt1[dt2]
merge
also works on data.tables (as it is generic and calls merge.data.table
)
merge(dt1, dt2)
Yet another option is the join
function found in the plyr package.
library(plyr)
join(df1, df2,
type = "inner")
# CustomerId Product State
# 1 2 Toaster Alabama
# 2 4 Radio Alabama
# 3 6 Radio Ohio
Options for type: inner
, left
, right
, full
. From ?join:
Unlike merge, [join]
preserves the order of x no matter what join type is used.
4. How and when to use grouping functions (tapply, by, aggregate) and the *apply family?
Answer:
R has many *apply functions which are ably described in the help files (e.g. ?apply
). There are enough of them, though, that beginning users may have difficulty deciding which one is appropriate for their situation or even remember them all. They may have a general sense that “We should be using an *apply function here”, but it can be tough to keep them all straight at first.
Despite the fact that much of the functionality of the *apply family is covered by the extremely popular plyr
package, the base functions remain useful and worth knowing.
This answer is intended to act as a sort of signpost for new users to help direct them to the correct apply function for their particular problem. Note, this is **not* intended to simply regurgitate or replace the R documentation. The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
- apply – When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
# Two dimensional matrix
M <- matrix(seq(1,16), 4, 4)
# apply min to rows
apply(M, 1, min)
[1] 1 2 3 4
# apply max to columns
apply(M, 2, max)
[1] 4 8 12 16
# 3 dimensional array
M <- array( seq(32), dim = c(4,4,2))
# Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
apply(M, 1, sum)
# Result is one-dimensional
[1] 120 128 136 144
# Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
apply(M, c(1,2), sum)
# Result is two-dimensional
[,1] [,2] [,3] [,4]
[1,] 18 26 34 42
[2,] 20 28 36 44
[3,] 22 30 38 46
[4,] 24 32 40 48
If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans
, rowMeans
, colSums
, rowSums
.
- lapply – When you want to apply a function to each element of a list in turn and get a list back.This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath.
x <- list(a = 1, b = 1:3, c = 10:100)
lapply(x, FUN = length)
$a
[1] 1
$b
[1] 3
$c
[1] 91
lapply(x, FUN = sum)
$a
[1] 1
$b
[1] 6
$c
[1] 5005
-
sapply – When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list. If you find yourself typing
unlist(lapply(...))
, stop and considersapply
.
x <- list(a = 1, b = 1:3, c = 10:100)
# Compare with above; a named vector, not a list
sapply(x, FUN = length)
a b c
1 3 91
sapply(x, FUN = sum)
a b c
1 6 5005
In more advanced uses of sapply
it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply
will use them as columns of a matrix:
sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix, sapply
will do essentially the same thing, treating each returned matrix as a single long vector:
sapply(1:5,function(x) matrix(x,2,2))
Unless we specify simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:
sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
-
vapply – When you want to use
sapply
but perhaps need to squeeze some more speed out of your code.Forvapply
, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.
x <- list(a = 1, b = 1:3, c = 10:100)
#Note that since the advantage here is mainly speed, this
# example is only for illustration. We're telling R that
# everything returned by length() should be an integer of
# length 1.
vapply(x, FUN = length, FUN.VALUE = 0L)
a b c
1 3 91
-
mapply – For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in
sapply
. This is multivariate in the sense that your function must accept multiple arguments.
#Sums the 1st elements, the 2nd elements, etc.
mapply(sum, 1:5, 1:5, 1:5)
[1] 3 6 9 12 15
#To do rep(1,4), rep(2,3), etc.
mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
-
Map – A wrapper to
mapply
withSIMPLIFY = FALSE
, so it is guaranteed to return a list.
Map(sum, 1:5, 1:5, 1:5)
[[1]]
[1] 3
[[2]]
[1] 6
[[3]]
[1] 9
[[4]]
[1] 12
[[5]]
[1] 15
-
rapply – For when you want to apply a function to each element of a nested list structure, recursively.To give you some idea of how uncommon
rapply
is. But YMMV.rapply
is best illustrated with a user-defined function to apply:
# Append ! to string, otherwise increment
myFun <- function(x){
if(is.character(x)){
return(paste(x,"!",sep=""))
}
else{
return(x + 1)
}
}
#A nested list structure
l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),
b = 3, c = "Yikes",
d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
# Result is named vector, coerced to character
rapply(l, myFun)
# Result is a nested list like l, with values altered
rapply(l, myFun, how="replace")
- tapply – For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.The black sheep of the *apply family, of sorts. The help file’s use of the phrase “ragged array” can be a bit confusing, but it is actually quite simple. A vector:
x <- 1:20
A factor (of the same length!) defining groups:
y <- factor(rep(letters[1:5], each = 4))
Add up the values in x
within each subgroup defined by y
:
tapply(x, y, sum)
a b c d e
10 26 42 58 74
More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply
is similar in spirit to the split-apply-combine functions that are common in R (aggregate
, by
, ave
, ddply
, etc.) Hence its black sheep status.
5. How to drop data frame columns by name?
Answer:
You can use a simple list of names:
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
drops <- c("x","z")
DF[ , !(names(DF) %in% drops)]
Or, alternatively, you can make a list of those to keep and refer to them by name:
keeps <- c("y", "a")
DF[keeps]
For those still not acquainted with the drop
argument of the indexing function, if you want to keep one column as a data frame, you do:
keeps <- "y"
DF[ , keeps, drop = FALSE]
drop=TRUE
(or not mentioning it) will drop unnecessary dimensions, and hence return a vector with the values of column y
.
Alternative Answer:
There’s also the subset
command, useful if you know which columns you want:
df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))
To drop columns a,c you could do:
df <- subset(df, select = -c(a, c))
6. How to remove rows with all or some NAs (missing values) in data.frame?
Answer:
Check complete.cases
:
> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
na.omit
is nicer for just removing all NA‘s. complete.cases
allows partial selection by including only certain columns of the dataframe:
> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Your solution can’t work. If you insist on using is.na
, then you have to do something like:
> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
but using complete.cases
is quite a lot more clear, and faster.
7. data.table vs dplyr: Can one do something well the other can’t or does poorly?
Answer:
Let’s cover these aspects to better understand: Speed
, Memory
usage
, Syntax
, and Features
.
Note: Unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr’s data.frame interface whose internals are in C++ using Rcpp. The data.table syntax is consistent in its form – DT[i, j, by]
. To keep i
, j
, and by
together is by design. By keeping related operations together, it allows to easily optimize operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.
i. Speed
Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 – 10 million groups and varying grouping columns, which also compares pandas
. See also updated benchmarks, which include Spark
and pydatatable
as well. On benchmarks, it would be great to cover these remaining aspects as well:
- Grouping operations involving a subset of rows – i.e.,
DT[x > val, sum(y), by = z]
type operations. - Benchmark other operations such as update and joins.
- Also benchmark memory footprint for each operation in addition to runtime.
ii. Memory usage
- Operations involving
filter()
orslice()
in dplyr can be memory inefficient (on both data.frames and data.tables). See this post. - data.table interface at the moment allows one to modify/update columns by reference (note that we don’t need to re-assign the result back to a variable).
# sub-assign by reference, updates 'y' in-place
DT[x >= 1L, y := NA]
But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):
# copies the entire 'y' column
ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it. Therefore we are working towards exporting shallow()
function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:
foo <- function(DT) {
DT = shallow(DT) ## shallow copy DT
DT[, newcol := 1L] ## does not affect the original DT
DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT
DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will
## also get modified.
}
By not using shallow()
, the old functionality is retained:
bar <- function(DT) {
DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference
DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT.
}
By creating a shallow copy using shallow()
, we understand that you don’t want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilities.
Also, once shallow()
is exported, dplyr’s data.table interface should avoid almost all copies. So those who prefer dplyr’s syntax can use it with data.tables. But it will still lack many features that data.table provides, including (sub)-assignment by reference.
- Aggregate while joining:Suppose you have two data.tables as follows:
DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
# x y z
# 1: 1 a 1
# 2: 1 a 2
# 3: 1 b 3
# 4: 1 b 4
# 5: 2 a 5
# 6: 2 a 6
# 7: 2 b 7
# 8: 2 b 8
DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
# x y mul
# 1: 1 a 4
# 2: 2 b 3
And you would like to get sum(z) * mul
for each row in DT2
while joining by columns x
,y
. We can either:
- 1) aggregate
DT1
to getsum(z)
, 2) perform a join and 3) multiply (or)
# data.table way
DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
# dplyr equivalent
DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%
right_join(DF2) %>% mutate(z = z * mul)
- 2) do it all in one go (using
by = .EACHI
feature):
DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
What is the advantage?
- We don’t have to allocate memory for the intermediate result.
- We don’t have to group/hash twice (one for aggregation and other for joining).
- And more importantly, the operation that we wanted to perform is clear by looking at
j
in (2).
Check this post for a detailed explanation of by = .EACHI
. No intermediate results are materialised, and the join+aggregate is performed all in one go. Have a look at this, this and this posts for real usage scenarios. In dplyr
you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).
- Update and joins:Consider the data.table code shown below:
DT1[DT2, col := i.mul]
adds/updates DT1
‘s column col
with mul
from DT2
on those rows where DT2
‘s key column matches DT1
. We don’t think there is an exact equivalent of this operation in dplyr
, i.e., without avoiding a *_join
operation, which would have to copy the entire DT1
just to add a new column to it, which is unnecessary. Check this post for a real usage scenario.
To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!
iii. Syntax
What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side. We will work with the dummy data shown below:
DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
- Basic aggregation/update operations.
# case (a)
DT[, sum(y), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
DT[, y := cumsum(y), by = z]
ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
# case (b)
DT[x > 2, sum(y), by = z]
DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
DT[x > 2, y := cumsum(y), by = z]
ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
# case (c)
DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
data.table syntax is compact and dplyr’s quite verbose. Things are more or less equivalent in case (a).
In case (b), we had to use
filter()
in dplyr while summarising. But while updating, we had to move the logic insidemutate()
. In data.table however, we express both operations with the same logic – operate on rows wherex > 2
, but in first case, getsum(y)
, whereas in the second case update those rows fory
with its cumulative sum.This is what we mean when we say theDT[i, j, by]
form is consistent.Similarly in case (c), when we have
if-else
condition, we are able to express the logic “as-is” in both data.table and dplyr. However, if we would like to return just those rows where theif
condition satisfies and skip otherwise, we cannot usesummarise()
directly (AFAICT). We have tofilter()
first and then summarise becausesummarise()
always expects a single value.While it returns the same result, usingfilter()
here makes the actual operation less obvious.It might very well be possible to usefilter()
in the first case as well (does not seem obvious), but the point is that we should not have to.Aggregation/update on multiple columns
# case (a)
DT[, lapply(.SD, sum), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
DT[, (cols) := lapply(.SD, sum), by = z]
ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
# case (b)
DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
# case (c)
DT[, c(.N, lapply(.SD, sum)), by = z]
DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
- In case (a), the codes are more or less equivalent. data.table uses familiar base function `lapply()`, whereas `dplyr` introduces `*_each()` along with a bunch of functions to `funs()`.
- data.table’s := requires column names to be provided, whereas dplyr generates it automatically.
- In case (b), dplyr’s syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table’s list.
- In case (c) though, dplyr would return `n()` as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in `j`. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function `c()` to concatenate `.N` to a `list` which returns a `list`.
Note: Once again, in data.table, all we need to do is return a list in
j
. Each element of the list will become a column in result. You can usec()
,as.list()
,lapply()
,list()
etc… base functions to accomplish this, without having to learn any new functions. You will need to learn just the special variables –.N
and.SD
at least. The equivalent in dplyr aren()
and.
- Joinsdplyr provides separate functions for each type of join where as data.table allows joins using the same syntax
DT[i, j, by]
(and with reason). It also provides an equivalentmerge.data.table()
function as an alternative.
setkey(DT1, x, y)
# 1. normal join
DT1[DT2] ## data.table syntax
left_join(DT2, DT1) ## dplyr syntax
# 2. select columns while join
DT1[DT2, .(z, i.mul)]
left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
# 3. aggregate while join
DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%
inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
# 4. update while join
DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
??
# 5. rolling join
DT1[DT2, roll = -Inf]
??
# 6. other arguments to control output
DT1[DT2, mult = "first"]
??
Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table’s
DT[i, j, by]
, ormerge()
which is similar to base R.However dplyr joins do just that. Nothing more. Nothing less.
data.tables can select columns while joining (2), and in dplyr you will need to
select()
first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.data.tables can aggregate while joining (3) and also update while joining (4), using
by = .EACHI
feature. Why materialse the entire join result to add/update just a few columns?data.table is capable of rolling joins (5) – roll forward, LOCF, roll backward, NOCB, nearest.
data.table also has
mult
= argument which selects first, last or all matches (6).data.table has
allow.cartesian = TRUE
argument to protect from accidental invalid joins.
Once again, the syntax is consistent with
DT[i, j, by]
with additional arguments allowing for controlling the output further.
-
do()
…dplyr’s summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort todo()
. You have to know beforehand about all your functions return value.
DT[, list(x[1], y[1]), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
DT[, list(x[1:2], y[1]), by = z]
DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
DT[, quantile(x, 0.25), by = z]
DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
DT[, quantile(x, c(0.25, 0.75)), by = z]
DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
DT[, as.list(summary(x)), by = z]
DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
.SD
‘s equivalent is .In data.table, you can throw pretty much anything in
j
– the only thing to remember is for it to return a list so that each element of the list gets converted to a column.In dplyr, cannot do that. Have to resort to
do()
depending on how sure you are as to whether your function would always return a single value. And it is quite slow.
Once again, data.table’s syntax is consistent with
DT[i, j, by]
. We can just keep throwing expressions inj
without having to worry about these things.
To summarise, we have particularly highlighted several instances where dplyr’s syntax is either inefficient, limited, or fails to make operations straightforward. This is, particularly because data.table gets quite a bit of backlash about “harder to read/learn” syntax.
But it is important to realise its syntax and feature limitations as well. data.table has its quirks as well. We are also attempting to improve data.table’s joins as highlighted here. But one should also consider the number of features that dplyr lacks in comparison to data.table.
iv. Features
We have pointed out most of the features here and also in this post. In addition:
fread – fast file reader has been available for a long time now.
fwrite – a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.
Automatic indexing – another handy feature to optimise base R syntax as is, internally.
Ad-hoc grouping:
dplyr
automatically sorts the results by grouping variables duringsummarise()
, which may not be always desirable.Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.
Non-equi joins: Allows joins using other operators
<=
,<
,>
,>=
along with all other advantages of data.table joins.Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.
setorder()
function in data.table that allows really fast reordering of data.tables by reference.dplyr provides interface to databases using the same syntax, which data.table does not at the moment.
data.table
provides faster equivalents of set operations (written by Jan Gorecki) –fsetdiff
,fintersect
,funion
, andfsetequal
with additionalall
argument (as in SQL).data.table loads cleanly with no masking warnings and has a mechanism described here for
[.data.frame
compatibility when passed to any R package. dplyr changes base functionsfilter
,lag
, and[
which can cause problems; e.g. here and here.
Finally:
On databases – there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature, not sure.
-
On parallelism – Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread-safe).
- Progress is being made currently (in v1.9.7 devel) towards parallelising known time-consuming parts for incremental performance gains using
OpenMP
.
- Progress is being made currently (in v1.9.7 devel) towards parallelising known time-consuming parts for incremental performance gains using
8. How to replace NA values with zeros in an R dataframe?
Answer:
A simple example:
> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 NA 3 7 6 6 10 6 5
2 9 8 9 5 10 NA 2 1 7 2
3 1 1 6 3 6 NA 1 4 1 6
4 NA 4 NA 7 10 2 NA 4 1 8
5 1 2 4 NA 2 6 2 6 7 4
6 NA 3 NA NA 10 2 1 10 8 4
7 4 4 9 10 9 8 9 4 10 NA
8 5 8 3 2 1 4 5 9 4 7
9 3 9 10 1 9 9 10 5 3 3
10 4 2 2 5 NA 9 7 2 5 5
> d[is.na(d)] <- 0
> d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 0 3 7 6 6 10 6 5
2 9 8 9 5 10 0 2 1 7 2
3 1 1 6 3 6 0 1 4 1 6
4 0 4 0 7 10 2 0 4 1 8
5 1 2 4 0 2 6 2 6 7 4
6 0 3 0 0 10 2 1 10 8 4
7 4 4 9 10 9 8 9 4 10 0
8 5 8 3 2 1 4 5 9 4 7
9 3 9 10 1 9 9 10 5 3 3
10 4 2 2 5 0 9 7 2 5 5
There’s no need to apply apply
. You should also take a look at norm
package. It has a lot of nice features for missing data analysis.
9. What are the differences between “=” and “<-” assignment operators in R?
Answer:
The operator <-
can be used anywhere, whereas the operator =
is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions. Let’s not put too fine a point on it: the R documentation is (subtly) wrong. This is easy to show: we just need to find a counter-example of the =
operator that isn’t (a) at the top level, nor (b) a subexpression in a braced list of expressions (i.e. {…; …}
).
Without further ado:
x
# Error: object 'x' not found
sum((x = 1), 2)
# [1] 3
x
# [1] 1
Clearly we’ve performed an assignment, using =
, outside of contexts (a) and (b). So, why has the documentation of a core R language feature been wrong for decades? It’s because in R’s syntax the symbol = has two distinct meanings that get routinely conflated:
- The first meaning is as an assignment operator. This is all we’ve talked about so far.
- The second meaning isn’t an operator but rather a syntax token that signals named argument passing in a function call. Unlike the
=
operator it performs no action at runtime, it merely changes the way an expression is parsed.
Let’s see. In any piece of code of the general form:
‹function_name›(‹argname› = ‹value›, …)
‹function_name›(‹args›, ‹argname› = ‹value›, …)
the =
is the token that defines named argument passing: it is not the assignment operator. Furthermore, =
is entirely forbidden in some syntactic contexts:
if (‹var› = ‹value›) …
while (‹var› = ‹value›) …
for (‹var› = ‹value› in ‹value2›) …
for (‹var1› in ‹var2› = ‹value›) …
Any of these will raise an error “unexpected ‘=’ in ‹bla›”. In any other context, =
refers to the assignment operator call. In particular, merely putting parentheses around the subexpression makes any of the above (a) valid, and (b) an assignment. For instance, the following performs assignment:
median((x = 1 : 10))
But also:
if (! (nf = length(from))) return()
Now you might object that such code is atrocious (and you may be right). But this code is taken from the base::file.copy function (replacing <- with =) — it’s a pervasive pattern in much of the core R codebase.
[= assignment is] allowed in only two places in the grammar: at the top level (as a complete program or user-typed expression); and when isolated from surrounding logical structure, by braces or an extra pair of parentheses. There is one additional difference between the =
and <-
operators: they call distinct functions. By default these functions do the same thing but you can override either of them separately to change the behaviour. By contrast, <-
and ->
(left-to-right assignment), though syntactically distinct, always call the same function. Overriding one also overrides the other. Knowing this is rarely practical but it can be used for some fun shenanigans.
Alternative Answer:
The difference in assignment operators is clearer when you use them to set an argument value in a function call. For example:
median(x = 1:10)
x
## Error: object 'x' not found
In this case, x
is declared within the scope of the function, so it does not exist in the user workspace.
median(x <- 1:10)
x
## [1] 1 2 3 4 5 6 7 8 9 10
In this case, x
is declared in the user workspace, so you can use it after the function call has been completed. There is a general preference among the R community for using <-
for assignment (other than in function signatures) for compatibility with (very) old versions of S-Plus. Note that the spaces help to clarify situations like:
x<-3
# Does this mean assignment?
x <- 3
# Or less than?
x < -3
Most R IDEs have keyboard shortcuts to make <-
easier to type. Ctrl + =
in Architect, Alt + -
in RStudio (Option + -
under macOS), Shift + -
(underscore) in emacs+ESS. If you prefer writing =
to <-
but want to use the more common assignment symbol for publicly released code (on CRAN, for example), then you can use one of the tidy_*
functions in the formatR
package to automatically replace =
with <-
.
library(formatR)
tidy_source(text = "x=1:5", arrow = TRUE)
## x <- 1:5
The answer to the question “Why does x <- y = 5
throw an error but not x <- y <- 5?
” is “It’s down to the magic contained in the parser”. R’s syntax contains many ambiguous cases that have to be resolved one way or another. The parser chooses to resolve the bits of the expression in different orders depending on whether =
or <-
was used. To understand what is happening, you need to know that the assignment silently returns the value that was assigned. You can see that more clearly by explicitly printing, for example print(x <- 2 + 3)
. Secondly, it’s clearer if we use prefix notation for assignment. So
x <- 5
`<-`(x, 5) #same thing
y = 5
`=`(y, 5) #also the same thing
The parser interprets x <- y <- 5
as
`<-`(x, `<-`(y, 5))
We might expect that x <- y = 5
would then be
`<-`(x, `=`(y, 5))
but actually it gets interpreted as
`=`(`<-`(x, y), 5)
This is because =
is lower precedence than <-
, as shown on the ?Syntax
help page.
10. How to convert a factor to integer\numeric without loss of information?
Answer:
See the Warning section of ?factor
:
In particular,
as.numeric
applied to a factor is meaningless, and may happen by implicit coercion. To transform a factorf
to approximately its original numeric values,as.numeric(levels(f))[f]
is recommended and slightly more efficient thanas.numeric(as.character(f))
.
The FAQ on R has similar advice.
Why is as.numeric(levels(f))[f]
more efficent than as.numeric(as.character(f))
?
as.numeric(as.character(f))
is effectively as.numeric(levels(f)[f])
, so you are performing the conversion to numeric on length(x)
values, rather than on nlevels(x)
values. The speed difference will be most apparent for long vectors with few levels. If the values are mostly unique, there won’t be much difference in speed. However you do the conversion, this operation is unlikely to be the bottleneck in your code, so don’t worry too much about it.
Some timings
library(microbenchmark)
microbenchmark(
as.numeric(levels(f))[f],
as.numeric(levels(f)[f]),
as.numeric(as.character(f)),
paste0(x),
paste(x),
times = 1e5
)
## Unit: microseconds
## expr min lq mean median uq max neval
## as.numeric(levels(f))[f] 3.982 5.120 6.088624 5.405 5.974 1981.418 1e+05
## as.numeric(levels(f)[f]) 5.973 7.111 8.352032 7.396 8.250 4256.380 1e+05
## as.numeric(as.character(f)) 6.827 8.249 9.628264 8.534 9.671 1983.694 1e+05
## paste0(x) 7.964 9.387 11.026351 9.956 10.810 2911.257 1e+05
## paste(x) 7.965 9.387 11.127308 9.956 11.093 2419.458 1e+05
Alternative Answer:
R has a number of (undocumented) convenience functions for converting factors:
as.character.factor
as.data.frame.factor
as.Date.factor
as.list.factor
as.vector.factor
…
But annoyingly, there is nothing to handle the factor -> numeric conversion. We would suggest overcoming this omission with the definition of your own idiomatic function:
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
that you can store at the beginning of your script, or even better in your .Rprofile
file.
11. How to plot two graphs in the same plot in R?
Answer:
lines()
or points()
will add to the existing graph, but will not create a new window. So you’d need to do
plot(x,y1,type="l",col="red")
lines(x,y2,col="green")
Alternative Answer:
You can also use par
and plot on the same graph but different axis. Something as follows:
plot( x, y1, type="l", col="red" )
par(new=TRUE)
plot( x, y2, type="l", col="green" )
If you read in detail about par
in R
, you will be able to generate really interesting graphs.
12. What is the difference between require() and library()?
Answer:
It is probably best to avoid using require()
unless you actually will be using the value it returns e.g in some error checking loop such as given by Thierry.
In most other cases it is better to use library()
, because this will give an error message at package loading time if the package is not available. require()
will just fail without an error if the package is not there. This is the best time to find out if the package needs to be installed (or perhaps doesn’t even exist because it is spelled wrong). Getting error feedback early and at the relevant time will avoid possible headaches with tracking down why later code fails when it attempts to use library routines.
Alternative Answer:
According to the documentation for both functions (accessed by putting a ?
before the function name and hitting enter), require
is used inside functions, as it outputs a warning and continues if the package is not found, whereas library
will throw an error.
13. How to deal with “package ‘xxx’ is not available (for R version x.y.z)” warning?
Answer:
i. You can’t spell
The first thing to test is have you spelled the name of the package correctly? Package names are case sensitive in R.
ii. You didn’t look in the right repository
Next, you should check to see if the package is available. Type
setRepositories()
See also ?setRepositories.
To see which repositories R will look in for your package, and optionally select some additional ones. At the very least, you will usually want CRAN
to be selected, and CRAN (extras)
if you use Windows, and the Bioc*
repositories if you do any biological analyses. To permanently change this, add a line like setRepositories(ind = c(1:6, 8))
to your Rprofile.site
file.
iii. The package is not in the repositories you selected
Return all the available packages using
ap <- available.packages()
See also Names of R’s available packages, ?available.packages.
Since this is a large matrix, you may wish to use the data viewer to examine it. Alternatively, you can quickly check to see if the package is available by testing against the row names.
View(ap)
"foobarbaz" %in% rownames(ap)
Alternatively, the list of available packages can be seen in a browser for CRAN, CRAN (extras), , R-forge, RForge, and github. Another possible warnings message you may get when interacting with CRAN mirrors is:
Warning: unable to access index for repository
Which may indicate the selected CRAN repository is currently be unavailable. You can select a different mirror with chooseCRANmirror()
and try the installation again. There are several reasons why a package may not be available.
iv. You don’t want a package
Perhaps you don’t really want a package. It is common to be confused about the difference between a package and a library, or a package and a dataset.
A package is a standardized collection of material extending R, e.g. providing code, data, or documentation. A library is a place (directory) where R knows to find packages it can use
To see available datasets, type
data()
v. R or Bioconductor is out of date
It may have a dependency on a more recent version of R (or one of the packages that it imports/depends upon does). Look at
ap["foobarbaz", "Depends"]
and consider updating your R installation to the current version. On Windows, this is most easily done via the installr
package.
library(installr)
updateR()
(Of course, you may need to install.packages("installr")
first.) Equivalently for Bioconductor packages, you may need to update your Bioconductor installation.
source("http://bioconductor.org/biocLite.R")
biocLite("BiocUpgrade")
vi. The package is out of date
It may have been archived (if it is no longer maintained and no longer passes R CMD check
tests). In this case, you can load an old version of the package using install_version()
.
library(remotes)
install_version("foobarbaz", "0.1.2")
An alternative is to install from the github CRAN mirror.
library(remotes)
install_github("cran/foobarbaz")
vii. There is no Windows/OS X/Linux binary
It may not have a Windows binary due to requiring additional software that CRAN does not have. Additionally, some packages are available only via the sources for some or all platforms. In this case, there may be a version in the CRAN (extras)
repository (see setRepositories
above). If the package requires compiling code (e.g. C, C++, FORTRAN) then on Windows install Rtools or on OS X install the developer tools accompanying XCode, and install the source version of the package via:
install.packages("foobarbaz", type = "source")
# Or equivalently, for Bioconductor packages:
source("http://bioconductor.org/biocLite.R")
biocLite("foobarbaz", type = "source")
On CRAN, you can tell if you’ll need special tools to build the package from source by looking at the NeedsCompilation
flag in the description. 8. The package is on github/Bitbucket/Gitorious It may have a repository on Github/Bitbucket/Gitorious. These packages require the remotes package to install.
library(remotes)
install_github("packageauthor/foobarbaz")
install_bitbucket("packageauthor/foobarbaz")
install_gitorious("packageauthor/foobarbaz")
(As with installr
, you may need to install.packages("remotes")
first.)
viii. There is no source version of the package
Although the binary version of your package is available, the source version is not. You can turn off this check by setting
options(install.packages.check.source = "no")
as described in this SO answer by imanuelc and the Details section of ?install.packages
.
ix. The package is in a non-standard repository
Your package is in a non-standard repository (e.g. Rbbg
). Assuming that it is reasonably compliant with CRAN standards, you can still download it using install.packages;
you just have to specify the repository URL.
install.packages("Rbbg", repos = "http://r.findata.org")
RHIPE
on the other hand isn’t in a CRAN-like repository and has its own installation instructions.
In Conclusion
These are the 13 most commonly asked R programming questions. If you have any suggestions or any confusion, please comment below. If you need any help, we will be glad to help you.
We, at Truemark, provide services like web and mobile app development, digital marketing, and website development. So, if you need any help and want to work with us, please feel free to contact us.
Hope this article helped you.
This post was first published on DevpostbyTruemark.
Top comments (0)