The {svTidy} package provides a set of functions to manipulate data frames in a tidy way (like {dplyr} and {tidyr} do), but by evaluating its arguments in a standard way, or by mean of formulas instead of data masking or tidy selection. This has several advantages over the Tidyverse equivalent functions and we will develop some of them here.
Before we present the formula masking mechanism of {svTidy}, we will first explain what are non-standard evaluation and data masking, and why they can be a problem in some cases because they are not referentially transparent. Then, we will show how formula masking works and how it can solve these problems. If you are familiar with these concepts, you can jump directly to the section “Formula masking”.
In R, when you call a function, the provided arguments are evaluated in the calling environment by default. Here is a simple example:
Thus, in log(x), x is first evaluated in
the global environment where the code is run. It resolves to a numerical
vector containing three numbers: 1, 3, and
8. Then, the logarithm of that numerical vector is
computed. Now, when the numbers are not directly in a vector, but they
are in a data frame, say df, we must indicate that we want
to use the column named x from it by df$x or
df[['x']]. However, since df$x does not
evaluate in a standard way, we will use the second form:
df <- data.frame(x = c(1, 3, 8), y = rep(FALSE, 3))
rm(x) # To make sure we do not use the old `x` vector
log(df[['x']])
#> [1] 0.000000 1.098612 2.079442This is quite simple and understandable R code. However, in some
cases, you have to repeat several times the name of the data frame,
which can be quite tedious. For instance, if you want to filter the rows
of df where the value of x is greater than
2, you can write in base R:
df[df[['x']] > 2, ]
#> x y
#> 2 3 FALSE
#> 3 8 FALSENote that df is repeated twice here. Not a big deal, but
it is annoying enough for some that they prefer an alternate approach
where the first argument of the function is the data=
argument, indicating only once which data frame is used. Subsequent
arguments refer by default to variables in that data frame, and if not
found, are looked in the search path, starting from the environment
where the code is executed. dplyr::filter() is such a
function. The following code does roughly the same operation on
df than above, but without repeating the name of the data
frame:
filter(df, x > 2)
#> x y
#> 1 3 FALSE
#> 2 8 FALSEArguably, this form is simpler and easier to read. But in this case,
the second argument x > 2 cannot be evaluated in
a standard way, because we do not refer to x in
the calling environment, but to x as a column of
df. Fortunately, R allows to manipulate arguments in a
non-standard way before they are evaluated. The mechanism that
provides the magic for dplyr::filter() to work like this is
called data
masking.
In the Tidyverse, non-standard evaluation is used to get more concise code, closer to English grammar. The focus is on interactive data analysis.
Well OK, that’s nice… so, what is the problem? The “Programming with dplyr” vignette (version 1.2.0) states:
“Data masking and tidy selection make interactive data exploration fast and fluid, but they add some new challenges when you attempt to use them indirectly such as in a for loop or a function.”
The problem is thus when the Tidyverse functions are used in a function or a for loop. A detailed explanation follows.
Code is qualified as referentially transparent when you can replace a part of the expression by another part that is equivalent. Standard evaluation of function arguments in R is referentially transparent. You can write this:
y <- df[['x']] > 2
df[y, ] # Functionally equivalent to `df[df[['x']] > 2, ]`
#> x y
#> 2 3 FALSE
#> 3 8 FALSEbut with the data masking, it does not work (at least the way you expected it to work):
y <- rlang::quo(x > 2) # Need quo() here to defer evaluation of the expression
filter(df, y) # Error: object 'y' in not the one you meant!
#> [1] x y
#> <0 rows> (or 0-length row.names)Here, it is the y variable in df that is
used, not the expression in y in the calling environment.
The workaround is, indeed, more complicated: you have to
“inject” the expression in y into the
filter() call, using the {{ operator:
filter(df, {{ y }})
#> x y
#> 1 3 FALSE
#> 2 8 FALSEAnother situation when data masking is hurting because its non-standard evaluation is when the code is called from a function and a part or the whole of a “data masked” argument becomes an argument of that function. In base R, you can write this:
my_filter_base <- function(data, subset) {
data[subset, ]
}
my_filter_base(df, df[['x']] > 2)
#> x y
#> 2 3 FALSE
#> 3 8 FALSEAgain, referential transparency and standard evaluation of the
arguments help here to make everything smooth.
dplyr::filter() does not allow to do this:
my_filter_dplyr <- function(data, subset) {
filter(data, subset)
}
my_filter_dplyr(df, x > 2)
#> Error in `filter()`:
#> ℹ In argument: `subset`.
#> Caused by error:
#> ! object 'x' not foundAgain, not the result you expected. You have to do an
“injection” (or “quasiquotation”) and
it uses the embracing operator {{ to
indicate we want to inject an expression inside another one before its
evaluation.
my_filter_dplyr2 <- function(data, subset) {
filter(data, {{ subset }})
}
my_filter_dplyr2(df, x > 2)
#> x y
#> 1 3 FALSE
#> 2 8 FALSEAs nice as this “injection” mechanism may look like, you will be bitten one day by it (because our brain tends to think in a referentially transparent way, which it is not)!
In {svTidy} we introduce an alternate non-standard mechanism called
formula masking. It is based on the use of R formulas
to indicate that an argument should be evaluated in a non-standard way.
For instance, the equivalent of filter(df, x > 2) in
{svTidy} is:
filter_(df, ~x > 2)
#> x y
#> 1 3 FALSE
#> 2 8 FALSEFirst note that {svTidy} functions equivalent to {dplyr} or {tidyr}
ones have an underscore at the end of their name (filter_()
vs filter(), or mutate_()
vs. mutate(), etc.) In older versions of {dplyr}
and {tidyr}, the functions with an underscore at the end were the
standard evaluation version of the functions without underscore. They
are now defunct in {dplyr} version >= 1.2.0. Since the {svTidy}
function can also evaluated their arguments in a standard
way, we keep this convention:
filter_(df, df[['x']] > 2) # Standard evaluation of the arguments, alternate form
#> x y
#> 1 3 FALSE
#> 2 8 FALSESo, svTidy::filter_() allows both standard and
non-standard evaluation of its arguments in the same function.
Non-standard evaluation is signaled by using a formula,
which is created in R thanks to the ~ operator. This way,
when you read code you can immediately spot the non-standard evaluated
arguments, thanks to the presence of that ~ operator. Also,
notice that you can easily convert {dplyr} code into {svTidy} one: add
an underscore at the end of the function name, and place a tilde in
front of non-standard evaluated arguments… and it will be good most of
the time.
OK, but how is this better that data masking? Well, it is somehow referentially transparent. So, you can write this:
y <- ~x > 2 # Note that you do not need quo() here: ~ already captures the expression
filter_(df, y)
#> x y
#> 1 3 FALSE
#> 2 8 FALSEHere, y is not a formula and is thus evaluated in a
standard way. It resolves to ~x > 2, which is a formula.
Thus, filter_() evaluates it in a non-standard way, looking
for x in df, as we are expecting, since we
provided that context as first argument of filter_().
We thus have both the advantages of non-standard evaluation of
the argument (no repetition of df) and the advantages of
standard evaluation (referential transparency).
Note that the formula is not new in R indeed. It is used in functions
like stats::t.test() or stats::lm() for
instance. So, we do not introduce a new mechanism. We reuse one that
already exists since a long time in the R language. However, in {svTidy}
the formula is handled in a way that makes it as much similar to the
Tidyverse as possible.
Before we compare a more complex example, we have to introduce two other features of {svTidy} functions: computed argument names and the data-dot mechanism. We have also to replace the pipeline by a bullet-list construct.
There is another situation that is difficult to resolve with the
Tidyverse functions. It is when the argument name is the name of a
variable we create, say, with mutate(). A simple
examples:
mutate(df, x2 = x^2)
#> x y x2
#> 1 1 FALSE 1
#> 2 3 FALSE 9
#> 3 8 FALSE 64The problem occurs when we want to compute the name of the new
variable x2. In {dplyr}, you have to use {{}}
(but here, it is using a different mechanism provided by the
glue() function) and the := operator instead
of the = operator.
my_mutate_dplyr <- function(data, var, expr) {
mutate(data, "{{var}}" := {{ expr }})
}
my_mutate_dplyr(df, x2, x^2)
#> x y x2
#> 1 1 FALSE 1
#> 2 3 FALSE 9
#> 3 8 FALSE 64With {svTidy}, we have used single-sided formulas until now (the
tilde ~ is on the left of the expression), but we can also
use two-sided formulas, where the tilde is in the middle, like in
'varname' ~ expr. This allows to compute the name of the
new variable in a more straightforward way:
my_mutate_svTidy <- function(data, expr) {
mutate_(data, expr)
}
my_mutate_svTidy(df, 'x2' ~ x^2)
#> x y x2
#> 1 1 FALSE 1
#> 2 3 FALSE 9
#> 3 8 FALSE 64If you want to separate the two terms of the formula as two different arguments, you can do it, but then you have to evaluate both arguments in a standard way (unless you process them in a special way using macro expansion that we will see here under):
my_mutate_svTidy2 <- function(data, name, expr) {
mutate_(data, name ~ expr)
}
my_mutate_svTidy2(df, 'x2', df$x^2)
#> x y x2
#> 1 1 FALSE 1
#> 2 3 FALSE 9
#> 3 8 FALSE 64Finally, you can replace one or more variables inside the right-hand
side of formulas, the same way indirection does for Tidyverse without
any special notation (note that you can inactivate it with
.__indirection__. <- FALSE in the function, if
needed).
my_mutate_svTidy3 <- function(data, name, var) {
mutate_(data, name ~ var^2)
}
my_mutate_svTidy3(df, 'x2', ~x)
#> x y x2
#> 1 1 FALSE 1
#> 2 3 FALSE 9
#> 3 8 FALSE 64The Tidyverse uses the pipeline operator |> (or
%>% from {magrittr}) to chain together several
operations on a data frame. This seems nice and reads well, but it glues
together several expression into a giant one that is much less easy to
debug. The pipe operator |> is nice to make R code more
readable when an instruction is made of several functions nested into
each other. But we believe that chaining several separate operations
using the same pipe operator |> is overusing it. As an
alternative, equally readable, we propose to use a pseudo-operator
.= that we call a “bullet-list” operator.
The idea is to present successive operations related together a little
bit as a bullet list. An example will be more clear than a long
explanation. In the Tidyverse, you would write something like this
(using both data masking and tidy selection):
data(starwars)
# A Tidyverse pipeline using five of the main {dplyr} verbs
starwars_sum <-
starwars |>
filter(species == "Human") |>
select(name:homeworld) |>
# Note: get age 2 years after battle of Yavin (birth_year is year born Before Battle of Yavin)
mutate(age = 2 + birth_year) |>
group_by(gender) |>
summarise(
mean_age = mean(age, na.rm = TRUE),
sd_age = sd(age, na.rm = TRUE),
n_age = sum(!is.na(age)),
mean_mass = mean(mass, na.rm = TRUE),
sd_mass = sd(mass, na.rm = TRUE),
n_mass = sum(!is.na(mass))
)
starwars_sum
#> # A tibble: 2 × 7
#> gender mean_age sd_age n_age mean_mass sd_mass n_mass
#> <chr> <dbl> <dbl> <int> <dbl> <dbl> <int>
#> 1 feminine 48.4 18.8 5 56.3 16.3 3
#> 2 masculine 57.5 25.4 21 85.7 16.5 17filter(), mutate() and
summarise() use data masking. select() and
group_by() use tidy selection. There is no clue in the code
of that. You have to look at the documentation, and this is mandatory to
understand what this code does. Yet, this reduces the typing by avoiding
quotes around variable names, and by avoiding to repeat
df$var. It is possible to rewrite this code with {svTidy}
the way we learned, by just appending underscore to the function name
and prepending a tilde to non-standard evaluated arguments. But we can
also replace the pipe operator by the bullet-list operator this way:
starwars_sum2 <- {
.= starwars
.= filter_(~species == "Human")
.= select_(~name:homeworld)
.= mutate_(age = ~2 + birth_year)
.= group_by_(~gender)
.= summarise_(
mean_age = ~mean(age, na.rm = TRUE),
sd_age = ~sd(age, na.rm = TRUE),
n_age = ~sum(!is.na(age)),
mean_mass = ~mean(mass, na.rm = TRUE),
sd_mass = ~sd(mass, na.rm = TRUE),
n_mass = ~sum(!is.na(mass))
)
}
starwars_sum2
#> # A tibble: 2 × 7
#> gender mean_age sd_age n_age mean_mass sd_mass n_mass
#> <chr> <dbl> <dbl> <int> <dbl> <dbl> <int>
#> 1 feminine 48.4 18.8 5 56.3 16.3 3
#> 2 masculine 57.5 25.4 21 85.7 16.5 17
identical(starwars_sum, starwars_sum2)
#> [1] TRUEThe ‘{’ operator groups together several separate
expressions that can be debugged more easily. We believe that
the .= at the beginning of each line makes it even clearer
that we have successive operations than when using the pipe
|> at the end of the line (compare both codes). However,
there is something special here. .= is a pseudo-operator
because it does nothing special. It is . followed by
= meaning we assign to dot . the right-side of
the expression after =. However, we do no specify the
data= argument in the {svTidy} functions. We should have to
write .= filter_(., ~species == "Human") for instance… but
we dropped . here. This is because the {svTidy} function
use an additional mechanism called “data-dot”. When the
data= argument is not provided, the default .
is inserted in the call of the function before it is executed. This
allows to get code closer to the Tidyverse one, with just three changes:
(1) replace the pipe |> at the end by a bullet-list
.= at the beginning of a line, (2) add an underscore after
the name of the functions, and (3) add a tilde before non-standard
arguments (and, optionally, group together the successive operations
with ‘{}’).
Now, if we want to reuse this code in a function with various argument, things become much more complicated with the Tidyverse, because of the required injections and special constructs for names of variables, as briefly explained at the beginning of this vignette (it is more detailed in “Programming with dplyr”).
my_summarise_dplyr <- function(data, subset, selection, group, year, var, var2) {
var2_sym <- as.symbol(var2) # Must provide a symbol for names!
data |>
filter({{ subset }}) |>
select({{ selection }}) |>
mutate({{var}} := .env$year + .data$birth_year) |>
group_by({{ group }}) |>
summarise(
"mean_{{var}}" := mean({{ var }}, na.rm = TRUE),
"sd_{{var}}" := sd({{ var }}, na.rm = TRUE),
"n_{{var}}" := sum(!is.na({{ var }})),
"mean_{{var2_sym}}" := mean(.data[[var2]], na.rm = TRUE),
"sd_{{var2_sym}}" := sd(.data[[var2]], na.rm = TRUE),
"n_{{var2_sym}}" := sum(!is.na(.data[[var2]]))
)
}
starwars_sum3 <- my_summarise_dplyr(starwars, subset = species == "Human",
selection = name:homeworld, group = gender, year = 2, var = age, var2 = 'mass')
starwars_sum3
#> # A tibble: 2 × 7
#> gender mean_age sd_age n_age mean_mass sd_mass n_mass
#> <chr> <dbl> <dbl> <int> <dbl> <dbl> <int>
#> 1 feminine 48.4 18.8 5 56.3 16.3 3
#> 2 masculine 57.5 25.4 21 85.7 16.5 17
identical(starwars_sum, starwars_sum3)
#> [1] TRUENote that var= and var2= illustrate the two
ways of defining a variable, by a symbol for var= and by
its name for var2= (character string). In the case of
var2=, it cannot be used as such in the name substitution.
It must be converted into a symbol first (in var_sym). The
way they are dealt with by the Tidyverse functions differ, as you can
see. Now, here is the {svTidy} version:
my_summarise_svTidy <- function(data, subset, selection, group, year, var, var2) {
fvar2 <- f_(var2)
.= data
.= filter_(subset)
.= select_(selection)
.= mutate_(var ~ year + birth_year)
.= group_by_(group)
.= summarise_(
'mean_{{var}}' ~ mean(var, na.rm = TRUE),
'sd_{{var}}' ~ sd(var, na.rm = TRUE),
'n_{{var}}' ~ sum(!is.na(var)),
'mean_{{var2}}' ~ mean(fvar2, na.rm = TRUE),
'sd_{{var2}}' ~ sd(fvar2, na.rm = TRUE),
'n_{{var2}}' ~ sum(!is.na(fvar2))
)
}
starwars_sum3 <- my_summarise_svTidy(starwars, subset = ~species == "Human",
selection = ~name:homeworld, group = ~gender, year = 2, var = ~age, var2 = 'mass')
starwars_sum3
#> # A tibble: 2 × 7
#> gender mean_age sd_age n_age mean_mass sd_mass n_mass
#> <chr> <dbl> <dbl> <int> <dbl> <dbl> <int>
#> 1 feminine 48.4 18.8 5 56.3 16.3 3
#> 2 masculine 57.5 25.4 21 85.7 16.5 17
identical(starwars_sum, starwars_sum3)
#> [1] TRUEYou notice that this last code is leaner than the Tidyverse version
and it is also much closer to the initial bullet-point version. First
line var <- { was replaced by the function definition
fun <- function(args) {.
.__macros__. <- TRUE is added in the body of the
function only if it is required (here for summarise_().)
Then, you simply replace the expressions by the arguments names like you
do in plain R code (replace starwars by data,
~species == "Human" inside filter_() by
subset, etc.) Finally, since macro expansion only work for
variables that contain formulas, and var2 is a character
string, we have to convert it into a formula before use. The function
svBase::f_() does this in a simple way. In practice, you
should prefer to directly use a formula for such arguments, like
var=.
bm <- bench::mark(
dplyr = my_summarise_dplyr(starwars, subset = species == "Human",
selection = name:homeworld, group = gender, year = 2, var = age, var2 = 'mass'),
svTidy = my_summarise_svTidy(starwars, subset = ~species == "Human",
selection = ~name:homeworld, group = ~gender, year = 2, var = ~age, var2 = 'mass')
)
bm
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 6.8ms 6.88ms 144. 20.58KB 6.66
#> 2 svTidy 755.9µs 800.73µs 1232. 9.81KB 10.5In such a small dataset, we essentially measure the overhead of the two approaches, and we can see that {svTidy} is 8.6 times faster, and it requires 2.1 times less memory than {dplyr} in this case. With larger datasets, the overhead becomes negligible, and results will be different. However, for code to be incorporated in functions that can possibly be run a large number of times (for instance in loops), this may be important.
If you are convinced, you will probably have to convert existing or future {dplyr}/{tidyr} code into {svTidy}. You have only a few rules to remember to do so:
append ‘_’ at the end of the function name (ex.:
select() -> select_()), and make sure that
{svTidy} is loaded higher in the search path than {dplyr} and {tidyr},
if the later packages are loaded too.
either:
df$var instead of
var for a column named “var” in a data frame
df), or~ in front of your NSE code and do not quote
variable names. You can keep ~varinstead of
df$var.Use “fast” collapse functions instead of base equivalent (for
instance, fmean() instead of mean()). In fact,
you can continue to use base function, but you will not benefit from the
speed increase of the fast functions, especially if your code involves
grouped data. Of course, also load the {collapse} package using
library(collapse) before use.
The ‘_’ function automatically ungroups the data at the end, on the contrary to their Tidyverse equivalent [note: not true for all functions for now, check your results].
You benefit from referential transparency in SE mode: if
x <- 'var', you can use x instead of
'var' everywhere. You do not need to “embrace” the
argument, like this {{ x }} (only required in Tidyverse
functions). Idem for formulas: write x <- ~var, and you
can use x everywhere instead of ~var.
To rename variables, you replace the Tidyverse syntax
{{varname}} := expr by a two-sided formula:
varname ~ expr.
If a function accepts both a data frame or a vector as first
argument (e.g., replace_na_(), you must write
v = vector if you provide a vector, to mark your intention
to use it with something else than a data frame.
The ‘_’ functions are “data-dot”. It means they inject
. as first argument (usually .data= if no data
frame is provided).
You cannot mix SE code and NSE code through formulas. Either use SE code for all arguments, or formulas only, inside a function call.
Formulas are converted into expressions that are evaluated in the
environment where the first provided formula was created. If you need an
evaluation in a different environment, you can use
retarget(formula) to change its environment.