
Functions for tidying data by separating, uniting, filling, and handling missing values.
These are SciViews::R versions of tidyverse functions with standard
evaluation and formula-based non-standard evaluation (ending with underscore
_). They work with data.frame, data.table, and tibbles.
Functions:
separate_() - Separate one column into multiple columns by splitting on a separator
unite_() - Unite multiple columns into one by pasting strings together
fill_() - Fill missing values using previous or next non-missing value
drop_na_() - Drop rows containing missing values
replace_na_() - Replace missing values with a specified value
uncount_() - Duplicate rows according to a weighting variable
separate_(
.data = (.),
col,
into,
sep = "[^[:alnum:]]+",
remove = TRUE,
convert = FALSE,
extra = "warn",
fill = "warn",
fixed = FALSE,
...
)
unite_(.data = (.), col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
fill_(.data = (.), ..., .direction = "down")
drop_na_(.data = (.), ..., .na.attr = FALSE, .prop = 0)
replace_na_(.data = (.), replace, ..., v = NULL)
uncount_(.data = (.), weights, ..., .remove = TRUE, .id = NULL)A data frame (data.frame, data.table, or tibble)
For separate_(): the column to separate. Can be a column name
as character, or a formula (e.g., ~col_name). For unite_(): the name
of the new united column (character string or formula).
For separate_(): names of new variables to create as a character
vector. Use NA to omit a variable in the output.
For separate_() and unite_(): separator between columns. For
separate_(), can be a character vector, a numeric vector of positions to
split at, or a regular expression pattern. Default is "[^[:alnum:]]+" for
separate_() and "_" for unite_().
Logical. If TRUE (default), remove input columns from output.
For separate_(): logical. If TRUE, attempts to convert
new columns to appropriate types. Default is FALSE.
For separate_() when sep is a character: what to do when
there are too many pieces. Options: "warn" (default, warn and drop),
"drop" (drop without warning), or "merge" (merge extra pieces with last).
For separate_() when sep is a character: what to do when
there are not enough pieces. Options: "warn" (default, warn and fill right
with NA), "right" (fill right without warning), or "left" (fill left).
For separate_(): logical. If TRUE, sep is a fixed string.
If FALSE (default), sep is a (perl) regular expression.
For separate_() and unite_(): additional arguments (currently
unused). For fill_() and drop_na_(): columns to fill or check for NAs.
Use formulas (e.g., ~col1, ~col2) or column names. If not provided,
uses all columns.
If TRUE, NAs are eliminated before uniting the values.
Direction in which to fill missing data: "down" (by
default), "up", or "downup" (first down, then up), "updown"
(the opposite).
logical. TRUE adds an attribute containing the removed
cases. For compatibility reasons this is exactly the same format as
na.omit(), i.e. the attribute is called "na.action" and of class omit
numeric. The proportion missing values in each case for the case to be considered as missing required to keep a
If .data is a vector, a unique value to replace NAs,
otherwise, a list of values, one per column of the data frame.
a vector where to replace NAs.
A vector of weight to use to "uncount" .data.
If TRUE (default), and weights is the name of a column,
that column is removed from .data.
The name of the column for the origin id, either names if all other arguments are named, or numbers.
A data frame of the same type as .data with the transformation applied.
separate_() returns a data frame with the specified column split into
multiple columns
unite_() returns a data frame with specified columns combined into one
fill_() returns a data frame with missing values filled
drop_na_() returns a data frame with rows containing NAs removed
replace_na_() returns a data frame or vector with NAs replaced by specified values
uncount_() returns a data frame with rows duplicated according to weights
library(svTidy)
# separate_() - split one column into multiple
df <- data.frame(x = c("a_b_c", "d_e_f", "g_h_i"))
df |> separate_(~x, into = c("A", "B", "C"), sep = "_")
#> A B C
#> 1 a b c
#> 2 d e f
#> 3 g h i
# Use character name instead of formula
df |> separate_("x", into = c("A", "B", "C"), sep = "_")
#> A B C
#> 1 a b c
#> 2 d e f
#> 3 g h i
# Drop a column with NA in into
df |> separate_(~x, into = c("A", NA, "C"), sep = "_")
#> A C
#> 1 a c
#> 2 d f
#> 3 g i
# Keep original column
df |> separate_(~x, into = c("A", "B", "C"), sep = "_", remove = FALSE)
#> x A B C
#> 1 a_b_c a b c
#> 2 d_e_f d e f
#> 3 g_h_i g h i
# Separate by numeric positions is not implemented yet
#df2 <- data.frame(date = c("20201231", "20210115", "20220601"))
#df2 |> separate_(~date, into = c("year", "month", "day"), sep = c(4, 6))
# Handle too many pieces
df3 <- data.frame(x = c("a_b_c", "d_e_f_g", "h_i"))
df3 |> separate_(~x, into = c("A", "B"), extra = "drop")
#> A B
#> 1 a b
#> 2 d e
#> 3 h i
df3 |> separate_(~x, into = c("A", "B"), extra = "merge")
#> A B
#> 1 a b_c
#> 2 d e_f_g
#> 3 h i
# Handle too few pieces
df3 |> separate_(~x, into = c("A", "B", "C"), fill = "right")
#> Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
#> A B C
#> 1 a b c
#> 2 d e f
#> 3 h i <NA>
# unite_() - combine multiple columns into one
df4 <- data.frame(year = 2020:2022, month = 1:3, day = 10:12)
df4 |> unite_(~date, ~year, ~month, ~day, sep = "-")
#> date
#> 1 2020-1-10
#> 2 2021-2-11
#> 3 2022-3-12
# Keep original columns
df4 |> unite_(~date, ~year, ~month, ~day, sep = "-", remove = FALSE)
#> date year month day
#> 1 2020-1-10 2020 1 10
#> 2 2021-2-11 2021 2 11
#> 3 2022-3-12 2022 3 12
# Handle NAs in unite
df5 <- data.frame(x = c("a", "b", NA), y = c("d", NA, "f"))
df5 |> unite_(~z, ~x, ~y)
#> z
#> 1 a_d
#> 2 b_NA
#> 3 NA_f
df5 |> unite_(~z, ~x, ~y, na.rm = TRUE)
#> z
#> 1 a_d
#> 2 b
#> 3 f
# fill_() - fill missing values
df6 <- data.frame(
group = c(1, 1, 1, 2, 2, 2),
value = c(10, NA, NA, 20, NA, 30)
)
df6 |> fill_(~value)
#> group value
#> 1 1 10
#> 2 1 10
#> 3 1 10
#> 4 2 20
#> 5 2 20
#> 6 2 30
# Fill upward
df6 |> fill_(~value, .direction = "up")
#> group value
#> 1 1 10
#> 2 1 20
#> 3 1 20
#> 4 2 20
#> 5 2 30
#> 6 2 30
# Fill down then up
df6 |> fill_(~value, .direction = "downup")
#> group value
#> 1 1 10
#> 2 1 10
#> 3 1 10
#> 4 2 20
#> 5 2 20
#> 6 2 30
# Fill specific columns
df7 <- data.frame(x = c(1, NA, 3), y = c(NA, 2, NA), z = c(1, 2, 3))
df7 |> fill_(~x, ~y, .direction = "down")
#> x y z
#> 1 1 NA 1
#> 2 1 2 2
#> 3 3 2 3
# Fill with grouped data
df6 |>
group_by_(~group) |>
fill_(~value)
#> group value
#> 1 1 10
#> 2 1 10
#> 3 1 10
#> 4 2 20
#> 5 2 20
#> 6 2 30
# drop_na_() - remove rows with missing values
df8 <- data.frame(x = c(1, 2, NA), y = c("a", NA, "c"), z = 1:3)
df8 |> drop_na_()
#> x y z
#> 1 1 a 1
# Drop NAs from specific columns only
df8 |> drop_na_(~x)
#> x y z
#> 1 1 a 1
#> 2 2 <NA> 2
df8 |> drop_na_(~x, ~y)
#> x y z
#> 1 1 a 1
# Use proportion threshold
df9 <- data.frame(x = c(1, NA, NA), y = c(NA, 2, NA), z = c(NA, NA, 3))
df9 |> drop_na_(.prop = 0.5) # Drop rows with >= 50% NAs
#> [1] x y z
#> <0 rows> (or 0-length row.names)
# Keep track of removed rows
result <- df8 |> drop_na_(.na.attr = TRUE)
attr(result, "na.action")
#> [1] 2 3
#> attr(,"class")
#> [1] "omit"
# replace_na_() - replace NAs with a value
df10 <- data.frame(x = c(1, 2, NA), y = c(NA, "b", "c"))
df10 |> replace_na_(list(x = 0, y = "missing"))
#> x y
#> 1 1 missing
#> 2 2 b
#> 3 0 c
# Replace in a single vector
vec <- c(1, 2, NA, 4, NA)
replace_na_(v = vec, replace = 0)
#> [1] 1 2 0 4 0
# Replace all NAs with same value (not standard tidyr)
df10 |> replace_na_(list(everywhere = 999))
#> x y
#> 1 1 <NA>
#> 2 2 b
#> 3 NA c
# uncount_() - duplicate rows according to weights
df11 <- data.frame(x = c("a", "b", "c"), n = c(1, 2, 3))
df11 |> uncount_(~n)
#> x
#> 1 a
#> 2 b
#> 2.1 b
#> 3 c
#> 3.1 c
#> 3.2 c
# Keep the weight column
df11 |> uncount_(~n, .remove = FALSE)
#> x n
#> 1 a 1
#> 2 b 2
#> 2.1 b 2
#> 3 c 3
#> 3.1 c 3
#> 3.2 c 3
# Add ID column to track original rows
df11 |> uncount_(~n, .id = "id")
#> x id
#> 1 a 1
#> 2 b 1
#> 2.1 b 2
#> 3 c 1
#> 3.1 c 2
#> 3.2 c 3
# Use numeric weights vector
df12 <- data.frame(x = c("a", "b", "c"))
df12 |> uncount_(weights = c(2, 1, 3))
#> x
#> 1 a
#> 1.1 a
#> 2 b
#> 3 c
#> 3.1 c
#> 3.2 c