Tidying Functions

Functions for tidying data by separating, uniting, filling, and handling missing values.

These are SciViews::R versions of tidyverse functions with standard evaluation and formula-based non-standard evaluation (ending with underscore _). They work with data.frame, data.table, and tibbles.

Functions:

separate_() - Separate one column into multiple columns by splitting on a separator
unite_() - Unite multiple columns into one by pasting strings together
fill_() - Fill missing values using previous or next non-missing value
drop_na_() - Drop rows containing missing values
replace_na_() - Replace missing values with a specified value
uncount_() - Duplicate rows according to a weighting variable

separate_(
  .data = (.),
  col,
  into,
  sep = "[^[:alnum:]]+",
  remove = TRUE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
  fixed = FALSE,
  ...
)

unite_(.data = (.), col, ..., sep = "_", remove = TRUE, na.rm = FALSE)

fill_(.data = (.), ..., .direction = "down")

drop_na_(.data = (.), ..., .na.attr = FALSE, .prop = 0)

replace_na_(.data = (.), replace, ..., v = NULL)

uncount_(.data = (.), weights, ..., .remove = TRUE, .id = NULL)

Arguments

.data: A data frame (data.frame, data.table, or tibble)
col: For separate_(): the column to separate. Can be a column name as character, or a formula (e.g., ~col_name). For unite_(): the name of the new united column (character string or formula).
into: For separate_(): names of new variables to create as a character vector. Use NA to omit a variable in the output.
sep: For separate_() and unite_(): separator between columns. For separate_(), can be a character vector, a numeric vector of positions to split at, or a regular expression pattern. Default is "[^[:alnum:]]+" for separate_() and "_" for unite_().
remove: Logical. If TRUE (default), remove input columns from output.
convert: For separate_(): logical. If TRUE, attempts to convert new columns to appropriate types. Default is FALSE.
extra: For separate_() when sep is a character: what to do when there are too many pieces. Options: "warn" (default, warn and drop), "drop" (drop without warning), or "merge" (merge extra pieces with last).
fill: For separate_() when sep is a character: what to do when there are not enough pieces. Options: "warn" (default, warn and fill right with NA), "right" (fill right without warning), or "left" (fill left).
fixed: For separate_(): logical. If TRUE, sep is a fixed string. If FALSE (default), sep is a (perl) regular expression.
...: For separate_() and unite_(): additional arguments (currently unused). For fill_() and drop_na_(): columns to fill or check for NAs. Use formulas (e.g., ~col1, ~col2) or column names. If not provided, uses all columns.
na.rm: If TRUE, NAs are eliminated before uniting the values.
.direction: Direction in which to fill missing data: "down" (by default), "up", or "downup" (first down, then up), "updown" (the opposite).
.na.attr: logical. TRUE adds an attribute containing the removed cases. For compatibility reasons this is exactly the same format as na.omit(), i.e. the attribute is called "na.action" and of class omit
.prop: numeric. The proportion missing values in each case for the case to be considered as missing required to keep a
replace: If .data is a vector, a unique value to replace NAs, otherwise, a list of values, one per column of the data frame.
v: a vector where to replace NAs.
weights: A vector of weight to use to "uncount" .data.
.remove: If TRUE (default), and weights is the name of a column, that column is removed from .data.
.id: The name of the column for the origin id, either names if all other arguments are named, or numbers.

Value

A data frame of the same type as .data with the transformation applied.

separate_() returns a data frame with the specified column split into multiple columns
unite_() returns a data frame with specified columns combined into one
fill_() returns a data frame with missing values filled
drop_na_() returns a data frame with rows containing NAs removed
replace_na_() returns a data frame or vector with NAs replaced by specified values
uncount_() returns a data frame with rows duplicated according to weights

Examples

library(svTidy)

# separate_() - split one column into multiple
df <- data.frame(x = c("a_b_c", "d_e_f", "g_h_i"))
df |> separate_(~x, into = c("A", "B", "C"), sep = "_")
#>   A B C
#> 1 a b c
#> 2 d e f
#> 3 g h i

# Use character name instead of formula
df |> separate_("x", into = c("A", "B", "C"), sep = "_")
#>   A B C
#> 1 a b c
#> 2 d e f
#> 3 g h i

# Drop a column with NA in into
df |> separate_(~x, into = c("A", NA, "C"), sep = "_")
#>   A C
#> 1 a c
#> 2 d f
#> 3 g i

# Keep original column
df |> separate_(~x, into = c("A", "B", "C"), sep = "_", remove = FALSE)
#>       x A B C
#> 1 a_b_c a b c
#> 2 d_e_f d e f
#> 3 g_h_i g h i

# Separate by numeric positions is not implemented yet
#df2 <- data.frame(date = c("20201231", "20210115", "20220601"))
#df2 |> separate_(~date, into = c("year", "month", "day"), sep = c(4, 6))

# Handle too many pieces
df3 <- data.frame(x = c("a_b_c", "d_e_f_g", "h_i"))
df3 |> separate_(~x, into = c("A", "B"), extra = "drop")
#>   A B
#> 1 a b
#> 2 d e
#> 3 h i
df3 |> separate_(~x, into = c("A", "B"), extra = "merge")
#>   A     B
#> 1 a   b_c
#> 2 d e_f_g
#> 3 h     i

# Handle too few pieces
df3 |> separate_(~x, into = c("A", "B", "C"), fill = "right")
#> Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
#>   A B    C
#> 1 a b    c
#> 2 d e    f
#> 3 h i <NA>

# unite_() - combine multiple columns into one
df4 <- data.frame(year = 2020:2022, month = 1:3, day = 10:12)
df4 |> unite_(~date, ~year, ~month, ~day, sep = "-")
#>        date
#> 1 2020-1-10
#> 2 2021-2-11
#> 3 2022-3-12

# Keep original columns
df4 |> unite_(~date, ~year, ~month, ~day, sep = "-", remove = FALSE)
#>        date year month day
#> 1 2020-1-10 2020     1  10
#> 2 2021-2-11 2021     2  11
#> 3 2022-3-12 2022     3  12

# Handle NAs in unite
df5 <- data.frame(x = c("a", "b", NA), y = c("d", NA, "f"))
df5 |> unite_(~z, ~x, ~y)
#>      z
#> 1  a_d
#> 2 b_NA
#> 3 NA_f
df5 |> unite_(~z, ~x, ~y, na.rm = TRUE)
#>     z
#> 1 a_d
#> 2   b
#> 3   f

# fill_() - fill missing values
df6 <- data.frame(
  group = c(1, 1, 1, 2, 2, 2),
  value = c(10, NA, NA, 20, NA, 30)
)
df6 |> fill_(~value)
#>   group value
#> 1     1    10
#> 2     1    10
#> 3     1    10
#> 4     2    20
#> 5     2    20
#> 6     2    30

# Fill upward
df6 |> fill_(~value, .direction = "up")
#>   group value
#> 1     1    10
#> 2     1    20
#> 3     1    20
#> 4     2    20
#> 5     2    30
#> 6     2    30

# Fill down then up
df6 |> fill_(~value, .direction = "downup")
#>   group value
#> 1     1    10
#> 2     1    10
#> 3     1    10
#> 4     2    20
#> 5     2    20
#> 6     2    30

# Fill specific columns
df7 <- data.frame(x = c(1, NA, 3), y = c(NA, 2, NA), z = c(1, 2, 3))
df7 |> fill_(~x, ~y, .direction = "down")
#>   x  y z
#> 1 1 NA 1
#> 2 1  2 2
#> 3 3  2 3

# Fill with grouped data
df6 |>
  group_by_(~group) |>
  fill_(~value)
#>   group value
#> 1     1    10
#> 2     1    10
#> 3     1    10
#> 4     2    20
#> 5     2    20
#> 6     2    30

# drop_na_() - remove rows with missing values
df8 <- data.frame(x = c(1, 2, NA), y = c("a", NA, "c"), z = 1:3)
df8 |> drop_na_()
#>   x y z
#> 1 1 a 1

# Drop NAs from specific columns only
df8 |> drop_na_(~x)
#>   x    y z
#> 1 1    a 1
#> 2 2 <NA> 2
df8 |> drop_na_(~x, ~y)
#>   x y z
#> 1 1 a 1

# Use proportion threshold
df9 <- data.frame(x = c(1, NA, NA), y = c(NA, 2, NA), z = c(NA, NA, 3))
df9 |> drop_na_(.prop = 0.5)  # Drop rows with >= 50% NAs
#> [1] x y z
#> <0 rows> (or 0-length row.names)

# Keep track of removed rows
result <- df8 |> drop_na_(.na.attr = TRUE)
attr(result, "na.action")
#> [1] 2 3
#> attr(,"class")
#> [1] "omit"

# replace_na_() - replace NAs with a value
df10 <- data.frame(x = c(1, 2, NA), y = c(NA, "b", "c"))
df10 |> replace_na_(list(x = 0, y = "missing"))
#>   x       y
#> 1 1 missing
#> 2 2       b
#> 3 0       c

# Replace in a single vector
vec <- c(1, 2, NA, 4, NA)
replace_na_(v = vec, replace = 0)
#> [1] 1 2 0 4 0

# Replace all NAs with same value (not standard tidyr)
df10 |> replace_na_(list(everywhere = 999))
#>    x    y
#> 1  1 <NA>
#> 2  2    b
#> 3 NA    c

# uncount_() - duplicate rows according to weights
df11 <- data.frame(x = c("a", "b", "c"), n = c(1, 2, 3))
df11 |> uncount_(~n)
#>     x
#> 1   a
#> 2   b
#> 2.1 b
#> 3   c
#> 3.1 c
#> 3.2 c

# Keep the weight column
df11 |> uncount_(~n, .remove = FALSE)
#>     x n
#> 1   a 1
#> 2   b 2
#> 2.1 b 2
#> 3   c 3
#> 3.1 c 3
#> 3.2 c 3

# Add ID column to track original rows
df11 |> uncount_(~n, .id = "id")
#>     x id
#> 1   a  1
#> 2   b  1
#> 2.1 b  2
#> 3   c  1
#> 3.1 c  2
#> 3.2 c  3

# Use numeric weights vector
df12 <- data.frame(x = c("a", "b", "c"))
df12 |> uncount_(weights = c(2, 1, 3))
#>     x
#> 1   a
#> 1.1 a
#> 2   b
#> 3   c
#> 3.1 c
#> 3.2 c

Arguments

Value

See also

Examples