loading...

Functions for tidying data by separating, uniting, filling, and handling missing values.

These are SciViews::R versions of tidyverse functions with standard evaluation and formula-based non-standard evaluation (ending with underscore _). They work with data.frame, data.table, and tibbles.

Functions:

  • separate_() - Separate one column into multiple columns by splitting on a separator

  • unite_() - Unite multiple columns into one by pasting strings together

  • fill_() - Fill missing values using previous or next non-missing value

  • drop_na_() - Drop rows containing missing values

  • replace_na_() - Replace missing values with a specified value

  • uncount_() - Duplicate rows according to a weighting variable

separate_(
  .data = (.),
  col,
  into,
  sep = "[^[:alnum:]]+",
  remove = TRUE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
  fixed = FALSE,
  ...
)

unite_(.data = (.), col, ..., sep = "_", remove = TRUE, na.rm = FALSE)

fill_(.data = (.), ..., .direction = "down")

drop_na_(.data = (.), ..., .na.attr = FALSE, .prop = 0)

replace_na_(.data = (.), replace, ..., v = NULL)

uncount_(.data = (.), weights, ..., .remove = TRUE, .id = NULL)

Arguments

.data

A data frame (data.frame, data.table, or tibble)

col

For separate_(): the column to separate. Can be a column name as character, or a formula (e.g., ~col_name). For unite_(): the name of the new united column (character string or formula).

into

For separate_(): names of new variables to create as a character vector. Use NA to omit a variable in the output.

sep

For separate_() and unite_(): separator between columns. For separate_(), can be a character vector, a numeric vector of positions to split at, or a regular expression pattern. Default is "[^[:alnum:]]+" for separate_() and "_" for unite_().

remove

Logical. If TRUE (default), remove input columns from output.

convert

For separate_(): logical. If TRUE, attempts to convert new columns to appropriate types. Default is FALSE.

extra

For separate_() when sep is a character: what to do when there are too many pieces. Options: "warn" (default, warn and drop), "drop" (drop without warning), or "merge" (merge extra pieces with last).

fill

For separate_() when sep is a character: what to do when there are not enough pieces. Options: "warn" (default, warn and fill right with NA), "right" (fill right without warning), or "left" (fill left).

fixed

For separate_(): logical. If TRUE, sep is a fixed string. If FALSE (default), sep is a (perl) regular expression.

...

For separate_() and unite_(): additional arguments (currently unused). For fill_() and drop_na_(): columns to fill or check for NAs. Use formulas (e.g., ~col1, ~col2) or column names. If not provided, uses all columns.

na.rm

If TRUE, NAs are eliminated before uniting the values.

.direction

Direction in which to fill missing data: "down" (by default), "up", or "downup" (first down, then up), "updown" (the opposite).

.na.attr

logical. TRUE adds an attribute containing the removed cases. For compatibility reasons this is exactly the same format as na.omit(), i.e. the attribute is called "na.action" and of class omit

.prop

numeric. The proportion missing values in each case for the case to be considered as missing required to keep a

replace

If .data is a vector, a unique value to replace NAs, otherwise, a list of values, one per column of the data frame.

v

a vector where to replace NAs.

weights

A vector of weight to use to "uncount" .data.

.remove

If TRUE (default), and weights is the name of a column, that column is removed from .data.

.id

The name of the column for the origin id, either names if all other arguments are named, or numbers.

Value

A data frame of the same type as .data with the transformation applied.

  • separate_() returns a data frame with the specified column split into multiple columns

  • unite_() returns a data frame with specified columns combined into one

  • fill_() returns a data frame with missing values filled

  • drop_na_() returns a data frame with rows containing NAs removed

  • replace_na_() returns a data frame or vector with NAs replaced by specified values

  • uncount_() returns a data frame with rows duplicated according to weights

Examples

library(svTidy)

# separate_() - split one column into multiple
df <- data.frame(x = c("a_b_c", "d_e_f", "g_h_i"))
df |> separate_(~x, into = c("A", "B", "C"), sep = "_")
#>   A B C
#> 1 a b c
#> 2 d e f
#> 3 g h i

# Use character name instead of formula
df |> separate_("x", into = c("A", "B", "C"), sep = "_")
#>   A B C
#> 1 a b c
#> 2 d e f
#> 3 g h i

# Drop a column with NA in into
df |> separate_(~x, into = c("A", NA, "C"), sep = "_")
#>   A C
#> 1 a c
#> 2 d f
#> 3 g i

# Keep original column
df |> separate_(~x, into = c("A", "B", "C"), sep = "_", remove = FALSE)
#>       x A B C
#> 1 a_b_c a b c
#> 2 d_e_f d e f
#> 3 g_h_i g h i

# Separate by numeric positions is not implemented yet
#df2 <- data.frame(date = c("20201231", "20210115", "20220601"))
#df2 |> separate_(~date, into = c("year", "month", "day"), sep = c(4, 6))

# Handle too many pieces
df3 <- data.frame(x = c("a_b_c", "d_e_f_g", "h_i"))
df3 |> separate_(~x, into = c("A", "B"), extra = "drop")
#>   A B
#> 1 a b
#> 2 d e
#> 3 h i
df3 |> separate_(~x, into = c("A", "B"), extra = "merge")
#>   A     B
#> 1 a   b_c
#> 2 d e_f_g
#> 3 h     i

# Handle too few pieces
df3 |> separate_(~x, into = c("A", "B", "C"), fill = "right")
#> Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
#>   A B    C
#> 1 a b    c
#> 2 d e    f
#> 3 h i <NA>

# unite_() - combine multiple columns into one
df4 <- data.frame(year = 2020:2022, month = 1:3, day = 10:12)
df4 |> unite_(~date, ~year, ~month, ~day, sep = "-")
#>        date
#> 1 2020-1-10
#> 2 2021-2-11
#> 3 2022-3-12

# Keep original columns
df4 |> unite_(~date, ~year, ~month, ~day, sep = "-", remove = FALSE)
#>        date year month day
#> 1 2020-1-10 2020     1  10
#> 2 2021-2-11 2021     2  11
#> 3 2022-3-12 2022     3  12

# Handle NAs in unite
df5 <- data.frame(x = c("a", "b", NA), y = c("d", NA, "f"))
df5 |> unite_(~z, ~x, ~y)
#>      z
#> 1  a_d
#> 2 b_NA
#> 3 NA_f
df5 |> unite_(~z, ~x, ~y, na.rm = TRUE)
#>     z
#> 1 a_d
#> 2   b
#> 3   f

# fill_() - fill missing values
df6 <- data.frame(
  group = c(1, 1, 1, 2, 2, 2),
  value = c(10, NA, NA, 20, NA, 30)
)
df6 |> fill_(~value)
#>   group value
#> 1     1    10
#> 2     1    10
#> 3     1    10
#> 4     2    20
#> 5     2    20
#> 6     2    30

# Fill upward
df6 |> fill_(~value, .direction = "up")
#>   group value
#> 1     1    10
#> 2     1    20
#> 3     1    20
#> 4     2    20
#> 5     2    30
#> 6     2    30

# Fill down then up
df6 |> fill_(~value, .direction = "downup")
#>   group value
#> 1     1    10
#> 2     1    10
#> 3     1    10
#> 4     2    20
#> 5     2    20
#> 6     2    30

# Fill specific columns
df7 <- data.frame(x = c(1, NA, 3), y = c(NA, 2, NA), z = c(1, 2, 3))
df7 |> fill_(~x, ~y, .direction = "down")
#>   x  y z
#> 1 1 NA 1
#> 2 1  2 2
#> 3 3  2 3

# Fill with grouped data
df6 |>
  group_by_(~group) |>
  fill_(~value)
#>   group value
#> 1     1    10
#> 2     1    10
#> 3     1    10
#> 4     2    20
#> 5     2    20
#> 6     2    30

# drop_na_() - remove rows with missing values
df8 <- data.frame(x = c(1, 2, NA), y = c("a", NA, "c"), z = 1:3)
df8 |> drop_na_()
#>   x y z
#> 1 1 a 1

# Drop NAs from specific columns only
df8 |> drop_na_(~x)
#>   x    y z
#> 1 1    a 1
#> 2 2 <NA> 2
df8 |> drop_na_(~x, ~y)
#>   x y z
#> 1 1 a 1

# Use proportion threshold
df9 <- data.frame(x = c(1, NA, NA), y = c(NA, 2, NA), z = c(NA, NA, 3))
df9 |> drop_na_(.prop = 0.5)  # Drop rows with >= 50% NAs
#> [1] x y z
#> <0 rows> (or 0-length row.names)

# Keep track of removed rows
result <- df8 |> drop_na_(.na.attr = TRUE)
attr(result, "na.action")
#> [1] 2 3
#> attr(,"class")
#> [1] "omit"

# replace_na_() - replace NAs with a value
df10 <- data.frame(x = c(1, 2, NA), y = c(NA, "b", "c"))
df10 |> replace_na_(list(x = 0, y = "missing"))
#>   x       y
#> 1 1 missing
#> 2 2       b
#> 3 0       c

# Replace in a single vector
vec <- c(1, 2, NA, 4, NA)
replace_na_(v = vec, replace = 0)
#> [1] 1 2 0 4 0

# Replace all NAs with same value (not standard tidyr)
df10 |> replace_na_(list(everywhere = 999))
#>    x    y
#> 1  1 <NA>
#> 2  2    b
#> 3 NA    c

# uncount_() - duplicate rows according to weights
df11 <- data.frame(x = c("a", "b", "c"), n = c(1, 2, 3))
df11 |> uncount_(~n)
#>     x
#> 1   a
#> 2   b
#> 2.1 b
#> 3   c
#> 3.1 c
#> 3.2 c

# Keep the weight column
df11 |> uncount_(~n, .remove = FALSE)
#>     x n
#> 1   a 1
#> 2   b 2
#> 2.1 b 2
#> 3   c 3
#> 3.1 c 3
#> 3.2 c 3

# Add ID column to track original rows
df11 |> uncount_(~n, .id = "id")
#>     x id
#> 1   a  1
#> 2   b  1
#> 2.1 b  2
#> 3   c  1
#> 3.1 c  2
#> 3.2 c  3

# Use numeric weights vector
df12 <- data.frame(x = c("a", "b", "c"))
df12 |> uncount_(weights = c(2, 1, 3))
#>     x
#> 1   a
#> 1.1 a
#> 2   b
#> 3   c
#> 3.1 c
#> 3.2 c