
Functions for joining two data frames based on matching values in key columns.
These are SciViews::R versions of tidyverse functions with standard
evaluation and formula-based non-standard evaluation (ending with underscore
_). They work with data.frame, data.table, and tibbles.
Functions:
left_join_() - Keep all rows from x, add matching columns from y
right_join_() - Keep all rows from y, add matching columns from x
inner_join_() - Keep only rows with matches in both x and y
full_join_() - Keep all rows from both x and y
semi_join_() - Keep rows in x that have a match in y (no columns from y)
anti_join_() - Keep rows in x that do NOT have a match in y
join_() - Generic join function with how parameter
join_(
x,
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
keep = NULL,
na_matches = c("na", "never"),
multiple = "all",
unmatched = "drop",
relationship = NULL,
sort = FALSE,
verbose = 0,
column = NULL,
attr = NULL,
how = "full"
)
left_join_(
x = (.),
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
keep = NULL,
na_matches = c("na", "never"),
multiple = "all",
unmatched = "drop",
relationship = NULL,
sort = FALSE,
verbose = 0,
column = NULL,
attr = NULL
)
right_join_(
x = (.),
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
keep = NULL,
na_matches = c("na", "never"),
multiple = "all",
unmatched = "drop",
relationship = NULL,
sort = FALSE,
verbose = 0,
column = NULL,
attr = NULL
)
inner_join_(
x = (.),
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
keep = NULL,
na_matches = c("na", "never"),
multiple = "all",
unmatched = "drop",
relationship = NULL,
sort = FALSE,
verbose = 0,
column = NULL,
attr = NULL
)
full_join_(
x = (.),
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
keep = NULL,
na_matches = c("na", "never"),
multiple = "all",
relationship = NULL,
sort = FALSE,
verbose = 0,
column = NULL,
attr = NULL
)
semi_join_(
x = (.),
y,
by = NULL,
copy = FALSE,
...,
na_matches = c("na", "never"),
sort = FALSE,
verbose = 0,
column = NULL,
attr = NULL
)
anti_join_(
x = (.),
y,
by = NULL,
copy = FALSE,
...,
na_matches = c("na", "never"),
sort = FALSE,
verbose = 0,
column = NULL,
attr = NULL
)A data frame (data.frame, data.table, or tibble)
A data frame to join with x
A character vector of column names to join by. If NULL, uses all
columns with common names across x and y. Can also be a named character
vector to join by different column names (e.g., c("a" = "b") joins x$a to
y$b), or a dplyr::join_by() object for complex joins.
If x and y are not from the same data source, and copy is TRUE,
then y will be copied into the same source as x. This allows you to join
tables across data sources, but is potentially expensive.
Character vector of length 2 specifying suffixes to append to
duplicate column names. Default is c(".x", ".y").
Additional arguments (currently unused, for compatibility)
Should the join keys from both x and y be preserved in the output?
If NULL (default), keeps join keys from x only for equality joins, but
keeps keys from both for inequality joins. If TRUE, all keys from both
inputs are retained. If FALSE, only keys from x are retained. Note: when
keep = TRUE, calculation is delegated to dplyr join methods.
Should two NA or two NaN values match? "na" (default)
treats two NA or two NaN values as equal, like %in%, match(), and
merge(). "never" treats two NA or two NaN values as different, and
will never match them together or to any other values. When "never",
calculation is delegated to dplyr join methods.
Handling of rows in x with multiple matches in y. Options:
"all" (default) returns every match detected in y (SQL behavior),
"first" returns the first match detected in y,
"last" returns the last match detected in y,
"any" returns one match (faster but non-deterministic).
For "any" and "last" (and "first" for right joins), calculation is
delegated to dplyr join methods.
How should unmatched keys that would result in dropped rows
be handled? "drop" (default) drops unmatched keys from the result.
"error" throws an error if unmatched keys are detected. Can also be a
named list like list(x = 1, y = 0.5, fail = "warning") specifying the
proportions that must match and the action ("message", "warning", or
"error"). Not available for full_join_(), semi_join_(), and
anti_join_().
Expected relationship between the keys of x and y:
NULL (default) has no expectations but warns for many-to-many,
"one-to-one" expects each row in x matches at most 1 row in y and vice versa,
"one-to-many" expects each row in y matches at most 1 row in x,
"many-to-one" expects each row in x matches at most 1 row in y,
"many-to-many" has no restrictions (explicit about relationship).
Logical. If TRUE, the result is sorted by the grouping variables.
Default is FALSE.
Integer controlling information printed about the join:
0 means no output (default), 1 prints join information, and
2 additionally prints the classes of the by columns.
Note: This parameter is ignored when using dplyr join methods.
Name for an extra column to generate in the output indicating
which dataset a record came from. TRUE creates a column named ".join",
or provide a custom name as a character string. NULL (default) does not
add this column. Note: This parameter is ignored when using dplyr join methods.
Name for an attribute providing information about the join
performed (including output of collapse::fmatch()) attached to the result.
TRUE creates an attribute named "join.match", or provide a custom name.
NULL (default) does not add this attribute. Note: This parameter is
ignored when using dplyr join methods.
Character string specifying the join type for join_():
"full" (default), "inner", "left", "right", "semi", or "anti".
A data frame of the same type as x``. The order of the rows and columns of x` is preserved as much as possible.
left_join_() returns all rows from x, and all columns from x and y
right_join_() returns all rows from y, and all columns from x and y
inner_join_() returns only rows with matches in both x and y
full_join_() returns all rows from both x and y
semi_join_() returns rows from x (no columns added from y)
anti_join_() returns rows from x with no match in y (no columns from
y)
library(svTidy)
# Create example datasets
band_members <- data.frame(
name = c("Mick", "John", "Paul"),
band = c("Stones", "Beatles", "Beatles")
)
band_instruments <- data.frame(
name = c("John", "Paul", "Keith"),
plays = c("guitar", "bass", "guitar")
)
# Left join - keep all rows from x
band_members |> left_join_(band_instruments, by = "name")
#> name band plays
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
# Right join - keep all rows from y
band_members |> right_join_(band_instruments, by = "name")
#> name band plays
#> 1 John Beatles guitar
#> 2 Paul Beatles bass
#> 3 Keith <NA> guitar
# Inner join - keep only matching rows
band_members |> inner_join_(band_instruments, by = "name")
#> name band plays
#> 1 John Beatles guitar
#> 2 Paul Beatles bass
# Full join - keep all rows from both
band_members |> full_join_(band_instruments, by = "name")
#> name band plays
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
#> 4 Keith <NA> guitar
# Semi join - filter x to rows matching y (no columns from y)
band_members |> semi_join_(band_instruments, by = "name")
#> name band
#> 1 John Beatles
#> 2 Paul Beatles
# Anti join - filter x to rows NOT matching y
band_members |> anti_join_(band_instruments, by = "name")
#> name band
#> 1 Mick Stones
# Join by different column names
band_instruments2 <- data.frame(
artist = c("John", "Paul", "Keith"),
plays = c("guitar", "bass", "guitar")
)
band_members |> left_join_(band_instruments2, by = c("name" = "artist"))
#> name band plays
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
# Add suffix to duplicate columns
df1 <- data.frame(id = 1:3, value = c("a", "b", "c"))
df2 <- data.frame(id = 2:4, value = c("B", "C", "D"))
df1 |> full_join_(df2, by = "id", suffix = c("_x", "_y"))
#> id value_x value_y
#> 1 1 a <NA>
#> 2 2 b B
#> 3 3 c C
#> 4 4 <NA> D
# Control handling of multiple matches
df_x <- data.frame(key = c(1, 1, 2), x = c("a", "b", "c"))
df_y <- data.frame(key = c(1, 1, 2), y = c("A", "B", "C"))
df_x |> left_join_(df_y, by = "key", multiple = "all")
#> key x y
#> 1 1 a A
#> 2 1 a B
#> 3 1 b A
#> 4 1 b B
#> 5 2 c C
df_x |> left_join_(df_y, by = "key", multiple = "first")
#> key x y
#> 1 1 a A
#> 2 1 b A
#> 3 2 c C
# Validate relationships
df_one <- data.frame(id = 1:3, val = c("a", "b", "c"))
df_many <- data.frame(id = c(1, 1, 2), val = c("A", "B", "C"))
if (FALSE) { # \dontrun{
# This will error - expects one-to-one but is one-to-many
df_one |> left_join_(df_many, by = "id", relationship = "one-to-one")
} # }
# This works - explicitly one-to-many
df_one |> left_join_(df_many, by = "id", relationship = "one-to-many")
#> id val.x val.y
#> 1 1 a A
#> 2 1 a B
#> 3 2 b C
#> 4 3 c <NA>
# Add indicator column showing source
band_members |>
full_join_(band_instruments, by = "name", column = "source")
#> name band plays source
#> 1 Mick Stones <NA> band_members
#> 2 John Beatles guitar matched
#> 3 Paul Beatles bass matched
#> 4 Keith <NA> guitar band_instruments
# Use generic join_() with how parameter
band_members |> join_(band_instruments, by = "name", how = "inner")
#> name band plays
#> 1 John Beatles guitar
#> 2 Paul Beatles bass
band_members |> join_(band_instruments, by = "name", how = "left")
#> name band plays
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
# Handle unmatched keys
if (FALSE) { # \dontrun{
# Error if any keys don't match
band_members |>
inner_join_(band_instruments, by = "name", unmatched = "error")
} # }