Joining Functions

Functions for joining two data frames based on matching values in key columns.

These are SciViews::R versions of tidyverse functions with standard evaluation and formula-based non-standard evaluation (ending with underscore _). They work with data.frame, data.table, and tibbles.

Functions:

left_join_() - Keep all rows from x, add matching columns from y
right_join_() - Keep all rows from y, add matching columns from x
inner_join_() - Keep only rows with matches in both x and y
full_join_() - Keep all rows from both x and y
semi_join_() - Keep rows in x that have a match in y (no columns from y)
anti_join_() - Keep rows in x that do NOT have a match in y
join_() - Generic join function with how parameter

join_(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = c("na", "never"),
  multiple = "all",
  unmatched = "drop",
  relationship = NULL,
  sort = FALSE,
  verbose = 0,
  column = NULL,
  attr = NULL,
  how = "full"
)

left_join_(
  x = (.),
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = c("na", "never"),
  multiple = "all",
  unmatched = "drop",
  relationship = NULL,
  sort = FALSE,
  verbose = 0,
  column = NULL,
  attr = NULL
)

right_join_(
  x = (.),
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = c("na", "never"),
  multiple = "all",
  unmatched = "drop",
  relationship = NULL,
  sort = FALSE,
  verbose = 0,
  column = NULL,
  attr = NULL
)

inner_join_(
  x = (.),
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = c("na", "never"),
  multiple = "all",
  unmatched = "drop",
  relationship = NULL,
  sort = FALSE,
  verbose = 0,
  column = NULL,
  attr = NULL
)

full_join_(
  x = (.),
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = c("na", "never"),
  multiple = "all",
  relationship = NULL,
  sort = FALSE,
  verbose = 0,
  column = NULL,
  attr = NULL
)

semi_join_(
  x = (.),
  y,
  by = NULL,
  copy = FALSE,
  ...,
  na_matches = c("na", "never"),
  sort = FALSE,
  verbose = 0,
  column = NULL,
  attr = NULL
)

anti_join_(
  x = (.),
  y,
  by = NULL,
  copy = FALSE,
  ...,
  na_matches = c("na", "never"),
  sort = FALSE,
  verbose = 0,
  column = NULL,
  attr = NULL
)

Arguments

x: A data frame (data.frame, data.table, or tibble)
y: A data frame to join with x
by: A character vector of column names to join by. If NULL, uses all columns with common names across x and y. Can also be a named character vector to join by different column names (e.g., c("a" = "b") joins x$a to y$b), or a dplyr::join_by() object for complex joins.
copy: If x and y are not from the same data source, and copy is TRUE, then y will be copied into the same source as x. This allows you to join tables across data sources, but is potentially expensive.
suffix: Character vector of length 2 specifying suffixes to append to duplicate column names. Default is c(".x", ".y").
...: Additional arguments (currently unused, for compatibility)
keep: Should the join keys from both x and y be preserved in the output? If NULL (default), keeps join keys from x only for equality joins, but keeps keys from both for inequality joins. If TRUE, all keys from both inputs are retained. If FALSE, only keys from x are retained. Note: when keep = TRUE, calculation is delegated to dplyr join methods.
na_matches: Should two NA or two NaN values match? "na" (default) treats two NA or two NaN values as equal, like %in%, match(), and merge(). "never" treats two NA or two NaN values as different, and will never match them together or to any other values. When "never", calculation is delegated to dplyr join methods.
multiple: Handling of rows in x with multiple matches in y. Options: "all" (default) returns every match detected in y (SQL behavior), "first" returns the first match detected in y, "last" returns the last match detected in y, "any" returns one match (faster but non-deterministic). For "any" and "last" (and "first" for right joins), calculation is delegated to dplyr join methods.
unmatched: How should unmatched keys that would result in dropped rows be handled? "drop" (default) drops unmatched keys from the result. "error" throws an error if unmatched keys are detected. Can also be a named list like list(x = 1, y = 0.5, fail = "warning") specifying the proportions that must match and the action ("message", "warning", or "error"). Not available for full_join_(), semi_join_(), and anti_join_().
relationship: Expected relationship between the keys of x and y: NULL (default) has no expectations but warns for many-to-many, "one-to-one" expects each row in x matches at most 1 row in y and vice versa, "one-to-many" expects each row in y matches at most 1 row in x, "many-to-one" expects each row in x matches at most 1 row in y, "many-to-many" has no restrictions (explicit about relationship).
sort: Logical. If TRUE, the result is sorted by the grouping variables. Default is FALSE.
verbose: Integer controlling information printed about the join: 0 means no output (default), 1 prints join information, and 2 additionally prints the classes of the by columns. Note: This parameter is ignored when using dplyr join methods.
column: Name for an extra column to generate in the output indicating which dataset a record came from. TRUE creates a column named ".join", or provide a custom name as a character string. NULL (default) does not add this column. Note: This parameter is ignored when using dplyr join methods.
attr: Name for an attribute providing information about the join performed (including output of collapse::fmatch()) attached to the result. TRUE creates an attribute named "join.match", or provide a custom name. NULL (default) does not add this attribute. Note: This parameter is ignored when using dplyr join methods.
how: Character string specifying the join type for join_(): "full" (default), "inner", "left", "right", "semi", or "anti".

Value

A data frame of the same type as x``. The order of the rows and columns of x` is preserved as much as possible.

left_join_() returns all rows from x, and all columns from x and y
right_join_() returns all rows from y, and all columns from x and y
inner_join_() returns only rows with matches in both x and y
full_join_() returns all rows from both x and y
semi_join_() returns rows from x (no columns added from y)
anti_join_() returns rows from x with no match in y (no columns from y)

Examples

library(svTidy)

# Create example datasets
band_members <- data.frame(
  name = c("Mick", "John", "Paul"),
  band = c("Stones", "Beatles", "Beatles")
)
band_instruments <- data.frame(
  name = c("John", "Paul", "Keith"),
  plays = c("guitar", "bass", "guitar")
)

# Left join - keep all rows from x
band_members |> left_join_(band_instruments, by = "name")
#>   name    band  plays
#> 1 Mick  Stones   <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles   bass

# Right join - keep all rows from y
band_members |> right_join_(band_instruments, by = "name")
#>    name    band  plays
#> 1  John Beatles guitar
#> 2  Paul Beatles   bass
#> 3 Keith    <NA> guitar

# Inner join - keep only matching rows
band_members |> inner_join_(band_instruments, by = "name")
#>   name    band  plays
#> 1 John Beatles guitar
#> 2 Paul Beatles   bass

# Full join - keep all rows from both
band_members |> full_join_(band_instruments, by = "name")
#>    name    band  plays
#> 1  Mick  Stones   <NA>
#> 2  John Beatles guitar
#> 3  Paul Beatles   bass
#> 4 Keith    <NA> guitar

# Semi join - filter x to rows matching y (no columns from y)
band_members |> semi_join_(band_instruments, by = "name")
#>   name    band
#> 1 John Beatles
#> 2 Paul Beatles

# Anti join - filter x to rows NOT matching y
band_members |> anti_join_(band_instruments, by = "name")
#>   name   band
#> 1 Mick Stones

# Join by different column names
band_instruments2 <- data.frame(
  artist = c("John", "Paul", "Keith"),
  plays = c("guitar", "bass", "guitar")
)
band_members |> left_join_(band_instruments2, by = c("name" = "artist"))
#>   name    band  plays
#> 1 Mick  Stones   <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles   bass

# Add suffix to duplicate columns
df1 <- data.frame(id = 1:3, value = c("a", "b", "c"))
df2 <- data.frame(id = 2:4, value = c("B", "C", "D"))
df1 |> full_join_(df2, by = "id", suffix = c("_x", "_y"))
#>   id value_x value_y
#> 1  1       a    <NA>
#> 2  2       b       B
#> 3  3       c       C
#> 4  4    <NA>       D

# Control handling of multiple matches
df_x <- data.frame(key = c(1, 1, 2), x = c("a", "b", "c"))
df_y <- data.frame(key = c(1, 1, 2), y = c("A", "B", "C"))
df_x |> left_join_(df_y, by = "key", multiple = "all")
#>   key x y
#> 1   1 a A
#> 2   1 a B
#> 3   1 b A
#> 4   1 b B
#> 5   2 c C
df_x |> left_join_(df_y, by = "key", multiple = "first")
#>   key x y
#> 1   1 a A
#> 2   1 b A
#> 3   2 c C

# Validate relationships
df_one <- data.frame(id = 1:3, val = c("a", "b", "c"))
df_many <- data.frame(id = c(1, 1, 2), val = c("A", "B", "C"))
if (FALSE) { # \dontrun{
# This will error - expects one-to-one but is one-to-many
df_one |> left_join_(df_many, by = "id", relationship = "one-to-one")
} # }
# This works - explicitly one-to-many
df_one |> left_join_(df_many, by = "id", relationship = "one-to-many")
#>   id val.x val.y
#> 1  1     a     A
#> 2  1     a     B
#> 3  2     b     C
#> 4  3     c  <NA>

# Add indicator column showing source
band_members |>
  full_join_(band_instruments, by = "name", column = "source")
#>    name    band  plays           source
#> 1  Mick  Stones   <NA>     band_members
#> 2  John Beatles guitar          matched
#> 3  Paul Beatles   bass          matched
#> 4 Keith    <NA> guitar band_instruments

# Use generic join_() with how parameter
band_members |> join_(band_instruments, by = "name", how = "inner")
#>   name    band  plays
#> 1 John Beatles guitar
#> 2 Paul Beatles   bass
band_members |> join_(band_instruments, by = "name", how = "left")
#>   name    band  plays
#> 1 Mick  Stones   <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles   bass

# Handle unmatched keys
if (FALSE) { # \dontrun{
# Error if any keys don't match
band_members |>
  inner_join_(band_instruments, by = "name", unmatched = "error")
} # }

Arguments

Value

See also

Examples