Last data update: 2014.03.03

R: Merge two data frames (fast) by common columns by performing...
matchmergeR Documentation

Merge two data frames (fast) by common columns by performing a left (outer) join or an inner join.

Description

Merge two data frames (fast) by common columns by performing a left (outer) join or an inner join.
The data frames are merged on the columns given by by.x and by.y. Columns can be specified only by name. This differs from the merge function from the base package in that merging is done based on 1 column key only. If more than one column is supplied in by.x and by.y, these columns will be concatenated together to form 1 key which will be used to match. Alternatively, by.x and by.y can be 2 vectors of length NROW(x) which will be used as keys.

Usage

matchmerge(x, y, by.x, by.y, all.x = FALSE, by.iskey = FALSE,
  suffix = ".y", add.columns = colnames(y), check.duplicates = TRUE,
  trace = FALSE)

Arguments

x

the left hand side data frame to merge

y

the right hand side data frame to merge
or a vector in which case you always need to supply by.y as a vector, make sure by.iskey is set to TRUE and provide in add.columns the column name for which y will be relabelled to in the joined data frame (see the example).

by.x

either the name of 1 column in x or a character vector of length NROW(x) which will be used as key to merge the 2 data frames

by.y

either the name of 1 column in y or a character vector of length NROW(x) which will be used as key to merge the 2 data frames. Duplicate values in by.y are not allowed.

all.x

logical, if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output. The default value corresponds to an inner join. If TRUE is supplied, this corresponds to a left (outer) join.

by.iskey

Logical, indicating that the by.x and the by.y inputs are vectors of length NROW(x) and NROW(y) instead of column names in x and y. If this is FALSE, the input columns will be pasted together to create a key to merge upon. Otherwise, the function will use the by.x and by.y vectors directly as matching key. Defaults to FALSE indicating the by.x and by.y are column names in x and y.

suffix

a character string to be used for duplicate column names in x and y to make the y columns unique.

add.columns

character vector of column names in y to merge to the x data frame. Defaults to all columns in y.

check.duplicates

checks if by.y contains duplicates which is not allowed. Defaults to TRUE.

trace

logical, indicating to print some informative messages about the progress

Details

The rows in the right hand side data frame that match on the specific key are extracted, and joined together with the left hand side data frame.

Merging is done based on the match function on the key value. This makes the function a lot faster when compared to applying merge, especially for large data frames (see the example). And also the memory consumption is a lot smaller.

In SQL database terminology, the default value of all.x = FALSE gives a natural join, a special case of an inner join. Specifying all.x = FALSE gives a left (outer) join. Right (outer) join or (full) outer join are not provided in this function.

Value

data frame with x joined with y based on the supplied columns. The output columns are the columns in x followed by the extra columns in y.

See Also

cbind, match, merge

Examples

left <- data.frame(idlhs = c(1:4, 3:5), a = LETTERS[1:7], stringsAsFactors = FALSE)
right <- data.frame(idrhs = c(1:4), b = LETTERS[8:11], stringsAsFactors = FALSE)
## Inner join
matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs")

## Left outer join in 2 ways
matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs", all.x=TRUE)
matchmerge(x=left, y=right, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE, by.iskey=TRUE)

## Show usage when y is just a vector instead of a data.frame
matchmerge(x=left, y=right$b, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE,
by.iskey=TRUE, add.columns="b.renamed")

## Show speedup difference with merge
## Not run: 
size <- 100000
dimension <- seq(Sys.Date(), Sys.Date()+10, by = "day")
left <- data.frame(date = rep(dimension, size), sales = rnorm(size))
right <- data.frame(date = dimension, feature = dimension-7, feature = dimension-14)
dim(left)
dim(right)
print(system.time(merge(left, right, by.x="date", by.y="date", all.x=TRUE, all.y=FALSE)))
print(system.time(matchmerge(left, right, by.x="date", by.y="date", all.x=TRUE, by.iskey=FALSE)))

## End(Not run)
## Show example usage
products <- expand.grid(product = c("Pepsi", "Coca Cola"), type = c("Can","Bottle"),
size = c("6Ml","8Ml"), distributor = c("Distri X","Distri Y"), salesperson = c("Mr X","Mr Y"),
stringsAsFactors=FALSE)
products <- products[!duplicated(products[, c("product","type","size")]), ]
products$key <- paste(products$product, products$type, products$size, sep=".")
sales <- expand.grid(item = unique(products$key), sales = rnorm(10000, mean = 100))
str(products)
str(sales)
info <- matchmerge(x=sales, y=products,
  by.x=sales$item, by.y=products$key, all.x=TRUE, by.iskey=TRUE,
  add.columns=c("size","distributor"), check.duplicates=FALSE)
str(info)
tapply(info$sales, info$distributor, FUN=sum)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(ETLUtils)
Loading required package: ff
Loading required package: bit
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit

Attaching package: 'bit'

The following object is masked from 'package:base':

    xor

Attaching package ff
- getOption("fftempdir")=="/tmp/RtmpFYgpUK"

- getOption("ffextension")=="ff"

- getOption("ffdrop")==TRUE

- getOption("fffinonexit")==TRUE

- getOption("ffpagesize")==65536

- getOption("ffcaching")=="mmnoflush"  -- consider "ffeachflush" if your system stalls on large writes

- getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system

- getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system


Attaching package: 'ff'

The following objects are masked from 'package:bit':

    clone, clone.default, clone.list

The following objects are masked from 'package:utils':

    write.csv, write.csv2

The following objects are masked from 'package:base':

    is.factor, is.ordered

> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/ETLUtils/matchmerge.Rd_%03d_medium.png", width=480, height=480)
> ### Name: matchmerge
> ### Title: Merge two data frames (fast) by common columns by performing a
> ###   left (outer) join or an inner join.
> ### Aliases: matchmerge
> 
> ### ** Examples
> 
> left <- data.frame(idlhs = c(1:4, 3:5), a = LETTERS[1:7], stringsAsFactors = FALSE)
> right <- data.frame(idrhs = c(1:4), b = LETTERS[8:11], stringsAsFactors = FALSE)
> ## Inner join
> matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs")
  idlhs a idrhs b
1     1 A     1 H
2     2 B     2 I
3     3 C     3 J
4     4 D     4 K
5     3 E     3 J
6     4 F     4 K
> 
> ## Left outer join in 2 ways
> matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs", all.x=TRUE)
  idlhs a idrhs    b
1     1 A     1    H
2     2 B     2    I
3     3 C     3    J
4     4 D     4    K
5     3 E     3    J
6     4 F     4    K
7     5 G    NA <NA>
> matchmerge(x=left, y=right, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE, by.iskey=TRUE)
  idlhs a idrhs    b
1     1 A     1    H
2     2 B     2    I
3     3 C     3    J
4     4 D     4    K
5     3 E     3    J
6     4 F     4    K
7     5 G    NA <NA>
> 
> ## Show usage when y is just a vector instead of a data.frame
> matchmerge(x=left, y=right$b, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE,
+ by.iskey=TRUE, add.columns="b.renamed")
  idlhs a b.renamed
1     1 A         H
2     2 B         I
3     3 C         J
4     4 D         K
5     3 E         J
6     4 F         K
7     5 G      <NA>
> 
> ## Show speedup difference with merge
> ## Not run: 
> ##D size <- 100000
> ##D dimension <- seq(Sys.Date(), Sys.Date()+10, by = "day")
> ##D left <- data.frame(date = rep(dimension, size), sales = rnorm(size))
> ##D right <- data.frame(date = dimension, feature = dimension-7, feature = dimension-14)
> ##D dim(left)
> ##D dim(right)
> ##D print(system.time(merge(left, right, by.x="date", by.y="date", all.x=TRUE, all.y=FALSE)))
> ##D print(system.time(matchmerge(left, right, by.x="date", by.y="date", all.x=TRUE, by.iskey=FALSE)))
> ## End(Not run)
> ## Show example usage
> products <- expand.grid(product = c("Pepsi", "Coca Cola"), type = c("Can","Bottle"),
+ size = c("6Ml","8Ml"), distributor = c("Distri X","Distri Y"), salesperson = c("Mr X","Mr Y"),
+ stringsAsFactors=FALSE)
> products <- products[!duplicated(products[, c("product","type","size")]), ]
> products$key <- paste(products$product, products$type, products$size, sep=".")
> sales <- expand.grid(item = unique(products$key), sales = rnorm(10000, mean = 100))
> str(products)
'data.frame':	8 obs. of  6 variables:
 $ product    : chr  "Pepsi" "Coca Cola" "Pepsi" "Coca Cola" ...
 $ type       : chr  "Can" "Can" "Bottle" "Bottle" ...
 $ size       : chr  "6Ml" "6Ml" "6Ml" "6Ml" ...
 $ distributor: chr  "Distri X" "Distri X" "Distri X" "Distri X" ...
 $ salesperson: chr  "Mr X" "Mr X" "Mr X" "Mr X" ...
 $ key        : chr  "Pepsi.Can.6Ml" "Coca Cola.Can.6Ml" "Pepsi.Bottle.6Ml" "Coca Cola.Bottle.6Ml" ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int  2 2 2 2 2
  .. ..- attr(*, "names")= chr  "product" "type" "size" "distributor" ...
  ..$ dimnames:List of 5
  .. ..$ product    : chr  "product=Pepsi" "product=Coca Cola"
  .. ..$ type       : chr  "type=Can" "type=Bottle"
  .. ..$ size       : chr  "size=6Ml" "size=8Ml"
  .. ..$ distributor: chr  "distributor=Distri X" "distributor=Distri Y"
  .. ..$ salesperson: chr  "salesperson=Mr X" "salesperson=Mr Y"
> str(sales)
'data.frame':	80000 obs. of  2 variables:
 $ item : Factor w/ 8 levels "Pepsi.Can.6Ml",..: 1 2 3 4 5 6 7 8 1 2 ...
 $ sales: num  101 101 101 101 101 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int  8 10000
  .. ..- attr(*, "names")= chr  "item" "sales"
  ..$ dimnames:List of 2
  .. ..$ item : chr  "item=Pepsi.Can.6Ml" "item=Coca Cola.Can.6Ml" "item=Pepsi.Bottle.6Ml" "item=Coca Cola.Bottle.6Ml" ...
  .. ..$ sales: chr  "sales=101.24399" "sales=100.90323" "sales= 98.64241" "sales= 99.40253" ...
> info <- matchmerge(x=sales, y=products,
+   by.x=sales$item, by.y=products$key, all.x=TRUE, by.iskey=TRUE,
+   add.columns=c("size","distributor"), check.duplicates=FALSE)
> str(info)
'data.frame':	80000 obs. of  4 variables:
 $ item       : Factor w/ 8 levels "Pepsi.Can.6Ml",..: 1 2 3 4 5 6 7 8 1 2 ...
 $ sales      : num  101 101 101 101 101 ...
 $ size       : chr  "6Ml" "6Ml" "6Ml" "6Ml" ...
 $ distributor: chr  "Distri X" "Distri X" "Distri X" "Distri X" ...
> tapply(info$sales, info$distributor, FUN=sum)
Distri X 
 8000568 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>