R: Parallelize a Vector Map Function using Forking
pvec
R Documentation
Parallelize a Vector Map Function using Forking
Description
pvec parellelizes the execution of a function on vector elements
by splitting the vector and submitting each part to one core. The
function must be a vectorized map, i.e. it takes a vector input and
creates a vector output of exactly the same length as the input which
doesn't depend on the partition of the vector.
It relies on forking and hence is not available on Windows unless
mc.cores = 1.
any further arguments passed to FUN after the vector
mc.set.seed
See mcparallel.
mc.silent
if set to TRUE then all output on ‘stdout’ will
be suppressed for all parallel processes forked (‘stderr’ is not
affected).
mc.cores
The number of cores to use, i.e. at most how many
child processes will be run simultaneously. Must be at least one,
and at least two for parallel operation. The option is initialized
from environment variable MC_CORES if set.
mc.cleanup
See the description of this argument in
mclapply.
Details
pvec parallelizes FUN(x, ...) where FUN is a
function that returns a vector of the same length as
x. FUN must also be pure (i.e., without side-effects)
since side-effects are not collected from the parallel processes. The
vector is split into nearly identically sized subvectors on which
FUN is run. Although it is in principle possible to use
functions that are not necessarily maps, the interpretation would be
case-specific as the splitting is in theory arbitrary (a warning is
given in such cases).
The major difference between pvec and mclapply is
that mclapply will run FUN on each element separately
whereas pvec assumes that c(FUN(x[1]), FUN(x[2])) is
equivalent to FUN(x[1:2]) and thus will split into as many
calls to FUN as there are cores (or elements, if fewer), each
handling a subset vector. This makes it more efficient than
mclapply but requires the above assumption on FUN.
If mc.cores == 1 this evaluates FUN(v, ...) in the
current process.
Value
The result of the computation – in a successful case it should be of
the same length as v. If an error occurred or the function was
not a map the result may be shorter or longer, and a warning is given.
Note
Due to the nature of the parallelization, error handling does not
follow the usual rules since errors will be returned as strings and
results from killed child processes will show up simply as
non-existent data. Therefore it is the responsibility of the user to
check the length of the result to make sure it is of the correct size.
pvec raises a warning if that is the case since it does not
know whether such an outcome is intentional or not.
See mcfork for the inadvisability of using this with
GUI front-ends.
x <- pvec(1:1000, sqrt)
stopifnot(all(x == sqrt(1:1000)))
# One use is to convert date strings to unix time in large datasets
# as that is a relatively slow operation.
# So let's get some random dates first
# (A small test only with 2 cores: set options("mc.cores")
# and increase N for a larger-scale test.)
N <- 1e5
dates <- sprintf('%04d-%02d-%02d', as.integer(2000+rnorm(N)),
as.integer(runif(N, 1, 12)), as.integer(runif(N, 1, 28)))
system.time(a <- as.POSIXct(dates))
# But specifying the format is faster
system.time(a <- as.POSIXct(dates, format = "%Y-%m-%d"))
# pvec ought to be faster, but system overhead can be high
system.time(b <- pvec(dates, as.POSIXct, format = "%Y-%m-%d"))
stopifnot(all(a == b))
# using mclapply for this would much slower because each value
# will require a separate call to as.POSIXct()
# as lapply(dates, as.POSIXct) does
system.time(c <- unlist(mclapply(dates, as.POSIXct, format = "%Y-%m-%d")))
stopifnot(all(a == c))
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(parallel)
> png(filename="/home/ddbj/snapshot/RGM3/R_rel/result/parallel/pvec.Rd_%03d_medium.png", width=480, height=480)
> ### Name: pvec
> ### Title: Parallelize a Vector Map Function using Forking
> ### Aliases: pvec
> ### Keywords: interface
>
> ### ** Examples
>
> x <- pvec(1:1000, sqrt)
> stopifnot(all(x == sqrt(1:1000)))
>
> ## No test:
> # One use is to convert date strings to unix time in large datasets
> # as that is a relatively slow operation.
> # So let's get some random dates first
> # (A small test only with 2 cores: set options("mc.cores")
> # and increase N for a larger-scale test.)
> N <- 1e5
> dates <- sprintf('%04d-%02d-%02d', as.integer(2000+rnorm(N)),
+ as.integer(runif(N, 1, 12)), as.integer(runif(N, 1, 28)))
>
> system.time(a <- as.POSIXct(dates))
user system elapsed
0.472 0.224 0.700
>
> # But specifying the format is faster
> system.time(a <- as.POSIXct(dates, format = "%Y-%m-%d"))
user system elapsed
0.124 0.064 0.187
>
> # pvec ought to be faster, but system overhead can be high
> system.time(b <- pvec(dates, as.POSIXct, format = "%Y-%m-%d"))
user system elapsed
0.092 0.064 0.202
> stopifnot(all(a == b))
>
> # using mclapply for this would much slower because each value
> # will require a separate call to as.POSIXct()
> # as lapply(dates, as.POSIXct) does
> system.time(c <- unlist(mclapply(dates, as.POSIXct, format = "%Y-%m-%d")))
user system elapsed
3.012 0.244 3.599
> stopifnot(all(a == c))
> ## End(No test)
>
>
>
>
> dev.off()
null device
1
>