Computations are distributed in parallel by file. Data subsets are
extracted and manipulated (MAP) and optionally combined (REDUCE)
within a single file.
A GRangesList implies a grouping of the ranges; MAP
is applied to each element of the GRangesList vs each range
when ranges is a GRanges.
When ranges is a GenomicFiles the files
argument is missing; both ranges and files are extracted
from the object.
files
A character vector or List of filenames. A List
implies a grouping of the files; MAP is applied to each
element of the List vs each file individually.
MAP
A function executed on each worker. The signature must contain a
minimum of two arguments representing the ranges and files. There is no
restriction on argument names and additional arguments can be provided.
MAP = function(range, file, ...)
REDUCE
An optional function that combines output from the MAP step. The
signature must contain at least one argument representing the list
output from MAP. There is no restriction on argument names and
additional arguments can be provided.
REDUCE = function(mapped, ...)
Reduction combines data from a single worker and is always
performed as part of the distributed step. When iterate=TRUEREDUCE is applied after each MAP step;
depending on the nature of REDUCE, iterative reduction
can substantially decrease the data stored in memory. When
iterate=FALSE reduction is applied to the list of MAP
output applied to all files / ranges.
When REDUCE is missing, output is a list from MAP.
iterate
A logical indicating if the REDUCE function
should be applied iteratively to the output of
MAP. When REDUCE is missing iterate
is set to FALSE. This argument applies to reduceByFile only
(reduceFiles calls MAP a single time on each worker).
Collapsing results iteratively is useful when the number of
records to be processed is large (maybe complete files) but
the end result is a much reduced representation of all records.
Iteratively applying REDUCE reduces the amount of
data on each worker at any one time and can substantially
reduce the memory footprint.
summarize
A logical indicating if results should be returned as a
SummarizedExperiment object instead of a list;
data are returned in the assays slot named 'data'.
This argument applies to reduceByFile only.
When REDUCE is provided summarize is ignored
(i.e., set to FALSE). A SummarizedExperiment requires the number
of rows in rowRanges and assays to match. Because REDUCE
collapses the data across ranges, the dimension of the result no longer
matches that of the original ranges.
init
An optional initial value for REDUCE when
iterate=TRUE. init must be an object of the same
type as the elements returned from MAP. REDUCE
logically adds init to the start (when proceeding left
to right) or end of results obtained with MAP.
...
Arguments passed to other methods.
Details
reduceByFile extracts, manipulates and combines multiple ranges
within a single file. Each file is sent to a worker where MAP is
invoked on each file / range combination. This approach allows multiple
ranges extracted from a single file to be kept separate or combined with
REDUCE.
In contrast, reduceFiles does not iterate through the individual
ranges but instead treats them as a group. MAP is invoked
once for each file using all ranges as the range argument.
In general, REDUCE does not play a significant role in
reduceFiles because MAP is only called once on each worker.
Both MAP and REDUCE are applied in the distributed
step (“on the worker“). There is no built-in ability to combine
results across workers in the distributed step.
Value
reduceByFile:
When summarize=FALSE the return value is a list or
the value from the final invocation of REDUCE. When
summarize=TRUE output is a SummarizedExperiment.
When ranges is a GenomicFiles object data from
rowRanges, colData and metadata are transferred
to the SummarizedExperiment.
reduceFiles:
A list or the value returned by the final invocation of
REDUCE.
Author(s)
Martin Morgan and Valerie Obenchain
See Also
reduceRanges
reduceByRange
GenomicFiles-class
Examples
if (requireNamespace("RNAseqData.HNRNPC.bam.chr14", quietly=TRUE)) {
## -----------------------------------------------------------------------
## Count junction reads in BAM files
## -----------------------------------------------------------------------
fls <- ## 8 bam files
RNAseqData.HNRNPC.bam.chr14::RNAseqData.HNRNPC.bam.chr14_BAMFILES
## Ranges of interest.
gr <- GRanges("chr14", IRanges(c(19100000, 106000000), width=1e7))
## MAP outputs a table of junction counts per range.
MAP <- function(range, file, ...) {
## for readGAlignments(), Rsamtools::ScanBamParam()
requireNamespace("GenomicAlignments", quietly=TRUE)
param = Rsamtools::ScanBamParam(which=range)
gal = GenomicAlignments::readGAlignments(file, param=param)
table(GenomicAlignments::njunc(gal))
}
## -----------------------------------------------------------------------
## reduceByFile:
## With no REDUCE, counts are computed for each range / file combination.
counts1 <- reduceByFile(gr, fls, MAP)
length(counts1) ## 8 files
elementNROWS(counts1) ## 2 ranges each
## Tables of counts for each range:
counts1[[1]]
## With a REDUCE, results are combined on the fly. This reducer sums the
## number of records in each range with exactly 1 junction.
REDUCE <- function(mapped, ...)
sum(sapply(mapped, "[", "1"))
reduceByFile(gr, fls, MAP, REDUCE)
## -----------------------------------------------------------------------
## reduceFiles:
## All ranges are treated as a single group:
counts2 <- reduceFiles(gr, fls, MAP)
## Counts are for all ranges grouped:
counts2[[1]]
## Contrast the above with that from reduceByFile() where counts
## are for each range separately:
counts1[[1]]
## -----------------------------------------------------------------------
## Methods for the GenomicFiles class:
## Both reduceByFiles() and reduceFiles() can operate on a GenomicFiles
## object.
colData <- DataFrame(method=rep("RNASeq", length(fls)),
format=rep("bam", length(fls)))
gf <- GenomicFiles(files=fls, rowRanges=gr, colData=colData)
gf
## Subset on ranges or files for different experimental runs.
dim(gf)
gf_sub <- gf[2, 3:4]
dim(gf_sub)
## When summarize = TRUE and no REDUCE is given, the output is a
## SummarizedExperiment object.
se <- reduceByFile(gf, MAP=MAP, summarize=TRUE)
se
## Data from the rowRanges, colData and metadata slots in the
## GenomicFiles are transferred to the SummarizedExperiment.
colData(se)
## Results are in the assays slot named 'data'.
assays(se)
}
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(GenomicFiles)
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: GenomicRanges
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: BiocParallel
Loading required package: Rsamtools
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomicFiles/reduceByFile-methods.Rd_%03d_medium.png", width=480, height=480)
> ### Name: reduceByFile
> ### Title: Parallel computations by files
> ### Aliases: reduceByFile reduceByFile,GRanges,ANY-method
> ### reduceByFile,GRangesList,ANY-method
> ### reduceByFile,GenomicFiles,missing-method reduceFiles
> ### Keywords: methods
>
> ### ** Examples
>
>
> if (requireNamespace("RNAseqData.HNRNPC.bam.chr14", quietly=TRUE)) {
+ ## -----------------------------------------------------------------------
+ ## Count junction reads in BAM files
+ ## -----------------------------------------------------------------------
+ fls <- ## 8 bam files
+ RNAseqData.HNRNPC.bam.chr14::RNAseqData.HNRNPC.bam.chr14_BAMFILES
+
+ ## Ranges of interest.
+ gr <- GRanges("chr14", IRanges(c(19100000, 106000000), width=1e7))
+
+ ## MAP outputs a table of junction counts per range.
+ MAP <- function(range, file, ...) {
+ ## for readGAlignments(), Rsamtools::ScanBamParam()
+ requireNamespace("GenomicAlignments", quietly=TRUE)
+ param = Rsamtools::ScanBamParam(which=range)
+ gal = GenomicAlignments::readGAlignments(file, param=param)
+ table(GenomicAlignments::njunc(gal))
+ }
+
+ ## -----------------------------------------------------------------------
+ ## reduceByFile:
+
+ ## With no REDUCE, counts are computed for each range / file combination.
+ counts1 <- reduceByFile(gr, fls, MAP)
+ length(counts1) ## 8 files
+ elementNROWS(counts1) ## 2 ranges each
+
+ ## Tables of counts for each range:
+ counts1[[1]]
+
+ ## With a REDUCE, results are combined on the fly. This reducer sums the
+ ## number of records in each range with exactly 1 junction.
+ REDUCE <- function(mapped, ...)
+ sum(sapply(mapped, "[", "1"))
+
+ reduceByFile(gr, fls, MAP, REDUCE)
+
+ ## -----------------------------------------------------------------------
+ ## reduceFiles:
+
+ ## All ranges are treated as a single group:
+ counts2 <- reduceFiles(gr, fls, MAP)
+
+ ## Counts are for all ranges grouped:
+ counts2[[1]]
+
+ ## Contrast the above with that from reduceByFile() where counts
+ ## are for each range separately:
+ counts1[[1]]
+
+ ## -----------------------------------------------------------------------
+ ## Methods for the GenomicFiles class:
+
+ ## Both reduceByFiles() and reduceFiles() can operate on a GenomicFiles
+ ## object.
+ colData <- DataFrame(method=rep("RNASeq", length(fls)),
+ format=rep("bam", length(fls)))
+ gf <- GenomicFiles(files=fls, rowRanges=gr, colData=colData)
+ gf
+
+ ## Subset on ranges or files for different experimental runs.
+ dim(gf)
+ gf_sub <- gf[2, 3:4]
+ dim(gf_sub)
+
+ ## When summarize = TRUE and no REDUCE is given, the output is a
+ ## SummarizedExperiment object.
+ se <- reduceByFile(gf, MAP=MAP, summarize=TRUE)
+ se
+
+ ## Data from the rowRanges, colData and metadata slots in the
+ ## GenomicFiles are transferred to the SummarizedExperiment.
+ colData(se)
+
+ ## Results are in the assays slot named 'data'.
+ assays(se)
+ }
List of length 1
names(1): data
>
>
>
>
>
> dev.off()
null device
1
>