R: Iterate through a BAM (or other) file, reducing output to a...
reduceByYield
R Documentation
Iterate through a BAM (or other) file, reducing output to a single result.
Description
Rsamtools files can be created with a ‘yieldSize’ argument that
influences the number of records (chunk size) input at one time (see,
e.g,. BamFile). reduceByYield iterates
through the file, processing each chunk and reducing it with previously
input chunks. This is a memory efficient way to process large data files,
especially when the final result fits in memory.
A BamFile instance (or other class
for which isOpen, open, close methods are defined,
and which support extraction of sequential chunks).
YIELD
A function name or user-supplied function that operates
on X to produce a VALUE that is passed to DONE
and MAP. Generally YIELD will be a data extractor such as
readGAlignments, scanBam, yield, etc. and
VALUE is a chunk of data.
YIELD(X)
MAP
A function of one or more arguments that operates on
the chunk of data from YIELD.
MAP(VALUE, ...)
REDUCE
A function of one (iterate=FALSE or two
(iterate=TRUE) arguments, returning the reduction (e.g., sum,
mean, concatenate) of the arguments.
REDUCE(mapped, ...) ## iterate=FALSE
REDUCE(x, y, ...) ## iterate=TRUE
DONE
A function of one argument, the VALUE output of
the most recent call to YIELD(X, ...). If missing, DONE
is function(VALUE) length(VALUE) == 0.
...
Additional arguments, passed to MAP.
iterate
logical(1) determines whether the call to
REDUCE is iterative (iterate=TRUE) or cumulative
(iterate=FALSE).
parallel
logical(1) determines if the MAP step
is run in parallel. bpiterate is used under the hood
and is currently supported for Unix/Mac only. For Windows machines,
parallel is ignored.
init
(Optional) Initial value used for REDUCE when
iterate=TRUE.
sampleSize
Initial value used for REDUCEsampler.
verbose
logical(1) determines if total records sampled are
reported at each iteration. Applicable to REDUCEsampler only.
Details
reduceByYield:
When iterate=TRUE, REDUCE requires 2 arguments and is
invoked with init and the output from the first call to
MAP. If init is missing, it operates on the first two
outputs from MAP.
When iterate=FALSE, REDUCE requires 1 argument and is
is invoked with a list containing a list containing all results from
MAP.
REDUCEsampler:
REDUCEsampler creates a function that can be used as the
REDUCE argument to reduceByYield.
Invoking REDUCEsampler with sampleSize returns a function
(call it myfun) that takes two arguments, x and y.
As with any iterative REDUCE function, x represents records
that have been yield'ed and y is the new chunk of records.
myfun samples records from consecutive chunks returned by the
YIELD function. (Re)sampling takes into consideration
the total number of records yield'ed, the sampleSize, and the
size of the new chunk.
Value
The value returned by the final invocation of REDUCE, or init
if provided and no data were yield'ed, or list() if init is
missing and no data were yield'ed.
Author(s)
Martin Morgan and Valerie Obenchain
See Also
BamFile and
TabixFile for examples of 'X'.
reduceByFile and reduceByRange
Examples
if (all(require(RNAseqData.HNRNPC.bam.chr14) &&
require(GenomicAlignments))) {
## -----------------------------------------------------------------------
## Nucleotide frequency of mapped reads
## -----------------------------------------------------------------------
## In this example nucleotide frequency of mapped reads is computed
## for a single file. The MAP step is run in parallel and REDUCE
## is iterative.
## Create a BamFile and set a 'yieldSize'.
fl <- system.file(package="Rsamtools", "extdata", "ex1.bam")
bf <- BamFile(fl, yieldSize=500)
## Define 'YIELD', 'MAP' and 'REDUCE' functions.
YIELD <- function(X, ...) {
flag = scanBamFlag(isUnmappedQuery=FALSE)
param = ScanBamParam(flag=flag, what="seq")
scanBam(X, param=param, ...)[[1]][['seq']]
}
MAP <- function(value, ...) {
requireNamespace("Biostrings", quietly=TRUE) ## for alphabetFrequency()
Biostrings::alphabetFrequency(value, collapse=TRUE)
}
REDUCE <- `+` # add successive alphabetFrequency matrices
## 'parallel=TRUE' runs the MAP step in parallel and is currently
## implemented for Unix/Mac only.
register(MulticoreParam(3))
reduceByYield(bf, YIELD, MAP, REDUCE, parallel=TRUE)
## -----------------------------------------------------------------------
## Coverage
## -----------------------------------------------------------------------
## If sufficient resources are available coverage can be computed
## across several large BAM files by combining reduceByYield() with
## bplapply().
## Create a BamFileList with a few sample files and a Snow cluster
## with the same number of workers as files.
bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1:3])
bpparam <- SnowParam(length(bfl))
## 'FUN' is run on each worker. Because these are Snow workers each
## variable used in 'FUN' must be explicitly passed. (This is not the case
## when using Multicore.)
FUN <- function(bf, YIELD, MAP, REDUCE, parallel, ...) {
requireNamespace("GenomicFiles", quietly=TRUE) ## for reduceByYield()
GenomicFiles::reduceByYield(bf, YIELD, MAP, REDUCE, parallel=parallel)
}
## Passing parallel=FALSE to reduceByYield() runs the MAP step in serial on
## each worker. In this example, parallel dispatch is at the file-level
## only (bplapply()).
YIELD <- `readGAlignments`
MAP <- function(value, ...) {
requireNamespace("GenomicAlignments", quietly=TRUE)
GenomicAlignments::coverage(value)[["chr14"]]
}
bplapply(bfl, FUN, YIELD=YIELD, MAP=MAP, REDUCE=`+`,
parallel=FALSE, BPPARAM = bpparam)
## -----------------------------------------------------------------------
## Sample records from a Bam file
## -----------------------------------------------------------------------
fl <- system.file(package="Rsamtools", "extdata", "ex1.bam")
bf <- BamFile(fl, yieldSize=1000)
yield <- function(x)
readGAlignments(x, param=ScanBamParam(what=c( "qwidth", "mapq" )))
map <- identity
## Samples records from successive chunks of aligned reads.
reduceByYield(bf, yield, map, REDUCEsampler(1000, TRUE))
}
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(GenomicFiles)
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: GenomicRanges
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: BiocParallel
Loading required package: Rsamtools
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomicFiles/reduceByYield.Rd_%03d_medium.png", width=480, height=480)
> ### Name: reduceByYield
> ### Title: Iterate through a BAM (or other) file, reducing output to a
> ### single result.
> ### Aliases: reduceByYield REDUCEsampler
> ### Keywords: manip
>
> ### ** Examples
>
>
> if (all(require(RNAseqData.HNRNPC.bam.chr14) &&
+ require(GenomicAlignments))) {
+
+ ## -----------------------------------------------------------------------
+ ## Nucleotide frequency of mapped reads
+ ## -----------------------------------------------------------------------
+
+ ## In this example nucleotide frequency of mapped reads is computed
+ ## for a single file. The MAP step is run in parallel and REDUCE
+ ## is iterative.
+
+ ## Create a BamFile and set a 'yieldSize'.
+ fl <- system.file(package="Rsamtools", "extdata", "ex1.bam")
+ bf <- BamFile(fl, yieldSize=500)
+
+ ## Define 'YIELD', 'MAP' and 'REDUCE' functions.
+ YIELD <- function(X, ...) {
+ flag = scanBamFlag(isUnmappedQuery=FALSE)
+ param = ScanBamParam(flag=flag, what="seq")
+ scanBam(X, param=param, ...)[[1]][['seq']]
+ }
+ MAP <- function(value, ...) {
+ requireNamespace("Biostrings", quietly=TRUE) ## for alphabetFrequency()
+ Biostrings::alphabetFrequency(value, collapse=TRUE)
+ }
+ REDUCE <- `+` # add successive alphabetFrequency matrices
+
+ ## 'parallel=TRUE' runs the MAP step in parallel and is currently
+ ## implemented for Unix/Mac only.
+ register(MulticoreParam(3))
+ reduceByYield(bf, YIELD, MAP, REDUCE, parallel=TRUE)
+
+ ## -----------------------------------------------------------------------
+ ## Coverage
+ ## -----------------------------------------------------------------------
+
+ ## If sufficient resources are available coverage can be computed
+ ## across several large BAM files by combining reduceByYield() with
+ ## bplapply().
+
+ ## Create a BamFileList with a few sample files and a Snow cluster
+ ## with the same number of workers as files.
+ bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1:3])
+ bpparam <- SnowParam(length(bfl))
+
+ ## 'FUN' is run on each worker. Because these are Snow workers each
+ ## variable used in 'FUN' must be explicitly passed. (This is not the case
+ ## when using Multicore.)
+ FUN <- function(bf, YIELD, MAP, REDUCE, parallel, ...) {
+ requireNamespace("GenomicFiles", quietly=TRUE) ## for reduceByYield()
+ GenomicFiles::reduceByYield(bf, YIELD, MAP, REDUCE, parallel=parallel)
+ }
+
+ ## Passing parallel=FALSE to reduceByYield() runs the MAP step in serial on
+ ## each worker. In this example, parallel dispatch is at the file-level
+ ## only (bplapply()).
+ YIELD <- `readGAlignments`
+ MAP <- function(value, ...) {
+ requireNamespace("GenomicAlignments", quietly=TRUE)
+ GenomicAlignments::coverage(value)[["chr14"]]
+ }
+ bplapply(bfl, FUN, YIELD=YIELD, MAP=MAP, REDUCE=`+`,
+ parallel=FALSE, BPPARAM = bpparam)
+
+
+ ## -----------------------------------------------------------------------
+ ## Sample records from a Bam file
+ ## -----------------------------------------------------------------------
+
+ fl <- system.file(package="Rsamtools", "extdata", "ex1.bam")
+ bf <- BamFile(fl, yieldSize=1000)
+
+ yield <- function(x)
+ readGAlignments(x, param=ScanBamParam(what=c( "qwidth", "mapq" )))
+ map <- identity
+
+ ## Samples records from successive chunks of aligned reads.
+ reduceByYield(bf, yield, map, REDUCEsampler(1000, TRUE))
+ }
Loading required package: RNAseqData.HNRNPC.bam.chr14
Loading required package: GenomicAlignments
starting worker localhost:11033
starting worker localhost:11033
starting worker localhost:11033
REDUCEsampler total=2000
REDUCEsampler total=3000
REDUCEsampler total=3271
GAlignments object with 1000 alignments and 2 metadata columns:
seqnames strand cigar qwidth start end width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
[1] seq1 - 35M 35 1515 1549 35
[2] seq2 + 35M 35 95 129 35
[3] seq2 - 35M 35 915 949 35
[4] seq2 + 35M 35 1031 1065 35
[5] seq1 - 35M 35 1189 1223 35
... ... ... ... ... ... ... ...
[996] seq1 + 36M 36 182 217 36
[997] seq1 + 35M 35 437 471 35
[998] seq1 - 35M 35 544 578 35
[999] seq1 + 35M 35 1190 1224 35
[1000] seq2 + 36M 36 353 388 36
njunc | qwidth mapq
<integer> | <integer> <integer>
[1] 0 | 35 99
[2] 0 | 35 99
[3] 0 | 35 99
[4] 0 | 35 99
[5] 0 | 35 99
... ... . ... ...
[996] 0 | 36 99
[997] 0 | 35 99
[998] 0 | 35 60
[999] 0 | 35 99
[1000] 0 | 36 99
-------
seqinfo: 2 sequences from an unspecified genome
>
>
>
>
>
> dev.off()
null device
1
>