R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Parallel computations by files

reduceByFile

R Documentation

Parallel computations by files

Description

Computations are distributed in parallel by file. Data subsets are extracted and manipulated (MAP) and optionally combined (REDUCE) within a single file.

Usage

## S4 method for signature 'GRanges,ANY'
reduceByFile(ranges, files, MAP, 
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
## S4 method for signature 'GRangesList,ANY'
reduceByFile(ranges, files, MAP, 
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
## S4 method for signature 'GenomicFiles,missing'
reduceByFile(ranges, files, MAP, 
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)

reduceFiles(ranges, files, MAP, REDUCE, ..., init)

Arguments

`ranges`	A `GRanges`, `GrangesList` or `GenomicFiles` object. A `GRangesList` implies a grouping of the ranges; `MAP` is applied to each element of the `GRangesList` vs each range when `ranges` is a `GRanges`. When `ranges` is a `GenomicFiles` the `files` argument is missing; both ranges and files are extracted from the object.
`files`	A `character` vector or `List` of filenames. A `List` implies a grouping of the files; `MAP` is applied to each element of the `List` vs each file individually.
`MAP`	A function executed on each worker. The signature must contain a minimum of two arguments representing the ranges and files. There is no restriction on argument names and additional arguments can be provided. `MAP = function(range, file, ...)`
`REDUCE`	An optional function that combines output from the `MAP` step. The signature must contain at least one argument representing the list output from `MAP`. There is no restriction on argument names and additional arguments can be provided. `REDUCE = function(mapped, ...)` Reduction combines data from a single worker and is always performed as part of the distributed step. When `iterate=TRUE` `REDUCE` is applied after each `MAP` step; depending on the nature of `REDUCE`, iterative reduction can substantially decrease the data stored in memory. When `iterate=FALSE` reduction is applied to the list of `MAP` output applied to all files / ranges. When `REDUCE` is missing, output is a list from `MAP`.
`iterate`	A logical indicating if the `REDUCE` function should be applied iteratively to the output of `MAP`. When `REDUCE` is missing `iterate` is set to FALSE. This argument applies to `reduceByFile` only (`reduceFiles` calls MAP a single time on each worker). Collapsing results iteratively is useful when the number of records to be processed is large (maybe complete files) but the end result is a much reduced representation of all records. Iteratively applying `REDUCE` reduces the amount of data on each worker at any one time and can substantially reduce the memory footprint.
`summarize`	A logical indicating if results should be returned as a `SummarizedExperiment` object instead of a list; data are returned in the `assays` slot named 'data'. This argument applies to `reduceByFile` only. When `REDUCE` is provided `summarize` is ignored (i.e., set to FALSE). A `SummarizedExperiment` requires the number of rows in `rowRanges` and `assays` to match. Because `REDUCE` collapses the data across ranges, the dimension of the result no longer matches that of the original ranges.
`init`	An optional initial value for `REDUCE` when `iterate=TRUE`. `init` must be an object of the same type as the elements returned from `MAP`. `REDUCE` logically adds `init` to the start (when proceeding left to right) or end of results obtained with `MAP`.
`...`	Arguments passed to other methods.

Details

reduceByFile extracts, manipulates and combines multiple ranges within a single file. Each file is sent to a worker where MAP is invoked on each file / range combination. This approach allows multiple ranges extracted from a single file to be kept separate or combined with REDUCE.

In contrast, reduceFiles does not iterate through the individual ranges but instead treats them as a group. MAP is invoked once for each file using all ranges as the range argument. In general, REDUCE does not play a significant role in reduceFiles because MAP is only called once on each worker.

Both MAP and REDUCE are applied in the distributed step (“on the worker“). There is no built-in ability to combine results across workers in the distributed step.

Value

reduceByFile: When summarize=FALSE the return value is a list or the value from the final invocation of REDUCE. When summarize=TRUE output is a SummarizedExperiment. When ranges is a GenomicFiles object data from rowRanges, colData and metadata are transferred to the SummarizedExperiment.
reduceFiles: A list or the value returned by the final invocation of REDUCE.

Author(s)

Martin Morgan and Valerie Obenchain

Examples


if (requireNamespace("RNAseqData.HNRNPC.bam.chr14", quietly=TRUE)) {
  ## -----------------------------------------------------------------------
  ## Count junction reads in BAM files
  ## -----------------------------------------------------------------------
  fls <-                                      ## 8 bam files
      RNAseqData.HNRNPC.bam.chr14::RNAseqData.HNRNPC.bam.chr14_BAMFILES
 
  ## Ranges of interest.
  gr <- GRanges("chr14", IRanges(c(19100000, 106000000), width=1e7))
 
  ## MAP outputs a table of junction counts per range.
  MAP <- function(range, file, ...) {
      ## for readGAlignments(), Rsamtools::ScanBamParam()
      requireNamespace("GenomicAlignments", quietly=TRUE)
      param = Rsamtools::ScanBamParam(which=range)
      gal = GenomicAlignments::readGAlignments(file, param=param)
      table(GenomicAlignments::njunc(gal))
  } 

  ## -----------------------------------------------------------------------
  ## reduceByFile:

  ## With no REDUCE, counts are computed for each range / file combination.
  counts1 <- reduceByFile(gr, fls, MAP)
  length(counts1)          ## 8 files
  elementNROWS(counts1)    ## 2 ranges each
 
  ## Tables of counts for each range:
  counts1[[1]]

  ## With a REDUCE, results are combined on the fly. This reducer sums the 
  ## number of records in each range with exactly 1 junction.
  REDUCE <- function(mapped, ...)
      sum(sapply(mapped, "[", "1"))
 
  reduceByFile(gr, fls, MAP, REDUCE)

  ## -----------------------------------------------------------------------
  ## reduceFiles:

  ## All ranges are treated as a single group:
  counts2 <- reduceFiles(gr, fls, MAP)

  ## Counts are for all ranges grouped:
  counts2[[1]]

  ## Contrast the above with that from reduceByFile() where counts 
  ## are for each range separately:
  counts1[[1]]

  ## -----------------------------------------------------------------------
  ## Methods for the GenomicFiles class:
 
  ## Both reduceByFiles() and reduceFiles() can operate on a GenomicFiles
  ## object.
  colData <- DataFrame(method=rep("RNASeq", length(fls)),
                       format=rep("bam", length(fls)))
  gf <- GenomicFiles(files=fls, rowRanges=gr, colData=colData)
  gf
  
  ## Subset on ranges or files for different experimental runs.
  dim(gf)
  gf_sub <- gf[2, 3:4]
  dim(gf_sub)
  
  ## When summarize = TRUE and no REDUCE is given, the output is a 
  ## SummarizedExperiment object.
  se <- reduceByFile(gf, MAP=MAP, summarize=TRUE)
  se
  
  ## Data from the rowRanges, colData and metadata slots in the
  ## GenomicFiles are transferred to the SummarizedExperiment.
  colData(se)
  
  ## Results are in the assays slot named 'data'.
  assays(se) 
}

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(GenomicFiles)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: GenomicRanges
Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: BiocParallel
Loading required package: Rsamtools
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomicFiles/reduceByFile-methods.Rd_%03d_medium.png", width=480, height=480)
> ### Name: reduceByFile
> ### Title: Parallel computations by files
> ### Aliases: reduceByFile reduceByFile,GRanges,ANY-method
> ###   reduceByFile,GRangesList,ANY-method
> ###   reduceByFile,GenomicFiles,missing-method reduceFiles
> ### Keywords: methods
> 
> ### ** Examples
> 
> 
> if (requireNamespace("RNAseqData.HNRNPC.bam.chr14", quietly=TRUE)) {
+   ## -----------------------------------------------------------------------
+   ## Count junction reads in BAM files
+   ## -----------------------------------------------------------------------
+   fls <-                                      ## 8 bam files
+       RNAseqData.HNRNPC.bam.chr14::RNAseqData.HNRNPC.bam.chr14_BAMFILES
+  
+   ## Ranges of interest.
+   gr <- GRanges("chr14", IRanges(c(19100000, 106000000), width=1e7))
+  
+   ## MAP outputs a table of junction counts per range.
+   MAP <- function(range, file, ...) {
+       ## for readGAlignments(), Rsamtools::ScanBamParam()
+       requireNamespace("GenomicAlignments", quietly=TRUE)
+       param = Rsamtools::ScanBamParam(which=range)
+       gal = GenomicAlignments::readGAlignments(file, param=param)
+       table(GenomicAlignments::njunc(gal))
+   } 
+ 
+   ## -----------------------------------------------------------------------
+   ## reduceByFile:
+ 
+   ## With no REDUCE, counts are computed for each range / file combination.
+   counts1 <- reduceByFile(gr, fls, MAP)
+   length(counts1)          ## 8 files
+   elementNROWS(counts1)    ## 2 ranges each
+  
+   ## Tables of counts for each range:
+   counts1[[1]]
+ 
+   ## With a REDUCE, results are combined on the fly. This reducer sums the 
+   ## number of records in each range with exactly 1 junction.
+   REDUCE <- function(mapped, ...)
+       sum(sapply(mapped, "[", "1"))
+  
+   reduceByFile(gr, fls, MAP, REDUCE)
+ 
+   ## -----------------------------------------------------------------------
+   ## reduceFiles:
+ 
+   ## All ranges are treated as a single group:
+   counts2 <- reduceFiles(gr, fls, MAP)
+ 
+   ## Counts are for all ranges grouped:
+   counts2[[1]]
+ 
+   ## Contrast the above with that from reduceByFile() where counts 
+   ## are for each range separately:
+   counts1[[1]]
+ 
+   ## -----------------------------------------------------------------------
+   ## Methods for the GenomicFiles class:
+  
+   ## Both reduceByFiles() and reduceFiles() can operate on a GenomicFiles
+   ## object.
+   colData <- DataFrame(method=rep("RNASeq", length(fls)),
+                        format=rep("bam", length(fls)))
+   gf <- GenomicFiles(files=fls, rowRanges=gr, colData=colData)
+   gf
+   
+   ## Subset on ranges or files for different experimental runs.
+   dim(gf)
+   gf_sub <- gf[2, 3:4]
+   dim(gf_sub)
+   
+   ## When summarize = TRUE and no REDUCE is given, the output is a 
+   ## SummarizedExperiment object.
+   se <- reduceByFile(gf, MAP=MAP, summarize=TRUE)
+   se
+   
+   ## Data from the rowRanges, colData and metadata slots in the
+   ## GenomicFiles are transferred to the SummarizedExperiment.
+   colData(se)
+   
+   ## Results are in the assays slot named 'data'.
+   assays(se) 
+ }
List of length 1
names(1): data
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>