R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Parallel computations by ranges

reduceByRange

R Documentation

Parallel computations by ranges

Description

Computations are distributed in parallel by range. Data subsets are extracted and manipulated (MAP) and optionally combined (REDUCE) across all files.

Usage

## S4 method for signature 'GRanges,ANY'
reduceByRange(ranges, files, MAP, 
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
## S4 method for signature 'GRangesList,ANY'
reduceByRange(ranges, files, MAP, 
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
## S4 method for signature 'GenomicFiles,missing'
reduceByRange(ranges, files, MAP, 
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)

reduceRanges(ranges, files, MAP, REDUCE, ..., init)

Arguments

`ranges`	A `GRanges`, `GrangesList` or `GenomicFiles` object. A `GRangesList` implies a grouping of the ranges; `MAP` is applied to each element of the `GRangesList` vs each range when `ranges` is a `GRanges`. When `ranges` is a `GenomicFiles` the `files` argument is missing; both ranges and files are extracted from the object.
`files`	A `character` vector or `List` of filenames. A `List` implies a grouping of the files; `MAP` is applied to each element of the `List` vs each file individually.
`MAP`	A function executed on each worker. The signature must contain a minimum of two arguments representing the ranges and files. There is no restriction on argument names and additional arguments can be provided. `MAP = function(range, file, ...)`
`REDUCE`	An optional function that combines output from the `MAP` step. The signature must contain at least one argument representing the list output from `MAP`. There is no restriction on argument names and additional arguments can be provided. `REDUCE = function(mapped, ...)` Reduction combines data from a single worker and is always performed as part of the distributed step. When `iterate=TRUE` `REDUCE` is applied after each `MAP` step; depending on the nature of `REDUCE`, iterative reduction can substantially decrease the data stored in memory. When `iterate=FALSE` reduction is applied to the list of `MAP` output applied to all files / ranges. When `REDUCE` is missing, output is a list from `MAP`.
`iterate`	A logical indicating if the `REDUCE` function should be applied iteratively to the output of `MAP`. When `REDUCE` is missing `iterate` is set to FALSE. This argument applies to `reduceByRange` only (`reduceRanges` calls MAP a single time on each worker). Collapsing results iteratively is useful when the number of records to be processed is large (maybe complete files) but the end result is a much reduced representation of all records. Iteratively applying `REDUCE` reduces the amount of data on each worker at any one time and can substantially reduce the memory footprint.
`summarize`	A logical indicating if results should be returned as a `SummarizedExperiment` object instead of a list; data are returned in the `assays` slot named 'data'. This argument applies to `reduceByRange` only. When `REDUCE` is provided `summarize` is ignored (i.e., set to FALSE). A `SummarizedExperiment` requires the number of rows in `colData` and the columns in `assays` to match. Because `REDUCE` collapses the data across files, the dimension of the result no longer matches that of the original ranges.
`init`	An optional initial value for `REDUCE` when `iterate=TRUE`. `init` must be an object of the same type as the elements returned from `MAP`. `REDUCE` logically adds `init` to the start (when proceeding left to right) or end of results obtained with `MAP`.
`...`	Arguments passed to other methods. Currently not used.

Details

reduceByRange extracts, manipulates and combines ranges across different files. Each element of ranges is sent to a worker; this is a single range when ranges is a GRanges and may be multiple ranges when ranges is a GRangesList. MAP is invoked on each range / file combination. This approach allows ranges extracted from multiple files to be kept separate or combined with REDUCE.

In contrast, reduceRanges does not iterate through the individual files but instead treats them as a group. MAP is invoked once for each range using all files as the files argument. Depending on the nature of MAP, REDUCE may play a minor role by concatenating or unlisting results.

Both MAP and REDUCE are applied in the distributed step (“on the worker“). There is no built-in ability to combine results across workers in the distributed step.

Value

reduceByRange: When summarize=FALSE the return value is a list or the value from the final invocation of REDUCE. When summarize=TRUE output is a SummarizedExperiment. When ranges is a GenomicFiles object data from rowRanges, colData and metadata are transferred to the SummarizedExperiment.
reduceRanges: A list or the value returned by the final invocation of REDUCE.

Author(s)

Martin Morgan and Valerie Obenchain

Examples


if (all(requireNamespace("RNAseqData.HNRNPC.bam.chr14", quietly=TRUE) &&
        require(GenomicAlignments))) {
  ## -----------------------------------------------------------------------
  ## Compute coverage across BAM files.
  ## -----------------------------------------------------------------------
  fls <-                                ## 8 bam files
      RNAseqData.HNRNPC.bam.chr14::RNAseqData.HNRNPC.bam.chr14_BAMFILES
  
  ## Regions of interest.
  gr <- GRanges("chr14", IRanges(c(62262735, 63121531, 63980327),
                width=214700))
  
  ## The MAP computes the coverage ...
  MAP <- function(range, file, ...) {
      requireNamespace("GenomicFiles", quietly=TRUE)
      ## for coverage(), Rsamtools::ScanBamParam()
      param = Rsamtools::ScanBamParam(which=range)
      GenomicFiles::coverage(file, param=param)[range]
  } 
  ## and REDUCE adds the last and current results. 
  REDUCE <- function(mapped, ...)
      Reduce("+", mapped)

  ## -----------------------------------------------------------------------
  ## reduceByRange:

  ## With no REDUCE, coverage is computed for each range / file combination.
  cov1 <- reduceByRange(gr, fls, MAP)
  cov1[[1]]
  
  ## Each call to coverage() produces an RleList which accumulate on the 
  ## workers. We can use a reducer to combine these lists either iteratively
  ## or non-iteratively. When iterate = TRUE the current result 
  ## is collapsed with the last resulting in a maximum of 2 RleLists on 
  ## a worker at any given time.
  cov2 <- reduceByRange(gr, fls, MAP, REDUCE, iterate=TRUE)
  cov2[[1]]
  
  ## If memory use is not a concern (or if MAP output is not large) the 
  ## REDUCE function can be applied non-iteratively. 
  cov3 <- reduceByRange(gr, fls, MAP, REDUCE, iterate=FALSE)
  
  ## Results match those obtained with the iterative REDUCE.
  cov3[[1]]

  ## When 'ranges' is a GRangesList, the list elements are sent to the
  ## workers instead of a single range as in the case of a GRanges.
  grl <- GRangesList(gr[1], gr[2:3])
  grl

  cov4 <- reduceByRange(grl, fls, MAP)
  length(cov4)          ## length of GRangesList
  elementNROWS(cov4)  ## number of files

  ## -----------------------------------------------------------------------
  ## reduceRanges:

  ## This function passes the character vector of all file names to MAP.
  ## MAP must handle each file separately or invoke a method that operates
  ## on a list of files.

  ## TODO: example
}

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(GenomicFiles)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: GenomicRanges
Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: BiocParallel
Loading required package: Rsamtools
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomicFiles/reduceByRange-methods.Rd_%03d_medium.png", width=480, height=480)
> ### Name: reduceByRange
> ### Title: Parallel computations by ranges
> ### Aliases: reduceByRange reduceByRange,GRanges,ANY-method
> ###   reduceByRange,GRangesList,ANY-method
> ###   reduceByRange,GenomicFiles,missing-method reduceRanges
> ### Keywords: methods
> 
> ### ** Examples
> 
> 
> if (all(requireNamespace("RNAseqData.HNRNPC.bam.chr14", quietly=TRUE) &&
+         require(GenomicAlignments))) {
+   ## -----------------------------------------------------------------------
+   ## Compute coverage across BAM files.
+   ## -----------------------------------------------------------------------
+   fls <-                                ## 8 bam files
+       RNAseqData.HNRNPC.bam.chr14::RNAseqData.HNRNPC.bam.chr14_BAMFILES
+   
+   ## Regions of interest.
+   gr <- GRanges("chr14", IRanges(c(62262735, 63121531, 63980327),
+                 width=214700))
+   
+   ## The MAP computes the coverage ...
+   MAP <- function(range, file, ...) {
+       requireNamespace("GenomicFiles", quietly=TRUE)
+       ## for coverage(), Rsamtools::ScanBamParam()
+       param = Rsamtools::ScanBamParam(which=range)
+       GenomicFiles::coverage(file, param=param)[range]
+   } 
+   ## and REDUCE adds the last and current results. 
+   REDUCE <- function(mapped, ...)
+       Reduce("+", mapped)
+ 
+   ## -----------------------------------------------------------------------
+   ## reduceByRange:
+ 
+   ## With no REDUCE, coverage is computed for each range / file combination.
+   cov1 <- reduceByRange(gr, fls, MAP)
+   cov1[[1]]
+   
+   ## Each call to coverage() produces an RleList which accumulate on the 
+   ## workers. We can use a reducer to combine these lists either iteratively
+   ## or non-iteratively. When iterate = TRUE the current result 
+   ## is collapsed with the last resulting in a maximum of 2 RleLists on 
+   ## a worker at any given time.
+   cov2 <- reduceByRange(gr, fls, MAP, REDUCE, iterate=TRUE)
+   cov2[[1]]
+   
+   ## If memory use is not a concern (or if MAP output is not large) the 
+   ## REDUCE function can be applied non-iteratively. 
+   cov3 <- reduceByRange(gr, fls, MAP, REDUCE, iterate=FALSE)
+   
+   ## Results match those obtained with the iterative REDUCE.
+   cov3[[1]]
+ 
+   ## When 'ranges' is a GRangesList, the list elements are sent to the
+   ## workers instead of a single range as in the case of a GRanges.
+   grl <- GRangesList(gr[1], gr[2:3])
+   grl
+ 
+   cov4 <- reduceByRange(grl, fls, MAP)
+   length(cov4)          ## length of GRangesList
+   elementNROWS(cov4)  ## number of files
+ 
+   ## -----------------------------------------------------------------------
+   ## reduceRanges:
+ 
+   ## This function passes the character vector of all file names to MAP.
+   ## MAP must handle each file separately or invoke a method that operates
+   ## on a list of files.
+ 
+   ## TODO: example
+ }
Loading required package: GenomicAlignments
[1] 8 8
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>