Last data update: 2014.03.03

R: Maintain and use BAM files
BamFileR Documentation

Maintain and use BAM files

Description

Use BamFile() to create a reference to a BAM file (and optionally its index). The reference remains open across calls to methods, avoiding costly index re-loading.

BamFileList() provides a convenient way of managing a list of BamFile instances.

Usage


## Constructors

BamFile(file, index=file, ..., yieldSize=NA_integer_, obeyQname=FALSE,
        asMates=FALSE, qnamePrefixEnd=NA, qnameSuffixStart=NA)
BamFileList(..., yieldSize=NA_integer_, obeyQname=FALSE, asMates=FALSE,
            qnamePrefixEnd=NA, qnameSuffixStart=NA)

## Opening / closing

## S3 method for class 'BamFile'
open(con, ...)
## S3 method for class 'BamFile'
close(con, ...)

## accessors; also path(), index(), yieldSize()

## S4 method for signature 'BamFile'
isOpen(con, rw="")
## S4 method for signature 'BamFile'
isIncomplete(con)
## S4 method for signature 'BamFile'
obeyQname(object, ...)
obeyQname(object, ...) <- value
## S4 method for signature 'BamFile'
asMates(object, ...)
asMates(object, ...) <- value
## S4 method for signature 'BamFile'
qnamePrefixEnd(object, ...)
qnamePrefixEnd(object, ...) <- value
## S4 method for signature 'BamFile'
qnameSuffixStart(object, ...)
qnameSuffixStart(object, ...) <- value

## actions

## S4 method for signature 'BamFile'
scanBamHeader(files, ..., what=c("targets", "text"))
## S4 method for signature 'BamFile'
seqinfo(x)
## S4 method for signature 'BamFileList'
seqinfo(x)
## S4 method for signature 'BamFile'
filterBam(file, destination, index=file, ...,
    filter=FilterRules(), indexDestination=TRUE,
    param=ScanBamParam(what=scanBamWhat()))
## S4 method for signature 'BamFile'
indexBam(files, ...)
## S4 method for signature 'BamFile'
sortBam(file, destination, ..., byQname=FALSE, maxMemory=512)
## S4 method for signature 'BamFileList'
mergeBam(files, destination, ...)

## reading

## S4 method for signature 'BamFile'
scanBam(file, index=file, ..., param=ScanBamParam(what=scanBamWhat()))

## counting

## S4 method for signature 'BamFile'
countBam(file, index=file, ..., param=ScanBamParam())
## S4 method for signature 'BamFileList'
countBam(file, index=file, ..., param=ScanBamParam())
## S4 method for signature 'BamFile'
quickBamFlagSummary(file, ..., param=ScanBamParam(), main.groups.only=FALSE)

Arguments

...

Additional arguments.

For BamFileList, this can either be a single character vector of paths to BAM files, or several instances of BamFile objects. When a character vector of paths, a second named argument ‘index’ can be a character() vector of length equal to the first argument specifying the paths to the index files, or character() to indicate that no index file is available. See BamFile.

con

An instance of BamFile.

x, object, file, files

A character vector of BAM file paths (for BamFile) or a BamFile instance (for other methods).

index

character(1); the BAM index file path (for BamFile); ignored for all other methods on this page.

yieldSize

Number of records to yield each time the file is read from with scanBam. See ‘Fields’ section for details.

asMates

Logical indicating if records should be paired as mates. See ‘Fields’ section for details.

qnamePrefixEnd

Single character (or NA) marking the end of the qname prefix. When specified, all characters prior to and including the qnamePrefixEnd are removed from the qname. If the prefix is not found in the qname the qname is not trimmed. Currently only implemented for mate-pairing (i.e., when asMates=TRUE in a BamFile.

qnameSuffixStart

Single character (or NA) marking the start of the qname suffix. When specified, all characters following and including the qnameSuffixStart are removed from the qname. If the suffix is not found in the qname the qname is not trimmmed. Currently only implemented for mate-pairing (i.e., when asMates=TRUE in a BamFile.

obeyQname

Logical indicating if the BAM file is sorted by qname. In Bioconductor > 2.12 paired-end files do not need to be sorted by qname. Instead use asMates=TRUE for reading paired-end data. See ‘Fields’ section for details.

value

Logical value for setting asMates and obeyQname in a BamFile instance.

what

For scanBamHeader, a character vector specifying that either or both of c("targets", "text") are to be extracted from the header; see scanBam for additional detail.

filter

A FilterRules instance. Functions in the FilterRules instance should expect a single DataFrame argument representing all information specified by param. Each function must return a logical vector, usually of length equal to the number of rows of the DataFrame. Return values are used to include (when TRUE) corresponding records in the filtered BAM file.

destination

character(1) file path to write filtered reads to.

indexDestination

logical(1) indicating whether the destination file should also be indexed.

byQname, maxMemory

See sortBam.

param

An optional ScanBamParam instance to further influence scanning, counting, or filtering.

rw

Mode of file; ignored.

main.groups.only

See quickBamFlagSummary.

Objects from the Class

Objects are created by calls of the form BamFile().

Fields

The BamFile class inherits fields from the RsamtoolsFile class and has fields:

yieldSize:

Number of records to yield each time the file is read from using scanBam or, when length(bamWhich()) != 0, a threshold which yields records in complete ranges whose sum first exceeds yieldSize. Setting yieldSize on a BamFileList does not alter existing yield sizes set on the individual BamFile instances.

asMates:

A logical indicating if the records should be returned as mated pairs. When TRUE scanBam attempts to mate (pair) the records and returns two additional fields groupid and mate_status. groupid is an integer vector of unique group ids; mate_status is a factor with level mated for records successfully paired by the algorithm, ambiguous for records that are possibly mates but cannot be assigned unambiguously, or unmated for reads that did not have valid mates.

Mate criteria:

  • Bit 0x40 and 0x80: Segments are a pair of first/last OR neither segment is marked first/last

  • Bit 0x100: Both segments are secondary OR both not secondary

  • Bit 0x10 and 0x20: Segments are on opposite strands

  • mpos match: segment1 mpos matches segment2 pos AND segment2 mpos matches segment1 pos

  • tid match

Flags, tags and ranges may be specified in the ScanBamParam for fine tuning of results.

obeyQname:

A logical(0) indicating if the file was sorted by qname. In Bioconductor > 2.12 paired-end files do not need to be sorted by qname. Instead set asMates=TRUE in the BamFile when using the readGAlignmentsList function from the GenomicAlignments package.

Functions and methods

BamFileList inherits additional methods from RsamtoolsFileList and SimpleList.

Opening / closing:

open.BamFile

Opens the (local or remote) path and index (if bamIndex is not character(0)), files. Returns a BamFile instance.

close.BamFile

Closes the BamFile con; returning (invisibly) the updated BamFile. The instance may be re-opened with open.BamFile.

isOpen

Tests whether the BamFile con has been opened for reading.

isIncomplete

Tests whether the BamFile con is niether closed nor at the end of the file.

Accessors:

path

Returns a character(1) vector of BAM path names.

index

Returns a character(0) or character(1) vector of BAM index path names.

yieldSize, yieldSize<-

Return or set an integer(1) vector indicating yield size.

obeyQname, obeyQname<-

Return or set a logical(0) indicating if the file was sorted by qname.

asMates, asMates<-

Return or set a logical(0) indicating if the records should be returned as mated pairs.

Methods:

scanBamHeader

Visit the path in path(file), returning the information contained in the file header; see scanBamHeader.

seqinfo, seqnames, seqlength

Visit the path in path(file), returning a Seqinfo, character, or named integer vector containing information on the anmes and / or lengths of each sequence. Seqnames are ordered as they appear in the file.

scanBam

Visit the path in path(file), returning the result of scanBam applied to the specified path.

countBam

Visit the path(s) in path(file), returning the result of countBam applied to the specified path.

filterBam

Visit the path in path(file), returning the result of filterBam applied to the specified path. A single file can be filtered to one or several destinations, as described in filterBam.

indexBam

Visit the path in path(file), returning the result of indexBam applied to the specified path.

sortBam

Visit the path in path(file), returning the result of sortBam applied to the specified path.

mergeBam

Merge several BAM files into a single BAM file. See mergeBam for details; additional arguments supported by mergeBam,character-method are also available for BamFileList.

show

Compactly display the object.

Author(s)

Martin Morgan and Marc Carlson

See Also

  • The readGAlignments, readGAlignmentPairs, and readGAlignmentsList functions defined in the GenomicAlignments package.

  • summarizeOverlaps and findSpliceOverlaps-methods in the GenomicAlignments package for methods that work on a BamFile and BamFileList objects.

Examples


##
## BamFile options.
##

fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
bf <- BamFile(fl)
bf

## When 'asMates=TRUE' scanBam() reads the data in as
## pairs. See 'asMates' above for details of the pairing
## algorithm.
asMates(bf) <- TRUE

## When 'yieldSize' is set, scanBam() will iterate
## through the file in chunks.
yieldSize(bf) <- 500 

## Some applications append a filename (e.g., NCBI Sequence Read 
## Archive (SRA) toolkit) or allele identifier to the sequence qname.
## This may result in a unique qname for each record which presents a
## problem when mating paired-end reads (identical qnames is one
## criteria for paired-end mating). 'qnamePrefixEnd' and 
## 'qnameSuffixStart' can be used to trim an unwanted prefix or suffix.
qnamePrefixEnd(bf) <- "/"
qnameSuffixStart(bf) <- "." 

##
## Reading Bam files.
##

fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
                  mustWork=TRUE)
(bf <- BamFile(fl))
head(seqlengths(bf))                    # sequences and lengths in BAM file

if (require(RNAseqData.HNRNPC.bam.chr14)) {
    bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES)
    bfl
    bfl[1:2]                            # subset
    bfl[[1]]                            # select first element -- BamFile
    ## merged across BAM files
    seqinfo(bfl)
    head(seqlengths(bfl))
}


length(scanBam(fl)[[1]][[1]])  # all records

bf <- open(BamFile(fl))        # implicit index
bf
identical(scanBam(bf), scanBam(fl))
close(bf)

## Use 'yieldSize' to iterate through a file in chunks.
bf <- open(BamFile(fl, yieldSize=1000)) 
while (nrec <- length(scanBam(bf)[[1]][[1]]))
    cat("records:", nrec, "\n")
close(bf)

## Repeatedly visit multiple ranges in the BamFile. 
rng <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584)))
bf <- open(BamFile(fl))
sapply(seq_len(length(rng)), function(i, bamFile, rng) {
    param <- ScanBamParam(which=rng[i], what="seq")
    bam <- scanBam(bamFile, param=param)[[1]]
    alphabetFrequency(bam[["seq"]], baseOnly=TRUE, collapse=TRUE)
}, bf, rng)
close(bf)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(Rsamtools)
Loading required package: GenomeInfoDb
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/Rsamtools/BamFile-class.Rd_%03d_medium.png", width=480, height=480)
> ### Name: BamFile
> ### Title: Maintain and use BAM files
> ### Aliases: BamFile-class BamFileList-class BamFile BamFileList
> ###   open.BamFile close.BamFile isOpen,BamFile-method
> ###   isIncomplete,BamFile-method scanBamHeader,BamFile-method
> ###   seqinfo,BamFile-method seqinfo,BamFileList-method obeyQname
> ###   obeyQname<- obeyQname,BamFile-method obeyQname<-,BamFile-method
> ###   obeyQname,BamFileList-method obeyQname<-,BamFileList-method asMates
> ###   asMates<- asMates,BamFile-method asMates<-,BamFile-method
> ###   asMates,BamFileList-method asMates<-,BamFileList-method
> ###   qnamePrefixEnd qnamePrefixEnd<- qnamePrefixEnd,BamFile-method
> ###   qnamePrefixEnd<-,BamFile-method qnamePrefixEnd,BamFileList-method
> ###   qnamePrefixEnd<-,BamFileList-method qnameSuffixStart
> ###   qnameSuffixStart<- qnameSuffixStart,BamFile-method
> ###   qnameSuffixStart<-,BamFile-method qnameSuffixStart,BamFileList-method
> ###   qnameSuffixStart<-,BamFileList-method scanBam,BamFile-method
> ###   countBam,BamFile-method countBam,BamFileList-method
> ###   filterBam,BamFile-method indexBam,BamFile-method
> ###   sortBam,BamFile-method mergeBam,BamFileList-method
> ###   quickBamFlagSummary,BamFile-method show,BamFile-method
> ###   show,BamFileList-method
> ### Keywords: classes
> 
> ### ** Examples
> 
> 
> ##
> ## BamFile options.
> ##
> 
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> bf <- BamFile(fl)
> bf
class: BamFile 
path: /home/ddbj/local/lib64/R/library/Rsamtools/extdata/ex1.bam
index: /home/ddbj/local/lib64/R/library/Rsamtools/extdata/ex1.bam.bai
isOpen: FALSE 
yieldSize: NA 
obeyQname: FALSE 
asMates: FALSE 
qnamePrefixEnd: NA 
qnameSuffixStart: NA 
> 
> ## When 'asMates=TRUE' scanBam() reads the data in as
> ## pairs. See 'asMates' above for details of the pairing
> ## algorithm.
> asMates(bf) <- TRUE
> 
> ## When 'yieldSize' is set, scanBam() will iterate
> ## through the file in chunks.
> yieldSize(bf) <- 500 
> 
> ## Some applications append a filename (e.g., NCBI Sequence Read 
> ## Archive (SRA) toolkit) or allele identifier to the sequence qname.
> ## This may result in a unique qname for each record which presents a
> ## problem when mating paired-end reads (identical qnames is one
> ## criteria for paired-end mating). 'qnamePrefixEnd' and 
> ## 'qnameSuffixStart' can be used to trim an unwanted prefix or suffix.
> qnamePrefixEnd(bf) <- "/"
> qnameSuffixStart(bf) <- "." 
> 
> ##
> ## Reading Bam files.
> ##
> 
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
+                   mustWork=TRUE)
> (bf <- BamFile(fl))
class: BamFile 
path: /home/ddbj/local/lib64/R/library/Rsamtools/extdata/ex1.bam
index: /home/ddbj/local/lib64/R/library/Rsamtools/extdata/ex1.bam.bai
isOpen: FALSE 
yieldSize: NA 
obeyQname: FALSE 
asMates: FALSE 
qnamePrefixEnd: NA 
qnameSuffixStart: NA 
> head(seqlengths(bf))                    # sequences and lengths in BAM file
seq1 seq2 
1575 1584 
> 
> if (require(RNAseqData.HNRNPC.bam.chr14)) {
+     bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES)
+     bfl
+     bfl[1:2]                            # subset
+     bfl[[1]]                            # select first element -- BamFile
+     ## merged across BAM files
+     seqinfo(bfl)
+     head(seqlengths(bfl))
+ }
Loading required package: RNAseqData.HNRNPC.bam.chr14
                 chr1                 chr10                 chr11 
            249250621             135534747             135006516 
chr11_gl000202_random                 chr12                 chr13 
                40103             133851895             115169878 
> 
> 
> length(scanBam(fl)[[1]][[1]])  # all records
[1] 3307
> 
> bf <- open(BamFile(fl))        # implicit index
> bf
class: BamFile 
path: /home/ddbj/local/lib64/R/library/Rsamtools/extdata/ex1.bam
index: /home/ddbj/local/lib64/R/library/Rsamtools/extdata/ex1.bam.bai
isOpen: TRUE 
yieldSize: NA 
obeyQname: FALSE 
asMates: FALSE 
qnamePrefixEnd: NA 
qnameSuffixStart: NA 
> identical(scanBam(bf), scanBam(fl))
[1] TRUE
> close(bf)
> 
> ## Use 'yieldSize' to iterate through a file in chunks.
> bf <- open(BamFile(fl, yieldSize=1000)) 
> while (nrec <- length(scanBam(bf)[[1]][[1]]))
+     cat("records:", nrec, "\n")
records: 1000 
records: 1000 
records: 1000 
records: 307 
> close(bf)
> 
> ## Repeatedly visit multiple ranges in the BamFile. 
> rng <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584)))
> bf <- open(BamFile(fl))
> sapply(seq_len(length(rng)), function(i, bamFile, rng) {
+     param <- ScanBamParam(which=rng[i], what="seq")
+     bam <- scanBam(bamFile, param=param)[[1]]
+     alphabetFrequency(bam[["seq"]], baseOnly=TRUE, collapse=TRUE)
+ }, bf, rng)
       [,1] [,2]
A     12843    0
C     13142    0
G     10833    0
T     16021    0
other    12    0
> close(bf)
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>