A character vector or factor containing the extended CIGAR strings.
It can be of arbitrary length except for queryLoc2refLoc which
only accepts a single CIGAR (as a character vector or factor of length 1).
ops
Character vector containing the extended CIGAR operations to actually
consider. Zero-length operations or operations not listed ops
are ignored.
flag
NULL or an integer vector containing the SAM flag for
each read.
According to the SAM Spec v1.4, flag bit 0x4 is the only reliable place
to tell whether a segment (or read) is mapped (bit is 0) or not (bit
is 1). If flag is supplied, then
cigarRangesAlongReferenceSpace, cigarRangesAlongQuerySpace,
cigarRangesAlongPairwiseSpace, and
extractAlignmentRangesOnReference don't produce any range
for unmapped reads i.e. they treat them as if their CIGAR was empty
(independently of what their CIGAR is). If flag is supplied, then
cigarWidthAlongReferenceSpace, cigarWidthAlongQuerySpace, and
cigarWidthAlongPairwiseSpace return NAs for unmapped reads.
N.regions.removed
TRUE or FALSE.
If TRUE, then cigarRangesAlongReferenceSpace and
cigarWidthAlongReferenceSpace report ranges/widths with respect
to the "reference" space from which the N regions have been removed,
and cigarRangesAlongPairwiseSpace and
cigarWidthAlongPairwiseSpace report them with respect to
the "pairwise" space from which the N regions have been removed.
pos
An integer vector containing the 1-based leftmost
position/coordinate for each (eventually clipped) read
sequence. Must have length 1 (in which case it's recycled to the
length of cigar), or the same length as cigar.
f
NULL or a factor of length cigar.
If NULL, then the ranges are grouped by alignment i.e. the
returned IRangesList object has 1 list element per
element in cigar. Otherwise they are grouped by factor level
i.e. the returned IRangesList object has 1 list element
per level in f and is named with those levels.
For example, if f is a factor containing the chromosome for each
read, then the returned IRangesList object will have
1 list element per chromosome and each list element will contain all
the ranges on that chromosome.
drop.empty.ranges
Should empty ranges be dropped?
reduce.ranges
Should adjacent ranges coming from the same cigar be merged or not?
Using TRUE can significantly reduce the size of the returned
object.
with.ops
TRUE or FALSE indicating whether the returned ranges should
be named with their corresponding CIGAR operation.
before.hard.clipping
TRUE or FALSE.
If TRUE, then cigarRangesAlongQuerySpace and
cigarWidthAlongQuerySpace report ranges/widths with respect
to the "query" space to which the H regions have been added.
before.hard.clipping and after.soft.clipping cannot
both be TRUE.
after.soft.clipping
TRUE or FALSE.
If TRUE, then cigarRangesAlongQuerySpace and
cigarWidthAlongQuerySpace report ranges/widths with respect
to the "query" space from which the S regions have been removed.
before.hard.clipping and after.soft.clipping cannot
both be TRUE.
dense
TRUE or FALSE.
If TRUE, then cigarRangesAlongPairwiseSpace and
cigarWidthAlongPairwiseSpace report ranges/widths with respect to
the "pairwise" space from which the I, D, and N regions have been removed.
N.regions.removed and dense cannot both be TRUE.
drop.D.ranges
Should the ranges corresponding to a deletion from the
reference (encoded with a D in the CIGAR) be dropped?
By default we keep them to be consistent with the pileup tool
from SAMtools.
Note that, when drop.D.ranges is TRUE, then Ds
and Ns in the CIGAR are equivalent.
start,end,width
Vectors of integers. NAs and negative values are accepted and
"solved" according to the rules of the SEW (Start/End/Width)
interface (see ?solveUserSEW for the details).
qloc
An integer vector containing "query-based locations" i.e.
1-based locations relative to the query sequence
stored in the SAM/BAM file.
qlocs
A list of the same length as cigar where each
element is an integer vector containing "query-based
locations" i.e. 1-based locations relative to the corresponding
query sequence stored in the SAM/BAM file.
Value
CIGAR_OPS is a predefined character vector containing the supported
extended CIGAR operations: M, I, D, N, S, H, P, =, X. See p. 4 of the
SAM Spec v1.4 at http://samtools.sourceforge.net/ for the list of
extended CIGAR operations and their meanings.
For explodeCigarOps and explodeCigarOpLengths:
Both functions return a list of the same length as cigar where each
list element is a character vector (for explodeCigarOps) or an integer
vector (for explodeCigarOpLengths). The 2 lists have the same shape,
that is, same length() and same elementNROWS(). The i-th
character vector in the list returned by explodeCigarOps contains one
single-letter string per CIGAR operation in cigar[i]. The i-th integer
vector in the list returned by explodeCigarOpLengths contains the
corresponding CIGAR operation lengths. Zero-length operations or operations
not listed in ops are ignored.
For cigarToRleList: A CompressedRleList object.
For cigarOpTable: An integer matrix with number of rows equal
to the length of cigar and nine columns, one for each extended
CIGAR operation.
For cigarRangesAlongReferenceSpace, cigarRangesAlongQuerySpace,
cigarRangesAlongPairwiseSpace, and
extractAlignmentRangesOnReference: An IRangesList
object (more precisely a CompressedIRangesList object)
with 1 list element per element in cigar.
However, if f is a factor, then the returned
IRangesList object can be a SimpleIRangesList
object (instead of CompressedIRangesList), and in that case,
has 1 list element per level in f and is named with those levels.
For cigarWidthAlongReferenceSpace and
cigarWidthAlongPairwiseSpace: An integer vector of the same
length as cigar where each element is the width of the alignment
with respect to the "reference" and "pairwise" space, respectively.
More precisely, for cigarWidthAlongReferenceSpace, the returned
widths are the lengths of the alignments on the reference,
N gaps included (except if N.regions.removed is TRUE).
NAs or "*" in cigar will produce NAs in the returned vector.
For cigarWidthAlongQuerySpace: An integer vector of the same
length as cigar where each element is the length of the corresponding
query sequence as inferred from the CIGAR string. Note that, by default
(i.e. if before.hard.clipping and after.soft.clipping are
FALSE), this is the length of the query sequence stored in the
SAM/BAM file. If before.hard.clipping or after.soft.clipping
is TRUE, the returned widths are the lengths of the query sequences
before hard clipping or after soft clipping.
NAs or "*" in cigar will produce NAs in the returned vector.
For cigarNarrow and cigarQNarrow: A character vector
of the same length as cigar containing the narrowed cigars.
In addition the vector has an "rshift" attribute which is an integer
vector of the same length as cigar. It contains the values that
would need to be added to the POS field of a SAM/BAM file as a
consequence of this cigar narrowing.
For queryLoc2refLoc: An integer vector of the same length as
qloc containing the "reference-based locations" (i.e. the
1-based locations relative to the reference sequence) corresponding
to the "query-based locations" passed in qloc.
For queryLocs2refLocs: A list of the same length as
qlocs where each element is an integer vector containing
the "reference-based locations" corresponding to the "query-based
locations" passed in the corresponding element in qlocs.
Author(s)
Herv<c3><83><c2><a9> Pag<c3><83><c2><a8>s & P. Aboyoun
The sequenceLayer function in the
GenomicAlignments package for laying the query sequences
alongside the "reference" or "pairwise" spaces.
The GAlignments container for storing a set of genomic
alignments.
The IRanges, IRangesList, and
RleList classes in the IRanges package.
The coverage generic and methods for
computing the coverage across a set of ranges or genomic ranges.
Examples
## ---------------------------------------------------------------------
## A. CIGAR_OPS, explodeCigarOps(), explodeCigarOpLengths(),
## cigarToRleList(), and cigarOpTable()
## ---------------------------------------------------------------------
## Supported CIGAR operations:
CIGAR_OPS
## Transform CIGARs into other useful representations:
cigar1 <- "3H15M55N4M2I6M2D5M6S"
cigar2 <- c("40M2I9M", cigar1, "2S10M2000N15M", "3H33M5H")
explodeCigarOps(cigar2)
explodeCigarOpLengths(cigar2)
explodeCigarOpLengths(cigar2, ops=c("I", "S"))
cigarToRleList(cigar2)
## Summarize CIGARs:
cigarOpTable(cigar2)
## ---------------------------------------------------------------------
## B. From CIGARs to ranges and to sequence lengths
## ---------------------------------------------------------------------
## CIGAR ranges along the "reference" space:
cigarRangesAlongReferenceSpace(cigar1, with.ops=TRUE)[[1]]
cigarRangesAlongReferenceSpace(cigar1,
reduce.ranges=TRUE, with.ops=TRUE)[[1]]
ops <- setdiff(CIGAR_OPS, "N")
cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]]
cigarRangesAlongReferenceSpace(cigar1, ops=ops,
reduce.ranges=TRUE, with.ops=TRUE)[[1]]
ops <- setdiff(CIGAR_OPS, c("D", "N"))
cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]]
cigarWidthAlongReferenceSpace(cigar1)
pos2 <- c(1, 1001, 1, 351)
cigarRangesAlongReferenceSpace(cigar2, pos=pos2, with.ops=TRUE)
res1a <- extractAlignmentRangesOnReference(cigar2, pos=pos2)
res1b <- cigarRangesAlongReferenceSpace(cigar2,
pos=pos2,
ops=setdiff(CIGAR_OPS, "N"),
reduce.ranges=TRUE)
stopifnot(identical(res1a, res1b))
res2a <- extractAlignmentRangesOnReference(cigar2, pos=pos2,
drop.D.ranges=TRUE)
res2b <- cigarRangesAlongReferenceSpace(cigar2,
pos=pos2,
ops=setdiff(CIGAR_OPS, c("D", "N")),
reduce.ranges=TRUE)
stopifnot(identical(res2a, res2b))
seqnames <- factor(c("chr6", "chr6", "chr2", "chr6"),
levels=c("chr2", "chr6"))
extractAlignmentRangesOnReference(cigar2, pos=pos2, f=seqnames)
## CIGAR ranges along the "query" space:
cigarRangesAlongQuerySpace(cigar2, with.ops=TRUE)
cigarWidthAlongQuerySpace(cigar1)
cigarWidthAlongQuerySpace(cigar1, before.hard.clipping=TRUE)
## CIGAR ranges along the "pairwise" space:
cigarRangesAlongPairwiseSpace(cigar2, with.ops=TRUE)
cigarRangesAlongPairwiseSpace(cigar2, dense=TRUE, with.ops=TRUE)
## ---------------------------------------------------------------------
## C. COMPUTE THE COVERAGE OF THE READS STORED IN A BAM FILE
## ---------------------------------------------------------------------
## The information stored in a BAM file can be used to compute the
## "coverage" of the mapped reads i.e. the number of reads that hit any
## given position in the reference genome.
## The following function takes the path to a BAM file and returns an
## object representing the coverage of the mapped reads that are stored
## in the file. The returned object is an RleList object named with the
## names of the reference sequences that actually receive some coverage.
flag0 <- scanBamFlag(isUnmappedQuery=FALSE, isDuplicate=FALSE)
extractCoverageFromBAM <- function(bamfile)
{
stopifnot(is(bamfile, "BamFile"))
## This ScanBamParam object allows us to load only the necessary
## information from the file.
param <- ScanBamParam(flag=flag0, what=c("rname", "pos", "cigar"))
bam <- scanBam(bamfile, param=param)[[1]]
## Note that unmapped reads and reads that are PCR/optical duplicates
## have already been filtered out by using the ScanBamParam object
## above.
f <- factor(bam$rname, levels=seqlevels(bamfile))
irl <- extractAlignmentRangesOnReference(bam$cigar, pos=bam$pos, f=f)
coverage(irl, width=seqlengths(bamfile))
}
library(Rsamtools)
f1 <- system.file("extdata", "ex1.bam", package="Rsamtools")
cvg <- extractCoverageFromBAM(BamFile(f1))
## extractCoverageFromBAM() is equivalent but slightly more efficient
## than loading a GAlignments object and computing its coverage:
cvg2 <- coverage(readGAlignments(f1, param=ScanBamParam(flag=flag0)))
stopifnot(identical(cvg, cvg2))
## ---------------------------------------------------------------------
## D. cigarNarrow() and cigarQNarrow()
## ---------------------------------------------------------------------
## cigarNarrow():
cigarNarrow(cigar1) # only drops the soft/hard clipping
cigarNarrow(cigar1, start=10)
cigarNarrow(cigar1, start=15)
cigarNarrow(cigar1, start=15, width=57)
cigarNarrow(cigar1, start=16)
#cigarNarrow(cigar1, start=16, width=55) # ERROR! (empty cigar)
cigarNarrow(cigar1, start=71)
cigarNarrow(cigar1, start=72)
cigarNarrow(cigar1, start=75)
## cigarQNarrow():
cigarQNarrow(cigar1, start=4, end=-3)
cigarQNarrow(cigar1, start=10)
cigarQNarrow(cigar1, start=19)
cigarQNarrow(cigar1, start=24)
## ---------------------------------------------------------------------
## E. PERFORMANCE
## ---------------------------------------------------------------------
if (interactive()) {
## We simulate 20 millions aligned reads, all 40-mers. 95% of them
## align with no indels. 5% align with a big deletion in the
## reference. In the context of an RNAseq experiment, those 5% would
## be suspected to be "junction reads".
set.seed(123)
nreads <- 20000000L
njunctionreads <- nreads * 5L / 100L
cigar3 <- character(nreads)
cigar3[] <- "40M"
junctioncigars <- paste(
paste(10:30, "M", sep=""),
paste(sample(80:8000, njunctionreads, replace=TRUE), "N", sep=""),
paste(30:10, "M", sep=""), sep="")
cigar3[sample(nreads, njunctionreads)] <- junctioncigars
some_fake_rnames <- paste("chr", c(1:6, "X"), sep="")
rname <- factor(sample(some_fake_rnames, nreads, replace=TRUE),
levels=some_fake_rnames)
pos <- sample(80000000L, nreads, replace=TRUE)
## The following takes < 3 sec. to complete:
system.time(irl1 <- extractAlignmentRangesOnReference(cigar3, pos=pos))
## The following takes < 4 sec. to complete:
system.time(irl2 <- extractAlignmentRangesOnReference(cigar3, pos=pos,
f=rname))
## The sizes of the resulting objects are about 240M and 160M,
## respectively:
object.size(irl1)
object.size(irl2)
}
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(GenomicAlignments)
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: Biostrings
Loading required package: XVector
Loading required package: Rsamtools
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomicAlignments/cigar-utils.Rd_%03d_medium.png", width=480, height=480)
> ### Name: cigar-utils
> ### Title: CIGAR utility functions
> ### Aliases: cigar-utils validCigar CIGAR_OPS explodeCigarOps
> ### explodeCigarOpLengths cigarToRleList cigarOpTable
> ### cigarRangesAlongReferenceSpace cigarRangesAlongQuerySpace
> ### cigarRangesAlongPairwiseSpace extractAlignmentRangesOnReference
> ### cigarWidthAlongReferenceSpace cigarWidthAlongQuerySpace
> ### cigarWidthAlongPairwiseSpace cigarNarrow cigarQNarrow queryLoc2refLoc
> ### queryLocs2refLocs
> ### Keywords: manip
>
> ### ** Examples
>
> ## ---------------------------------------------------------------------
> ## A. CIGAR_OPS, explodeCigarOps(), explodeCigarOpLengths(),
> ## cigarToRleList(), and cigarOpTable()
> ## ---------------------------------------------------------------------
>
> ## Supported CIGAR operations:
> CIGAR_OPS
[1] "M" "I" "D" "N" "S" "H" "P" "=" "X"
>
> ## Transform CIGARs into other useful representations:
> cigar1 <- "3H15M55N4M2I6M2D5M6S"
> cigar2 <- c("40M2I9M", cigar1, "2S10M2000N15M", "3H33M5H")
>
> explodeCigarOps(cigar2)
[[1]]
[1] "M" "I" "M"
[[2]]
[1] "H" "M" "N" "M" "I" "M" "D" "M" "S"
[[3]]
[1] "S" "M" "N" "M"
[[4]]
[1] "H" "M" "H"
> explodeCigarOpLengths(cigar2)
[[1]]
[1] 40 2 9
[[2]]
[1] 3 15 55 4 2 6 2 5 6
[[3]]
[1] 2 10 2000 15
[[4]]
[1] 3 33 5
> explodeCigarOpLengths(cigar2, ops=c("I", "S"))
[[1]]
[1] 2
[[2]]
[1] 2 6
[[3]]
[1] 2
[[4]]
integer(0)
> cigarToRleList(cigar2)
RleList of length 4
[[1]]
character-Rle of length 51 with 3 runs
Lengths: 40 2 9
Values : "M" "I" "M"
[[2]]
character-Rle of length 98 with 9 runs
Lengths: 3 15 55 4 2 6 2 5 6
Values : "H" "M" "N" "M" "I" "M" "D" "M" "S"
[[3]]
character-Rle of length 2027 with 4 runs
Lengths: 2 10 2000 15
Values : "S" "M" "N" "M"
[[4]]
character-Rle of length 41 with 3 runs
Lengths: 3 33 5
Values : "H" "M" "H"
>
> ## Summarize CIGARs:
> cigarOpTable(cigar2)
M I D N S H P = X
[1,] 49 2 0 0 0 0 0 0 0
[2,] 30 2 2 55 6 3 0 0 0
[3,] 25 0 0 2000 2 0 0 0 0
[4,] 33 0 0 0 0 8 0 0 0
>
> ## ---------------------------------------------------------------------
> ## B. From CIGARs to ranges and to sequence lengths
> ## ---------------------------------------------------------------------
>
> ## CIGAR ranges along the "reference" space:
> cigarRangesAlongReferenceSpace(cigar1, with.ops=TRUE)[[1]]
IRanges object with 9 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
H 1 0 0
M 1 15 15
N 16 70 55
M 71 74 4
I 75 74 0
M 75 80 6
D 81 82 2
M 83 87 5
S 88 87 0
>
> cigarRangesAlongReferenceSpace(cigar1,
+ reduce.ranges=TRUE, with.ops=TRUE)[[1]]
IRanges object with 1 range and 0 metadata columns:
start end width
<integer> <integer> <integer>
HMNMIMDMS 1 87 87
>
> ops <- setdiff(CIGAR_OPS, "N")
>
> cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]]
IRanges object with 8 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
H 1 0 0
M 1 15 15
M 71 74 4
I 75 74 0
M 75 80 6
D 81 82 2
M 83 87 5
S 88 87 0
>
> cigarRangesAlongReferenceSpace(cigar1, ops=ops,
+ reduce.ranges=TRUE, with.ops=TRUE)[[1]]
IRanges object with 2 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
HM 1 15 15
MIMDMS 71 87 17
>
> ops <- setdiff(CIGAR_OPS, c("D", "N"))
>
> cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]]
IRanges object with 7 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
H 1 0 0
M 1 15 15
M 71 74 4
I 75 74 0
M 75 80 6
M 83 87 5
S 88 87 0
>
> cigarWidthAlongReferenceSpace(cigar1)
[1] 87
>
> pos2 <- c(1, 1001, 1, 351)
>
> cigarRangesAlongReferenceSpace(cigar2, pos=pos2, with.ops=TRUE)
IRangesList of length 4
[[1]]
IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
M 1 40 40
I 41 40 0
M 41 49 9
[[2]]
IRanges object with 9 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
H 1001 1000 0
M 1001 1015 15
N 1016 1070 55
M 1071 1074 4
I 1075 1074 0
M 1075 1080 6
D 1081 1082 2
M 1083 1087 5
S 1088 1087 0
[[3]]
IRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
S 1 0 0
M 1 10 10
N 11 2010 2000
M 2011 2025 15
[[4]]
IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
H 351 350 0
M 351 383 33
H 384 383 0
>
> res1a <- extractAlignmentRangesOnReference(cigar2, pos=pos2)
> res1b <- cigarRangesAlongReferenceSpace(cigar2,
+ pos=pos2,
+ ops=setdiff(CIGAR_OPS, "N"),
+ reduce.ranges=TRUE)
> stopifnot(identical(res1a, res1b))
>
> res2a <- extractAlignmentRangesOnReference(cigar2, pos=pos2,
+ drop.D.ranges=TRUE)
> res2b <- cigarRangesAlongReferenceSpace(cigar2,
+ pos=pos2,
+ ops=setdiff(CIGAR_OPS, c("D", "N")),
+ reduce.ranges=TRUE)
> stopifnot(identical(res2a, res2b))
>
> seqnames <- factor(c("chr6", "chr6", "chr2", "chr6"),
+ levels=c("chr2", "chr6"))
> extractAlignmentRangesOnReference(cigar2, pos=pos2, f=seqnames)
IRangesList of length 2
$chr2
IRanges object with 2 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 1 10 10
[2] 2011 2025 15
$chr6
IRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 1 49 49
[2] 1001 1015 15
[3] 1071 1087 17
[4] 351 383 33
>
> ## CIGAR ranges along the "query" space:
> cigarRangesAlongQuerySpace(cigar2, with.ops=TRUE)
IRangesList of length 4
[[1]]
IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
M 1 40 40
I 41 42 2
M 43 51 9
[[2]]
IRanges object with 9 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
H 1 0 0
M 1 15 15
N 16 15 0
M 16 19 4
I 20 21 2
M 22 27 6
D 28 27 0
M 28 32 5
S 33 38 6
[[3]]
IRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
S 1 2 2
M 3 12 10
N 13 12 0
M 13 27 15
[[4]]
IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
H 1 0 0
M 1 33 33
H 34 33 0
> cigarWidthAlongQuerySpace(cigar1)
[1] 38
> cigarWidthAlongQuerySpace(cigar1, before.hard.clipping=TRUE)
[1] 41
>
> ## CIGAR ranges along the "pairwise" space:
> cigarRangesAlongPairwiseSpace(cigar2, with.ops=TRUE)
IRangesList of length 4
[[1]]
IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
M 1 40 40
I 41 42 2
M 43 51 9
[[2]]
IRanges object with 9 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
H 1 0 0
M 1 15 15
N 16 70 55
M 71 74 4
I 75 76 2
M 77 82 6
D 83 84 2
M 85 89 5
S 90 89 0
[[3]]
IRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
S 1 0 0
M 1 10 10
N 11 2010 2000
M 2011 2025 15
[[4]]
IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
H 1 0 0
M 1 33 33
H 34 33 0
> cigarRangesAlongPairwiseSpace(cigar2, dense=TRUE, with.ops=TRUE)
IRangesList of length 4
[[1]]
IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
M 1 40 40
I 41 40 0
M 41 49 9
[[2]]
IRanges object with 9 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
H 1 0 0
M 1 15 15
N 16 15 0
M 16 19 4
I 20 19 0
M 20 25 6
D 26 25 0
M 26 30 5
S 31 30 0
[[3]]
IRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
S 1 0 0
M 1 10 10
N 11 10 0
M 11 25 15
[[4]]
IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
H 1 0 0
M 1 33 33
H 34 33 0
>
> ## ---------------------------------------------------------------------
> ## C. COMPUTE THE COVERAGE OF THE READS STORED IN A BAM FILE
> ## ---------------------------------------------------------------------
> ## The information stored in a BAM file can be used to compute the
> ## "coverage" of the mapped reads i.e. the number of reads that hit any
> ## given position in the reference genome.
> ## The following function takes the path to a BAM file and returns an
> ## object representing the coverage of the mapped reads that are stored
> ## in the file. The returned object is an RleList object named with the
> ## names of the reference sequences that actually receive some coverage.
>
> flag0 <- scanBamFlag(isUnmappedQuery=FALSE, isDuplicate=FALSE)
>
> extractCoverageFromBAM <- function(bamfile)
+ {
+ stopifnot(is(bamfile, "BamFile"))
+ ## This ScanBamParam object allows us to load only the necessary
+ ## information from the file.
+ param <- ScanBamParam(flag=flag0, what=c("rname", "pos", "cigar"))
+ bam <- scanBam(bamfile, param=param)[[1]]
+ ## Note that unmapped reads and reads that are PCR/optical duplicates
+ ## have already been filtered out by using the ScanBamParam object
+ ## above.
+ f <- factor(bam$rname, levels=seqlevels(bamfile))
+ irl <- extractAlignmentRangesOnReference(bam$cigar, pos=bam$pos, f=f)
+ coverage(irl, width=seqlengths(bamfile))
+ }
>
> library(Rsamtools)
> f1 <- system.file("extdata", "ex1.bam", package="Rsamtools")
> cvg <- extractCoverageFromBAM(BamFile(f1))
>
> ## extractCoverageFromBAM() is equivalent but slightly more efficient
> ## than loading a GAlignments object and computing its coverage:
> cvg2 <- coverage(readGAlignments(f1, param=ScanBamParam(flag=flag0)))
> stopifnot(identical(cvg, cvg2))
>
> ## ---------------------------------------------------------------------
> ## D. cigarNarrow() and cigarQNarrow()
> ## ---------------------------------------------------------------------
>
> ## cigarNarrow():
> cigarNarrow(cigar1) # only drops the soft/hard clipping
[1] "15M55N4M2I6M2D5M"
attr(,"rshift")
[1] 0
> cigarNarrow(cigar1, start=10)
[1] "6M55N4M2I6M2D5M"
attr(,"rshift")
[1] 9
> cigarNarrow(cigar1, start=15)
[1] "1M55N4M2I6M2D5M"
attr(,"rshift")
[1] 14
> cigarNarrow(cigar1, start=15, width=57)
[1] "1M55N1M"
attr(,"rshift")
[1] 14
> cigarNarrow(cigar1, start=16)
[1] "4M2I6M2D5M"
attr(,"rshift")
[1] 70
> #cigarNarrow(cigar1, start=16, width=55) # ERROR! (empty cigar)
> cigarNarrow(cigar1, start=71)
[1] "4M2I6M2D5M"
attr(,"rshift")
[1] 70
> cigarNarrow(cigar1, start=72)
[1] "3M2I6M2D5M"
attr(,"rshift")
[1] 71
> cigarNarrow(cigar1, start=75)
[1] "6M2D5M"
attr(,"rshift")
[1] 74
>
> ## cigarQNarrow():
> cigarQNarrow(cigar1, start=4, end=-3)
[1] "15M55N4M2I6M2D5M4S"
attr(,"rshift")
[1] 0
> cigarQNarrow(cigar1, start=10)
[1] "9M55N4M2I6M2D5M6S"
attr(,"rshift")
[1] 6
> cigarQNarrow(cigar1, start=19)
[1] "4M2I6M2D5M6S"
attr(,"rshift")
[1] 70
> cigarQNarrow(cigar1, start=24)
[1] "1I6M2D5M6S"
attr(,"rshift")
[1] 74
>
> ## ---------------------------------------------------------------------
> ## E. PERFORMANCE
> ## ---------------------------------------------------------------------
>
> #if (interactive()) {
> ## We simulate 20 millions aligned reads, all 40-mers. 95% of them
> ## align with no indels. 5% align with a big deletion in the
> ## reference. In the context of an RNAseq experiment, those 5% would
> ## be suspected to be "junction reads".
> set.seed(123)
> nreads <- 20000000L
> njunctionreads <- nreads * 5L / 100L
> cigar3 <- character(nreads)
> cigar3[] <- "40M"
> junctioncigars <- paste(
+ paste(10:30, "M", sep=""),
+ paste(sample(80:8000, njunctionreads, replace=TRUE), "N", sep=""),
+ paste(30:10, "M", sep=""), sep="")
> cigar3[sample(nreads, njunctionreads)] <- junctioncigars
> some_fake_rnames <- paste("chr", c(1:6, "X"), sep="")
> rname <- factor(sample(some_fake_rnames, nreads, replace=TRUE),
+ levels=some_fake_rnames)
> pos <- sample(80000000L, nreads, replace=TRUE)
>
> ## The following takes < 3 sec. to complete:
> system.time(irl1 <- extractAlignmentRangesOnReference(cigar3, pos=pos))
user system elapsed
2.164 0.056 2.222
>
> ## The following takes < 4 sec. to complete:
> system.time(irl2 <- extractAlignmentRangesOnReference(cigar3, pos=pos,
+ f=rname))
user system elapsed
1.384 0.040 1.424
>
> ## The sizes of the resulting objects are about 240M and 160M,
> ## respectively:
> object.size(irl1)
248004064 bytes
> object.size(irl2)
168012128 bytes
> #}
>
>
>
>
>
> dev.off()
null device
1
>