R: Use filters and output formats to calculate pile-up...
pileup
R Documentation
Use filters and output formats to calculate pile-up statistics for a
BAM file.
Description
pileup uses PileupParam and ScanBamParam objects
to calculate pileup statistics for a BAM file. The result is a
data.frame with columns summarizing counts of reads overlapping
each genomic position, optionally differentiated on nucleotide,
strand, and position within read.
When file is character(1), an optional character(1) of BAM
index file path; see scanBam.
...
Additional arguments, perhaps used by methods.
scanBamParam
An instance of ScanBamParam.
pileupParam
An instance of PileupParam.
max_depth
integer(1); maximum number of overlapping alignments
considered for each position in the pileup.
min_base_quality
integer(1); minimum ‘QUAL’ value for
each nucleotide in an alignment. Use phred2ASCIIOffset to
help translate numeric or character values to these offsets.
min_mapq
integer(1); minimum ‘MAPQ’ value for an
alignment to be included in pileup.
min_nucleotide_depth
integer(1); minimum count of each
nucleotide (independent of other nucleotides) at a given
position required for said nucleotide to appear in the result.
min_minor_allele_depth
integer(1); minimum count of all
nucleotides other than the major allele at a given position required
for a particular nucleotide to appear in the result.
distinguish_strands
logical(1); TRUE if result should
differentiate between strands.
distinguish_nucleotides
logical(1); TRUE if result
should differentiate between nucleotides.
ignore_query_Ns
logical(1); TRUE if non-determinate
nucleotides in alignments should be excluded from the pileup.
include_deletions
logical(1); TRUE to include deletions
in pileup.
include_insertions
logical(1); TRUE to include
insertions in pileup.
left_bins
numeric; all same sign; unique positions within a
read to delimit bins. Position within read is based on counting from
the 5' end regardless of strand. Minimum of two values are
required so at least one range can be formed. NULL (default)
indicates no binning. Use negative values to count from the opposite
end. Sorted order not required. Floating-point values are coerced to
integer.
If you want to delimit bins based on sequencing cycles to, e.g.,
discard later cycles, query_bins probably gives the
desired behavior.
query_bins
numeric; all same sign; unique positions within a
read to delimit bins. Position within a read is based on counting
from the 5' end when the read aligns to plus strand,
counting from the 3' end when read aligns to minus
strand. Minimum of two values are required so at least one range can
be formed. NULL (default) indicates no binning. Use negative
values to count from the opposite end. Sorted order not
required. Floating-point values are coerced to integer.
phred
Either a numeric() (coerced to integer()) PHRED score
(e.g., in the range (0, 41) for the ‘Illumina 1.8+’ scheme)
or character() of printable ASCII characters. When phred is
character(), it can be of length 1 with 1 or more characters, or of
any length where all elements have exactly 1 character.
scheme
Encoding scheme, used to translate numeric() PHRED
scores to their ASCII code. Ignored when phred is already
character().
cycle_bins
DEPRECATED. See left_bins for
identical behavior.
Details
NB: Use of pileup assumes familiarity with
ScanBamParam, and use of left_bins and
query_bins assumes familiarity with cut.
pileup visits each position in the BAM file, subject to
constraints implied by which and flag of
scanBamParam. For a given position, all reads overlapping the
position that are consistent with constraints dictated by flag
of scanBamParam and pileupParam are used for
counting. pileup also automatically excludes reads with flags
that indicate any of:
unmapped alignment (isUnmappedQuery)
secondary alignment (isSecondaryAlignment)
not passing quality controls (isNotPassingQualityControls)
PCR or optical duplicate (isDuplicate)
If no which argument is supplied to the ScanBamParam,
behavior reflects that of scanBam: the entire file is visited
and counted.
Use a yieldSize value when creating a BamFile
instance to manage memory consumption when using pileup with large BAM
files. There are some differences in how pileup behavior is
affected when the yieldSize value is set on the BAM file. See
more details below.
Many of the parameters of the pileupParam interact. A simple
illustration is ignore_query_Ns and
distinguish_nucleotides, as mentioned in the
ignore_query_Ns section.
Parameters for pileupParam belong to two categories: parameters
that affect only the filtering criteria (so-called
‘behavior-only’ policies), and parameters that affect
filtering behavior and the schema of the results
(‘behavior+structure’ policies).
distinguish_nucleotides and distinguish_strands when set
to TRUE each add a column (nucleotide and strand,
respectively) to the resulting data.frame. If both are TRUE,
each combination of nucleotide and strand at a given
position is counted separately. Setting only one to TRUE
behaves as expected; for example, if only nucleotide is set to
TRUE, each nucleotide at a given position is counted
separately, but the distinction of on which strand the nucleotide
appears is ignored.
ignore_query_Ns determines how ambiguous nucloetides are
treated. By default, ambiguous nucleotides (typically ‘N’ in
BAM files) in alignments are ignored. If ignore_query_Ns is
FALSE, ambiguous nucleotides are included in counts; further,
if ignore_query_Ns is FALSE and
distinguish_nucleotides is TRUE the ‘N’
nucleotide value appears in the nucleotide column when a base at a
given position is ambiguous.
By default, deletions with respect to the reference genome to which
the reads were aligned are included in the counts in a pileup. If
include_deletions is TRUE and
distinguish_nucleotides is TRUE, the nucleotide column
uses a ‘-’ character to indicate a deletion at a position.
The ‘=’ nucleotide code in the SEQ field (to mean
‘identical to reference genome’) is supported by pileup, such
that a match with the reference genome is counted separately in the
results if distinguish_nucleotides is TRUE.
CIGAR support: pileup handles the extended CIGAR format by only
heeding operations that contribute to counts (‘M’, ‘D’,
‘I’). If you are confused about how the different CIGAR
operations interact, see the SAM versions of the BAM files used for
pileup unit testing for a fairly comprehensive human-readable
treatment.
Deletions and clipping:
The extended CIGAR allows a number of operations conceptually
similar to a ‘deletion’ with respect to the reference
genome, but offer more specialized meanings than a simple
deletion. CIGAR ‘N’ operations (not to be confused with
‘N’ used for non-determinate bases in the SEQ field)
indicate a large skipped region, ‘S’ a soft clip, and
‘H’ a hard clip. ‘N’, ‘S’, and ‘H’
CIGAR operations are never counted: only counts of true deletions
(‘D’ in the CIGAR) can be included by setting
include_deletions to TRUE.
Soft clip operations contribute to the relative position of
nucleotides within a read, whereas hard clip and refskip
operations do not. For example, consider a sequence in a bam file
that starts at 11, with a CIGAR 2S1M and SEQ field
ttA. The cycle position of the A nucleotide will be
3, but the reported position will be 11. If using
left_bins or query_bins it might make sense to first
preprocess your files to remove soft clipping.
Insertions and padding:
pileup can include counts of insertion operations by
setting include_insertions to TRUE. Details about
insertions are effectively truncated such that each insertion is
reduced to a single ‘+’ nucleotide. Information about
e.g. the nucleotide code and base quality within the insertion is
not included.
Because ‘+’ is used as a nucleotide code to represent
insertions in pileup, counts of the ‘+’ nucleotide
participate in voting for min_nucleotide_depth and
min_minor_allele_depth functionality.
The position of an insertion is the position that precedes it on
the reference sequence. Note: this means if
include_insertions is TRUE a single read will
contribute two nucleotide codes (+, plus that of the
non-insertion base) at a given position if the non-insertion base
passes filter criteria.
‘P’ CIGAR (padding) operations never affect counts.
The “bin” arguments query_bins and left_bins
allow users to differentiate pileup counts based on arbitrary
non-overlapping regions within a read. pileup relies on
cut to derive bins, but limits input to numeric values
of the same sign (coerced to integers), including
+/-Inf. If a “bin” argument is set pileup
automatically excludes bases outside the implicit outer range. Here
are some important points regarding bin arguments:
query_bins vs. left_bins:
BAM files store sequence data from the 5' end to the 3' end
(regardless of to which strand the read aligns). That means for
reads aligned to the minus strand, cycle position within a read is
effectively reversed. Take care to use the appropriate bin
argument for your use case.
The most common use of a bin argument is to bin later cycles
separately from earlier cycles; this is because accuracy typically
degrades in later cycles. For this application, query_bins
yields the correct behavior because bin positions are relative to
cycle order (i.e., sensitive to strand).
left_bins (in contrast with query_bins) determines
bin positions from the 5' end, regardless of strand.
Positive or negative bin values can be used to delmit bins
based on the effective “start” or “end” of a
read. For left_bin the effective start is always the 5' end
(left-to-right as appear in the BAM file).
For query_bin the effective start is the first cycle of the
read as it came off the sequencer; that means the 5' end for reads
aligned to the plus strand and 3' for reads aligned to the minus
strand.
From effective start of reads: use positive values,
0, and (+)Inf. Because cut
produces ranges in the form (first,last], ‘0’ should be
used to create a bin that includes the first position. To
account for variable-length reads in BAM files, use
(+)Inf as the upper bound on a bin that extends to the
end of all reads.
From effective end of reads: use negative values
and -Inf. -1 denotes the last position in a
read. Because cut produces ranges in the form
(first,last], specify the lower bound of a bin by using one
less than the desired value, e.g., a bin that captures only
the second nucleotide from the last position:
query_bins=c(-3, -2). To account for variable-length
reads in BAM files, use -Inf as the lower bound on a
bin that extends to the beginning of all reads.
pileup obeys yieldSize on BamFile objects
with some differences from how scanBam responds to
yieldSize. Here are some points to keep in mind when using
pileup in conjunction with yieldSize:
BamFiles with a yieldSize value set, once
opened and used with pileup, should not be used with
other functions that accept a BamFile instance as
input. Create a new BamFile instance instead of trying to
reuse.
pileup only returns genomic positions for which all
input has been processed. pileup will hold in reserve
positions for which only partial data has been
processed. Positions held in reserve will be returned upon
subsequent calls to pileup when all the input for a given
position has been processed.
The correspondence between yieldSize and the number of rows
in the data.frame returned from pileup is not
1-to-1. yieldSize only limits the number of
alignments processed from the BAM file each time the file
is read. Only a handful of alignments can lead to many distinct
records in the result.
Like scanBam, pileup uses an empty return
object (a zero-row data.frame) to indicate end-of-input.
Sometimes reading yieldSize records from the BAM file
does not result in any completed positions. In order to avoid
returning a zero-row data.framepileup will
repeatedly process yieldSize additional records until at
least one position can be returned to the user.
Value
For pileup a data.frame with 1 row per unique
combination of differentiating column values that satisfied filter
criteria, with frequency (count) of unique combination. Columns
seqnames, pos, and count always appear; optional
strand, nulceotide, and left_bin /
query_bin columns appear as dictated by arguments to
PileupParam.
Column details:
seqnames: factor. SAM ‘RNAME’ field of
alignment.
pos: integer(1). Genomic position of base. Derived by
offset from SAM ‘POS’ field of alignment.
strand: factor. ‘strand’ to which read aligns.
nucleotide: factor. ‘nucleotide’ of base,
extracted from SAM ‘SEQ’ field of alignment.
left_bin / query_bin: factor. Bin in which base
appears.
count: integer(1). Frequency of combination of
differentiating fields, as indicated by values of other columns.
See scanBam for more detailed explanation of SAM fields.
If a which argument is specified for the scanBamParam, a
which_label column (factor in the form
‘rname:start-end’) is automatically included. The
which_label column is used to maintain grouping of results,
such that two queries of the same genomic region can be
differentiated.
Order of rows in data.frame is first by order of
seqnames column based on the BAM index file, then
non-decreasing order on columns pos, then nucleotide,
then strand, then left_bin / query_bin.
PileupParam returns an instance of PileupParam class.
phred2ASCIIOffset returns a named integer vector of the same
length or number of characters in phred. The names are the
ASCII encoding, and the values are the offsets to be used with
min_base_quality.
## The examples below apply equally to pileup queries for specific
## genomic ranges (specified by the 'which' parameter of 'ScanBamParam')
## and queries across entire files; the entire-file examples are
## included first to make them easy to find. The more detailed examples
## of pileup use queries with a 'which' argument.
library("RNAseqData.HNRNPC.bam.chr14")
fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
## no 'which' argument to ScanBamParam: process entire file at once
res <- pileup(fl)
head(res)
table(res$strand, res$nucleotide)
## no 'which' argument to ScanBamParam with 'yieldSize' set on BamFile
bf <- open(BamFile(fl, yieldSize=5e4))
repeat {
res <- pileup(bf)
message(nrow(res), " rows in result data.frame")
if(nrow(res) == 0L)
break
}
close(bf)
## pileup for the first half of chr14 with default Pileup parameters
## 'which' argument: process only specified genomic range(s)
sbp <- ScanBamParam(which=GRanges("chr14", IRanges(1, 53674770)))
res <- pileup(fl, scanBamParam=sbp)
head(res)
table(res$strand, res$nucleotide)
## specify pileup parameters: include ambiguious nucleotides
## (the 'N' level in the nucleotide column of the data.frame)
p_param <- PileupParam(ignore_query_Ns=FALSE)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
head(res)
table(res$strand, res$nucleotide)
## Don't distinguish strand, require a minimum frequency of 7 for a
## nucleotide at a genomic position to be included in results.
p_param <- PileupParam(distinguish_strands=TRUE,
min_nucleotide_depth=7)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
head(res)
table(res$nucleotide, res$strand)
## Any combination of the filtering criteria is possible: let's say we
## want a "coverage pileup" that only counts reads with mapping
## quality >= 13, and bases with quality >= 10 that appear at least 4
## times at each genomic position
p_param <- PileupParam(distinguish_nucleotides=FALSE,
distinguish_strands=FALSE,
min_mapq=13,
min_base_quality=10,
min_nucleotide_depth=4)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
head(res)
res <- res[, c("pos", "count")] ## drop seqnames and which_label cols
plot(count ~ pos, res, pch=".")
## ASCII offsets to help specify min_base_quality, e.g., quality of at
## least 10 on the Illumina 1.3+ scale
phred2ASCIIOffset(10, "Illumina 1.3+")
## Well-supported polymorphic positions (minor allele present in at
## least 5 alignments) with high map quality
p_param <- PileupParam(min_minor_allele_depth=5,
min_mapq=40,
distinguish_strand=FALSE)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
dim(res) ## reduced to our biologically interesting positions
head(xtabs(count ~ pos + nucleotide, res))
## query_bins
## basic use of positive bins: single pivot; count bases that appear in
## first 10 cycles of a read separately from the rest
p_param <- PileupParam(query_bins=c(0, 10, Inf))
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
## basic use of positive bins: simple range; only include bases in
## cycle positions 6-10 within read
p_param <- PileupParam(query_bins=c(5, 10))
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
## basic use of negative bins: single pivot; count bases that appear in
## last 3 cycle positions of a read separately from the rest.
p_param <- PileupParam(query_bins=c(-Inf, -4, -1))
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
## basic use of negative bins: drop nucleotides in last two cycle
## positions of each read
p_param <- PileupParam(query_bins=c(-Inf, -3))
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
## typical use: beginning, middle, and end segments; because of the
## nature of sequencing technology, it is common for bases in the
## beginning and end segments of each read to be less reliable. pileup
## makes it easy to count each segment separately.
## Assume determined ahead of time that for the data 1-7 makes sense for
## beginning, 8-12 middle, >=13 end (actual cycle positions should be
## tailored to data in actual BAM files).
p_param <- PileupParam(query_bins=c(0, 7, 12, Inf)) ## note non-symmetric
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
xt <- xtabs(count ~ nucleotide + query_bin, res)
print(xt)
t(t(xt) / colSums(xt)) ## cheap normalization for illustrative purposes
## 'implicit outer range': in contrast to c(0, 7, 12, Inf), suppose we
## still want to have beginning, middle, and end segements, but know
## that the first three cycles and any bases beyond the 16th cycle are
## irrelevant. Hence, the implicit outer range is (3,16]; all bases
## outside of that are dropped.
p_param <- PileupParam(query_bins=c(3, 7, 12, 16))
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
xt <- xtabs(count ~ nucleotide + query_bin, res)
print(xt)
t(t(xt) / colSums(xt))
## single-width bins: count each cycle within a read separately.
p_param <- PileupParam(query_bins=seq(1,20))
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
xt <- xtabs(count ~ nucleotide + query_bin, res)
print(xt[ , c(1:3, 18:19)]) ## fit on one screen
print(t(t(xt) / colSums(xt))[ , c(1:3, 18:19)])
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(Rsamtools)
Loading required package: GenomeInfoDb
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/Rsamtools/pileup.Rd_%03d_medium.png", width=480, height=480)
> ### Name: pileup
> ### Title: Use filters and output formats to calculate pile-up statistics
> ### for a BAM file.
> ### Aliases: PileupParam-class PileupParam show,PileupParam-method
> ### max_depth min_base_quality min_mapq min_nucleotide_depth
> ### min_minor_allele_depth distinguish_strands distinguish_nucleotides
> ### ignore_query_Ns include_deletions include_insertions left_bins
> ### query_bins cycle_bins pileup pileup,character-method
> ### pileup,BamFile-method phred2ASCIIOffset
>
> ### ** Examples
>
>
> ## The examples below apply equally to pileup queries for specific
> ## genomic ranges (specified by the 'which' parameter of 'ScanBamParam')
> ## and queries across entire files; the entire-file examples are
> ## included first to make them easy to find. The more detailed examples
> ## of pileup use queries with a 'which' argument.
>
> library("RNAseqData.HNRNPC.bam.chr14")
>
> fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
>
> ## No test:
> ## no 'which' argument to ScanBamParam: process entire file at once
> res <- pileup(fl)
> head(res)
[1] seqnames pos strand nucleotide count
<0 rows> (or 0-length row.names)
> table(res$strand, res$nucleotide)
A C G T N = - +
+ 0 0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0
* 0 0 0 0 0 0 0 0
> ## End(No test)
>
> ## No test:
> ## no 'which' argument to ScanBamParam with 'yieldSize' set on BamFile
> bf <- open(BamFile(fl, yieldSize=5e4))
> repeat {
+ res <- pileup(bf)
+ message(nrow(res), " rows in result data.frame")
+ if(nrow(res) == 0L)
+ break
+ }
0 rows in result data.frame
> close(bf)
> ## End(No test)
>
> ## pileup for the first half of chr14 with default Pileup parameters
> ## 'which' argument: process only specified genomic range(s)
> sbp <- ScanBamParam(which=GRanges("chr14", IRanges(1, 53674770)))
> res <- pileup(fl, scanBamParam=sbp)
> head(res)
[1] seqnames pos strand nucleotide count which_label
<0 rows> (or 0-length row.names)
> table(res$strand, res$nucleotide)
A C G T N = - +
+ 0 0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0
* 0 0 0 0 0 0 0 0
>
> ## specify pileup parameters: include ambiguious nucleotides
> ## (the 'N' level in the nucleotide column of the data.frame)
> p_param <- PileupParam(ignore_query_Ns=FALSE)
> res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
> head(res)
[1] seqnames pos strand nucleotide count which_label
<0 rows> (or 0-length row.names)
> table(res$strand, res$nucleotide)
A C G T N = - +
+ 0 0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0
* 0 0 0 0 0 0 0 0
>
> ## Don't distinguish strand, require a minimum frequency of 7 for a
> ## nucleotide at a genomic position to be included in results.
>
> p_param <- PileupParam(distinguish_strands=TRUE,
+ min_nucleotide_depth=7)
> res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
> head(res)
[1] seqnames pos strand nucleotide count which_label
<0 rows> (or 0-length row.names)
> table(res$nucleotide, res$strand)
+ - *
A 0 0 0
C 0 0 0
G 0 0 0
T 0 0 0
N 0 0 0
= 0 0 0
- 0 0 0
+ 0 0 0
>
> ## Any combination of the filtering criteria is possible: let's say we
> ## want a "coverage pileup" that only counts reads with mapping
> ## quality >= 13, and bases with quality >= 10 that appear at least 4
> ## times at each genomic position
> p_param <- PileupParam(distinguish_nucleotides=FALSE,
+ distinguish_strands=FALSE,
+ min_mapq=13,
+ min_base_quality=10,
+ min_nucleotide_depth=4)
> res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
> head(res)
seqnames pos count which_label
1 chr14 19681651 4 chr14:1-53674770
2 chr14 19681655 4 chr14:1-53674770
3 chr14 19681657 4 chr14:1-53674770
4 chr14 19681658 4 chr14:1-53674770
5 chr14 19681661 4 chr14:1-53674770
6 chr14 19681662 4 chr14:1-53674770
> res <- res[, c("pos", "count")] ## drop seqnames and which_label cols
> plot(count ~ pos, res, pch=".")
>
> ## ASCII offsets to help specify min_base_quality, e.g., quality of at
> ## least 10 on the Illumina 1.3+ scale
> phred2ASCIIOffset(10, "Illumina 1.3+")
J
41
>
> ## Well-supported polymorphic positions (minor allele present in at
> ## least 5 alignments) with high map quality
> p_param <- PileupParam(min_minor_allele_depth=5,
+ min_mapq=40,
+ distinguish_strand=FALSE)
> res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
> dim(res) ## reduced to our biologically interesting positions
[1] 0 5
> head(xtabs(count ~ pos + nucleotide, res))
< table of extent 0 x 8 >
>
> ## query_bins
>
> ## No test:
>
> ## basic use of positive bins: single pivot; count bases that appear in
> ## first 10 cycles of a read separately from the rest
> p_param <- PileupParam(query_bins=c(0, 10, Inf))
> res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
>
> ## basic use of positive bins: simple range; only include bases in
> ## cycle positions 6-10 within read
> p_param <- PileupParam(query_bins=c(5, 10))
> res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
>
> ## basic use of negative bins: single pivot; count bases that appear in
> ## last 3 cycle positions of a read separately from the rest.
> p_param <- PileupParam(query_bins=c(-Inf, -4, -1))
> res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
>
> ## basic use of negative bins: drop nucleotides in last two cycle
> ## positions of each read
> p_param <- PileupParam(query_bins=c(-Inf, -3))
> res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
> ## End(No test)
>
> ## typical use: beginning, middle, and end segments; because of the
> ## nature of sequencing technology, it is common for bases in the
> ## beginning and end segments of each read to be less reliable. pileup
> ## makes it easy to count each segment separately.
>
> ## Assume determined ahead of time that for the data 1-7 makes sense for
> ## beginning, 8-12 middle, >=13 end (actual cycle positions should be
> ## tailored to data in actual BAM files).
> p_param <- PileupParam(query_bins=c(0, 7, 12, Inf)) ## note non-symmetric
> res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
> xt <- xtabs(count ~ nucleotide + query_bin, res)
> print(xt)
query_bin
nucleotide (0,7] (7,12] (12,Inf]
A 0 0 0
C 0 0 0
G 0 0 0
T 0 0 0
N 0 0 0
= 0 0 0
- 0 0 0
+ 0 0 0
> t(t(xt) / colSums(xt)) ## cheap normalization for illustrative purposes
query_bin
nucleotide (0,7] (7,12] (12,Inf]
A
C
G
T
N
=
-
+
>
> ## 'implicit outer range': in contrast to c(0, 7, 12, Inf), suppose we
> ## still want to have beginning, middle, and end segements, but know
> ## that the first three cycles and any bases beyond the 16th cycle are
> ## irrelevant. Hence, the implicit outer range is (3,16]; all bases
> ## outside of that are dropped.
> p_param <- PileupParam(query_bins=c(3, 7, 12, 16))
> res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
> xt <- xtabs(count ~ nucleotide + query_bin, res)
> print(xt)
query_bin
nucleotide (3,7] (7,12] (12,16]
A 0 0 0
C 0 0 0
G 0 0 0
T 0 0 0
N 0 0 0
= 0 0 0
- 0 0 0
+ 0 0 0
> t(t(xt) / colSums(xt))
query_bin
nucleotide (3,7] (7,12] (12,16]
A
C
G
T
N
=
-
+
>
> ## No test:
> ## single-width bins: count each cycle within a read separately.
> p_param <- PileupParam(query_bins=seq(1,20))
> res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
> xt <- xtabs(count ~ nucleotide + query_bin, res)
> print(xt[ , c(1:3, 18:19)]) ## fit on one screen
query_bin
nucleotide (1,2] (2,3] (3,4] (18,19] (19,20]
A 0 0 0 0 0
C 0 0 0 0 0
G 0 0 0 0 0
T 0 0 0 0 0
N 0 0 0 0 0
= 0 0 0 0 0
- 0 0 0 0 0
+ 0 0 0 0 0
> print(t(t(xt) / colSums(xt))[ , c(1:3, 18:19)])
query_bin
nucleotide (1,2] (2,3] (3,4] (18,19] (19,20]
A
C
G
T
N
=
-
+
> ## End(No test)
>
>
>
>
>
> dev.off()
null device
1
>