R: Map range coordinates between reads and genome space using...
mapToAlignments
R Documentation
Map range coordinates between reads and genome space using
CIGAR alignments
Description
Map range coordinates between reads (local) and genome (reference) space
using the CIGAR in a GAlignments object.
See ?mapToTranscripts in the
GenomicRanges package for mapping coordinates between features
in the transcriptome and genome space.
Usage
## S4 method for signature 'GenomicRanges,GAlignments'
mapToAlignments(x, alignments, ...)
## S4 method for signature 'GenomicRanges,GAlignments'
pmapToAlignments(x, alignments, ...)
## S4 method for signature 'GenomicRanges,GAlignments'
mapFromAlignments(x, alignments, ...)
## S4 method for signature 'GenomicRanges,GAlignments'
pmapFromAlignments(x, alignments, ...)
Arguments
x
GenomicRanges object of positions to be mapped.
x must have names when mapping to the genome.
alignments
A GAlignments object that represents the alignment of
x to the genome. The aligments object must have names. When
mapping to the genome names are used to determine mapping pairs and in the
reverse direction they are used as the seqlevels of the output object.
...
Arguments passed to other methods.
Details
These methods use a GAlignments object to represent the alignment
between the ranges in x and the output. The following CIGAR
operations in the "Extended CIGAR format" are used in the mapping
algorithm:
M, X, = Sequence match or mismatch
I Insertion to the reference
D Deletion from the reference
N Skipped region from the reference
S Soft clip on the read
H Hard clip on the read
P Silent deletion from the padded reference
mapToAlignments, pmapToAlignments
The CIGAR is used to map the genomic (reference) position x to
local coordinates. The mapped position starts at
start(x) - start(alignments) + 1
and is incremented or decremented as the algorithm walks the length of
the CIGAR. A successful mapping in this direction requires that
x fall within alignments.
The seqlevels of the return object are taken from the
alignments object and will be a name descriptive of the read
or aligned region. In this direction, mapping is attempted between all
elements of x and all elements of alignments.
mapFromAlignments, pmapFromAlignments
The CIGAR is used to map the local position x to genomic
(reference) coordinates. The mapped position starts at
start(x) + start(alignments) - 1
and is incremented or decremented as the algorithm walks the length of
the CIGAR. A successful mapping in this direction requires that the
width of alignments is <= the width of x.
When mapping to the genome, name matching is used to determine the
mapping pairs (vs attempting to match all possible pairs). Ranges in
x are only mapped to ranges in alignments with the
same name. Name matching is motivated by use cases such as
differentially expressed regions where the expressed regions in
x would only be related to a subset of regions in
alignments, which may contains gene or transcript ranges.
element-wise versions
pmapToAlignments and pmapFromAlignments are element-wise
(aka 'parallel') versions of mapToAlignments and
mapFromAlignments. The i-th range in x is mapped to the
i-th range in alignments; x and alignments must
have the same length.
Ranges in x that do not map (out of bounds) are returned as
zero-width ranges starting at 0. These ranges are given the special
seqname of "UNMAPPED". Note the non-parallel methods do not return
unmapped ranges so the "UNMAPPED" seqname is unique to
pmapToAlignments and pmapFromAlignments.
strand
By SAM convention, the CIGAR string is reported for mapped reads on the
forward genomic strand. There is no need to consider strand in these
methods. The output of these methods will always be unstranded
(i.e., "*").
Value
An object the same class as x.
Parallel methods return an object the same shape as x. Ranges that
cannot be mapped (out of bounds) are returned as zero-width ranges starting
at 0 with a seqname of "UNMAPPED".
Non-parallel methods return an object that varies in length similar to a
Hits object. The result only contains mapped records, out of bound ranges
are not returned. xHits and alignmentsHits metadata columns
indicate the elements of x and alignments used in the mapping.
When present, names from x are propagated to the output. When
mapping locally, the seqlevels of the output are the names on the
alignment object. When mapping globally, the output seqlevels are
the seqlevels of alignment which are usually chromosome names.
Author(s)
V. Obenchain, M. Lawrence and H. Pag<c3><83><c2><a8>s
See Also
?mapToTranscripts in the
in the GenomicFeatures package for methods mapping between
transcriptome and genome space.
## ---------------------------------------------------------------------
## A. Basic use
## ---------------------------------------------------------------------
## 1. Map to local space with mapToAlignments()
## ---------------------------------------------------------------------
## Mapping to local coordinates requires 'x' to be within 'alignments'.
## In this 'x', the second range is too long and can't be mapped.
alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="read_A")
x <- GRanges("chr1", IRanges(c(12, 12), width=c(6, 20)))
mapToAlignments(x, alignments)
## The element-wise version of the function returns unmapped ranges
## as zero-width ranges with a seqlevel of "UNMAPPED":
pmapToAlignments(x, c(alignments, alignments))
## Mapping the same range through different alignments demonstrates
## how the CIGAR operations affect the outcome.
ops <- c("no-op", "junction", "insertion", "deletion")
x <- GRanges(rep("chr1", 4), IRanges(rep(12, 4), width=rep(6, 4), names=ops))
alignments <- GAlignments(rep("chr1", 4), rep(10L, 4),
cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
strand = strand(rep("*", 4)),
names = paste0("region_", 1:4))
pmapToAlignments(x, alignments)
## 2. Map to genome space with mapFromAlignments()
## ---------------------------------------------------------------------
## One of the criteria when mapping to genomic coordinates is that the
## shifted 'x' range falls within 'alignments'. Here the first 'x'
## range has a shifted start value of 14 (5 + 10 - 1 = 14) with a width of
## 2 and so is successfully mapped. The second has a shifted start of 29
## (20 + 10 - 1 = 29) which is outside the range of 'alignments'.
x <- GRanges("chr1", IRanges(c(5, 20), width=2, names=rep("region_A", 2)))
alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="region_A")
mapFromAlignments(x, alignments)
## Another characteristic of mapping this direction is the name matching
## used to determine pairs. Mapping is only attempted between ranges in 'x'
## and 'alignments' with the same name. If we change the name of the first 'x'
## range, only the second will be mapped to 'alignment'. We know the second
## range fails to map so we get an empty result.
names(x) <- c("region_B", "region_A")
mapFromAlignments(x, alignments)
## CIGAR operations: insertions reduce the width of the output while
## junctions and deletions increase it.
ops <- c("no-op", "junction", "insertion", "deletion")
x <- GRanges(rep("chr1", 4), IRanges(rep(3, 4), width=rep(5, 4), names=ops))
alignments <- GAlignments(rep("chr1", 4), rep(10L, 4),
cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
strand = strand(rep("*", 4)))
pmapFromAlignments(x, alignments)
## ---------------------------------------------------------------------
## B. TATA box motif: mapping from read -> genome -> transcript
## ---------------------------------------------------------------------
## The TATA box motif is a conserved DNA sequence in the core promoter
## region. Many eukaryotic genes have a TATA box located approximately
## 25-35 base pairs upstream of the transcription start site. The motif is
## the binding site of general transcription factors or histones and
## plays a key role in transcription.
## In this example, the position of the TATA box motif (if present) is
## located in the DNA sequence corresponding to read ranges. The local
## motif positions are mapped to genome coordinates and then mapped
## to gene features such as promoters regions.
## Load reads from chromosome 4 of D. melanogaster (dm3):
library(pasillaBamSubset)
fl <- untreated1_chr4()
gal <- readGAlignments(fl)
## Extract DNA sequences corresponding to the read ranges:
library(GenomicFeatures)
library(BSgenome.Dmelanogaster.UCSC.dm3)
dna <- extractTranscriptSeqs(BSgenome.Dmelanogaster.UCSC.dm3, grglist(gal))
## Search for the consensus motif TATAAA in the sequences:
box <- vmatchPattern("TATAAA", dna)
## Some sequences had more than one match:
table(elementNROWS(box))
## The element-wise function we'll use for mapping to genome coordinates
## requires the two input argument to have the same length. We need to
## replicate the read ranges to match the number of motifs found.
## Expand the read ranges to match motifs found:
motif <- elementNROWS(box) != 0
alignments <- rep(gal[motif], elementNROWS(box)[motif])
## We make the IRanges into a GRanges object so the seqlevels can
## propagate to the output. Seqlevels are needed in the last mapping step.
readCoords <- GRanges(seqnames(alignments), unlist(box, use.names=FALSE))
## Map the local position of the motif to genome coordinates:
genomeCoords <- pmapFromAlignments(readCoords, alignments)
genomeCoords
## We are interested in the location of the TATA box motifs in the
## promoter regions. To perform the mapping we need the promoter ranges
## as a GRanges or GRangesList.
## Extract promoter regions 50 bp upstream from the transcription start site:
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
promoters <- promoters(txdb, upstream=50, downstream=0)
## Map the genome coordinates to the promoters:
names(promoters) <- mcols(promoters)$tx_name ## must be named
mapToTranscripts(genomeCoords, promoters)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(GenomicAlignments)
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: Biostrings
Loading required package: XVector
Loading required package: Rsamtools
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomicAlignments/coordinate-mapping-methods.Rd_%03d_medium.png", width=480, height=480)
> ### Name: mapToAlignments
> ### Title: Map range coordinates between reads and genome space using CIGAR
> ### alignments
> ### Aliases: coordinate-mapping-methods mapToAlignments
> ### mapToAlignments,Ranges,GAlignments-method
> ### mapToAlignments,GenomicRanges,GAlignments-method pmapToAlignments
> ### pmapToAlignments,Ranges,GAlignments-method
> ### pmapToAlignments,GenomicRanges,GAlignments-method mapFromAlignments
> ### mapFromAlignments,Ranges,GAlignments-method
> ### mapFromAlignments,GenomicRanges,GAlignments-method pmapFromAlignments
> ### pmapFromAlignments,Ranges,GAlignments-method
> ### pmapFromAlignments,GenomicRanges,GAlignments-method
> ### Keywords: methods utilities
>
> ### ** Examples
>
> ## ---------------------------------------------------------------------
> ## A. Basic use
> ## ---------------------------------------------------------------------
>
> ## 1. Map to local space with mapToAlignments()
> ## ---------------------------------------------------------------------
>
> ## Mapping to local coordinates requires 'x' to be within 'alignments'.
> ## In this 'x', the second range is too long and can't be mapped.
> alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="read_A")
> x <- GRanges("chr1", IRanges(c(12, 12), width=c(6, 20)))
> mapToAlignments(x, alignments)
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | xHits alignmentsHits
<Rle> <IRanges> <Rle> | <integer> <integer>
[1] read_A [3, 8] * | 1 1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> ## The element-wise version of the function returns unmapped ranges
> ## as zero-width ranges with a seqlevel of "UNMAPPED":
> pmapToAlignments(x, c(alignments, alignments))
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] read_A [3, 8] *
[2] UNMAPPED [0, -1] *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
>
> ## Mapping the same range through different alignments demonstrates
> ## how the CIGAR operations affect the outcome.
> ops <- c("no-op", "junction", "insertion", "deletion")
> x <- GRanges(rep("chr1", 4), IRanges(rep(12, 4), width=rep(6, 4), names=ops))
> alignments <- GAlignments(rep("chr1", 4), rep(10L, 4),
+ cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
+ strand = strand(rep("*", 4)),
+ names = paste0("region_", 1:4))
> pmapToAlignments(x, alignments)
GRanges object with 4 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
no-op region_1 [3, 8] *
junction region_2 [3, 6] *
insertion region_3 [3, 10] *
deletion region_4 [3, 6] *
-------
seqinfo: 4 sequences from an unspecified genome; no seqlengths
>
> ## 2. Map to genome space with mapFromAlignments()
> ## ---------------------------------------------------------------------
>
> ## One of the criteria when mapping to genomic coordinates is that the
> ## shifted 'x' range falls within 'alignments'. Here the first 'x'
> ## range has a shifted start value of 14 (5 + 10 - 1 = 14) with a width of
> ## 2 and so is successfully mapped. The second has a shifted start of 29
> ## (20 + 10 - 1 = 29) which is outside the range of 'alignments'.
> x <- GRanges("chr1", IRanges(c(5, 20), width=2, names=rep("region_A", 2)))
> alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="region_A")
> mapFromAlignments(x, alignments)
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | xHits alignmentsHits
<Rle> <IRanges> <Rle> | <integer> <integer>
region_A chr1 [14, 15] * | 1 1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> ## Another characteristic of mapping this direction is the name matching
> ## used to determine pairs. Mapping is only attempted between ranges in 'x'
> ## and 'alignments' with the same name. If we change the name of the first 'x'
> ## range, only the second will be mapped to 'alignment'. We know the second
> ## range fails to map so we get an empty result.
> names(x) <- c("region_B", "region_A")
> mapFromAlignments(x, alignments)
GRanges object with 0 ranges and 2 metadata columns:
seqnames ranges strand | xHits alignmentsHits
<Rle> <IRanges> <Rle> | <integer> <integer>
-------
seqinfo: no sequences
>
> ## CIGAR operations: insertions reduce the width of the output while
> ## junctions and deletions increase it.
> ops <- c("no-op", "junction", "insertion", "deletion")
> x <- GRanges(rep("chr1", 4), IRanges(rep(3, 4), width=rep(5, 4), names=ops))
> alignments <- GAlignments(rep("chr1", 4), rep(10L, 4),
+ cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
+ strand = strand(rep("*", 4)))
> pmapFromAlignments(x, alignments)
GRanges object with 4 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
no-op chr1 [12, 16] *
junction chr1 [12, 18] *
insertion chr1 [12, 14] *
deletion chr1 [12, 18] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> ## ---------------------------------------------------------------------
> ## B. TATA box motif: mapping from read -> genome -> transcript
> ## ---------------------------------------------------------------------
>
> ## The TATA box motif is a conserved DNA sequence in the core promoter
> ## region. Many eukaryotic genes have a TATA box located approximately
> ## 25-35 base pairs upstream of the transcription start site. The motif is
> ## the binding site of general transcription factors or histones and
> ## plays a key role in transcription.
>
> ## In this example, the position of the TATA box motif (if present) is
> ## located in the DNA sequence corresponding to read ranges. The local
> ## motif positions are mapped to genome coordinates and then mapped
> ## to gene features such as promoters regions.
>
> ## Load reads from chromosome 4 of D. melanogaster (dm3):
> library(pasillaBamSubset)
> fl <- untreated1_chr4()
> gal <- readGAlignments(fl)
>
> ## Extract DNA sequences corresponding to the read ranges:
> library(GenomicFeatures)
Loading required package: AnnotationDbi
> library(BSgenome.Dmelanogaster.UCSC.dm3)
Loading required package: BSgenome
Loading required package: rtracklayer
> dna <- extractTranscriptSeqs(BSgenome.Dmelanogaster.UCSC.dm3, grglist(gal))
>
> ## Search for the consensus motif TATAAA in the sequences:
> box <- vmatchPattern("TATAAA", dna)
>
> ## Some sequences had more than one match:
> table(elementNROWS(box))
0 1 2
197616 6508 231
>
> ## The element-wise function we'll use for mapping to genome coordinates
> ## requires the two input argument to have the same length. We need to
> ## replicate the read ranges to match the number of motifs found.
>
> ## Expand the read ranges to match motifs found:
> motif <- elementNROWS(box) != 0
> alignments <- rep(gal[motif], elementNROWS(box)[motif])
>
> ## We make the IRanges into a GRanges object so the seqlevels can
> ## propagate to the output. Seqlevels are needed in the last mapping step.
> readCoords <- GRanges(seqnames(alignments), unlist(box, use.names=FALSE))
>
> ## Map the local position of the motif to genome coordinates:
> genomeCoords <- pmapFromAlignments(readCoords, alignments)
> genomeCoords
GRanges object with 6970 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr4 [6378, 6383] *
[2] chr4 [6490, 6495] *
[3] chr4 [7177, 7182] *
[4] chr4 [7181, 7186] *
[5] chr4 [7205, 7210] *
... ... ... ...
[6966] chr4 [1347507, 1347512] *
[6967] chr4 [1347509, 1347514] *
[6968] chr4 [1347511, 1347516] *
[6969] chr4 [1347519, 1347524] *
[6970] chr4 [1347519, 1347524] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> ## We are interested in the location of the TATA box motifs in the
> ## promoter regions. To perform the mapping we need the promoter ranges
> ## as a GRanges or GRangesList.
>
> ## Extract promoter regions 50 bp upstream from the transcription start site:
> library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
> txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
> promoters <- promoters(txdb, upstream=50, downstream=0)
>
> ## Map the genome coordinates to the promoters:
> names(promoters) <- mcols(promoters)$tx_name ## must be named
> mapToTranscripts(genomeCoords, promoters)
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | xHits transcriptsHits
<Rle> <IRanges> <Rle> | <integer> <integer>
[1] FBtr0100600 [40, 45] - | 4429 23869
[2] FBtr0100600 [40, 45] - | 4430 23869
[3] FBtr0100600 [40, 45] - | 4431 23869
[4] FBtr0333702 [14, 19] + | 6417 23759
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> dev.off()
null device
1
>