Last data update: 2014.03.03

R: Map range coordinates between reads and genome space using...
mapToAlignmentsR Documentation

Map range coordinates between reads and genome space using CIGAR alignments

Description

Map range coordinates between reads (local) and genome (reference) space using the CIGAR in a GAlignments object.

See ?mapToTranscripts in the GenomicRanges package for mapping coordinates between features in the transcriptome and genome space.

Usage

## S4 method for signature 'GenomicRanges,GAlignments'
mapToAlignments(x, alignments, ...) 
## S4 method for signature 'GenomicRanges,GAlignments'
pmapToAlignments(x, alignments, ...) 

## S4 method for signature 'GenomicRanges,GAlignments'
mapFromAlignments(x, alignments, ...) 
## S4 method for signature 'GenomicRanges,GAlignments'
pmapFromAlignments(x, alignments, ...) 

Arguments

x

GenomicRanges object of positions to be mapped. x must have names when mapping to the genome.

alignments

A GAlignments object that represents the alignment of x to the genome. The aligments object must have names. When mapping to the genome names are used to determine mapping pairs and in the reverse direction they are used as the seqlevels of the output object.

...

Arguments passed to other methods.

Details

These methods use a GAlignments object to represent the alignment between the ranges in x and the output. The following CIGAR operations in the "Extended CIGAR format" are used in the mapping algorithm:

  • M, X, = Sequence match or mismatch

  • I Insertion to the reference

  • D Deletion from the reference

  • N Skipped region from the reference

  • S Soft clip on the read

  • H Hard clip on the read

  • P Silent deletion from the padded reference

  • mapToAlignments, pmapToAlignments The CIGAR is used to map the genomic (reference) position x to local coordinates. The mapped position starts at

          start(x) - start(alignments) + 1
          

    and is incremented or decremented as the algorithm walks the length of the CIGAR. A successful mapping in this direction requires that x fall within alignments.

    The seqlevels of the return object are taken from the alignments object and will be a name descriptive of the read or aligned region. In this direction, mapping is attempted between all elements of x and all elements of alignments.

  • mapFromAlignments, pmapFromAlignments The CIGAR is used to map the local position x to genomic (reference) coordinates. The mapped position starts at

          start(x) + start(alignments) - 1
          

    and is incremented or decremented as the algorithm walks the length of the CIGAR. A successful mapping in this direction requires that the width of alignments is <= the width of x.

    When mapping to the genome, name matching is used to determine the mapping pairs (vs attempting to match all possible pairs). Ranges in x are only mapped to ranges in alignments with the same name. Name matching is motivated by use cases such as differentially expressed regions where the expressed regions in x would only be related to a subset of regions in alignments, which may contains gene or transcript ranges.

  • element-wise versions pmapToAlignments and pmapFromAlignments are element-wise (aka 'parallel') versions of mapToAlignments and mapFromAlignments. The i-th range in x is mapped to the i-th range in alignments; x and alignments must have the same length.

    Ranges in x that do not map (out of bounds) are returned as zero-width ranges starting at 0. These ranges are given the special seqname of "UNMAPPED". Note the non-parallel methods do not return unmapped ranges so the "UNMAPPED" seqname is unique to pmapToAlignments and pmapFromAlignments.

  • strand By SAM convention, the CIGAR string is reported for mapped reads on the forward genomic strand. There is no need to consider strand in these methods. The output of these methods will always be unstranded (i.e., "*").

Value

An object the same class as x.

Parallel methods return an object the same shape as x. Ranges that cannot be mapped (out of bounds) are returned as zero-width ranges starting at 0 with a seqname of "UNMAPPED".

Non-parallel methods return an object that varies in length similar to a Hits object. The result only contains mapped records, out of bound ranges are not returned. xHits and alignmentsHits metadata columns indicate the elements of x and alignments used in the mapping.

When present, names from x are propagated to the output. When mapping locally, the seqlevels of the output are the names on the alignment object. When mapping globally, the output seqlevels are the seqlevels of alignment which are usually chromosome names.

Author(s)

V. Obenchain, M. Lawrence and H. Pag<c3><83><c2><a8>s

See Also

  • ?mapToTranscripts in the in the GenomicFeatures package for methods mapping between transcriptome and genome space.

  • http://samtools.sourceforge.net/ for a description of the Extended CIGAR format.

Examples

## ---------------------------------------------------------------------
## A. Basic use 
## ---------------------------------------------------------------------

## 1. Map to local space with mapToAlignments()
## ---------------------------------------------------------------------

## Mapping to local coordinates requires 'x' to be within 'alignments'.
## In this 'x', the second range is too long and can't be mapped.
alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="read_A")
x <- GRanges("chr1", IRanges(c(12, 12), width=c(6, 20)))
mapToAlignments(x, alignments)

## The element-wise version of the function returns unmapped ranges
## as zero-width ranges with a seqlevel of "UNMAPPED":
pmapToAlignments(x, c(alignments, alignments))

## Mapping the same range through different alignments demonstrates 
## how the CIGAR operations affect the outcome.
ops <- c("no-op", "junction", "insertion", "deletion")
x <- GRanges(rep("chr1", 4), IRanges(rep(12, 4), width=rep(6, 4), names=ops)) 
alignments <- GAlignments(rep("chr1", 4), rep(10L, 4), 
                          cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
                          strand = strand(rep("*", 4)),
                          names = paste0("region_", 1:4))
pmapToAlignments(x, alignments)

## 2. Map to genome space with mapFromAlignments()
## ---------------------------------------------------------------------

## One of the criteria when mapping to genomic coordinates is that the
## shifted 'x' range falls within 'alignments'. Here the first 'x' 
## range has a shifted start value of 14 (5 + 10 - 1 = 14) with a width of 
## 2 and so is successfully mapped. The second has a shifted start of 29
## (20 + 10 - 1 = 29) which is outside the range of 'alignments'.
x <- GRanges("chr1", IRanges(c(5, 20), width=2, names=rep("region_A", 2)))
alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="region_A")
mapFromAlignments(x, alignments)

## Another characteristic of mapping this direction is the name matching
## used to determine pairs. Mapping is only attempted between ranges in 'x' 
## and 'alignments' with the same name. If we change the name of the first 'x' 
## range, only the second will be mapped to 'alignment'. We know the second
## range fails to map so we get an empty result.
names(x) <- c("region_B", "region_A")
mapFromAlignments(x, alignments)

## CIGAR operations: insertions reduce the width of the output while
## junctions and deletions increase it.
ops <- c("no-op", "junction", "insertion", "deletion")
x <- GRanges(rep("chr1", 4), IRanges(rep(3, 4), width=rep(5, 4), names=ops)) 
alignments <- GAlignments(rep("chr1", 4), rep(10L, 4), 
                         cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
                         strand = strand(rep("*", 4)))
pmapFromAlignments(x, alignments)

## ---------------------------------------------------------------------
## B. TATA box motif: mapping from read -> genome -> transcript
## ---------------------------------------------------------------------

## The TATA box motif is a conserved DNA sequence in the core promoter
## region. Many eukaryotic genes have a TATA box located approximately
## 25-35 base pairs upstream of the transcription start site. The motif is 
## the binding site of general transcription factors or histones and
## plays a key role in transcription.

## In this example, the position of the TATA box motif (if present) is 
## located in the DNA sequence corresponding to read ranges. The local 
## motif positions are mapped to genome coordinates and then mapped
## to gene features such as promoters regions.

## Load reads from chromosome 4 of D. melanogaster (dm3):
library(pasillaBamSubset)
fl <- untreated1_chr4()
gal <- readGAlignments(fl)

## Extract DNA sequences corresponding to the read ranges:
library(GenomicFeatures)
library(BSgenome.Dmelanogaster.UCSC.dm3)
dna <- extractTranscriptSeqs(BSgenome.Dmelanogaster.UCSC.dm3, grglist(gal))

## Search for the consensus motif TATAAA in the sequences:
box <- vmatchPattern("TATAAA", dna)

## Some sequences had more than one match:
table(elementNROWS(box))

## The element-wise function we'll use for mapping to genome coordinates
## requires the two input argument to have the same length. We need to
## replicate the read ranges to match the number of motifs found.

## Expand the read ranges to match motifs found:
motif <- elementNROWS(box) != 0
alignments <- rep(gal[motif], elementNROWS(box)[motif])

## We make the IRanges into a GRanges object so the seqlevels can
## propagate to the output. Seqlevels are needed in the last mapping step.
readCoords <- GRanges(seqnames(alignments), unlist(box, use.names=FALSE))

## Map the local position of the motif to genome coordinates:
genomeCoords <- pmapFromAlignments(readCoords, alignments) 
genomeCoords

## We are interested in the location of the TATA box motifs in the
## promoter regions. To perform the mapping we need the promoter ranges 
## as a GRanges or GRangesList.

## Extract promoter regions 50 bp upstream from the transcription start site:
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
promoters <- promoters(txdb, upstream=50, downstream=0)

## Map the genome coordinates to the promoters:
names(promoters) <- mcols(promoters)$tx_name  ## must be named 
mapToTranscripts(genomeCoords, promoters) 









Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(GenomicAlignments)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: Biostrings
Loading required package: XVector
Loading required package: Rsamtools
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomicAlignments/coordinate-mapping-methods.Rd_%03d_medium.png", width=480, height=480)
> ### Name: mapToAlignments
> ### Title: Map range coordinates between reads and genome space using CIGAR
> ###   alignments
> ### Aliases: coordinate-mapping-methods mapToAlignments
> ###   mapToAlignments,Ranges,GAlignments-method
> ###   mapToAlignments,GenomicRanges,GAlignments-method pmapToAlignments
> ###   pmapToAlignments,Ranges,GAlignments-method
> ###   pmapToAlignments,GenomicRanges,GAlignments-method mapFromAlignments
> ###   mapFromAlignments,Ranges,GAlignments-method
> ###   mapFromAlignments,GenomicRanges,GAlignments-method pmapFromAlignments
> ###   pmapFromAlignments,Ranges,GAlignments-method
> ###   pmapFromAlignments,GenomicRanges,GAlignments-method
> ### Keywords: methods utilities
> 
> ### ** Examples
> 
> ## ---------------------------------------------------------------------
> ## A. Basic use 
> ## ---------------------------------------------------------------------
> 
> ## 1. Map to local space with mapToAlignments()
> ## ---------------------------------------------------------------------
> 
> ## Mapping to local coordinates requires 'x' to be within 'alignments'.
> ## In this 'x', the second range is too long and can't be mapped.
> alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="read_A")
> x <- GRanges("chr1", IRanges(c(12, 12), width=c(6, 20)))
> mapToAlignments(x, alignments)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     xHits alignmentsHits
         <Rle> <IRanges>  <Rle> | <integer>      <integer>
  [1]   read_A    [3, 8]      * |         1              1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## The element-wise version of the function returns unmapped ranges
> ## as zero-width ranges with a seqlevel of "UNMAPPED":
> pmapToAlignments(x, c(alignments, alignments))
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]   read_A   [3,  8]      *
  [2] UNMAPPED   [0, -1]      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
> 
> ## Mapping the same range through different alignments demonstrates 
> ## how the CIGAR operations affect the outcome.
> ops <- c("no-op", "junction", "insertion", "deletion")
> x <- GRanges(rep("chr1", 4), IRanges(rep(12, 4), width=rep(6, 4), names=ops)) 
> alignments <- GAlignments(rep("chr1", 4), rep(10L, 4), 
+                           cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
+                           strand = strand(rep("*", 4)),
+                           names = paste0("region_", 1:4))
> pmapToAlignments(x, alignments)
GRanges object with 4 ranges and 0 metadata columns:
            seqnames    ranges strand
               <Rle> <IRanges>  <Rle>
      no-op region_1   [3,  8]      *
   junction region_2   [3,  6]      *
  insertion region_3   [3, 10]      *
   deletion region_4   [3,  6]      *
  -------
  seqinfo: 4 sequences from an unspecified genome; no seqlengths
> 
> ## 2. Map to genome space with mapFromAlignments()
> ## ---------------------------------------------------------------------
> 
> ## One of the criteria when mapping to genomic coordinates is that the
> ## shifted 'x' range falls within 'alignments'. Here the first 'x' 
> ## range has a shifted start value of 14 (5 + 10 - 1 = 14) with a width of 
> ## 2 and so is successfully mapped. The second has a shifted start of 29
> ## (20 + 10 - 1 = 29) which is outside the range of 'alignments'.
> x <- GRanges("chr1", IRanges(c(5, 20), width=2, names=rep("region_A", 2)))
> alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="region_A")
> mapFromAlignments(x, alignments)
GRanges object with 1 range and 2 metadata columns:
           seqnames    ranges strand |     xHits alignmentsHits
              <Rle> <IRanges>  <Rle> | <integer>      <integer>
  region_A     chr1  [14, 15]      * |         1              1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## Another characteristic of mapping this direction is the name matching
> ## used to determine pairs. Mapping is only attempted between ranges in 'x' 
> ## and 'alignments' with the same name. If we change the name of the first 'x' 
> ## range, only the second will be mapped to 'alignment'. We know the second
> ## range fails to map so we get an empty result.
> names(x) <- c("region_B", "region_A")
> mapFromAlignments(x, alignments)
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |     xHits alignmentsHits
      <Rle> <IRanges>  <Rle> | <integer>      <integer>
  -------
  seqinfo: no sequences
> 
> ## CIGAR operations: insertions reduce the width of the output while
> ## junctions and deletions increase it.
> ops <- c("no-op", "junction", "insertion", "deletion")
> x <- GRanges(rep("chr1", 4), IRanges(rep(3, 4), width=rep(5, 4), names=ops)) 
> alignments <- GAlignments(rep("chr1", 4), rep(10L, 4), 
+                          cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
+                          strand = strand(rep("*", 4)))
> pmapFromAlignments(x, alignments)
GRanges object with 4 ranges and 0 metadata columns:
            seqnames    ranges strand
               <Rle> <IRanges>  <Rle>
      no-op     chr1  [12, 16]      *
   junction     chr1  [12, 18]      *
  insertion     chr1  [12, 14]      *
   deletion     chr1  [12, 18]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## ---------------------------------------------------------------------
> ## B. TATA box motif: mapping from read -> genome -> transcript
> ## ---------------------------------------------------------------------
> 
> ## The TATA box motif is a conserved DNA sequence in the core promoter
> ## region. Many eukaryotic genes have a TATA box located approximately
> ## 25-35 base pairs upstream of the transcription start site. The motif is 
> ## the binding site of general transcription factors or histones and
> ## plays a key role in transcription.
> 
> ## In this example, the position of the TATA box motif (if present) is 
> ## located in the DNA sequence corresponding to read ranges. The local 
> ## motif positions are mapped to genome coordinates and then mapped
> ## to gene features such as promoters regions.
> 
> ## Load reads from chromosome 4 of D. melanogaster (dm3):
> library(pasillaBamSubset)
> fl <- untreated1_chr4()
> gal <- readGAlignments(fl)
> 
> ## Extract DNA sequences corresponding to the read ranges:
> library(GenomicFeatures)
Loading required package: AnnotationDbi
> library(BSgenome.Dmelanogaster.UCSC.dm3)
Loading required package: BSgenome
Loading required package: rtracklayer
> dna <- extractTranscriptSeqs(BSgenome.Dmelanogaster.UCSC.dm3, grglist(gal))
> 
> ## Search for the consensus motif TATAAA in the sequences:
> box <- vmatchPattern("TATAAA", dna)
> 
> ## Some sequences had more than one match:
> table(elementNROWS(box))

     0      1      2 
197616   6508    231 
> 
> ## The element-wise function we'll use for mapping to genome coordinates
> ## requires the two input argument to have the same length. We need to
> ## replicate the read ranges to match the number of motifs found.
> 
> ## Expand the read ranges to match motifs found:
> motif <- elementNROWS(box) != 0
> alignments <- rep(gal[motif], elementNROWS(box)[motif])
> 
> ## We make the IRanges into a GRanges object so the seqlevels can
> ## propagate to the output. Seqlevels are needed in the last mapping step.
> readCoords <- GRanges(seqnames(alignments), unlist(box, use.names=FALSE))
> 
> ## Map the local position of the motif to genome coordinates:
> genomeCoords <- pmapFromAlignments(readCoords, alignments) 
> genomeCoords
GRanges object with 6970 ranges and 0 metadata columns:
         seqnames             ranges strand
            <Rle>          <IRanges>  <Rle>
     [1]     chr4       [6378, 6383]      *
     [2]     chr4       [6490, 6495]      *
     [3]     chr4       [7177, 7182]      *
     [4]     chr4       [7181, 7186]      *
     [5]     chr4       [7205, 7210]      *
     ...      ...                ...    ...
  [6966]     chr4 [1347507, 1347512]      *
  [6967]     chr4 [1347509, 1347514]      *
  [6968]     chr4 [1347511, 1347516]      *
  [6969]     chr4 [1347519, 1347524]      *
  [6970]     chr4 [1347519, 1347524]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## We are interested in the location of the TATA box motifs in the
> ## promoter regions. To perform the mapping we need the promoter ranges 
> ## as a GRanges or GRangesList.
> 
> ## Extract promoter regions 50 bp upstream from the transcription start site:
> library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
> txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
> promoters <- promoters(txdb, upstream=50, downstream=0)
> 
> ## Map the genome coordinates to the promoters:
> names(promoters) <- mcols(promoters)$tx_name  ## must be named 
> mapToTranscripts(genomeCoords, promoters) 
GRanges object with 4 ranges and 2 metadata columns:
         seqnames    ranges strand |     xHits transcriptsHits
            <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1] FBtr0100600  [40, 45]      - |      4429           23869
  [2] FBtr0100600  [40, 45]      - |      4430           23869
  [3] FBtr0100600  [40, 45]      - |      4431           23869
  [4] FBtr0333702  [14, 19]      + |      6417           23759
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>