R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Map range coordinates between transcripts and genome space

mapToTranscripts

R Documentation

Map range coordinates between transcripts and genome space

Description

Map range coordinates between features in the transcriptome and genome (reference) space.

See ?mapToAlignments in the GenomicAlignments package for mapping coordinates between reads (local) and genome (reference) space using a CIGAR alignment.

Usage

## mapping to transcripts
## S4 method for signature 'GenomicRanges,GRangesList'
mapToTranscripts(x, transcripts, 
          ignore.strand = FALSE) 
## S4 method for signature 'ANY,TxDb'
mapToTranscripts(x, transcripts, ignore.strand = FALSE, 
          extractor.fun = GenomicFeatures::transcripts, ...) 
## S4 method for signature 'GenomicRanges,GRangesList'
pmapToTranscripts(x, transcripts, 
          ignore.strand = FALSE) 

## mapping from transcripts
## S4 method for signature 'GenomicRanges,GRangesList'
mapFromTranscripts(x, transcripts, 
          ignore.strand = FALSE) 
## S4 method for signature 'GenomicRanges,GRangesList'
pmapFromTranscripts(x, transcripts, 
          ignore.strand = FALSE)
## S4 method for signature 'Ranges,GRangesList'
pmapFromTranscripts(x, transcripts)

Arguments

`x`	`GenomicRanges` object of positions to be mapped. The seqnames of `x` are used in `mapFromTranscripts`, i.e., when mapping from transcripts to the genome. In the case of `pmapFromTranscripts`, `x` can be a `Ranges` object.
`transcripts`	A named `GenomicRanges` or `GRangesList` object used to map between `x` and the result. The ranges can be any feature in the transcriptome extracted from a `TxDb` (e.g., introns, exons, cds regions). See ?`transcripts` and ?`transcriptsBy` for a list of extractor functions. The `transcripts` object must have names. When mapping from transcripts to the genome, they are used to determine mapping pairs; in the reverse direction they become the seqlevels of the output object.
`ignore.strand`	When `ignore.strand` is TRUE, strand is ignored in overlaps operations (i.e., all strands are considered "+") and the strand in the output is '*'. When `ignore.strand` is FALSE strand in the output is taken from the `transcripts` argument. When `transcripts` is a `GRangesList`, all inner list elements of a common list element must have the same strand or an error is thrown. Mapped position is computed by counting from the transcription start site (TSS) and is not affected by the value of `ignore.strand`.
`extractor.fun`	Function to extract genomic features from a `TxDb` object. This argument is only applicable to `mapToTranscripts` when `transcripts` is a `TxDb` object. The `extractor` should be the name of a function (not a character()) described on the `?transcripts`, `?transcriptsBy`, or `?microRNAs` man page. Valid `extractor` functions: transcripts ## default exons cds genes promoters disjointExons transcriptsBy exonsBy cdsBy intronsByTranscript fiveUTRsByTranscript threeUTRsByTranscript microRNAs tRNAs
`...`	Additional arguments passed to `extractor.fun` functions.

Details

In GenomicFeatures >= 1.21.10, the default for ignore.strand was changed to FALSE for consistency with other methods in GenomicRanges and GenomicAlignments. Additionally, the mapped position is computed from the TSS and does not depend on the ignore.strand argument. See the section on ignore.strand for details.

mapToTranscripts, pmapToTranscripts The genomic range in x is mapped to the local position in the transcripts ranges. A successful mapping occurs when x is completely within the transcripts range, equivalent to:
```
      findOverlaps(..., type="within")
      
```
Transcriptome-based coordinates start counting at 1 at the beginning of the transcripts range and return positions where x was aligned. The seqlevels of the return object are taken from the transcripts object and should be transcript names. In this direction, mapping is attempted between all elements of x and all elements of transcripts.

mapToTranscripts uses findOverlaps to map ranges in x to ranges in transcripts. This method does not return unmapped ranges.

pmapToTranscripts maps the i-th range in x to the i-th range in transcripts. Recycling is supported for both x and transcripts when either is length == 1L; otherwise the lengths must match. Ranges in x that do not map (out of bounds or strand mismatch) are returned as zero-width ranges starting at 0. These ranges are given the seqname of "UNMAPPED".
mapFromTranscripts, pmapFromTranscripts The transcript-based position in x is mapped to genomic coordinates using the ranges in transcripts. A successful mapping occurs when the following is TRUE:
```
      width(transcripts) >= start(x) + width(x)
      
```
x is aligned to transcripts by moving in start(x) positions in from the beginning of the transcripts range. The seqlevels of the return object are chromosome names.

mapFromTranscripts uses the seqname of x and the names of transcripts to determine mapping pairs (vs attempting to match all possible pairs). Name matching is motivated by use cases such as differentially expressed regions where the expressed regions in x would only be related to a subset of regions in transcripts. This method does not return unmapped ranges.

pmapFromTranscripts maps the i-th range in x to the i-th range in transcripts and therefore does not use name matching. Recycling is supported in pmapFromTranscripts when either x or transcripts is length == 1L; otherwise the lengths must match. Ranges in x that do not map (out of bounds or strand mismatch) are returned as zero-width ranges starting at 0. These ranges are given the seqname of "UNMAPPED".

Value

pmapToTranscripts returns a GRanges the same length as x.

pmapFromTranscripts returns a GRanges when transcripts is a GRanges and a GRangesList when transcripts is a GRangesList. In both cases the return object is the same length as x. The rational for returning the GRangesList is to preserve exon structure; ranges in a list element that are not overlapped by x are returned as a zero-width range. The GRangesList return object will have no seqlevels called "UNMAPPED"; those will only occur when a GRanges is returned.

mapToTranscripts and mapFromTranscripts return GRanges objects that vary in length similar to a Hits object. The result contains mapped records only; strand mismatch and out of bound ranges are not returned. xHits and transcriptsHits metadata columns (similar to the queryHits and subjectHits of a Hits object) indicate elements of x and transcripts used in the mapping.

When mapping to transcript coordinates, seqlevels of the output are the names on the transcripts object and most often these will be transcript names. When mapping to the genome, seqlevels of the output are the seqlevels of transcripts which are usually chromosome names.

Author(s)

V. Obenchain, M. Lawrence and H. Pag<c3><83><c2><a8>s

Examples


## ---------------------------------------------------------------------
## A. Basic Use
## ---------------------------------------------------------------------

## ------------------------------------
## (i) Map from genome to transcript:

## The seqnames of the output are the transcript names, not chromosomes. For
## this reason 'transcripts' must be named. 
x <- GRanges("A", IRanges(16, 18))
gr1 <- GRanges("A", IRanges(1, 10, names="tx_a"))
gr2 <- GRanges("A", IRanges(15, 20, names="tx_b"))

## 'transcripts' as GRanges:
mapToTranscripts(x, gr2)

## 'transcripts' as GRangesList:
mapToTranscripts(x, GRangesList("tx_c" = c(gr1, gr2)))

## Round trip from genomic -> transcript -> genomic coordinates:
tx_coord <- mapToTranscripts(x, gr2)
mapFromTranscripts(tx_coord, gr2)

## ------------------------------------
## (ii) Map from transcript to genome:

## A prerequisite for mapping from transcript -> genome is that the seqname 
## of the range in 'x' match the name of the range in 'transcripts'. Here 
## the seqname of 'x' is "TX_1" and mapping is only attempted with the second 
## range in 'gr':
x <- GRanges("TX_1", IRanges(5, 10))
gr <- GRanges("chr3", IRanges(c(1, 1), width=50, names=c("TX_2", "TX_1")))
mapFromTranscripts(x, gr)

## ------------------------------------
## (iii) Element-wise versions:

## Recycling is supported when length(transcripts) == 1; otherwise the
## lengths of 'x' and 'transcripts' must be the same.
x <- GRanges("A", IRanges(c(1, 5, 10), width=1))
transcripts <- GRanges("A", IRanges(4, 7))
pmapToTranscripts(x, transcripts)


## ---------------------------------------------------------------------
## B. Map local sequence locations to the genome
## ---------------------------------------------------------------------

## NAGNAG alternative splicing plays an essential role in biological processes 
## and represents a highly adaptable system for posttranslational regulation 
## of gene function. The majority of NAGNAG studies largely focus on messenger 
## RNA. A study by Sun, Lin, and Yan 
## (http://www.hindawi.com/journals/bmri/2014/736798/) demonstrated that
## NAGNAG splicing is also operative in large intergenic noncoding RNA
## (lincRNA). 

## One finding of interest was that linc-POLR3G-10 exhibited two NAGNAG 
## acceptors located in two distinct transcripts: TCONS_00010012 and 
## TCONS_00010010. 

## Extract the exon coordinates of TCONS_00010012 and TCONS_00010010: 
lincrna <- c("TCONS_00010012", "TCONS_00010010")
library(TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts)
txdb <- TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts
exons <- exonsBy(txdb, by="tx", use.names=TRUE)[lincrna]
exons

## The two NAGNAG acceptors were identified in the upstream region of 
## the fourth and fifth exons located in TCONS_00010012.
## Extract the sequences for transcript TCONS_00010012:
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
exons_seq <- getSeq(genome, exons[[1]])

## TCONS_00010012 has 4 exons:
exons_seq

## The most common triplet among the lincRNA sequences was CAG. Identify
## the location of this pattern in all exons.
cag_loc <- vmatchPattern("CAG", exons_seq)

## Convert the first occurance of CAG in each exon back to genome coordinates.
first_loc <- do.call(c, sapply(cag_loc, "[", 1, simplify=TRUE))
pmapFromTranscripts(first_loc, exons[[1]])


## -----------------------------------------------------------------------
## C. Map 3'UTR variants to genome coordinates
## -----------------------------------------------------------------------

## A study by Skeeles et. al (PLoS ONE 8(3): e58609. doi:
## 10.1371/journal.pone.0058609) investigated the impact of 3'UTR variants 
## on the expression of cancer susceptibility genes.

## 8 candidate miRNA genes on chromosome 12 were used to test for 
## differential luciferase expression in mice. In Table 2 of the manuscript
## variant locations are given as nucleotide position within the gene.
geneNames <- c("Bcap29", "Dgkb", "Etv1", "Hbp1", "Hbp1", "Ifrd1", 
               "Ifrd1", "Pik3cg", "Pik3cg", "Tspan13", "Twistnb")
starts <- c(1409, 3170, 3132, 2437, 2626, 3239, 3261, 4947, 4979, 958, 1489) 
snps <- GRanges(geneNames, IRanges(starts, width=1))

## To map these transcript-space coordinates to the genome we need gene ranges
## in genome space.
library(org.Mm.eg.db)
geneid <- select(org.Mm.eg.db, unique(geneNames), "ENTREZID", "SYMBOL")
geneid

## Extract the gene regions:
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
genes <- genes(txdb)[geneid$ENTREZID]

## A prerequesite of the mapping from transcript space to genome space
## is that seqnames in 'x' match names in 'transcripts'. Rename
## 'genes' with the appropriate gene symbol.
names(genes) <- geneid$SYMBOL 

## The xHits and transcriptsHits metadta columns indicate which ranges in
## 'snps' and 'genes' were involved in the mapping.
mapFromTranscripts(snps, genes)


## -----------------------------------------------------------------------
## D. Map dbSNP variants to cds or cDNA coordinates
## -----------------------------------------------------------------------

## The GIPR gene encodes a G-protein coupled receptor for gastric inhibitory 
## polypeptide (GIP). Originally GIP was identified to inhibited gastric acid 
## secretion and gastrin release but was later demonstrated to stimulate 
## insulin release in the presence of elevated glucose.

## In this example 5 SNPs located in the GIPR gene are mapped to cDNA 
## coordinates. A list of SNPs in GIPR can be downloaded from dbSNP or NCBI.
rsids <- c("rs4803846", "rs139322374", "rs7250736", "rs7250754", "rs9749185")

## Extract genomic coordinates with a SNPlocs package.
library(SNPlocs.Hsapiens.dbSNP141.GRCh38)
snps <- snpid2grange(SNPlocs.Hsapiens.dbSNP141.GRCh38, rsids)

## Gene regions of GIPR can be extracted from a TxDb package of compatible
## build. The TxDb package uses Entrez gene identifiers and GIPR is a gene 
## symbol. Conversion between gene symbols and Entrez gene IDs is done by 
## calling select() on an organism db package.
library(org.Hs.eg.db)
geneid <- select(org.Hs.eg.db, "GIPR", "ENTREZID", "SYMBOL")

## The transcriptsBy() extractor returns a range for each transcript that
## includes the UTR and exon regions (i.e., cDNA).
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
txbygene <- transcriptsBy(txdb, "gene")
cDNA <- txbygene[geneid$ENTREZID]
cDNA

## Before mapping, the chromosome names (seqlevels) in the two objects must 
## be harmonized. The style for 'snps' is dbSNP and 'cDNA' is UCSC.
seqlevelsStyle(snps)
seqlevelsStyle(cDNA)

## Modify the style and genome in 'snps' to match 'cDNA'.
seqlevelsStyle(snps) <- seqlevelsStyle(cDNA)
genome(snps) <- genome(cDNA)

## The 'cDNA' object is a GRangesList of length 1. This single list element
## contains the cDNA range for 4 different transcripts. To map to each
## transcript individually 'cDNA' must be unlisted before mapping.

## Map all 5 SNPS to all 4 transcripts:
mapToTranscripts(snps, unlist(cDNA))

## Map the first SNP to transcript uc002pct.1 and the second to uc002pcu.1.
pmapToTranscripts(snps[1:2], unlist(cDNA)[1:2])

## The cdsBy() extractor returns coding regions by gene or by transcript.
## Extract the coding regions for transcript uc002pct.1.
cds <- cdsBy(txdb, "tx", use.names=TRUE)["uc002pct.1"]
cds

## The 'cds' object is a GRangesList of length 1 containing all cds ranges
## for the single transcript uc002pct.1.

## To map to the concatenated group of ranges leave 'cds' as a GRangesList.
mapToTranscripts(snps, cds)

## Only the second SNP could be mapped. Unlisting the 'cds' object maps the
## SNPs to the individual cds ranges (vs the concatenated range).
mapToTranscripts(snps[2], unlist(cds))

## The location is the same because the SNP hit the first cds range. If the
## transcript were on the "-" strand the difference in concatenated vs
## non-concatenated position would be more obvious.

## Change the strand of 'cds':
strand(cds) <- "-"

## Re-map using 'ignore.strand'. The 'ignore.strand' argument is used
## in overlaps operations but does not affect the mapped position.

## Map to concatenated group of cds regions:
mapToTranscripts(snps[2], cds, ignore.strand=TRUE)

## Map to individual cds regions:
mapToTranscripts(snps[2], unlist(cds), ignore.strand=TRUE)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(GenomicFeatures)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomicFeatures/coordinate-mapping-methods.Rd_%03d_medium.png", width=480, height=480)
> ### Name: mapToTranscripts
> ### Title: Map range coordinates between transcripts and genome space
> ### Aliases: coordinate-mapping mapToTranscripts
> ###   mapToTranscripts,GenomicRanges,GenomicRanges-method
> ###   mapToTranscripts,GenomicRanges,GRangesList-method
> ###   mapToTranscripts,ANY,TxDb-method pmapToTranscripts
> ###   pmapToTranscripts,GenomicRanges,GenomicRanges-method
> ###   pmapToTranscripts,GenomicRanges,GRangesList-method
> ###   pmapToTranscripts,GRangesList,GRangesList-method mapFromTranscripts
> ###   mapFromTranscripts,GenomicRanges,GenomicRanges-method
> ###   mapFromTranscripts,GenomicRanges,GRangesList-method
> ###   pmapFromTranscripts pmapFromTranscripts,Ranges,GenomicRanges-method
> ###   pmapFromTranscripts,Ranges,GRangesList-method
> ###   pmapFromTranscripts,GenomicRanges,GenomicRanges-method
> ###   pmapFromTranscripts,GenomicRanges,GRangesList-method
> ### Keywords: methods utilities
> 
> ### ** Examples
> 
> 
> ## ---------------------------------------------------------------------
> ## A. Basic Use
> ## ---------------------------------------------------------------------
> 
> ## ------------------------------------
> ## (i) Map from genome to transcript:
> 
> ## The seqnames of the output are the transcript names, not chromosomes. For
> ## this reason 'transcripts' must be named. 
> x <- GRanges("A", IRanges(16, 18))
> gr1 <- GRanges("A", IRanges(1, 10, names="tx_a"))
> gr2 <- GRanges("A", IRanges(15, 20, names="tx_b"))
> 
> ## 'transcripts' as GRanges:
> mapToTranscripts(x, gr2)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     xHits transcriptsHits
         <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1]     tx_b    [2, 4]      * |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## 'transcripts' as GRangesList:
> mapToTranscripts(x, GRangesList("tx_c" = c(gr1, gr2)))
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     xHits transcriptsHits
         <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1]     tx_c  [12, 14]      * |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## Round trip from genomic -> transcript -> genomic coordinates:
> tx_coord <- mapToTranscripts(x, gr2)
> mapFromTranscripts(tx_coord, gr2)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     xHits transcriptsHits
         <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1]        A  [16, 18]      * |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## ------------------------------------
> ## (ii) Map from transcript to genome:
> 
> ## A prerequisite for mapping from transcript -> genome is that the seqname 
> ## of the range in 'x' match the name of the range in 'transcripts'. Here 
> ## the seqname of 'x' is "TX_1" and mapping is only attempted with the second 
> ## range in 'gr':
> x <- GRanges("TX_1", IRanges(5, 10))
> gr <- GRanges("chr3", IRanges(c(1, 1), width=50, names=c("TX_2", "TX_1")))
> mapFromTranscripts(x, gr)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     xHits transcriptsHits
         <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1]     chr3   [5, 10]      * |         1               2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## ------------------------------------
> ## (iii) Element-wise versions:
> 
> ## Recycling is supported when length(transcripts) == 1; otherwise the
> ## lengths of 'x' and 'transcripts' must be the same.
> x <- GRanges("A", IRanges(c(1, 5, 10), width=1))
> transcripts <- GRanges("A", IRanges(4, 7))
> pmapToTranscripts(x, transcripts)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1] UNMAPPED   [0, -1]      *
  [2]        1   [2,  2]      *
  [3] UNMAPPED   [0, -1]      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
> 
> 
> ## ---------------------------------------------------------------------
> ## B. Map local sequence locations to the genome
> ## ---------------------------------------------------------------------
> 
> ## NAGNAG alternative splicing plays an essential role in biological processes 
> ## and represents a highly adaptable system for posttranslational regulation 
> ## of gene function. The majority of NAGNAG studies largely focus on messenger 
> ## RNA. A study by Sun, Lin, and Yan 
> ## (http://www.hindawi.com/journals/bmri/2014/736798/) demonstrated that
> ## NAGNAG splicing is also operative in large intergenic noncoding RNA
> ## (lincRNA). 
> 
> ## One finding of interest was that linc-POLR3G-10 exhibited two NAGNAG 
> ## acceptors located in two distinct transcripts: TCONS_00010012 and 
> ## TCONS_00010010. 
> 
> ## Extract the exon coordinates of TCONS_00010012 and TCONS_00010010: 
> lincrna <- c("TCONS_00010012", "TCONS_00010010")
> library(TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts)
> txdb <- TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts
> exons <- exonsBy(txdb, by="tx", use.names=TRUE)[lincrna]
> exons
GRangesList object of length 2:
$TCONS_00010012 
GRanges object with 5 ranges and 3 metadata columns:
      seqnames               ranges strand |   exon_id   exon_name exon_rank
         <Rle>            <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     chr5 [87577501, 87577923]      + |     16943        <NA>         1
  [2]     chr5 [87578532, 87578683]      + |     16947        <NA>         2
  [3]     chr5 [87578827, 87578903]      + |     16950        <NA>         3
  [4]     chr5 [87581561, 87581668]      + |     16953        <NA>         4
  [5]     chr5 [87583253, 87583828]      + |     16957        <NA>         5

$TCONS_00010010 
GRanges object with 2 ranges and 3 metadata columns:
      seqnames               ranges strand | exon_id exon_name exon_rank
  [1]     chr5 [87577114, 87578683]      + |   16941      <NA>         1
  [2]     chr5 [87581558, 87581668]      + |   16952      <NA>         2

-------
seqinfo: 93 sequences (1 circular) from hg19 genome
> 
> ## The two NAGNAG acceptors were identified in the upstream region of 
> ## the fourth and fifth exons located in TCONS_00010012.
> ## Extract the sequences for transcript TCONS_00010012:
> library(BSgenome.Hsapiens.UCSC.hg19)
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> genome <- BSgenome.Hsapiens.UCSC.hg19
> exons_seq <- getSeq(genome, exons[[1]])
> 
> ## TCONS_00010012 has 4 exons:
> exons_seq
  A DNAStringSet instance of length 5
    width seq
[1]   423 CAAGAGAGAAGCAGAAGCCACCATGGTTTTTTAT...TGCTGGTGCCGTGCTTCTACAGCCTGCAGAACT
[2]   152 GAAGGACTTTACAGCTTCTCTTTCGCATATCCAA...TGATAACCAAGAAGTCTACTAAGTGACTAACAG
[3]    77 TTAATATTTTTGGACATGAGGTGACTGCAGTTAA...GGAAAGCAAACCCATAGATAAAGGGATACTATG
[4]   108 AGCAACTGGGAACCTAATTCTGACCAAAAAGGAG...AAGCAACCGCATACCAAGGATAATTATAAGAAG
[5]   576 AAGAGAAGACATGGAGAAGAGGAGGGAGTAGTTG...CTAGGGACTTTCCAAGAGGATGACCAAAGAAGG
> 
> ## The most common triplet among the lincRNA sequences was CAG. Identify
> ## the location of this pattern in all exons.
> cag_loc <- vmatchPattern("CAG", exons_seq)
> 
> ## Convert the first occurance of CAG in each exon back to genome coordinates.
> first_loc <- do.call(c, sapply(cag_loc, "[", 1, simplify=TRUE))
> pmapFromTranscripts(first_loc, exons[[1]])
GRanges object with 5 ranges and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]     chr5 [87577512, 87577514]      +
  [2]     chr5 [87578543, 87578545]      +
  [3]     chr5 [87578854, 87578856]      +
  [4]     chr5 [87581620, 87581622]      +
  [5]     chr5 [87583299, 87583301]      +
  -------
  seqinfo: 93 sequences from an unspecified genome; no seqlengths
> 
> 
> ## -----------------------------------------------------------------------
> ## C. Map 3'UTR variants to genome coordinates
> ## -----------------------------------------------------------------------
> 
> ## A study by Skeeles et. al (PLoS ONE 8(3): e58609. doi:
> ## 10.1371/journal.pone.0058609) investigated the impact of 3'UTR variants 
> ## on the expression of cancer susceptibility genes.
> 
> ## 8 candidate miRNA genes on chromosome 12 were used to test for 
> ## differential luciferase expression in mice. In Table 2 of the manuscript
> ## variant locations are given as nucleotide position within the gene.
> geneNames <- c("Bcap29", "Dgkb", "Etv1", "Hbp1", "Hbp1", "Ifrd1", 
+                "Ifrd1", "Pik3cg", "Pik3cg", "Tspan13", "Twistnb")
> starts <- c(1409, 3170, 3132, 2437, 2626, 3239, 3261, 4947, 4979, 958, 1489) 
> snps <- GRanges(geneNames, IRanges(starts, width=1))
> 
> ## To map these transcript-space coordinates to the genome we need gene ranges
> ## in genome space.
> library(org.Mm.eg.db)

> geneid <- select(org.Mm.eg.db, unique(geneNames), "ENTREZID", "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> geneid
   SYMBOL ENTREZID
1  Bcap29    12033
2    Dgkb   217480
3    Etv1    14009
4    Hbp1    73389
5   Ifrd1    15982
6  Pik3cg    30955
7 Tspan13    66109
8 Twistnb    28071
> 
> ## Extract the gene regions:
> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> genes <- genes(txdb)[geneid$ENTREZID]
> 
> ## A prerequesite of the mapping from transcript space to genome space
> ## is that seqnames in 'x' match names in 'transcripts'. Rename
> ## 'genes' with the appropriate gene symbol.
> names(genes) <- geneid$SYMBOL 
> 
> ## The xHits and transcriptsHits metadta columns indicate which ranges in
> ## 'snps' and 'genes' were involved in the mapping.
> mapFromTranscripts(snps, genes)
GRanges object with 11 ranges and 2 metadata columns:
       seqnames               ranges strand |     xHits transcriptsHits
          <Rle>            <IRanges>  <Rle> | <integer>       <integer>
   [1]    chr12 [31633250, 31633250]      - |         1               1
   [2]    chr12 [37883343, 37883343]      + |         2               2
   [3]    chr12 [38783389, 38783389]      + |         3               3
   [4]    chr12 [31948099, 31948099]      - |         4               4
   [5]    chr12 [31947910, 31947910]      - |         5               4
   [6]    chr12 [40219951, 40219951]      - |         6               5
   [7]    chr12 [40219929, 40219929]      - |         7               5
   [8]    chr12 [32203703, 32203703]      - |         8               6
   [9]    chr12 [32203671, 32203671]      - |         9               6
  [10]    chr12 [36041521, 36041521]      - |        10               7
  [11]    chr12 [33431112, 33431112]      + |        11               8
  -------
  seqinfo: 66 sequences from an unspecified genome; no seqlengths
> 
> 
> ## -----------------------------------------------------------------------
> ## D. Map dbSNP variants to cds or cDNA coordinates
> ## -----------------------------------------------------------------------
> 
> ## The GIPR gene encodes a G-protein coupled receptor for gastric inhibitory 
> ## polypeptide (GIP). Originally GIP was identified to inhibited gastric acid 
> ## secretion and gastrin release but was later demonstrated to stimulate 
> ## insulin release in the presence of elevated glucose.
> 
> ## In this example 5 SNPs located in the GIPR gene are mapped to cDNA 
> ## coordinates. A list of SNPs in GIPR can be downloaded from dbSNP or NCBI.
> rsids <- c("rs4803846", "rs139322374", "rs7250736", "rs7250754", "rs9749185")
> 
> ## Extract genomic coordinates with a SNPlocs package.
> library(SNPlocs.Hsapiens.dbSNP141.GRCh38)
> snps <- snpid2grange(SNPlocs.Hsapiens.dbSNP141.GRCh38, rsids)
> 
> ## Gene regions of GIPR can be extracted from a TxDb package of compatible
> ## build. The TxDb package uses Entrez gene identifiers and GIPR is a gene 
> ## symbol. Conversion between gene symbols and Entrez gene IDs is done by 
> ## calling select() on an organism db package.
> library(org.Hs.eg.db)

> geneid <- select(org.Hs.eg.db, "GIPR", "ENTREZID", "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> 
> ## The transcriptsBy() extractor returns a range for each transcript that
> ## includes the UTR and exon regions (i.e., cDNA).
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
> txbygene <- transcriptsBy(txdb, "gene")
> cDNA <- txbygene[geneid$ENTREZID]
> cDNA
GRangesList object of length 1:
$2696 
GRanges object with 4 ranges and 2 metadata columns:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]    chr19 [45668244, 45682155]      + |     79258  uc002pct.1
  [2]    chr19 [45668244, 45682459]      + |     79259  uc002pcu.1
  [3]    chr19 [45668244, 45682459]      + |     79260  uc010xxp.1
  [4]    chr19 [45668244, 45682459]      + |     79261  uc010xxq.1

-------
seqinfo: 455 sequences (1 circular) from hg38 genome
> 
> ## Before mapping, the chromosome names (seqlevels) in the two objects must 
> ## be harmonized. The style for 'snps' is dbSNP and 'cDNA' is UCSC.
> seqlevelsStyle(snps)
[1] "dbSNP"
> seqlevelsStyle(cDNA)
[1] "UCSC"
> 
> ## Modify the style and genome in 'snps' to match 'cDNA'.
> seqlevelsStyle(snps) <- seqlevelsStyle(cDNA)
> genome(snps) <- genome(cDNA)
> 
> ## The 'cDNA' object is a GRangesList of length 1. This single list element
> ## contains the cDNA range for 4 different transcripts. To map to each
> ## transcript individually 'cDNA' must be unlisted before mapping.
> 
> ## Map all 5 SNPS to all 4 transcripts:
> mapToTranscripts(snps, unlist(cDNA))
GRanges object with 20 ranges and 2 metadata columns:
       seqnames         ranges strand |     xHits transcriptsHits
          <Rle>      <IRanges>  <Rle> | <integer>       <integer>
   [1]     2696   [8649, 8649]      + |         1               2
   [2]     2696   [8649, 8649]      + |         1               3
   [3]     2696   [8649, 8649]      + |         1               4
   [4]     2696   [8649, 8649]      + |         1               1
   [5]     2696   [1322, 1322]      + |         2               2
   ...      ...            ...    ... .       ...             ...
  [16]     2696 [12560, 12560]      + |         4               1
  [17]     2696 [ 3915,  3915]      + |         5               2
  [18]     2696 [ 3915,  3915]      + |         5               3
  [19]     2696 [ 3915,  3915]      + |         5               4
  [20]     2696 [ 3915,  3915]      + |         5               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## Map the first SNP to transcript uc002pct.1 and the second to uc002pcu.1.
> pmapToTranscripts(snps[1:2], unlist(cDNA)[1:2])
GRanges object with 2 ranges and 0 metadata columns:
      seqnames       ranges strand
         <Rle>    <IRanges>  <Rle>
  [1]     2696 [8649, 8649]      +
  [2]     2696 [1322, 1322]      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## The cdsBy() extractor returns coding regions by gene or by transcript.
> ## Extract the coding regions for transcript uc002pct.1.
> cds <- cdsBy(txdb, "tx", use.names=TRUE)["uc002pct.1"]
> cds
GRangesList object of length 1:
$uc002pct.1 
GRanges object with 12 ranges and 3 metadata columns:
       seqnames               ranges strand |    cds_id    cds_name exon_rank
          <Rle>            <IRanges>  <Rle> | <integer> <character> <integer>
   [1]    chr19 [45669521, 45669592]      + |    198246        <NA>         2
   [2]    chr19 [45670635, 45670734]      + |    198247        <NA>         3
   [3]    chr19 [45671285, 45671392]      + |    198248        <NA>         4
   [4]    chr19 [45672851, 45672954]      + |    198249        <NA>         5
   [5]    chr19 [45674074, 45674177]      + |    198250        <NA>         6
   ...      ...                  ...    ... .       ...         ...       ...
   [8]    chr19 [45677323, 45677383]      + |    198253        <NA>         9
   [9]    chr19 [45677710, 45677779]      + |    198254        <NA>        10
  [10]    chr19 [45677906, 45677994]      + |    198255        <NA>        11
  [11]    chr19 [45678088, 45678226]      + |    198256        <NA>        12
  [12]    chr19 [45681604, 45681711]      + |    198258        <NA>        13

-------
seqinfo: 455 sequences (1 circular) from hg38 genome
> 
> ## The 'cds' object is a GRangesList of length 1 containing all cds ranges
> ## for the single transcript uc002pct.1.
> 
> ## To map to the concatenated group of ranges leave 'cds' as a GRangesList.
> mapToTranscripts(snps, cds)
GRanges object with 1 range and 2 metadata columns:
        seqnames    ranges strand |     xHits transcriptsHits
           <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1] uc002pct.1  [45, 45]      + |         2               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## Only the second SNP could be mapped. Unlisting the 'cds' object maps the
> ## SNPs to the individual cds ranges (vs the concatenated range).
> mapToTranscripts(snps[2], unlist(cds))
GRanges object with 1 range and 2 metadata columns:
        seqnames    ranges strand |     xHits transcriptsHits
           <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1] uc002pct.1  [45, 45]      + |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## The location is the same because the SNP hit the first cds range. If the
> ## transcript were on the "-" strand the difference in concatenated vs
> ## non-concatenated position would be more obvious.
> 
> ## Change the strand of 'cds':
> strand(cds) <- "-"
> 
> ## Re-map using 'ignore.strand'. The 'ignore.strand' argument is used
> ## in overlaps operations but does not affect the mapped position.
> 
> ## Map to concatenated group of cds regions:
> mapToTranscripts(snps[2], cds, ignore.strand=TRUE)
GRanges object with 1 range and 2 metadata columns:
        seqnames    ranges strand |     xHits transcriptsHits
           <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1] uc002pct.1  [45, 45]      * |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> ## Map to individual cds regions:
> mapToTranscripts(snps[2], unlist(cds), ignore.strand=TRUE)
GRanges object with 1 range and 2 metadata columns:
        seqnames    ranges strand |     xHits transcriptsHits
           <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1] uc002pct.1  [45, 45]      * |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>