Last data update: 2014.03.03

R: Add annotation to ranges
detailRangesR Documentation

Add annotation to ranges

Description

Add detailed exon-based annotation to specified genomic regions.

Usage

detailRanges(incoming, txdb, orgdb, dist=5000, promoter=c(3000, 1000), 
    max.intron=1e6, key.field="ENTREZID", name.field="SYMBOL", 
    ignore.strand=TRUE)

Arguments

incoming

a GRanges object containing the ranges to be annotated

txdb

a TranscriptDb object for the genome of interest

orgdb

a genome wide annotation object for the genome of interest

dist

an integer scalar specifying the flanking distance to annotate

promoter

an integer vector of length 2, where first and second values define the promoter as some distance upstream and downstream from the TSS, respectively

max.intron

an integer scalar indicating the maximum distance between exons

key.field

a character scalar specifying the keytype for name extraction

name.field

a character scalar or vector specifying the column(s) to use as the gene name

ignore.strand

a logical scalar indicating whether strandedness in incoming should be ignored

Details

This function adds exon-based annotations to a given set of genomic regions, in the form of compact character strings specifying the features overlapping and flanking each region. The aim is to determine the genic context of empirically identified regions. This allows some basic biological interpretation of binding/marking in those regions. All neighboring genes within a specified range are reported, rather than just the closest gene to the region. If a region in incoming is stranded and ignore.strand=FALSE, annotated features will only be reported if they lie on the same strand as that region.

If incoming is missing, then the annotation will be provided directly to the user in the form of a GRanges object. This may be more useful when further work on the annotation is required. Exon numbers are provided in the metadata with promoters and gene bodies labelled as 0 and -1, respectively. Overlaps to introns can be identified by finding those regions that overlap with gene bodies but not with any of the corresponding exons.

Value

If incoming is not provided, a GRanges object will be returned containing ranges for the exons, promoters and gene bodies. Gene keys (e.g., Entrez IDs) are stored as row names. Gene symbols, exon numbers and internal groupings (for exons of genes with multiple genomic locations) are also stored as metadata.

If incoming is a GRanges object, a list will be returned with overlap, left and right elements. Each element is a character vector of length equal to the number of ranges in incoming. Each non-empty string records the gene symbol, the overlapped exons and the strand. For left and right, the gap between the range and the annotated feature is also included.

Explanation of fields

For annotated features overlapping a region, the character string in the overlap output vector will be of the form GENE|EXONS|STRAND. GENE is the gene symbol by default, but reverts to <XXX> if no symbol is defined for a gene with the Entrez ID XXX. The EXONS indicate the exon or range of exons that are overlapped. The STRAND is, obviously, the strand on which the gene is coded. For annotated regions flanking the region within a distance of dist, the character string in the left or right output vectors will have an additional [DIST] value. This represents the gap between the edge of the region and the closest exon for that gene.

Exons are numbered in order of increasing start or end position for genes on the forward or reverse strands, respectively. Promoters are defined as the region of length promoter upstream of the gene TSS, itself defined as the start of the first exon (for genes on the forward strand) or the end of the last exon (otherwise). All promoters are marked as exon 0 for simplicity. Exon ranges in EXON are reported from as a comma-separated list where stretches of consecutive exons are summarized into a range. If the region overlaps an intron, it is labelled with I in EXON. No intronic overlaps are reported if there is an exonic overlap.

Note that promoter and intronic annotations are only reported for the overlap vector to reduce redundancy in the output. For example, it makes little sense to report that the region is both flanking and overlapping an intron. Similarly, the value of DIST is more relevant when it is reported to the nearest exon rather than to an intron (in which case, the distance would be zero if the intron overlaps the region). In cases where the distance is reported to the first exon, it can be used to refine the choice of promoter.

Other options

The max.intron value is necessary to deal with genes that have ambiguous locations on the genome. If a gene has exons on different chromosomes, its location is uncertain and the gene is partitioned into two sets of exons for separate processing. However, this is less obvious when the ambiguous locations belong to the same chromosome. The max.intron value protects against excessively large genes that may occur from considering those locations as a single transcriptional unit. Exons are partitioned into two (or more) internal groupings for further processing.

The default settings for key.field and name.field will work for human and mouse genomes, but may not work for other organisms. The key.field should refer to the key type in the OrgDb object, and also correspond to the GENEID of the TxDb object. For example, in S. cerevisiae, key.field is set to "ORF" while name.field is set to "GENENAME". If multiple entries are supplied in name.field, the value of GENE is defined as a semicolon-separated list of each of those entries.

Author(s)

Aaron Lun

Examples

 
require(org.Mm.eg.db)
require(TxDb.Mmusculus.UCSC.mm10.knownGene)

current <- readRDS(system.file("exdata", "exrange.rds", package="csaw"))
output <- detailRanges(current, orgdb=org.Mm.eg.db,
    txdb=TxDb.Mmusculus.UCSC.mm10.knownGene) 
head(output$overlap)
head(output$right)
head(output$left)

detailRanges(txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, orgdb=org.Mm.eg.db)

## Not run: 
output <- detailRanges(current, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, 
    orgdb=org.Mm.eg.db, name.field=c("ENTREZID"))
head(output$overlap)

output <- detailRanges(current, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, 
    orgdb=org.Mm.eg.db, name.field=c("SYMBOL", "ENTREZID"))
head(output$overlap)

## End(Not run)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(csaw)
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/csaw/detailRanges.Rd_%03d_medium.png", width=480, height=480)
> ### Name: detailRanges
> ### Title: Add annotation to ranges
> ### Aliases: detailRanges
> ### Keywords: annotation
> 
> ### ** Examples
>  
> require(org.Mm.eg.db)
Loading required package: org.Mm.eg.db
Loading required package: AnnotationDbi

> require(TxDb.Mmusculus.UCSC.mm10.knownGene)
Loading required package: TxDb.Mmusculus.UCSC.mm10.knownGene
Loading required package: GenomicFeatures
> 
> current <- readRDS(system.file("exdata", "exrange.rds", package="csaw"))
> output <- detailRanges(current, orgdb=org.Mm.eg.db,
+     txdb=TxDb.Mmusculus.UCSC.mm10.knownGene) 
> head(output$overlap)
[1] "Nrxn3|8|+"         ""                  ""                 
[4] "1700007G11Rik|I|+" "Mannr|2|+"         ""                 
> head(output$right)
[1] "Nrxn3|9|+[3846]" "Rprm|1|-[2293]"  ""                ""               
[5] ""                ""               
> head(output$left)
[1] ""                        ""                       
[3] ""                        "1700007G11Rik|5|+[2890]"
[5] ""                        ""                       
> 
> detailRanges(txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, orgdb=org.Mm.eg.db)
GRanges object with 288419 ranges and 3 metadata columns:
            seqnames                 ranges strand |      symbol      exon
               <Rle>              <IRanges>  <Rle> | <character> <integer>
  100009600     chr9   [21062393, 21062717]      - |       Zglp1         7
  100009600     chr9   [21062894, 21062987]      - |       Zglp1         6
  100009600     chr9   [21063314, 21063396]      - |       Zglp1         5
  100009600     chr9   [21066024, 21066377]      - |       Zglp1         4
  100009600     chr9   [21066940, 21067925]      - |       Zglp1         3
        ...      ...                    ...    ... .         ...       ...
      99889     chr3 [ 85785218,  85887518]      - |      Arfip1        -1
      99890     chr3 [110246104, 110250999]      - |       Prmt6        -1
      99899     chr3 [151730923, 151749959]      - |       Ifi44        -1
      99929     chr3 [ 65528447,  65555518]      + |      Tiparp        -1
      99982     chr4 [136550533, 136602723]      - |       Kdm1a        -1
             internal
            <integer>
  100009600         1
  100009600         1
  100009600         1
  100009600         1
  100009600         1
        ...       ...
      99889     24180
      99890     24181
      99899     24182
      99929     24183
      99982     24184
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome
> 
> ## Not run: 
> ##D output <- detailRanges(current, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, 
> ##D     orgdb=org.Mm.eg.db, name.field=c("ENTREZID"))
> ##D head(output$overlap)
> ##D 
> ##D output <- detailRanges(current, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, 
> ##D     orgdb=org.Mm.eg.db, name.field=c("SYMBOL", "ENTREZID"))
> ##D head(output$overlap)
> ## End(Not run)
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>