Last data update: 2014.03.03

R: Retrieve annotation data from an Ensembl based package
exonsByR Documentation

Retrieve annotation data from an Ensembl based package

Description

Retrieve gene/transcript/exons annotations stored in an Ensembl based database package generated with the makeEnsembldbPackage function.

Usage


## S4 method for signature 'EnsDb'
exons(x, columns=listColumns(x,"exon"),
                        filter, order.by, order.type="asc",
                        return.type="GRanges")

## S4 method for signature 'EnsDb'
exonsBy(x, by=c("tx", "gene"),
                          columns=listColumns(x, "exon"), filter, use.names=FALSE)

## S4 method for signature 'EnsDb'
exonsByOverlaps(x, ranges, maxgap=0L, minoverlap=1L,
                                  type=c("any", "start", "end"),
                                  columns=listColumns(x, "exon"),
                                  filter)

## S4 method for signature 'EnsDb'
transcripts(x, columns=listColumns(x, "tx"),
                              filter, order.by, order.type="asc",
                              return.type="GRanges")

## S4 method for signature 'EnsDb'
transcriptsBy(x, by=c("gene", "exon"),
                                columns=listColumns(x, "tx"), filter)

## S4 method for signature 'EnsDb'
transcriptsByOverlaps(x, ranges, maxgap=0L, minoverlap=1L,
                                        type=c("any", "start", "end"),
                                        columns=listColumns(x, "tx"),
                                        filter)

## S4 method for signature 'EnsDb'
promoters(x, upstream=2000, downstream=200, ...)

## S4 method for signature 'EnsDb'
genes(x, columns=listColumns(x, "gene"), filter,
                        order.by, order.type="asc",
                        return.type="GRanges")

## S4 method for signature 'EnsDb'
disjointExons(x, aggregateGenes=FALSE,
                                includeTranscripts=TRUE, filter, ...)

## S4 method for signature 'EnsDb'
cdsBy(x, by=c("tx", "gene"), columns=NULL, filter,
                        use.names=FALSE)

## S4 method for signature 'EnsDb'
fiveUTRsByTranscript(x, columns=NULL, filter)

## S4 method for signature 'EnsDb'
threeUTRsByTranscript(x, columns=NULL, filter)

## S4 method for signature 'GRangesList'
toSAF(x, ...)

Arguments

(In alphabetic order)

...

For promoters: additional arguments to be passed to the transcripts method.

aggregateGenes

For disjointExons: When FALSE (default) exon fragments that overlap multiple genes are dropped. When TRUE, all fragments are kept and the gene_id metadata column includes all gene IDs that overlap the exon fragment.

by

For exonsBy: wheter exons sould be fetched by genes or by transcripts; as in the corresponding function of the GenomicFeatures package. For transcriptsBy: whether transcripts should be fetched by genes or by exons; fetching transcripts by cds as supported by the transcriptsBy method in the GenomicFeatures package is currently not implemented. For cdsBy: whether cds should be fetched by transcript of by gene.

columns

Columns to be retrieved from the database tables.

Default values for genes are all columns from the gene database table, for exons and exonsBy the column names of the exon database table table and for transcript and transcriptBy the columns of the tx data base table (see details below for more information).

Note that any of the column names of the database tables can be submitted to any of the methods (use listTables or listColumns methods for a complete list of allowed column names).

For cdsBy: this argument is only supported for for by="tx".

downstream

For method promoters: the number of nucleotides downstream of the transcription start site that should be included in the promoter region.

filter

A filter object extending BasicFilter or a list of such object(s) to select specific entries from the database (see examples below).

includeTranscripts

For disjointExons: When TRUE (default) a tx_name metadata column is included that lists all transcript IDs that overlap the exon fragment. Note: this is different to the disjointExons function in the GenomicFeatures package, that lists the transcript names, not IDs.

maxgap

For exonsByOverlaps and transcriptsByOverlaps: see exonsByOverlaps help page in the GenomicFeatures package.

minoverlap

For exonsByOverlaps and transcriptsByOverlaps: see exonsByOverlaps help page in the GenomicFeatures package.

order.by

Name of one of the columns above on which the results should be sorted.

order.type

If the results should be ordered ascending (asc, default) or descending (desc).

ranges

For exonsByOverlaps and transcriptsByOverlaps: a GRanges object specifying the genomic regions.

return.type

Type of the returned object. Can be either "data.frame", "DataFrame" or "GRanges". In the latter case the return object will be a GRanges object with the GRanges specifying the chromosomal start and end coordinates of the feature (gene, transcript or exon, depending whether genes, transcripts or exons was called). All additional columns are added as metadata columns to the GRanges object.

type

For exonsByOverlaps and transcriptsByOverlaps: see exonsByOverlaps help page in the GenomicFeatures package.

upstream

For method promoters: the number of nucleotides upstream of the transcription start site that should be included in the promoter region.

use.names

For cdsBy and exonsBy: only for by="gene": use the names of the genes instead of their IDs as names of the resulting GRangesList.

x

For toSAF a GRangesList object. For all other methods an EnsDb instance.

Details

A detailed description of all database tables and the associated attributes/column names is also given in the vignette of this package. An overview of the columns is given below:

gene_id

the Ensembl gene ID of the gene.

gene_name

the name of the gene (in most cases its official symbol).

entrezid

the NCBI Entrezgene ID of the gene; note that this can also be a ";" separated list of IDs for Ensembl genes mapped to more than one Entrezgene.

gene_biotype

the biotype of the gene.

gene_seq_start

the start coordinate of the gene on the sequence (usually a chromosome).

gene_seq_end

the end coordinate of the gene.

seq_name

the name of the sequence the gene is encoded (usually a chromosome).

seq_strand

the strand on which the gene is encoded

seq_coord_system

the coordinate system of the sequence.

tx_id

the Ensembl transcript ID.

tx_biotype

the biotype of the transcript.

tx_seq_start

the chromosomal start coordinate of the transcript.

tx_seq_end

the chromosomal end coordinate of the transcript.

tx_cds_seq_start

the start coordinate of the coding region of the transcript (NULL for non-coding transcripts).

tx_cds_seq_end

the end coordinate of the coding region.

exon_id

the ID of the exon. In Ensembl, each exon specified by a unique chromosomal start and end position has its own ID. Thus, the same exon might be part of several transcripts.

exon_seq_start

the chromosomal start coordinate of the exon.

exon_seq_end

the chromosomal end coordinate of the exon.

exon_idx

the index of the exon in the transcript model. As noted above, an exon can be part of several transcripts and thus its position inside these transcript might differ.

Also, the vignette provides examples on how to retrieve sequences for genes/transcripts/exons.

Value

For exons, transcripts and genes, a data.frame, DataFrame or a GRanges, depending on the value of the return.type parameter. The result is ordered as specified by the parameter order.by or, if not provided, by seq_name and chromosomal start coordinate, but NOT by any ordering of values in eventually submitted filter objects.

For exonsBy, transcriptsBy: a GRangesList, depending on the value of the return.type parameter. The results are ordered by the value of the by parameter.

For exonsByOverlaps and transcriptsByOverlaps: a GRanges with the exons or transcripts overlapping the specified regions.

For toSAF: a data.frame with column names "GeneID" (the group name from the GRangesList, i.e. the ID by which the GRanges are split), "Chr" (the seqnames from the GRanges), "Start" (the start coordinate), "End" (the end coordinate) and "Strand" (the strand).

For disjointExons: a GRanges of non-overlapping exon parts.

For cdsBy: a GRangesList with GRanges per either transcript or exon specifying the start and end coordinates of the coding region of the transcript or gene.

For fiveUTRsByTranscript: a GRangesList with GRanges for each protein coding transcript representing the start and end coordinates of full or partial exons that constitute the 5' untranslated region of the transcript.

For threeUTRsByTranscript: a GRangesList with GRanges for each protein coding transcript representing the start and end coordinates of full or partial exons that constitute the 3' untranslated region of the transcript.

Methods and Functions

exons

Retrieve exon information from the database. Additional columns from transcripts or genes associated with the exons can be specified and are added to the respective exon annotation.

exonsBy

Retrieve exons grouped by transcript or by gene. This function returns a GRangesList as does the analogous function in the GenomicFeatures package. Using the columns parameter it is possible to determine which additional values should be retrieved from the database. These will be included in the GRanges object for the exons as metadata columns. The exons in the inner GRanges are ordered by the exon index within the transcript (if by="tx"), or increasingly by the chromosomal start position of the exon or decreasingly by the chromosomal end position of the exon depending whether the gene is encoded on the + or - strand (for by="gene"). The GRanges in the GRangesList will be ordered by the name of the gene or transcript.

exonsByOverlaps

Retrieve exons overlapping specified genomic ranges. For more information see exonsByOverlaps method in the GenomicFeatures package. The functionality is to some extent similar and redundant to the exons method in combination with GRangesFilter filter.

transcripts

Retrieve transcript information from the database. Additional columns from genes or exons associated with the transcripts can be specified and are added to the respective transcript annotation.

transcriptsBy

Retrieve transcripts grouped by gene or exon. This function returns a GRangesList as does the analogous function in the GenomicFeatures package. Using the columns parameter it is possible to determine which additional values should be retrieved from the database. These will be included in the GRanges object for the transcripts as metadata columns. The transcripts in the inner GRanges are ordered increasingly by the chromosomal start position of the transcript for genes encoded on the + strand and in a decreasing manner by the chromosomal end position of the transcript for genes encoded on the - strand. The GRanges in the GRangesList will be ordered by the name of the gene or exon.

transcriptsByOverlaps

Retrieve transcripts overlapping specified genomic ranges. For more information see transcriptsByOverlaps method in the GenomicFeatures package. The functionality is to some extent similar and redundant to the transcripts method in combination with GRangesFilter filter.

promoters

Retrieve promoter information from the database. Additional columns from genes or exons associated with the promoters can be specified and are added to the respective promoter annotation.

genes

Retrieve gene information from the database. Additional columns from transcripts or exons associated with the genes can be specified and are added to the respective gene annotation.

disjointExons

This method is identical to disjointExons defined in the GenomicFeatures package. It creates a GRanges of non-overlapping exon parts with metadata columns of gene_id and exonic_part. Exon parts that overlap more than one gene can be dropped with aggregateGenes=FALSE.

cdsBy

Returns the coding region grouped either by transcript or by gene. Each element in the GRangesList represents the cds for one transcript or gene, with the individual ranges corresponding to the coding part of its exons. For by="tx" additional annotation columns can be added to the individual GRanges (in addition to the default columns exon_id and exon_rank). Note that the GRangesList is sorted by its names.

fiveUTRsByTranscript

Returns the 5' untranslated region for protein coding transcripts.

threeUTRsByTranscript

Returns the 3' untranslated region for protein coding transcripts.

toSAF

Reformats a GRangesList object into a data.frame corresponding to a standard SAF (Simplified Annotation Format) file (i.e. with column names "GeneID", "Chr", "Start", "End" and "Strand"). Note: this method makes only sense on a GRangesList that groups features (exons, transcripts) by gene.

Note

Ensembl defines genes not only on standard chromosomes, but also on patched chromosomes and chromosome variants. Thus it might be advisable to restrict the queries to just those chromosomes of interest (e.g. by specifying a SeqnameFilter(c(1:22, "X", "Y"))). In addition, also so called LRG genes (Locus Reference Genomic) are defined in Ensembl. Their gene id starts with LRG instead of ENS for Ensembl genes, thus, a filter can be applied to specifically select those genes or exclude those genes (see examples below).

Depending on the value of the global option "ucscChromosomeNames" (use getOption(ucscChromosomeNames, FALSE) to get its value or option(ucscChromosomeNames=TRUE) to change its value) the sequence/chromosome names of the returned GRanges objects or provided in the returned data.frame or DataFrame correspond to Ensembl chromosome names (if value is FALSE) or UCSC chromosome names (if TRUE). This ensures a better integration with the Gviz package, in which this option is set by default to TRUE.

Author(s)

Johannes Rainer, Tim Triche

See Also

makeEnsembldbPackage, BasicFilter, listColumns, lengthOf

Examples


library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75

######   genes
##
## get all genes endcoded on chromosome Y
AllY <- genes(edb, filter=SeqnameFilter("Y"))
AllY

## return result as DataFrame.
AllY.granges <- genes(edb,
                      filter=SeqnameFilter("Y"),
                      return.type="DataFrame")
AllY.granges

## include all transcripts of the gene and their chromosomal
## coordinates, sort by chrom start of transcripts and return as
## GRanges.
AllY.granges.tx <- genes(edb,
                         filter=SeqnameFilter("Y"),
                         columns=c("gene_id", "seq_name",
                             "seq_strand", "tx_id", "tx_biotype",
                             "tx_seq_start", "tx_seq_end"),
                         order.by="tx_seq_start")
AllY.granges.tx



######   transcripts
##
## get all transcripts of a gene
Tx <- transcripts(edb,
                  filter=GeneidFilter("ENSG00000184895"),
                  order.by="tx_seq_start")
Tx

## get all transcripts of two genes along with some information on the
## gene and transcript
Tx <- transcripts(edb,
                  filter=GeneidFilter(c("ENSG00000184895",
                      "ENSG00000092377")),
                      columns=c("gene_id", "gene_seq_start",
                          "gene_seq_end", "gene_biotype", "tx_biotype"))
Tx

######   promoters
##
## get the bona-fide promoters (2k up- to 200nt downstream of TSS)
promoters(edb, filter=GeneidFilter(c("ENSG00000184895",
                                     "ENSG00000092377")))

######   exons
##
## get all exons of the provided genes
Exon <- exons(edb,
              filter=GeneidFilter(c("ENSG00000184895",
                  "ENSG00000092377")),
              order.by="exon_seq_start",
              columns=c( "gene_id", "gene_seq_start",
                  "gene_seq_end", "gene_biotype"))
Exon



#####    exonsBy
##
## get all exons for transcripts encoded on chromosomes X and Y.
ETx <- exonsBy(edb, by="tx",
               filter=SeqnameFilter(c("X", "Y")))
ETx
## get all exons for genes encoded on chromosome 1 to 22, X and Y and
## include additional annotation columns in the result
EGenes <- exonsBy(edb, by="gene",
                  filter=SeqnameFilter(c("X", "Y")),
                  columns=c("gene_biotype", "gene_name"))
EGenes

## Note that this might also contain "LRG" genes.
length(grep(names(EGenes), pattern="LRG"))

## to fetch just Ensemblgenes, use an GeneidFilter with value
## "ENS%" and condition "like"


#####    transcriptsBy
##
TGenes <- transcriptsBy(edb, by="gene",
                        filter=SeqnameFilter(c("X", "Y")))
TGenes

## convert this to a SAF formatted data.frame that can be used by the
## featureCounts function from the Rsubreader package.
head(toSAF(TGenes))


#####   transcriptsByOverlaps
##
ir <- IRanges(start=c(2654890, 2709520, 28111770),
              end=c(2654900, 2709550, 28111790))
gr <- GRanges(rep("Y", length(ir)), ir)

## Retrieve all transcripts overlapping any of the regions.
txs <- transcriptsByOverlaps(edb, gr)
txs

## Alternatively, use a GRangesFilter
grf <- GRangesFilter(gr, condition="overlapping")
txs <- transcripts(edb, filter=grf)
txs


####    cdsBy
## Get the coding region for all transcripts on chromosome Y.
## Specifying also additional annotation columns (in addition to the default
## exon_id and exon_rank).
cds <- cdsBy(edb, by="tx", filter=SeqnameFilter("Y"),
             columns=c("tx_biotype", "gene_name"))

####    the 5' untranslated regions:
fUTRs <- fiveUTRsByTranscript(edb, filter=SeqnameFilter("Y"))

####    the 3' untranslated regions with additional column gene_name.
tUTRs <- threeUTRsByTranscript(edb, filter=SeqnameFilter("Y"),
                               columns="gene_name")


Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(ensembldb)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: GenomicRanges
Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/ensembldb/EnsDb-exonsBy.Rd_%03d_medium.png", width=480, height=480)
> ### Name: exonsBy
> ### Title: Retrieve annotation data from an Ensembl based package
> ### Aliases: disjointExons,EnsDb-method cdsBy cdsBy,EnsDb-method
> ###   fiveUTRsByTranscript,EnsDb-method threeUTRsByTranscript,EnsDb-method
> ###   exons exons,EnsDb-method exonsBy exonsBy,EnsDb-method
> ###   exonsByOverlaps,EnsDb-method genes genes,EnsDb-method toSAF
> ###   toSAF,GRangesList-method transcripts transcripts,EnsDb-method
> ###   transcriptsBy transcriptsBy,EnsDb-method
> ###   transcriptsByOverlaps,EnsDb-method promoters promoters,EnsDb-method
> ### Keywords: classes
> 
> ### ** Examples
> 
> 
> library(EnsDb.Hsapiens.v75)
> edb <- EnsDb.Hsapiens.v75
> 
> ######   genes
> ##
> ## get all genes endcoded on chromosome Y
> AllY <- genes(edb, filter=SeqnameFilter("Y"))
> AllY
GRanges object with 495 ranges and 5 metadata columns:
                  seqnames               ranges strand |         gene_id
                     <Rle>            <IRanges>  <Rle> |     <character>
  ENSG00000251841        Y   [2652790, 2652894]      + | ENSG00000251841
  ENSG00000184895        Y   [2654896, 2655740]      - | ENSG00000184895
  ENSG00000237659        Y   [2657868, 2658369]      + | ENSG00000237659
  ENSG00000232195        Y   [2696023, 2696259]      + | ENSG00000232195
  ENSG00000129824        Y   [2709527, 2800041]      + | ENSG00000129824
              ...      ...                  ...    ... .             ...
  ENSG00000224240        Y [28695572, 28695890]      + | ENSG00000224240
  ENSG00000227629        Y [28732789, 28737748]      - | ENSG00000227629
  ENSG00000237917        Y [28740998, 28780799]      - | ENSG00000237917
  ENSG00000231514        Y [28772667, 28773306]      - | ENSG00000231514
  ENSG00000235857        Y [59001391, 59001635]      + | ENSG00000235857
                    gene_name    entrezid   gene_biotype seq_coord_system
                  <character> <character>    <character>      <character>
  ENSG00000251841  RNU6-1334P                      snRNA       chromosome
  ENSG00000184895         SRY        6736 protein_coding       chromosome
  ENSG00000237659  RNASEH2CP1                 pseudogene       chromosome
  ENSG00000232195    TOMM22P2                 pseudogene       chromosome
  ENSG00000129824      RPS4Y1        6192 protein_coding       chromosome
              ...         ...         ...            ...              ...
  ENSG00000224240     CYCSP49                 pseudogene       chromosome
  ENSG00000227629  SLC25A15P1                 pseudogene       chromosome
  ENSG00000237917     PARP4P1                 pseudogene       chromosome
  ENSG00000231514     FAM58CP                 pseudogene       chromosome
  ENSG00000235857     CTBP2P1                 pseudogene       chromosome
  -------
  seqinfo: 1 sequence from GRCh37 genome
> 
> ## return result as DataFrame.
> AllY.granges <- genes(edb,
+                       filter=SeqnameFilter("Y"),
+                       return.type="DataFrame")
> AllY.granges
DataFrame with 495 rows and 9 columns
            gene_id   gene_name    entrezid   gene_biotype gene_seq_start
        <character> <character> <character>    <character>      <integer>
1   ENSG00000251841  RNU6-1334P                      snRNA        2652790
2   ENSG00000184895         SRY        6736 protein_coding        2654896
3   ENSG00000237659  RNASEH2CP1                 pseudogene        2657868
4   ENSG00000232195    TOMM22P2                 pseudogene        2696023
5   ENSG00000129824      RPS4Y1        6192 protein_coding        2709527
...             ...         ...         ...            ...            ...
491 ENSG00000224240     CYCSP49                 pseudogene       28695572
492 ENSG00000227629  SLC25A15P1                 pseudogene       28732789
493 ENSG00000237917     PARP4P1                 pseudogene       28740998
494 ENSG00000231514     FAM58CP                 pseudogene       28772667
495 ENSG00000235857     CTBP2P1                 pseudogene       59001391
    gene_seq_end    seq_name seq_strand seq_coord_system
       <integer> <character>  <integer>      <character>
1        2652894           Y          1       chromosome
2        2655740           Y         -1       chromosome
3        2658369           Y          1       chromosome
4        2696259           Y          1       chromosome
5        2800041           Y          1       chromosome
...          ...         ...        ...              ...
491     28695890           Y          1       chromosome
492     28737748           Y         -1       chromosome
493     28780799           Y         -1       chromosome
494     28773306           Y         -1       chromosome
495     59001635           Y          1       chromosome
> 
> ## include all transcripts of the gene and their chromosomal
> ## coordinates, sort by chrom start of transcripts and return as
> ## GRanges.
> AllY.granges.tx <- genes(edb,
+                          filter=SeqnameFilter("Y"),
+                          columns=c("gene_id", "seq_name",
+                              "seq_strand", "tx_id", "tx_biotype",
+                              "tx_seq_start", "tx_seq_end"),
+                          order.by="tx_seq_start")
> AllY.granges.tx
GRanges object with 731 ranges and 5 metadata columns:
                  seqnames               ranges strand |         gene_id
                     <Rle>            <IRanges>  <Rle> |     <character>
  ENSG00000251841        Y   [2652790, 2652894]      + | ENSG00000251841
  ENSG00000184895        Y   [2654896, 2655740]      - | ENSG00000184895
  ENSG00000184895        Y   [2654896, 2655740]      - | ENSG00000184895
  ENSG00000184895        Y   [2654896, 2655740]      - | ENSG00000184895
  ENSG00000237659        Y   [2657868, 2658369]      + | ENSG00000237659
              ...      ...                  ...    ... .             ...
  ENSG00000224240        Y [28695572, 28695890]      + | ENSG00000224240
  ENSG00000227629        Y [28732789, 28737748]      - | ENSG00000227629
  ENSG00000237917        Y [28740998, 28780799]      - | ENSG00000237917
  ENSG00000231514        Y [28772667, 28773306]      - | ENSG00000231514
  ENSG00000235857        Y [59001391, 59001635]      + | ENSG00000235857
                            tx_id             tx_biotype tx_seq_start
                      <character>            <character>    <integer>
  ENSG00000251841 ENST00000516032                  snRNA      2652790
  ENSG00000184895 ENST00000383070         protein_coding      2654896
  ENSG00000184895 ENST00000525526         protein_coding      2655049
  ENSG00000184895 ENST00000534739         protein_coding      2655145
  ENSG00000237659 ENST00000454281   processed_pseudogene      2657868
              ...             ...                    ...          ...
  ENSG00000224240 ENST00000420810   processed_pseudogene     28695572
  ENSG00000227629 ENST00000456738 unprocessed_pseudogene     28732789
  ENSG00000237917 ENST00000435945 unprocessed_pseudogene     28740998
  ENSG00000231514 ENST00000435741   processed_pseudogene     28772667
  ENSG00000235857 ENST00000431853   processed_pseudogene     59001391
                  tx_seq_end
                   <integer>
  ENSG00000251841    2652894
  ENSG00000184895    2655740
  ENSG00000184895    2655644
  ENSG00000184895    2655644
  ENSG00000237659    2658369
              ...        ...
  ENSG00000224240   28695890
  ENSG00000227629   28737748
  ENSG00000237917   28780799
  ENSG00000231514   28773306
  ENSG00000235857   59001635
  -------
  seqinfo: 1 sequence from GRCh37 genome
> 
> 
> 
> ######   transcripts
> ##
> ## get all transcripts of a gene
> Tx <- transcripts(edb,
+                   filter=GeneidFilter("ENSG00000184895"),
+                   order.by="tx_seq_start")
> Tx
GRanges object with 3 ranges and 5 metadata columns:
                  seqnames             ranges strand |           tx_id
                     <Rle>          <IRanges>  <Rle> |     <character>
  ENST00000383070        Y [2654896, 2655740]      - | ENST00000383070
  ENST00000525526        Y [2655049, 2655644]      - | ENST00000525526
  ENST00000534739        Y [2655145, 2655644]      - | ENST00000534739
                      tx_biotype tx_cds_seq_start tx_cds_seq_end
                     <character>        <numeric>      <numeric>
  ENST00000383070 protein_coding          2655030        2655644
  ENST00000525526 protein_coding          2655049        2655644
  ENST00000534739 protein_coding          2655145        2655644
                          gene_id
                      <character>
  ENST00000383070 ENSG00000184895
  ENST00000525526 ENSG00000184895
  ENST00000534739 ENSG00000184895
  -------
  seqinfo: 1 sequence from GRCh37 genome
> 
> ## get all transcripts of two genes along with some information on the
> ## gene and transcript
> Tx <- transcripts(edb,
+                   filter=GeneidFilter(c("ENSG00000184895",
+                       "ENSG00000092377")),
+                       columns=c("gene_id", "gene_seq_start",
+                           "gene_seq_end", "gene_biotype", "tx_biotype"))
> Tx
GRanges object with 6 ranges and 6 metadata columns:
                  seqnames             ranges strand |           tx_id
                     <Rle>          <IRanges>  <Rle> |     <character>
  ENST00000383070        Y [2654896, 2655740]      - | ENST00000383070
  ENST00000525526        Y [2655049, 2655644]      - | ENST00000525526
  ENST00000534739        Y [2655145, 2655644]      - | ENST00000534739
  ENST00000346432        Y [6778727, 6959724]      + | ENST00000346432
  ENST00000355162        Y [6778727, 6959724]      + | ENST00000355162
  ENST00000383032        Y [6778727, 6959724]      + | ENST00000383032
                          gene_id gene_seq_start gene_seq_end   gene_biotype
                      <character>      <integer>    <integer>    <character>
  ENST00000383070 ENSG00000184895        2654896      2655740 protein_coding
  ENST00000525526 ENSG00000184895        2654896      2655740 protein_coding
  ENST00000534739 ENSG00000184895        2654896      2655740 protein_coding
  ENST00000346432 ENSG00000092377        6778727      6959724 protein_coding
  ENST00000355162 ENSG00000092377        6778727      6959724 protein_coding
  ENST00000383032 ENSG00000092377        6778727      6959724 protein_coding
                      tx_biotype
                     <character>
  ENST00000383070 protein_coding
  ENST00000525526 protein_coding
  ENST00000534739 protein_coding
  ENST00000346432 protein_coding
  ENST00000355162 protein_coding
  ENST00000383032 protein_coding
  -------
  seqinfo: 1 sequence from GRCh37 genome
> 
> ######   promoters
> ##
> ## get the bona-fide promoters (2k up- to 200nt downstream of TSS)
> promoters(edb, filter=GeneidFilter(c("ENSG00000184895",
+                                      "ENSG00000092377")))
GRanges object with 6 ranges and 5 metadata columns:
                  seqnames             ranges strand |           tx_id
                     <Rle>          <IRanges>  <Rle> |     <character>
  ENST00000383070        Y [2655541, 2657740]      - | ENST00000383070
  ENST00000525526        Y [2655445, 2657644]      - | ENST00000525526
  ENST00000534739        Y [2655445, 2657644]      - | ENST00000534739
  ENST00000346432        Y [6776727, 6778926]      + | ENST00000346432
  ENST00000355162        Y [6776727, 6778926]      + | ENST00000355162
  ENST00000383032        Y [6776727, 6778926]      + | ENST00000383032
                      tx_biotype tx_cds_seq_start tx_cds_seq_end
                     <character>        <numeric>      <numeric>
  ENST00000383070 protein_coding          2655030        2655644
  ENST00000525526 protein_coding          2655049        2655644
  ENST00000534739 protein_coding          2655145        2655644
  ENST00000346432 protein_coding          6893126        6959533
  ENST00000355162 protein_coding          6893126        6959533
  ENST00000383032 protein_coding          6893126        6959533
                          gene_id
                      <character>
  ENST00000383070 ENSG00000184895
  ENST00000525526 ENSG00000184895
  ENST00000534739 ENSG00000184895
  ENST00000346432 ENSG00000092377
  ENST00000355162 ENSG00000092377
  ENST00000383032 ENSG00000092377
  -------
  seqinfo: 1 sequence from GRCh37 genome
> 
> ######   exons
> ##
> ## get all exons of the provided genes
> Exon <- exons(edb,
+               filter=GeneidFilter(c("ENSG00000184895",
+                   "ENSG00000092377")),
+               order.by="exon_seq_start",
+               columns=c( "gene_id", "gene_seq_start",
+                   "gene_seq_end", "gene_biotype"))
> Exon
GRanges object with 24 ranges and 5 metadata columns:
                  seqnames             ranges strand |         exon_id
                     <Rle>          <IRanges>  <Rle> |     <character>
  ENSE00001494622        Y [2654896, 2655740]      - | ENSE00001494622
  ENSE00002323146        Y [2655049, 2655069]      - | ENSE00002323146
  ENSE00002201849        Y [2655075, 2655644]      - | ENSE00002201849
  ENSE00002214525        Y [2655145, 2655168]      - | ENSE00002214525
  ENSE00002144027        Y [2655171, 2655644]      - | ENSE00002144027
              ...      ...                ...    ... .             ...
  ENSE00001654905        Y [6953939, 6954013]      + | ENSE00001654905
  ENSE00001729171        Y [6954331, 6954458]      + | ENSE00001729171
  ENSE00001763966        Y [6955308, 6955473]      + | ENSE00001763966
  ENSE00001593786        Y [6958130, 6958231]      + | ENSE00001593786
  ENSE00001370395        Y [6959513, 6959724]      + | ENSE00001370395
                          gene_id gene_seq_start gene_seq_end   gene_biotype
                      <character>      <integer>    <integer>    <character>
  ENSE00001494622 ENSG00000184895        2654896      2655740 protein_coding
  ENSE00002323146 ENSG00000184895        2654896      2655740 protein_coding
  ENSE00002201849 ENSG00000184895        2654896      2655740 protein_coding
  ENSE00002214525 ENSG00000184895        2654896      2655740 protein_coding
  ENSE00002144027 ENSG00000184895        2654896      2655740 protein_coding
              ...             ...            ...          ...            ...
  ENSE00001654905 ENSG00000092377        6778727      6959724 protein_coding
  ENSE00001729171 ENSG00000092377        6778727      6959724 protein_coding
  ENSE00001763966 ENSG00000092377        6778727      6959724 protein_coding
  ENSE00001593786 ENSG00000092377        6778727      6959724 protein_coding
  ENSE00001370395 ENSG00000092377        6778727      6959724 protein_coding
  -------
  seqinfo: 1 sequence from GRCh37 genome
> 
> 
> 
> #####    exonsBy
> ##
> ## get all exons for transcripts encoded on chromosomes X and Y.
> ETx <- exonsBy(edb, by="tx",
+                filter=SeqnameFilter(c("X", "Y")))
> ETx
GRangesList object of length 6754:
$ENST00000014935 
GRanges object with 9 ranges and 3 metadata columns:
      seqnames                 ranges strand |         exon_id           tx_id
         <Rle>              <IRanges>  <Rle> |     <character>     <character>
  [1]        X [153637754, 153638375]      - | ENSE00001050512 ENST00000014935
  [2]        X [153637448, 153637532]      - | ENSE00001050508 ENST00000014935
  [3]        X [153633775, 153633996]      - | ENSE00001450968 ENST00000014935
  [4]        X [153633336, 153633424]      - | ENSE00000678418 ENST00000014935
  [5]        X [153633169, 153633255]      - | ENSE00003627274 ENST00000014935
  [6]        X [153631863, 153631963]      - | ENSE00000678413 ENST00000014935
  [7]        X [153631610, 153631722]      - | ENSE00003549633 ENST00000014935
  [8]        X [153631283, 153631531]      - | ENSE00000678409 ENST00000014935
  [9]        X [153630099, 153631182]      - | ENSE00001200781 ENST00000014935
      exon_rank
      <integer>
  [1]         1
  [2]         2
  [3]         3
  [4]         4
  [5]         5
  [6]         6
  [7]         7
  [8]         8
  [9]         9

...
<6753 more elements>
-------
seqinfo: 2 sequences from GRCh37 genome
> ## get all exons for genes encoded on chromosome 1 to 22, X and Y and
> ## include additional annotation columns in the result
> EGenes <- exonsBy(edb, by="gene",
+                   filter=SeqnameFilter(c("X", "Y")),
+                   columns=c("gene_biotype", "gene_name"))
> EGenes
GRangesList object of length 2908:
$ENSG00000000003 
GRanges object with 17 ranges and 4 metadata columns:
       seqnames               ranges strand |   gene_biotype   gene_name
          <Rle>            <IRanges>  <Rle> |    <character> <character>
   [1]        X [99894942, 99894988]      - | protein_coding      TSPAN6
   [2]        X [99891790, 99892101]      - | protein_coding      TSPAN6
   [3]        X [99891605, 99891803]      - | protein_coding      TSPAN6
   [4]        X [99891188, 99891686]      - | protein_coding      TSPAN6
   [5]        X [99890555, 99890743]      - | protein_coding      TSPAN6
   ...      ...                  ...    ... .            ...         ...
  [13]        X [99888439, 99888536]      - | protein_coding      TSPAN6
  [14]        X [99887482, 99887565]      - | protein_coding      TSPAN6
  [15]        X [99887538, 99887565]      - | protein_coding      TSPAN6
  [16]        X [99885756, 99885863]      - | protein_coding      TSPAN6
  [17]        X [99883667, 99884983]      - | protein_coding      TSPAN6
               gene_id         exon_id
           <character>     <character>
   [1] ENSG00000000003 ENSE00001828996
   [2] ENSG00000000003 ENSE00001863395
   [3] ENSG00000000003 ENSE00001855382
   [4] ENSG00000000003 ENSE00001886883
   [5] ENSG00000000003 ENSE00003662440
   ...             ...             ...
  [13] ENSG00000000003 ENSE00001895484
  [14] ENSG00000000003 ENSE00000401072
  [15] ENSG00000000003 ENSE00001849132
  [16] ENSG00000000003 ENSE00000868868
  [17] ENSG00000000003 ENSE00001459322

...
<2907 more elements>
-------
seqinfo: 2 sequences from GRCh37 genome
> 
> ## Note that this might also contain "LRG" genes.
> length(grep(names(EGenes), pattern="LRG"))
[1] 21
> 
> ## to fetch just Ensemblgenes, use an GeneidFilter with value
> ## "ENS%" and condition "like"
> 
> 
> #####    transcriptsBy
> ##
> TGenes <- transcriptsBy(edb, by="gene",
+                         filter=SeqnameFilter(c("X", "Y")))
> TGenes
GRangesList object of length 2908:
$ENSG00000000003 
GRanges object with 3 ranges and 4 metadata columns:
      seqnames               ranges strand |           tx_id
         <Rle>            <IRanges>  <Rle> |     <character>
  [1]        X [99888439, 99894988]      - | ENST00000494424
  [2]        X [99883667, 99891803]      - | ENST00000373020
  [3]        X [99887538, 99891686]      - | ENST00000496771
                tx_biotype tx_cds_seq_start tx_cds_seq_end
               <character>        <numeric>      <numeric>
  [1] processed_transcript             <NA>           <NA>
  [2]       protein_coding         99885795       99891691
  [3] processed_transcript             <NA>           <NA>

$ENSG00000000005 
GRanges object with 2 ranges and 4 metadata columns:
      seqnames               ranges strand |           tx_id
  [1]        X [99839799, 99854882]      + | ENST00000373031
  [2]        X [99848621, 99852528]      + | ENST00000485971
                tx_biotype tx_cds_seq_start tx_cds_seq_end
  [1]       protein_coding         99840016       99854714
  [2] processed_transcript             <NA>           <NA>

$ENSG00000001497 
GRanges object with 6 ranges and 4 metadata columns:
      seqnames               ranges strand |           tx_id
  [1]        X [64732463, 64754655]      - | ENST00000484069
  [2]        X [64732463, 64754636]      - | ENST00000312391
  [3]        X [64732463, 64754636]      - | ENST00000374804
  [4]        X [64732462, 64754636]      - | ENST00000374811
  [5]        X [64732462, 64754634]      - | ENST00000374807
  [6]        X [64740309, 64743497]      - | ENST00000469091
                   tx_biotype tx_cds_seq_start tx_cds_seq_end
  [1] nonsense_mediated_decay         64744901       64754595
  [2]          protein_coding         64744901       64754595
  [3]          protein_coding         64732655       64754595
  [4]          protein_coding         64732655       64754595
  [5]          protein_coding         64732655       64754595
  [6]          protein_coding         64740535       64743497

...
<2905 more elements>
-------
seqinfo: 2 sequences from GRCh37 genome
> 
> ## convert this to a SAF formatted data.frame that can be used by the
> ## featureCounts function from the Rsubreader package.
> head(toSAF(TGenes))
           GeneID Chr    Start      End Strand
1 ENSG00000000003   X 99888439 99894988      -
2 ENSG00000000003   X 99883667 99891803      -
3 ENSG00000000003   X 99887538 99891686      -
4 ENSG00000000005   X 99839799 99854882      +
5 ENSG00000000005   X 99848621 99852528      +
6 ENSG00000001497   X 64732463 64754655      -
> 
> 
> #####   transcriptsByOverlaps
> ##
> ir <- IRanges(start=c(2654890, 2709520, 28111770),
+               end=c(2654900, 2709550, 28111790))
> gr <- GRanges(rep("Y", length(ir)), ir)
> 
> ## Retrieve all transcripts overlapping any of the regions.
> txs <- transcriptsByOverlaps(edb, gr)
> txs
GRanges object with 3 ranges and 5 metadata columns:
                  seqnames               ranges strand |           tx_id
                     <Rle>            <IRanges>  <Rle> |     <character>
  ENST00000383070        Y [ 2654896,  2655740]      - | ENST00000383070
  ENST00000250784        Y [ 2709527,  2735309]      + | ENST00000250784
  ENST00000598545        Y [28111776, 28114889]      - | ENST00000598545
                      tx_biotype tx_cds_seq_start tx_cds_seq_end
                     <character>        <numeric>      <numeric>
  ENST00000383070 protein_coding          2655030        2655644
  ENST00000250784 protein_coding          2709666        2734935
  ENST00000598545 protein_coding         28111776       28114889
                          gene_id
                      <character>
  ENST00000383070 ENSG00000184895
  ENST00000250784 ENSG00000129824
  ENST00000598545 ENSG00000269393
  -------
  seqinfo: 1 sequence from GRCh37 genome
> 
> ## Alternatively, use a GRangesFilter
> grf <- GRangesFilter(gr, condition="overlapping")
> txs <- transcripts(edb, filter=grf)
> txs
GRanges object with 3 ranges and 5 metadata columns:
                  seqnames               ranges strand |           tx_id
                     <Rle>            <IRanges>  <Rle> |     <character>
  ENST00000383070        Y [ 2654896,  2655740]      - | ENST00000383070
  ENST00000250784        Y [ 2709527,  2735309]      + | ENST00000250784
  ENST00000598545        Y [28111776, 28114889]      - | ENST00000598545
                      tx_biotype tx_cds_seq_start tx_cds_seq_end
                     <character>        <numeric>      <numeric>
  ENST00000383070 protein_coding          2655030        2655644
  ENST00000250784 protein_coding          2709666        2734935
  ENST00000598545 protein_coding         28111776       28114889
                          gene_id
                      <character>
  ENST00000383070 ENSG00000184895
  ENST00000250784 ENSG00000129824
  ENST00000598545 ENSG00000269393
  -------
  seqinfo: 1 sequence from GRCh37 genome
> 
> 
> ####    cdsBy
> ## Get the coding region for all transcripts on chromosome Y.
> ## Specifying also additional annotation columns (in addition to the default
> ## exon_id and exon_rank).
> cds <- cdsBy(edb, by="tx", filter=SeqnameFilter("Y"),
+              columns=c("tx_biotype", "gene_name"))
> 
> ####    the 5' untranslated regions:
> fUTRs <- fiveUTRsByTranscript(edb, filter=SeqnameFilter("Y"))
> 
> ####    the 3' untranslated regions with additional column gene_name.
> tUTRs <- threeUTRsByTranscript(edb, filter=SeqnameFilter("Y"),
+                                columns="gene_name")
> 
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>