Last data update: 2014.03.03

R: Extract sequences upstream of a set of genes or transcripts
extractUpstreamSeqsR Documentation

Extract sequences upstream of a set of genes or transcripts

Description

extractUpstreamSeqs is a generic function for extracting sequences upstream of a supplied set of genes or transcripts.

Usage

extractUpstreamSeqs(x, genes, width=1000, ...)

## Dispatch is on the 2nd argument!

## S4 method for signature 'GenomicRanges'
extractUpstreamSeqs(x, genes, width=1000)

## S4 method for signature 'TxDb'
extractUpstreamSeqs(x, genes, width=1000, exclude.seqlevels=NULL)

Arguments

x

An object containing the chromosome sequences from which to extract the upstream sequences. It can be a BSgenome, TwoBitFile, or FaFile object, or any genome sequence container. More formally, x must be an object for which seqinfo and getSeq are defined.

genes

An object containing the locations (i.e. chromosome name, start, end, and strand) of the genes or transcripts with respect to the reference genome. Only GenomicRanges and TxDb objects are supported at the moment. If the latter, the gene locations are obtained by calling the genes function on the TxDb object internally.

width

How many bases to extract upstream of each TSS (transcription start site).

...

Additional arguments, for use in specific methods.

exclude.seqlevels

A character vector containing the chromosome names (a.k.a. sequence levels) to exclude when the genes are obtained from a TxDb object.

Value

A DNAStringSet object containing one upstream sequence per gene (or per transcript if genes is a GenomicRanges object containing transcript ranges).

More precisely, if genes is a GenomicRanges object, the returned object is parallel to it, that is, the i-th element in the returned object is the upstream sequence corresponding to the i-th gene (or transcript) in genes. Also the names on the GenomicRanges object are propagated to the returned object.

If genes is a TxDb object, the names on the returned object are the gene IDs found in the TxDb object. To see the type of gene IDs (i.e. Entrez gene ID or Ensembl gene ID or ...), you can display genes with show(genes).

In addition, the returned object has the following metadata columns (accessible with mcols) that provide some information about the gene (or transcript) corresponding to each upstream sequence:

  • gene_seqnames: the chromosome name of the gene (or transcript);

  • gene_strand: the strand of the gene (or transcript);

  • gene_TSS: the transcription start site of the gene (or transcript).

Note

IMPORTANT: Always make sure to use a TxDb package (or TxDb object) that contains a gene model compatible with the genome sequence container x, that is, a gene model based on the exact same reference genome as x.

See http://bioconductor.org/packages/release/BiocViews.html#___TxDb for the list of TxDb packages available in the current release of Bioconductor. Note that you can make your own custom TxDb object from various annotation resources. See the makeTxDbFromUCSC, makeTxDbFromBiomart, and makeTxDbFromGFF functions for more information about this.

Author(s)

Herv<c3><83><c2><a9> Pag<c3><83><c2><a8>s

See Also

  • The available.genomes function in the BSgenome package for checking avaibility of BSgenome data packages (and installing the desired one).

  • The makeTxDbFromUCSC, makeTxDbFromBiomart, and makeTxDbFromGFF functions for making your own custom TxDb object from various annotation resources.

  • The BSgenome, TwoBitFile, and FaFile classes, defined and documented in the BSgenome, rtracklayer, and Rsamtools packages, respectively.

  • The TxDb class.

  • The genes function for extracting gene ranges from a TxDb object.

  • The GenomicRanges class defined and documented in the GenomicRanges package.

  • The DNAStringSet class defined and documented in the Biostrings package.

  • The seqinfo getter defined and documented in the GenomeInfoDb package.

  • The getSeq function for extracting subsequences from a sequence container.

Examples

## Load a genome:
library(BSgenome.Dmelanogaster.UCSC.dm3)
genome <- BSgenome.Dmelanogaster.UCSC.dm3
genome

## Use a TxDb object:
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
txdb  # contains Ensembl gene IDs

## Because the chrU and chrUextra sequences are made of concatenated
## scaffolds (see http://genome.ucsc.edu/cgi-bin/hgGateway?db=dm3),
## extracting the upstream sequences for genes located on these
## scaffolds is not reliable. So we exclude them:
exclude <- c("chrU", "chrUextra")
up1000seqs <- extractUpstreamSeqs(genome, txdb, width=1000,
                                  exclude.seqlevels=exclude)
up1000seqs  # the names are Ensembl gene IDs
mcols(up1000seqs)

## Upstream sequences for genes close to the chromosome bounds can be
## shorter than 1000 (note that this does not happen for circular
## chromosomes like chrM):
table(width(up1000seqs))
mcols(up1000seqs)[width(up1000seqs) != 1000, ]

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(GenomicFeatures)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomicFeatures/extractUpstreamSeqs.Rd_%03d_medium.png", width=480, height=480)
> ### Name: extractUpstreamSeqs
> ### Title: Extract sequences upstream of a set of genes or transcripts
> ### Aliases: extractUpstreamSeqs extractUpstreamSeqs,GenomicRanges-method
> ###   extractUpstreamSeqs,TxDb-method
> ###   extractUpstreamSeqs,GRangesList-method
> ### Keywords: manip
> 
> ### ** Examples
> 
> ## Load a genome:
> library(BSgenome.Dmelanogaster.UCSC.dm3)
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> genome <- BSgenome.Dmelanogaster.UCSC.dm3
> genome
Fly genome:
# organism: Drosophila melanogaster (Fly)
# provider: UCSC
# provider version: dm3
# release date: Apr. 2006
# release name: BDGP Release 5
# 15 sequences:
#   chr2L     chr2R     chr3L     chr3R     chr4      chrX      chrU     
#   chrM      chr2LHet  chr2RHet  chr3LHet  chr3RHet  chrXHet   chrYHet  
#   chrUextra                                                            
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> 
> ## Use a TxDb object:
> library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
> txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
> txdb  # contains Ensembl gene IDs
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: dm3
# Organism: Drosophila melanogaster
# Taxonomy ID: 7227
# UCSC Table: ensGene
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Ensembl gene ID
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 29173
# exon_nrow: 76920
# cds_nrow: 62135
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2015-10-07 18:15:53 +0000 (Wed, 07 Oct 2015)
# GenomicFeatures version at creation time: 1.21.30
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1
> 
> ## Because the chrU and chrUextra sequences are made of concatenated
> ## scaffolds (see http://genome.ucsc.edu/cgi-bin/hgGateway?db=dm3),
> ## extracting the upstream sequences for genes located on these
> ## scaffolds is not reliable. So we exclude them:
> exclude <- c("chrU", "chrUextra")
> up1000seqs <- extractUpstreamSeqs(genome, txdb, width=1000,
+                                   exclude.seqlevels=exclude)
> up1000seqs  # the names are Ensembl gene IDs
  A DNAStringSet instance of length 15530
        width seq                                           names               
    [1]  1000 TTATTTCAAAAAAGTTAATTA...AAATATTAGACATATCCATTG FBgn0031208
    [2]  1000 ACTCACCGACTACGCCTCTAT...ACGGAATGTATGTGAGGAAGG FBgn0263584
    [3]  1000 GACTTTTGAAAAACTTTCGTG...AGTCGTCTTTTACTTTACGAG FBgn0067779
    [4]  1000 AGATACGGAAAAACCTTCAGA...ATATCAGTTTGAAAAGCTTCG FBgn0031213
    [5]  1000 TCTGGAAGTGAAGGGAGATGT...TTTTAATGTCAAACTCTGGGT FBgn0031214
    ...   ... ...
[15526]  1000 TTCGAGTTAAAAAATTGTATT...ATATGCCAGAATGTCAGACAG FBgn0058441
[15527]  1000 GCTCCATTGATAAGGTAAAAC...CACCGACAACGAGGAAACATT FBgn0085831
[15528]  1000 ATTAGAGATAAAAAACAAAAG...GTCAACTTTCCATAAAGAACC FBgn0085790
[15529]  1000 AATTAGAGATAAAAAACAAAA...GTCAACTTTCCATAAAGAACC FBgn0085791
[15530]  1000 AAAGCTAAGTCTTATAGAAAA...ATCGCAATTAACACCGATACT FBgn0085792
> mcols(up1000seqs)
DataFrame with 15530 rows and 3 columns
      gene_seqnames gene_strand  gene_TSS
              <Rle>       <Rle> <integer>
1             chr2L           +      7529
2             chr2L           +     21952
3             chr2L           +     66584
4             chr2L           +     71757
5             chr2L           +     76348
...             ...         ...       ...
15526       chrYHet           -    205372
15527       chrYHet           -    307365
15528       chrYHet           -    313714
15529       chrYHet           -    320997
15530       chrYHet           -    328489
> 
> ## Upstream sequences for genes close to the chromosome bounds can be
> ## shorter than 1000 (note that this does not happen for circular
> ## chromosomes like chrM):
> table(width(up1000seqs))

  310   353   898  1000 
    1     1     1 15527 
> mcols(up1000seqs)[width(up1000seqs) != 1000, ]
DataFrame with 3 rows and 3 columns
  gene_seqnames gene_strand  gene_TSS
          <Rle>       <Rle> <integer>
1         chr3R           +       354
2      chr3LHet           +       899
3       chrYHet           +       311
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>