An object representing a single chromosome or a collection of chromosomes.
More precisely, x can be a DNAString object
(single chromosome), or a BSgenome object (collection
of chromosomes).
Other objects representing a collection of chromosomes are supported
(e.g. FaFile objects in the Rsamtools package)
as long as seqinfo and
getSeq work on them.
transcripts
An object representing the exon ranges of each transcript to extract.
More precisely:
If x is a DNAString object, then
transcripts must be an RangesList object.
If x is a BSgenome object or any object
representing a collection of chromosomes, then transcripts
must be a GRangesList object or any object
for which exonsBy is implemented (e.g. a TxDb
object). If the latter, then it's first turned into a
GRangesList object with
exonsBy(transcripts, by="tx", ...).
Note that, for each transcript, the exons must be ordered by ascending
rank, that is, by ascending position in the transcript
(when going in the 5' to 3' direction). This generally means (but not
always) that they are also ordered from 5' to 3' on the reference genome.
More precisely:
For a transcript located on the plus strand, the exons will
typically (but not necessarily) be ordered by ascending position
on the reference genome.
For a transcript located on the minus strand, the exons will
typically (but not necessarily) be ordered by descending position
on the reference genome.
If transcripts was obtained with exonsBy (see above),
then the exons are guaranteed to be ordered by ascending rank. See
?exonsBy for more information.
...
Additional arguments, for use in specific methods.
For the default method, additional arguments are allowed only when
transcripts is not a GRangesList object,
in which case they are passed to the internal call to exonsBy
(see above).
strand
Only supported when x is a DNAString object.
Can be an atomic vector, a factor, or an Rle object,
in which case it indicates the strand of each transcript (i.e. all the
exons in a transcript are considered to be on the same strand).
More precisely: it's turned into a factor (or factor-Rle)
that has the "standard strand levels" (this is done by calling the
strand function on it). Then it's recycled
to the length of RangesList object transcripts
if needed. In the resulting object, the i-th element is interpreted
as the strand of all the exons in the i-th transcript.
strand can also be a list-like object, in which case it indicates
the strand of each exon, individually. Thus it must have the same
shape as RangesList object transcripts
(i.e. same length plus strand[[i]] must have the same length
as transcripts[[i]] for all i).
strand can only contain "+" and/or "-" values.
"*" is not allowed.
Value
A DNAStringSet object parallel to
transcripts, that is, the i-th element in it is the sequence
of the i-th transcript in transcripts.
Author(s)
Herv<c3><83><c2><a9> Pag<c3><83><c2><a8>s
See Also
coverageByTranscript for computing coverage by
transcript (or CDS) of a set of ranges.
transcriptLengths for extracting the transcript
lengths from a TxDb object.
The transcriptLocs2refLocs function for converting
transcript-based locations into reference-based locations.
The available.genomes function in the
BSgenome package for checking avaibility of BSgenome
data packages (and installing the desired one).
The DNAString and DNAStringSet
classes defined and documented in the Biostrings package.
The translate function in the
Biostrings package for translating DNA or RNA sequences
into amino acid sequences.
The GRangesList class defined and documented
in the GenomicRanges package.
The RangesList class defined and documented
in the IRanges package.
The exonsBy function for extracting exon ranges
grouped by transcript.
The TxDb class.
Examples
## ---------------------------------------------------------------------
## 1. A TOY EXAMPLE
## ---------------------------------------------------------------------
library(Biostrings)
## A chromosome of length 30:
x <- DNAString("ATTTAGGACACTCCCTGAGGACAAGACCCC")
## 2 transcripts on 'x':
tx1 <- IRanges(1, 8) # 1 exon
tx2 <- c(tx1, IRanges(12, 30)) # 2 exons
transcripts <- IRangesList(tx1=tx1, tx2=tx2)
extractTranscriptSeqs(x, transcripts)
## By default, all the exons are considered to be on the plus strand.
## We can use the 'strand' argument to tell extractTranscriptSeqs()
## to extract them from the minus strand.
## Extract all the exons from the minus strand:
extractTranscriptSeqs(x, transcripts, strand="-")
## Note that, for a transcript located on the minus strand, the exons
## should typically be ordered by descending position on the reference
## genome in order to reflect their rank in the transcript:
extractTranscriptSeqs(x, IRangesList(tx1=tx1, tx2=rev(tx2)), strand="-")
## Extract the exon of the 1st transcript from the minus strand:
extractTranscriptSeqs(x, transcripts, strand=c("-", "+"))
## Extract the 2nd exon of the 2nd transcript from the minus strand:
extractTranscriptSeqs(x, transcripts, strand=list("-", c("+", "-")))
## ---------------------------------------------------------------------
## 2. A REAL EXAMPLE
## ---------------------------------------------------------------------
## Load a genome:
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
## Load a TxDb object:
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(txdb_file)
## Check that 'txdb' is based on the hg19 assembly:
txdb
## Extract the exon ranges grouped by transcript from 'txdb':
transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
## Extract the transcript sequences from the genome:
tx_seqs <- extractTranscriptSeqs(genome, transcripts)
tx_seqs
## A sanity check:
stopifnot(identical(width(tx_seqs), unname(sum(width(transcripts)))))
## Note that 'tx_seqs' can also be obtained with:
extractTranscriptSeqs(genome, txdb, use.names=TRUE)
## ---------------------------------------------------------------------
## 3. USING extractTranscriptSeqs() TO EXTRACT CDS SEQUENCES
## ---------------------------------------------------------------------
cds <- cdsBy(txdb, by="tx", use.names=TRUE)
cds_seqs <- extractTranscriptSeqs(genome, cds)
cds_seqs
## A sanity check:
stopifnot(identical(width(cds_seqs), unname(sum(width(cds)))))
## Note that, alternatively, the CDS sequences can be obtained from the
## transcript sequences by removing the 5' and 3' UTRs:
tx_lens <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE)
stopifnot(identical(tx_lens$tx_name, names(tx_seqs))) # sanity
## Keep the rows in 'tx_lens' that correspond to a sequence in 'cds_seqs'
## and put them in the same order as in 'cds_seqs':
m <- match(names(cds_seqs), names(tx_seqs))
tx_lens <- tx_lens[m, ]
utr5_width <- tx_lens$utr5_len
utr3_width <- tx_lens$utr3_len
cds_seqs2 <- narrow(tx_seqs[m],
start=utr5_width+1L, end=-(utr3_width+1L))
stopifnot(identical(as.character(cds_seqs2), as.character(cds_seqs)))
## ---------------------------------------------------------------------
## 4. TRANSLATE THE CDS SEQUENCES
## ---------------------------------------------------------------------
prot_seqs <- translate(cds_seqs, if.fuzzy.codon="solve")
## Note that, by default, translate() uses The Standard Genetic Code to
## translate codons into amino acids. However, depending on the organism,
## a different genetic code might be needed to translate CDS sequences
## located on the mitochodrial chromosome. For example, for vertebrates,
## the following code could be used to correct 'prot_seqs':
SGC1 <- getGeneticCode("SGC1")
chrM_idx <- which(all(seqnames(cds) == "chrM"))
prot_seqs[chrM_idx] <- translate(cds_seqs[chrM_idx], genetic.code=SGC1,
if.fuzzy.codon="solve")
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(GenomicFeatures)
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomicFeatures/extractTranscriptSeqs.Rd_%03d_medium.png", width=480, height=480)
> ### Name: extractTranscriptSeqs
> ### Title: Extract transcript (or CDS) sequences from chromosome sequences
> ### Aliases: extractTranscriptSeqs extractTranscriptSeqs,DNAString-method
> ### extractTranscriptSeqs,ANY-method
> ### Keywords: manip
>
> ### ** Examples
>
> ## ---------------------------------------------------------------------
> ## 1. A TOY EXAMPLE
> ## ---------------------------------------------------------------------
>
> library(Biostrings)
Loading required package: XVector
>
> ## A chromosome of length 30:
> x <- DNAString("ATTTAGGACACTCCCTGAGGACAAGACCCC")
>
> ## 2 transcripts on 'x':
> tx1 <- IRanges(1, 8) # 1 exon
> tx2 <- c(tx1, IRanges(12, 30)) # 2 exons
> transcripts <- IRangesList(tx1=tx1, tx2=tx2)
> extractTranscriptSeqs(x, transcripts)
A DNAStringSet instance of length 2
width seq names
[1] 8 ATTTAGGA tx1
[2] 27 ATTTAGGATCCCTGAGGACAAGACCCC tx2
>
> ## By default, all the exons are considered to be on the plus strand.
> ## We can use the 'strand' argument to tell extractTranscriptSeqs()
> ## to extract them from the minus strand.
>
> ## Extract all the exons from the minus strand:
> extractTranscriptSeqs(x, transcripts, strand="-")
A DNAStringSet instance of length 2
width seq names
[1] 8 TCCTAAAT tx1
[2] 27 TCCTAAATGGGGTCTTGTCCTCAGGGA tx2
>
> ## Note that, for a transcript located on the minus strand, the exons
> ## should typically be ordered by descending position on the reference
> ## genome in order to reflect their rank in the transcript:
> extractTranscriptSeqs(x, IRangesList(tx1=tx1, tx2=rev(tx2)), strand="-")
A DNAStringSet instance of length 2
width seq names
[1] 8 TCCTAAAT tx1
[2] 27 GGGGTCTTGTCCTCAGGGATCCTAAAT tx2
>
> ## Extract the exon of the 1st transcript from the minus strand:
> extractTranscriptSeqs(x, transcripts, strand=c("-", "+"))
A DNAStringSet instance of length 2
width seq names
[1] 8 TCCTAAAT tx1
[2] 27 ATTTAGGATCCCTGAGGACAAGACCCC tx2
>
> ## Extract the 2nd exon of the 2nd transcript from the minus strand:
> extractTranscriptSeqs(x, transcripts, strand=list("-", c("+", "-")))
A DNAStringSet instance of length 2
width seq names
[1] 8 TCCTAAAT tx1
[2] 27 ATTTAGGAGGGGTCTTGTCCTCAGGGA tx2
>
> ## ---------------------------------------------------------------------
> ## 2. A REAL EXAMPLE
> ## ---------------------------------------------------------------------
>
> ## Load a genome:
> library(BSgenome.Hsapiens.UCSC.hg19)
Loading required package: BSgenome
Loading required package: rtracklayer
> genome <- BSgenome.Hsapiens.UCSC.hg19
>
> ## Load a TxDb object:
> txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite",
+ package="GenomicFeatures")
> txdb <- loadDb(txdb_file)
>
> ## Check that 'txdb' is based on the hg19 assembly:
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg19
# Organism: Homo sapiens
# UCSC Table: knownGene
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: no
# miRBase build ID: NA
# transcript_nrow: 178
# exon_nrow: 620
# cds_nrow: 523
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2014-10-08 10:31:15 -0700 (Wed, 08 Oct 2014)
# GenomicFeatures version at creation time: 1.17.21
# RSQLite version at creation time: 0.11.4
# DBSCHEMAVERSION: 1.0
>
> ## Extract the exon ranges grouped by transcript from 'txdb':
> transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
>
> ## Extract the transcript sequences from the genome:
> tx_seqs <- extractTranscriptSeqs(genome, transcripts)
> tx_seqs
A DNAStringSet instance of length 178
width seq names
[1] 2053 AATGACGTACTTCGCAGGCGCG...TAAAGGTTTTTCTTAGATTCTG uc001bum.2
[2] 2293 AATGACGTACTTCGCAGGCGCG...TAAAGGTTTTTCTTAGATTCTG uc009vua.2
[3] 2264 AATGACGTACTTCGCAGGCGCG...TAAAGGTTTTTCTTAGATTCTG uc010ogz.1
[4] 577 AAACACTCTGTGTGGCTCCTCG...TAAAGTCTCTTCCTCCAAGTCA uc001fbq.3
[5] 874 AATCGGCTAGGAGCAGCGAGCG...ACCTGGTCCCTACCATCGTGGG uc010pgn.2
... ... ...
[174] 1033 GCCGGAAGGATTCTGATTGCGA...AGGCATAAATAACCAGCAATGA uc011jbw.2
[175] 1590 GCCCGTCAGTGCTGGGAGGGGC...AAAAGCATATTTCTAAGAGGTT uc011jor.1
[176] 1155 GCCCGTCAGTGCTGGGAGGGGC...AAAAGCATATTTCTAAGAGGTT uc011jos.1
[177] 1567 GCCCGTCAGTGCTGGGAGGGGC...ATGGTGGGCTCAGTCATTCCTC uc011jov.1
[178] 696 GCCCGTCAGTGCTGGGAGGGGC...CAGGAAAGGCTGGACCACCTGC uc011jox.1
>
> ## A sanity check:
> stopifnot(identical(width(tx_seqs), unname(sum(width(transcripts)))))
>
> ## Note that 'tx_seqs' can also be obtained with:
> extractTranscriptSeqs(genome, txdb, use.names=TRUE)
A DNAStringSet instance of length 178
width seq names
[1] 2053 AATGACGTACTTCGCAGGCGCG...TAAAGGTTTTTCTTAGATTCTG uc001bum.2
[2] 2293 AATGACGTACTTCGCAGGCGCG...TAAAGGTTTTTCTTAGATTCTG uc009vua.2
[3] 2264 AATGACGTACTTCGCAGGCGCG...TAAAGGTTTTTCTTAGATTCTG uc010ogz.1
[4] 577 AAACACTCTGTGTGGCTCCTCG...TAAAGTCTCTTCCTCCAAGTCA uc001fbq.3
[5] 874 AATCGGCTAGGAGCAGCGAGCG...ACCTGGTCCCTACCATCGTGGG uc010pgn.2
... ... ...
[174] 1033 GCCGGAAGGATTCTGATTGCGA...AGGCATAAATAACCAGCAATGA uc011jbw.2
[175] 1590 GCCCGTCAGTGCTGGGAGGGGC...AAAAGCATATTTCTAAGAGGTT uc011jor.1
[176] 1155 GCCCGTCAGTGCTGGGAGGGGC...AAAAGCATATTTCTAAGAGGTT uc011jos.1
[177] 1567 GCCCGTCAGTGCTGGGAGGGGC...ATGGTGGGCTCAGTCATTCCTC uc011jov.1
[178] 696 GCCCGTCAGTGCTGGGAGGGGC...CAGGAAAGGCTGGACCACCTGC uc011jox.1
>
> ## ---------------------------------------------------------------------
> ## 3. USING extractTranscriptSeqs() TO EXTRACT CDS SEQUENCES
> ## ---------------------------------------------------------------------
>
> cds <- cdsBy(txdb, by="tx", use.names=TRUE)
> cds_seqs <- extractTranscriptSeqs(genome, cds)
> cds_seqs
A DNAStringSet instance of length 152
width seq names
[1] 1401 ATGGAGCCAGAGCTGCTGGTTC...GAAAACAGAACCACCTGGCTAG uc001bum.2
[2] 1641 ATGGAGCCAGAGCTGCTGGTTC...GAAAACAGAACCACCTGGCTAG uc009vua.2
[3] 1101 ATGGTGCTGAAGAAGTCAGGAG...GAAAACAGAACCACCTGGCTAG uc010ogz.1
[4] 345 ATGACTTGCAAAATGTCGCAGC...CCTCGGGGAGGGCACCCCCTAA uc001fbq.3
[5] 813 ATGGCGGGCGGGGCCCGGGAGG...ACCTGGTCCCTACCATCGTGGG uc010pgn.2
... ... ...
[148] 924 ATGCTGAATACAACCTCAGTCA...TAGAGTGATCAGAAGGCTTTGA uc011jbw.2
[149] 1191 ATGGATCCCAGGGGGACCAAGA...CAAGACTCCCTCTCCCAAATAG uc011jor.1
[150] 414 ATGCACAGCCCTGGCCAATGGA...CAAGACTCCCTCTCCCAAATAG uc011jos.1
[151] 360 ATGGATCCCAGGGGGACCAAGA...CCGAGGCCGGTTGGAGGGGTGA uc011jov.1
[152] 318 ATGGATCCCAGGGGGACCAAGA...CCAGGAAAGGCTGGACCACCTG uc011jox.1
>
> ## A sanity check:
> stopifnot(identical(width(cds_seqs), unname(sum(width(cds)))))
>
> ## Note that, alternatively, the CDS sequences can be obtained from the
> ## transcript sequences by removing the 5' and 3' UTRs:
> tx_lens <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE)
> stopifnot(identical(tx_lens$tx_name, names(tx_seqs))) # sanity
> ## Keep the rows in 'tx_lens' that correspond to a sequence in 'cds_seqs'
> ## and put them in the same order as in 'cds_seqs':
> m <- match(names(cds_seqs), names(tx_seqs))
> tx_lens <- tx_lens[m, ]
> utr5_width <- tx_lens$utr5_len
> utr3_width <- tx_lens$utr3_len
> cds_seqs2 <- narrow(tx_seqs[m],
+ start=utr5_width+1L, end=-(utr3_width+1L))
> stopifnot(identical(as.character(cds_seqs2), as.character(cds_seqs)))
>
> ## ---------------------------------------------------------------------
> ## 4. TRANSLATE THE CDS SEQUENCES
> ## ---------------------------------------------------------------------
>
> prot_seqs <- translate(cds_seqs, if.fuzzy.codon="solve")
>
> ## Note that, by default, translate() uses The Standard Genetic Code to
> ## translate codons into amino acids. However, depending on the organism,
> ## a different genetic code might be needed to translate CDS sequences
> ## located on the mitochodrial chromosome. For example, for vertebrates,
> ## the following code could be used to correct 'prot_seqs':
> SGC1 <- getGeneticCode("SGC1")
> chrM_idx <- which(all(seqnames(cds) == "chrM"))
> prot_seqs[chrM_idx] <- translate(cds_seqs[chrM_idx], genetic.code=SGC1,
+ if.fuzzy.codon="solve")
>
>
>
>
>
> dev.off()
null device
1
>