Last data update: 2014.03.03

R: BSgenome objects
BSgenome-classR Documentation

BSgenome objects

Description

The BSgenome class is a container for storing the full genome sequences of a given organism. BSgenome objects are usually made in advance by a volunteer and made available to the Bioconductor community as "BSgenome data packages". See ?available.genomes for how to get the list of "BSgenome data packages" curently available.

Accessor methods

In the code snippets below, x is a BSgenome object. Note that, because the BSgenome class contains the GenomeDescription class, then all the accessor methods for GenomeDescription objects can also be used on x.

sourceUrl(x)

Returns the source URL i.e. the permanent URL to the place where the FASTA files used to produce the sequences contained in x can be found (and downloaded).

seqnames(x), seqnames(x) <- value

Gets or sets the names of the single sequences contained in x. Each single sequence is stored in a DNAString or MaskedDNAString object and typically comes from a source file (FASTA) with a single record. The names returned by seqnames(x) usually reflect the names of those source files but a common prefix or suffix was eventually removed in order to keep them as short as possible.

seqlengths(x)

Returns the lengths of the single sequences contained in x.

See ?`length,XVector-method` and ?`length,MaskedXString-method` for the definition of the length of a DNAString or MaskedDNAString object. Note that the length of a masked sequence (MaskedXString object) is not affected by the current set of active masks but the nchar method for MaskedXString objects is.

names(seqlengths(x)) is guaranteed to be identical to seqnames(x).

mseqnames(x)

Returns the index of the multiple sequences contained in x. Each multiple sequence is stored in a DNAStringSet object and typically comes from a source file (FASTA) with multiple records. The names returned by mseqnames(x) usually reflect the names of those source files but a common prefix or suffix was eventually removed in order to keep them as short as possible.

names(x)

Returns the index of all sequences contained in x. This is the same as c(seqnames(x), mseqnames(x)).

length(x)

Returns the length of x, i.e., the total number of sequences in it (single and multiple sequences). This is the same as length(names(x)).

x[[name]]

Returns the sequence (single or multiple) in x named name (name must be a single string). No sequence is actually loaded into memory until this is explicitely requested with a call to x[[name]] or x$name. When loaded, a sequence is kept in a cache. It will be automatically removed from the cache at garbage collection if it's not in use anymore i.e. if there are no reference to it (other than the reference stored in the cache). With options(verbose=TRUE), a message is printed each time a sequence is removed from the cache.

x$name

Same as x[[name]] but name is not evaluated and therefore must be a literal character string or a name (possibly backtick quoted).

masknames(x)

The names of the built-in masks that are defined for all the single sequences. There can be up to 4 built-in masks per sequence. These will always be (in this order): (1) the mask of assembly gaps, aka "the AGAPS mask";

(2) the mask of intra-contig ambiguities, aka "the AMB mask";

(3) the mask of repeat regions that were determined by the RepeatMasker software, aka "the RM mask";

(4) the mask of repeat regions that were determined by the Tandem Repeats Finder software (where only repeats with period less than or equal to 12 were kept), aka "the TRF mask".

All the single sequences in a given package are guaranteed to have the same collection of built-in masks (same number of masks and in the same order).

masknames(x) gives the names of the masks in this collection. Therefore the value returned by masknames(x) is a character vector made of the first N elements of c("AGAPS", "AMB", "RM", "TRF"), where N depends only on the BSgenome data package being looked at (0 <= N <= 4). The man page for most BSgenome data packages should provide the exact list and permanent URLs of the source data files that were used to extract the built-in masks. For example, if you've installed the BSgenome.Hsapiens.UCSC.hg38 package, load it and see the Note section in ?`BSgenome.Hsapiens.UCSC.hg38`.

Author(s)

H. Pag<c3><83><c2><a8>s

See Also

available.genomes, GenomeDescription-class, BSgenome-utils, DNAString-class, DNAStringSet-class, MaskedDNAString-class, getSeq,BSgenome-method, injectSNPs, subseq,XVector-method, rm, gc

Examples

## Loading a BSgenome data package doesn't load its sequences
## into memory:
library(BSgenome.Celegans.UCSC.ce2)

## Number of sequences in this genome:
length(Celegans) 

## Display a summary of the sequences:
Celegans

## Index of single sequences:
seqnames(Celegans)

## Lengths (i.e. number of nucleotides) of the single sequences:
seqlengths(Celegans)

## Load chromosome I from disk to memory (hence takes some time)
## and keep a reference to it:
chrI <- Celegans[["chrI"]]  # equivalent to Celegans$chrI

chrI

class(chrI)   # a DNAString instance
length(chrI)  # with 15080483 nucleotides

## Single sequence can be renamed:
seqnames(Celegans) <- sub("^chr", "", seqnames(Celegans))
seqlengths(Celegans)
Celegans$I
seqnames(Celegans) <- paste0("chr", seqnames(Celegans))

## Multiple sequences:
library(BSgenome.Rnorvegicus.UCSC.rn5)
rn5 <- BSgenome.Rnorvegicus.UCSC.rn5
rn5
seqnames(rn5)
rn5_chr1 <- rn5$chr1
mseqnames(rn5) 
rn5_random  <- Rnorvegicus$random
rn5_random
class(rn5_random)  # a DNAStringSet instance
## Character vector containing the description lines of the first
## 4 sequences in the original FASTA file:
names(rn5_random)[1:4]

## ---------------------------------------------------------------------
## PASS-BY-ADDRESS SEMANTIC, CACHING AND MEMORY USAGE
## ---------------------------------------------------------------------

## We want a message to be printed each time a sequence is removed
## from the cache:
options(verbose=TRUE)

gc()  # nothing seems to be removed from the cache
rm(rn5_chr1, rn5_random)
gc()  # rn5_chr1 and rn5_random are removed from the cache (they are
      # not in use anymore)

options(verbose=FALSE)

## Get the current amount of data in memory (in Mb):
mem0 <- gc()["Vcells", "(Mb)"]

system.time(rn5_chr2 <- rn5$chr2)  # read from disk
  
gc()["Vcells", "(Mb)"] - mem0  # 'rn5_chr2' occupies 20Mb in memory

system.time(tmp <- rn5$chr2)  # much faster! (sequence
                              # is in the cache)

gc()["Vcells", "(Mb)"] - mem0  # we're still using 20Mb (sequences
                               # have a pass-by-address semantic
                               # i.e. the sequence data are not
                               # duplicated)
  
## subseq() doesn't copy the sequence data either, hence it is very
## fast and memory efficient (but the returned object will hold a
## reference to 'rn5_chr2'):
y <- subseq(rn5_chr2, 10, 8000000) 
gc()["Vcells", "(Mb)"] - mem0

## We must remove all references to 'rn5_chr2' before it can be
## removed from the cache (so the 20Mb of memory used by this
## sequence are freed):
options(verbose=TRUE)
rm(rn5_chr2, tmp)
gc()

## Remember that 'y' holds a reference to 'rn5_chr2' too:
rm(y)
gc()

options(verbose=FALSE)
gc()["Vcells", "(Mb)"] - mem0

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(BSgenome)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/BSgenome/BSgenome-class.Rd_%03d_medium.png", width=480, height=480)
> ### Name: BSgenome-class
> ### Title: BSgenome objects
> ### Aliases: class:BSgenome BSgenome-class BSgenome length,BSgenome-method
> ###   sourceUrl sourceUrl,BSgenome-method mseqnames
> ###   mseqnames,BSgenome-method names,BSgenome-method masknames
> ###   masknames,BSgenome-method seqinfo,BSgenome-method
> ###   seqinfo<-,BSgenome-method seqnames<-,BSgenome-method BSgenome
> ###   as.list,BSgenome-method show,BSgenome-method [[,BSgenome-method
> ###   $,BSgenome-method
> ### Keywords: methods classes
> 
> ### ** Examples
> 
> ## Loading a BSgenome data package doesn't load its sequences
> ## into memory:
> library(BSgenome.Celegans.UCSC.ce2)
> 
> ## Number of sequences in this genome:
> length(Celegans) 
[1] 7
> 
> ## Display a summary of the sequences:
> Celegans
Worm genome:
# organism: Caenorhabditis elegans (Worm)
# provider: UCSC
# provider version: ce2
# release date: Mar. 2004
# release name: WormBase v. WS120
# 7 sequences:
#   chrI   chrII  chrIII chrIV  chrV   chrX   chrM                              
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> 
> ## Index of single sequences:
> seqnames(Celegans)
[1] "chrI"   "chrII"  "chrIII" "chrIV"  "chrV"   "chrX"   "chrM"  
> 
> ## Lengths (i.e. number of nucleotides) of the single sequences:
> seqlengths(Celegans)
    chrI    chrII   chrIII    chrIV     chrV     chrX     chrM 
15080483 15279308 13783313 17493791 20922231 17718849    13794 
> 
> ## Load chromosome I from disk to memory (hence takes some time)
> ## and keep a reference to it:
> chrI <- Celegans[["chrI"]]  # equivalent to Celegans$chrI
> 
> chrI
  15080483-letter "DNAString" instance
seq: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA...TTAGGCTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
> 
> class(chrI)   # a DNAString instance
[1] "DNAString"
attr(,"package")
[1] "Biostrings"
> length(chrI)  # with 15080483 nucleotides
[1] 15080483
> 
> ## Single sequence can be renamed:
> seqnames(Celegans) <- sub("^chr", "", seqnames(Celegans))
> seqlengths(Celegans)
       I       II      III       IV        V        X        M 
15080483 15279308 13783313 17493791 20922231 17718849    13794 
> Celegans$I
  15080483-letter "DNAString" instance
seq: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA...TTAGGCTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
> seqnames(Celegans) <- paste0("chr", seqnames(Celegans))
> 
> ## Multiple sequences:
> library(BSgenome.Rnorvegicus.UCSC.rn5)
> rn5 <- BSgenome.Rnorvegicus.UCSC.rn5
> rn5
Rat genome:
# organism: Rattus norvegicus (Rat)
# provider: UCSC
# provider version: rn5
# release date: Mar. 2012
# release name: RGSC 5.0
# 
# 22 single sequences:
#   chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chr10 chr11 chr12
#   chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chrX  chrM             
# (use 'seqnames()' to see all the single sequence names, use the '$' or '[['
# operator to access a given single sequence)
# 
# multiple sequences:
#   random chrUn                                                                
# (use 'mseqnames()' to see all the multiple sequence names, use the '$' or '[['
# operator to access a given multiple sequence)
> seqnames(rn5)
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chrX"  "chrM" 
> rn5_chr1 <- rn5$chr1
> mseqnames(rn5) 
[1] "random" "chrUn" 
> rn5_random  <- Rnorvegicus$random
> rn5_random
  A DNAStringSet instance of length 1278
        width seq                                           names               
   [1]   1013 AGTCTTGAACTCTTCTTCGTT...GGAATGCTACAACCTAGAAAT chr10_AABR0611010...
   [2]   1765 CAGGAGTAAAGTCTTCTGAAC...ACTAAATCCCCAACCCCGGTG chr10_JH620367_ra...
   [3]    780 AGCACACAATCTGGGAGAATA...GTTCAGAAGACTTTACCCCGG chr10_AABR0611010...
   [4]   4563 AAGACTGGAGAGATGGCTCAG...TGCTCGCCAGCTCGAGCTGGA chr10_AABR0611010...
   [5]   2250 TTGTTAGAGGTGGAGTTATGT...ATCAAAAGTTTAAGATTACCA chr10_AABR0611010...
   ...    ... ...
[1274]  15658 GGATAGTAAGTATAGAAGAGA...TAGGAACACAACTTTGAAGAA chrX_JH620457_random
[1275]    548 ATCAGATAGGTTTAATGCAGA...GAAATCAGGGACTAGACAAGG chrX_AABR06110855...
[1276]   1012 AAAGATATTCTGTAATTTGGT...AGCCATCAGGTTGTCTCTGGA chrX_AABR06110856...
[1277]   2100 TTCTCATGACAAATTTGCTTT...AGAAGGACAGGCAACCCTTTC chrX_JH620458_random
[1278]   5429 TTCTCTTGGGAAATTTTAGCT...TAGAGTGTTATTCCCTTCCCG chrX_JH620459_random
> class(rn5_random)  # a DNAStringSet instance
[1] "DNAStringSet"
attr(,"package")
[1] "Biostrings"
> ## Character vector containing the description lines of the first
> ## 4 sequences in the original FASTA file:
> names(rn5_random)[1:4]
[1] "chr10_AABR06110104_random" "chr10_JH620367_random"    
[3] "chr10_AABR06110107_random" "chr10_AABR06110108_random"
> 
> ## ---------------------------------------------------------------------
> ## PASS-BY-ADDRESS SEMANTIC, CACHING AND MEMORY USAGE
> ## ---------------------------------------------------------------------
> 
> ## We want a message to be printed each time a sequence is removed
> ## from the cache:
> options(verbose=TRUE)
> 
> gc()  # nothing seems to be removed from the cache
Garbage collection 71 = 46+11+14 (level 2) ... 
146.7 Mbytes of cons cells used (58%)
308.8 Mbytes of vectors used (70%)
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  2745522 146.7    4703850 251.3  3205452 171.2
Vcells 40474731 308.8   57918492 441.9 40530568 309.3
> rm(rn5_chr1, rn5_random)
> gc()  # rn5_chr1 and rn5_random are removed from the cache (they are
Garbage collection 72 = 46+11+15 (level 2) ... 
146.7 Mbytes of cons cells used (58%)
308.8 Mbytes of vectors used (58%)
uncaching random
uncaching chr1
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  2745651 146.7    4703850 251.3  3205452 171.2
Vcells 40475078 308.9   69582190 530.9 40530568 309.3
>       # not in use anymore)
> 
> options(verbose=FALSE)
> 
> ## Get the current amount of data in memory (in Mb):
> mem0 <- gc()["Vcells", "(Mb)"]
> 
> system.time(rn5_chr2 <- rn5$chr2)  # read from disk
   user  system elapsed 
  1.216   0.084   1.301 
>   
> gc()["Vcells", "(Mb)"] - mem0  # 'rn5_chr2' occupies 20Mb in memory
[1] 271.8
> 
> system.time(tmp <- rn5$chr2)  # much faster! (sequence
   user  system elapsed 
  0.000   0.000   0.001 
>                               # is in the cache)
> 
> gc()["Vcells", "(Mb)"] - mem0  # we're still using 20Mb (sequences
[1] 271.8
>                                # have a pass-by-address semantic
>                                # i.e. the sequence data are not
>                                # duplicated)
>   
> ## subseq() doesn't copy the sequence data either, hence it is very
> ## fast and memory efficient (but the returned object will hold a
> ## reference to 'rn5_chr2'):
> y <- subseq(rn5_chr2, 10, 8000000) 
> gc()["Vcells", "(Mb)"] - mem0
[1] 271.8
> 
> ## We must remove all references to 'rn5_chr2' before it can be
> ## removed from the cache (so the 20Mb of memory used by this
> ## sequence are freed):
> options(verbose=TRUE)
> rm(rn5_chr2, tmp)
> gc()
Garbage collection 79 = 46+11+22 (level 2) ... 
146.5 Mbytes of cons cells used (58%)
300.0 Mbytes of vectors used (61%)
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  2742950 146.5    4703850 251.3  3205452 171.2
Vcells 39317664 300.0   64302945 490.6 40530568 309.3
> 
> ## Remember that 'y' holds a reference to 'rn5_chr2' too:
> rm(y)
> gc()
Garbage collection 80 = 46+11+23 (level 2) ... 
146.5 Mbytes of cons cells used (58%)
300.0 Mbytes of vectors used (61%)
uncaching chr2
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  2742873 146.5    4703850 251.3  3205452 171.2
Vcells 39317645 300.0   64302945 490.6 40530568 309.3
> 
> options(verbose=FALSE)
> gc()["Vcells", "(Mb)"] - mem0
[1] 0
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>