Last data update: 2014.03.03

R: The BSgenomeForge functions
BSgenomeForgeR Documentation

The BSgenomeForge functions

Description

A set of functions for making a BSgenome data package.

Usage

## Top-level BSgenomeForge function:

forgeBSgenomeDataPkg(x, seqs_srcdir=".", destdir=".", verbose=TRUE)

## Low-level BSgenomeForge functions:

forgeSeqlengthsFile(seqnames, prefix="", suffix=".fa",
                    seqs_srcdir=".", seqs_destdir=".", verbose=TRUE)

forgeSeqFiles(seqnames, mseqnames=NULL,
              seqfile_name=NA, prefix="", suffix=".fa",
              seqs_srcdir=".", seqs_destdir=".",
              ondisk_seq_format=c("2bit", "rda", "fa.rz", "fa"),
              verbose=TRUE)

forgeMasksFiles(seqnames, nmask_per_seq,
                seqs_destdir=".",
                ondisk_seq_format=c("2bit", "rda", "fa.rz", "fa"),
                masks_srcdir=".", masks_destdir=".",
                AGAPSfiles_type="gap", AGAPSfiles_name=NA,
                AGAPSfiles_prefix="", AGAPSfiles_suffix="_gap.txt",
                RMfiles_name=NA, RMfiles_prefix="", RMfiles_suffix=".fa.out",
                TRFfiles_name=NA, TRFfiles_prefix="", TRFfiles_suffix=".bed",
                verbose=TRUE)

Arguments

x

A BSgenomeDataPkgSeed object or the name of a BSgenome data package seed file. See the BSgenomeForge vignette in this package for more information.

seqs_srcdir, masks_srcdir

Single strings indicating the path to the source directories i.e. to the directories containing the source data files. Only read access to these directories is needed. See the BSgenomeForge vignette in this package for more information.

destdir

A single string indicating the path to the directory where the source tree of the target package should be created. This directory must already exist. See the BSgenomeForge vignette in this package for more information.

ondisk_seq_format

Specifies how the single sequences should be stored in the forged package. Can be "2bit", "rda", "fa.rz", or "fa". If "2bit" (the default), then all the single sequences are stored in a single twoBit file. If "rda", then each single sequence is stored in a separated serialized XString object (one per single sequence). If "fa.rz" or "fa", then all the single sequences are stored in a single FASTA file (compressed in the RAZip format if "fa.rz").

verbose

TRUE or FALSE.

seqnames, mseqnames

A character vector containing the names of the single (for seqnames) and multiple (for mseqnames) sequences to forge. See the BSgenomeForge vignette in this package for more information.

seqfile_name, prefix, suffix

See the BSgenomeForge vignette in this package for more information, in particular the description of the seqfile_name, seqfiles_prefix and seqfiles_suffix fields of a BSgenome data package seed file.

seqs_destdir, masks_destdir

During the forging process the source data files are converted into serialized Biostrings objects. seqs_destdir and masks_destdir must be single strings indicating the path to the directories where these serialized objects should be saved. These directories must already exist.

forgeSeqlengthsFile will produce a single .rda file. Both forgeSeqFiles and forgeMasksFiles will produce one .rda file per sequence.

nmask_per_seq

A single integer indicating the desired number of masks per sequence. See the BSgenomeForge vignette in this package for more information.

AGAPSfiles_type, AGAPSfiles_name, AGAPSfiles_prefix, AGAPSfiles_suffix, RMfiles_name, RMfiles_prefix, RMfiles_suffix, TRFfiles_name, TRFfiles_prefix, TRFfiles_suffix

These arguments are named accordingly to the corresponding fields of a BSgenome data package seed file. See the BSgenomeForge vignette in this package for more information.

Details

These functions are intended for Bioconductor users who want to make a new BSgenome data package, not for regular users of these packages. See the BSgenomeForge vignette in this package (vignette("BSgenomeForge")) for an extensive coverage of this topic.

Author(s)

H. Pag<c3><83><c2><a8>s

Examples

seqs_srcdir <- system.file("extdata", package="BSgenome")
seqnames <- c("chrX", "chrM")

## Forge .rda sequence files:
forgeSeqFiles(seqnames, prefix="ce2", suffix=".fa.gz",
              seqs_srcdir=seqs_srcdir,
              seqs_destdir=tempdir(), ondisk_seq_format="rda")

## Forge .2bit sequence files:
forgeSeqFiles(seqnames, prefix="ce2", suffix=".fa.gz",
              seqs_srcdir=seqs_srcdir,
              seqs_destdir=tempdir(), ondisk_seq_format="2bit")

## Sanity checks:
library(BSgenome.Celegans.UCSC.ce2)
genome <- BSgenome.Celegans.UCSC.ce2

load(file.path(tempdir(), "chrX.rda"))
stopifnot(genome$chrX == chrX)
load(file.path(tempdir(), "chrM.rda"))
stopifnot(genome$chrM == chrM)

ce2_sequences <- import(file.path(tempdir(), "single_sequences.2bit"))
ce2_sequences0 <- DNAStringSet(list(chrX=genome$chrX, chrM=genome$chrM))
stopifnot(identical(names(ce2_sequences0), names(ce2_sequences)) &&
          all(ce2_sequences0 == ce2_sequences))

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(BSgenome)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/BSgenome/BSgenomeForge.Rd_%03d_medium.png", width=480, height=480)
> ### Name: BSgenomeForge
> ### Title: The BSgenomeForge functions
> ### Aliases: BSgenomeForge BSgenomeDataPkgSeed class:BSgenomeDataPkgSeed
> ###   BSgenomeDataPkgSeed-class forgeBSgenomeDataPkg
> ###   forgeBSgenomeDataPkg,BSgenomeDataPkgSeed-method
> ###   forgeBSgenomeDataPkg,list-method
> ###   forgeBSgenomeDataPkg,character-method forgeSeqlengthsFile
> ###   forgeSeqFiles forgeMasksFiles
> ### Keywords: manip
> 
> ### ** Examples
> 
> seqs_srcdir <- system.file("extdata", package="BSgenome")
> seqnames <- c("chrX", "chrM")
> 
> ## Forge .rda sequence files:
> forgeSeqFiles(seqnames, prefix="ce2", suffix=".fa.gz",
+               seqs_srcdir=seqs_srcdir,
+               seqs_destdir=tempdir(), ondisk_seq_format="rda")
Loading FASTA file '/home/ddbj/local/lib64/R/library/BSgenome/extdata/ce2chrX.fa.gz' in 'chrX' object ... DONE
Saving 'chrX' object to compressed data file '/tmp/Rtmp9aMazk/chrX.rda' ... DONE
Loading FASTA file '/home/ddbj/local/lib64/R/library/BSgenome/extdata/ce2chrM.fa.gz' in 'chrM' object ... DONE
Saving 'chrM' object to compressed data file '/tmp/Rtmp9aMazk/chrM.rda' ... DONE
> 
> ## Forge .2bit sequence files:
> forgeSeqFiles(seqnames, prefix="ce2", suffix=".fa.gz",
+               seqs_srcdir=seqs_srcdir,
+               seqs_destdir=tempdir(), ondisk_seq_format="2bit")
Loading 'chrX' sequence from FASTA file '/home/ddbj/local/lib64/R/library/BSgenome/extdata/ce2chrX.fa.gz' ... DONE
Loading 'chrM' sequence from FASTA file '/home/ddbj/local/lib64/R/library/BSgenome/extdata/ce2chrM.fa.gz' ... DONE
Writing all sequences to '/tmp/Rtmp9aMazk/single_sequences.2bit' ... DONE
> 
> ## Sanity checks:
> library(BSgenome.Celegans.UCSC.ce2)
> genome <- BSgenome.Celegans.UCSC.ce2
> 
> load(file.path(tempdir(), "chrX.rda"))
> stopifnot(genome$chrX == chrX)
> load(file.path(tempdir(), "chrM.rda"))
> stopifnot(genome$chrM == chrM)
> 
> ce2_sequences <- import(file.path(tempdir(), "single_sequences.2bit"))
> ce2_sequences0 <- DNAStringSet(list(chrX=genome$chrX, chrM=genome$chrM))
> stopifnot(identical(names(ce2_sequences0), names(ce2_sequences)) &&
+           all(ce2_sequences0 == ce2_sequences))
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>