Last data update: 2014.03.03

R: Function to simulate randomly generated DNA sequence.
sim.DNAseqR Documentation

Function to simulate randomly generated DNA sequence.

Description

The function randomly generated DNA sequence of a given length and with fixed GC content to simulate DNA sequence representing a (proportion of the) genome for non-model species without available reference genome sequence.

Usage

sim.DNAseq(size = 10000, GCfreq = 0.46)

Arguments

size

DNA sequence length to generate in bp. Could be the known or estimated size of the species genome or more conveniently a fraction large enough to be representative of the full length size to limit memory usage and speed up computations.

GCfreq

GC content expressed as the proportion of G and C bases in the genome.

Details

If no reference genome sequence or some reference contigs are available for a species, knowledge of GC content and genome size are needed to generate random DNA sequence representative of a species. If reference genome sequence (even draft or unassembled contigs) are available, a randomly generated sequence can be simulate following CG content and length of an import proportion of the reference sequence for comparison purpose (see example below).

Value

A single continuous DNA sequence (character string).

Author(s)

Olivier Lepais

References

Lepais O & Weir JT. 2014. SimRAD: an R package for simulation-based prediction of the number of loci expected in RADseq and similar genotyping by sequencing approaches. Molecular Ecology Resources, 14, 1314-1321. DOI: 10.1111/1755-0998.12273.

See Also

ref.DNAseq, insilico.digest.

Examples


## example 1: simulation of random sequence for non-model species without reference genome:
# generating 1Mb of DNA sequence with 44.4% GC content: 
smsq <- sim.DNAseq(size=1000000, GCfreq=0.444)

# length:
width(smsq)

# GC content:
require(seqinr)
GC(s2c(smsq))


## example 2: simulating random sequence with parameter following a known reference genome sequence:

# generating a Fasta file for the example:
sq<-c()
for(i in 1:10){
sq <- c(sq, sim.DNAseq(size=rpois(1, 1000), GCfreq=0.444))
}
sq <- DNAStringSet(sq)
writeFasta(sq, file="SimRAD-exampleRefSeq-Fasta.fa", mode="w")

# importing the Fasta file and sub-selecting 25% of the contigs
rfsq <- ref.DNAseq("SimRAD-exampleRefSeq-Fasta.fa", subselect.contigs = TRUE, prop.contigs = 0.25)

# length of the reference sequence:
width(rfsq)

# computing GC content:
require(seqinr)
GC(s2c(rfsq))

# simulating random generated DNA sequence with characteristics equivalent to 
#    the sub-selected reference genome for comparison purpose:
smsq <- sim.DNAseq(size=width(rfsq), GCfreq=GC(s2c(rfsq)))


Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(SimRAD)
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: XVector
Loading required package: ShortRead
Loading required package: BiocParallel
Loading required package: Rsamtools
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: GenomicAlignments
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: zlibbioc
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/SimRAD/sim.DNAseq.Rd_%03d_medium.png", width=480, height=480)
> ### Name: sim.DNAseq
> ### Title: Function to simulate randomly generated DNA sequence.
> ### Aliases: sim.DNAseq
> ### Keywords: double digestion restriction enzyme library construction
> ###   adaptator ligation fragment selection restriction exclusion RESTseq
> ###   RAD GBS ddRAD ezRAD
> 
> ### ** Examples
> 
> 
> ## example 1: simulation of random sequence for non-model species without reference genome:
> # generating 1Mb of DNA sequence with 44.4% GC content: 
> smsq <- sim.DNAseq(size=1000000, GCfreq=0.444)
> 
> # length:
> width(smsq)
[1] 1000000
> 
> # GC content:
> require(seqinr)
Loading required package: seqinr

Attaching package: 'seqinr'

The following object is masked from 'package:Biostrings':

    translate

> GC(s2c(smsq))
[1] 0.444212
> 
> 
> ## example 2: simulating random sequence with parameter following a known reference genome sequence:
> 
> # generating a Fasta file for the example:
> sq<-c()
> for(i in 1:10){
+ sq <- c(sq, sim.DNAseq(size=rpois(1, 1000), GCfreq=0.444))
+ }
> sq <- DNAStringSet(sq)
> writeFasta(sq, file="SimRAD-exampleRefSeq-Fasta.fa", mode="w")
> 
> # importing the Fasta file and sub-selecting 25% of the contigs
> rfsq <- ref.DNAseq("SimRAD-exampleRefSeq-Fasta.fa", subselect.contigs = TRUE, prop.contigs = 0.25)
> 
> # length of the reference sequence:
> width(rfsq)
[1] 2064
> 
> # computing GC content:
> require(seqinr)
> GC(s2c(rfsq))
[1] 0.4462209
> 
> # simulating random generated DNA sequence with characteristics equivalent to 
> #    the sub-selected reference genome for comparison purpose:
> smsq <- sim.DNAseq(size=width(rfsq), GCfreq=GC(s2c(rfsq)))
> 
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>