R: Function to simulate randomly generated DNA sequence.
sim.DNAseq
R Documentation
Function to simulate randomly generated DNA sequence.
Description
The function randomly generated DNA sequence of a given length and with fixed GC content to simulate DNA sequence representing a (proportion of the) genome for non-model species without available reference genome sequence.
Usage
sim.DNAseq(size = 10000, GCfreq = 0.46)
Arguments
size
DNA sequence length to generate in bp. Could be the known or estimated size of the species genome or more conveniently a fraction large enough to be representative of the full length size to limit memory usage and speed up computations.
GCfreq
GC content expressed as the proportion of G and C bases in the genome.
Details
If no reference genome sequence or some reference contigs are available for a species, knowledge of GC content and genome size are needed to generate random DNA sequence representative of a species.
If reference genome sequence (even draft or unassembled contigs) are available, a randomly generated sequence can be simulate following CG content and length of an import proportion of the reference sequence for comparison purpose (see example below).
Value
A single continuous DNA sequence (character string).
Author(s)
Olivier Lepais
References
Lepais O & Weir JT. 2014. SimRAD: an R package for simulation-based prediction of the number of loci expected in RADseq and similar genotyping by sequencing approaches. Molecular Ecology Resources, 14, 1314-1321. DOI: 10.1111/1755-0998.12273.
See Also
ref.DNAseq, insilico.digest.
Examples
## example 1: simulation of random sequence for non-model species without reference genome:
# generating 1Mb of DNA sequence with 44.4% GC content:
smsq <- sim.DNAseq(size=1000000, GCfreq=0.444)
# length:
width(smsq)
# GC content:
require(seqinr)
GC(s2c(smsq))
## example 2: simulating random sequence with parameter following a known reference genome sequence:
# generating a Fasta file for the example:
sq<-c()
for(i in 1:10){
sq <- c(sq, sim.DNAseq(size=rpois(1, 1000), GCfreq=0.444))
}
sq <- DNAStringSet(sq)
writeFasta(sq, file="SimRAD-exampleRefSeq-Fasta.fa", mode="w")
# importing the Fasta file and sub-selecting 25% of the contigs
rfsq <- ref.DNAseq("SimRAD-exampleRefSeq-Fasta.fa", subselect.contigs = TRUE, prop.contigs = 0.25)
# length of the reference sequence:
width(rfsq)
# computing GC content:
require(seqinr)
GC(s2c(rfsq))
# simulating random generated DNA sequence with characteristics equivalent to
# the sub-selected reference genome for comparison purpose:
smsq <- sim.DNAseq(size=width(rfsq), GCfreq=GC(s2c(rfsq)))
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(SimRAD)
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: XVector
Loading required package: ShortRead
Loading required package: BiocParallel
Loading required package: Rsamtools
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: GenomicAlignments
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: zlibbioc
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/SimRAD/sim.DNAseq.Rd_%03d_medium.png", width=480, height=480)
> ### Name: sim.DNAseq
> ### Title: Function to simulate randomly generated DNA sequence.
> ### Aliases: sim.DNAseq
> ### Keywords: double digestion restriction enzyme library construction
> ### adaptator ligation fragment selection restriction exclusion RESTseq
> ### RAD GBS ddRAD ezRAD
>
> ### ** Examples
>
>
> ## example 1: simulation of random sequence for non-model species without reference genome:
> # generating 1Mb of DNA sequence with 44.4% GC content:
> smsq <- sim.DNAseq(size=1000000, GCfreq=0.444)
>
> # length:
> width(smsq)
[1] 1000000
>
> # GC content:
> require(seqinr)
Loading required package: seqinr
Attaching package: 'seqinr'
The following object is masked from 'package:Biostrings':
translate
> GC(s2c(smsq))
[1] 0.444212
>
>
> ## example 2: simulating random sequence with parameter following a known reference genome sequence:
>
> # generating a Fasta file for the example:
> sq<-c()
> for(i in 1:10){
+ sq <- c(sq, sim.DNAseq(size=rpois(1, 1000), GCfreq=0.444))
+ }
> sq <- DNAStringSet(sq)
> writeFasta(sq, file="SimRAD-exampleRefSeq-Fasta.fa", mode="w")
>
> # importing the Fasta file and sub-selecting 25% of the contigs
> rfsq <- ref.DNAseq("SimRAD-exampleRefSeq-Fasta.fa", subselect.contigs = TRUE, prop.contigs = 0.25)
>
> # length of the reference sequence:
> width(rfsq)
[1] 2064
>
> # computing GC content:
> require(seqinr)
> GC(s2c(rfsq))
[1] 0.4462209
>
> # simulating random generated DNA sequence with characteristics equivalent to
> # the sub-selected reference genome for comparison purpose:
> smsq <- sim.DNAseq(size=width(rfsq), GCfreq=GC(s2c(rfsq)))
>
>
>
>
>
>
>
> dev.off()
null device
1
>