Last data update: 2014.03.03

R: Function to import a reference DNA sequence from a Fasta...
ref.DNAseqR Documentation

Function to import a reference DNA sequence from a Fasta file.

Description

This function read a Fasta file containing a genome reference DNA sequence in the form of several contigs (or alternatively a single sequence) and can optionally randomly sub-select a fraction of the contigs sequence to allow for faster computation and avoid memory saturation.

Usage

ref.DNAseq(FASTA.file, subselect.contigs = TRUE, prop.contigs = 0.1)

Arguments

FASTA.file

Full patch to a Fasta file (e.g."C:/Users/Me/Documents/RAD/refGenome.fa").

subselect.contigs

If TRUE (the default) and if the Fasta file contains more than one sequence, sub-select a fraction of the contig sequences. If FALSE or if the Fasta file contains only one continuous DNA sequence, all data contain in the Fasta file is imported.

prop.contigs

Proportion of contigs to randomly sub-select for subsequent analyses.

Details

To limit memory usage and speed up computation time, it is strongly recommended to sub-select a fraction of the available reference DNA contigs. Usually, a proportion of 0.10 of the reference contigs should be representative enough to give a good idea of the genome characteristics of the species. The ratio of the length of contig sequence kept to the estimated total size of the genome can then be used to estimate the number of loci by cross-multiplication (see example below).

Contig sequences will be randomly concatenated to form a chimeric unique continuous sequence (comparable to simulated sequence by sim.DNAseq function). This may create some bias as two restriction site located in two different contigs may thus form a fragments that would not exist in the real genome. Yet, contigs do not represent real entities and keeping them apart would also create biases by randomly generating false digested restriction site which would heavily inflate the number of sites and loci particularly for draft reference genome with hundred of thousand of contigs. On the contrary, the choice to concatenate contigs may only slightly increase the number of restriction site and the number of fragments (loci).

Value

A single continuous DNA sequence (character string).

Author(s)

Olivier Lepais

References

Lepais O & Weir JT. 2014. SimRAD: an R package for simulation-based prediction of the number of loci expected in RADseq and similar genotyping by sequencing approaches. Molecular Ecology Resources, 14, 1314-1321. DOI: 10.1111/1755-0998.12273.

See Also

sim.DNAseq, insilico.digest.

Examples


# generating a Fasta file for the example:
sq<-c()
for(i in 1:20){
sq <- c(sq, sim.DNAseq(size=rpois(1, 10000), GCfreq=0.444))
}
sq <- DNAStringSet(sq)
writeFasta(sq, file="SimRAD-exampleRefSeq.fa", mode="w")

# importing the Fasta file and sub selecting 25% of the contigs
rfsq <- ref.DNAseq("SimRAD-exampleRefSeq.fa", subselect.contigs = TRUE, prop.contigs = 0.25)

# length of the reference sequence:
width(rfsq)

# ratio for the cross-multiplication of the number of fragments and loci at the genomes scale:
genome.size <- 100000000 # genome size: 100Mb
ratio <- genome.size/width(rfsq)
ratio

# computing GC content:
require(seqinr)
GC(s2c(rfsq))

# simulating random generated DNA sequence with characteristics equivalent to
#     the sub-selected reference genome for comparison purpose:
smsq <- sim.DNAseq(size=width(rfsq), GCfreq=GC(s2c(rfsq)))



Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(SimRAD)
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: XVector
Loading required package: ShortRead
Loading required package: BiocParallel
Loading required package: Rsamtools
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: GenomicAlignments
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: zlibbioc
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/SimRAD/ref.DNAseq.Rd_%03d_medium.png", width=480, height=480)
> ### Name: ref.DNAseq
> ### Title: Function to import a reference DNA sequence from a Fasta file.
> ### Aliases: ref.DNAseq
> ### Keywords: double digestion restriction enzyme library construction
> ###   adaptator ligation fragment selection restriction exclusion RESTseq
> ###   RAD GBS ddRAD ezRAD
> 
> ### ** Examples
> 
> 
> # generating a Fasta file for the example:
> sq<-c()
> for(i in 1:20){
+ sq <- c(sq, sim.DNAseq(size=rpois(1, 10000), GCfreq=0.444))
+ }
> sq <- DNAStringSet(sq)
> writeFasta(sq, file="SimRAD-exampleRefSeq.fa", mode="w")
> 
> # importing the Fasta file and sub selecting 25% of the contigs
> rfsq <- ref.DNAseq("SimRAD-exampleRefSeq.fa", subselect.contigs = TRUE, prop.contigs = 0.25)
> 
> # length of the reference sequence:
> width(rfsq)
[1] 50021
> 
> # ratio for the cross-multiplication of the number of fragments and loci at the genomes scale:
> genome.size <- 100000000 # genome size: 100Mb
> ratio <- genome.size/width(rfsq)
> ratio
[1] 1999.16
> 
> # computing GC content:
> require(seqinr)
Loading required package: seqinr

Attaching package: 'seqinr'

The following object is masked from 'package:Biostrings':

    translate

> GC(s2c(rfsq))
[1] 0.4408748
> 
> # simulating random generated DNA sequence with characteristics equivalent to
> #     the sub-selected reference genome for comparison purpose:
> smsq <- sim.DNAseq(size=width(rfsq), GCfreq=GC(s2c(rfsq)))
> 
> 
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>