Last data update: 2014.03.03

R: Function to exclude fragments containing a specified...
exclude.seqsiteR Documentation

Function to exclude fragments containing a specified restriction site.

Description

Given a vector of sequences representing DNA fragments digested by restriction enzyme, the function return the DNA fragments that do not contain a specified restriction site, which is typically used to reduce the number of loci in the RESTseq method. The function can be use repeatedly for excluding fragments containing several restriction sites.

Usage

exclude.seqsite(sequences, site, verbose=TRUE)

Arguments

sequences

a vector of DNA sequences representing DNA fragments after digestion by restriction enzyme(s), typically the output of another function such as adapt.select, insilico.digest or exclude.seqsite.

site

restriction site to target DNA fragments to exclude. This typically corresponds to recognition site of a frequent cutter restriction enzyme.

verbose

If TRUE (the default), returns the number of fragments excluded and kept. FALSE makes the function silent to be used in a loop.

Details

Frequent cutter restriction enzyme can be easily used to further reduce the number of fragments as demonstrated by RESTseq method. This approach looks interesting in some species with complex genomes as it allows removing parts of the genomes containing highly repetitive CG or / and AT rich sequences.

This function can be used directly after a single enzyme digestion using insilico.digest function to remove fragments containing restriction site of a second enzyme. An equivalent alternative would be to simulate a double digestion using insilico.digest followed by adapt.select with type = "AA", which would remove fragments containing restriction site of the enzyme 2 (see example below).

An unlimited number of exclusion steps using different restriction enzyme can be simulated by running the function with the output of a previous execution of the function (see example below).

Value

A vector of DNA fragment sequences.

Author(s)

Olivier Lepais

References

Lepais O & Weir JT. 2014. SimRAD: an R package for simulation-based prediction of the number of loci expected in RADseq and similar genotyping by sequencing approaches. Molecular Ecology Resources, 14, 1314-1321. DOI: 10.1111/1755-0998.12273.

Stolle & Moritz 2013. RESTseq - Efficient benchtop population genomics with RESTriction fragment SEQuencing. PLoS ONE 8: e63960. doi:10.1371/journal.pone.0063960

See Also

adapt.select, size.select.

Examples

### Example 1:
# simulating some sequence:
simseq <-  sim.DNAseq(size=1000000, GCfreq=0.433)

#Restriction Enzyme 1
#PstI
cs_5p1 <- "CTGCA"
cs_3p1 <- "G"

#Restriction Enzyme 2
#MseI #
cs_5p2 <- "T"
cs_3p2 <- "TAA"
# hence, recognition site: "TTAA"

# single digestion:
simseq.dig <- insilico.digest(simseq, cs_5p1, cs_3p1, cs_5p1, cs_3p1, verbose=TRUE)
# excluding fragments coutaining restriction site of the enzyme 2
simseq.exc <- exclude.seqsite(simseq.dig, "TTAA")

## which is equivalent to:
simseq.dig2 <- insilico.digest(simseq, cs_5p1, cs_3p1, cs_5p2, cs_3p2, verbose=TRUE)
simseq.selectAA <- adapt.select(simseq.dig2, type="AA", cs_5p1, cs_3p1, cs_5p2, cs_3p2)
length(simseq.selectAA)


### Example 2:
simseq <-  sim.DNAseq(size=1000000, GCfreq=0.51)

#Restriction Enzyme 1
#TaqI
cs_5p1 <- "T"
cs_3p1 <- "CGA"

simseq.dig <- insilico.digest(simseq, cs_5p1, cs_3p1, cs_5p1, cs_3p1, verbose=TRUE)

# removing fragments countaining restiction sites of MseI ("TTAA"), MliCI ("AATT"),
#         HaellI ("GGCC"), MspI ("CCGG") and HinP1I ("GCGC"):
excl1 <- exclude.seqsite(simseq.dig, "TTAA")
excl2 <- exclude.seqsite(excl1, "AATT")
excl3 <- exclude.seqsite(excl2, "GGCC")
excl4 <- exclude.seqsite(excl3, "CCGG")
excl5 <- exclude.seqsite(excl4, "GCGC")
# which can be followed by  size selection step.


Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(SimRAD)
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: XVector
Loading required package: ShortRead
Loading required package: BiocParallel
Loading required package: Rsamtools
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: GenomicAlignments
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: zlibbioc
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/SimRAD/exclude.seqsite.Rd_%03d_medium.png", width=480, height=480)
> ### Name: exclude.seqsite
> ### Title: Function to exclude fragments containing a specified restriction
> ###   site.
> ### Aliases: exclude.seqsite
> ### Keywords: double digestion restriction enzyme library construction
> ###   adaptator ligation fragment selection restriction exclusion RESTseq
> 
> ### ** Examples
> 
> ### Example 1:
> # simulating some sequence:
> simseq <-  sim.DNAseq(size=1000000, GCfreq=0.433)
> 
> #Restriction Enzyme 1
> #PstI
> cs_5p1 <- "CTGCA"
> cs_3p1 <- "G"
> 
> #Restriction Enzyme 2
> #MseI #
> cs_5p2 <- "T"
> cs_3p2 <- "TAA"
> # hence, recognition site: "TTAA"
> 
> # single digestion:
> simseq.dig <- insilico.digest(simseq, cs_5p1, cs_3p1, cs_5p1, cs_3p1, verbose=TRUE)
Number of restriction sites for the first enzyme: 177
Number of restriction sites for the second enzyme: 177
Number of type AB and BA fragments:352
Number of type AA fragments:176
Number of type BB fragments:176
Warning messages:
1: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
2: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
3: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
4: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
> # excluding fragments coutaining restriction site of the enzyme 2
> simseq.exc <- exclude.seqsite(simseq.dig, "TTAA")
171 out of 178 sequences with the restriction site detected and removed 
7 sequence remaining 
> 
> ## which is equivalent to:
> simseq.dig2 <- insilico.digest(simseq, cs_5p1, cs_3p1, cs_5p2, cs_3p2, verbose=TRUE)
Number of restriction sites for the first enzyme: 177
Number of restriction sites for the second enzyme: 6444
Number of type AB and BA fragments:340
Number of type AA fragments:7
Number of type BB fragments:6274
Warning messages:
1: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
2: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
3: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
4: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
> simseq.selectAA <- adapt.select(simseq.dig2, type="AA", cs_5p1, cs_3p1, cs_5p2, cs_3p2)
Warning message:
In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
> length(simseq.selectAA)
[1] 7
> 
> 
> ### Example 2:
> simseq <-  sim.DNAseq(size=1000000, GCfreq=0.51)
> 
> #Restriction Enzyme 1
> #TaqI
> cs_5p1 <- "T"
> cs_3p1 <- "CGA"
> 
> simseq.dig <- insilico.digest(simseq, cs_5p1, cs_3p1, cs_5p1, cs_3p1, verbose=TRUE)
Number of restriction sites for the first enzyme: 3958
Number of restriction sites for the second enzyme: 3958
Number of type AB and BA fragments:7914
Number of type AA fragments:3957
Number of type BB fragments:3957
Warning messages:
1: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
2: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
3: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
4: In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first
> 
> # removing fragments countaining restiction sites of MseI ("TTAA"), MliCI ("AATT"),
> #         HaellI ("GGCC"), MspI ("CCGG") and HinP1I ("GCGC"):
> excl1 <- exclude.seqsite(simseq.dig, "TTAA")
1892 out of 3959 sequences with the restriction site detected and removed 
2067 sequence remaining 
> excl2 <- exclude.seqsite(excl1, "AATT")
637 out of 2067 sequences with the restriction site detected and removed 
1430 sequence remaining 
> excl3 <- exclude.seqsite(excl2, "GGCC")
424 out of 1430 sequences with the restriction site detected and removed 
1006 sequence remaining 
> excl4 <- exclude.seqsite(excl3, "CCGG")
197 out of 1006 sequences with the restriction site detected and removed 
809 sequence remaining 
> excl5 <- exclude.seqsite(excl4, "GCGC")
141 out of 809 sequences with the restriction site detected and removed 
668 sequence remaining 
> # which can be followed by  size selection step.
> 
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>