Last data update: 2014.03.03

R: Cut up the genome
cutGenomeR Documentation

Cut up the genome

Description

Perform an in silico restriction digest of a target genome.

Usage

cutGenome(bs, pattern, overhang=4L)

Arguments

bs

a BSgenome object or a character string pointing to a FASTA file

pattern

character string describing the recognition site

overhang

integer scalar specifying the length of the 5' overhang

Details

This function simulates a restriction digestion of a specified genome, given the recognition site and 5' overhang of the cutter The total sequence spanned by each fragment is recorded, including the two sticky ends. No support is currently provided for searching the reverse strand, so the recognition site should be an inverse palindrome.

The genome should be specified as a BSgenome object. However, a character string can also be provided, specifying a FASTA file containing all the reference sequences in a genome. The latter may be necessary to synchronise the fragments with the genome used for alignment.

Note that some of the reported fragments may be physically impossible to form, e.g., for overlapping sites or consecutive sites when overhang==nchar(pattern). Nonetheless, they are still reported to maintain the correspondence between fragments and cut sites. Cleavage sites on the forward strand can be obtained as the start locations of all fragments (excepting the first fragment on each chromosome).

Value

A GRanges object containing the boundaries of each restriction fragment in the genome.

Warning

If bs is a FASTQ file, the chromosome names in the FASTQ headers will be loaded faithfully by cutGenome. However, many mapping pipelines will drop the rest of the name past the first whitespace when constructing the alignment index. To be safe, users should ensure that the chromosome names in the FASTQ headers consist of one word. Otherwise, there will be a discrepancy between the chromosome names in the output GRanges, and those in the BAM files after alignment.

Author(s)

Aaron Lun

See Also

matchPattern

Examples

require(BSgenome.Ecoli.NCBI.20080805)

cutGenome(Ecoli, "AAGCTT", overhang=4L) # HindIII
cutGenome(Ecoli, "CCGCGG", overhang=2L) # SacII
cutGenome(Ecoli, "AGCT", overhang=0L) # AluI

# Trying with FastA files.
x <- system.file("extdata", "fastaEx.fa", package="Biostrings")
cutGenome(x, "AGCT", overhang=2)
cutGenome(x, "AGCT", overhang=4)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(diffHic)
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: InteractionSet
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/diffHic/cutGenome.Rd_%03d_medium.png", width=480, height=480)
> ### Name: cutGenome
> ### Title: Cut up the genome
> ### Aliases: cutGenome
> ### Keywords: preprocessing
> 
> ### ** Examples
> 
> require(BSgenome.Ecoli.NCBI.20080805)
Loading required package: BSgenome.Ecoli.NCBI.20080805
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> 
> cutGenome(Ecoli, "AAGCTT", overhang=4L) # HindIII
GRanges object with 7530 ranges and 0 metadata columns:
          seqnames             ranges strand
             <Rle>          <IRanges>  <Rle>
     [1] NC_008253     [    1,  5631]      *
     [2] NC_008253     [ 5628,  8815]      *
     [3] NC_008253     [ 8812, 14390]      *
     [4] NC_008253     [14387, 15273]      *
     [5] NC_008253     [15270, 73083]      *
     ...       ...                ...    ...
  [7526] AC_000091 [4598710, 4604166]      *
  [7527] AC_000091 [4604163, 4624393]      *
  [7528] AC_000091 [4624390, 4634429]      *
  [7529] AC_000091 [4634426, 4646149]      *
  [7530] AC_000091 [4646146, 4646332]      *
  -------
  seqinfo: 13 sequences from 2008/08/05 genome
> cutGenome(Ecoli, "CCGCGG", overhang=2L) # SacII
GRanges object with 8995 ranges and 0 metadata columns:
          seqnames             ranges strand
             <Rle>          <IRanges>  <Rle>
     [1] NC_008253     [    1, 14798]      *
     [2] NC_008253     [14797, 15173]      *
     [3] NC_008253     [15172, 32923]      *
     [4] NC_008253     [32922, 37533]      *
     [5] NC_008253     [37532, 41591]      *
     ...       ...                ...    ...
  [8991] AC_000091 [4612262, 4625241]      *
  [8992] AC_000091 [4625240, 4627146]      *
  [8993] AC_000091 [4627145, 4634502]      *
  [8994] AC_000091 [4634501, 4641114]      *
  [8995] AC_000091 [4641113, 4646332]      *
  -------
  seqinfo: 13 sequences from 2008/08/05 genome
> cutGenome(Ecoli, "AGCT", overhang=0L) # AluI
GRanges object with 186117 ranges and 0 metadata columns:
            seqnames             ranges strand
               <Rle>          <IRanges>  <Rle>
       [1] NC_008253       [   1,    2]      *
       [2] NC_008253       [   3,   69]      *
       [3] NC_008253       [  70,  929]      *
       [4] NC_008253       [ 930, 1103]      *
       [5] NC_008253       [1104, 1443]      *
       ...       ...                ...    ...
  [186113] AC_000091 [4644580, 4645143]      *
  [186114] AC_000091 [4645144, 4645330]      *
  [186115] AC_000091 [4645331, 4645508]      *
  [186116] AC_000091 [4645509, 4646147]      *
  [186117] AC_000091 [4646148, 4646332]      *
  -------
  seqinfo: 13 sequences from 2008/08/05 genome
> 
> # Trying with FastA files.
> x <- system.file("extdata", "fastaEx.fa", package="Biostrings")
> cutGenome(x, "AGCT", overhang=2)
GRanges object with 4 ranges and 0 metadata columns:
       seqnames    ranges strand
          <Rle> <IRanges>  <Rle>
  [1] sequence1  [ 1, 32]      *
  [2] sequence1  [31, 58]      *
  [3] sequence1  [57, 72]      *
  [4] sequence2  [ 1, 70]      *
  -------
  seqinfo: 2 sequences from an unspecified genome
> cutGenome(x, "AGCT", overhang=4)
GRanges object with 4 ranges and 0 metadata columns:
       seqnames    ranges strand
          <Rle> <IRanges>  <Rle>
  [1] sequence1  [ 1, 33]      *
  [2] sequence1  [30, 59]      *
  [3] sequence1  [56, 72]      *
  [4] sequence2  [ 1, 70]      *
  -------
  seqinfo: 2 sequences from an unspecified genome
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>