Last data update: 2014.03.03

R: Methods for processing DNase Hi-C data
DNaseHiCR Documentation

Methods for processing DNase Hi-C data

Description

Processing of BAM files for DNase Hi-C into index files

Usage

segmentGenome(bs, size=500) 
prepPseudoPairs(bam, param, file, dedup=TRUE, ichim=TRUE, 
    chim.span=1000, minq=NA, output.dir=NULL)

Arguments

bs

a BSgenome object, or a character string pointing to a FASTA file, or a named integer vector of chromosome lengths

size

an integer scalar indicating the size of the pseudo-fragments

bam

a character string containing the path to a name-sorted BAM file

param

a pairParam object containing read extraction parameters

file

a character string specifying the path to an output index file

dedup

a logical scalar indicating whether marked duplicate reads should be removed

ichim

a logical scalar indicating whether invalid chimeras should be counted

chim.span

an integer scalar specifying the maximum span between a chimeric 3' end and a mate read

minq

an integer scalar specifying the minimum mapping quality for each read

output.dir

a character string specifying a directory for temporary files

Details

DNase Hi-C involves random fragmentation with DNase instead of restriction enzymes. This is accommodated in diffHic by partitioning the genome into small pseudo-fragments, using segmentGenome. Reads are then assigned into these pseudo-fragments using prepPseudoPairs. The rest of the analysis pipeline can then be used in the same manner as that for standard Hi-C.

The behavior of prepPseudoPairs is almost identical to that for preparePairs, if the latter were asked to assign reads into pseudo-fragments. However, for prepPseudoPairs, no reporting or removal of self-circles or dangling ends is performed, as these have no meaning for artificial fragments. Also, invalidity of chimeras is determined by checking whether the 3' end is more than chim.span away from the mate read, rather than checking for localization in different fragments.

The size of the pseudo-fragments is determined by, well, size in segmentGenome. Smaller sizes provide better resolution but increase computational work. Needless to say, the param$fragments field should contain the output from segmentGenome, rather than from cutGenome. Also see cutGenome documentation for a warning about the chromosome names.

Some loss of spatial resolution is inevitable when reads are summarized into pseudo-fragments. This is largely irrelevant, though, as counting across the interaction space will ultimately use much larger bins (usually at least 2 kbp).

Value

For segmentGenome, a GRanges object is produced containing the coordinates of the pseudo-fragments in the specified genome.

For prepPseudoPairs, a HDF5-formatted index file is produced at the specified location. A list of diagnostic vectors are also returned in the same format as that from preparePairs, without the same.id entry.

Author(s)

Aaron Lun

See Also

preparePairs, cutGenome

Examples

require(BSgenome.Ecoli.NCBI.20080805)
segmentGenome(BSgenome.Ecoli.NCBI.20080805)
segmentGenome(BSgenome.Ecoli.NCBI.20080805, size=1000)

# Pretend that this example is DNase Hi-C.
hic.file <- system.file("exdata", "hic_sort.bam", package="diffHic")
cuts <- readRDS(system.file("exdata", "cuts.rds", package="diffHic"))
pseudo <- segmentGenome(seqlengths(cuts), size=50) 
param <- pairParam(pseudo) 

tmpf <- "gunk.h5"
prepPseudoPairs(hic.file, param, tmpf)
prepPseudoPairs(hic.file, param, tmpf, dedup=FALSE)
prepPseudoPairs(hic.file, param, tmpf, minq=50)
prepPseudoPairs(hic.file, param, tmpf, chim.span=20)


Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(diffHic)
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: InteractionSet
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/diffHic/DNaseHiC.Rd_%03d_medium.png", width=480, height=480)
> ### Name: DNaseHiC
> ### Title: Methods for processing DNase Hi-C data
> ### Aliases: prepPseudoPairs segmentGenome
> ### Keywords: preprocessing
> 
> ### ** Examples
> 
> require(BSgenome.Ecoli.NCBI.20080805)
Loading required package: BSgenome.Ecoli.NCBI.20080805
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> segmentGenome(BSgenome.Ecoli.NCBI.20080805)
GRanges object with 129516 ranges and 0 metadata columns:
            seqnames             ranges strand
               <Rle>          <IRanges>  <Rle>
       [1] NC_008253       [   1,  500]      *
       [2] NC_008253       [ 501, 1000]      *
       [3] NC_008253       [1001, 1500]      *
       [4] NC_008253       [1501, 2000]      *
       [5] NC_008253       [2001, 2500]      *
       ...       ...                ...    ...
  [129512] AC_000091 [4644001, 4644500]      *
  [129513] AC_000091 [4644501, 4645000]      *
  [129514] AC_000091 [4645001, 4645500]      *
  [129515] AC_000091 [4645501, 4646000]      *
  [129516] AC_000091 [4646001, 4646332]      *
  -------
  seqinfo: 13 sequences from an unspecified genome
> segmentGenome(BSgenome.Ecoli.NCBI.20080805, size=1000)
GRanges object with 64762 ranges and 0 metadata columns:
           seqnames             ranges strand
              <Rle>          <IRanges>  <Rle>
      [1] NC_008253       [   1, 1000]      *
      [2] NC_008253       [1001, 2000]      *
      [3] NC_008253       [2001, 3000]      *
      [4] NC_008253       [3001, 4000]      *
      [5] NC_008253       [4001, 5000]      *
      ...       ...                ...    ...
  [64758] AC_000091 [4642001, 4643000]      *
  [64759] AC_000091 [4643001, 4644000]      *
  [64760] AC_000091 [4644001, 4645000]      *
  [64761] AC_000091 [4645001, 4646000]      *
  [64762] AC_000091 [4646001, 4646332]      *
  -------
  seqinfo: 13 sequences from an unspecified genome
> 
> # Pretend that this example is DNase Hi-C.
> hic.file <- system.file("exdata", "hic_sort.bam", package="diffHic")
> cuts <- readRDS(system.file("exdata", "cuts.rds", package="diffHic"))
> pseudo <- segmentGenome(seqlengths(cuts), size=50) 
> param <- pairParam(pseudo) 
> 
> tmpf <- "gunk.h5"
> prepPseudoPairs(hic.file, param, tmpf)
$pairs
   total   marked filtered   mapped 
      30        7        3       20 

$singles
[1] 2

$chimeras
  total  mapped   multi invalid 
     12       8       7       4 

> prepPseudoPairs(hic.file, param, tmpf, dedup=FALSE)
$pairs
   total   marked filtered   mapped 
      30        7        3       27 

$singles
[1] 2

$chimeras
  total  mapped   multi invalid 
     12      12      12       6 

> prepPseudoPairs(hic.file, param, tmpf, minq=50)
$pairs
   total   marked filtered   mapped 
      30        7        9       14 

$singles
[1] 2

$chimeras
  total  mapped   multi invalid 
     12       6       3       2 

> prepPseudoPairs(hic.file, param, tmpf, chim.span=20)
$pairs
   total   marked filtered   mapped 
      30        7        3       20 

$singles
[1] 2

$chimeras
  total  mapped   multi invalid 
     12       8       7       6 

> 
> ## Don't show: 
> unlink(tmpf)
> ## End(Don't show)
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>