Last data update: 2014.03.03

R: Iterative correction of Hi-C counts
correctedContactR Documentation

Iterative correction of Hi-C counts

Description

Perform iterative correction on counts for Hi-C interactions to correct for biases between fragments.

Usage

correctedContact(data, iterations=50, exclude.local=1, ignore.low=0.02, 
    winsor.high=0.02, average=TRUE, dist.correct=FALSE)

Arguments

data

an InteractionSet object produced by squareCounts

iterations

an integer scalar specifying the number of correction iterations

exclude.local

an integer scalar, indicating the distance off the diagonal under which bin pairs are excluded

ignore.low

a numeric scalar, indicating the proportion of low-abundance bins to ignore

winsor.high

a numeric scalar indicating the proportion of high-abundance bin pairs to winsorize

average

a logical scalar specifying whether counts should be averaged across libraries

dist.correct

a logical scalar indicating whether to correct for distance effects

Details

This function implements the iterative correction procedure described by Imakaev et al. in their 2012 paper. Briefly, this aims to factorize the count for each bin pair into the biases for each of the two anchor bins and the true interaction probability. The bias represents the ease of sequencing/mapping/other for the genome sequence in each bin.

The data argument should be generated by taking the output of squareCounts after setting filter=1. Filtering should be avoided as counts in low-abundance bin pairs may be informative upon summation for each bin. For example, a large count sum for a bin may be formed from many bin pairs with low counts. Removal of those bin pairs would result in loss of information.

For average=TRUE, if multiple libraries are used to generate data, an average count will be computed for each bin pairs across all libraries using mglmOneGroup. The average count will then be used for correction. Otherwise, correction will be performed on the counts for each library separately.

The maximum step size in the output can be used as a measure of convergence. Ideally, the step size should approach 1 as iterations pass. This indicates that the correction procedure is converging to a single solution, as the maximum change to the computed biases is decreasing.

Value

A list with several components.

truth:

a numeric vector containing the true interaction probabilities for each bin pair

bias:

a numeric vector of biases for all bins

max:

a numeric vector containing the maximum fold-change change in biases at each iteration

trend:

a numeric vector specifying the fitted value for the distance-dependent trend, if dist.correct=TRUE

If average=FALSE, each component is a numeric matrix instead. Each column of the matrix contains the specified information for each library in data.

Additional parameter settings

Some robustness is provided by winsorizing out strong interactions with winsor.high to ensure that they do not overly influence the computed biases. This is useful for removing spikes around repeat regions or due to PCR duplication. Low-abundance bins can also be removed with ignore.low to avoid instability during correction, though this will result in NA values in the output.

Local bin pairs can be excluded as these are typically irrelevant to long-range interactions. They are also typically very high-abundance and may have excessive weight during correction, if not removed. This can be done by removing all bin pairs where the difference between the first and second anchor indices is less than exclude.local. Setting exclude.local=NA will only use inter-chromosomal bin pairs for correction.

If dist.correct=TRUE, abundances will be adjusted for distance-dependent effects. This is done by computing residuals from the fitted distance-abundance trend, using the filterTrended function. These residuals are then used for iterative correction, such that local interactions will not always have higher contact probabilities.

Ideally, the probability sums to unity across all bin pairs for a given bin (ignoring NA entries). This is complicated by winsorizing of high-abundance interactions and removal of local interactions. These interactions are not involved in correction, but are still reported in the output truth. As a result, the sum may not equal unity, i.e., values are not strictly interpretable as probabilities.

Author(s)

Aaron Lun

References

Imakaev M et al. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9, 999-1003.

See Also

squareCounts, mglmOneGroup

Examples

# Dummying up some data.
set.seed(3423746)
npts <- 100
npairs <- 5000
nlibs <- 4
anchor1 <- sample(npts, npairs, replace=TRUE)
anchor2 <- sample(npts, npairs, replace=TRUE)
data <- InteractionSet(
    matrix(rpois(npairs*nlibs, runif(npairs, 10, 100)), nrow=npairs),
    GInteractions(anchor1=anchor1, anchor2=anchor2,
        regions=GRanges("chrA", IRanges(1:npts, 1:npts)), mode="reverse"),
	colData=DataFrame(totals=runif(nlibs, 1e6, 2e6)))

# Correcting.
stuff <- correctedContact(data)
head(stuff$truth)
head(stuff$bias)
plot(stuff$max)

# Different behavior with average=FALSE.
stuff <- correctedContact(data, average=FALSE)
head(stuff$truth)
head(stuff$bias)
head(stuff$max)

# Creating an offset matrix, for use in glmFit.
anchor1.bias <- stuff$bias[anchors(data, type="first", id=TRUE),]
anchor2.bias <- stuff$bias[anchors(data, type="second", id=TRUE),]
offsets <- log(anchor1.bias * anchor2.bias)



# Adjusting for distance, and computing offsets with trend correction.
stuff <- correctedContact(data, average=FALSE, dist.correct=TRUE)
head(stuff$truth)
head(stuff$trend)
offsets <- log(stuff$bias[anchors(data, type="first", id=TRUE),]) +
    log(stuff$bias[anchors(data, type="second", id=TRUE),])


Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(diffHic)
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: InteractionSet
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/diffHic/correctedContact.Rd_%03d_medium.png", width=480, height=480)
> ### Name: correctedContact
> ### Title: Iterative correction of Hi-C counts
> ### Aliases: correctedContact
> ### Keywords: normalization
> 
> ### ** Examples
> 
> # Dummying up some data.
> set.seed(3423746)
> npts <- 100
> npairs <- 5000
> nlibs <- 4
> anchor1 <- sample(npts, npairs, replace=TRUE)
> anchor2 <- sample(npts, npairs, replace=TRUE)
> data <- InteractionSet(
+     matrix(rpois(npairs*nlibs, runif(npairs, 10, 100)), nrow=npairs),
+     GInteractions(anchor1=anchor1, anchor2=anchor2,
+         regions=GRanges("chrA", IRanges(1:npts, 1:npts)), mode="reverse"),
+ 	colData=DataFrame(totals=runif(nlibs, 1e6, 2e6)))
> 
> # Correcting.
> stuff <- correctedContact(data)
> head(stuff$truth)
[1] 0.015090382 0.005454726 0.013199564 0.006194694 0.002427702 0.004512986
> head(stuff$bias)
[1] 79.47863 78.21909 64.91889 65.05703 66.16683 82.45563
> plot(stuff$max)
> 
> # Different behavior with average=FALSE.
> stuff <- correctedContact(data, average=FALSE)
> head(stuff$truth)
            [,1]        [,2]        [,3]        [,4]
[1,] 0.015993377 0.014932314 0.017935997 0.011874704
[2,] 0.004716752 0.006718097 0.004782074 0.005428051
[3,] 0.013893050 0.012935873 0.012576284 0.013340685
[4,] 0.005775059 0.005370371 0.008488667 0.005514521
[5,] 0.002373992 0.002651176 0.002128739 0.002709963
[6,] 0.004616494 0.004093229 0.004725339 0.004833168
> head(stuff$bias)
         [,1]     [,2]     [,3]     [,4]
[1,] 79.44583 80.12882 79.19669 78.00497
[2,] 77.57805 76.53334 79.33583 79.85193
[3,] 64.72275 66.18180 63.19447 63.92046
[4,] 66.95121 68.03298 64.37778 67.20136
[5,] 65.72483 66.44462 65.15494 65.79029
[6,] 84.05451 83.19919 79.98695 82.43838
> head(stuff$max)
          [,1]      [,2]      [,3]      [,4]
[1,] 82.194890 82.231381 81.926797 83.006024
[2,]  1.069459  1.076306  1.071531  1.072110
[3,]  1.035208  1.040115  1.037664  1.038077
[4,]  1.018830  1.021603  1.020410  1.020586
[5,]  1.010647  1.011826  1.011274  1.011314
[6,]  1.006084  1.006556  1.006316  1.006297
> 
> # Creating an offset matrix, for use in glmFit.
> anchor1.bias <- stuff$bias[anchors(data, type="first", id=TRUE),]
> anchor2.bias <- stuff$bias[anchors(data, type="second", id=TRUE),]
> offsets <- log(anchor1.bias * anchor2.bias)
> 
> ## Don't show: 
> # effective function of offset in GLMs.
> difference <- log(stuff$truth) - (log(assay(data)) - offsets) 
> stopifnot(all(is.na(difference) | difference < 1e-8))
> ## End(Don't show)
> 
> # Adjusting for distance, and computing offsets with trend correction.
> stuff <- correctedContact(data, average=FALSE, dist.correct=TRUE)
> head(stuff$truth)
            [,1]        [,2]        [,3]        [,4]
[1,] 0.016230412 0.015107505 0.018223365 0.012133830
[2,] 0.004882928 0.006944673 0.004909311 0.005607093
[3,] 0.014195053 0.013285178 0.012929646 0.013395824
[4,] 0.006016098 0.005582093 0.008747207 0.005711547
[5,] 0.002418694 0.002694319 0.002181806 0.002791897
[6,] 0.004640510 0.004127641 0.004784632 0.004843673
> head(stuff$trend)
         [,1]     [,2]     [,3]     [,4]
[1,] 5.613923 5.616414 5.608766 5.596893
[2,] 5.588246 5.587704 5.594276 5.587591
[3,] 5.601978 5.586651 5.582532 5.614958
[4,] 5.587631 5.587378 5.595316 5.587722
[5,] 5.617216 5.620255 5.613407 5.598193
[6,] 5.626500 5.620901 5.610264 5.624400
> offsets <- log(stuff$bias[anchors(data, type="first", id=TRUE),]) +
+     log(stuff$bias[anchors(data, type="second", id=TRUE),])
> 
> ## Don't show: 
> hidden.offsets <- offsets + stuff$trend/log2(exp(1))
> difference <- log(stuff$truth) - (log(assay(data)) - hidden.offsets)
> stopifnot(all(is.na(difference) | difference < 1e-8))
> ## End(Don't show)
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>