R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Segment genomic signals using the CBS method

segmentByCBS

R Documentation

Segment genomic signals using the CBS method

Description

Segment genomic signals using the CBS method of the DNAcopy package. This is a convenient low-level wrapper for the DNAcopy::segment() method. It is intended to be applied to a sample at the time. For more details on the Circular Binary Segmentation (CBS) method see [1,2].

Usage

## Default S3 method:
segmentByCBS(y, chromosome=0L, x=NULL, index=seq(along = y), w=NULL, undo=0,
  avg=c("mean", "median"), ..., joinSegments=TRUE, knownSegments=NULL, seed=NULL,
  verbose=FALSE)

Arguments

`y`	A `numeric` `vector` of J genomic signals to be segmented.
`chromosome`	Optional `numeric` `vector` of length J, specifying the chromosome of each loci. If a scalar, it is expanded to a vector of length J.
`x`	Optional `numeric` `vector` of J genomic locations. If `NULL`, index locations `1:J` are used.
`index`	An optional `integer` `vector` of length J specifying the genomewide indices of the loci.
`w`	Optional `numeric` `vector` in [0,1] of J weights.
`undo`	A non-negative `numeric`. If greater than zero, then arguments `undo.splits="sdundo"` and `undo.SD=undo` are passed to `DNAcopy::segment()`. In the special case when `undo` is +`Inf`, the segmentation result will not contain any changepoints (in addition to what is specified by argument `knownSegments`).
`avg`	A `character` string specifying how to calculating segment mean levels after change points have been identified.
`...`	Additional arguments passed to the `DNAcopy::segment()` segmentation function.
`joinSegments`	If `TRUE`, there are no gaps between neighboring segments. If `FALSE`, the boundaries of a segment are defined by the support that the loci in the segments provides, i.e. there exist a locus at each end point of each segment. This also means that there is a gap between any neighboring segments, unless the change point is in the middle of multiple loci with the same position. The latter is what `DNAcopy::segment()` returns.
`knownSegments`	Optional `data.frame` specifying non-overlapping known segments. These segments must not share loci. See `findLargeGaps`() and `gapsToSegments`().
`seed`	An (optional) `integer` specifying the random seed to be set before calling the segmentation method. The random seed is set to its original state when exiting. If `NULL`, it is not set.
`verbose`	See `Verbose`.

Details

Internally segment of DNAcopy is used to segment the signals. This segmentation method support weighted segmentation.

Value

Returns a CBS object.

Reproducibility

The DNAcopy::segment() implementation of CBS uses approximation through random sampling for some estimates. Because of this, repeated calls using the same signals may result in slightly different results, unless the random seed is set/fixed.

Missing and non-finite values

Signals may contain missing values (NA or NaN), but not infinite values (+/-Inf). Loci with missing-value signals are preserved and keep in the result.

Likewise, genomic positions may contain missing values. However, if they do, such loci are silently excluded before performing the segmentation, and are not kept in the results. The mapping between the input locus-level data and ditto of the result can be inferred from the index column of the locus-level data of the result.

None of the input data may have infinite values, i.e. -Inf or +Inf. If so, an informative error is thrown.

Author(s)

Henrik Bengtsson

References

[1] A.B. Olshen, E.S. Venkatraman (aka Venkatraman E. Seshan), R. Lucito and M. Wigler, Circular binary segmentation for the analysis of array-based DNA copy number data, Biostatistics, 2004
[2] E.S. Venkatraman and A.B. Olshen, A faster circular binary segmentation algorithm for the analysis of array CGH data, Bioinformatics, 2007

Examples

 
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Simulating copy-number data
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
set.seed(0xBEEF)

# Number of loci
J <- 1000

mu <- double(J)
mu[200:300] <- mu[200:300] + 1
mu[350:400] <- NA # centromere
mu[650:800] <- mu[650:800] - 1
eps <- rnorm(J, sd=1/2)
y <- mu + eps
x <- sort(runif(length(y), max=length(y))) * 1e5
w <- runif(J)
w[650:800] <- 0.001


# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Segmentation
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
fit <- segmentByCBS(y, x=x)
print(fit)
plotTracks(fit)


     
xlab <- "Position (Mb)"
ylim <- c(-3,3)
xMb <- x/1e6
plot(xMb,y, pch=20, col="#aaaaaa", xlab=xlab, ylim=ylim)
drawLevels(fit, col="red", lwd=2, xScale=1e-6)

 
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# TESTS
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
fit <- segmentByCBS(y, x=x, seed=0xBEEF)
print(fit)
##   id chromosome       start      end nbrOfLoci    mean
## 1  y          0    55167.82 20774251       201  0.0164
## 2  y          0 20774250.85 29320105        99  1.0474
## 3  y          0 29320104.86 65874675       349 -0.0227
## 4  y          0 65874675.06 81348129       151 -1.0813
## 5  y          0 81348129.20 99910827       200 -0.0612


# Test #1: Reverse the ordering and segment
fitR <- segmentByCBS(rev(y), x=rev(x), seed=0xBEEF)
# Sanity check
stopifnot(all.equal(getSegments(fitR), getSegments(fit)))
# Sanity check
stopifnot(all.equal(rev(getLocusData(fitR)$index), getLocusData(fit)$index))

# Test #2: Reverse, but preserve ordering of 'data' object
fitRP <- segmentByCBS(rev(y), x=rev(x), preserveOrder=TRUE)
stopifnot(all.equal(getSegments(fitRP), getSegments(fit)))


# (Test #3: Change points inbetween data points at the same locus)
x[650:654] <- x[649]
fitC <- segmentByCBS(rev(y), x=rev(x), preserveOrder=TRUE, seed=0xBEEF)

# Test #4: Allow for some missing values in signals
y[450] <- NA
fitD <- segmentByCBS(y, x=x, seed=0xBEEF)


# Test #5: Allow for some missing genomic annotations
x[495] <- NA
fitE <- segmentByCBS(y, x=x, seed=0xBEEF)


# Test #6: Undo all change points found
fitF <- segmentByCBS(y, x=x, undo=Inf, seed=0xBEEF)
print(fitF)
stopifnot(nbrOfSegments(fitF) == 1L)


# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# MISC.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Emulate a centromere
x[650:699] <- NA
fit <- segmentByCBS(y, x=x, seed=0xBEEF)
xMb <- x/1e6
plot(xMb,y, pch=20, col="#aaaaaa", xlab=xlab, ylim=ylim)
drawLevels(fit, col="red", lwd=2, xScale=1e-6)

fitC <- segmentByCBS(y, x=x, joinSegments=FALSE, seed=0xBEEF)
drawLevels(fitC, col="blue", lwd=2, xScale=1e-6)



# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Multiple chromosomes
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Appending CBS results
fit1 <- segmentByCBS(y, chromosome=1, x=x)
fit2 <- segmentByCBS(y, chromosome=2, x=x)
fit <- append(fit1, fit2)
print(fit)
plotTracks(fit, subset=NULL, lwd=2, Clim=c(-3,3))


# Segmenting multiple chromosomes at once
chromosomeWG <- rep(1:2, each=J)
xWG <- rep(x, times=2)
yWG <- rep(y, times=2)
fitWG <- segmentByCBS(yWG, chromosome=chromosomeWG, x=xWG)
print(fitWG)
plotTracks(fitWG, subset=NULL, lwd=2, Clim=c(-3,3))

# Assert same results
fit$data[,"index"] <- getLocusData(fitWG)[,"index"] # Ignore 'index'
stopifnot(all.equal(getLocusData(fitWG), getLocusData(fit)))
stopifnot(all.equal(getSegments(fitWG), getSegments(fit)))

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(PSCBS)
PSCBS v0.61.0 (2016-02-03) successfully loaded. See ?PSCBS for help.

Attaching package: 'PSCBS'

The following objects are masked from 'package:base':

    append, load

> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/PSCBS/segmentByCBS.Rd_%03d_medium.png", width=480, height=480)
> ### Name: segmentByCBS
> ### Title: Segment genomic signals using the CBS method
> ### Aliases: segmentByCBS.default segmentByCBS segmentByCBS.data.frame
> ###   segmentByCBS.CBS segmentByCBS.CNA segmentByCBS
> ### Keywords: IO
> 
> ### ** Examples
> 
>  
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # Simulating copy-number data
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> set.seed(0xBEEF)
> 
> # Number of loci
> J <- 1000
> 
> mu <- double(J)
> mu[200:300] <- mu[200:300] + 1
> mu[350:400] <- NA # centromere
> mu[650:800] <- mu[650:800] - 1
> eps <- rnorm(J, sd=1/2)
> y <- mu + eps
> x <- sort(runif(length(y), max=length(y))) * 1e5
> w <- runif(J)
> w[650:800] <- 0.001
> 
> 
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # Segmentation
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> fit <- segmentByCBS(y, x=x)
> print(fit)
  sampleName chromosome       start      end nbrOfLoci    mean
1       <NA>          0    55167.82 20774251       201  0.0164
2       <NA>          0 20774250.85 29320105        99  1.0474
3       <NA>          0 29320104.86 65874675       298 -0.0203
4       <NA>          0 65874675.06 81348129       151 -1.0813
5       <NA>          0 81348129.20 99910827       200 -0.0612
> plotTracks(fit)
> 
> 
>      
> xlab <- "Position (Mb)"
> ylim <- c(-3,3)
> xMb <- x/1e6
> plot(xMb,y, pch=20, col="#aaaaaa", xlab=xlab, ylim=ylim)
> drawLevels(fit, col="red", lwd=2, xScale=1e-6)
NULL
> 
>  
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # TESTS
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> fit <- segmentByCBS(y, x=x, seed=0xBEEF)
> print(fit)
  sampleName chromosome       start      end nbrOfLoci    mean
1       <NA>          0    55167.82 20774251       201  0.0164
2       <NA>          0 20774250.85 29320105        99  1.0474
3       <NA>          0 29320104.86 65874675       298 -0.0203
4       <NA>          0 65874675.06 81348129       151 -1.0813
5       <NA>          0 81348129.20 99910827       200 -0.0612
> ##   id chromosome       start      end nbrOfLoci    mean
> ## 1  y          0    55167.82 20774251       201  0.0164
> ## 2  y          0 20774250.85 29320105        99  1.0474
> ## 3  y          0 29320104.86 65874675       349 -0.0227
> ## 4  y          0 65874675.06 81348129       151 -1.0813
> ## 5  y          0 81348129.20 99910827       200 -0.0612
> 
> 
> # Test #1: Reverse the ordering and segment
> fitR <- segmentByCBS(rev(y), x=rev(x), seed=0xBEEF)
> # Sanity check
> stopifnot(all.equal(getSegments(fitR), getSegments(fit)))
> # Sanity check
> stopifnot(all.equal(rev(getLocusData(fitR)$index), getLocusData(fit)$index))
> 
> # Test #2: Reverse, but preserve ordering of 'data' object
> fitRP <- segmentByCBS(rev(y), x=rev(x), preserveOrder=TRUE)
> stopifnot(all.equal(getSegments(fitRP), getSegments(fit)))
> 
> 
> # (Test #3: Change points inbetween data points at the same locus)
> x[650:654] <- x[649]
> fitC <- segmentByCBS(rev(y), x=rev(x), preserveOrder=TRUE, seed=0xBEEF)
> 
> # Test #4: Allow for some missing values in signals
> y[450] <- NA
> fitD <- segmentByCBS(y, x=x, seed=0xBEEF)
> 
> 
> # Test #5: Allow for some missing genomic annotations
> x[495] <- NA
> fitE <- segmentByCBS(y, x=x, seed=0xBEEF)
> 
> 
> # Test #6: Undo all change points found
> fitF <- segmentByCBS(y, x=x, undo=Inf, seed=0xBEEF)
> print(fitF)
  sampleName chromosome    start      end nbrOfLoci        mean
1       <NA>          0 55167.82 99910827       999 -0.07979396
> stopifnot(nbrOfSegments(fitF) == 1L)
> 
> 
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # MISC.
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # Emulate a centromere
> x[650:699] <- NA
> fit <- segmentByCBS(y, x=x, seed=0xBEEF)
> xMb <- x/1e6
> plot(xMb,y, pch=20, col="#aaaaaa", xlab=xlab, ylim=ylim)
> drawLevels(fit, col="red", lwd=2, xScale=1e-6)
NULL
> 
> fitC <- segmentByCBS(y, x=x, joinSegments=FALSE, seed=0xBEEF)
> drawLevels(fitC, col="blue", lwd=2, xScale=1e-6)
NULL
> 
> 
> 
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # Multiple chromosomes
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # Appending CBS results
> fit1 <- segmentByCBS(y, chromosome=1, x=x)
> fit2 <- segmentByCBS(y, chromosome=2, x=x)
> fit <- append(fit1, fit2)
> print(fit)
   sampleName chromosome       start      end nbrOfLoci    mean
1        <NA>          1    55167.82 20585463       198  0.0096
2        <NA>          1 20585463.34 29320105       102  1.0304
3        <NA>          1 29320104.86 67893865       296 -0.0238
4        <NA>          1 67893865.37 81348129       101 -1.0803
5        <NA>          1 81348129.20 99910827       200 -0.0612
6        <NA>         NA          NA       NA        NA      NA
7        <NA>          2    55167.82 20585463       198  0.0096
8        <NA>          2 20585463.34 29320105       102  1.0304
9        <NA>          2 29320104.86 67893865       296 -0.0238
10       <NA>          2 67893865.37 81348129       101 -1.0803
11       <NA>          2 81348129.20 99910827       200 -0.0612
> plotTracks(fit, subset=NULL, lwd=2, Clim=c(-3,3))
> 
> 
> # Segmenting multiple chromosomes at once
> chromosomeWG <- rep(1:2, each=J)
> xWG <- rep(x, times=2)
> yWG <- rep(y, times=2)
> fitWG <- segmentByCBS(yWG, chromosome=chromosomeWG, x=xWG)
> print(fitWG)
   sampleName chromosome       start      end nbrOfLoci    mean
1        <NA>          1    55167.82 20585463       198  0.0096
2        <NA>          1 20585463.34 29320105       102  1.0304
3        <NA>          1 29320104.86 67893865       296 -0.0238
4        <NA>          1 67893865.37 81348129       101 -1.0803
5        <NA>          1 81348129.20 99910827       200 -0.0612
6        <NA>         NA          NA       NA        NA      NA
7        <NA>          2    55167.82 20585463       198  0.0096
8        <NA>          2 20585463.34 29320105       102  1.0304
9        <NA>          2 29320104.86 67893865       296 -0.0238
10       <NA>          2 67893865.37 81348129       101 -1.0803
11       <NA>          2 81348129.20 99910827       200 -0.0612
> plotTracks(fitWG, subset=NULL, lwd=2, Clim=c(-3,3))
> 
> # Assert same results
> fit$data[,"index"] <- getLocusData(fitWG)[,"index"] # Ignore 'index'
> stopifnot(all.equal(getLocusData(fitWG), getLocusData(fit)))
> stopifnot(all.equal(getSegments(fitWG), getSegments(fit)))
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>