Last data update: 2014.03.03

R: Segment genomic signals using the CBS method
segmentByCBSR Documentation

Segment genomic signals using the CBS method

Description

Segment genomic signals using the CBS method of the DNAcopy package. This is a convenient low-level wrapper for the DNAcopy::segment() method. It is intended to be applied to a sample at the time. For more details on the Circular Binary Segmentation (CBS) method see [1,2].

Usage

## Default S3 method:
segmentByCBS(y, chromosome=0L, x=NULL, index=seq(along = y), w=NULL, undo=0,
  avg=c("mean", "median"), ..., joinSegments=TRUE, knownSegments=NULL, seed=NULL,
  verbose=FALSE)

Arguments

y

A numeric vector of J genomic signals to be segmented.

chromosome

Optional numeric vector of length J, specifying the chromosome of each loci. If a scalar, it is expanded to a vector of length J.

x

Optional numeric vector of J genomic locations. If NULL, index locations 1:J are used.

index

An optional integer vector of length J specifying the genomewide indices of the loci.

w

Optional numeric vector in [0,1] of J weights.

undo

A non-negative numeric. If greater than zero, then arguments undo.splits="sdundo" and undo.SD=undo are passed to DNAcopy::segment(). In the special case when undo is +Inf, the segmentation result will not contain any changepoints (in addition to what is specified by argument knownSegments).

avg

A character string specifying how to calculating segment mean levels after change points have been identified.

...

Additional arguments passed to the DNAcopy::segment() segmentation function.

joinSegments

If TRUE, there are no gaps between neighboring segments. If FALSE, the boundaries of a segment are defined by the support that the loci in the segments provides, i.e. there exist a locus at each end point of each segment. This also means that there is a gap between any neighboring segments, unless the change point is in the middle of multiple loci with the same position. The latter is what DNAcopy::segment() returns.

knownSegments

Optional data.frame specifying non-overlapping known segments. These segments must not share loci. See findLargeGaps() and gapsToSegments().

seed

An (optional) integer specifying the random seed to be set before calling the segmentation method. The random seed is set to its original state when exiting. If NULL, it is not set.

verbose

See Verbose.

Details

Internally segment of DNAcopy is used to segment the signals. This segmentation method support weighted segmentation.

Value

Returns a CBS object.

Reproducibility

The DNAcopy::segment() implementation of CBS uses approximation through random sampling for some estimates. Because of this, repeated calls using the same signals may result in slightly different results, unless the random seed is set/fixed.

Missing and non-finite values

Signals may contain missing values (NA or NaN), but not infinite values (+/-Inf). Loci with missing-value signals are preserved and keep in the result.

Likewise, genomic positions may contain missing values. However, if they do, such loci are silently excluded before performing the segmentation, and are not kept in the results. The mapping between the input locus-level data and ditto of the result can be inferred from the index column of the locus-level data of the result.

None of the input data may have infinite values, i.e. -Inf or +Inf. If so, an informative error is thrown.

Author(s)

Henrik Bengtsson

References

[1] A.B. Olshen, E.S. Venkatraman (aka Venkatraman E. Seshan), R. Lucito and M. Wigler, Circular binary segmentation for the analysis of array-based DNA copy number data, Biostatistics, 2004
[2] E.S. Venkatraman and A.B. Olshen, A faster circular binary segmentation algorithm for the analysis of array CGH data, Bioinformatics, 2007

See Also

To segment allele-specific tumor copy-number signals from a tumor with a matched normal, see segmentByPairedPSCBS(). For the same without a matched normal, see segmentByNonPairedPSCBS().

It is also possible to prune change points after segmentation (with identical results) using pruneBySdUndo().

Examples

 
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Simulating copy-number data
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
set.seed(0xBEEF)

# Number of loci
J <- 1000

mu <- double(J)
mu[200:300] <- mu[200:300] + 1
mu[350:400] <- NA # centromere
mu[650:800] <- mu[650:800] - 1
eps <- rnorm(J, sd=1/2)
y <- mu + eps
x <- sort(runif(length(y), max=length(y))) * 1e5
w <- runif(J)
w[650:800] <- 0.001


# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Segmentation
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
fit <- segmentByCBS(y, x=x)
print(fit)
plotTracks(fit)


     
xlab <- "Position (Mb)"
ylim <- c(-3,3)
xMb <- x/1e6
plot(xMb,y, pch=20, col="#aaaaaa", xlab=xlab, ylim=ylim)
drawLevels(fit, col="red", lwd=2, xScale=1e-6)

 
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# TESTS
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
fit <- segmentByCBS(y, x=x, seed=0xBEEF)
print(fit)
##   id chromosome       start      end nbrOfLoci    mean
## 1  y          0    55167.82 20774251       201  0.0164
## 2  y          0 20774250.85 29320105        99  1.0474
## 3  y          0 29320104.86 65874675       349 -0.0227
## 4  y          0 65874675.06 81348129       151 -1.0813
## 5  y          0 81348129.20 99910827       200 -0.0612


# Test #1: Reverse the ordering and segment
fitR <- segmentByCBS(rev(y), x=rev(x), seed=0xBEEF)
# Sanity check
stopifnot(all.equal(getSegments(fitR), getSegments(fit)))
# Sanity check
stopifnot(all.equal(rev(getLocusData(fitR)$index), getLocusData(fit)$index))

# Test #2: Reverse, but preserve ordering of 'data' object
fitRP <- segmentByCBS(rev(y), x=rev(x), preserveOrder=TRUE)
stopifnot(all.equal(getSegments(fitRP), getSegments(fit)))


# (Test #3: Change points inbetween data points at the same locus)
x[650:654] <- x[649]
fitC <- segmentByCBS(rev(y), x=rev(x), preserveOrder=TRUE, seed=0xBEEF)

# Test #4: Allow for some missing values in signals
y[450] <- NA
fitD <- segmentByCBS(y, x=x, seed=0xBEEF)


# Test #5: Allow for some missing genomic annotations
x[495] <- NA
fitE <- segmentByCBS(y, x=x, seed=0xBEEF)


# Test #6: Undo all change points found
fitF <- segmentByCBS(y, x=x, undo=Inf, seed=0xBEEF)
print(fitF)
stopifnot(nbrOfSegments(fitF) == 1L)


# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# MISC.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Emulate a centromere
x[650:699] <- NA
fit <- segmentByCBS(y, x=x, seed=0xBEEF)
xMb <- x/1e6
plot(xMb,y, pch=20, col="#aaaaaa", xlab=xlab, ylim=ylim)
drawLevels(fit, col="red", lwd=2, xScale=1e-6)

fitC <- segmentByCBS(y, x=x, joinSegments=FALSE, seed=0xBEEF)
drawLevels(fitC, col="blue", lwd=2, xScale=1e-6)



# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Multiple chromosomes
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Appending CBS results
fit1 <- segmentByCBS(y, chromosome=1, x=x)
fit2 <- segmentByCBS(y, chromosome=2, x=x)
fit <- append(fit1, fit2)
print(fit)
plotTracks(fit, subset=NULL, lwd=2, Clim=c(-3,3))


# Segmenting multiple chromosomes at once
chromosomeWG <- rep(1:2, each=J)
xWG <- rep(x, times=2)
yWG <- rep(y, times=2)
fitWG <- segmentByCBS(yWG, chromosome=chromosomeWG, x=xWG)
print(fitWG)
plotTracks(fitWG, subset=NULL, lwd=2, Clim=c(-3,3))

# Assert same results
fit$data[,"index"] <- getLocusData(fitWG)[,"index"] # Ignore 'index'
stopifnot(all.equal(getLocusData(fitWG), getLocusData(fit)))
stopifnot(all.equal(getSegments(fitWG), getSegments(fit)))

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(PSCBS)
PSCBS v0.61.0 (2016-02-03) successfully loaded. See ?PSCBS for help.

Attaching package: 'PSCBS'

The following objects are masked from 'package:base':

    append, load

> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/PSCBS/segmentByCBS.Rd_%03d_medium.png", width=480, height=480)
> ### Name: segmentByCBS
> ### Title: Segment genomic signals using the CBS method
> ### Aliases: segmentByCBS.default segmentByCBS segmentByCBS.data.frame
> ###   segmentByCBS.CBS segmentByCBS.CNA segmentByCBS
> ### Keywords: IO
> 
> ### ** Examples
> 
>  
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # Simulating copy-number data
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> set.seed(0xBEEF)
> 
> # Number of loci
> J <- 1000
> 
> mu <- double(J)
> mu[200:300] <- mu[200:300] + 1
> mu[350:400] <- NA # centromere
> mu[650:800] <- mu[650:800] - 1
> eps <- rnorm(J, sd=1/2)
> y <- mu + eps
> x <- sort(runif(length(y), max=length(y))) * 1e5
> w <- runif(J)
> w[650:800] <- 0.001
> 
> 
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # Segmentation
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> fit <- segmentByCBS(y, x=x)
> print(fit)
  sampleName chromosome       start      end nbrOfLoci    mean
1       <NA>          0    55167.82 20774251       201  0.0164
2       <NA>          0 20774250.85 29320105        99  1.0474
3       <NA>          0 29320104.86 65874675       298 -0.0203
4       <NA>          0 65874675.06 81348129       151 -1.0813
5       <NA>          0 81348129.20 99910827       200 -0.0612
> plotTracks(fit)
> 
> 
>      
> xlab <- "Position (Mb)"
> ylim <- c(-3,3)
> xMb <- x/1e6
> plot(xMb,y, pch=20, col="#aaaaaa", xlab=xlab, ylim=ylim)
> drawLevels(fit, col="red", lwd=2, xScale=1e-6)
NULL
> 
>  
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # TESTS
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> fit <- segmentByCBS(y, x=x, seed=0xBEEF)
> print(fit)
  sampleName chromosome       start      end nbrOfLoci    mean
1       <NA>          0    55167.82 20774251       201  0.0164
2       <NA>          0 20774250.85 29320105        99  1.0474
3       <NA>          0 29320104.86 65874675       298 -0.0203
4       <NA>          0 65874675.06 81348129       151 -1.0813
5       <NA>          0 81348129.20 99910827       200 -0.0612
> ##   id chromosome       start      end nbrOfLoci    mean
> ## 1  y          0    55167.82 20774251       201  0.0164
> ## 2  y          0 20774250.85 29320105        99  1.0474
> ## 3  y          0 29320104.86 65874675       349 -0.0227
> ## 4  y          0 65874675.06 81348129       151 -1.0813
> ## 5  y          0 81348129.20 99910827       200 -0.0612
> 
> 
> # Test #1: Reverse the ordering and segment
> fitR <- segmentByCBS(rev(y), x=rev(x), seed=0xBEEF)
> # Sanity check
> stopifnot(all.equal(getSegments(fitR), getSegments(fit)))
> # Sanity check
> stopifnot(all.equal(rev(getLocusData(fitR)$index), getLocusData(fit)$index))
> 
> # Test #2: Reverse, but preserve ordering of 'data' object
> fitRP <- segmentByCBS(rev(y), x=rev(x), preserveOrder=TRUE)
> stopifnot(all.equal(getSegments(fitRP), getSegments(fit)))
> 
> 
> # (Test #3: Change points inbetween data points at the same locus)
> x[650:654] <- x[649]
> fitC <- segmentByCBS(rev(y), x=rev(x), preserveOrder=TRUE, seed=0xBEEF)
> 
> # Test #4: Allow for some missing values in signals
> y[450] <- NA
> fitD <- segmentByCBS(y, x=x, seed=0xBEEF)
> 
> 
> # Test #5: Allow for some missing genomic annotations
> x[495] <- NA
> fitE <- segmentByCBS(y, x=x, seed=0xBEEF)
> 
> 
> # Test #6: Undo all change points found
> fitF <- segmentByCBS(y, x=x, undo=Inf, seed=0xBEEF)
> print(fitF)
  sampleName chromosome    start      end nbrOfLoci        mean
1       <NA>          0 55167.82 99910827       999 -0.07979396
> stopifnot(nbrOfSegments(fitF) == 1L)
> 
> 
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # MISC.
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # Emulate a centromere
> x[650:699] <- NA
> fit <- segmentByCBS(y, x=x, seed=0xBEEF)
> xMb <- x/1e6
> plot(xMb,y, pch=20, col="#aaaaaa", xlab=xlab, ylim=ylim)
> drawLevels(fit, col="red", lwd=2, xScale=1e-6)
NULL
> 
> fitC <- segmentByCBS(y, x=x, joinSegments=FALSE, seed=0xBEEF)
> drawLevels(fitC, col="blue", lwd=2, xScale=1e-6)
NULL
> 
> 
> 
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # Multiple chromosomes
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> # Appending CBS results
> fit1 <- segmentByCBS(y, chromosome=1, x=x)
> fit2 <- segmentByCBS(y, chromosome=2, x=x)
> fit <- append(fit1, fit2)
> print(fit)
   sampleName chromosome       start      end nbrOfLoci    mean
1        <NA>          1    55167.82 20585463       198  0.0096
2        <NA>          1 20585463.34 29320105       102  1.0304
3        <NA>          1 29320104.86 67893865       296 -0.0238
4        <NA>          1 67893865.37 81348129       101 -1.0803
5        <NA>          1 81348129.20 99910827       200 -0.0612
6        <NA>         NA          NA       NA        NA      NA
7        <NA>          2    55167.82 20585463       198  0.0096
8        <NA>          2 20585463.34 29320105       102  1.0304
9        <NA>          2 29320104.86 67893865       296 -0.0238
10       <NA>          2 67893865.37 81348129       101 -1.0803
11       <NA>          2 81348129.20 99910827       200 -0.0612
> plotTracks(fit, subset=NULL, lwd=2, Clim=c(-3,3))
> 
> 
> # Segmenting multiple chromosomes at once
> chromosomeWG <- rep(1:2, each=J)
> xWG <- rep(x, times=2)
> yWG <- rep(y, times=2)
> fitWG <- segmentByCBS(yWG, chromosome=chromosomeWG, x=xWG)
> print(fitWG)
   sampleName chromosome       start      end nbrOfLoci    mean
1        <NA>          1    55167.82 20585463       198  0.0096
2        <NA>          1 20585463.34 29320105       102  1.0304
3        <NA>          1 29320104.86 67893865       296 -0.0238
4        <NA>          1 67893865.37 81348129       101 -1.0803
5        <NA>          1 81348129.20 99910827       200 -0.0612
6        <NA>         NA          NA       NA        NA      NA
7        <NA>          2    55167.82 20585463       198  0.0096
8        <NA>          2 20585463.34 29320105       102  1.0304
9        <NA>          2 29320104.86 67893865       296 -0.0238
10       <NA>          2 67893865.37 81348129       101 -1.0803
11       <NA>          2 81348129.20 99910827       200 -0.0612
> plotTracks(fitWG, subset=NULL, lwd=2, Clim=c(-3,3))
> 
> # Assert same results
> fit$data[,"index"] <- getLocusData(fitWG)[,"index"] # Ignore 'index'
> stopifnot(all.equal(getLocusData(fitWG), getLocusData(fit)))
> stopifnot(all.equal(getSegments(fitWG), getSegments(fit)))
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>