a vector of weights for the probes. The weights should be
inversely proportional to their variances. Currently all weights
should be positive i.e. remove probes with zero weight prior to
segmentation.
alpha
significance levels for the test to accept change-points.
nperm
number of permutations used for p-value computation.
p.method
method used for p-value computation. For the "perm"
method the p-value is based on full permutation. For the "hybrid"
method the maximum over the entire region is split into maximum of
max over small segments and max over the rest. Approximation is
used for the larger segment max. Default is hybrid.
min.width
the minimum number of markers for a changed segment.
The default is 2 but can be made larger. Maximum possible value is
set at 5 since arbitrary widths can have the undesirable effect of
incorrect change-points when a true signal of narrow widths exists.
kmax
the maximum width of smaller segment for permutation
in the hybrid method.
nmin
the minimum length of data for which the approximation of
maximum statistic is used under the hybrid method. should be larger
than 4*kmax
eta
the probability to declare a change conditioned on the
permuted statistic exceeding the observed statistic exactly
j (= 1,...,nperm*alpha) times.
sbdry
the sequential boundary used to stop and declare a
change. This boundary is a function of nperm, alpha and eta. It can
be obtained using the function "getbdry" and used instead of having
the "segment" function compute it every time it is called.
trim
proportion of data to be trimmed for variance calculation
for smoothing outliers and undoing splits based on SD.
undo.splits
A character string specifying how change-points are
to be undone, if at all. Default is "none". Other choices are
"prune", which uses a sum of squares criterion, and "sdundo", which
undoes splits that are not at least this many SDs apart.
undo.prune
the proportional increase in sum of squares allowed
when eliminating splits if undo.splits="prune".
undo.SD
the number of SDs between means to keep a split if
undo.splits="sdundo".
verbose
level of verbosity for monitoring the program's
progress where 0 produces no printout, 1 prints the current sample,
2 the current chromosome and 3 the current segment. The default
level is 1.
Details
This function implements the cicular binary segmentation (CBS)
algorithm of Olshen and Venkatraman (2004). Given a set of genomic
data, either continuous or binary, the algorithm recursively splits
chromosomes into either two or three subsegments based on a maximum
t-statistic. A reference distribution, used to decided whether or not
to split, is estimated by permutation. Options are given to eliminate
splits when the means of adjacent segments are not sufficiently far
apart. Note that after the first split the α-levels of the
tests for splitting are not unconditional.
We recommend using one of the undoing options to remove change-points
detected due to local trends (see the manuscript below for examples of
local trends).
Since the segmentation procedure uses a permutation reference
distribution, R commands for setting and saving seeds should be used
if the user wishes to reproduce the results.
Data that are NA, Inf, NaN will be removed on a per sample basis for
"genomdat" and all samples for "chrom" and "maploc".
Value
An object of class DNAcopy. It has three elements:
data
The original CNA object which was the input for segment
out
a data frame with six columns. Each row of the data frame
contains a segment for which there are six variables: the sample id,
the chromosome number, the map position of the start of the segment,
the map position of the end of the segment, the number of markers in
the segment, and the average value in the segment.
segRows
a data frame with the start and end row of each segment
in the data matrix. print command shows it with the showSegRows=T
Olshen, A. B., Venkatraman, E. S., Lucito, R., Wigler, M. (2004).
Circular binary segmentation for the analysis of array-based DNA copy
number data. Biostatistics 5: 557-572.
Venkatraman, E. S., Olshen, A. B. (2007) A faster circular binary
segmentation algorithm for the analysis of array CGH data.
Bioinformatics 23: 657-63.
Examples
# test code on an easy data set
set.seed(25)
genomdat <- rnorm(500, sd=0.1) +
rep(c(-0.2,0.1,1,-0.5,0.2,-0.5,0.1,-0.2),c(137,87,17,49,29,52,87,42))
plot(genomdat)
chrom <- rep(1:2,c(290,210))
maploc <- c(1:290,1:210)
test1 <- segment(CNA(genomdat, chrom, maploc))
# test code on a noisier and hence more difficult data set
set.seed(51)
genomdat <- rnorm(500, sd=0.2) +
rep(c(-0.2,0.1,1,-0.5,0.2,-0.5,0.1,-0.2),c(137,87,17,49,29,52,87,42))
plot(genomdat)
chrom <- rep(1:2,c(290,210))
maploc <- c(1:290,1:210)
test2 <- segment(CNA(genomdat, chrom, maploc))
# test code for weighted CBS
set.seed(97)
wts <- sample(1:3, 500, replace=TRUE)
genomdat <- rnorm(500, sd=0.3)/sqrt(wts) +
rep(c(-0.2,0.1,1,-0.5,0.2,-0.5,0.1,-0.2),c(137,87,17,49,29,52,87,42))
plot(genomdat)
chrom <- rep(1:2,c(290,210))
maploc <- c(1:290,1:210)
test3 <- segment(CNA(genomdat, chrom, maploc), weights=wts)
#A real analyis
data(coriell)
#Combine into one CNA object to prepare for analysis on Chromosomes 1-23
CNA.object <- CNA(cbind(coriell$Coriell.05296,coriell$Coriell.13330),
coriell$Chromosome,coriell$Position,
data.type="logratio",sampleid=c("c05296","c13330"))
#We generally recommend smoothing single point outliers before analysis
#Make sure to check that the smoothing is proper
smoothed.CNA.object <- smooth.CNA(CNA.object)
#Segmentation at default parameters
segment.smoothed.CNA.object <- segment(smoothed.CNA.object, verbose=1)
data(coriell)
#Combine into one CNA object to prepare for analysis on Chromosomes 1-23
CNA.object <- CNA(cbind(coriell$Coriell.05296,coriell$Coriell.13330),
coriell$Chromosome,coriell$Position,
data.type="logratio",sampleid=c("c05296","c13330"))
#We generally recommend smoothing single point outliers before analysis
#Make sure to check that the smoothing is proper
smoothed.CNA.object <- smooth.CNA(CNA.object)
#Segmentation at default parameters
segment.smoothed.CNA.object <- segment(smoothed.CNA.object, verbose=1)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(DNAcopy)
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/DNAcopy/segment.Rd_%03d_medium.png", width=480, height=480)
> ### Name: segment
> ### Title: Genome Segmentation Program
> ### Aliases: segment
> ### Keywords: nonparametric
>
> ### ** Examples
>
>
> # test code on an easy data set
> set.seed(25)
> genomdat <- rnorm(500, sd=0.1) +
+ rep(c(-0.2,0.1,1,-0.5,0.2,-0.5,0.1,-0.2),c(137,87,17,49,29,52,87,42))
> plot(genomdat)
> chrom <- rep(1:2,c(290,210))
> maploc <- c(1:290,1:210)
> test1 <- segment(CNA(genomdat, chrom, maploc))
Analyzing: Sample.1
>
> # test code on a noisier and hence more difficult data set
> set.seed(51)
> genomdat <- rnorm(500, sd=0.2) +
+ rep(c(-0.2,0.1,1,-0.5,0.2,-0.5,0.1,-0.2),c(137,87,17,49,29,52,87,42))
> plot(genomdat)
> chrom <- rep(1:2,c(290,210))
> maploc <- c(1:290,1:210)
> test2 <- segment(CNA(genomdat, chrom, maploc))
Analyzing: Sample.1
>
> # test code for weighted CBS
> set.seed(97)
> wts <- sample(1:3, 500, replace=TRUE)
> genomdat <- rnorm(500, sd=0.3)/sqrt(wts) +
+ rep(c(-0.2,0.1,1,-0.5,0.2,-0.5,0.1,-0.2),c(137,87,17,49,29,52,87,42))
> plot(genomdat)
> chrom <- rep(1:2,c(290,210))
> maploc <- c(1:290,1:210)
> test3 <- segment(CNA(genomdat, chrom, maploc), weights=wts)
Analyzing: Sample.1
>
> #A real analyis
>
> data(coriell)
>
> #Combine into one CNA object to prepare for analysis on Chromosomes 1-23
>
> CNA.object <- CNA(cbind(coriell$Coriell.05296,coriell$Coriell.13330),
+ coriell$Chromosome,coriell$Position,
+ data.type="logratio",sampleid=c("c05296","c13330"))
Warning message:
In CNA(cbind(coriell$Coriell.05296, coriell$Coriell.13330), coriell$Chromosome, :
array has repeated maploc positions
>
> #We generally recommend smoothing single point outliers before analysis
> #Make sure to check that the smoothing is proper
>
> smoothed.CNA.object <- smooth.CNA(CNA.object)
>
> #Segmentation at default parameters
>
> segment.smoothed.CNA.object <- segment(smoothed.CNA.object, verbose=1)
Analyzing: c05296
Analyzing: c13330
> data(coriell)
>
> #Combine into one CNA object to prepare for analysis on Chromosomes 1-23
>
> CNA.object <- CNA(cbind(coriell$Coriell.05296,coriell$Coriell.13330),
+ coriell$Chromosome,coriell$Position,
+ data.type="logratio",sampleid=c("c05296","c13330"))
Warning message:
In CNA(cbind(coriell$Coriell.05296, coriell$Coriell.13330), coriell$Chromosome, :
array has repeated maploc positions
>
> #We generally recommend smoothing single point outliers before analysis
> #Make sure to check that the smoothing is proper
>
> smoothed.CNA.object <- smooth.CNA(CNA.object)
>
> #Segmentation at default parameters
>
> segment.smoothed.CNA.object <- segment(smoothed.CNA.object, verbose=1)
Analyzing: c05296
Analyzing: c13330
>
>
>
>
>
>
> dev.off()
null device
1
>