a vector of SNP names. SNPs must be ordered by chromosme locations
chr
chromosomes of all the SNPs specified in snpNames
pos
positions of all the SNPs specified in snpNames
LRR
Log R Ratio of all the SNPs specified in snpNames
BAF
B Allele Frequency of all the SNPs specified in snpNames
pBs
population frequency of of all the SNPs specified in snpNames
sampleID
symbol/name of the studied sample. Only one sample is studied each time
Para
a list of initial parameters for the HMM. If Para is NULL, The
default initial parameters: init.Para.CNA is used
fixPara
if fixPara is TRUE, the parameters in Para are fixed, and are
used directly to calculate posterior probabilities. It is not recommended to
set fixPara as TRUE for CNA studies.
cnv.only
a vector indicating those CNV-only probes, for which we
only consider their Log R ratio. If it is NULL, there is no CNV-only probes
estimate.pi.r
to estimate pi.r (proportion of uniform component for
LRR) or not. By default, estimate.pi.r=FALSE, and the initial value of pi.r is
used to estimate other parameters
estimate.pi.b
to estimate pi.b (proportion of uniform component for
BAF) or not. By default, estimate.pi.b=FALSE, and the initial value of pi.b is
used to estimate other parameters
estimate.trans.m
to estimate transition probability matrix or not. By
default, estimate.trans.m=FALSE, and the initial value of estimate.trans.m is
used to estimate other parameters
outputSeg
wether to output the information of copy number altered
segments
outputSNP
if outputSNP is 0, do not output SNP specific information; if outputSNP is 1, output the most likely copy number and genotype state of the SNPs that are within copy number altered regions; if outputSNP is 2, output the most likely copy number and genotype state of all the SNPs (whether it is within CNV regions or not), if outputSNP is 3, output the posterior probability for all the copy number and genotype states for the SNPs.
outputTag
the prefix of the output files, output of copy number
altered segments is written into file outputTag_segment.txt, and output of
SNP information is written into file outputTag_SNP.txt
outputViterbi
whether to output the copy altered regions identified by
the viterbi algorithm. see details
Ds
Parameter to for transition probability of the HMM. A vector of
length N, where N is the number of states in the HMM
pBs.alpha
pBs.alpha is the lower limit of population B allele frequency, and the
upper limit is 1 - pBs.alpha
contamination
whether tissue contamination is considered
normalGtp
normalGtp is specified only if paired tumor-normal
SNP array is availalble. It is the normal tissue genotype for all the SNPs
specified in snpNames, which can only take four different values:
-1, 0, 1, and 2. Values 0, 1, 2 correspond to the number of B alleles,
and value -1 indicates the normal genotype is missing. By default,
it is NULL, then all the normal genotype are set missing (-1)
geno.error
probability of genotyping error in normal tissue genotypes
min.tp
the minimum of transition probability.
max.diff
Due to normalization procedure, the BAF may not be symmetric.
Let's use state (AAA, AAB, ABB, BBB) as an example. Ideally, mean values of
normal components AAB and ABB, denoted by mu1 and mu2, respectively, should
have the relation mu1 = 1-mu2 if BAF is symmetric. However, this may not be
true due to normalization procedures. We restrict the difference of mu1 and
(1-mu2) by this parameter max.diff.
distThreshold
If distance between adjacent probes is larger than
distThreshold, restart the transition probability by the default values
in transB.
transB
The default transition probability.
epsilon
see explanation of K
K
epsilon and K are used to specify the convergence
criteria. We say the estimate.para is converged if for K consecutive
updates, the maximum change of parameter estimates in every adjacent
step is smaller than epsilon
maxIt
the maximum number of iterations of the EM algorithm to estimate parameters
seg.nSNP
the minimum number of SNPs per segment
traceIt
if traceIt is a integer n, then the running time is printed out in every n iterations of the EM algorithm.
if traceIt is 0 or negative, no tracing information is printed out.
Value
results are written into output files
Note
Copy number altered regions are identified, by default, based on the SNP level
copy number calls. A CNA region boundary is declared simply when the adjacent
SNPs have different copy numbers. An alternative approach is to use viterbi
algorithm to output the “best path”. Most time the results based on the SNP
level copy number calls are the same as the results from viterbi algorithm.
For the following up association studies, the SNP level information is more
relevant if we examine the association SNP by SNP.
Author(s)
Wei Sun and Zhengzheng Tang
Examples
data(snpData)
data(snpInfo)
dim(snpData)
dim(snpInfo)
snpData[1:2,]
snpInfo[1:2,]
snpInfo[c(1001,1100,10001,10200),]
plotCN(pos=snpInfo$Position, LRR=snpData$LRR, BAF=snpData$BAF,
main = "simulated data on Chr22")
snpNames = snpInfo$Name
chr = snpInfo$Chr
pos = snpInfo$Position
LRR = snpData$LRR
BAF = snpData$BAF
pBs = snpInfo$PFB
cnv.only=(snpInfo$PFB>1)
sampleID="simu1"
# Note this simulated data is more of CNV rather than CNA.
# For example, there is no tissue contamination.
# We just use it to illustrate the usage of genoCNA.
Theta = genoCNA(snpNames, chr, pos, LRR, BAF, pBs, contamination=TRUE,
normalGtp=NULL, sampleID, cnv.only=cnv.only, outputSeg = TRUE,
outputSNP = 1, outputTag = "simu1")
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(genoCN)
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/genoCN/genoCNA.Rd_%03d_medium.png", width=480, height=480)
> ### Name: genoCNA
> ### Title: Copy Number Aberration
> ### Aliases: genoCNA
> ### Keywords: methods
>
> ### ** Examples
>
> data(snpData)
> data(snpInfo)
>
> dim(snpData)
[1] 17348 3
> dim(snpInfo)
[1] 17348 4
>
> snpData[1:2,]
Name LRR BAF
1 rs2334386 -0.2440655 1.000000000
2 rs9617528 -0.1422038 0.007824231
> snpInfo[1:2,]
Name Chr Position PFB
1075853 rs2334386 22 14430353 0.956
1075854 rs9617528 22 14441016 0.176
>
> snpInfo[c(1001,1100,10001,10200),]
Name Chr Position PFB
1076853 rs1934895 22 17460150 0.140
1076952 rs1206549 22 17595860 0.740
1085853 rs17750152 22 35827972 0.856
1086052 rs8141057 22 36007895 0.780
>
> plotCN(pos=snpInfo$Position, LRR=snpData$LRR, BAF=snpData$BAF,
+ main = "simulated data on Chr22")
>
> snpNames = snpInfo$Name
> chr = snpInfo$Chr
> pos = snpInfo$Position
> LRR = snpData$LRR
> BAF = snpData$BAF
> pBs = snpInfo$PFB
> cnv.only=(snpInfo$PFB>1)
> sampleID="simu1"
>
> # Note this simulated data is more of CNV rather than CNA.
> # For example, there is no tissue contamination.
> # We just use it to illustrate the usage of genoCNA.
>
> Theta = genoCNA(snpNames, chr, pos, LRR, BAF, pBs, contamination=TRUE,
+ normalGtp=NULL, sampleID, cnv.only=cnv.only, outputSeg = TRUE,
+ outputSNP = 1, outputTag = "simu1")
5 Wed Jul 6 16:20:37 2016
10 Wed Jul 6 16:20:42 2016
15 Wed Jul 6 16:20:47 2016
20 Wed Jul 6 16:20:52 2016
converges after 20 iterations
estimate copy number states for chromosome:
22
>
>
>
>
>
> dev.off()
null device
1
>