is the integer index pointing to the column (gene) of geData from which the searching strategy has to start. Optionally a list of genes (indexes pointing to the columns of geData) can be provided.
logFilePrefix
Is a string containing a prefix of the log file generated by the algorithm. No longer necessary in this upgrade of the package.
coeffMissingAllowed
This parameter controls the number of missing values tolerated by the pam classification procedure (see details).
subsetToUse
If necessary the costruction of the signature can be restricted to a subset of genes. In this case a list of the columns of geData has to be provided.
cpuCluster
If a parallel search is necessary, this variable has to be set to the output of NCPUS() function.
stopCpuCluster
flag to control if the channel to the cpu-cluster has to be closed
Details
In the global enviroment two variables have to be set up: geData and
stData. geData is a matrix whose columns are the gene expressions and the rows
are the samples (see geNSCLC for example). It is recommended that the columns
names are instantiated. stData is a variable of the "Surv" class from the
package "survival" (see stNSCLG for example).
Starting from the seed gene (a list of seeds is allowed), the next gene added is the one that maximizes the distance of the two survival curves. The list of genes grows until no more gene is able to improve the distance between the survival curves.
A gene (candidateGene) can be added to the running signature if it satisfies two
controls: given the classification computed on the gene expressions of
geneCandidate + runningSignature, 1) no cluster can have a dimension lower than
floor(0.1 * nrow(geData)), and 2) the survival curves cannot cross. When more
than 1 candidate gene is proposed, if the number of candidates is greater than
0.01*ncol(geData) the searching stops; otherwise a subset of the candidates is
selected using backward strategy.
The parameter coeffMissingAllowed controls an empirical rule having in charge to prevent the crash of the pam() function. The number of joint missing values allowed in a sample described by p gene expression levels is given by floor(p^coeffMissingAllowed).
Value
The function returns a list with the following slots
signatureName
is a string for identifying the signature. By default is
set to (colnames(geData)[seedGene])[1].
startingSignature
is a list of string set to colnames(geData)[seedGene]
coeffMissingAllowed
same as input
startingClassification
(factor) classification of the samples computed by using the gene expression levels of the startingSignature
startingTValue
test-value of the log-rank test computed on the startingSignature
startingPValue
p-value corresponding to the startingTValue
signatureIDs
indexes pointing to the column of geData providing the sequence of gene expression levels that maximizes the distance between the two survival curves
signature
labels corresponding to signatureIDs: colnames(geData)[signatureIDs]
tValue
test-value of the log-rank test computed on the signature
pValue
p-value corresponding to the tValue
classification
(factor) classification of the samples computed by using the gene expression levels of the signature
Author(s)
Stefano M. Pagnotta and Michele Ceccarelli
See Also
geNSCLC, stNSCLC.
Examples
# find the signature starting from the gene SELP for the Non Small Cell Lung Cancer
#############
# set the working data
data(geNSCLC)
geData <- geNSCLC
data(stNSCLC)
stData <- stNSCLC
##############
# set the dimension of the cpu's cluster
aMakeCluster <- makeCluster(2)
################
# set the starting gene to SELP
geneSeed <- which(colnames(geData) == "SELP")
##################
# run ...
ans <- signatureFinder(geneSeed, cpuCluster = aMakeCluster)
ans