Determines the significance of pre-defined sets of genes with respect to
an outcome
variable, such as a group indicator, a quantitative variable or a survival time
Data x: p by n matrix of features (expression values),
one observation per column (missing values allowed); y: n-vector of outcome measurements
y
Vector of response values: 1,2 for two class
problem, or 1,2,3 ... for multiclass problem, or real numbers for quantitative
or survival problems
genesets
Gene set collection (a list)
genenames
Vector of genenames in expression dataset
method
Method for summarizing a gene set: "maxmean" (default), "mean" or "absmean"
resp.type
Problem type: "quantitative" for a continuous parameter;
"Two class unpaired" ; "Survival" for censored survival outcome; "Multiclass" :
more than 2 groups, coded 1,2,3...; "Two class paired" for paired outcomes, coded -1,1 (first pair), -2,2 (second pair), etc
censoring.status
Vector of censoring status values for survival problems,
1 mean death or failure, 0 means censored
random.seed
Optional initial seed for random number generator (integer)
knn.neighbors
Number of nearest neighbors to use for imputation
of missing features values
s0
Exchangeability factor for denominator of test statistic; Default
is automatic choice
s0.perc
Percentile of standard deviation values to use for s0; default is
automatic choice; -1 means s0=0 (different from s0.perc=0, meaning
s0=zeroeth percentile of standard deviation values= min of sd values)
minsize
Minimum number of genes in genesets to be considered
maxsize
Maximum number of genes in genesets to be considered
restand
Should restandardization be done? Default TRUE
,
restand.basis
What should be used to do the restandardization?
The set of genes in the genesets ("catalog", the default) or the
genes in the data set ("data")
nperms
Number of permutations used to estimate false discovery rates
xl.mode
Used by Excel interface
xl.time
Used by Excel interface
xl.prevfit
Used by Excel interface
Details
Carries out a Gene set analysis, as described in the paper
by Efron and Tibshirani (2006). It differs from a Gene Set Enrichment Analysis
(Subramanian et al 2006)
in its use of the "maxmean" statistic: this is the mean of the positive or
negative part of gene scores in the gene set, whichever is large in absolute
values. Efron and Tibshirani shows that this is often more powerful
than the modified KS statistic used in GSEA.
GSA also does "restandardization" of the genes (rows), on top of the
permutation of columns (done in GSEA).
Gene set analysis is applicable to microarray data
and other data with a large number of features. This is also the R package
that is called by the "official" SAM Excel package v3.0.
The format of the response vector y and the calling sequence
is illustrated in the examples below. A more complete description
is given in the SAM manual
at http://www-stat.stanford.edu/~tibs/SAM
Value
A list with components
GSA.scores
Gene set scores for each gene set
GSA.scores.perm
Matrix of Gene set scores from permutions, one column per
permutation
fdr.lo
Estimated false discovery rates for negative gene sets
(negative means lower expression correlates with class 2 in two sample problems,
lower expression correlates with increased y for quantitative problems,
lower expression correlates with higher risk for survival problems)
fdr.hi
Estimated false discovery rates for positive gene sets; positive
is opposite of negative, as defined above
pvalues.lo
P-values for negative gene sets
pvalues.hi
P-values for positive gene sets
stand.info
Information from restandardization process
stand.info.star
Information from restandardization process in permutations
ngenes
Number of genes in union of gene sets
nperms
Number of permutations used
gene.scores
Individual gene scores (eg t-statistics for two class problem)
s0
Computed exchangeability factor
s0.perc
Computed percentile of standard deviation values.
s0= s0.perc percentile of the gene standard deviations
call
The call to GSA
x
For internal use
y
For internal use
genesets
For internal use
genenames
For internal use
r.obs
For internal use
r.star
For internal use
gs.mat
For internal use
gs.ind
For internal use
catalog
For internal use
catalog.unique
For internal use
Author(s)
Robert Tibshirani
References
Efron, B. and Tibshirani, R.
On testing the significance of sets of genes. Stanford tech report rep 2006.
http://www-stat.stanford.edu/~tibs/ftp/GSA.pdf
Subramanian, A. and Tamayo, P. Mootha, V. K. and Mukherjee, S. and Ebert, B. L. and Gillette, M. A. and Paulovich, A. and Pomeroy, S. L. and Golub, T. R. and Lander, E. S. and Mesirov, J. P. (2005) A knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 102, pg 15545-15550.
Examples
######### two class unpaired comparison
# y must take values 1,2
set.seed(100)
x<-matrix(rnorm(1000*20),ncol=20)
dd<-sample(1:1000,size=100)
u<-matrix(2*rnorm(100),ncol=10,nrow=100)
x[dd,11:20]<-x[dd,11:20]+u
y<-c(rep(1,10),rep(2,10))
genenames=paste("g",1:1000,sep="")
#create some random gene sets
genesets=vector("list",50)
for(i in 1:50){
genesets[[i]]=paste("g",sample(1:1000,size=30),sep="")
}
geneset.names=paste("set",as.character(1:50),sep="")
GSA.obj<-GSA(x,y, genenames=genenames, genesets=genesets, resp.type="Two class unpaired", nperms=100)
GSA.listsets(GSA.obj, geneset.names=geneset.names,FDRcut=.5)
#to use "real" gene set collection, we read it in from a gmt file:
#
# geneset.obj<- GSA.read.gmt("file.gmt")
#
# where file.gmt is a gene set collection from GSEA collection or
# or the website http://www-stat.stanford.edu/~tibs/GSA, or one
# that you have created yourself. Then
# GSA.obj<-GSA(x,y, genenames=genenames, genesets=geneset.obj$genesets, resp.type="Two class unpaired", nperms=100)
#
#
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(GSA)
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/GSA/GSA.Rd_%03d_medium.png", width=480, height=480)
> ### Name: GSA
> ### Title: Gene set analysis
> ### Aliases: GSA
> ### Keywords: univar survival ts nonparametric
>
> ### ** Examples
>
>
> ######### two class unpaired comparison
> # y must take values 1,2
>
> set.seed(100)
> x<-matrix(rnorm(1000*20),ncol=20)
> dd<-sample(1:1000,size=100)
>
> u<-matrix(2*rnorm(100),ncol=10,nrow=100)
> x[dd,11:20]<-x[dd,11:20]+u
> y<-c(rep(1,10),rep(2,10))
>
>
> genenames=paste("g",1:1000,sep="")
>
> #create some random gene sets
> genesets=vector("list",50)
> for(i in 1:50){
+ genesets[[i]]=paste("g",sample(1:1000,size=30),sep="")
+ }
> geneset.names=paste("set",as.character(1:50),sep="")
>
> GSA.obj<-GSA(x,y, genenames=genenames, genesets=genesets, resp.type="Two class unpaired", nperms=100)
perm= 10 / 100
perm= 20 / 100
perm= 30 / 100
perm= 40 / 100
perm= 50 / 100
perm= 60 / 100
perm= 70 / 100
perm= 80 / 100
perm= 90 / 100
perm= 100 / 100
>
>
> GSA.listsets(GSA.obj, geneset.names=geneset.names,FDRcut=.5)
$FDRcut
[1] 0.5
$negative
Gene_set Gene_set_name Score p-value FDR
[1,] "11" "set11" "-0.3151" "0" "0"
[2,] "15" "set15" "-0.6354" "0" "0"
[3,] "50" "set50" "-0.3587" "0.02" "0.3333"
$positive
Gene_set Gene_set_name Score p-value FDR
[1,] "6" "set6" "0.5037" "0" "0"
[2,] "17" "set17" "0.7162" "0" "0"
[3,] "31" "set31" "0.4192" "0" "0"
$nsets.neg
[1] 3
$nsets.pos
[1] 3
>
>
>
> #to use "real" gene set collection, we read it in from a gmt file:
> #
> # geneset.obj<- GSA.read.gmt("file.gmt")
> #
> # where file.gmt is a gene set collection from GSEA collection or
> # or the website http://www-stat.stanford.edu/~tibs/GSA, or one
> # that you have created yourself. Then
>
> # GSA.obj<-GSA(x,y, genenames=genenames, genesets=geneset.obj$genesets, resp.type="Two class unpaired", nperms=100)
> #
> #
>
>
>
>
>
>
>
>
> dev.off()
null device
1
>