R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Cross-validation for PCA

Q2	R Documentation

Cross-validation for PCA

Description

Internal cross-validation can be used for estimating the level of structure in a data set and to optimise the choice of number of principal components.

Usage

Q2(object, originalData = completeObs(object), fold = 5, nruncv = 1,
  type = c("krzanowski", "impute"), verbose = interactive(),
  variables = 1:nVar(object), ...)

Arguments

`object`	A `pcaRes` object (result from previous PCA analysis.)
`originalData`	The matrix (or ExpressionSet) that used to obtain the pcaRes object.
`fold`	The number of groups to divide the data in.
`nruncv`	The number of times to repeat the whole cross-validation
`type`	krzanowski or imputation type cross-validation
`verbose`	`boolean` If TRUE Q2 outputs a primitive progress bar.
`variables`	indices of the variables to use during cross-validation calculation. Other variables are kept as they are and do not contribute to the total sum-of-squares.
`...`	Further arguments passed to the `pca` function called within Q2.

Details

This method calculates Q^2 for a PCA model. This is the cross-validated version of R^2 and can be interpreted as the ratio of variance that can be predicted independently by the PCA model. Poor (low) Q^2 indicates that the PCA model only describes noise and that the model is unrelated to the true data structure. The definition of Q^2 is:

Q^2=1 - sum_i^k sum_j^n (x - hat{x})^2 / ∑_i^k ∑_j^n(x^2)

for the matrix x which has n rows and k columns. For a given number of PC's x is estimated as hat{x}=TP' (T are scores and P are loadings). Although this defines the leave-one-out cross-validation this is not what is performed if fold is less than the number of rows and/or columns. In 'impute' type CV, diagonal rows of elements in the matrix are deleted and the re-estimated. In 'krzanowski' type CV, rows are sequentially left out to build fold PCA models which give the loadings. Then, columns are sequentially left out to build fold models for scores. By combining scores and loadings from different models, we can estimate completely left out values. The two types may seem similar but can give very different results, krzanowski typically yields more stable and reliable result for estimating data structure whereas impute is better for evaluating missing value imputation performance. Note that since Krzanowski CV operates on a reduced matrix, it is not possible estimate Q2 for all components and the result vector may therefore be shorter than nPcs(object).

Value

A matrix or vector with Q^2 estimates.

Author(s)

Henning Redestig, Ondrej Mikula

References

Krzanowski, WJ. Cross-validation in principal component analysis. Biometrics. 1987(43):3,575-584

Examples

data(iris)
x <- iris[,1:4]
pcIr <- pca(x, nPcs=3)
q2 <- Q2(pcIr, x)
barplot(q2, main="Krzanowski CV", xlab="Number of PCs", ylab=expression(Q^2))
## q2 for a single variable
Q2(pcIr, x, variables=2)
pcIr <- pca(x, nPcs=3, method="nipals")
q2 <- Q2(pcIr, x, type="impute")
barplot(q2, main="Imputation CV", xlab="Number of PCs", ylab=expression(Q^2))

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(pcaMethods)
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: 'pcaMethods'

The following object is masked from 'package:stats':

    loadings

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/pcaMethods/Q2.Rd_%03d_medium.png", width=480, height=480)
> ### Name: Q2
> ### Title: Cross-validation for PCA
> ### Aliases: Q2
> ### Keywords: multivariate
> 
> ### ** Examples
> 
> data(iris)
> x <- iris[,1:4]
> pcIr <- pca(x, nPcs=3)
> q2 <- Q2(pcIr, x)
> barplot(q2, main="Krzanowski CV", xlab="Number of PCs", ylab=expression(Q^2))
> ## q2 for a single variable
> Q2(pcIr, x, variables=2)
     PC 1      PC 2      PC 3 
0.1417291 0.3654055 0.4815427 
> pcIr <- pca(x, nPcs=3, method="nipals")
> q2 <- Q2(pcIr, x, type="impute")
> barplot(q2, main="Imputation CV", xlab="Number of PCs", ylab=expression(Q^2))
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>