Implementation of probabilistic PCA (PPCA). PPCA allows to perform
PCA on incomplete data and may be used for missing value
estimation. This script was implemented after the Matlab version
provided by Jakob Verbeek ( see
http://lear.inrialpes.fr/~verbeek/) and the draft “EM
Algorithms for PCA and Sensible PCA” written by Sam Roweis.
matrix – Data containing the variables in
columns and observations in rows. The data may contain missing
values, denoted as NA.
nPcs
numeric – Number of components to
estimate. The preciseness of the missing value estimation depends
on the number of components, which should resemble the internal
structure of the data.
seed
numeric Set the seed for the random number
generator. PPCA creates fills the initial loading matrix with
random numbers chosen from a normal distribution. Thus results may
vary slightly. Set the seed for exact reproduction of your
results.
threshold
Convergence threshold.
maxIterations
the maximum number of allowed iterations
...
Reserved for future use. Currently no further
parameters are used.
Details
Probabilistic PCA combines an EM approach for PCA with a
probabilistic model. The EM approach is based on the assumption
that the latent variables as well as the noise are normal
distributed.
In standard PCA data which is far from the training set but close
to the principal subspace may have the same reconstruction error.
PPCA defines a likelihood function such that the likelihood for
data far from the training set is much lower, even if they are
close to the principal subspace. This allows to improve the
estimation accuracy.
A method called kEstimate is provided to estimate the
optimal number of components via cross validation. In general few
components are sufficient for reasonable estimation accuracy. See
also the package documentation for further discussion on what kind
of data PCA-based missing value estimation is advisable.
Complexity: Runtime is linear in the number of data,
number of data dimensions and number of principal components.
Convergence: The threshold indicating convergence was
changed from 1e-3 in 1.2.x to 1e-5 in the current version leading
to more stable results. For reproducability you can set the seed
(parameter seed) of the random number generator. If used for
missing value estimation, results may be checked by simply running
the algorithm several times with changing seed, if the estimated
values show little variance the algorithm converged well.
Value
Standard PCA result object used by all PCA-based methods
of this package. Contains scores, loadings, data mean and
more. See pcaRes for details.
Note
Requires MASS. It is not recommended to use this
function directely but rather to use the pca() wrapper function.
Author(s)
Wolfram Stacklies
See Also
bpca, svdImpute, prcomp,
nipalsPca, pca, pcaRes.
Examples
## Load a sample metabolite dataset with 5% missing values (metaboliteData)
data(metaboliteData)
## Perform probabilistic PCA using the 3 largest components
result <- pca(t(metaboliteData), method="ppca", nPcs=3, seed=123)
## Get the estimated complete observations
cObs <- completeObs(result)
## Plot the scores
plotPcs(result, type = "scores")
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(pcaMethods)
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'pcaMethods'
The following object is masked from 'package:stats':
loadings
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/pcaMethods/ppca.Rd_%03d_medium.png", width=480, height=480)
> ### Name: ppca
> ### Title: Probabilistic PCA
> ### Aliases: ppca
> ### Keywords: multivariate
>
> ### ** Examples
>
> ## Load a sample metabolite dataset with 5% missing values (metaboliteData)
> data(metaboliteData)
> ## Perform probabilistic PCA using the 3 largest components
> result <- pca(t(metaboliteData), method="ppca", nPcs=3, seed=123)
> ## Get the estimated complete observations
> cObs <- completeObs(result)
> ## Plot the scores
> plotPcs(result, type = "scores")
> ## Don't show:
> stopifnot(sum((fitted(result) - t(metaboliteData))^2, na.rm=TRUE) < 200)
> ## End(Don't show)
>
>
>
>
>
> dev.off()
null device
1
>