Implements a Bayesian PCA missing value estimator. The script
is a port of the Matlab version provided by Shigeyuki OBA. See
also http://ishiilab.jp/member/oba/tools/BPCAFill.html.
BPCA combines an EM approach for PCA with a Bayesian model. In
standard PCA data far from the training set but close to the
principal subspace may have the same reconstruction error. BPCA
defines a likelihood function such that the likelihood for data
far from the training set is much lower, even if they are close to
the principal subspace.
matrix – Pre-processed matrix (centered,
scaled) with variables in columns and observations in rows. The
data may contain missing values, denoted as NA.
nPcs
numeric – Number of components used for
re-estimation. Choosing few components may decrease the
estimation precision.
maxSteps
numeric – Maximum number of estimation
steps.
verbose
boolean – BPCA prints the number of steps
and the increase in precision if set to TRUE. Default is
interactive().
threshold
convergence threshold
...
Reserved for future use. Currently no further
parameters are used
Details
Scores and loadings obtained with Bayesian PCA slightly differ
from those obtained with conventional PCA. This is because BPCA
was developed especially for missing value estimation. The
algorithm does not force orthogonality between factor loadings, as
a result factor loadings are not necessarily orthogonal. However,
the BPCA authors found that including an orthogonality criterion
made the predictions worse.
The authors also state that the difference between real and
predicted Eigenvalues becomes larger when the number of
observation is smaller, because it reflects the lack of
information to accurately determine true factor loadings from the
limited and noisy data. As a result, weights of factors to
predict missing values are not the same as with conventional PCA,
but the missing value estimation is improved.
BPCA works iteratively, the complexity is growing with
O(n^3) because several matrix inversions are
required. The size of the matrices to invert depends on the
number of components used for re-estimation.
Finding the optimal number of components for estimation is not a
trivial task; the best choice depends on the internal structure of
the data. A method called kEstimate is provided to
estimate the optimal number of components via cross validation.
In general few components are sufficient for reasonable estimation
accuracy. See also the package documentation for further
discussion about on what data PCA-based missing value estimation
makes sense.
It is not recommended to use this function directely but rather to
use the pca() wrapper function.
There is a difference with respect the interpretation of rows
(observations) and columns (variables) compared to matlab
implementation. For estimation of missing values for microarray
data, the suggestion in the original bpca is to intepret genes as
observations and the samples as variables. In pcaMethods however,
genes are interpreted as variables and samples as observations
which arguably also is the more natural interpretation. For bpca
behavior like in the matlab implementation, simply transpose your
input matrix.
Details about the probabilistic model underlying BPCA are found in
Oba et. al 2003. The algorithm uses an expectation maximation
approach together with a Bayesian model to approximate the
principal axes (eigenvectors of the covariance matrix in PCA).
The estimation is done iteratively, the algorithm terminates if
either the maximum number of iterations was reached or if the
estimated increase in precision falls below 1e^-4.
Complexity: The relatively high complexity of the method is
a result of several matrix inversions required in each step.
Considering the case that the maximum number of iteration steps is
needed, the approximate complexity is given by the term
maxSteps * row_miss
* O(n^3)
Where row_miss is the number of rows
containing missing values and O(n^3) is the
complexity for inverting a matrix of size
components. Components is the number of
components used for re-estimation.
Value
Standard PCA result object used by all PCA-based methods
of this package. Contains scores, loadings, data mean and
more. See pcaRes for details.
Note
Requires MASS.
Author(s)
Wolfram Stacklies
References
Shigeyuki Oba, Masa-aki Sato, Ichiro Takemasa, Morito
Monden, Ken-ichi Matsubara and Shin Ishii. A Bayesian missing
value estimation method for gene expression profile
data. Bioinformatics, 19(16):2088-2096, Nov 2003.
## Load a sample metabolite dataset with 5% missig values (metaboliteData)e
data(metaboliteData)
## Perform Bayesian PCA with 2 components
pc <- pca(t(metaboliteData), method="bpca", nPcs=2)
## Get the estimated principal axes (loadings)
loadings <- loadings(pc)
## Get the estimated scores
scores <- scores(pc)
## Get the estimated complete observations
cObs <- completeObs(pc)
## Now make a scores and loadings plot
slplot(pc)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(pcaMethods)
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'pcaMethods'
The following object is masked from 'package:stats':
loadings
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/pcaMethods/bpca.Rd_%03d_medium.png", width=480, height=480)
> ### Name: bpca
> ### Title: Bayesian PCA missing value estimation
> ### Aliases: bpca
> ### Keywords: multivariate
>
> ### ** Examples
>
> ## Load a sample metabolite dataset with 5% missig values (metaboliteData)e
> data(metaboliteData)
> ## Perform Bayesian PCA with 2 components
> pc <- pca(t(metaboliteData), method="bpca", nPcs=2)
> ## Get the estimated principal axes (loadings)
> loadings <- loadings(pc)
> ## Get the estimated scores
> scores <- scores(pc)
> ## Get the estimated complete observations
> cObs <- completeObs(pc)
> ## Now make a scores and loadings plot
> slplot(pc)
> ## Don't show:
> stopifnot(sum((fitted(pc) - t(metaboliteData))^2, na.rm=TRUE) < 200)
> ## End(Don't show)
>
>
>
>
>
> dev.off()
null device
1
>