R: Maximum Likelihood Estimation for Multivariate Normal Data...
monomvn
R Documentation
Maximum Likelihood Estimation for Multivariate Normal
Data with Monotone Missingness
Description
Maximum likelihood estimation of the mean and covariance matrix of
multivariate normal (MVN) distributed data with a monotone missingness pattern.
Through the use of parsimonious/shrinkage regressions (e.g., plsr, pcr,
ridge, lasso, etc.), where standard regressions fail,
this function can handle an (almost) arbitrary amount of missing data
data matrix were each row is interpreted as a
random sample from a MVN distribution with missing
values indicated by NA
pre
logical indicating whether pre-processing of the
y is to be performed. This sorts the columns so that the
number of NAs is non-decreasing with the column index
method
describes the type of parsimonious
(or shrinkage) regression to
be performed when standard least squares regression fails.
From the pls package we have "plsr"
(plsr, the default) for partial least squares and
"pcr" (pcr) for standard principal
component regression. From the lars package (see the
"type" argument to lars)
we have "lasso" for L1-constrained regression, "lar"
for least angle regression, "forward.stagewise" and
"stepwise" for fast implementations of classical forward
selection of covariates. From the MASS package we have
"ridge" as implemented by the lm.ridge
function. The "factor" method treats the first p
columns of y as known factors
p
when performing regressions, p is the proportion of the
number of columns to rows in the design matrix before an
alternative regression method (those above) is performed as if
least-squares regression has “failed”. Least-squares regression is
known to fail when the number of columns equals the number of rows,
hence a default of p = 0.9 <= 1. Alternatively, setting
p = 0 forces method to be used for every regression.
Intermediate settings of p allow the user to control when
least-squares regressions stop and the method ones start.
When method = "factor" the p argument represents an
integer (positive) number of initial columns of y to treat
as known factors
ncomp.max
maximal number of (principal) components to include
in a method—only meaningful for the "plsr" or
"pcr" methods. Large settings can cause the execution to be
slow as it drastically increases the cross-validation (CV) time
batch
indicates whether the columns with equal missingness
should be processed together using a multi-response regression.
This is more efficient if many OLS regressions are used, but can
lead to slightly poorer, even unstable, fits when parsimonious
regressions are used
validation
method for cross validation when applying
a parsimonious regression method. The default setting
of "CV" (randomized 10-fold cross-validation) is the faster
method, but does not yield a deterministic result and does not apply
for regressions on less than ten responses.
"LOO" (leave-one-out cross-validation)
is deterministic, always applicable, and applied automatically whenever
"CV" cannot be used. When standard least squares is
appropriate, the methods implemented the
lars package (e.g. lasso) support model choice via the
"Cp" statistic, which defaults to the "CV" method
when least squares fails. This argument is ignored for the
"ridge" method; see details below
obs
logical indicating whether or not to (additionally)
compute a mean vector and covariance matrix based only on the observed
data, without regressions. I.e., means are calculated as averages
of each non-NA entry in the columns of y, and entries
(a,b) of the
covariance matrix are calculated by applying cov(ya,yb)
to the jointly non-NA entries of columns a and b
of y
verb
whether or not to print progress indicators. The default
(verb = 0) keeps quiet, while any positive number causes brief
statement about dimensions of each regression to print to
the screen as it happens. verb = 2 causes each of the ML
regression estimators to be printed along with the corresponding
new entries of the mean and columns of the covariance matrix.
verb = 3 requires that the RETURN key be pressed between
each print statement
quiet
causes warnings about regressions to be silenced
when TRUE
Details
If pre = TRUE then monomvn first re-arranges the columns
of y into nondecreasing order with respect to the number of
missing (NA) entries. Then (at least) the first column should
be completely observed. The mean components and covariances between
the first set of complete columns are obtained through the standard
mean and cov routines.
Next each successive group of columns with the same missingness pattern
is processed in sequence (assuming batch = TRUE).
Suppose a total of j columns have
been processed this way already. Let y2 represent the non-missing
contingent of the next group of k columns of y
with and identical missingness pattern, and let y1 be the
previously processed j-1 columns of y
containing only the rows
corresponding to each non-NA entry in y2. I.e.,
nrow(y1) = nrow(y2). Note that y1 contains no
NA entries since the missing data pattern is monotone.
The k next entries (indices j:(j+k)) of the mean vector,
and the j:(j+k) rows and columns of the covariance matrix are
obtained by multivariate regression of y2 on y1.
The regression method used (except in the case of method =
"factor" depends on the number of rows and columns
in y1 and on the p parameter. Whenever ncol(y1)
< p*nrow(y1) least-squares regression is used, otherwise
method = c("pcr", "plsr"). If ever a least-squares regression
fails due to co-linearity then one of the other methods is
tried. The "factor" method always involves an OLS regression
on (a subset of) the first p columns of y.
All methods require a scheme for estimating the amount of
variability explained by increasing the numbers of coefficients
(or principal components) in the model.
Towards this end, the pls and lars packages support
10-fold cross validation (CV) or leave-one-out (LOO) CV estimates of
root mean squared error. See pls and lars for
more details. monomvn uses
CV in all cases except when nrow(y1) <= 10, in which case CV fails and
LOO is used. Whenever nrow(y1) <= 3pcr
fails, so plsr is used instead.
If quiet = FALSE then a warning
is given whenever the first choice for a regression fails.
For pls methods, RMSEs are calculated for a number of
components in 1:ncomp.max where
a NULL value for ncomp.max it is replaced with
ncomp.max <- min(ncomp.max, ncol(y2), nrow(y1)-1)
which is the max allowed by the pls package.
Simple heuristics are used to select a small number of components
(ncomp for pls), or number of coefficients (for
lars), which explains a large amount of the variability (RMSE).
The lars methods use a “one-standard error rule” outlined
in Section 7.10, page 216 of HTF below. The
pls package does not currently support the calculation of
standard errors for CV estimates of RMSE, so a simple linear penalty
for increasing ncomp is used instead. The ridge constant
(lambda) for lm.ridge is set using the
optimize function on the GCV output.
Based on the ML ncol(y1)+1 regression coefficients (including
intercept) obtained for each of the
columns of y2, and on the corresponding matrix of
residual sum of squares, and on the previous j-1 means
and rows/cols of the covariance matrix, the j:(j+k) entries and
rows/cols can be filled in as described by Little and Rubin, section 7.4.3.
Once every column has been processed, the entries of the mean vector, and
rows/cols of the covariance matrix are re-arranged into their original
order.
Value
monomvn returns an object of class "monomvn", which is a
list containing a subset of the components below.
call
a copy of the function call as used
mu
estimated mean vector with columns corresponding to the
columns of y
S
estimated covariance matrix with rows and columns
corresponding to the columns of y
na
when pre = TRUE this is a vector containing number of
NA entries in each column of y
o
when pre = TRUE this is a vector containing the
index of each column in the sorting of the columns of y
obtained by o <- order(na)
method
method of regression used on each column, or
"complete" indicating that no regression was necessary
ncomp
number of components in a plsr or
pcr regression, or NA if such a method was
not used. This field is used to record lambda
when lm.ridge is used
lambda
if method is one of c("lasso",
"forward.stagewise", "ridge"), then this field records the
lambda penalty parameters used
mu.obs
when obs = TRUE this is the “observed”
mean vector
S.obs
when obs = TRUE this is the “observed”
covariance matrix, as described above. Note that S.obs is
usually not positive definite
Note
The CV in plsr and lars are random in nature, and so
can be dependent on the random seed. Use validation=LOO for
deterministic (but slower) result.
When using method = "factor" in the current version of
the package, the factors in the first p
columns of y must also obey the monotone pattern, and,
have no more NA entries than the other columns of y.
Be warned that the lars implementation of
"forward.stagewise" can sometimes get stuck in
(what seems like) an infinite loop.
This is not a bug in the monomvn package;
the bug has been reported to the authors of lars
Robert B. Gramacy, Joo Hee Lee, and Ricardo Silva (2007).
On estimating covariances between many assets with histories
of highly variable length. Preprint available on arXiv:0710.5837:
http://arxiv.org/abs/0710.5837
Roderick J.A. Little and Donald B. Rubin (2002).
Statistical Analysis with Missing Data, Second Edition.
Wilely.
Bjorn-Helge Mevik and Ron Wehrens (2007).
The pls Package: Principal Component and Partial
Least Squares Regression in R.
Journal of Statistical Software 18(2)