R: A function to fit finite mixtures using the gamlss family of...
gamlssNP
R Documentation
A function to fit finite mixtures using the gamlss family of distributions
Description
This function will fit a finite (or normal) mixture distribution where the kernel distribution can belong to
any gamlss family of distributions using the EM algorithm.
The function is based on functions alldist() and allvc of the npmlreg package of
Jochen Einbeck, John Hinde and Ross Darnell.
Usage
gamlssNP(formula, random = ~1, family = NO(), data = NULL, K = 4,
mixture = c("np", "gq"),
tol = 0.5, weights, pluginz, control = NP.control(...),
g.control = gamlss.control(trace = FALSE, ...), ...)
Arguments
formula
a formula defining the response and the fixed effects for the mu parameters
random
a formula defining the random part of the model
family
a gamlss family object
data
the data frame which for this function is mandatory even if it the data are attached
K
the number of mass points/integretion points (supported values are 1:10,20)
mixture
the mixing distribution, "np" for non-parametric or "gq" for Gaussian Quadrature
tol
the toletance scalar ussualy between zero and one
weights
prior weights
pluginz
optional
control
this sets the control parameters for the EM iterations algorithm.
The default setting is the NP.control function
g.control
the gamlss control function, gamlss.control, passed to the gamlss fit
...
for extra arguments
Details
The function gamlssNP() is a modification of the R functions
alldist() and allvc created by Jochen Einbeck and John Hinde.
Both functions were originally created by Ross Darnell (2002). Here the two
functions are merged to one gamlssNP and allows finite mixture from
gamlss family of distributions.
The following are comments from the original Einbeck and Hinde
documentation.
"The nonparametric maximum likelihood (NPML) approach was
introduced in Aitkin (1996) as a tool to fit overdispersed
generalized linear models. Aitkin (1999) extended this method to
generalized linear models with shared random effects arising through
variance component or repeated measures structure. Applications are
two-stage sample designs, when firstly the primary sampling units
(the upper-level units, e.g. classes) and then the secondary
sampling units (lower-level units, e.g. students) are selected, or
longitudinal data.
Models of this type have also been referred to as
multi-level models (Goldstein, 2003). This R function is restricted
to 2-level models. The idea of NPML is to approximate the unknown
and unspecified distribution of the random effect by a discrete
mixture of k exponential family densities, leading to a simple
expression of the marginal likelihood, which can then be maximized
using a standard EM algorithm. When option 'gq' is set, then
Gauss-Hermite masses and mass points are used and considered as
fixed, otherwise they serve as starting points for the EM algorithm.
The position of the starting points can be concentrated or extended
by setting tol smaller or larger than one, respectively. Variance
component models with random coefficients (Aitkin, Hinde & Francis,
2005, p. 491) are also possible, in this case the option
random.distribution is restricted to the setting 'np' . The weights
have to be understood as frequency weights, i.e. setting all weights
equal to 2 will duplicate each data point and hence double the
disparity and deviance. Warning: There might be some options and
circumstances which had not been tested and where the weights do not
work."
Note that in keeping with the gamlss notation disparity is called global deviance.
Value
The function gamlssNP produces an object of class "gamlssNP".
This object contain several components.
family
the name of the gamlss family
type
the type of distribution which in this case is "Mixture"
parameters
the parameters for the kernel gamlss family distribution
call
the call of the gamlssNP function
y
the response variable
bd
the binomial demominator, only for BI and BB models
control
the NP.control settings
weights
the vector of weights of te expanded fit
G.deviance
the global deviance
N
the number of observations in the fit
rqres
a function to calculate the normalized (randomized) quantile
residuals of the object (here is the gamlss object rather than gamlssNP and it should change??)
iter
the number of external iterations in the last gamlss fitting (?? do we need this?)
type
the type of the distribution or the response variable here set to "Mixture"
method
which algorithm is used for the gamlss fit, RS(), CG() or mixed()
contrasts
the type of contrasts use in the fit
converged
whether the gamlss fit has converged
residuals
the normalized (randomized) quantile residuals of the model
mu.fv
the fitted values of the extended mu model, also sigma.fv, nu.fv,
tau.fv for the other parameters if present
mu.lp
the linear predictor of the extended mu model, also sigma.lp, nu.lp,
tau.lp for the other parameters if present
mu.wv
the working variable of the extended mu model, also sigma.wv, nu.wv,
tau.wv for the other parameters if present
mu.wt
the working weights of the mu model, also sigma.wt, nu.wt,
tau.wt for the other parameters if present
mu.link
the link function for the mu model, also sigma.link,
nu.link, tau.link for the other parameters if present
mu.terms
the terms for the mu model, also sigma.terms, nu.terms,
tau.terms for the other parameters if present
mu.x
the design matrix for the mu, also sigma.x, nu.x, tau.x for
the other parameters if present
mu.qr
the QR decomposition of the mu model, also sigma.qr, nu.qr,
tau.qr for the other parameters if present
mu.coefficients
the linear coefficients of the mu model, also
sigma.coefficients, nu.coefficients, tau.coefficients for the
other parameters if present
mu.formula
the formula for the mu model, also sigma.formula,
nu.formula, tau.formula for the other parameters if present
mu.df
the mu degrees of freedom also sigma.df, nu.df, tau.df for
the other parameters if present
mu.nl.df
the non linear degrees of freedom, also sigma.nl.df,
nu.nl.df, tau.nl.df for the other parameters if present
df.fit
the total degrees of freedom use by the model
df.residual
the residual degrees of freedom left after the model is
fitted
data
the original data set
EMiter
the number of EM iterations
EMconverged
whether the EM has converged
allresiduals
the residuas for the long fit
mass.points
the estimates mass point (if "np" mixture is used)
K
the number of mass points used
post.prob
contains a matrix of posteriori probabilities,
prob
the estimated mixture probalilities
aic
the Akaike information criterion
sbc
the Bayesian information criterion
formula
the formula used in the expanded fit
random
the random effect formula
pweights
prior weights
ebp
the Empirical Bayes Predictions (Aitkin, 1996b) on the scale of the
linear predictor
Note that in case of Gaussian quadrature, the coefficient given at 'z' in coefficients corresponds
to the standard deviation of the mixing distribution.
As a by-product, gamlssNP produces a plot showing the global deviance against the iteration number.
Further, a plot with the EM trajectories is given.
The x-axis corresponds to the iteration number, and the y-axis to the value of the mass points at a particular iteration.
This plot is not produced when mixture is set to "gq"
Author(s)
Mikis Stasinopoulos based on function created by Jochen Einbeck John Hinde and Ross Darnell
References
Aitkin, M. and Francis, B. (1995). Fitting overdispersed generalized linear models by
nonparametric maximum likelihood. GLIM Newsletter 25 , 37-45.
Aitkin, M. (1996a). A general maximum likelihood analysis of overdispersion in generalized linear models.
Statistics and Computing 6 , 251-262.
Aitkin, M. (1996b). Empirical Bayes shrinkage using posterior random effect means from nonparametric maximum
likelihood estimation in general random effect models. Statistical Modelling:
Proceedings of the 11th IWSM 1996 , 87-94.
Aitkin, M., Francis, B. and Hinde, J. (2005) Statistical Modelling in GLIM 4. Second Edition,
Oxford Statistical Science Series, Oxford, UK.
Einbeck, J. & Hinde, J. (2005). A note on NPML estimation for exponential family regression models
with unspecified dispersion parameter. Technical Report IRL-GLWY-2005-04, National University of Ireland, Galway.
Einbeck, J. Darnell R. and Hinde J. (2006) npmlreg: Nonparametric maximum likelihood estimation for random effect
models, R package version 0.34
Hinde, J. (1982). Compound Poisson regression models. Lecture Notes in Statistics 14 ,109-121.
Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,
(with discussion), Appl. Statist., 54, part 3, pp 507-554.
Stasinopoulos D. M., Rigby R.A. and Akantziliotou C. (2003) Instructions on how to use the GAMLSS package in R.
Accompanying documentation in the current GAMLSS help files, (see also http://www.gamlss.org/).