Maximum likelihood estimation for P-splines density estimation. Histogram binning
produces frequency counts, which are modelled by constrained B-splines in a Poisson regression. A penalty
based on differences in the sequences B-spline coefficients is used to smooth/interpolate the counts.
Iterated weighted least squares (IWLS) for a mixed model representation of the P-splines regression,
conditional on a particular penalty coefficient, is used for estimating the B-spline coefficients.
Leave-one-out cross-validation deviances are available for estimation of the penalty coefficient.
vector of λ's (or scalar) to be considered in profile likelihood. Required.
breaks
histogram breaks (as in hist function)
xrange
vector of minimum and maximum of B-spline (support of density)
nseg
number of segments between knots
degree
degree of B-splines (0 is constant, 1 is linear, etc.)
design.knots
spline knots for splineDesign function
ord
order of difference used in the penalty term
beta
vector of B-spline coefficients (required)
bsplines
matrix of B-splines
nbinwidth
scaling to convert count frequency into proper density
log
logical, if TRUE then log density
pvector
vector of initial values of GPD parameters (sigmau, xi) or NULL
finitelik
logical, should log-likelihood return finite value for invalid parameters
lambda
penalty coefficient
counts
counts from histogram binning
Details
The P-splines density estimator is fitted using maximum likelihood estimation, following
the approach of Eilers and Marx (1996). Histogram binning produces frequency counts, which are
modelled by constrained B-splines in a Poisson regression. A penalty
based on differences in the sequences B-spline coefficients is used to smooth/interpolate the counts.
The B-splines are defined as in Eiler and Marx (1996), so that those are meet the boundary are simply
shifted and truncated version of the internal B-splines. No renormalisation is carried out. They are not
"natural" B-spline which are also commonly in use. Note that atural B-splines can be obtained by suitable
linear combinations of these B-splines. Hence, in practice there is little difference in the fit obtained
from either B-spline definition, even with the penalty constraining the coefficients. If the user desires
they can force the use of natural B-splines, by prior specification of the design.knots
with appropriate replication of the boundaries, see dpsden.
Iterated weighted least squares (IWLS) for a mixed model representation of the P-splines regression,
conditional on a particular penalty coefficient, is used for estimating the B-spline coefficients which
is equivalent to maximum likelihood estimation. Leave-one-out cross-validation deviances are available
for estimation of the penalty coefficient.
The parameter vector is the B-spline coefficients beta, no matter whether the penalty coefficient is
fixed or estimated. The penalty coefficient lambda is treated separately.
The log-likelihood functions lpsden and nlpsden
evaluate the likelihood for the original dataset, using the fitted P-splines density estimator. The
log-likelihood is output as nllh from the fitting function fpsden.
They do not provide the likelihood for the Poisson regression of the histogram counts, which is usually
evaluated using the deviance. The deviance (via CVMSE for Poisson counts) is also output as cvlambda
from the fitting function fpsden.
The iwlspsden function performs the IWLS. The
cvpsden function calculates the leave-one-out cross-validation
sum of the squared errors. They are not designed to be used directly by users. No checks of the
inputs are carried out.
Value
Log-likelihood for original data is given by lpsden and it's
wrappers for negative log-likelihood from nlpsden. Cross-validation
sum of square of errors is provided by cvpsden. Poisson regression
fitting by IWLS is carried out in iwlspsden. Fitting function
fpsden returns a simple list with the
following elements
call:
optim call
x:
data vector x
xrange:
range of support of B-splines
degree:
degree of B-splines
nseg:
number of internal segments
design.knots:
knots used in splineDesign
ord:
order of penalty term
binned:
histogram results
breaks:
histogram breaks
mids:
histogram mid-bins
counts:
histogram counts
nbinwidth:
scaling factor to convert counts to density
bsplines:
B-splines matrix used for binned counts
databsplines:
B-splines matrix used for data
counts:
histogram counts
lambdaseq:
λ vector for profile likelihood or scalar for fixed λ
cvlambda:
CV MSE for each λ
mle and beta:
vector of MLE of coefficients
nllh:
negative log-likelihood for original data
n:
total original sample size
lambda:
Estimated or fixed λ
Acknowledgments
The Poisson regression and leave-one-out cross-validation functions
are based on the code of Eilers and Marx (1996) available from Brian Marx's website
http://www.stat.lsu.edu/faculty/marx, which is gratefully acknowledged.
Note
The data are both vectors. Infinite and missing sample values are dropped.
No initial values for the coefficients are needed.
It is advised to specify the range of support xrange, using finite end-points. This is
especially important when the support is bounded. By default xrange is simply the range of the
input data range(x).
Further, it is advised to always set the histogram bin breaks, expecially if the support is bounded.
By default 10*ln(n) equi-spaced bins are defined between xrange.