npksum computes kernel sums on evaluation
data, given a set of training data, data to be weighted (optional), and a
bandwidth specification (any bandwidth object).
a symbolic description of variables on which the sum is
to be performed. The details of constructing a formula are
described below.
data
an optional data frame, list or environment (or object
coercible to a data frame by as.data.frame) containing the variables
in the model. If not found in data, the variables are taken from
environment(formula), typically the environment from which the
function is called.
newdata
An optional data frame in which to look for evaluation data. If
omitted, data is used.
subset
an optional vector specifying a subset of observations to be used.
na.action
a function which indicates what should happen when the data contain
NAs. The default is set by the na.action setting of options, and is
na.fail if that is unset. The (recommended) default is
na.omit.
...
additional arguments supplied to specify the parameters to the
default S3 method, which is called during estimation.
txdat
a p-variate data frame of sample realizations (training data) used to
compute the sum.
tydat
a numeric vector of data to be weighted. The ith kernel weight
is applied to the ith element. Defaults to 1.
exdat
a p-variate data frame of sum evaluation points (if omitted,
defaults to the training data itself).
bws
a bandwidth specification. This can be set as any suitable bandwidth
object returned from a bandwidth-generating function, or a numeric vector.
weights
a n by q matrix of weights which can optionally be
applied to tydat in the sum. See details.
leave.one.out
a logical value to specify whether or not to compute the leave one
out sums. Will not work if exdat is specified. Defaults to
FALSE.
kernel.pow
an integer specifying the power to which the kernels will be raised
in the sum. Defaults to 1.
bandwidth.divide
a logical specifying whether or not to divide continuous kernel
weights by their bandwidths. Use this with nearest-neighbor
methods. Defaults to FALSE.
operator
a string specifying whether the normal, convolution,
derivative, or integral kernels are to be
used. Operators scale results by factors of h or 1/h where
appropriate. Defaults to normal and applies to all elements in a
multivariate object. See details.
permutation.operator
a string which can have a value of none, normal,
derivative, or integral. If set to something other
than none (the default), then a separate result will be returned for each
term in the product kernel, with the operator applied to that term.
Permutation operators scale results by factors of h or 1/h where
appropriate. This is useful for computing gradients, for example.
compute.score
a logical specifying whether or not to return the score
(the ‘grad h’ terms) for each dimension in addition to the kernel
sum. Cannot be TRUE if a permutation operator other than
"none" is selected.
compute.ocg
a logical specifying whether or not to return a separate result for
each unordered and ordered dimension, where the product kernel term
for that dimension is evaluated at an appropriate reference
category. This is used primarily in np to compute ordered and
unordered categorical gradients. See details.
return.kernel.weights
a logical specifying whether or not to return the matrix of
generalized product kernel weights. Defaults to FALSE. See
details.
Details
npksum
exists so that you can create your own kernel objects with
or without a variable to be weighted (default Y=1). With the options
available, you could create new nonparametric tests or even new kernel
estimators. The convolution kernel option would allow you to create,
say, the least squares cross-validation function for kernel density
estimation.
npksum uses highly-optimized C code that strives to minimize
its ‘memory footprint’, while there is low overhead involved
when using repeated calls to this function (see, by way of
illustration, the example below that conducts leave-one-out
cross-validation for a local constant regression estimator via calls
to the R function nlm, and compares this to the
npregbw function).
npksum implements a variety of methods for computing
multivariate kernel sums (p-variate) defined over a set of
possibly continuous and/or discrete (unordered, ordered) data. The
approach is based on Li and Racine (2003) who employ
‘generalized product kernels’ that admit a mix of continuous
and discrete data types.
Three classes of kernel estimators for the continuous data types are
available: fixed, adaptive nearest-neighbor, and generalized
nearest-neighbor. Adaptive nearest-neighbor bandwidths change with
each sample realization in the set, x[i], when estimating
the kernel sum at the point x. Generalized nearest-neighbor
bandwidths change with the point at which the sum is computed,
x. Fixed bandwidths are constant over the support of x.
npksum computes sum(t(W[j]) %*% Y[j] * K(X[j])), where W[j]
represents a row vector extracted from W. That is, it computes
the kernel weighted sum of the outer product of the rows of W
and Y. In the examples, the uses of such sums are illustrated.
npksum may be invoked either with a formula-like
symbolic
description of variables on which the sum is to be
performed or through a simpler interface whereby data is passed
directly to the function via the txdat and tydat
parameters. Use of these two interfaces is mutually exclusive.
Data contained in the data frame txdat (and also exdat)
may be a mix of continuous (default), unordered discrete (to be
specified in the data frame txdat using the
factor command), and ordered discrete (to be specified
in the data frame txdat using the ordered
command). Data can be entered in an arbitrary order and data types
will be detected automatically by the routine (see np
for details).
Data for which bandwidths are to be estimated may be specified
symbolically. A typical description has the form dependent data
~ explanatory data, where dependent data and explanatory
data are both series of variables specified by name, separated by the
separation character '+'. For example, y1 ~ x1 + x2 specifies
that y1 is to be kernel-weighted by x1 and x2
throughout the sum. See below for further examples.
A variety of kernels may be specified by the user. Kernels implemented
for continuous data types include the second, fourth, sixth, and
eighth order Gaussian and Epanechnikov kernels, and the uniform
kernel. Unordered discrete data types use a variation on Aitchison and
Aitken's (1976) kernel, while ordered data types use a variation of
the Wang and van Ryzin (1981) kernel (see np for
details).
The option operator= can be used to ‘mix and match’
operator strings to create a ‘hybrid’ kernel provided they
match the dimension of the data. For example, for a two-dimensional
data frame of numeric datatypes,
operator=c("normal","derivative") will use the normal
(i.e. PDF) kernel for variable one and the derivative of the PDF
kernel for variable two. Please note that applying operators will scale the
results by factors of h or 1/h where appropriate.
The option permutation.operator= can be used to ‘mix and match’
operator strings to create a ‘hybrid’ kernel, in addition to
the kernel sum with no operators applied, one for each continuous
dimension in the data. For example, for a two-dimensional
data frame of numeric datatypes,
permutation.operator=c("derivative") will return the usual
kernel sum as if operator = c("normal","normal") in the
ksum member, and in the p.ksum member, it will return
kernel sums for operator = c("derivative","normal"), and
operator = c("normal","derivative"). This makes the computation
of gradients much easier.
The option compute.score= can be used to compute the gradients
with respect to h in addition to the normal kernel sum. Like
permutations, the additional results are returned in the
p.ksum. This option does not work in conjunction with
permutation.operator.
The option compute.ocg= works much like permutation.operator,
but for discrete variables. The kernel is evaluated at a reference
category in each dimension: for ordered data, the next lowest category
is selected, except in the case of the lowest category, where the
second lowest category is selected; for unordered data, the first
category is selected. These additional data are returned in the
p.ksum member. This option can be set simultaneously with
permutation.operator.
The option return.kernel.weights=TRUE returns a matrix of
dimension ‘number of training observations’ by ‘number
of evaluation observations’ and contains only the generalized product
kernel weights ignoring all other objects and options that may be
provided to npksum (e.g. bandwidth.divide=TRUE will be
ignored, etc.). Summing the columns of the weight matrix and dividing
by ‘number of training observations’ times the product of the
bandwidths (i.e. colMeans(foo$kw)/prod(h)) would produce
the kernel estimator of a (multivariate) density
(operator="normal") or multivariate cumulative distribution
(operator="integral").
Value
npksum returns a npkernelsum object
with the following components:
eval
the evaluation points
ksum
the sum at the evaluation points
kw
the kernel weights (when return.kernel.weights=TRUE
is specified)
Usage Issues
If you are using data of mixed types, then it is advisable to use the
data.frame function to construct your input data and not
cbind, since cbind will typically not work as
intended on mixed data types and will coerce the data to the same
type.
Aitchison, J. and C.G.G. Aitken (1976), “ Multivariate binary
discrimination by the kernel method,” Biometrika, 63, 413-420.
Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory
and Practice, Princeton University Press.
Li, Q. and J.S. Racine (2003), “Nonparametric estimation of
distributions with categorical and continuous data,” Journal
of Multivariate Analysis, 86, 266-292.
Pagan, A. and A. Ullah (1999), Nonparametric Econometrics, Cambridge
University Press.
Scott, D.W. (1992), Multivariate Density Estimation. Theory,
Practice and Visualization, New York: Wiley.
Silverman, B.W. (1986), Density Estimation, London: Chapman and
Hall.
Wang, M.C. and J. van Ryzin (1981), “A class of smooth
estimators for discrete distributions,” Biometrika, 68, 301-309.
Examples
## Not run:
# EXAMPLE 1: For this example, we generate 100,000 observations from a
# N(0, 1) distribution, then evaluate the kernel density on a grid of 50
# equally spaced points using the npksum() function, then compare
# results with the (identical) npudens() function output
n <- 100000
x <- rnorm(n)
x.eval <- seq(-4, 4, length=50)
# Compute the bandwidth with the normal-reference rule-of-thumb
bw <- npudensbw(dat=x, bwmethod="normal-reference")
# Compute the univariate kernel density estimate using the 100,000
# training data points evaluated on the 50 evaluation data points,
# i.e., 1/nh times the sum of the kernel function evaluated at each of
# the 50 points
den.ksum <- npksum(txdat=x, exdat=x.eval, bws=bw$bw)$ksum/(n*bw$bw[1])
# Note that, alternatively (easier perhaps), you could use the
# bandwidth.divide=TRUE argument and drop the *bw$bw[1] term in the
# denominator, as in
# npksum(txdat=x, exdat=x.eval, bws=bw$bw, bandwidth.divide=TRUE)$ksum/n
# Compute the density directly with the npudens() function...
den <- fitted(npudens(tdat=x, edat=x.eval, bws=bw$bw))
# Plot the true DGP, the npksum()/(nh) estimate and (identical)
# npudens() estimate
plot(x.eval, dnorm(x.eval), xlab="X", ylab="Density", type="l")
points(x.eval, den.ksum, col="blue")
points(x.eval, den, col="red", cex=0.2)
legend(1, .4,
c("DGP", "npksum()",
"npudens()"),
col=c("black", "blue", "red"),
lty=c(1, 1, 1))
# Sleep for 5 seconds so that we can examine the output...
Sys.sleep(5)
# EXAMPLE 2: For this example, we first obtain residuals from a
# parametric regression model, then compute a vector of leave-one-out
# kernel weighted sums of squared residuals where the kernel function is
# raised to the power 2. Note that this returns a vector of kernel
# weighted sums, one for each element of the error vector. Note also
# that you can specify the bandwidth type, kernel function, kernel order
# etc.
data("cps71")
attach(cps71)
error <- residuals(lm(logwage~age+I(age^2)))
bw <- npreg(error~age)
ksum <- npksum(txdat=age,
tydat=error^2,
bws=bw$bw,
leave.one.out=TRUE,
kernel.pow=2)
ksum
# Obviously, if we wanted the sum of these weighted kernel sums then,
# trivially,
sum(ksum$ksum)
detach(cps71)
# Sleep for 5 seconds so that we can examine the output...
Sys.sleep(5)
# Note that weighted leave-one-out sums of squared residuals are used to
# construct consistent model specification tests. In fact, the
# npcmstest() routine in this package is constructed purely from calls
# to npksum(). You can type npcmstest to see the npcmstest()
# code and also examine some examples of the many uses of
# npksum().
# EXAMPLE 3: For this example, we conduct local-constant (i.e.,
# Nadaraya-Watson) kernel regression. We shall use cross-validated
# bandwidths using npregbw() by way of example. Note we extract
# the kernel sum from npksum() via the `$ksum' argument in both
# the numerator and denominator.
data("cps71")
attach(cps71)
bw <- npregbw(xdat=age, ydat=logwage)
fit.lc <- npksum(txdat=age, tydat=logwage, bws=bw$bw)$ksum/
npksum(txdat=age, bws=bw$bw)$ksum
plot(age, logwage, xlab="Age", ylab="log(wage)")
lines(age, fit.lc)
# Sleep for 5 seconds so that we can examine the output...
Sys.sleep(5)
# If you wished to compute the kernel sum for a set of evaluation points,
# you first generate the set of points then feed this to npksum,
# e.g., for the set (20, 30, ..., 60) we would use
age.seq <- seq(20, 60, 10)
fit.lc <- npksum(txdat=age, exdat=age.seq, tydat=logwage, bws=bw$bw)$ksum/
npksum(txdat=age, exdat=age.seq, bws=bw$bw)$ksum
# Note that now fit.lc contains the 5 values of the local constant
# estimator corresponding to age.seq...
fit.lc
detach(cps71)
# EXAMPLE 4: For this example, we conduct least-squares cross-validation
# for the local-constant regression estimator. We first write an R
# function `ss' that computes the leave-one-out sum of squares using the
# npksum() function, and then feed this function, along with
# random starting values for the bandwidth vector, to the nlm() routine
# in R (nlm = Non-Linear Minimization). Finally, we compare results with
# the function npregbw() that is written solely in C and calls
# a tightly coupled C-level search routine. Note that one could make
# repeated calls to nlm() using different starting values for h (highly
# recommended in general).
# Increase the number of digits printed out by default in R and avoid
# using scientific notation for this example (we wish to compare
# objective function minima)
options(scipen=100, digits=12)
# Generate 100 observations from a simple DGP where one explanatory
# variable is irrelevant.
n <- 100
x1 <- runif(n)
x2 <- rnorm(n)
x3 <- runif(n)
txdat <- data.frame(x1, x2, x3)
# Note - x3 is irrelevant
tydat <- x1 + sin(x2) + rnorm(n)
# Write an R function that returns the average leave-one-out sum of
# squared residuals for the local constant estimator based upon
# npksum(). This function accepts one argument and presumes that
# txdat and tydat have been defined already.
ss <- function(h) {
# Test for valid (non-negative) bandwidths - return infinite penalty
# when this occurs
if(min(h)<=0) {
return(.Machine$double.xmax)
} else {
mean <- npksum(txdat,
tydat,
leave.one.out=TRUE,
bandwidth.divide=TRUE,
bws=h)$ksum/
npksum(txdat,
leave.one.out=TRUE,
bandwidth.divide=TRUE,
bws=h)$ksum
return(sum((tydat-mean)^2)/length(tydat))
}
}
# Now pass this function to R's nlm() routine along with random starting
# values and place results in `nlm.return'.
nlm.return <- nlm(ss, runif(length(txdat)))
bw <- npregbw(xdat=txdat, ydat=tydat)
# Bandwidths from nlm()
nlm.return$estimate
# Bandwidths from npregbw()
bw$bw
# Function value (minimum) from nlm()
nlm.return$minimum
# Function value (minimum) from npregbw()
bw$fval
# Sleep for 5 seconds so that we can examine the output...
Sys.sleep(5)
# EXAMPLE 5: For this example, we use npksum() to plot the kernel
# function itself. Our `training data' is the singleton, 0, and our
# evaluation data a grid in [-4,4], while we use a bandwidth of 1. By
# way of example we plot a sixth order Gaussian kernel (note that this
# kernel function can assume negative values)
x <- 0
x.eval <- seq(-4,4,length=500)
kernel <- npksum(txdat=x,exdat=x.eval,bws=1,
bandwidth.divide=TRUE,
ckertype="gaussian",
ckerorder=6)$ksum
plot(x.eval,kernel,type="l",xlab="X",ylab="Kernel Function")
abline(0,0)
## End(Not run)