Last data update: 2014.03.03

R: Bivariate kernel density estimates
bivariate.densityR Documentation

Bivariate kernel density estimates

Description

Provides an adaptive or fixed bandwidth kernel density estimate of bivariate data.

Usage

bivariate.density(data, ID = NULL, pilotH, globalH = pilotH,
		  adaptive = TRUE, edgeCorrect = TRUE, res = 50, WIN = NULL,
		  counts = NULL, intensity = FALSE, xrange = NULL,
		  yrange = NULL, trim = 5, gamma = NULL, pdef = NULL, 
		  atExtraCoords = NULL, use.ppp.methods = TRUE, comment = TRUE)

Arguments

data

An object of type data.frame, list, matrix, or ppp giving the observed data from which we wish to calculate the density estimate. Optional ID information (e.g. a dichotomous indicator for cases and controls) may also be provided in these four data structures. See ‘Details’ for further information on how to properly specify each one.

ID

If data is a data structure with a third component/column indicating case (1) or control (0) status, ID must specify which of these groups we wish to estimate a density for. If ID is NULL (default), a density is estimated for all present observations, regardless of any status information.

pilotH

A single numeric, positive ‘smoothing parameter’ or ‘bandwidth’. When adaptive is TRUE (default), this value is taken to be the pilot bandwidth, used to construct the bivariate pilot density required for adaptive smoothing (see ‘Details’). For a fixed bandwidth kernel density estimate, pilotH simply represents the fixed amount of smoothing. Currently, all smoothing is isotropic in nature.

globalH

A single numeric, positive smoothing multiplier referred to as the global bandwidth, used to calculate the adaptive bandwidths (see ‘Details’). When adaptive is TRUE, this defaults to be the same as the pilot bandwidth. Ignored for a fixed density estimate.

adaptive

Boolean. Whether or not to produce an adaptive (variable bandwidth) density estimate, with the alternative being a fixed bandwith density estimate. Defaults to TRUE.

edgeCorrect

Boolean. Whether or not to perform edge-correction on the density estimate according to the methods demonstrated by Diggle (1985) (fixed bandwidth) and Marshall and Hazelton (2010) (adaptive). This can have a noticable effect on computation time in some cases. Defaults to TRUE. When adaptive = TRUE, the fixed-bandwidth pilot density is also edge-corrected according to edgeCorrect.

res

A single, numeric, positive integer indicating the square root of the desired resolution of the evaluation grid. That is, each of the evaluation grid axes will have length res. Currently, only res*res grids are supported. Defaults to 50 for computational reasons.

WIN

A polygonal object of class owin from the package spatstat giving the study region or ‘window’. All functions in the package sparr that require knowledge of the specific study region make use of this class; no other method of defining the study region is currently supported. If no window is supplied (default), the function defines (and returns) it's own rectangular owin based on xrange and yrange. Ignored if data is an object of type ppp.

counts

To perform binned kernel estimation, a numeric, positive, integer vector of giving counts associated with each observed coordinate in data, if data contains unique observations. If NULL (default), the function assumes each coordinate in data corresponds to one observation at that point. Should the data being supplied to bivariate.density contain duplicated coordinates, the function computes the counts vector internally (overriding any supplied value for counts), issues a warning, and continues with binned estimation. Non-integer values are rounded to the nearest integer.

intensity

A boolean value indicating whether or not to return an intensity (interpreted as the the expected number of observations per unit area and integrating to the number of observations in the study region) function, rather than a density (integrating to one). Defaults to FALSE.

xrange

Required only when no study region is supplied (WIN = NULL) and data is not an object of class ppp, and ignored otherwise. A vector of length 2 giving the upper and lower limits of the estimation interval for the x axis, in which case an evenly spaced set of values of length res is generated.

yrange

As above, but for the y axis.

trim

A numeric value (defaulting to 5) that prevents excessively large bandwidths in adaptive smoothing by trimming the originally computed bandwidths h by trim times median(h). A value of NA or a negative numeric value requests no trimming. Ignored when adaptive is FALSE.

gamma

An optional positive numeric value to use in place of gamma for adaptive bandwidth calculation (see ‘Details’). For adaptive relative risk estimation, this value can sensibly be chosen as common for both case and control densities (such as the gamma value from the adaptive density estimate of the ‘pooled’ (full) dataset) - see Davies and Hazelton (2010). If nothing is supplied (default), this value is computed from the data being used to estimate the density in the defined fashion (again, see ‘Details’). Ignored for fixed bandwidth estimation.

pdef

An optional object of class bivden for adaptive density estimation. This object is used as an alternative or ‘external’ way to specify the pilot density for computing the variable bandwidth factors and must have the same grid resolution and coordinates as the estimate currently being constructed. If NULL (default) the pilot density is computed internally using pilotH from above, but if supplied, pilotH need not be given. Bandwidth trimming value is computed based upon the data points making up pdef. Ignored if adaptive = FALSE.

atExtraCoords

It can occasionally be useful to retrieve the values of the estimated density at specific coordinates that are not the specific observations or the exact grid coordinates, for further analysis or plotting. atExtraCoords allows the user to specify an additional object of type data.frame with 2 colums giving the x atExtraCoords[,1] and y atExtraCoords[,2] coordinates at which to calculate and return the estimated density and other statistics (see ‘Value’).

use.ppp.methods

Boolean. Whether or not to switch to using methods defined for objects of class ppp.object from the package spatstat to estimate the density. This approach is much, much faster than forcing bivariate.density to do the explicit calculations (due to implementation of a Fast Fourier Transform; see density.ppp) and is highly recommended for large datasets. To further reduce computation time in the adaptive case when use.ppp.methods = TRUE, the variable edge-correction factors are calculated using the integer percentiles of the varying bandwidths. Defaults to TRUE.

comment

Boolean. Whether or not to print function progress (including starting and ending times) during execution. Defaults to TRUE.

Details

This function calculates an adaptive or fixed bandwidth bivariate kernel density estimate, using the bivariate Gaussian kernel. Abramson's method is used for adaptive smoothing (Abramson, 1982). Suppose our data argumnent is a data.frame or matrix. Then for each observation data[i,1:2] (i = 1, 2, ... n), the bandwidth h[i] is given by

h[i]=globalH / ( w(data[i,1:2]; pilotH)^(1/2)*gamma )

where w is the fixed bandwidth pilot density constructed with bandwidth pilotH and the scaling parameter gamma is the geometric mean of the w^(-1/2) values. A detailed discussion on this construction is given in Silverman (1986).

If the data argument is a data.frame or a matrix, this must have exactly two columns containing the x ([,1]) and y ([,2]) data values, or exactly three columns with the third (rightmost) column giving ID information by way of a numeric, dichotomous indicator. Should data be a list, this must have two vector components of equal length named x and y. The user may specify a third component with the name ID giving the vector of corresponding ID information (must be of equal length to x and y). Alternatively, data may be an object of class ppp (see ppp.object). ID information can be stored in such an object through the argument marks. If data is a ppp object, the value of window of this object overrides the value of the argument WIN above.

Value

An object of class "bivden". This is effectively a list with the following components:

Zm

a numeric matrix giving the value of the estimated (edge-corrected if elected) density at each of the coordinates of the grid. Values corresponding to points on the grid that fall outside the study region WIN are set to NA

X

a the sequence of values that were used as x grid coordinates. Will have length res

Y

a the sequence of values that were used as y grid coordinates. Will have length res

kType

the kernel function used in estimation. Currently fixed at "gaus"

h

a numeric vector with length equal to the number of observations, giving the bandwidths assigned to each observation in the order they appeared in data. For a fixed bandwidth estimate, this will simply be the identical value passed to and returned as pilotH

pilotH

the pilot or fixed bandwidth depending on whether adaptive smoothing is employed or not, respectively

globalH

the global bandwidth globalH if adaptive smoothing is employed, NA for fixed smoothing

hypoH

the matrix of ‘hypothetical’ bandwidths (with element placement corresponding to Zm) for each coordinate of the evaluation grid. That is, these values are the bandwidths at that grid coordinate if, hypothetically, there was an observation there (along with the original data). These are used for edge-correction in adaptive densities (Marshall and Hazelton, 2010). Will be NA for fixed bandwidth estimates

zSpec

a numeric vector with length equal to the number of observations used, giving the values of the density at the specific coordinates of the observations. Order corresponds to the order of the observations in data

zExtra

as zSpec for the observations in atExtraCoords, NA if atExtraCoords is not supplied

WIN

the object of class owin used as the study region

qhz

a numeric matrix of the edge-correction factors for the entire evaluation grid (with placement corresponding to Zm. If edgeCorrect = FALSE, all edge correction factors are set to and returned as 1

qhzSpec

edge-correction factors for the individual observations; order corresponding to data

qhzExtra

as qhzSpec for the observations in atExtraCoords; NA if atExtraCoords is not supplied

pdef

a copy of the object originally supplied as the pilot density as per the pdef argument; NULL if unused

pilotvals

the values of the pilot density used to compute the adaptive bandwidths. Order corresponds to the order of the observations in data. NULL when adaptive = FALSE

gamma

the value of gamma that was passed to the function, or the geometric mean term of the reciprocal of the square root of the pilot density values used to scale the adaptive bandwidths if gamma is not supplied. NULL when adaptive = FALSE

counts

the counts vector used in estimation of the density/intensity. If all values in data were unique and counts = NULL, the returned counts will be a vector of ones equal to the number of coordinates in data

data

a two-column numeric data frame giving the observations in the originally supplied data that were used for the density estimation. If data originally contained duplicated coordinates, the returned data will contain only the unique coordinates, and should be viewed with respect to the returned value of counts

Warning

Explicit calculation of bivariate kernel density estimates is computationally expensive. The decision to produce adaptive over fixed bandwidth estimates, the size of the dataset, the evaluation grid resolution specified by res, the complexity of the study region and electing to edge-correct all have a direct impact upon the time it will take to estimate the density. Keeping use.ppp.methods = TRUE can drastically reduce this computational cost at the expense a degree of accuracy that is generally considered negligible for most practical purposes.

Author(s)

T.M. Davies

References

Abramson, I. (1982). On bandwidth variation in kernel estimates — a square root law, Annals of Statistics, 10(4), 1217-1223.

Davies, T.M. and Hazelton, M.L. (2010), Adaptive kernel estimation of spatial relative risk, Statistics in Medicine, 29(23) 2423-2437.

Davies, T.M., Jones, K. and Hazelton, M.L. (2015), Symmetric adaptive smoothing regimens for estimation of the spatial relative risk function, Submitted for publication.

Diggle, P.J. (1985), A kernel method for smoothing point process data, Journal of the Royal Statistical Society, Series C, 34(2), 138-147.

Marshall, J.C. and Hazelton, M.L. (2010) Boundary kernels for adaptive density estimators on regions with irregular boundaries, Journal of Multivariate Analysis, 101, 949-963.

Silverman, B.W. (1986), Density Estimation for Statistics and Data Analysis, Chapman & Hall, New York.

Examples

##Chorley-Ribble laryngeal cancer data ('spatstat' library)
data(chorley)

ch.lar.density <- bivariate.density(data = chorley, ID = "larynx",
 pilotH = 1.5, adaptive = FALSE)

plot(ch.lar.density, col = "lightblue", phi = 30, theta = -30,
 ticktype = "detailed", main = "chorley.larynx", display = "persp")

## Not run: 

##PBC liver disease data
data(PBC)
pbc.adaptive.density <- bivariate.density(data = PBC, ID = "case",
 pilotH = 350)

#3D plot - may need to adjust size of RGL device. Hold left click
# to pan, hold right to zoom.
plot(pbc.adaptive.density, display = "3d", col = heat.colors(20),
 main = "Density of PBC in north-east England", aspect = 1:2)

## End(Not run)

Results