R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Bivariate kernel density estimates

bivariate.density

R Documentation

Bivariate kernel density estimates

Description

Provides an adaptive or fixed bandwidth kernel density estimate of bivariate data.

Usage

bivariate.density(data, ID = NULL, pilotH, globalH = pilotH,
		  adaptive = TRUE, edgeCorrect = TRUE, res = 50, WIN = NULL,
		  counts = NULL, intensity = FALSE, xrange = NULL,
		  yrange = NULL, trim = 5, gamma = NULL, pdef = NULL, 
		  atExtraCoords = NULL, use.ppp.methods = TRUE, comment = TRUE)

Arguments

`data`	An object of type `data.frame`, `list`, `matrix`, or `ppp` giving the observed data from which we wish to calculate the density estimate. Optional ID information (e.g. a dichotomous indicator for cases and controls) may also be provided in these four data structures. See ‘Details’ for further information on how to properly specify each one.
`ID`	If `data` is a data structure with a third component/column indicating case (1) or control (0) status, `ID` must specify which of these groups we wish to estimate a density for. If `ID` is `NULL` (default), a density is estimated for all present observations, regardless of any status information.
`pilotH`	A single numeric, positive ‘smoothing parameter’ or ‘bandwidth’. When `adaptive` is `TRUE` (default), this value is taken to be the pilot bandwidth, used to construct the bivariate pilot density required for adaptive smoothing (see ‘Details’). For a fixed bandwidth kernel density estimate, `pilotH` simply represents the fixed amount of smoothing. Currently, all smoothing is isotropic in nature.
`globalH`	A single numeric, positive smoothing multiplier referred to as the global bandwidth, used to calculate the adaptive bandwidths (see ‘Details’). When `adaptive` is `TRUE`, this defaults to be the same as the pilot bandwidth. Ignored for a fixed density estimate.
`adaptive`	Boolean. Whether or not to produce an adaptive (variable bandwidth) density estimate, with the alternative being a fixed bandwith density estimate. Defaults to `TRUE`.
`edgeCorrect`	Boolean. Whether or not to perform edge-correction on the density estimate according to the methods demonstrated by Diggle (1985) (fixed bandwidth) and Marshall and Hazelton (2010) (adaptive). This can have a noticable effect on computation time in some cases. Defaults to `TRUE`. When `adaptive = TRUE`, the fixed-bandwidth pilot density is also edge-corrected according to `edgeCorrect`.
`res`	A single, numeric, positive integer indicating the square root of the desired resolution of the evaluation grid. That is, each of the evaluation grid axes will have length `res`. Currently, only `res*res` grids are supported. Defaults to 50 for computational reasons.
`WIN`	A polygonal object of class `owin` from the package `spatstat` giving the study region or ‘window’. All functions in the package `sparr` that require knowledge of the specific study region make use of this class; no other method of defining the study region is currently supported. If no window is supplied (default), the function defines (and returns) it's own rectangular `owin` based on `xrange` and `yrange`. Ignored if `data` is an object of type `ppp`.
`counts`	To perform binned kernel estimation, a numeric, positive, integer vector of giving counts associated with each observed coordinate in `data`, if `data` contains unique observations. If `NULL` (default), the function assumes each coordinate in `data` corresponds to one observation at that point. Should the `data` being supplied to `bivariate.density` contain duplicated coordinates, the function computes the counts vector internally (overriding any supplied value for `counts`), issues a warning, and continues with binned estimation. Non-integer values are rounded to the nearest integer.
`intensity`	A boolean value indicating whether or not to return an intensity (interpreted as the the expected number of observations per unit area and integrating to the number of observations in the study region) function, rather than a density (integrating to one). Defaults to `FALSE`.
`xrange`	Required only when no study region is supplied (`WIN = NULL`) and `data` is not an object of class `ppp`, and ignored otherwise. A vector of length 2 giving the upper and lower limits of the estimation interval for the x axis, in which case an evenly spaced set of values of length `res` is generated.
`yrange`	As above, but for the y axis.
`trim`	A numeric value (defaulting to `5`) that prevents excessively large bandwidths in adaptive smoothing by trimming the originally computed bandwidths h by `trim` times `median(`h`)`. A value of `NA` or a negative numeric value requests no trimming. Ignored when `adaptive` is `FALSE`.
`gamma`	An optional positive numeric value to use in place of `gamma` for adaptive bandwidth calculation (see ‘Details’). For adaptive relative risk estimation, this value can sensibly be chosen as common for both case and control densities (such as the `gamma` value from the adaptive density estimate of the ‘pooled’ (full) dataset) - see Davies and Hazelton (2010). If nothing is supplied (default), this value is computed from the data being used to estimate the density in the defined fashion (again, see ‘Details’). Ignored for fixed bandwidth estimation.
`pdef`	An optional object of class `bivden` for adaptive density estimation. This object is used as an alternative or ‘external’ way to specify the pilot density for computing the variable bandwidth factors and must have the same grid resolution and coordinates as the estimate currently being constructed. If `NULL` (default) the pilot density is computed internally using `pilotH` from above, but if supplied, `pilotH` need not be given. Bandwidth trimming value is computed based upon the data points making up `pdef`. Ignored if `adaptive = FALSE`.
`atExtraCoords`	It can occasionally be useful to retrieve the values of the estimated density at specific coordinates that are not the specific observations or the exact grid coordinates, for further analysis or plotting. `atExtraCoords` allows the user to specify an additional object of type `data.frame` with 2 colums giving the x `atExtraCoords[,1]` and y `atExtraCoords[,2]` coordinates at which to calculate and return the estimated density and other statistics (see ‘Value’).
`use.ppp.methods`	Boolean. Whether or not to switch to using methods defined for objects of class `ppp.object` from the package `spatstat` to estimate the density. This approach is much, much faster than forcing `bivariate.density` to do the explicit calculations (due to implementation of a Fast Fourier Transform; see `density.ppp`) and is highly recommended for large datasets. To further reduce computation time in the adaptive case when `use.ppp.methods = TRUE`, the variable edge-correction factors are calculated using the integer percentiles of the varying bandwidths. Defaults to `TRUE`.
`comment`	Boolean. Whether or not to print function progress (including starting and ending times) during execution. Defaults to `TRUE`.

Details

This function calculates an adaptive or fixed bandwidth bivariate kernel density estimate, using the bivariate Gaussian kernel. Abramson's method is used for adaptive smoothing (Abramson, 1982). Suppose our data argumnent is a data.frame or matrix. Then for each observation data[i,1:2] (i = 1, 2, ... n), the bandwidth h[i] is given by

h[i]=globalH / ( w(data[i,1:2]; pilotH)^(1/2)*gamma )

where w is the fixed bandwidth pilot density constructed with bandwidth pilotH and the scaling parameter gamma is the geometric mean of the w^(-1/2) values. A detailed discussion on this construction is given in Silverman (1986).

If the data argument is a data.frame or a matrix, this must have exactly two columns containing the x ([,1]) and y ([,2]) data values, or exactly three columns with the third (rightmost) column giving ID information by way of a numeric, dichotomous indicator. Should data be a list, this must have two vector components of equal length named x and y. The user may specify a third component with the name ID giving the vector of corresponding ID information (must be of equal length to x and y). Alternatively, data may be an object of class ppp (see ppp.object). ID information can be stored in such an object through the argument marks. If data is a ppp object, the value of window of this object overrides the value of the argument WIN above.

Value

An object of class "bivden". This is effectively a list with the following components:

`Zm`	a numeric matrix giving the value of the estimated (edge-corrected if elected) density at each of the coordinates of the grid. Values corresponding to points on the grid that fall outside the study region `WIN` are set to `NA`
`X`	a the sequence of values that were used as x grid coordinates. Will have length `res`
`Y`	a the sequence of values that were used as y grid coordinates. Will have length `res`
`kType`	the kernel function used in estimation. Currently fixed at `"gaus"`
`h`	a numeric vector with length equal to the number of observations, giving the bandwidths assigned to each observation in the order they appeared in `data`. For a fixed bandwidth estimate, this will simply be the identical value passed to and returned as `pilotH`
`pilotH`	the pilot or fixed bandwidth depending on whether adaptive smoothing is employed or not, respectively
`globalH`	the global bandwidth `globalH` if adaptive smoothing is employed, `NA` for fixed smoothing
`hypoH`	the matrix of ‘hypothetical’ bandwidths (with element placement corresponding to `Zm`) for each coordinate of the evaluation grid. That is, these values are the bandwidths at that grid coordinate if, hypothetically, there was an observation there (along with the original data). These are used for edge-correction in adaptive densities (Marshall and Hazelton, 2010). Will be `NA` for fixed bandwidth estimates
`zSpec`	a numeric vector with length equal to the number of observations used, giving the values of the density at the specific coordinates of the observations. Order corresponds to the order of the observations in `data`
`zExtra`	as `zSpec` for the observations in `atExtraCoords`, `NA` if `atExtraCoords` is not supplied
`WIN`	the object of class `owin` used as the study region
`qhz`	a numeric matrix of the edge-correction factors for the entire evaluation grid (with placement corresponding to `Zm`. If `edgeCorrect = FALSE`, all edge correction factors are set to and returned as `1`
`qhzSpec`	edge-correction factors for the individual observations; order corresponding to `data`
`qhzExtra`	as `qhzSpec` for the observations in `atExtraCoords`; `NA` if `atExtraCoords` is not supplied
`pdef`	a copy of the object originally supplied as the pilot density as per the `pdef` argument; `NULL` if unused
`pilotvals`	the values of the pilot density used to compute the adaptive bandwidths. Order corresponds to the order of the observations in `data`. `NULL` when `adaptive = FALSE`
`gamma`	the value of `gamma` that was passed to the function, or the geometric mean term of the reciprocal of the square root of the pilot density values used to scale the adaptive bandwidths if `gamma` is not supplied. `NULL` when `adaptive = FALSE`
`counts`	the counts vector used in estimation of the density/intensity. If all values in `data` were unique and `counts = NULL`, the returned `counts` will be a vector of ones equal to the number of coordinates in `data`
`data`	a two-column numeric data frame giving the observations in the originally supplied `data` that were used for the density estimation. If `data` originally contained duplicated coordinates, the returned `data` will contain only the unique coordinates, and should be viewed with respect to the returned value of `counts`

Warning

Explicit calculation of bivariate kernel density estimates is computationally expensive. The decision to produce adaptive over fixed bandwidth estimates, the size of the dataset, the evaluation grid resolution specified by res, the complexity of the study region and electing to edge-correct all have a direct impact upon the time it will take to estimate the density. Keeping use.ppp.methods = TRUE can drastically reduce this computational cost at the expense a degree of accuracy that is generally considered negligible for most practical purposes.

Author(s)

T.M. Davies

References

Abramson, I. (1982). On bandwidth variation in kernel estimates — a square root law, Annals of Statistics, 10(4), 1217-1223.

Davies, T.M. and Hazelton, M.L. (2010), Adaptive kernel estimation of spatial relative risk, Statistics in Medicine, 29(23) 2423-2437.

Davies, T.M., Jones, K. and Hazelton, M.L. (2015), Symmetric adaptive smoothing regimens for estimation of the spatial relative risk function, Submitted for publication.

Diggle, P.J. (1985), A kernel method for smoothing point process data, Journal of the Royal Statistical Society, Series C, 34(2), 138-147.

Marshall, J.C. and Hazelton, M.L. (2010) Boundary kernels for adaptive density estimators on regions with irregular boundaries, Journal of Multivariate Analysis, 101, 949-963.

Silverman, B.W. (1986), Density Estimation for Statistics and Data Analysis, Chapman & Hall, New York.

Examples

##Chorley-Ribble laryngeal cancer data ('spatstat' library)
data(chorley)

ch.lar.density <- bivariate.density(data = chorley, ID = "larynx",
 pilotH = 1.5, adaptive = FALSE)

plot(ch.lar.density, col = "lightblue", phi = 30, theta = -30,
 ticktype = "detailed", main = "chorley.larynx", display = "persp")

## Not run: 

##PBC liver disease data
data(PBC)
pbc.adaptive.density <- bivariate.density(data = PBC, ID = "case",
 pilotH = 350)

#3D plot - may need to adjust size of RGL device. Hold left click
# to pan, hold right to zoom.
plot(pbc.adaptive.density, display = "3d", col = heat.colors(20),
 main = "Density of PBC in north-east England", aspect = 1:2)

## End(Not run)