Last data update: 2014.03.03

R: Semi-Parametric Regression
semipR Documentation

Semi-Parametric Regression

Description

Estimates a semi-parametric model with the form y = X β + f(z) + u, where f(z) is either fully nonparametric with f(z) = f(z_1) or conditionally parametric with f(z) = z_2 λ (z_1).

Usage

 
semip(form,nonpar,conpar,window1=.25,window2=.25,bandwidth1=0,bandwidth2=0, 
 kern="tcub",distance="Mahal",targetfull=NULL, print.summary=TRUE, data=NULL)

Arguments

form

Model formula. Specifies the base parametric form of the model, y = X β. Any number of variables can be included in X. Format: semip(y~x1+x2..., ...).

nonpar

List of variables in z_1. Formats: semip(..., nonpar=~z1a, ...) or semip(..., nonpar=~z1a+zb, ...). Important: note the "~" before the first z1 variable. At most two variables can be included in z_1.

conpar

List of variables in z_2. By default, conpar = NULL and f(z) has the fully nonparametric form f(z) = f(z_1); in this case the variables in z_1 are taken from the list provided by nonpar. If a list of variables is provided for nonpar, the conditionally parametric form f(z) = z_2 λ (z_1) is assumed for f(z), and the variables for z_2 are provided by conpar. Any number of variables can be included in conpar. Format: semip(..., conpar=~z2a+z2b+z2c+..., ...). Important: note the "~" before the first z2 variable.

window1

Window size for the LWR or CPAR regressions of y and x on z. Default = .25.

window2

Window size for the LWR or CPAR regression of y-X β on z. Default = .25.

bandwidth1

Bandwidth for the LWR or CPAR regressions of y and x on z. Default: not specified.

bandwidth2

Bandwidth for the LWR or CPAR regression of y-X β on z. Default: not specified.

kern

Kernel weighting functions. Default is the tri-cube. Options include "rect", "tria", "epan", "bisq", "tcub", "trwt", and "gauss".

distance

Options: "Euclid", "Mahal", or "Latlong" for Euclidean, Mahalanobis, or "great-circle" geographic distance. May be abbreviated to the first letter but must be capitalized. Note: semip looks for the first two letters to determine which variable is latitude and which is longitude, so data set must be attached first or specified using the data option; options like data$latitude will not work. Default: Mahal.

targetfull

Target options to be passed to the lwr command if conpar = NULL or the cparlwr command if a list of variables is provided for conpar. Options include NULL, "alldata", or the full output of the maketarget command. The appropriate argument will then be passed on to the lwr or cparlwr command.

print.summary

If print.summary=T, prints a summary of the regression results for ey on ex, i.e., the parametric portion of the model. Default: print.summary=T.

data

A data frame containing the data. Default: use data in the current working directory

Details

If conpar = NULL, the function implements Robinson's (1988) semi-parametric estimator for the model y = X β + f(z) + u. In this case, the list of variables in z is taken from nonpar and z can have at most two variables. If a list of variables is provided for conpar, the function implements the semi-parametric estimator for the model f(z) = z_2 λ (z_1). In this case, the list of variables in z1 is taken from nonpar and the list of variables in z_2 is taken from conpar. z_1 can have at most two variables. There is no limit on the number of variables in z_2.

The estimation procedure has the following three steps under either specification:

1. Nonparametric regressions of y on z and each X on z using the lwr function when conpar=NULL and the cparlwr function when a list of variables is provided for cparlwr. The window or bandwidth for these regressions is set by window1 or bandwidth1.

2. OLS regression of y-fitted(y) on the k-1 variables in X - fitted(X), omitting the intercept. The coefficients from this regression are the estimated values of β.

3. Nonparametric regression of y-X β on z using the lwr function when conpar=NULL and the cparlwr function when a list of variables is provided for cparlwr. The window or bandwidth for these regressions is set by window2 or bandwidth2.

The stage-two OLS regressions use k degrees of freedom. The stage-three nonparametric regression uses 2*df1-df2 degrees of freedom, where df1 = tr(L) and df2 = tr(L'L) and L is the nxn matrix for the lwr or cparlwr regression L(Y - X β). The estimated variance is sig2 = rss/(n-2*df1+df2), where rss = sum(y-XB-f(z))^2 . The covariance matrix estimate for β is sig2*((X-fitted(X))'(X-fitted(X)))^(-1). The covariance matrix is stored as vmat.

The nonparametric regressions are estimated using either the lwr or cparlwr function. See their descriptions for more information.

Value

xcoef

The estimated coefficients for the parametric part of the model, β.

vmat

The covariance matrix for the estimates of β.

xbhat

The predicted values of y for the full data set.

nphat

The predicted values of f(z) for the full data set. mean(xbhat)+mean(nphat) will be close but not necessarily identical to mean(y).

nphat.se

Standard errors for the predicted values of y for the full data set.

npfit

The complete set of lwr or cparlwr results from the nonparametric regression of y - X β on Z.

df1

k + tr(L), where k is the number of explanatory variables in X β (including the constant) and L is the nxn matrix used to calculate the final-stage nonparametric or conditionally parametric regression of Y - X β on Z. df1 is one measure of the degrees of freedom used in estimation.

df2

An alternative measure of the degrees of freedom used in estimation, df2 = k + tr(L'L).

sig2

Estimated residual variance, sig2 = rss/(n-2*df1+df2).

References

Cleveland, William S. and Susan J. Devlin, "Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting," Journal of the American Statistical Association 83 (1988), 596-610.

Loader, Clive. Local Regression and Likelihood. New York: Springer, 1999.

McMillen, Daniel P., "Issues in Spatial Data Analysis," Journal of Regional Science 50 (2010), 119-141.

McMillen, Daniel P., "Employment Densities, Spatial Autocorrelation, and Subcenters in Large Metropolitan Areas," Journal of Regional Science 44 (2004), 225-243.

McMillen, Daniel P. and Christian Redfearn, "Estimation and Hypothesis Testing for Nonparametric Hedonic House Price Functions," Journal of Regional Science 50 (2010), 712-733.

Pagan, Adrian and Aman Ullah. Nonparametric Econometrics. New York: Cambridge University Press, 1999.

Robinson, Paul M. 1988. "Root-N-Consistent Semiparametric Regression," Econometrica, 56, 931-954.

See Also

cparlwr

lwr

maketarget

Examples


# Single variable in f(z)
par(ask=TRUE)
n = 1000
x <- runif(n,0,2*pi)
x <- sort(x)
z <- runif(n,0,2*pi)
xsq <- x^2
sinx <- sin(x)
cosx <- cos(x)
sin2x <- sin(2*x)
cos2x <- cos(2*x)
ybase1 <-  x - .1*xsq + sinx - cosx - .5*sin2x + .5*cos2x
ybase2 <- -z + .1*(z^2) - sin(z) + cos(z) + .5*sin(2*z) - .5*cos(2*z)
ybase <- ybase1+ybase2
sig = sd(ybase)/2
y <- ybase + rnorm(n,0,sig)

# Correct specification for x; z in f(z)
fit <- semip(y~x+xsq+sinx+cosx+sin2x+cos2x,nonpar=~z,window1=.20,window2=.20)
2*fit$df1 - fit$df2
yvect <- c(min(ybase1,fit$xbhat), max(ybase1, fit$xbhat))
xbhat  <- fit$xbhat - mean(fit$xbhat) + mean(ybase1)
plot(x,ybase1,type="l",xlab="x",ylab="ybase1",ylim=yvect, main="Predictions for XB")
lines(x, xbhat, col="red")

predse <- sqrt(fit$sig2 + fit$nphat.se^2)
nphat <- fit$nphat - mean(fit$nphat) + mean(ybase2)
lower <- nphat + qnorm(.025)*fit$nphat.se
upper <- nphat + qnorm(.975)*fit$nphat.se
o <- order(z)
yvect <- c(min(lower), max(upper))
plot(z[o], ybase2[o], type="l", xlab="z", ylab="f(z) ",
   main="Predictions for f(z) ", ylim=yvect)
lines(z[o], nphat[o], col="red")
lines(z[o], lower[o], col="red", lty="dashed")
lines(z[o], upper[o], col="red", lty="dashed")

## Not run: 
# Chicago Housing Sales
data(matchdata)
match05 <- data.frame(matchdata[matchdata$year==2005,])
match05$age <- 2005-match05$yrbuilt

tfit1 <- maketarget(~dcbd,window=.3,data=match05)
tfit2 <- maketarget(~longitude+latitude,window=.5,data=match05)

# nonparametric control for dcbd

fit <- semip(lnprice~lnland+lnbldg+rooms+bedrooms+bathrooms+centair+fireplace+brick+
garage1+garage2+ age+rr, nonpar=~dcbd, data=match05,targetfull=tfit1)

# nonparametric controls for longitude and latitude

fit <- semip(lnprice~lnland+lnbldg+rooms+bedrooms+bathrooms+centair+fireplace+brick+
garage1+garage2+ age+rr+dcbd, nonpar=~longitude+latitude, data=match05, targetfull=tfit2,
distance="Latlong")

# Conditionally parametric model:  y = XB + dcbd*lambda(longitude,latitude) + u
fit <- semip(lnprice~lnland+lnbldg+rooms+bedrooms+bathrooms+centair+fireplace+
 brick+garage1+garage2+age+rr, nonpar=~longitude+latitude, conpar=~dcbd, 
 data=match05, distance="Latlong",targetfull=tfit1)

# Conditional parametric model:  y = XB + Z*lambda(longitude,latitude) + u
# Z = (dcbd,lnland,lnbldg,age)
fit <- semip(lnprice~rooms+bedrooms+bathrooms+centair+fireplace+brick+
garage1+garage2+rr, nonpar=~longitude+latitude, conpar=~dcbd+lnland+lnbldg+age, 
 data=match05, distance="Latlong",targetfull=tfit2)

## End(Not run)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(McSpatial)
Loading required package: lattice
Loading required package: locfit
locfit 1.5-9.1 	 2013-03-22
Loading required package: maptools
Loading required package: sp
Checking rgeos availability: TRUE
Loading required package: quantreg
Loading required package: SparseM

Attaching package: 'SparseM'

The following object is masked from 'package:base':

    backsolve

Loading required package: RANN
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/McSpatial/semip.Rd_%03d_medium.png", width=480, height=480)
> ### Name: semip
> ### Title: Semi-Parametric Regression
> ### Aliases: semip
> ### Keywords: Semi-Parametric Models
> 
> ### ** Examples
> 
> 
> # Single variable in f(z)
> par(ask=TRUE)
> n = 1000
> x <- runif(n,0,2*pi)
> x <- sort(x)
> z <- runif(n,0,2*pi)
> xsq <- x^2
> sinx <- sin(x)
> cosx <- cos(x)
> sin2x <- sin(2*x)
> cos2x <- cos(2*x)
> ybase1 <-  x - .1*xsq + sinx - cosx - .5*sin2x + .5*cos2x
> ybase2 <- -z + .1*(z^2) - sin(z) + cos(z) + .5*sin(2*z) - .5*cos(2*z)
> ybase <- ybase1+ybase2
> sig = sd(ybase)/2
> y <- ybase + rnorm(n,0,sig)
> 
> # Correct specification for x; z in f(z)
> fit <- semip(y~x+xsq+sinx+cosx+sin2x+cos2x,nonpar=~z,window1=.20,window2=.20)
Parametric Portion 
  
        Estimate Std. Error    z-value     Pr(>|z|)
x      1.0539782 0.45331991   2.325021 2.007085e-02
xsq   -0.1082117 0.07205667  -1.501758 1.331595e-01
sinx   1.0289993 0.07781862  13.223047 0.000000e+00
cosx  -0.9060827 0.29133738  -3.110080 1.870364e-03
sin2x -0.5624848 0.05232637 -10.749549 0.000000e+00
cos2x  0.5173237 0.08357031   6.190281 6.005700e-10
> 2*fit$df1 - fit$df2
[1] 17.12679
> yvect <- c(min(ybase1,fit$xbhat), max(ybase1, fit$xbhat))
> xbhat  <- fit$xbhat - mean(fit$xbhat) + mean(ybase1)
> plot(x,ybase1,type="l",xlab="x",ylab="ybase1",ylim=yvect, main="Predictions for XB")
> lines(x, xbhat, col="red")
> 
> predse <- sqrt(fit$sig2 + fit$nphat.se^2)
> nphat <- fit$nphat - mean(fit$nphat) + mean(ybase2)
> lower <- nphat + qnorm(.025)*fit$nphat.se
> upper <- nphat + qnorm(.975)*fit$nphat.se
> o <- order(z)
> yvect <- c(min(lower), max(upper))
> plot(z[o], ybase2[o], type="l", xlab="z", ylab="f(z) ",
+    main="Predictions for f(z) ", ylim=yvect)
> lines(z[o], nphat[o], col="red")
> lines(z[o], lower[o], col="red", lty="dashed")
> lines(z[o], upper[o], col="red", lty="dashed")
> 
> ## Not run: 
> ##D # Chicago Housing Sales
> ##D data(matchdata)
> ##D match05 <- data.frame(matchdata[matchdata$year==2005,])
> ##D match05$age <- 2005-match05$yrbuilt
> ##D 
> ##D tfit1 <- maketarget(~dcbd,window=.3,data=match05)
> ##D tfit2 <- maketarget(~longitude+latitude,window=.5,data=match05)
> ##D 
> ##D # nonparametric control for dcbd
> ##D 
> ##D fit <- semip(lnprice~lnland+lnbldg+rooms+bedrooms+bathrooms+centair+fireplace+brick+
> ##D garage1+garage2+ age+rr, nonpar=~dcbd, data=match05,targetfull=tfit1)
> ##D 
> ##D # nonparametric controls for longitude and latitude
> ##D 
> ##D fit <- semip(lnprice~lnland+lnbldg+rooms+bedrooms+bathrooms+centair+fireplace+brick+
> ##D garage1+garage2+ age+rr+dcbd, nonpar=~longitude+latitude, data=match05, targetfull=tfit2,
> ##D distance="Latlong")
> ##D 
> ##D # Conditionally parametric model:  y = XB + dcbd*lambda(longitude,latitude) + u
> ##D fit <- semip(lnprice~lnland+lnbldg+rooms+bedrooms+bathrooms+centair+fireplace+
> ##D  brick+garage1+garage2+age+rr, nonpar=~longitude+latitude, conpar=~dcbd, 
> ##D  data=match05, distance="Latlong",targetfull=tfit1)
> ##D 
> ##D # Conditional parametric model:  y = XB + Z*lambda(longitude,latitude) + u
> ##D # Z = (dcbd,lnland,lnbldg,age)
> ##D fit <- semip(lnprice~rooms+bedrooms+bathrooms+centair+fireplace+brick+
> ##D garage1+garage2+rr, nonpar=~longitude+latitude, conpar=~dcbd+lnland+lnbldg+age, 
> ##D  data=match05, distance="Latlong",targetfull=tfit2)
> ## End(Not run)
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>