R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Model Building

model.build

R Documentation

Model Building

Description

Create sophisticated models using Random Forest, Quantile Regression Forests, Conditional Forests, or Stochastic Gradient Boosting from training data

Usage

 

model.build(model.type = NULL, qdata.trainfn = NULL, folder = NULL,
MODELfn = NULL, predList = NULL, predFactor = FALSE, response.name = NULL,
response.type = NULL, unique.rowname = NULL, seed = NULL, na.action = NULL,
keep.data = TRUE, ntree = switch(model.type,RF=500,QRF=1000,CF=500,500),
mtry = switch(model.type,RF=NULL,QRF=ceiling(length(predList)/3),
CF = min(5,length(predList)-1),NULL), replace = TRUE, strata = NULL,
sampsize = NULL, proximity = TRUE, importance=FALSE, 
quantiles=c(0.1,0.5,0.9), subset = NULL, weights = NULL,
controls = NULL, xtrafo = NULL, ytrafo = NULL, scores = NULL,
n.trees = NULL, shrinkage = 0.001, interaction.depth = 10, 
bag.fraction = 0.5, train.fraction = NULL, nTrain = NULL,
n.minobsinnode = 10, var.monotone = NULL)

Arguments

`model.type`	String. Model type. `"RF"` (random forest), `"QRF"` (quantile random forest), `"CF"` (conditional forest), or `"SGB"` (stochastic gradient boosting).
`qdata.trainfn`	String. The name (full path or base name with path specified by `folder`) of the training data file used for building the model (file should include columns for both response and predictor variables). The file must be a comma-delimited file `*.csv` with column headings. `qdata.trainfn` can also be an `R` dataframe. If predictions will be made (`predict = TRUE` or `map=TRUE`) the predictor column headers must match the names of the raster layer files, or a `rastLUT` must be provided to match predictor columns to the appropriate raster and band. If `qdata.trainfn = NULL` (the default), a GUI interface prompts user to browse to the training data file.
`folder`	String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If `folder = NULL` (default), a GUI interface prompts user to browse to a folder. To use the working directory, specify `folder = getwd()`.
`MODELfn`	String. The file name to use to save files related to the model object. If `MODELfn = NULL` (the default), a default name is generated by pasting `model.type`, `response.type`, and `response.name`, separated by underscores. If the other output filenames are left unspecified, `MODELfn` will be used as the basic name to generate other output filenames. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by `folder`.
`predList`	String. A character vector of the predictor short names used to build the model. These names must match the column names in the training/test data files and the names in column two of the `rastLUT`. If `predList = NULL` (the default), a GUI interface prompts user to select predictors from column 2 of `rastLUT`. If both `predList = NULL` and `rastLUT = NULL`, then a GUI interface prompts user to browse to rasters used as predictors, and select from a generated list, the individual layers (bands) of rasters used to build the model. In this case (i.e., `rastLUT = NULL`), predictor column names of training data must be standard format, consisting of raster stack name followed by b1, b2, etc..., giving the band number within each stack (Example: `stacknameb1`, `stacknameb2`, `stacknameb3`, ...).
`predFactor`	String. A character vector of predictor short names of the predictors from `predList` that are factors (i.e categorical predictors). These must be a subset of the predictor names given in `predList` Categorical predictors may have multiple categories.
`response.name`	String. The name of the response variable used to build the model. If `response.name = NULL`, a GUI interface prompts user to select a variable from the list of column names from training data file. `response.name` must be column name from the training/test data files.
`response.type`	String. Response type: `"binary"`, `"categorical"` or `"continuous"`. Binary response must be binary 0/1 variable with only 2 categories. All zeros will be treated as one category, and everything else will be treated as the second category.
`unique.rowname`	String. The name of the unique identifier used to identify each row in the training data. If `unique.rowname = NULL`, a GUI interface prompts user to select a variable from the list of column names from the training data file. If `unique.rowname = FALSE`, a variable is generated of numbers from `1` to `nrow(qdata)` to index each row.
`seed`	Integer. The number used to initialize randomization to build RF or SGB models. If you want to produce the same model later, use the same seed. If `seed = NULL` (the default), a new seed is created each run.
`na.action`	String. Model validation. Specifies the action to take if there are `NA` values in the predictor data. There are 2 options: (1) `na.action = na.omit` where any data point with missing predictors is removed from the model building data; (2) `na.action = na.roughfix` where a missing categorical predictor is replaced with the most common category, and a missing continuous predictor or response is replaced with the median. Note: it is not recommended that `na.roughfix` will just be used for missing predictor. Data points with missing response will always be omitted.
`keep.data`	Logical. RF and SGB models. Should a copy of the predictor data be included in the model object. Useful for if `model.interaction.plot` will be used later.
`ntree`	Integer. RF QRF and CF models. The number of random forest trees for a RF model. The default is 500 trees.
`mtry`	Integer. RF QRF and CF models. Number of variables to try at each node of Random Forest trees. By default, RF models will use the `"tuneRF()"` function to optimize `mtry`.
`replace`	Logical. RF models. Should sampling of cases be done with or without replacement?
`strata`	Factor or String. RF models. A (factor) variable that is used for stratified sampling. Can be in the form of either the name of the column in `qdata` or a factor or vector with one element for each row of `qdata`.
`sampsize`	Vector. RF models. Size(s) of sample to draw. For classification, if `sampsize` is a vector of the length the number of factor levels `strata`, then sampling is stratified by `strata`, and the elements of `sampsize` indicate the numbers to be drawn from each strata. If argument `strata` is not provided, and `repsonse.type = "binary"` then sampling is stratified by presence/absence. If argument `sampsize` is not provided `model.build()` will use the default value from the `randomForest` package: `if (replace) nrow(data) else ceiling(.632*nrow(data))`.
`proximity`	Logical. RF models. Should proximity measure among the rows be calculated?
`importance`	Logical. QRF models. For QRF models only, importance must be specified at the time of model building. If TRUE importance of predictors is assessed at the given `quantiles`. Warning, on large datasets calculating QRF importances is very memory intensive and may require increasing memory limits with `memory.limit()`.
`quantiles`	Numeric. Used for QRF models if `importance=TRUE`. Specify which quantiles of response variable to use. Later importance plots can only be made for `quantiles` specified at the time of model building.
`subset`	CF models. An optional vector specifying a subset of observations to be used in the fitting process. Note: `subset` is not supported for cross validation diagnostics.
`weights`	CF models. An optional vector of weights to be used in the fitting process. Non-negative integer valued weights are allowed as well as non-negative real weights. Observations are sampled (with or without replacement) according to probabilities `weights/sum(weights)`. The fraction of observations to be sampled (without replacement) is computed based on the sum of the weights if all weights are integer-valued and based on the number of weights greater zero else. Alternatively, `weights` can be a double matrix defining case weights for all `ncol(weights)` trees in the forest directly. This requires more storage but gives the user more control. Note: `weights` is not supported for cross validation diagnostics.
`controls`	CF models. An object of class `ForestControl-class`, which can be obtained using cforest_control (and its convenience interfaces cforest_unbiased and cforest_classical). If `controls` is specified, then stand alone arguments `mtry` and `ntree` ignored and these parameters must be specified as part of the `controls` argument. If `controls` not specified, `model.build` defaults to `cforest_unbiased(mtry=mtry, ntree=ntree)` with the values of `mtry` and `ntree` specified by the stand alone arguments.
`xtrafo`	CF models. A function to be applied to all input variables. By default, the `ptrafo` function is applied. Defaults to `xtrafo=ptrafo`.
`ytrafo`	CF models. A function to be applied to all response variables. By default, the `ptrafo` function is applied. Defaults to `ytrafo=ptrafo`.
`scores`	CF models. An optional named list of scores to be attached to ordered factors. Note: `weights` is not supported for cross validation diagnostics.
`n.trees`	Integer. SGB models. The number of stochastic gradient boosting trees for an SGB model. For `response.type="binary"` and `"continuous"`, `n.trees=NULL` (the default) the model creation code will increase the number of trees 100 at a time until OOB error rate stops improving. The `gbm` function `gbm.perf()` will be used to select the best number of trees for model predictions, with argument `method="OOB"`. The `gbm` package warns that `OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive.`. For `response.type="categorical"` the `gbm` package has a bug, preventing the use of the function `gbm.more`, therefore if `n.trees` is not provided, the default is to build a model with 5000 trees. If the best number of trees is within 90 If `n.trees` is given and `train.fraction` is less than 1, then the SGB model is built with the given number of trees, and the best number of trees is calculated with `gbm` `method="test"`.
`shrinkage`	Numeric. SGB models. A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction.
`interaction.depth`	Integer. SGB models. The maximum depth of variable interactions. `interaction.depth = 1` implies an additive model, `interaction.depth = 2` implies a model with up to 2-way interactions, etc...
`bag.fraction`	Numeric. SGB models. `bag.fraction` must be a number between `0` and `1`, giving the fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If `bag.fraction < 1` then running the same model twice will result in similar but different fits.
`train.fraction`	Numeric. SGB models. The first `train.fraction * nrows(data)` observations are used to fit the model and the remainder are used for computing out-of-sample estimates of the loss function. Deprecated.
`nTrain`	Numeric. SGB models. An integer representing the number of cases on which to train. This is the preferred way of specification; The option `train.fraction` is deprecated and only maintained for backward compatibility. These two parameters are mutually exclusive. If both are unspecified, all data is used for training.
`n.minobsinnode`	Integer. SGB models. Minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight.
`var.monotone`	String. SGB models. an optional vector, the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome.

Details

This package provides a push button approach to complex model building and production mapping. It contains three main functions: model.build,model.diagnostics, and model.mapmake.

In addition it contains a simple function get.test that can be used to randomly divide a training dataset into training and test/validation sets; build.rastLUT that uses GUI prompts to walk a user through the process of setting up a Raster look up table to link predictors from the training data with the rasters used for map contruction; model.explore, for preliminary data exploration; and, model.importance.plot and model.interaction.plot for interpreting the effects of individual model predictors.

These functions can be run in a traditional R command mode, where all arguments are specified in the function call. However they can also be used in a full push button mode, where you type in, for example, the simple command model.build, and GUI pop up windows will ask questions about the type of model, the file locations of the data, etc...

When running the ModelMap package on non-Windows platforms, file names and folders need to be specified in the argument list, but other pushbutton selections are handled by the select.list() function, which is platform independent.

Binary, categorical, and continuous response models are supported for Random Forest, Conditional Forest, and Stochastic Gradient Boosting. Quantile Random Forest is appropriate for only continuous response models.

Random Forest is implemented through the randomForest package within R. Random Forest is more user friendly than Stochastic Gradient Boosting, as it has fewer parameters to be set by the user, and is less sensitive to tuning of these parameters. A Random Forest model consists of multiple trees that vote on predictions. For each tree a random subset of the training data is used to construct the tree, with the remaining data points used to construct out-of-bag (OOB) error estimates. At each node of the tree a random selection of predictors is chosen to determine the split. The number of predictors used to select the splits (argument mtry) is the primary user specified parameter that can affect model performance.

By default mtry will be automatically optimized using the randomForest package tuneRF() function. Note that this is a stochastic process. If there is a chance that models may be combined later with the randomForest package combine function then for consistency it is important to provide the mtry argument rather that using the default optimization process.

Random Forest will not over fit data, therefore the only penalty of increasing the number of trees is computation time. Random Forest can compute variable importance, an advantage over some "black box" modeling techniques if it is important to understand the ecological relationships underlying a model (Brieman, 2001).

Quantile Regression Forests is implemented through the quantregForest package.

Conditional Forests is implemented with the cforest() function in the party package. As stated in the party package, ensembles of conditional inference trees have not yet been extensively tested, so this routine is meant for the expert user only and its current state is rather experimental.

For CF models, ModelMap currently only supports binary, categorical and continuous response models. Also, for some CF model parameters (subset, weights, and scores) MOdelMap only provides OOB and independent test set diagnostics, and does not support cross validation diagnostics.

Stochastic gradient boosting (Friedman 2001, 2002), is related to both boosting and bagging. Many small classification or regression trees are built sequentially from "pseudo"-residuals (the gradient of the loss function of the previous tree).

At each iteration, a tree is built from a random sub-sample of the dataset (selected without replacement) and an incremental improvement in the model. Using only a fraction of the training data increases both the computation speed and the prediction accuracy, while also helping to avoid over-fitting the data. An advantage of stochastic gradient boosting is that it is not necessary to pre-select or transform predictor variables. It is also resistant to outliers, as the steepest gradient algorithm emphasizes points that are close to their correct classification. Stochastic gradient boosting is implemented through the gbm package within R.

One disadvantage of Stochastic Gradient Boosting, compared to Random Forest, is increased number of user specified parameters, and the SGB models tend to be more sensitive to these parameters. Model fitting parameter options include distribution, interaction depth, bagging fraction, shrinkage rate, and training fraction. These parameters can be set in the argument list when calling model.map(). Values for these parameters other than the defaults can not be set by point and click in the GUI pop up windows, and must be set in the argument list when calling model.build(). Friedman (2001, 2002) and Ridgeway (1999) provide guidelines on appropriate settings for model fitting options.

Also, unlike Random Forest models, in Stochastic Gradient Boosting, there is a penalty for using too many trees. The default behavior in model.map() is to increase the number of trees 100 at a time until the model stops improving, then call the gbm subfunction gbm.perf(method="OOB") to select the best number of iterations. Alternatively, the model.build() argument ntrees can be used to set some large number of trees to be calculated all at once and, again, the gbm.perf(method="OOB") function will be used to select the best number of trees. Note that the gbm package warns that "OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive." The gbm package offers two alternative techniques for calculating the best number of trees, but these are not yet implemented in the ModelMap package, as they require the use of a formula interface for model building.

Value

The function will return the model object. Additionally, it will write a text file to disk, in the folder specified by folder. This file lists the values of each argument as chosen from GUI prompts used for the function call.

Author(s)

Elizabeth Freeman and Tracey Frescino

References

Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.

Elith, J., Leathwick, J. R. and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology. 77:802-813.

Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Ann. Stat., 29(5):1189-1232.

Friedman, J.H. (2002). Stochastic gradient boosting. Comput. Stat. Data An., 38(4):367-378.

Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News 2(3), 18–22.

N. Meinshausen (2006) "Quantile Regression Forests", Journal of Machine Learning Research 7, 983-999 http://jmlr.csail.mit.edu/papers/v7/

Ridgeway, G., (1999). The state of boosting. Comp. Sci. Stat. 31:172-181

Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis and Torsten Hothorn (2007). Bias in Random Forest variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 25. http://www.biomedcentral.co,/1471-2105/8/25

Carolin Strobl, James Malley and Gerhard Tutz (2009). An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random forests. Phsycological Methods, 14(4), 323-348.

Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger (2004). Bagging Survival Trees. Statistics in Medicine, 23(1), 77-91.

Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette Molinaro and Mark J. ven der Laan (2006a). Survival Ensembles. Biostatistics, 7(3), 355-373.

Torston Hothorn, Kurt Hornik and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651-674. Preprint available from http://statmath.wu-wein.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf

Examples


###########################################################################
############################# Run this set up code: #######################
###########################################################################

# set seed:
seed=38

# Define training and test files:

qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap")

# Define folder for all output:
folder=getwd()	

#identifier for individual training and test data points

unique.rowname="ID"


###########################################################################
############## Pick one of the following sets of definitions: #############
###########################################################################


########## Continuous Response, Continuous Predictors ############

#file name:
MODELfn="RF_Bio_TC"				

#predictors:
predList=c("TCB","TCG","TCW")	

#define which predictors are categorical:
predFactor=FALSE	

# Response name and type:
response.name="BIO"
response.type="continuous"


########## binary Response, Continuous Predictors ############

#file name to store model:
MODELfn="RF_CONIFTYP_TC"				

#predictors:
predList=c("TCB","TCG","TCW")		

#define which predictors are categorical:
predFactor=FALSE

# Response name and type:
response.name="CONIFTYP"

# This variable is 1 if a conifer or mixed conifer type is present, 
# otherwise 0.

response.type="binary"


########## Continuous Response, Categorical Predictors ############

# In this example, NLCD is a categorical predictor.
#
# You must decide what you want to happen if there are categories
# present in the data to be predicted (either the validation/test set
# or in the image file) that were not present in the original training data.
# Choices:
#       na.action = "na.omit"
#                    Any validation datapoint or image pixel with a value for any
#                    categorical predictor not found in the training data will be
#                    returned as NA.
#       na.action = "na.roughfix"
#                    Any validation datapoint or image pixel with a value for any
#                    categorical predictor not found in the training data will have
#                    the most common category for that predictor substituted,
#                    and the a prediction will be made.

# You must also let R know which of the predictors are categorical, in other
# words, which ones R needs to treat as factors.
# This vector must be a subset of the predictors given in predList

#file name to store model:
MODELfn="RF_BIO_TCandNLCD"			

#predictors:
predList=c("TCB","TCG","TCW","NLCD")

#define which predictors are categorical:
predFactor=c("NLCD")

# Response name and type:
response.name="BIO"
response.type="continuous"



###########################################################################
########################### build model: ##################################
###########################################################################


### create model before batching (only run this code once ever!) ###

model.obj = model.build( model.type="RF",
                       qdata.trainfn=qdata.trainfn,
                       folder=folder,		
                       unique.rowname=unique.rowname,	
                       MODELfn=MODELfn,
                       predList=predList,
                       predFactor=predFactor,
                       response.name=response.name,
                       response.type=response.type,
                       seed=seed,
                       na.action="na.roughfix"
)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(ModelMap)
Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Loading required package: raster
Loading required package: sp
Loading required package: rgdal
rgdal: version: 1.1-10, (SVN revision 622)
 Geospatial Data Abstraction Library extensions to R successfully loaded
 Loaded GDAL runtime: GDAL 1.11.3, released 2015/09/16
 Path to GDAL shared files: /usr/share/gdal/1.11
 Loaded PROJ.4 runtime: Rel. 4.9.2, 08 September 2015, [PJ_VERSION: 492]
 Path to PROJ.4 shared files: (autodetected)
 Linking to sp version: 1.2-3 
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/ModelMap/model.build.Rd_%03d_medium.png", width=480, height=480)
> ### Name: model.build
> ### Title: Model Building
> ### Aliases: model.build
> ### Keywords: models
> 
> ### ** Examples
> 
> 
> ###########################################################################
> ############################# Run this set up code: #######################
> ###########################################################################
> 
> # set seed:
> seed=38
> 
> # Define training and test files:
> 
> qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap")
> 
> # Define folder for all output:
> folder=getwd()	
> 
> #identifier for individual training and test data points
> 
> unique.rowname="ID"
> 
> 
> ###########################################################################
> ############## Pick one of the following sets of definitions: #############
> ###########################################################################
> 
> 
> ########## Continuous Response, Continuous Predictors ############
> 
> #file name:
> MODELfn="RF_Bio_TC"				
> 
> #predictors:
> predList=c("TCB","TCG","TCW")	
> 
> #define which predictors are categorical:
> predFactor=FALSE	
> 
> # Response name and type:
> response.name="BIO"
> response.type="continuous"
> 
> 
> ########## binary Response, Continuous Predictors ############
> 
> #file name to store model:
> MODELfn="RF_CONIFTYP_TC"				
> 
> #predictors:
> predList=c("TCB","TCG","TCW")		
> 
> #define which predictors are categorical:
> predFactor=FALSE
> 
> # Response name and type:
> response.name="CONIFTYP"
> 
> # This variable is 1 if a conifer or mixed conifer type is present, 
> # otherwise 0.
> 
> response.type="binary"
> 
> 
> ########## Continuous Response, Categorical Predictors ############
> 
> # In this example, NLCD is a categorical predictor.
> #
> # You must decide what you want to happen if there are categories
> # present in the data to be predicted (either the validation/test set
> # or in the image file) that were not present in the original training data.
> # Choices:
> #       na.action = "na.omit"
> #                    Any validation datapoint or image pixel with a value for any
> #                    categorical predictor not found in the training data will be
> #                    returned as NA.
> #       na.action = "na.roughfix"
> #                    Any validation datapoint or image pixel with a value for any
> #                    categorical predictor not found in the training data will have
> #                    the most common category for that predictor substituted,
> #                    and the a prediction will be made.
> 
> # You must also let R know which of the predictors are categorical, in other
> # words, which ones R needs to treat as factors.
> # This vector must be a subset of the predictors given in predList
> 
> #file name to store model:
> MODELfn="RF_BIO_TCandNLCD"			
> 
> #predictors:
> predList=c("TCB","TCG","TCW","NLCD")
> 
> #define which predictors are categorical:
> predFactor=c("NLCD")
> 
> # Response name and type:
> response.name="BIO"
> response.type="continuous"
> 
> 
> 
> ###########################################################################
> ########################### build model: ##################################
> ###########################################################################
> 
> 
> ### create model before batching (only run this code once ever!) ###
> 
> model.obj = model.build( model.type="RF",
+                        qdata.trainfn=qdata.trainfn,
+                        folder=folder,		
+                        unique.rowname=unique.rowname,	
+                        MODELfn=MODELfn,
+                        predList=predList,
+                        predFactor=predFactor,
+                        response.name=response.name,
+                        response.type=response.type,
+                        seed=seed,
+                        na.action="na.roughfix"
+ )
mtry = 1  OOB error = 1819.048 
Searching left ...
Searching right ...
mtry = 2 	OOB error = 1907.638 
-0.04870136 0.05 
> 
> 
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>