Last data update: 2014.03.03

R: Fitting distributions to observations/Monte Carlo simulations
fitDistrR Documentation

Fitting distributions to observations/Monte Carlo simulations

Description

This function fits 21 different continuous distributions by (weighted) NLS to the histogram or kernel density of the Monte Carlo simulation results as obtained by propagate or any other vector containing large-scale observations. Finally, the fits are sorted by ascending AIC.

Usage

fitDistr(object, type = c("hist", "dens"), nbin = 100,
         weights = NULL, verbose = TRUE, plot = TRUE,  ...)

Arguments

object

an object of class 'propagate' or a vector containing observations.

type

propability density functions are fitted to either the density values of the "hist"ogram or of the kernel "dens"ity estimates.

nbin

in case of type = "hist", the number of bins in the histogram.

weights

numeric or logical. Either a numeric vector of weights, or if TRUE, the distributions are fitted with weights = 1/(counts per bin).

verbose

logical. If TRUE, steps of the analysis are printed to the console.

plot

logical. If TRUE, a plot with the "best" distribution (in terms of lowest AIC) on top of the histogram or kernel density curve is displayed.

...

other parameters to be passed to density or histogram .

Details

Fits the following 21 distributions using residual sum-of-squares as the minimization criterion for optim:
1) Normal distribution (dnorm) => http://en.wikipedia.org/wiki/Normal_distribution
2) Skewed-normal distribution (propagate:::dsn) => http://en.wikipedia.org/wiki/Skew_normal_distribution
3) Generalized normal distribution (propagate:::dgnorm) => http://en.wikipedia.org/wiki/Generalized_normal_distribution
4) Log-normal distribution (dlnorm) => http://en.wikipedia.org/wiki/Log-normal_distribution
5) Scaled and shifted t-distribution (propagate:::dst) => GUM 2008, Chapter 6.4.9.2.
6) Logistic distribution (dlogis) => http://en.wikipedia.org/wiki/Logistic_distribution
7) Uniform distribution (dunif) => http://en.wikipedia.org/wiki/Uniform_distribution_(continuous)
8) Triangular distribution (propagate:::dtriang) => http://en.wikipedia.org/wiki/Triangular_distribution
9) Trapezoidal distribution (propagate:::dtrap) => http://link.springer.com/article/10.1007%2Fs001840200230
10) Curvilinear Trapezoidal distribution (propagate:::dctrap) => GUM 2008, Chapter 6.4.3.1
11) Generalized Trapezoidal distribution (propagate:::dgtrap) => http://www.seas.gwu.edu/~dorpjr/Publications/JournalPapers/Metrika2003VanDorp.pdf
12) Gamma distribution (dgamma) => http://en.wikipedia.org/wiki/Gamma_distribution
13) Cauchy distribution (dcauchy) => http://en.wikipedia.org/wiki/Cauchy_distribution
14) Laplace distribution (propagate:::dlaplace) => http://en.wikipedia.org/wiki/Laplace_distribution
15) Gumbel distribution (propagate:::dgumbel) => http://en.wikipedia.org/wiki/Gumbel_distribution
16) Johnson SU distribution (propagate:::dJSU) => http://en.wikipedia.org/wiki/Johnson_SU_distribution
17) Johnson SB distribution (propagate:::dJSB) => http://www.mathwave.com/articles/johnson_sb_distribution.html
18) Three-parameter Weibull distribution (propagate:::dweibull2) => http://en.wikipedia.org/wiki/Weibull_distribution
19) Four-parameter beta distribution (propagate:::dbeta2) => http://en.wikipedia.org/wiki/http://en.wikipedia.org/wiki/Beta_distribution#Four_parameters_2
20) Arcsine distribution (propagate:::darcsin) => http://en.wikipedia.org/wiki/Arcsine_distribution
21) Von Mises distribution (propagate:::dmises) => http://en.wikipedia.org/wiki/Von_Mises_distribution

Distributions 3) and 16) - 19) are sometimes hard to fit because the start parameters are not readily deducible from the kernel density estimates or some parameters are highly sensitive to shape changes. For these five cases, a grid of starting values with different magnitudes is used to obtain the best parameter combination with respect to lowest residual sum-of-squares ("brute force" approach).

The goodness-of-fit (GOF) is calculated with AIC from the (weighted) log-likelihood of the fit:

m{ln}(L) = 0.5 cdot ≤ft(-N cdot ≤ft( m{ln}(2π) + 1 + m{ln}(N) - ∑_{i=1}^n log(w_i) + m{ln}≤ft(∑_{i=1}^n w_i cdot x_i^2 ight) ight) ight) ; , ; AIC = 2k - 2 m{ln}(L)

with x_i = the residuals from the NLS fit, N = the length of the residual vector, k = the number of parameters of the fitted model and w_i = the weights.

In contrast to some other distribution fitting softwares (i.e. Easyfit, Mathwave) that use residual sum-of-squares/Anderson-Darling/Kolmogorov-Smirnov statistics as GOF measures, the application of AIC accounts for increasing number of parameters in the distribution fit and therefore compensates for overfitting. Hence, this approach is more similar to ModelRisk (Vose Software) and as employed in fitdistr of the 'MASS' package. Another application is to identify a possible distribution for the raw data prior to using Monte Carlo simulations from this distribution. However, a decent number of observations should be at hand in order to obtain a realistic estimate of the proper distribution. See 'Examples'.
The code for the density functions is in file "distr-densities.R".

IMPORTANT: It can be feasible to set weights = TRUE in order to give more weight to bins with low counts. See 'Examples'.

ALSO: Distribution fitting is highly sensitive to the number of defined histogram bins, so it is advisable to change this parameter and inspect if the order of fitted distributions remains stable!

Value

A list with the following items:

aic: the ascendingly sorted AIC dataframe.
fit: a list of the results from nls.lm for each distribution model.
bestfit: the best model in terms of lowest AIC.
fitted: the fitted values from the best model.
residuals: the residuals from the best model.

Author(s)

Andrej-Nikolai Spiess

References

Continuous univariate distributions, Volume 1.
Johnson NL, Kotz S and Balakrishnan N.
Wiley Series in Probability and Statistics, 2.ed (2004).

A guide on probability distributions.
R-forge distributions core team.
http://dutangc.free.fr/pub/prob/probdistr-main.pdf.

Univariate distribution relationships.
Leemis LM and McQueston JT.
The American Statistician (2008), 62: 45-53.

Examples

## Not run: 
## Linear example, small error
## => family of normal distributions.
EXPR1 <- expression(x + 2 * y)
x <- c(5, 0.01)
y <- c(1, 0.01)
DF1 <- cbind(x, y)
RES1 <- propagate(expr = EXPR1, data = DF1, type = "stat", 
                  do.sim = TRUE, verbose = TRUE)
fitDistr(RES1)$aic

## Ratio example, larger error
## => family of skewed distributions.
EXPR2 <- expression(x/2 * y)
x <- c(5, 0.1)
y <- c(1, 0.02)
DF2 <- cbind(x, y)
RES2 <- propagate(expr = EXPR2, data = DF2, type = "stat", 
                  do.sim = TRUE, verbose = TRUE)
fitDistr(RES2)$aic

## Exponential example, large error
## => family of log-normal distributions.
EXPR3 <- expression(x^(2 * y))
x <- c(5, 0.1)
y <- c(1, 0.1)
DF3 <- cbind(x, y)
RES3 <- propagate(expr = EXPR3, data = DF3, type = "stat", 
                  do.sim = TRUE, verbose = TRUE)
fitDistr(RES3)$aic

## Rectangular input distributions result
## in trapezoidal output distribution.
A <- runif(100000, 20, 25)
B <- runif(100000, 3, 3.5)
DF4 <- cbind(A, B)
EXPR4 <- expression(A + B)
RES4 <- propagate(EXPR4, data = DF4, type = "sim", 
                 use.cov = FALSE, do.sim = TRUE)
fitDistr(RES4)$aic        

## Fitting with 1/counts as weights.
EXPR5 <- expression(x + 2 * y)
x <- c(5, 0.05)
y <- c(1, 0.05)
DF5 <- cbind(x, y)
RES5 <- propagate(expr = EXPR5, data = DF5, type = "stat", 
                  do.sim = TRUE, verbose = TRUE, weights = TRUE)
fitDistr(RES5)$aic

## End(Not run)

Results