R: Representativeness of Observations in a Data Set
dataRep
R Documentation
Representativeness of Observations in a Data Set
Description
These functions are intended to be used to describe how well a given
set of new observations (e.g., new subjects) were represented in a
dataset used to develop a predictive model.
The dataRep function forms a data frame that contains all the unique
combinations of variable values that existed in a given set of
variable values. Cross–classifications of values are created using
exact values of variables, so for continuous numeric variables it is
often necessary to round them to the nearest v and to possibly
curtail the values to some lower and upper limit before rounding.
Here v denotes a numeric constant specifying the matching tolerance
that will be used. dataRep also stores marginal distribution
summaries for all the variables. For numeric variables, all 101
percentiles are stored, and for all variables, the frequency
distributions are also stored (frequencies are computed after any
rounding and curtailment of numeric variables). For the purposes of
rounding and curtailing, the roundN function is provided. A print
method will summarize the calculations made by dataRep, and if
long=TRUE all unique combinations of values and their frequencies in
the original dataset are printed.
The predict method for dataRep takes a new data frame having
variables named the same as the original ones (but whose factor levels
are not necessarily in the same order) and examines the collapsed
cross-classifications created by dataRep to find how many
observations were similar to each of the new observations after any
rounding or curtailment of limits is done. predict also does some
calculations to describe how the variable values of the new
observations "stack up" against the marginal distributions of the
original data. For categorical variables, the percent of observations
having a given variable with the value of the new observation (after
rounding for variables that were through roundN in the formula given
to dataRep) is computed. For numeric variables, the percentile of
the original distribution in which the current value falls will be
computed. For this purpose, the data are not rounded because the 101
original percentiles were retained; linear interpolation is used to
estimate percentiles for values between two tabulated percentiles.
The lowest marginal frequency of matching values across all variables
is also computed. For example, if an age, sex combination matches 10
subjects in the original dataset but the age value matches 100 ages
(after rounding) and the sex value matches the sex code of 300
observations, the lowest marginal frequency is 100, which is a "best
case" upper limit for multivariable matching. I.e., matching on all
variables has to result on a lower frequency than this amount.
A print method for the output of predict.dataRep prints all
calculations done by predict by default. Calculations can be
selectively suppressed.
Usage
dataRep(formula, data, subset, na.action)
roundN(x, tol=1, clip=NULL)
## S3 method for class 'dataRep'
print(x, long=FALSE, ...)
## S3 method for class 'dataRep'
predict(object, newdata, ...)
## S3 method for class 'predict.dataRep'
print(x, prdata=TRUE, prpct=TRUE, ...)
Arguments
formula
a formula with no left-hand-side. Continuous numeric variables in
need of rounding should appear in the formula as e.g. roundN(x,5) to
have a tolerance of e.g. +/- 2.5 in matching. Factor or character
variables as well as numeric ones not passed through roundN are
matched on exactly.
x
a numeric vector or an object created by dataRep
object
the object created by dataRep or predict.dataRep
data, subset, na.action
standard modeling arguments. Default na.action is na.delete,
i.e., observations in the original dataset having any variables
missing are deleted up front.
tol
rounding constant (tolerance is actually tol/2 as values are rounded
to the nearest tol)
clip
a 2-vector specifying a lower and upper limit to curtail values of x
before rounding
long
set to TRUE to see all unique combinations and frequency count
newdata
a data frame containing all the variables given to dataRep but not
necessarily in the same order or having factor levels in the same order
prdata
set to FALSE to suppress printing newdata and the count of matching
observations (plus the worst-case marginal frequency).
prpct
set to FALSE to not print percentiles and percents
...
unused
Value
dataRep returns a list of class "dataRep" containing the collapsed
data frame and frequency counts along with marginal distribution
information. predict returns an object of class "predict.dataRep"
containing information determined by matching observations in
newdata with the original (collapsed) data.
Side Effects
print.dataRep prints.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
f.harrell@vanderbilt.edu
See Also
round, table
Examples
set.seed(13)
num.symptoms <- sample(1:4, 1000,TRUE)
sex <- factor(sample(c('female','male'), 1000,TRUE))
x <- runif(1000)
x[1] <- NA
table(num.symptoms, sex, .25*round(x/.25))
d <- dataRep(~ num.symptoms + sex + roundN(x,.25))
print(d, long=TRUE)
predict(d, data.frame(num.symptoms=1:3, sex=c('male','male','female'),
x=c(.03,.5,1.5)))
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, round.POSIXt, trunc.POSIXt, units
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/Hmisc/dataRep.Rd_%03d_medium.png", width=480, height=480)
> ### Name: dataRep
> ### Title: Representativeness of Observations in a Data Set
> ### Aliases: dataRep print.dataRep predict.dataRep print.predict.dataRep
> ### roundN [.roundN
> ### Keywords: datasets category cluster manip models
>
> ### ** Examples
>
> set.seed(13)
> num.symptoms <- sample(1:4, 1000,TRUE)
> sex <- factor(sample(c('female','male'), 1000,TRUE))
> x <- runif(1000)
> x[1] <- NA
> table(num.symptoms, sex, .25*round(x/.25))
, , = 0
sex
num.symptoms female male
1 19 22
2 9 11
3 19 16
4 19 12
, , = 0.25
sex
num.symptoms female male
1 30 31
2 24 35
3 37 35
4 27 29
, , = 0.5
sex
num.symptoms female male
1 30 30
2 31 30
3 26 28
4 44 28
, , = 0.75
sex
num.symptoms female male
1 24 36
2 31 28
3 31 31
4 32 23
, , = 1
sex
num.symptoms female male
1 19 17
2 17 16
3 19 14
4 26 13
>
>
> d <- dataRep(~ num.symptoms + sex + roundN(x,.25))
> print(d, long=TRUE)
Data Representativeness n=999
dataRep(formula = ~num.symptoms + sex + roundN(x, 0.25))
Frequencies of Missing Values Due to Each Variable
num.symptoms sex roundN(x, 0.25)
0 0 1
Specifications for Matching
Type Parameters
num.symptoms exact numeric
sex exact categorical female male
x round to nearest 0.25
Unique Combinations of Descriptor Variables
num.symptoms sex x Frequency
1 1 female 0.00 19
2 2 female 0.00 9
3 3 female 0.00 19
4 4 female 0.00 19
5 1 male 0.00 22
6 2 male 0.00 11
7 3 male 0.00 16
8 4 male 0.00 12
9 1 female 0.25 30
10 2 female 0.25 24
11 3 female 0.25 37
12 4 female 0.25 27
13 1 male 0.25 31
14 2 male 0.25 35
15 3 male 0.25 35
16 4 male 0.25 29
17 1 female 0.50 30
18 2 female 0.50 31
19 3 female 0.50 26
20 4 female 0.50 44
21 1 male 0.50 30
22 2 male 0.50 30
23 3 male 0.50 28
24 4 male 0.50 28
25 1 female 0.75 24
26 2 female 0.75 31
27 3 female 0.75 31
28 4 female 0.75 32
29 1 male 0.75 36
30 2 male 0.75 28
31 3 male 0.75 31
32 4 male 0.75 23
33 1 female 1.00 19
34 2 female 1.00 17
35 3 female 1.00 19
36 4 female 1.00 26
37 1 male 1.00 17
38 2 male 1.00 16
39 3 male 1.00 14
40 4 male 1.00 13
>
>
> predict(d, data.frame(num.symptoms=1:3, sex=c('male','male','female'),
+ x=c(.03,.5,1.5)))
Descriptor Variable Values, Estimated Frequency in Original Dataset,
and Minimum Marginal Frequency for any Variable
num.symptoms sex x Frequency Marginal.Freq
1 1 male 0.03 22 127
2 2 male 0.50 30 232
3 3 female 1.50 0 0
Percentiles for Continuous Descriptor Variables,
Percentage in Category for Categorical Variables
num.symptoms sex x
1 12 49 3
2 37 49 50
3 62 51 100
>
>
>
>
>
> dev.off()
null device
1
>