R: Concise Statistical Description of a Vector, Matrix, Data...
describe
R Documentation
Concise Statistical Description of a Vector, Matrix, Data Frame, or Formula
Description
describe is a generic method that invokes describe.data.frame,
describe.matrix, describe.vector, or
describe.formula. describe.vector is the basic
function for handling a single variable.
This function determines whether the variable is character, factor,
category, binary, discrete numeric, and continuous numeric, and prints
a concise statistical summary according to each. A numeric variable is
deemed discrete if it has <= 10 unique values. In this case,
quantiles are not printed. A frequency table is printed
for any non-binary variable if it has no more than 20 unique
values. For any variable with at least 20 unique values, the 5 lowest
and highest values are printed. This behavior can be overriden for long
character variables with many levels using the listunique
parameter, to get a complete tabulation.
describe is especially useful for
describing data frames created by *.get, as labels, formats,
value labels, and (in the case of sas.get) frequencies of special
missing values are printed.
For a binary variable, the sum (number of 1's) and mean (proportion of
1's) are printed. If the first argument is a formula, a model frame
is created and passed to describe.data.frame. If a variable
is of class "impute", a count of the number of imputed values is
printed. If a date variable has an attribute partial.date
(this is set up by sas.get), counts of how many partial dates are
actually present (missing month, missing day, missing both) are also presented.
If a variable was created by the special-purpose function substi (which
substitutes values of a second variable if the first variable is NA),
the frequency table of substitutions is also printed.
For numeric variables, describe adds an item called Info
which is a relative information measure using the relative efficiency of
a proportional odds/Wilcoxon test on the variable relative to the same
test on a variable that has no ties. Info is related to how
continuous the variable is, and ties are less harmful the more untied
values there are. The formula for Info is one minus the sum of
the cubes of relative frequencies of values divided by one minus the
square of the reciprocal of the sample size. The lowest information
comes from a variable having only one unique values following by a
highly skewed binary variable. Info is reported to
two decimal places.
A latex method exists for converting the describe object to a
LaTeX file. For numeric variables having at least 20 unique values,
describe saves in its returned object the frequencies of 100
evenly spaced bins running from minimum observed value to the maximum.
latex inserts a spike histogram displaying these frequency counts
in the tabular material using the LaTeX picture environment. For
example output see
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/Hmisc/counties.pdf.
Note that the latex method assumes you have the following styles
installed in your latex installation: setspace and relsize.
Sample weights may be specified to any of the functions, resulting
in weighted means, quantiles, and frequency tables.
Note: As discussed in Cox and Longton (2008), Stata Technical Bulletin 8(4)
pp. 557, the term "unique" should really be "distinct".
Usage
## S3 method for class 'vector'
describe(x, descript, exclude.missing=TRUE, digits=4,
listunique=0, listnchar=12,
weights=NULL, normwt=FALSE, minlength=NULL, ...)
## S3 method for class 'matrix'
describe(x, descript, exclude.missing=TRUE, digits=4, ...)
## S3 method for class 'data.frame'
describe(x, descript, exclude.missing=TRUE,
digits=4, ...)
## S3 method for class 'formula'
describe(x, descript, data, subset, na.action,
digits=4, weights, ...)
## S3 method for class 'describe'
print(x, condense=TRUE, ...)
## S3 method for class 'describe'
latex(object, title=NULL, condense=TRUE,
file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'),
append=FALSE, size='small', tabular=TRUE, greek=TRUE,
spacing=0.7, lspace=c(0,0), ...)
## S3 method for class 'describe.single'
latex(object, title=NULL, condense=TRUE, vname,
file, append=FALSE, size='small', tabular=TRUE, greek=TRUE,
lspace=c(0,0), ...)
Arguments
x
a data frame, matrix, vector, or formula. For a data frame, the
describe.data.frame
function is automatically invoked. For a matrix, describe.matrix is
called. For a formula, describe.data.frame(model.frame(x))
is invoked. The formula may or may not have a response variable. For
print or latex, x is an object created by
describe.
descript
optional title to print for x. The default is the name of the argument
or the "label" attributes of individual variables. When the first argument
is a formula, descript defaults to a character representation of
the formula.
exclude.missing
set toTRUE to print the names of variables that contain only missing values.
This list appears at the bottom of the printout, and no space is taken
up for such variables in the main listing.
digits
number of significant digits to print
listunique
For a character variable that is not an mChoice variable, that
has its longest string length greater than listnchar, and that
has no more than listunique unique values, all values are
listed in alphabetic order. Any value having more than one occurrence
has the frequency of occurrence after it, in parentheses. Specify
listunique equal to some value at least as large as the number
of observations to ensure that all character variables will have all
their values listed. For purposes of tabulating character strings,
multiple white spaces of any kind are translated to a single space,
leading and trailing white space are ignored, and case is ignored.
listnchar
see listunique
weights
a numeric vector of frequencies or sample weights. Each observation
will be treated as if it were sampled weights times.
minlength
value passed to summary.mChoice.
normwt
The default, normwt=FALSE results in the use of weights as
weights in computing various statistics. In this case the sample size
is assumed to be equal to the sum of weights. Specify
normwt=TRUE to divide
weights by a constant so that weights sum to the number of
observations (length of vectors specified to describe). In this
case the number of observations is taken to be the actual number of
records given to describe.
object
a result of describe
title
unused
condense
default isTRUE to condense the output with regard to the 5 lowest and
highest values and the frequency table
data
subset
na.action
These are used if a formula is specified. na.action defaults to
na.retain which does not delete any NAs from the data frame.
Use na.action=na.omit or na.delete to drop any observation with
any NA before processing.
...
arguments passed to describe.default which are passed to calls
to format for numeric variables. For example if using R
POSIXct or Date date/time formats, specifying
describe(d,format='%d%b%y') will print date/time variables as
"01Jan2000". This is useful for omitting the time
component. See the help file for format.POSIXct or
format.Date for more
information. For latex methods, ... is ignored.
file
name of output file (should have a suffix of .tex). Default name is
formed from the first word of the descript element of the
describe object, prefixed by "describe". Set
file="" to send LaTeX code to standard output instead of a file.
append
set to TRUE to have latex append text to an existing file
named file
size
LaTeX text size ("small", the default, or "normalsize", "tiny",
"scriptsize", etc.) for the describe output in LaTeX.
tabular
set to FALSE to use verbatim rather than tabular environment
for the summary statistics output. By default, tabular is used if the
output is not too wide.
greek
By default, the latex methods
will change LaTeX names of greek letters that appear in variable
labels to appropriate LaTeX symbols in math mode unless
greek=FALSE. greek=TRUE is not implemented in S-Plus
versions older than 6.2.
spacing
By default, the latex method for describe run
on a matrix or data frame uses the setspace LaTeX package with a
line spacing of 0.7 so as to no waste space. Specify spacing=0
to suppress the use of the setspace's spacing environment,
or specify another positive value to use this environment with a
different spacing.
lspace
extra vertical scape, in character size units (i.e., "ex"
as appended to the space). When using certain font sizes, there is
too much space left around LaTeX verbatim environments. This
two-vector specifies space to remove (i.e., the values are negated in
forming the vspace command) before (first element) and after
(second element of lspace) verbatims
vname
unused argument in latex.describe.single
Details
If options(na.detail.response=TRUE)
has been set and na.action is "na.delete" or
"na.keep", summary statistics on
the response variable are printed separately for missing and non-missing
values of each predictor. The default summary function returns
the number of non-missing response values and the mean of the last
column of the response values, with a names attribute of c("N","Mean").
When the response is a Surv object and the mean is used, this will
result in the crude proportion of events being used to summarize
the response. The actual summary function can be designated through
options(na.fun.response = "function name").
If you are modifying LaTex parskip or certain other parameters,
you may need to shrink the area around tabular and
verbatim environments produced by latex.describe. You can
do this using for example
usepackage{etoolbox}makeatletterpreto{@verbatim}{\topsep=-1.4pt
partopsep=0pt}preto{@tabular}{parskip=2pt
parsep=0pt}makeatother in the LaTeX preamble.
Value
a list containing elements descript, counts,
values. The list is of class describe. If the input
object was a matrix or a data
frame, the list is a list of lists, one list for each variable
analyzed. latex returns a standard latex object. For numeric
variables having at least 20 unique values, an additional component
intervalFreq. This component is a list with two elements, range
(containing two values) and count, a vector of 100 integer frequency
counts.
set.seed(1)
describe(runif(200),dig=2) #single variable, continuous
#get quantiles .05,.10,dots
dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE))
describe(dfr)
## Not run:
d <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE)
describe(d) #describe entire data frame
attach(d, 1)
describe(relig) #Has special missing values .D .F .M .R .T
#attr(relig,"label") is "Religious preference"
#relig : Religious preference Format:relig
# n missing D F M R T unique
# 4038 263 45 33 7 2 1 8
#
#0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%)
#3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%)
#5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%)
# Method for describing part of a data frame:
describe(death.time ~ age*sex + rcs(blood.pressure))
describe(~ age+sex)
describe(~ age+sex, weights=freqs) # weighted analysis
fit <- lrm(y ~ age*sex + log(height))
describe(formula(fit))
describe(y ~ age*sex, na.action=na.delete)
# report on number deleted for each variable
options(na.detail.response=TRUE)
# keep missings separately for each x, report on dist of y by x=NA
describe(y ~ age*sex)
options(na.fun.response="quantile")
describe(y ~ age*sex) # same but use quantiles of y by x=NA
d <- describe(my.data.frame)
d$age # print description for just age
d[c('age','sex')] # print description for two variables
d[sort(names(d))] # print in alphabetic order by var. names
d2 <- d[20:30] # keep variables 20-30
page(d2) # pop-up window for these variables
# Test date/time formats and suppression of times when they don't vary
library(chron)
d <- data.frame(a=chron((1:20)+.1),
b=chron((1:20)+(1:20)/100),
d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
hour=rep(11,20),min=rep(17,20),sec=rep(11,20)),
f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
hour=1:20,min=1:20,sec=1:20),
g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20))
describe(d)
# Make a function to run describe, latex.describe, and use the kdvi
# previewer in Linux to view the result and easily make a pdf file
ldesc <- function(data) {
options(xdvicmd='kdvi')
d <- describe(data, desc=deparse(substitute(data)))
dvi(latex(d, file='/tmp/z.tex'), nomargins=FALSE, width=8.5, height=11)
}
ldesc(d)
## End(Not run)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, round.POSIXt, trunc.POSIXt, units
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/Hmisc/describe.Rd_%03d_medium.png", width=480, height=480)
> ### Name: describe
> ### Title: Concise Statistical Description of a Vector, Matrix, Data Frame,
> ### or Formula
> ### Aliases: describe describe.default describe.vector describe.matrix
> ### describe.formula describe.data.frame print.describe
> ### print.describe.single [.describe latex.describe latex.describe.single
> ### Keywords: interface nonparametric category distribution robust models
>
> ### ** Examples
>
> set.seed(1)
> describe(runif(200),dig=2) #single variable, continuous
runif(200)
n missing unique Info Mean .05 .10 .25 .50 .75
200 0 200 1 0.52 0.084 0.142 0.294 0.505 0.742
.90 .95
0.881 0.927
lowest : 0.013 0.013 0.023 0.036 0.059, highest: 0.976 0.985 0.992 0.992 0.993
> #get quantiles .05,.10,...
>
> dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE))
> describe(dfr)
dfr
2 Variables 400 Observations
--------------------------------------------------------------------------------
x
n missing unique Info Mean .05 .10 .25
400 0 400 1 0.001083 -1.64182 -1.32308 -0.64280
.50 .75 .90 .95
-0.05831 0.67754 1.35234 1.72182
lowest : -3.008 -2.889 -2.592 -2.403 -2.343
highest: 2.351 2.447 2.498 2.649 3.810
--------------------------------------------------------------------------------
y
n missing unique
400 0 2
female (187, 47%), male (213, 53%)
--------------------------------------------------------------------------------
>
> ## Not run:
> ##D d <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE)
> ##D describe(d) #describe entire data frame
> ##D attach(d, 1)
> ##D describe(relig) #Has special missing values .D .F .M .R .T
> ##D #attr(relig,"label") is "Religious preference"
> ##D
> ##D #relig : Religious preference Format:relig
> ##D # n missing D F M R T unique
> ##D # 4038 263 45 33 7 2 1 8
> ##D #
> ##D #0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%)
> ##D #3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%)
> ##D #5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%)
> ##D
> ##D
> ##D # Method for describing part of a data frame:
> ##D describe(death.time ~ age*sex + rcs(blood.pressure))
> ##D describe(~ age+sex)
> ##D describe(~ age+sex, weights=freqs) # weighted analysis
> ##D
> ##D fit <- lrm(y ~ age*sex + log(height))
> ##D describe(formula(fit))
> ##D describe(y ~ age*sex, na.action=na.delete)
> ##D # report on number deleted for each variable
> ##D options(na.detail.response=TRUE)
> ##D # keep missings separately for each x, report on dist of y by x=NA
> ##D describe(y ~ age*sex)
> ##D options(na.fun.response="quantile")
> ##D describe(y ~ age*sex) # same but use quantiles of y by x=NA
> ##D
> ##D d <- describe(my.data.frame)
> ##D d$age # print description for just age
> ##D d[c('age','sex')] # print description for two variables
> ##D d[sort(names(d))] # print in alphabetic order by var. names
> ##D d2 <- d[20:30] # keep variables 20-30
> ##D page(d2) # pop-up window for these variables
> ##D
> ##D # Test date/time formats and suppression of times when they don't vary
> ##D library(chron)
> ##D d <- data.frame(a=chron((1:20)+.1),
> ##D b=chron((1:20)+(1:20)/100),
> ##D d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
> ##D hour=rep(11,20),min=rep(17,20),sec=rep(11,20)),
> ##D f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
> ##D hour=1:20,min=1:20,sec=1:20),
> ##D g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20))
> ##D describe(d)
> ##D
> ##D # Make a function to run describe, latex.describe, and use the kdvi
> ##D # previewer in Linux to view the result and easily make a pdf file
> ##D
> ##D ldesc <- function(data) {
> ##D options(xdvicmd='kdvi')
> ##D d <- describe(data, desc=deparse(substitute(data)))
> ##D dvi(latex(d, file='/tmp/z.tex'), nomargins=FALSE, width=8.5, height=11)
> ##D }
> ##D
> ##D ldesc(d)
> ## End(Not run)
>
>
>
>
>
> dev.off()
null device
1
>