Last data update: 2014.03.03

R: Histogram
HistogramR Documentation

Histogram

Description

Abbreviation: hs

From the standard R function hist plots a frequency histogram with default colors, including background color and grid lines plus an option for a relative frequency and/or cumulative histogram, as well as summary statistics and a table that provides the bins, midpoints, counts, proportions, cumulative counts and cumulative proportions. Bins can be selected several different ways besides the default, including specifying just the bin width and/or the bin start. Also provides improved error diagnostics and feedback for the user on how to correct the problem when the bins do not contain all of the specified data.

If the provided object to analyze is a set of multiple variables, including an entire data frame, then each non-numeric variable in the data frame is analyzed and the results written to a pdf file in the current working directory. The name of each output pdf file that contains a bar chart and its path are specified in the output.

When output is assigned into an object, such as h in h <- hs(Y), the pieces of output can be accessed for later analysis. A primary such analysis is knitr for dynamic report generation from a generated R markdown file according to the Rmd option in which interpretative R output is embedded in documents. See value below.

Usage

Histogram(x=NULL, data=mydata, n.cat=getOption("n.cat"), Rmd=NULL,

         color.fill=getOption("color.fill.bar"), 
         color.stroke=getOption("color.stroke.bar"),
         color.bg=getOption("color.bg"),
         color.grid=getOption("color.grid"),
         color.box=getOption("color.box"),

         color.reg="snow2", over.grid=FALSE,
         cex.axis=0.75, color.axis="gray30",

         rotate.values=0, offset=0.5,

         breaks="Sturges", bin.start=NULL, bin.width=NULL, bin.end=NULL,

         prop=FALSE, cumul=c("off", "on", "both"), hist.counts=FALSE, 
         digits.d=NULL, xlab=NULL, ylab=NULL, main=NULL, sub=NULL,

         quiet=getOption("quiet"),
         pdf.file=NULL, pdf.width=5, pdf.height=5,
         fun.call=NULL, ...)

hs(...)

Arguments

x

Variable(s) to analyze. Can be a single numerical variable, either within a data frame or as a vector in the user's workspace, or multiple variables in a data frame such as designated with the c function, or an entire data frame. If not specified, then defaults to all numerical variables in the specified data frame, mydata by default.

data

Optional data frame that contains the variable(s) of interest, default is mydata.

n.cat

For the analysis of multiple variables, such as a data frame, specifies the largest number of unique values of variable of a numeric data type for which the variable will be analyzed as a categorical. Default is 0.

Rmd

File name for the file of R markdown to be written, if specified. The file type is .Rmd, which automatically opens in RStudio, but it is a simple text file that can be edited with any text editor, including RStudio.

color.fill

Color of the bars. To set transparency level, use lessR function theme or use rgb function directly.

color.stroke

Color of the border of the bars. To set transparency level, use function theme or use rgb function directly.

color.bg

Color of the plot background.

color.grid

Color of the grid lines.

color.box

Color of border around the plot background, the box, that encloses the plot, with a default of "black".

color.reg

The color of the superimposed, regular histogram when cumul="both".

over.grid

If TRUE, plot the grid lines over the histogram.

cex.axis

Scale magnification factor, which by defaults displays the axis values to be smaller than the axis labels. Provides the functionality of, and can be replaced by, the standard R cex.axis.

color.axis

Color of the font used to label the axis values.

rotate.values

Degrees that the axis values are rotated, usually to accommodate longer values, typically used in conjunction with offset.

offset

The amount of spacing between the axis values and the axis. Default is 0.5. Larger values such as 1.0 are used to create space for the label when longer axis value names are rotated.

breaks

The method for calculating the bins, or an explicit specification of the bins, such as with the standard R seq function or other options provided by the hist function.

bin.start

Optional specified starting value of the bins.

bin.width

Optional specified bin width, which can be specified with or without a bin.start value.

bin.end

Optional specified value that is within the last bin, so the actual endpoint of the last bin may be larger than the specified value.

prop

Specify proportions or relative frequencies on the vertical axis. Default is FALSE.

hist.counts

Replaces standard R labels options, which has multiple definitions in R. Specifies to display the count of each bin.

cumul

Specify a cumulative histogram. The value of "on" displays the cumulative histogram, with default of "off". The value of "both" superimposes the regular histogram.

digits.d

Number of significant digits for each of the displayed summary statistics.

xlab

Label for x-axis. Defaults to variable name.

ylab

Label for y-axis. Defaults to Frequency or Proportion.

main

Title of graph.

sub

Sub-title of graph, below xlab.

quiet

If set to TRUE, no text output. Can change system default with theme function.

pdf.file

Name of the pdf file to which graphics are redirected. If there is no filetype of .pdf, the filetype is added to the name.

pdf.width

Width of the pdf file in inches.

pdf.height

Height of the pdf file in inches.

fun.call

Function call. Used with knitr to pass the function call when obtained from the abbreviated function call hs.

...

Other parameter values for graphics as defined processed by hist and par for general graphics, including xlim, ylim, lwd and cex.lab, col.main, col.lab, sub, col.sub, density, etc. Also includes col.ticks to specify the color of the tick marks and srt to rotate the axis value labels.

Details

OVERVIEW
Results are based on the standard R hist function to calculate and plot a histogram, plus the additional provided color capabilities, a relative frequency histogram and summary statistics. However, a histogram with densities is not supported. The freq option from the standard R hist function has no effect as it is always set to FALSE in each internal call to hist. To plot densities, which correspond to setting freq to FALSE, use the lessR function Density.

DATA
The data may either be a vector from the global environment, the user's workspace, as illustrated in the examples below, or one or more variable's in a data frame, or a complete data frame. The default input data frame is mydata. Can specify the source data frame name with the data option. If multiple variables are specified, only the numerical variables in the list of variables are analyzed. The variables in the data frame are referenced directly by their names, that is, no need to invoke the standard R mechanisms of the mydata$name notation, the with function or the attach function. If the name of the vector in the global environment and of a variable in the input data frame are the same, the vector is analyzed.

To obtain a histogram of each numerical variable in the mydata data frame, use Histogram(). Or, for a data frame with a different name, insert the name between the parentheses. To analyze a subset of the variables in a data frame, specify the list with either a : or the c function, such as m01:m03 or c(m01,m02,m03).

COLORS
Individual colors in the plot can be manipulated with options such as color.bars for the color of the histogram bars. A color theme for all the colors can be chosen for a specific plot with the colors option with the lessR function theme. The default color theme is dodgerblue, but a gray scale is available with "gray", and other themes are available as explained in theme, such as "red" and "green". Use the option ghost=TRUE for a black background, no grid lines and partial transparency of plotted colors.

For the color options, such as color.grid, the value of "off" is the same as "transparent".

VARIABLE LABELS
If variable labels exist, then the corresponding variable label is by default listed as the label for the horizontal axis and on the text output. For more information, see Read.

ONLY VARIABLES ARE REFERENCED
The referenced variable in a lessR function can only be a variable name (or list of variable names). This referenced variable must exist in either the referenced data frame, such as the default mydata, or in the user's workspace, more formally called the global environment. That is, expressions cannot be directly evaluated. For example:

> Histogram(rnorm(50)) # does NOT work

Instead, do the following:

    > Y <- rnorm(50)   # create vector Y in user workspace
    > Histogram(Y)     # directly reference Y

ERROR DETECTION
A somewhat relatively common error by beginning users of the base R hist function may encounter is to manually specify a sequence of bins with the seq function that does not fully span the range of specified data values. The result is a rather cryptic error message and program termination. Here, Histogram detects this problem before attempting to generate the histogram with hist, and then informs the user of the problem with a more detailed and explanatory error message. Moreover, the entire range of bins need not be specified to customize the bins. Instead, just a bin width need be specified, bin.width, and/or a value that begins the first bin, bin.start. If a starting value is specified without a bin width, the default Sturges method provides the bin width.

PDF OUTPUT
Because of the customized graphic windowing system that maintains a unique graphic window for the Help function, the standard graphic output functions such as pdf do not work with the lessR graphics functions. Instead, to obtain pdf output, use the pdf.file option, perhaps with the optional pdf.width and pdf.height options. These files are written to the default working directory, which can be explicitly specified with the R setwd function.

Value

The output can optionally be saved into an R object, otherwise it simply appears in the console. Redesigned in lessR version 3.3 to provide two different types of components: the pieces of readable output, and a variety of statistics. The readable output are character strings such as tables amenable for reading. The statistics are numerical values amenable for further analysis. The motivation of these types of output is to facilitate R markdown documents, as the name of each piece, preceded by the name of the saved object and a $, can be inserted into the R~Markdown document (see examples).

READABLE OUTPUT
codeout_ss: Summary statistics
codeout_freq: Frequency distribution
codeout_outliers: Outlier analysis
codeout_file: Name and location of optional Rmd file

STATISTICS
codebin_width: Bin width
coden_bins: Number of bins
codebreaks: Breaks of the bins
codemids: Bin midpoints
codecounts: Bin counts
codeprop: Bin proportion
codecounts_cumul: Bin cumulative counts
codeprop_cumul: Bin cumulative proportion

Although not typically needed, if the output is assigned to an object named, for example, h, then the contents of the object can be viewed directly with the unclass function, here as unclass(h).

Author(s)

David W. Gerbing (Portland State University; gerbing@pdx.edu)

See Also

hist, plot, par, theme.

Examples

# generate 50 random normal data values with three decimal digits
y <- round(rnorm(50),3)


# --------------------
# different histograms
# --------------------

# histogram with all defaults
Histogram(y)
# short form
hs(y)
# compare to standard R function hist
hist(y)

# output saved for later analysis into object h
h <- hs(y)
# view full text output
h
# view just the outlier analysis
h$out_outliers
# list the names of all the components
names(h)

# histogram with no borders for the bars
Histogram(y, color.stroke="off")

# save the histogram to a pdf file
Histogram(y, pdf.file="MyHistogram.pdf")

# histogram with no grid, red bars, black background, and black border
Histogram(y, color.grid="off", color.bg="black",
          color.fill="red", color.stroke="black")
# or set this color scheme for all subsequent analyses
set("red", color.grid="off", color.bg="black", color.stroke.bar="black")
Histogram(y)

# histogram with orange color theme, transparent orange bars, no grid lines
theme(colors="orange", ghost=TRUE)
Histogram(y)
# back to default of "blue" color theme
theme(colors="blue")

# histogram with specified bin width
# can also use bin.start
Histogram(y, bin.width=.25)

# histogram with rotated axis values, offset more from axis
# suppress text output
Histogram(y, rotate.values=45, offset=1, quiet=TRUE)

# histogram with specified bins and grid lines displayed over the histogram
Histogram(y, breaks=seq(-5,5,.25), xlab="My Variable", over.grid=TRUE)

# histogram with bins calculated with the Scott method and values displayed
Histogram(y, breaks="Scott", hist.counts=TRUE, quiet=TRUE)

# histogram with the number of suggested bins, with proportions
Histogram(y, breaks=15, prop=TRUE)

# histogram with specified colors, overriding defaults
# color.bg and color.grid are defined in histogram
# all other parameters are defined in hist, par and plot functions
# generates caution messages that can be ignored regarding density and angle
#Histogram(y, color.fill="darkblue", color.stroke="lightsteelblue4", color.bg="ivory",
#  color.grid="darkgray", density=25, angle=-45, cex.lab=.8, cex.axis=.8,
#  col.lab="sienna3", main="My Title", col.main="gray40", xlim=c(-5,5), lwd=2,
#  xlab="My Favorite Variable")

# ---------------------
# cumulative histograms
# ---------------------

# cumulative histogram with superimposed regular histogram, all defaults
Histogram(y, cumul="both")

# cumulative histogram plus regular histogram
# present with proportions on vertical axis, override other defaults
Histogram(y, cumul="both", breaks=seq(-4,4,.25), prop=TRUE, 
  color.reg="mistyrose")


# -------------------------------------------------
# histograms for data frames and multiple variables
# -------------------------------------------------

# create data frame, mydata, to mimic reading data with Read function
# mydata contains both numeric and non-numeric data
mydata <- data.frame(rnorm(50), rnorm(50), rnorm(50), rep(c("A","B"),25))
names(mydata) <- c("X","Y","Z","C")

# although data not attached, access the variable directly by its name
Histogram(X)

# histograms for all numeric variables in data frame called mydata
#  except for numeric variables with unique values < n.cat
# mydata is the default name, so does not need to be specified with data
Histogram()

# variable of interest is in a data frame which is not the default mydata
# access the breaks variable in the R provided warpbreaks data set
# although data not attached, access the variable directly by its name
Histogram(breaks, data=warpbreaks)

# all histograms with specified options, including red axis labels
Histogram(color.fill="palegreen1", color.bg="ivory", hist.counts=TRUE, col.lab="red")

# histograms for all specified numeric variables
# use the combine or c function to specify a list of variables
Histogram(c(X,Y))

Results