Assume that a small sample of i.i.d. random variables from a negative binomial distribution is given, and you have obtained unbiased estimates of mean and raw variance. Then, a new bias is introduced when the squared coefficient of variation (SCV, a.k.a. dispersion) is calculated from these unbiased estimates by dividing the raw variance by the square of the mean. This bias can be calculated by numerical simulation and a pre-calculated adjustment table (or rather a fit through tabulated values) is supplied with the package. The present function uses this to remove the bias from a raw SCV estimate.
The conditions vector is a factor that assigns to each column of the count data a condition (or treatment, or phenotype, or the like). This information is stored in the CountDataSet's "phenoData" slot as a row named "condition".
The counts slot holds the count data as a matrix of non-negative integer count values, one row for each observational unit (gene or the like), and one column for each sample.
The dispersion table ("dispTable") is a named vector that assigns to each condition (as name) a dispersion column (as value). If nbinomTest is called to compare two conditions, say "A" and "B", DESeq looks up in the dispTable, which dispersion columns to use. In the standard case (see example), these are just the dispersions for "A" and "B", i.e., the columns disp_A and disp_B in fData(object). If the "pooled" or "blind" variance estimation is used, all conditions are assigned the same column.
This function obtains dispersion estimates for a count data set. For each condition (or collectively for all conditions, see 'method' argument below) it first computes for each gene an empirical dispersion value (a.k.a. a raw SCV value), then fits by regression a dispersion-mean relationship and finally chooses for each gene a dispersion parameter that will be used in subsequent tests from the empirical and the fitted value according to the 'sharingMode' argument.
Given a matrix or data frame of count data, this function estimates the size factors as follows: Each column is divided by the geometric means of the rows. The median (or, ir requested, another location estimator) of these ratios (skipping the genes with a geometric mean of zero) is used as the size factor for this column.
After calling estimateDispersions, a CountDataSet object is populated with one or (in case of a “per-condition” estimation) several fitInfo objects, which can be accessed with this function.