R: Detect differential expression in the presence of outliers
messinaDER Documentation

Detect differential expression in the presence of outliers


Run the Messina algorithm to find differentially-expressed features (eg. genes) in the presence of outliers.


messinaDE(x, y, max_misattribution_rate, f_train = 0.9, n_boot = 50,
  seed = NULL, progress = TRUE, silent = FALSE)



The maximum allowable sample misattribution rate, in [0, 0.5). Increasing this value will increase the algorithm's resistance to outliers, at the cost of somewhat reduced sensitivity. Note that for values >= 0.95, a conventional statistical approach to identifying differential expression (eg. t-test) will likely be more powerful than Messina. See details and the vignette for more information on selecting this parameter.


feature expression values, either supplied as an ExpressionSet, or as an object that can be converted to a matrix by as.matrix. In the latter case, features should be in rows and samples in columns, with feature names taken from the rows of the object.


a binary vector (TRUE/FALSE or 1/0) of class membership information for each sample in x.


the fraction of samples to be used in the training splits of the bootstrap rounds.


the number of bootstrap rounds to use.


an optional random seed for the analysis. If NULL, a random seed derived from the current state of the PRNG is used.


display a progress bar tracking the computation?


be completely silent (except for error and warning messages)?


The Messina classification algorithm (see main page at messina) can be adapted to identify differentially-expressed features in a two-class setting, with tunable resistance to outliers. This convenience function simplifies the setting of parameters for this task.

Outlier differential expression

Outliers in differential expression measurements are common in many experimental contexts. They may be due to experimental errors, sample misidentification, or the presence of unknown structure (eg. disease subtypes) in what was supposed to be a homogeneous sample group. The latter two causes are particularly troublesome in clinical samples, where diagnoses can be incorrect, samples impure, and subtypes common. The effect of these outliers is to inflate within-group variance estimates, reducing the power for detecting differential expression. Messina provides a principled approach to detecting differential expression in datasets containing at most a specified level of outlier samples.

Misattribution rate

In the Messina framework, for each feature each of the two classes of samples is considered to have a typical signal level. Most samples in each class will display the level of signal that matches their class, but a small number will display a level of signal consistent with the wrong class. We call these samples with signal matching the wrong class 'misattributed samples'. Messina can be tuned to ignore a given rate of sample misattribution when detecting differential expression, and therefore can be smoothly adjusted to deal with varying levels of outlier contamination in an experiment.

messinaDE assumes that the probability of an outlier sample is equal in each of the two classes. There are situations where this assumption is likely incorrect: for example, in a cancer vs normal comparison, the normal samples are likely to have much more consistent expression than the highly perturbed and variable cancer samples. In these cases, the user can call the worker function messina directly, with min_sens and min_spec parameters set appropriately to the expected outlier rate in each class. An example of how to calculate the required parameters is given in the vignette.


Mark Pinese


Pinese M, Scarlett CJ, Kench JG, et al. (2009) Messina: A Novel Analysis Tool to Identify Biologically Relevant Molecules in Disease. PLoS ONE 4(4): e5337. doi:10.1371/journal.pone.0005337

See Also






## Load some example data

x = exprs(apColonData)
y = pData(apColonData)$SubType

## Subset the data to only tumour and normal samples
sel = y %in% c("normal", "tumor")
x = x[,sel]
y = y[sel]

## Find differentially-expressed probesets.  Allow a sample misattribution rate of
## at most 20%.
fit = messinaDE(x, y == "tumor", max_misattribution_rate = 0.2)

## Display the results.


> data(apColonData)
> x = exprs(apColonData)
> y = pData(apColonData)$SubType
> ## Subset the data to only tumour and normal samples
> sel = y %in% c("normal", "tumor")
> x = x[,sel]
> y = y[sel]
> ## Find differentially-expressed probesets.  Allow a sample misattribution rate of
> ## at most 20%.
> fit = messinaDE(x, y == "tumor", max_misattribution_rate = 0.2)
> ## Display the results.
> fit
An object of class MessinaClassResult

Problem type:classification
  An object of class MessinaParameters
  5339 features, 38 samples.
  Objective type: sensitivity/specificity.  Minimum sensitivity: 0.8  Minimum specificity: 0.8
  Minimum group fraction: 0
  Training fraction: 0.9
  Number of bootstraps: 50
  Random seed: 

Summary of results:
  An object of class MessinaFits
  610 / 5339 features passed performance requirements (11.43%)
  Top features:
            Passed Requirements Classifier Type Threshold Value Direction
206784_at                  TRUE       Threshold       10.689311        -1
207502_at                  TRUE       Threshold        6.835522        -1
206422_at                  TRUE       Threshold        8.839170        -1
209613_s_at                TRUE       Threshold        7.412960        -1
207003_at                  TRUE       Threshold        9.558430        -1
204719_at                  TRUE       Threshold        6.095796        -1
209735_at                  TRUE       Threshold        5.467084        -1
220834_at                  TRUE       Threshold       13.383229        -1
213921_at                  TRUE       Threshold        4.160415        -1
209612_s_at                TRUE       Threshold        7.685667        -1
206784_at   9.709529
207502_at   9.212592
206422_at   9.061072
209613_s_at 8.797458
207003_at   8.396076
204719_at   7.985464
209735_at   7.729767
220834_at   7.644725
213921_at   7.633863
209612_s_at 7.393653
> plot(fit)
