Last data update: 2014.03.03

R: Documenting missing data visualisation
missing-dataR Documentation

Documenting missing data visualisation

Description

There is a need for adequate handling of missing value impuation in quantitative proteomics. Before developing a framework to handle missing data imputation optimally, we propose a set of visualisation tools. This document serves as an internal notebook for current progress and ideas that will eventually materialise in exported functionality in the MSnbase package.

Details

The explore the structure of missing values, we propose to

1. Explore missing values in the frame of the experimental design. The imageNA2 function offers such a simple visualisation. It is currently limited to 2-group designs/comparisons. In case of time course experiments or sub-cellular fractionation along a density gradient, we propose to split the time/gradient into 2 groups (early/late, top/bottom) as a first approximation.

2. Explore the proportion of missing values in each group.

3. Explore the total and group-wise feature intensity distributions.

The existing plotNA function illustrates the completeness/missingness of the data.

Author(s)

Laurent Gatto <lg390@cam.ac.uk>, Samuel Wieczorek and Thomas Burger

See Also

plotNA, imageNA2.

Examples

## Other suggestions
library("pRolocdata")
library("pRoloc")
data(dunkley2006)
set.seed(1)
nax <- makeNaData(dunkley2006, pNA = 0.10)
pcol <- factor(ifelse(dunkley2006$fraction <= 5, "A", "B"))
sel1 <- pcol == "A"

## missing values in each sample
barplot(colSums(is.na(nax)), col = pcol)


## table of missing values in proteins
par(mfrow = c(3, 1))
barplot(table(rowSums(is.na(nax))), main = "All")
barplot(table(rowSums(is.na(nax)[sel1,])), main = "Group A")
barplot(table(rowSums(is.na(nax)[!sel1,])), main = "Group B")


fData(nax)$nNA1 <- rowSums(is.na(nax)[, sel1])
fData(nax)$nNA2 <- rowSums(is.na(nax)[, !sel1])
fData(nax)$nNA <- rowSums(is.na(nax))
o <- MSnbase:::imageNA2(nax, pcol)

plot((fData(nax)$nNA1 - fData(nax)$nNA2)[o], type = "l")
grid()

plot(sort(fData(nax)$nNA1 - fData(nax)$nNA2), type = "l")
grid()


o2 <- order(fData(nax)$nNA1 - fData(nax)$nNA2)
MSnbase:::imageNA2(nax, pcol, Rowv=o2)

layout(matrix(c(rep(1, 10), rep(2, 5)), nc = 3))
MSnbase:::imageNA2(nax, pcol, Rowv=o2)
plot((fData(nax)$nNA1 - fData(nax)$nNA)[o2], type = "l", col = "red",
     ylim = c(-9, 9), ylab = "")
lines((fData(nax)$nNA - fData(nax)$nNA2)[o2], col = "steelblue")
lines((fData(nax)$nNA1 - fData(nax)$nNA2)[o2], type = "l",
     lwd = 2)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(MSnbase)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: mzR
Loading required package: Rcpp
Loading required package: BiocParallel
Loading required package: ProtGenerics

This is MSnbase version 1.20.7 
  Read '?MSnbase' and references therein for information
  about the package and how to get started.


Attaching package: 'MSnbase'

The following object is masked from 'package:stats':

    smooth

The following object is masked from 'package:base':

    trimws

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/MSnbase/missing-data.Rd_%03d_medium.png", width=480, height=480)
> ### Name: missing-data
> ### Title: Documenting missing data visualisation
> ### Aliases: missing-data missingdata
> ### Keywords: documentation, internal
> 
> ### ** Examples
> 
> ## Other suggestions
> library("pRolocdata")

This is pRolocdata version 1.10.0.
Use 'pRolocdata()' to list available data sets.
> library("pRoloc")
Loading required package: MLInterfaces
Loading required package: annotate
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: XML
Loading required package: cluster

This is pRoloc version 1.12.4 
  Read '?pRoloc' and references therein for information
  about the package and how to get started.

> data(dunkley2006)
> set.seed(1)
> nax <- makeNaData(dunkley2006, pNA = 0.10)
> pcol <- factor(ifelse(dunkley2006$fraction <= 5, "A", "B"))
> sel1 <- pcol == "A"
> 
> ## missing values in each sample
> barplot(colSums(is.na(nax)), col = pcol)
> 
> 
> ## table of missing values in proteins
> par(mfrow = c(3, 1))
> barplot(table(rowSums(is.na(nax))), main = "All")
> barplot(table(rowSums(is.na(nax)[sel1,])), main = "Group A")
> barplot(table(rowSums(is.na(nax)[!sel1,])), main = "Group B")
> 
> 
> fData(nax)$nNA1 <- rowSums(is.na(nax)[, sel1])
> fData(nax)$nNA2 <- rowSums(is.na(nax)[, !sel1])
> fData(nax)$nNA <- rowSums(is.na(nax))
> o <- MSnbase:::imageNA2(nax, pcol)
> 
> plot((fData(nax)$nNA1 - fData(nax)$nNA2)[o], type = "l")
> grid()
> 
> plot(sort(fData(nax)$nNA1 - fData(nax)$nNA2), type = "l")
> grid()
> 
> 
> o2 <- order(fData(nax)$nNA1 - fData(nax)$nNA2)
> MSnbase:::imageNA2(nax, pcol, Rowv=o2)
> 
> layout(matrix(c(rep(1, 10), rep(2, 5)), nc = 3))
> MSnbase:::imageNA2(nax, pcol, Rowv=o2)
> plot((fData(nax)$nNA1 - fData(nax)$nNA)[o2], type = "l", col = "red",
+      ylim = c(-9, 9), ylab = "")
> lines((fData(nax)$nNA - fData(nax)$nNA2)[o2], col = "steelblue")
> lines((fData(nax)$nNA1 - fData(nax)$nNA2)[o2], type = "l",
+      lwd = 2)
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>