Last data update: 2014.03.03

R: Filter and Impute Samples
filterAndImputeSamplesR Documentation

Filter and Impute Samples

Description

A method that removes samples or genes with high NA rates and then KNN imputes remaining missing values.

Usage

filterAndImputeSamples(study, studyName = "study", 
outputFile = "createTestTrainSetsOutput.txt", 
impute = TRUE, knnFractionSize = 0.01, fractionSampleNAcutoff = 0.005,
fractionGeneNAcutoff = 0.01, exprIndex = "expr", classIndex, 
sampleCol = TRUE, returnErrorRate = TRUE)

Arguments

study

A list, of minimally the gene expression or some molecular data matrix with keys (molecular features, such as genes) in the rows and patient samples in the columns and a keys list. It is assumed that the keys entity is named "keys", but in line with using this function for any type of molecular data, the exprIndex name in the list can be altered.

studyName

Character string to name the study. Useful in cases where looping over multiple datasets; output messages printed to the output file can then be identified by each individual study name.

outputFile

Output File for printing progress and stats on gene/sample filtering and data imputation. Include full directory if file should not be printed to current working directory.

impute

Impute data? A boolean TRUE or FALSE value. If FALSE, only genes and samples with high NA rates are removed, and the rest of the data is not imputed.

knnFractionSize

What is the fraction of neighbors out of the total dataset to be used for knn impute nearest neighbor? This is translated into the "k" numeric magnitude in impute.knn() from the impute package. Default is .01, or 1

fractionSampleNAcutoff

Max fraction of NAs allowed for a certain sample across all genes. Default is .005 (.005, or .5%, still captures a large number of gnees for a sample if there are tens of thousands of genes in the data matrix.)

fractionGeneNAcutoff

Max fraction of NAs allowed for a certain gene across all samples. Default is .01. Thus, a certain gene cannot be missing in greater than 1% of patients. It is recommended that this threshold be increased for smaller datasets unless a user wants a gene to be removed that is missing in only 1 sample.

exprIndex

Character string. List slot name for the data matrix, presumably an expression matrix.

classIndex

Optional character string giving the list slot name for a phenotype vector or matrix if available. If phenotype/class data such as survival is already in the list, filtering out samples with high NA rates will result in the need to remove these samples from the phenotype data matrix; filterAndImputeSamples will appropriately filter out these samples from the phenoteyp data.

sampleCol

Are samples in the columns of the expression matrix? If not, this function will first transpose the matrix to make sure impute.knn is running properly.

returnErrorRate

Boolean TRUE or FALSE. If TRUE, a small amount of real expression data points are held out, and knn.impute is performed. The accuracy rate of the imputed values vs. the real values is returned. THis is helpful in early data analysis stages to determine whether KNN imputation is appropriate for your type of data. Default is FALSE to reduce computation time.

Value

A list containing the following objects:

expr

original expression matrix

exprFilterImputed

final filtered and imputed expression matrix

keys

original keys

keys

final filtered and imputed keys

classes

original classes/phenotype data

classes

final classes/phenotype data, removing any sample rows that were removed from the expression matrix after filtering.

Author(s)

Katie Planey <katie.planey@gmail.com>

Examples


#load up our datasets
data(curatedBreastDataExprSetList);

#just perform on one dataset as an example, GSE9893. This dataset does have NA values.
#highestVariance calculation make take a minute to run.
#create study list object. 
study <- list(expr=exprs(curatedBreastDataExprSetList[[5]]),
keys=curatedBreastDataExprSetList[[2]]@featureData$gene_symbol,
phenoData=pData(curatedBreastDataExprSetList[[5]]))

filteredStudy <- filterAndImputeSamples(study, studyName = "study", 
outputFile = "createTestTrainSetsOutput.txt", impute = TRUE, 
knnFractionSize = 0.01, fractionSampleNAcutoff = 0.005, 
fractionGeneNAcutoff = 0.01, exprIndex = "expr", classIndex="phenoData",
sampleCol = TRUE, returnErrorRate = TRUE)

#see output list names 
names(filteredStudy)
#what is the imputation error fraction (rate)?
filteredStudy$errorRate

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(curatedBreastData)
Loading required package: ggplot2
Loading required package: impute
Loading required package: XML
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: BiocStyle
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/curatedBreastData/filterAndImputeSamples.Rd_%03d_medium.png", width=480, height=480)
> ### Name: filterAndImputeSamples
> ### Title: Filter and Impute Samples
> ### Aliases: filterAndImputeSamples
> 
> ### ** Examples
> 
> 
> #load up our datasets
> data(curatedBreastDataExprSetList);
> 
> #just perform on one dataset as an example, GSE9893. This dataset does have NA values.
> #highestVariance calculation make take a minute to run.
> #create study list object. 
> study <- list(expr=exprs(curatedBreastDataExprSetList[[5]]),
+ keys=curatedBreastDataExprSetList[[2]]@featureData$gene_symbol,
+ phenoData=pData(curatedBreastDataExprSetList[[5]]))
> 
> filteredStudy <- filterAndImputeSamples(study, studyName = "study", 
+ outputFile = "createTestTrainSetsOutput.txt", impute = TRUE, 
+ knnFractionSize = 0.01, fractionSampleNAcutoff = 0.005, 
+ fractionGeneNAcutoff = 0.01, exprIndex = "expr", classIndex="phenoData",
+ sampleCol = TRUE, returnErrorRate = TRUE)
Cluster size 22898 broken into 7597 15301 
Cluster size 7597 broken into 1303 6294 
Done cluster 1303 
Cluster size 6294 broken into 2904 3390 
Cluster size 2904 broken into 1773 1131 
Cluster size 1773 broken into 1372 401 
Done cluster 1372 
Done cluster 401 
Done cluster 1773 
Done cluster 1131 
Done cluster 2904 
Cluster size 3390 broken into 1993 1397 
Cluster size 1993 broken into 977 1016 
Done cluster 977 
Done cluster 1016 
Done cluster 1993 
Done cluster 1397 
Done cluster 3390 
Done cluster 6294 
Done cluster 7597 
Cluster size 15301 broken into 6374 8927 
Cluster size 6374 broken into 4092 2282 
Cluster size 4092 broken into 1948 2144 
Cluster size 1948 broken into 951 997 
Done cluster 951 
Done cluster 997 
Done cluster 1948 
Cluster size 2144 broken into 979 1165 
Done cluster 979 
Done cluster 1165 
Done cluster 2144 
Done cluster 4092 
Cluster size 2282 broken into 613 1669 
Done cluster 613 
Cluster size 1669 broken into 780 889 
Done cluster 780 
Done cluster 889 
Done cluster 1669 
Done cluster 2282 
Done cluster 6374 
Cluster size 8927 broken into 6446 2481 
Cluster size 6446 broken into 3049 3397 
Cluster size 3049 broken into 1504 1545 
Cluster size 1504 broken into 1080 424 
Done cluster 1080 
Done cluster 424 
Done cluster 1504 
Cluster size 1545 broken into 899 646 
Done cluster 899 
Done cluster 646 
Done cluster 1545 
Done cluster 3049 
Cluster size 3397 broken into 1179 2218 
Done cluster 1179 
Cluster size 2218 broken into 1764 454 
Cluster size 1764 broken into 1009 755 
Done cluster 1009 
Done cluster 755 
Done cluster 1764 
Done cluster 454 
Done cluster 2218 
Done cluster 3397 
Done cluster 6446 
Cluster size 2481 broken into 1695 786 
Cluster size 1695 broken into 960 735 
Done cluster 960 
Done cluster 735 
Done cluster 1695 
Done cluster 786 
Done cluster 2481 
Done cluster 8927 
Done cluster 15301 
Cluster size 22895 broken into 7694 15201 
Cluster size 7694 broken into 1681 6013 
Cluster size 1681 broken into 547 1134 
Done cluster 547 
Done cluster 1134 
Done cluster 1681 
Cluster size 6013 broken into 2938 3075 
Cluster size 2938 broken into 1263 1675 
Done cluster 1263 
Cluster size 1675 broken into 1321 354 
Done cluster 1321 
Done cluster 354 
Done cluster 1675 
Done cluster 2938 
Cluster size 3075 broken into 1852 1223 
Cluster size 1852 broken into 920 932 
Done cluster 920 
Done cluster 932 
Done cluster 1852 
Done cluster 1223 
Done cluster 3075 
Done cluster 6013 
Done cluster 7694 
Cluster size 15201 broken into 4625 10576 
Cluster size 4625 broken into 2347 2278 
Cluster size 2347 broken into 1692 655 
Cluster size 1692 broken into 716 976 
Done cluster 716 
Done cluster 976 
Done cluster 1692 
Done cluster 655 
Done cluster 2347 
Cluster size 2278 broken into 559 1719 
Done cluster 559 
Cluster size 1719 broken into 642 1077 
Done cluster 642 
Done cluster 1077 
Done cluster 1719 
Done cluster 2278 
Done cluster 4625 
Cluster size 10576 broken into 6676 3900 
Cluster size 6676 broken into 3345 3331 
Cluster size 3345 broken into 1163 2182 
Done cluster 1163 
Cluster size 2182 broken into 1328 854 
Done cluster 1328 
Done cluster 854 
Done cluster 2182 
Done cluster 3345 
Cluster size 3331 broken into 1219 2112 
Done cluster 1219 
Cluster size 2112 broken into 1897 215 
Cluster size 1897 broken into 1425 472 
Done cluster 1425 
Done cluster 472 
Done cluster 1897 
Done cluster 215 
Done cluster 2112 
Done cluster 3331 
Done cluster 6676 
Cluster size 3900 broken into 2244 1656 
Cluster size 2244 broken into 1152 1092 
Done cluster 1152 
Done cluster 1092 
Done cluster 2244 
Cluster size 1656 broken into 993 663 
Done cluster 993 
Done cluster 663 
Done cluster 1656 
Done cluster 3900 
Done cluster 10576 
Done cluster 15201 
Warning message:
In filterAndImputeSamples(study, studyName = "study", outputFile = "createTestTrainSetsOutput.txt",  :
  
Just a warning: this function assumes your missing values
  are proper NAs, not "null",etc.

> 
> #see output list names 
> names(filteredStudy)
[1] "expr"             "exprFilterImpute" "class"            "classesFilter"   
[5] "keysFilterImpute" "keys"             "meanAbsDiff"      "errorRate"       
> #what is the imputation error fraction (rate)?
> filteredStudy$errorRate
[1] 0.1136878
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>