A list, of minimally the gene expression or some molecular data matrix with keys (molecular features, such as genes) in the rows and patient samples in the columns and a keys list. It is assumed that the keys entity is named "keys", but in line with using this function for any type of molecular data, the exprIndex name in the list can be altered.
studyName
Character string to name the study. Useful in cases where looping over multiple datasets; output messages printed to the output file can then be identified by each individual study name.
outputFile
Output File for printing progress and stats on gene/sample filtering and data imputation. Include full directory if file should not be printed to current working directory.
impute
Impute data? A boolean TRUE or FALSE value. If FALSE, only genes and samples with high NA rates are removed, and the rest of the data is not imputed.
knnFractionSize
What is the fraction of neighbors out of the total dataset to be used for knn impute nearest neighbor? This is translated into the "k" numeric magnitude in impute.knn() from the impute package. Default is .01, or 1
fractionSampleNAcutoff
Max fraction of NAs allowed for a certain sample across all genes. Default is .005 (.005, or .5%, still captures a large number of gnees for a sample if there are tens of thousands of genes in the data matrix.)
fractionGeneNAcutoff
Max fraction of NAs allowed for a certain gene across all samples. Default is .01. Thus, a certain gene cannot be missing in greater than 1% of patients. It is recommended that this threshold be increased for smaller datasets unless a user wants a gene to be removed that is missing in only 1 sample.
exprIndex
Character string. List slot name for the data matrix, presumably an expression matrix.
classIndex
Optional character string giving the list slot name for a phenotype vector or matrix if available. If phenotype/class data such as survival is already in the list, filtering out samples with high NA rates will result in the need to remove these samples from the phenotype data matrix; filterAndImputeSamples will appropriately filter out these samples from the phenoteyp data.
sampleCol
Are samples in the columns of the expression matrix? If not, this function will first transpose the matrix to make sure impute.knn is running properly.
returnErrorRate
Boolean TRUE or FALSE. If TRUE, a small amount of real expression data points are held out, and knn.impute is performed. The accuracy rate of the imputed values vs. the real values is returned. THis is helpful in early data analysis stages to determine whether KNN imputation is appropriate for your type of data. Default is FALSE to reduce computation time.
Value
A list containing the following objects:
expr
original expression matrix
exprFilterImputed
final filtered and imputed expression matrix
keys
original keys
keys
final filtered and imputed keys
classes
original classes/phenotype data
classes
final classes/phenotype data, removing any sample rows that were removed from the expression matrix after filtering.
Author(s)
Katie Planey <katie.planey@gmail.com>
Examples
#load up our datasets
data(curatedBreastDataExprSetList);
#just perform on one dataset as an example, GSE9893. This dataset does have NA values.
#highestVariance calculation make take a minute to run.
#create study list object.
study <- list(expr=exprs(curatedBreastDataExprSetList[[5]]),
keys=curatedBreastDataExprSetList[[2]]@featureData$gene_symbol,
phenoData=pData(curatedBreastDataExprSetList[[5]]))
filteredStudy <- filterAndImputeSamples(study, studyName = "study",
outputFile = "createTestTrainSetsOutput.txt", impute = TRUE,
knnFractionSize = 0.01, fractionSampleNAcutoff = 0.005,
fractionGeneNAcutoff = 0.01, exprIndex = "expr", classIndex="phenoData",
sampleCol = TRUE, returnErrorRate = TRUE)
#see output list names
names(filteredStudy)
#what is the imputation error fraction (rate)?
filteredStudy$errorRate
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(curatedBreastData)
Loading required package: ggplot2
Loading required package: impute
Loading required package: XML
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: BiocStyle
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/curatedBreastData/filterAndImputeSamples.Rd_%03d_medium.png", width=480, height=480)
> ### Name: filterAndImputeSamples
> ### Title: Filter and Impute Samples
> ### Aliases: filterAndImputeSamples
>
> ### ** Examples
>
>
> #load up our datasets
> data(curatedBreastDataExprSetList);
>
> #just perform on one dataset as an example, GSE9893. This dataset does have NA values.
> #highestVariance calculation make take a minute to run.
> #create study list object.
> study <- list(expr=exprs(curatedBreastDataExprSetList[[5]]),
+ keys=curatedBreastDataExprSetList[[2]]@featureData$gene_symbol,
+ phenoData=pData(curatedBreastDataExprSetList[[5]]))
>
> filteredStudy <- filterAndImputeSamples(study, studyName = "study",
+ outputFile = "createTestTrainSetsOutput.txt", impute = TRUE,
+ knnFractionSize = 0.01, fractionSampleNAcutoff = 0.005,
+ fractionGeneNAcutoff = 0.01, exprIndex = "expr", classIndex="phenoData",
+ sampleCol = TRUE, returnErrorRate = TRUE)
Cluster size 22898 broken into 7597 15301
Cluster size 7597 broken into 1303 6294
Done cluster 1303
Cluster size 6294 broken into 2904 3390
Cluster size 2904 broken into 1773 1131
Cluster size 1773 broken into 1372 401
Done cluster 1372
Done cluster 401
Done cluster 1773
Done cluster 1131
Done cluster 2904
Cluster size 3390 broken into 1993 1397
Cluster size 1993 broken into 977 1016
Done cluster 977
Done cluster 1016
Done cluster 1993
Done cluster 1397
Done cluster 3390
Done cluster 6294
Done cluster 7597
Cluster size 15301 broken into 6374 8927
Cluster size 6374 broken into 4092 2282
Cluster size 4092 broken into 1948 2144
Cluster size 1948 broken into 951 997
Done cluster 951
Done cluster 997
Done cluster 1948
Cluster size 2144 broken into 979 1165
Done cluster 979
Done cluster 1165
Done cluster 2144
Done cluster 4092
Cluster size 2282 broken into 613 1669
Done cluster 613
Cluster size 1669 broken into 780 889
Done cluster 780
Done cluster 889
Done cluster 1669
Done cluster 2282
Done cluster 6374
Cluster size 8927 broken into 6446 2481
Cluster size 6446 broken into 3049 3397
Cluster size 3049 broken into 1504 1545
Cluster size 1504 broken into 1080 424
Done cluster 1080
Done cluster 424
Done cluster 1504
Cluster size 1545 broken into 899 646
Done cluster 899
Done cluster 646
Done cluster 1545
Done cluster 3049
Cluster size 3397 broken into 1179 2218
Done cluster 1179
Cluster size 2218 broken into 1764 454
Cluster size 1764 broken into 1009 755
Done cluster 1009
Done cluster 755
Done cluster 1764
Done cluster 454
Done cluster 2218
Done cluster 3397
Done cluster 6446
Cluster size 2481 broken into 1695 786
Cluster size 1695 broken into 960 735
Done cluster 960
Done cluster 735
Done cluster 1695
Done cluster 786
Done cluster 2481
Done cluster 8927
Done cluster 15301
Cluster size 22895 broken into 7694 15201
Cluster size 7694 broken into 1681 6013
Cluster size 1681 broken into 547 1134
Done cluster 547
Done cluster 1134
Done cluster 1681
Cluster size 6013 broken into 2938 3075
Cluster size 2938 broken into 1263 1675
Done cluster 1263
Cluster size 1675 broken into 1321 354
Done cluster 1321
Done cluster 354
Done cluster 1675
Done cluster 2938
Cluster size 3075 broken into 1852 1223
Cluster size 1852 broken into 920 932
Done cluster 920
Done cluster 932
Done cluster 1852
Done cluster 1223
Done cluster 3075
Done cluster 6013
Done cluster 7694
Cluster size 15201 broken into 4625 10576
Cluster size 4625 broken into 2347 2278
Cluster size 2347 broken into 1692 655
Cluster size 1692 broken into 716 976
Done cluster 716
Done cluster 976
Done cluster 1692
Done cluster 655
Done cluster 2347
Cluster size 2278 broken into 559 1719
Done cluster 559
Cluster size 1719 broken into 642 1077
Done cluster 642
Done cluster 1077
Done cluster 1719
Done cluster 2278
Done cluster 4625
Cluster size 10576 broken into 6676 3900
Cluster size 6676 broken into 3345 3331
Cluster size 3345 broken into 1163 2182
Done cluster 1163
Cluster size 2182 broken into 1328 854
Done cluster 1328
Done cluster 854
Done cluster 2182
Done cluster 3345
Cluster size 3331 broken into 1219 2112
Done cluster 1219
Cluster size 2112 broken into 1897 215
Cluster size 1897 broken into 1425 472
Done cluster 1425
Done cluster 472
Done cluster 1897
Done cluster 215
Done cluster 2112
Done cluster 3331
Done cluster 6676
Cluster size 3900 broken into 2244 1656
Cluster size 2244 broken into 1152 1092
Done cluster 1152
Done cluster 1092
Done cluster 2244
Cluster size 1656 broken into 993 663
Done cluster 993
Done cluster 663
Done cluster 1656
Done cluster 3900
Done cluster 10576
Done cluster 15201
Warning message:
In filterAndImputeSamples(study, studyName = "study", outputFile = "createTestTrainSetsOutput.txt", :
Just a warning: this function assumes your missing values
are proper NAs, not "null",etc.
>
> #see output list names
> names(filteredStudy)
[1] "expr" "exprFilterImpute" "class" "classesFilter"
[5] "keysFilterImpute" "keys" "meanAbsDiff" "errorRate"
> #what is the imputation error fraction (rate)?
> filteredStudy$errorRate
[1] 0.1136878
>
>
>
>
>
>
> dev.off()
null device
1
>