Expression matrix, with features in the rows and samples in the columns.
outputFile
Output file for messages that print status of removing duplicated samples. Include full directory if file should not be printed to current working directory.
varMetric
Standard options taken from the base var() function. May be important if you have NA values in your data matrix; otherwise, "everything" is usually fine.
Value
exprMatrix: the final data matrix with only 1 sample per patient ID.
Note
Suggestions are welcome for further ways to pick the best sample from samples from the same patient. No curatedBreastData matrices currently have samples that share the same patient ID, but this function is especially useful for say TCGA data, where this is often the case.
It is suggseted one imputes missing values using the filterAndImpute function before running this function to avoid -Inf and NA values in the variance calculations.
Author(s)
Katie Planey <katie.planey@gmail.com>
Examples
#No curatedBreastData has duplicated samples,
#but we can still run this function on one of the datasets:
#load up our datasets
data(curatedBreastDataExprSetList);
#This dataset does not have NA values, which makes for a good example without extra pre-processing.
outputMatrix <- removeDuplicatedPatients(exprMatrix=
exprs(curatedBreastDataExprSetList[[1]]),
outputFile = "./duplicatedPatientsOutput.txt", varMetric = c("everything"))
#final dimensions - unchanged in this case with
#no samples sharing the same patient ID.
dim(outputMatrix)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(curatedBreastData)
Loading required package: ggplot2
Loading required package: impute
Loading required package: XML
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: BiocStyle
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/curatedBreastData/removeDuplicatedPatients.Rd_%03d_medium.png", width=480, height=480)
> ### Name: removeDuplicatedPatients
> ### Title: Remove duplicated patient samples (samples from the same
> ### patient/column ID)
> ### Aliases: removeDuplicatedPatients
>
> ### ** Examples
>
> #No curatedBreastData has duplicated samples,
> #but we can still run this function on one of the datasets:
> #load up our datasets
> data(curatedBreastDataExprSetList);
>
> #This dataset does not have NA values, which makes for a good example without extra pre-processing.
> outputMatrix <- removeDuplicatedPatients(exprMatrix=
+ exprs(curatedBreastDataExprSetList[[1]]),
+ outputFile = "./duplicatedPatientsOutput.txt", varMetric = c("everything"))
Starting with 60patients.
found no multiple samples from the same patient(s)
> #final dimensions - unchanged in this case with
> #no samples sharing the same patient ID.
> dim(outputMatrix)
[1] 22582 60
>
>
>
>
>
> dev.off()
null device
1
>