Last data update: 2014.03.03

R: Remove duplicated patient samples (samples from the same...
removeDuplicatedPatientsR Documentation

Remove duplicated patient samples (samples from the same patient/column ID)

Description

Function to keep only 1 sample per patient (column ID) in the data matrix. Keeps the sample that has the overall highest variance.

Usage

removeDuplicatedPatients(exprMatrix, outputFile = "duplicatedPatientsOutput.txt", 
varMetric = c("everything", "all.obs", "complete.obs", "na.or.complete",
"pairwise.complete.obs"))

Arguments

exprMatrix

Expression matrix, with features in the rows and samples in the columns.

outputFile

Output file for messages that print status of removing duplicated samples. Include full directory if file should not be printed to current working directory.

varMetric

Standard options taken from the base var() function. May be important if you have NA values in your data matrix; otherwise, "everything" is usually fine.

Value

exprMatrix: the final data matrix with only 1 sample per patient ID.

Note

Suggestions are welcome for further ways to pick the best sample from samples from the same patient. No curatedBreastData matrices currently have samples that share the same patient ID, but this function is especially useful for say TCGA data, where this is often the case.

It is suggseted one imputes missing values using the filterAndImpute function before running this function to avoid -Inf and NA values in the variance calculations.

Author(s)

Katie Planey <katie.planey@gmail.com>

Examples

#No curatedBreastData has duplicated samples, 
#but we can still run this function on one of the datasets:
#load up our datasets
data(curatedBreastDataExprSetList);

#This dataset does not have NA values, which makes for a good example without extra pre-processing.
outputMatrix <- removeDuplicatedPatients(exprMatrix=
exprs(curatedBreastDataExprSetList[[1]]), 
outputFile = "./duplicatedPatientsOutput.txt", varMetric = c("everything"))
#final dimensions - unchanged in this case with 
#no samples sharing the same patient ID.
dim(outputMatrix)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(curatedBreastData)
Loading required package: ggplot2
Loading required package: impute
Loading required package: XML
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: BiocStyle
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/curatedBreastData/removeDuplicatedPatients.Rd_%03d_medium.png", width=480, height=480)
> ### Name: removeDuplicatedPatients
> ### Title: Remove duplicated patient samples (samples from the same
> ###   patient/column ID)
> ### Aliases: removeDuplicatedPatients
> 
> ### ** Examples
> 
> #No curatedBreastData has duplicated samples, 
> #but we can still run this function on one of the datasets:
> #load up our datasets
> data(curatedBreastDataExprSetList);
> 
> #This dataset does not have NA values, which makes for a good example without extra pre-processing.
> outputMatrix <- removeDuplicatedPatients(exprMatrix=
+ exprs(curatedBreastDataExprSetList[[1]]), 
+ outputFile = "./duplicatedPatientsOutput.txt", varMetric = c("everything"))

Starting with  60patients.
found no multiple samples from the same patient(s)
> #final dimensions - unchanged in this case with 
> #no samples sharing the same patient ID.
> dim(outputMatrix)
[1] 22582    60
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>