R: Compute matrix with pairwise distances between objects....
distGPS
R Documentation
Compute matrix with pairwise distances between objects. Several
GPS metrics are available.
Description
The function computes pairwise distances between invididuals
(e.g. samples or genes) according to a user-specified metric.
Several metrics are available. The precise definition of each metric
depends on the class of the first argument (see details section).
Desired distance metric. Valid options for
chroGPS-factors map are 'tanimoto',
'avgdist', 'chisquare' and 'chi' (see details). For chroGPS-genes
maps, metrics 'wtanimoto', 'euclidean' and 'manhattan' are also
available.
weights
For signature(x='matrix'), an unnamed numeric vector with weights applied to every sample (column)
in the original data. The typical example is when we have a sample
(epigenetic factor) with several replicates available (biological or
technical replicate, different antibody, etc.), and we want to treat
them together (for instance giving a 1/nreplicates weight to each
one). If not supplied, each replicate is considered as an individual sample (using 1 as weight
for every sample).
uniqueRows
If set to TRUE and x is a
matrix or data.frame, duplicated rows are removed
prior to distance calculation. This can save substantial computing
time and memory. Notice however that the dimension of the distance
matrix is equal to the number of unique rows in x, instead of
nrow..
(x).
genomelength
For 'chi' and 'chisquare' metrics, numeric value
indicating the length of the genome. If not given the function
uses the minimum length necessary to fit the total length of the result.
mc.cores
If mc.cores>1 and parallel package is
loaded, computations are performed in parallel with mc.cores
processors when possible.
Details
For RangedDataList objects, distances are defined as follows.
Let a1 and a2 be two RangedData objects.
Define as n1 the number of a1 intervals overlapping with
some interval in a2. Define n2 analogously.
The Tanimoto distance between a1 and a2 is defined as
(n1+n2)/(nrow(z1)+nrow(z2)).
The average distance between a1 and a2 is defined as
.5*(n1/nrow(z1) + n2/nrow(z2)).
The wtanimoto distance in chroGPS-genes weights each epigenetic factor
(table columns) according to its frequency (table rows).
The chi-square distance is defined as the usual chi-square distance on
a binary matrix B which is automatically computed by
distGPS.
The binary matrix B is the
matrix with length(x) rows and number of columns equal to the
genome length, where B[i,j]==1 indicates that element i
has a binding site at base pair j.
The chi distance is simply defined as the square root of the
chi-square distance.
Finally, euclidean and manhattan metrics have the same definition than
in the base R function dist.
When choosing a metric one should consider the effect of outliers,
i.e. samples with large distance to all other samples.
Tanimoto and Average Distance take values between 0 and 1, and
therefore outlying distances have a limited effect.
Chi-square and Chi distances are not limited between 0 and 1,
i.e. some distances may be much larger than others. The Chi metric is
slightly more robust to outliers than the Chi-square metric.
For matrix or data.frame objects, x must be a
matrix with 0's and 1's (or FALSE and TRUE).
The usual definitions
are used for Tanimoto (which is equivalent to Jaccard's index),
Chi-square and Chi.
Average overlap between rows i and j is simply the
average between the proportion of elements in i also in
j and the proportion of elements in j also in i.
Value
Object of class distGPS, with matrix of pairwise
dissimilarities (distances) between objects.
Methods
distGPS:
signature(x='RangedDataList')
Each element in x is
assumed to indicate the binding sites for a different sample,
e.g. epigenetic factor. Typically space(x) indicates the
chromosome, start(x) the start position and end(x) the
end position (in bp). Strand information is ignored.
signature(x='matrix')
Rows in x contain individuals for
which we want to compute distances. Columns in x contain the
variables, and should only contain either 0's and 1's or FALSE
and TRUE.
splitDistGPS:
This is a set of internal classes and functions to be used in the
parallel computation of Multidimensional Scaling.
uniqueCount:
This function collapses a chroGPS-genes matrix or data frame so that
elements with the same combination of variables are aggregated into a
single entry. Elements become then identified by their unique pattern
and a frequency count is also returned.
See Also
mds to create MDS-oriented objects, procrustesAdj for
Procrustes adjustment.
Examples
x <- rbind(c(rep(0,15),rep(1,5)),c(rep(0,15),rep(1,5)),c(rep(0,19),1),c(rep(1,5),rep(0,15)))
rownames(x) <- letters[1:4]
d <- distGPS(x,metric='tanimoto')
du <- distGPS(x,metric='tanimoto',uniqueRows=TRUE)
mds1 <- mds(d)
mds1
plot(mds1)
d <- distGPS(x,metric='chisquare')
mds1 <- mds(d)
mds1
plot(mds1)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(chroGPS)
Loading required package: IRanges
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: MASS
Loading required package: changepoint
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Successfully loaded changepoint package version 2.2.1
NOTE: Predefined penalty values have changed. Previous penalty values with a postfix 1 i.e. SIC1 are now without i.e. SIC and previous penalties without a postfix i.e. SIC are now with a postfix 0 i.e. SIC0. See NEWS and help files for further details.
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/chroGPS/distGPS.Rd_%03d_medium.png", width=480, height=480)
> ### Name: distGPS
> ### Title: Compute matrix with pairwise distances between objects. Several
> ### GPS metrics are available.
> ### Aliases: distGPS distGPS-methods distGPS,RangedDataList-method
> ### distGPS,data.frame-method distGPS,matrix-method
> ### as.matrix,distGPS-method splitDistGPS,data.frame-method
> ### splitDistGPS,matrix-method uniqueCount
> ### Keywords: multivariate clustering
>
> ### ** Examples
>
> x <- rbind(c(rep(0,15),rep(1,5)),c(rep(0,15),rep(1,5)),c(rep(0,19),1),c(rep(1,5),rep(0,15)))
> rownames(x) <- letters[1:4]
> d <- distGPS(x,metric='tanimoto')
> du <- distGPS(x,metric='tanimoto',uniqueRows=TRUE)
> mds1 <- mds(d)
> mds1
Object of class MDS approximating distances between 4 objects
R-squared= 1 Stress= 0
> plot(mds1)
> d <- distGPS(x,metric='chisquare')
> mds1 <- mds(d)
> mds1
Object of class MDS approximating distances between 4 objects
R-squared= NaN Stress= 0
> plot(mds1)
>
>
>
>
>
> dev.off()
null device
1
>