number of objects in each cluster - positive integer value or vector with the same size as nrow(means),
e.g. numObjects=c(50,20)
means
matrix of cluster means (e.g. means=matrix(c(0,8,0,8),2,2)). If means = NULL matrix should be read from means_<modelNumber>.csv file
cov
covariance matrix (the same for each cluster, e.g. cov=matrix(c(1, 0, 0, 1), 2, 2)).
If cov=NULL matrix should be read from
cov_<modelNumber>.csv file.
Note: you cannot use this argument for generation of clusters with different covariance matrices.
Those kind of generation should be done by setting fixedCov to FALSE and using appropriate model
model
model number,
model=1 - no cluster structure. Observations are simulated from uniform distribution over the unit hypercube in number of
dimensions (variables) given in numNoisyVar argument;
model=2 - means and covariances are taken from arguments means and cov (see Example 1);
model=21,22,... - if fixedCov=TRUEmeans should be read from means_<modelNumber>.csv
and covariance matrix for all clusters should be read from cov_<modelNumber>.csv
and if fixedCov=FALSEmeans should be read from means_<modelNumber>.csv
and covariance matrices should be read separately for each cluster
from
cov_<modelNumber>_<clusterNumber>.csv
fixedCov
if fixedCov=TRUE covariance matrix for all clusters is the same
and if
fixedCov=FALSE each cluster is generated from different covariance matrix - see model
number of categories (for ordinal data only). Positive integer value or vector with the same size as ncol(means) plus number of noisy variables.
numNoisyVar
number of noisy variables. For model=1 it means number of variables
numOutliers
number of outliers (for metric and symbolic interval data only). If a positive integer - number of outliers, if value from <0,1> - percentage of outliers in whole data set
rangeOutliers
range for outliers (for metric and symbolic interval data only). The default range is [1, 10].The outliers are generated independently for each variable for the whole data set from uniform distribution. The generated values are randomly added to maximum of j-th variable or subtracted from minimum of j-th variable
inputType
"csv" - a dot as decimal point or "csv2" - a comma as decimal point in
means_<modelNumber>.csv and cov_<modelNumber>.csv files
inputHeader
inputHeader=TRUE indicates that input files (means_<modelNumber>.csv; cov_<modelNumber...>.csv) contain header row
inputRowNames
inputRowNames=TRUE indicates that input files (means_<modelNumber>.csv; cov_<modelNumber...>.csv) contain first column with row names or with number of objects (positive integer values)
outputCsv
optional, name of csv file with generated data (first column contains id, second - number of cluster and others - data)
outputCsv2
optional, name of csv (a comma as decimal point and a semicolon as field separator) file with generated data (first column contains id, second - number of cluster and others - data)
outputColNames
outputColNames=TRUE indicates that output file (given by outputCsv and outputCsv2 parameters) contains first row with column names
outputRowNames
outputRowNames=TRUE indicates that output file (given by outputCsv and outputCsv2 parameters) contains a vector of row names
Details
See file $R_HOMElibraryclusterSimpdfclusterGen_details.pdf for further details
Value
clusters
cluster number for each object, for model=1 each
object belongs to its own cluster thus this variable contains objects numbers
data
generated data: for metric and ordinal data - matrix with
objects in rows and variables in columns;
for symbolic interval data three-dimensional structure: first dimension represents object number,
second - variable number and third dimension contains lower- and upper-bounds of intervals
Billard, L., Diday, E. (2006): Symbolic data analysis. Conceptual statistics and data mining, Wiley, Chichester.
Qiu, W., Joe, H. (2006), Generation of random clusters with specified degree of separation, "Journal of Classification", vol. 23, 315-334.
Steinley, D., Henson, R. (2005), OCLUS: an analytic method for generating clusters with known overlap, "Journal of Classification", vol. 22, 221-250.
Walesiak, M., Dudek, A. (2008), Identification of noisy variables for nonmetric and symbolic data in cluster analysis, In: C. Preisach, H. Burkhardt, L. Schmidt-Thieme, R. Decker (Eds.), Data analysis, machine learning and applications, Springer-Verlag, Berlin, Heidelberg, 85-92.
Examples
# Example 1
library(clusterSim)
means <- matrix(c(0,7,0,7),2,2)
cov <- matrix(c(1,0,0,1),2,2)
grnd <- cluster.Gen(numObjects=60,means=means,cov=cov,model=2,
numOutliers=8)
colornames <- c("red","blue","green")
grnd$clusters[grnd$clusters==0]<-length(colornames)
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)
# Example 2
library(clusterSim)
grnd <- cluster.Gen(50,model=4,dataType="m",numNoisyVar=2)
data <- as.matrix(grnd$data)
colornames <- c("red","blue","green")
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)
# Example 3
library(clusterSim)
grnd<-cluster.Gen(50,model=4,dataType="o",numCategories=7, numNoisyVar=2)
plotCategorial(grnd$data,,grnd$clusters,ask=TRUE)
# Example 4 (1 nonnoisy variable and 2 noisy variables, 3 clusters)
library(clusterSim)
grnd <- cluster.Gen(c(40,60,20), model=2, means=c(2,14,25),
cov=c(1.5,1.5,1.5),numNoisyVar=2)
colornames <- c("red","blue","green")
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)
# Example 5
library(clusterSim)
grnd <- cluster.Gen(c(20,35,20,25),model=14,dataType="m",numNoisyVar=1,
fixedCov=FALSE, numOutliers=0.1, outputCsv2="data14.csv")
data <- as.matrix(grnd$data)
colornames <- c("red","blue","green","brown","black")
grnd$clusters[grnd$clusters==0]<-length(colornames)
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)
# Example 6 (this example needs files means_24.csv)
# and cov_24.csv to be placed in working directory
# library(clusterSim)
# grnd<-cluster.Gen(c(50,80,20),model=24,dataType="m",numNoisyVar=1,
# numOutliers=10, rangeOutliers=c(1,5))
# print(grnd)
# data <- as.data.frame(grnd$data)
# colornames<-c("red","blue","green","brown")
# grnd$clusters[grnd$clusters==0]<-length(colornames)
# plot(data,col=colornames[grnd$clusters],ask=TRUE)
# Example 7 (this example needs files means_25.csv and cov_25_1.csv)
# cov_25_2.csv, cov_25_3.csv, cov_25_4.csv, cov_25_5.csv
# to be placed in working directory
# library(clusterSim)
# grnd<-cluster.Gen(c(40,30,20,35,45),model=25,numNoisyVar=3,fixedCov=F)
# data <- as.data.frame(grnd$data)
# colornames<-c("red","blue","green","magenta","brown")
# plot(data,col=colornames[grnd$clusters],ask=TRUE)