The sensitivity or conditional probability of the correct classification of cluster k is calculated as follows: First, the proportions of observations whose true cluster label is k are computed for each classified clusters. Then the largest proportion is selected as the conditional probability of the correct classification. Since this calculation can return 1 for sensitivities of all clusters if all observations belong to one cluster, we also report the observed cluster labels returned by the algorithms.
This dataset consists of features of handwritten numerals (‘0’–‘9’) (K=10) extracted from a collection of Dutch utility maps. Two hundred patterns per class (for a total of 2,000 (=N) patterns) have been digitized in binary images. Raw observations are 32x45 bitmmaps, which are divided into nooverlapping blocks of 2x3 and the number of pixels are counted in each block. This generate p=240 (16x15) variable, recodring the normalized counts of pixels in each block and each element is an integer in the range 0 to 6. rownames of DutchUtility contains the true digits and colnames of it contains the position of the block matrix, from which the normalized counts of pixels are taken.
The robust sparse K-means clustering method by Kondo (2011). In this algorithm, sparse K-means (Witten and Tibshirani (2010)) is robustified by iteratively trimming the prespecified proportion of cases in the weighted squared Euclidean distances and the squared Euclidean distances.
The dataset contains n= 64 bodies of e-mails in binary bag-of-words representation which Filannino manually collected from DBWorld mailing list. DBWorld mailing list announces conferences, jobs, books, software and grants. Filannino applied supervised learning algorithm to classify e-mails between “announces of conferences” and “everything else”. Out of 64 e-mails, 29 are about conference announcements and 35 are not.
This function returns a revised silhouette plot, cluster centers in weighted squared Euclidean distances and a matrix containing the weighted squared Euclidean distances between cases and each cluster center. Missing values are adjusted.
The dataset describes n = 1797 digits from 0 to 9 (K = 10), handwritten by 13 subjects. Raw observations are 32x32 bitmaps, which are divided into nonoverlapping blocks of 4x4 and the number of on pixels are counted in each block. This generates p = 64 (= 8x8) variable, recording the normalized counts of pixels in each block and each element is an integer in the range 0 to 16. The row names of the matrix optd contains the true labels (between 0 and 9), and the column names of it contains the position of the block in original bitmap.
The function Clest performs Clest ( Dudoit and Fridlyand (2002)) with CER as the measure of the agreement between two partitions (in each training set). The following clustering algorithm can be used: K-means, trimmed K-means, sparse K-means and robust sparse K-means.