clustindex              package:cclust              R Documentation

_C_l_u_s_t_e_r _I_n_d_e_x_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     `clres' is the result of a clustering algorithm of class such as
     "cclust". This function is calculating the values of several
     clustering indexes. The values of the indexes can be independenly
     used in order to determine the number of clusters existing in a
     data set.

_U_s_a_g_e:

      clustindex ( clres, x, index = "all" ) 

_A_r_g_u_m_e_n_t_s:

   clres: An object of a clustering result

       x: Data matrix

   index: The indexes being calculated "calinski", "cindex", "db",
          "hartigan", "ratkowsky", "scott", "marriot", "ball",
          "trcovw", "tracew", "friedman", "rubin", "ssi", "likelihood",
          and "all" for all the indexes.

_D_e_t_a_i_l_s:

     The description of the indexes is categorized into 3 groups, based
     on the statistics mainly used to compute them.
     The first group is based on the sum of squares within (SSW) and
     between (SSB) the clusters. These statistics measure the
     dispersion of the data points in a cluster and between the
     clusters respectively. These indexes are:

        *  calinski: (SSB/(k-1))/(SSW/(n-k)), where n is the number of
           data points and k is the number of clusters.

        *  hartigan: then log(SSB/SSW).

        *  ratkowsky:  mean(sqrt{(varSSB/varSST)}), where varSSB stands
           for the SSB for every variable and varSST for the total sum
           of squares for every variable.

        *  ball: SSW/k, where k is the number of clusters. .in -3

           The second group is based on the statistics of T, i.e., the
           scatter matrix of the data points, and W, which is the sum
           of the scatter matrices in every group. These indexes are:

           *  scott: nlog(|T|/|W|), where n is the number of data
              points and |cdot| stands for the determinant of a matrix.

           *  marriot: k^2 |W|, where k is the number of clusters.

           *  trcovw: Trace Cov W.

           *  tracew: Trace W.

           *  friedman: Trace W^{(-1)} B, where B is the scatter matrix
              of the cluster centers.

           *  rubin: |T|/|W|. .in -3

              The third group consists of four algorithms not belonging
              to the previous ones and not having anything in common.

              *  cindex: if the data set is binary, then while the
                 C-Index is a cluster similarity measure, is expressed
                 as:
                 [d_{(w)}-min(d_{(w)})]/[max(d_{(w)})-min(d_{(w)})],
                 where d_{(w)} is the sum of all n_{(d)} within cluster
                 distances, min(d_{(w)}) is the sum of the n_{(d)}
                 smallest pairwise distances in the data set, and max
                 (d_{(w)}) is the sum of the n_{(d)} biggest pairwise
                 distances. In order to compute the C-Index all
                 pairwise distances in the data set have to be computed
                 and stored. In the case of binary data, the storage of
                 the distances is creating no problems since there are
                 only a few possible distances. However, the
                 computation of all distances can make this index
                 prohibitive for large data sets.

              *  db: R=(1/n)*sum(R_{(i)}) where R_{(i)} stands for the
                 maximum value of R_{(ij)} for ineq j, and R_{(ij)} for
                 R_{(ij)}=(SSW_{(i)}+SSW_{(j)})/DC_{(ij)}, where
                 DC_{(ij)} is the distance between the centers of two
                 clusters i, j.

              *  likelihood: under the assumption of independence of
                 the variables within a cluster, a cluster solution can
                 be regarded as a mixture model for the data, where the
                 cluster centers give the probabilities for each
                 variable to be 1. Therefore, the negative
                 Log-likelihood can be computed and used as a quantity
                 measure for a cluster solution. Note that the
                 assumptions for applying special penalty terms, like
                 in AIC or BIC, are not fulfilled in this model, and
                 also they show no effect for these data sets.

              *  ssi: this ``Simple Structure Index'' combines three
                 elements which influence the interpretability of a
                 solution, i.e., the maximum difference of each
                 variable between the clusters, the sizes of the most
                 contrasting clusters and the deviation of a variable
                 in the cluster centers compared to its overall mean.
                 These three elements are multiplicatively combined and
                 normalized to give a value between 0 and 1. .in -3 

_V_a_l_u_e:

     Returns an vector with the indexes values.

_A_u_t_h_o_r(_s):

     Evgenia Dimitriadou and Andreas Weingessel

_R_e_f_e_r_e_n_c_e_s:

     Andreas Weingessel, Evgenia Dimitriadou and Sara Dolnicar, An
     Examination Of Indexes For Determining The Number Of Clusters In
     Binary Data Sets,
     <URL: http://www.wu-wien.ac.at/am/wp99.htm#29>
     and the references therein.

_S_e_e _A_l_s_o:

     `cclust', `kmeans'

_E_x_a_m_p_l_e_s:

     # a 2-dimensional example
     x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2),
              matrix(rnorm(100,mean=1,sd=0.3),ncol=2))
     cl<-cclust(x,2,20,verbose=TRUE,method="kmeans")
     resultindexes <- clustindex(cl,x, index="all")
     resultindexes   

