

   CClluusstteerriinngg LLaarrggee AApppplliiccaattiioonnss

        clara(x, k, metric = "euclidean", stand = F, samples = 5,
              sampsize = 40 + 2 * k)

   AArrgguummeennttss::

          x: data matrix or dataframe, each row corresponds to
             an observation, and each column corresponds to a
             variable. All variables must be numeric.  Missing
             values (NAs) are allowed.

          k: integer, the number of clusters.  It is required
             that 0 < k < n where n is the number of observa-
             tions.

     metric: character string specifying the metric to be used
             for calculating dissimilarities between observa-
             tions.  The currently available options are
             "euclidean" and "manhattan".  Euclidean distances
             are root sum-of-squares of differences, and man-
             hattan distances are the sum of absolute differ-
             ences.

      stand: logical flag: if TRUE, then the measurements in
             `x' are standardized before calculating the dis-
             similarities. Measurements are standardized for
             each variable (column), by subtracting the vari-
             able's mean value and dividing by the variable's
             mean absolute deviation.

    samples: integer, number of samples to be drawn from the
             dataset.

   sampsize: integer, number of observations in each sample.
             `sampsize' should be higher than the number of
             clusters (`k') and at most the number of observa-
             tions (nrow(`x')).

   DDeessccrriippttiioonn::

        Returns a list representing a clustering of the data
        into `k' clusters.

   DDeettaaiillss::

        `clara' is fully described in chapter 3 of Kaufman and
        Rousseeuw (1990).  Compared to other partitioning meth-
        ods such as `pam', it can deal with much larger
        datasets. Internally, this is achieved by considering
        sub-datasets of fixed size, so that the time and stor-
        age requirements become linear in nrow(`x') rather than
        quadratic.

        Each sub-dataset is partitioned into `k' clusters using
        the same algorithm as in the `pam' function.  Once `k'
        representative objects have been selected from the sub-
        dataset, each observation of the entire dataset is
        assigned to the nearest medoid.  The sum of the dissim-
        ilarities of the observations to their closest medoid,
        is used as a measure of the quality of the clustering.
        The sub-dataset for which the sum is minimal, is
        retained.  A further analysis is carried out on the
        final partition.  Each sub-dataset is forced to contain
        the medoids obtained from the best sub-dataset until
        then.  Randomly drawn observations are added to this
        set until `sampsize' has been reached.

   VVaalluuee::

        an object of class `"clara"' representing the cluster-
        ing.  See clara.object for details.

   BBAACCKKGGRROOUUNNDD::

        Cluster analysis divides a dataset into groups (clus-
        ters) of observations that are similar to each other.
        Partitioning methods like `pam', `clara', and `fanny'
        require that the number of clusters be given by the
        user.  Hierarchical methods like `agnes', `diana', and
        `mona' construct a hierarchy of clusterings, with the
        number of clusters ranging from one to the number of
        observations.

   NNOOTTEE::

        For small datasets (say with fewer than 200 observa-
        tions), the function `pam' can be used directly.

   RReeffeerreenncceess::

        Kaufman, L. and Rousseeuw, P.J. (1990).  Finding Groups
        in Data: An Introduction to Cluster Analysis.  Wiley,
        New York.

        Struyf, A., Hubert, M. and Rousseeuw, P.J. (1997).
        Integrating Robust Clustering Techniques in S-PLUS,
        Computational Statistics and Data Analysis, 26, 17-37.

   SSeeee AAllssoo::

        `clara.object', `pam', `partition.object', `plot.parti-
        tion'.

   EExxaammpplleess::

        ## generate 500 objects, divided into 2 clusters.
        x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
                   cbind(rnorm(300,50,8), rnorm(300,50,8)))
        clarax <- clara(x, 2)
        clarax
        clarax$clusinfo
        plot(clarax)

        ## `xclara' is an artificial data set with 3 clusters of 1000 bivariate
        ## objects each.
        data(xclara)
        ## Plot similar to Figure 5 in Struyf et al (1996)
        plot(clara(xclara, 3), ask = TRUE)

        ## generate 500 objects, divided into 2 clusters.
        x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
                   cbind(rnorm(300,50,8), rnorm(300,50,8)))
        clarax <- clara(x, 2)
        clarax
        clarax$clusinfo
        plot(clarax)

        ## `xclara' is an artificial data set with 3 clusters of 1000 bivariate
        ## objects each.
        data(xclara)
        ## Plot similar to Figure 5 in Struyf et al (1996)
        plot(clara(xclara, 3), ask = TRUE)

