src/library/stats/man/kmeans.Rd - R - Git at Google

 % File src/library/stats/man/kmeans.Rd
 % Part of the R package, https://www.R-project.org
 % Copyright 1995-2018 R Core Team
 % Distributed under GPL 2 or later

 \name{kmeans}
 \alias{kmeans}
 \alias{print.kmeans}
 \alias{fitted.kmeans}
 \title{
 K-Means Clustering
 }
 \description{
   Perform k-means clustering on a data matrix.
 }
 \usage{
 kmeans(x, centers, iter.max = 10, nstart = 1,
        algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
                      "MacQueen"), trace=FALSE)
 \method{fitted}{kmeans}(object, method = c("centers", "classes"), ...)
 }
 \arguments{
   \item{x}{numeric matrix of data, or an object that can be coerced to
     such a matrix (such as a numeric vector or a data frame with all
     numeric columns).}
   \item{centers}{either the number of clusters, say \eqn{k}, or a set of
     initial (distinct) cluster centres.  If a number, a random set of
     (distinct) rows in \code{x} is chosen as the initial centres.}
   \item{iter.max}{the maximum number of iterations allowed.}
   \item{nstart}{if \code{centers} is a number, how many random sets
     should be chosen?}
   \item{algorithm}{character: may be abbreviated.  Note that
     \code{"Lloyd"} and \code{"Forgy"} are alternative names for one
     algorithm.}
   \item{object}{an \R object of class \code{"kmeans"}, typically the
     result \code{ob} of \code{ob <- kmeans(..)}.}
   \item{method}{character: may be abbreviated. \code{"centers"} causes
     \code{fitted} to return cluster centers (one for each input point) and
     \code{"classes"} causes \code{fitted} to return a vector of class
     assignments.}
   \item{trace}{logical or integer number, currently only used in the
     default method (\code{"Hartigan-Wong"}): if positive (or true),
     tracing information on the progress of the algorithm is
     produced.  Higher values may produce more tracing information.}
   \item{\dots}{not used.}
 }
 \details{
   The data given by \code{x} are clustered by the \eqn{k}-means method,
   which aims to partition the points into \eqn{k} groups such that the
   sum of squares from points to the assigned cluster centres is minimized.
   At the minimum, all cluster centres are at the mean of their Voronoi
   sets (the set of data points which are nearest to the cluster centre).

   The algorithm of Hartigan and Wong (1979) is used by default.  Note
   that some authors use \eqn{k}-means to refer to a specific algorithm
   rather than the general method: most commonly the algorithm given by
   MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy
   (1965).  The Hartigan--Wong algorithm generally does a better job than
   either of those, but trying several random starts (\code{nstart}\eqn{>
   1}) is often recommended.  In rare cases, when some of the points
   (rows of \code{x}) are extremely close, the algorithm may not converge
   in the \dQuote{Quick-Transfer} stage, signalling a warning (and
   returning \code{ifault = 4}).  Slight
   rounding of the data may be advisable in that case.

   For ease of programmatic exploration, \eqn{k=1} is allowed, notably
   returning the center and \code{withinss}.

   Except for the Lloyd--Forgy method, \eqn{k} clusters will always be
   returned if a number is specified.
   If an initial matrix of centres is supplied, it is possible that
   no point will be closest to one or more centres, which is currently
   an error for the Hartigan--Wong method.
 }
 \value{
   \code{kmeans} returns an object of class \code{"kmeans"} which has a
   \code{print} and a \code{fitted} method.  It is a list with at least
   the following components:
   \item{cluster}{
     A vector of integers (from \code{1:k}) indicating the cluster to
     which each point is allocated.
   }
   \item{centers}{A matrix of cluster centres.}
   \item{totss}{The total sum of squares.}
   \item{withinss}{Vector of within-cluster sum of squares,
     one component per cluster.}
   \item{tot.withinss}{Total within-cluster sum of squares,
     i.e.\sspace{}\code{sum(withinss)}.}
   \item{betweenss}{The between-cluster sum of squares,
     i.e.\sspace{}\code{totss-tot.withinss}.}
   \item{size}{The number of points in each cluster.}
   \item{iter}{The number of (outer) iterations.}
   \item{ifault}{integer: indicator of a possible algorithm problem
     -- for experts.}
 }
 \references{
   Forgy, E. W. (1965).
   Cluster analysis of multivariate data: efficiency vs interpretability
   of classifications.
   \emph{Biometrics}, \bold{21}, 768--769.

   Hartigan, J. A. and Wong, M. A. (1979).
   Algorithm AS 136: A K-means clustering algorithm.
   \emph{Applied Statistics}, \bold{28}, 100--108.
   \doi{10.2307/2346830}.

   Lloyd, S. P. (1957, 1982).
   Least squares quantization in PCM.
   Technical Note, Bell Laboratories.
   Published in 1982 in \emph{IEEE Transactions on Information Theory},
   \bold{28}, 128--137.

   MacQueen, J. (1967).
   Some methods for classification and analysis of multivariate
   observations.
   In \emph{Proceedings of the Fifth Berkeley Symposium on  Mathematical
     Statistics and Probability},
   eds L. M. Le Cam & J. Neyman,
   \bold{1}, pp.\sspace{}281--297.
   Berkeley, CA: University of California Press.
 }
 \examples{
 require(graphics)

 # a 2-dimensional example
 x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
            matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
 colnames(x) <- c("x", "y")
 (cl <- kmeans(x, 2))
 plot(x, col = cl$cluster)
 points(cl$centers, col = 1:2, pch = 8, cex = 2)

 # sum of squares
 ss <- function(x) sum(scale(x, scale = FALSE)^2)

 ## cluster centers "fitted" to each obs.:
 fitted.x <- fitted(cl);  head(fitted.x)
 resid.x <- x - fitted(cl)

 ## Equalities : ----------------------------------
 cbind(cl[c("betweenss", "tot.withinss", "totss")], # the same two columns
          c(ss(fitted.x), ss(resid.x),    ss(x)))
 stopifnot(all.equal(cl$ totss,        ss(x)),
 	  all.equal(cl$ tot.withinss, ss(resid.x)),
 	  ## these three are the same:
 	  all.equal(cl$ betweenss,    ss(fitted.x)),
 	  all.equal(cl$ betweenss, cl$totss - cl$tot.withinss),
 	  ## and hence also
 	  all.equal(ss(x), ss(fitted.x) + ss(resid.x))
 	  )

 kmeans(x,1)$withinss # trivial one-cluster, (its W.SS == ss(x))

 ## random starts do help here with too many clusters
 ## (and are often recommended anyway!):
 (cl <- kmeans(x, 5, nstart = 25))
 plot(x, col = cl$cluster)
 points(cl$centers, col = 1:5, pch = 8)
 }
 \keyword{multivariate}
 \keyword{cluster}
	% File src/library/stats/man/kmeans.Rd
	% Part of the R package, https://www.R-project.org
	% Copyright 1995-2018 R Core Team
	% Distributed under GPL 2 or later

	\name{kmeans}
	\alias{kmeans}
	\alias{print.kmeans}
	\alias{fitted.kmeans}
	\title{
	K-Means Clustering
	}
	\description{
	Perform k-means clustering on a data matrix.
	}
	\usage{
	kmeans(x, centers, iter.max = 10, nstart = 1,
	algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
	"MacQueen"), trace=FALSE)
	\method{fitted}{kmeans}(object, method = c("centers", "classes"), ...)
	}
	\arguments{
	\item{x}{numeric matrix of data, or an object that can be coerced to
	such a matrix (such as a numeric vector or a data frame with all
	numeric columns).}
	\item{centers}{either the number of clusters, say \eqn{k}, or a set of
	initial (distinct) cluster centres. If a number, a random set of
	(distinct) rows in \code{x} is chosen as the initial centres.}
	\item{iter.max}{the maximum number of iterations allowed.}
	\item{nstart}{if \code{centers} is a number, how many random sets
	should be chosen?}
	\item{algorithm}{character: may be abbreviated. Note that
	\code{"Lloyd"} and \code{"Forgy"} are alternative names for one
	algorithm.}
	\item{object}{an \R object of class \code{"kmeans"}, typically the
	result \code{ob} of \code{ob <- kmeans(..)}.}
	\item{method}{character: may be abbreviated. \code{"centers"} causes
	\code{fitted} to return cluster centers (one for each input point) and
	\code{"classes"} causes \code{fitted} to return a vector of class
	assignments.}
	\item{trace}{logical or integer number, currently only used in the
	default method (\code{"Hartigan-Wong"}): if positive (or true),
	tracing information on the progress of the algorithm is
	produced. Higher values may produce more tracing information.}
	\item{\dots}{not used.}
	}
	\details{
	The data given by \code{x} are clustered by the \eqn{k}-means method,
	which aims to partition the points into \eqn{k} groups such that the
	sum of squares from points to the assigned cluster centres is minimized.
	At the minimum, all cluster centres are at the mean of their Voronoi
	sets (the set of data points which are nearest to the cluster centre).

	The algorithm of Hartigan and Wong (1979) is used by default. Note
	that some authors use \eqn{k}-means to refer to a specific algorithm
	rather than the general method: most commonly the algorithm given by
	MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy
	(1965). The Hartigan--Wong algorithm generally does a better job than
	either of those, but trying several random starts (\code{nstart}\eqn{>
	1}) is often recommended. In rare cases, when some of the points
	(rows of \code{x}) are extremely close, the algorithm may not converge
	in the \dQuote{Quick-Transfer} stage, signalling a warning (and
	returning \code{ifault = 4}). Slight
	rounding of the data may be advisable in that case.

	For ease of programmatic exploration, \eqn{k=1} is allowed, notably
	returning the center and \code{withinss}.

	Except for the Lloyd--Forgy method, \eqn{k} clusters will always be
	returned if a number is specified.
	If an initial matrix of centres is supplied, it is possible that
	no point will be closest to one or more centres, which is currently
	an error for the Hartigan--Wong method.
	}
	\value{
	\code{kmeans} returns an object of class \code{"kmeans"} which has a
	\code{print} and a \code{fitted} method. It is a list with at least
	the following components:
	\item{cluster}{
	A vector of integers (from \code{1:k}) indicating the cluster to
	which each point is allocated.
	}
	\item{centers}{A matrix of cluster centres.}
	\item{totss}{The total sum of squares.}
	\item{withinss}{Vector of within-cluster sum of squares,
	one component per cluster.}
	\item{tot.withinss}{Total within-cluster sum of squares,
	i.e.\sspace{}\code{sum(withinss)}.}
	\item{betweenss}{The between-cluster sum of squares,
	i.e.\sspace{}\code{totss-tot.withinss}.}
	\item{size}{The number of points in each cluster.}
	\item{iter}{The number of (outer) iterations.}
	\item{ifault}{integer: indicator of a possible algorithm problem
	-- for experts.}
	}
	\references{
	Forgy, E. W. (1965).
	Cluster analysis of multivariate data: efficiency vs interpretability
	of classifications.
	\emph{Biometrics}, \bold{21}, 768--769.

	Hartigan, J. A. and Wong, M. A. (1979).
	Algorithm AS 136: A K-means clustering algorithm.
	\emph{Applied Statistics}, \bold{28}, 100--108.
	\doi{10.2307/2346830}.

	Lloyd, S. P. (1957, 1982).
	Least squares quantization in PCM.
	Technical Note, Bell Laboratories.
	Published in 1982 in \emph{IEEE Transactions on Information Theory},
	\bold{28}, 128--137.

	MacQueen, J. (1967).
	Some methods for classification and analysis of multivariate
	observations.
	In \emph{Proceedings of the Fifth Berkeley Symposium on Mathematical
	Statistics and Probability},
	eds L. M. Le Cam & J. Neyman,
	\bold{1}, pp.\sspace{}281--297.
	Berkeley, CA: University of California Press.
	}
	\examples{
	require(graphics)

	# a 2-dimensional example
	x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
	matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
	colnames(x) <- c("x", "y")
	(cl <- kmeans(x, 2))
	plot(x, col = cl$cluster)
	points(cl$centers, col = 1:2, pch = 8, cex = 2)

	# sum of squares
	ss <- function(x) sum(scale(x, scale = FALSE)^2)

	## cluster centers "fitted" to each obs.:
	fitted.x <- fitted(cl); head(fitted.x)
	resid.x <- x - fitted(cl)

	## Equalities : ----------------------------------
	cbind(cl[c("betweenss", "tot.withinss", "totss")], # the same two columns
	c(ss(fitted.x), ss(resid.x), ss(x)))
	stopifnot(all.equal(cl$ totss, ss(x)),
	all.equal(cl$ tot.withinss, ss(resid.x)),
	## these three are the same:
	all.equal(cl$ betweenss, ss(fitted.x)),
	all.equal(cl$ betweenss, cl$totss - cl$tot.withinss),
	## and hence also
	all.equal(ss(x), ss(fitted.x) + ss(resid.x))
	)

	kmeans(x,1)$withinss # trivial one-cluster, (its W.SS == ss(x))

	## random starts do help here with too many clusters
	## (and are often recommended anyway!):
	(cl <- kmeans(x, 5, nstart = 25))
	plot(x, col = cl$cluster)
	points(cl$centers, col = 1:5, pch = 8)
	}
	\keyword{multivariate}
	\keyword{cluster}