src/library/stats/man/ks.test.Rd - R - Git at Google

 % File src/library/stats/man/ks.test.Rd
 % Part of the R package, https://www.R-project.org
 % Copyright 1995-2022 R Core Team
 % Distributed under GPL 2 or later

 \name{ks.test}
 \alias{ks.test}
 \alias{ks.test.default}
 \alias{ks.test.formula}
 \encoding{UTF-8}
 \title{Kolmogorov-Smirnov Tests}
 \description{
   Perform a one- or two-sample Kolmogorov-Smirnov test.
 }
 \usage{
 ks.test(x, \dots)
 \method{ks.test}{default}(x, y, \dots,
         alternative = c("two.sided", "less", "greater"),
         exact = NULL, simulate.p.value = FALSE, B = 2000)
 \method{ks.test}{formula}(formula, data, subset, na.action, \dots)
 }
 \arguments{
   \item{x}{a numeric vector of data values.}
   \item{y}{either a numeric vector of data values, or a character string
     naming a cumulative distribution function or an actual cumulative
     distribution function such as \code{pnorm}.  Only continuous CDFs
     are valid.}
   \item{\dots}{for the default method, parameters of the distribution
     specified (as a character string) by \code{y}.  Otherwise, further
     arguments to be passed to or from methods.}
   \item{alternative}{indicates the alternative hypothesis and must be
     one of \code{"two.sided"} (default), \code{"less"}, or
     \code{"greater"}.  You can specify just the initial letter of the
     value, but the argument name must be given in full.
     See \sQuote{Details} for the meanings of the possible values.}
   \item{exact}{\code{NULL} or a logical indicating whether an exact
     p-value should be computed.  See \sQuote{Details} for the meaning of
     \code{NULL}.}
   \item{simulate.p.value}{a logical indicating whether to compute
     p-values by Monte Carlo simulation.}
   \item{B}{an integer specifying the number of replicates used in the
     Monte Carlo test.}
   \item{formula}{a formula of the form \code{lhs ~ rhs} where \code{lhs}
     is a numeric variable giving the data values and \code{rhs} either
     \code{1} for a one-sample test or a factor with two levels giving
     the corresponding groups for a two-sample test.}
   \item{data}{an optional matrix or data frame (or similar: see
     \code{\link{model.frame}}) containing the variables in the
     formula \code{formula}.  By default the variables are taken from
     \code{environment(formula)}.}
   \item{subset}{an optional vector specifying a subset of observations
     to be used.}
   \item{na.action}{a function which indicates what should happen when
     the data contain \code{NA}s.  Defaults to
     \code{getOption("na.action")}.}
 }
 \details{
   If \code{y} is numeric, a two-sample (Smirnov) test of the null hypothesis
   that \code{x} and \code{y} were drawn from the same \emph{continuous}
   distribution is performed.

   Alternatively, \code{y} can be a character string naming a continuous
   (cumulative) distribution function, or such a function.  In this case,
   a one-sample (Kolmogorov) test is carried out of the null that the distribution
   function which generated \code{x} is distribution \code{y} with
   parameters specified by \code{\dots}.

   The presence of ties always generates a warning, since continuous
   distributions do not generate them.  If the ties arose from rounding
   the tests may be approximately valid, but even modest amounts of
   rounding can have a significant effect on the calculated statistic.

   Missing values are silently omitted from \code{x} and (in the
   two-sample case) \code{y}.

   The possible values \code{"two.sided"}, \code{"less"} and
   \code{"greater"} of \code{alternative} specify the null hypothesis
   that the true distribution function of \code{x} is equal to, not less
   than or not greater than the hypothesized distribution function
   (one-sample case) or the distribution function of \code{y} (two-sample
   case), respectively.  This is a comparison of cumulative distribution
   functions, and the test statistic is the maximum difference in value,
   with the statistic in the \code{"greater"} alternative being
   \eqn{D^+ = \max_u [ F_x(u) - F_y(u) ]}{D^+ = max[F_x(u) - F_y(u)]}.
   Thus in the two-sample case \code{alternative = "greater"} includes
   distributions for which \code{x} is stochastically \emph{smaller} than
   \code{y} (the CDF of \code{x} lies above and hence to the left of that
   for \code{y}), in contrast to \code{\link{t.test}} or
   \code{\link{wilcox.test}}.

   Exact p-values are not available for the one-sample case in the
   presence of ties.
   If \code{exact = NULL} (the default), an
   exact p-value is computed if the sample size is less than 100 in the
   one-sample case \emph{and there are no ties}, and if the product of
   the sample sizes is less than 10000 in the two-sample case, with or
   without ties (using the algorithm described in Schröer and Trenkler, 1995).
   Otherwise, asymptotic distributions are used whose approximations may
   be inaccurate in small samples.  In the one-sample two-sided case,
   exact p-values are obtained as described in Marsaglia, Tsang & Wang
   (2003) (but not using the optional approximation in the right tail, so
   this can be slow for small p-values).  The formula of Birnbaum &
   Tingey (1951) is used for the one-sample one-sided case.

   If a one-sample test is used, the parameters specified in
   \code{\dots} must be pre-specified and not estimated from the data.
   There is some more refined distribution theory for the KS test with
   estimated parameters (see Durbin, 1973), but that is not implemented
   in \code{ks.test}.
 }
 \value{
   A list inheriting from classes \code{"ks.test"} and \code{"htest"}
   containing the following components:
   \item{statistic}{the value of the test statistic.}
   \item{p.value}{the p-value of the test.}
   \item{alternative}{a character string describing the alternative
     hypothesis.}
   \item{method}{a character string indicating what type of test was
     performed.}
   \item{data.name}{a character string giving the name(s) of the data.}
 }
 \source{
   The two-sided one-sample distribution comes \emph{via}
   Marsaglia, Tsang and Wang (2003).

   Exact distributions for the two-sample (Smirnov) test are computed
   by the algorithm proposed Schröer (1991) and Schröer & Trenkler (1995).

 }
 \references{
   Z. W. Birnbaum and Fred H. Tingey (1951).
   One-sided confidence contours for probability distribution functions.
   \emph{The Annals of Mathematical Statistics}, \bold{22}/4, 592--596.
   \doi{10.1214/aoms/1177729550}.

   William J. Conover (1971).
   \emph{Practical Nonparametric Statistics}.
   New York: John Wiley & Sons.
   Pages 295--301 (one-sample Kolmogorov test),
   309--314 (two-sample Smirnov test).

   Durbin, J. (1973).
   \emph{Distribution theory for tests based on the sample distribution
     function}.
   SIAM.

   W. Feller (1948).
   On the Kolmogorov-Smirnov limit theorems for empirical distributions.
   \emph{The Annals of Mathematical Statistics}, \bold{19}(2), 177--189.
   \doi{10.1214/aoms/1177730243}.

   George Marsaglia, Wai Wan Tsang and Jingbo Wang (2003).
   Evaluating Kolmogorov's distribution.
   \emph{Journal of Statistical Software}, \bold{8}/18.
   \doi{10.18637/jss.v008.i18}.

   Gunar Schröer (1991),
   Computergestützte statistische Inferenz am Beispiel der
   Kolmogorov-Smirnov Tests.
   Diplomarbeit Universität Osnabrück.

   Gunar Schröer and Dietrich Trenkler (1995).
   Exact and Randomization Distributions of Kolmogorov-Smirnov Tests for
   Two or Three Samples.
   \emph{Computational Statistics & Data Analysis}, \bold{20}(2),
   185--202.
   \doi{10.1016/0167-9473(94)00040-P}.
 }
 \seealso{
   \code{\link{psmirnov}}.

   \code{\link{shapiro.test}} which performs the Shapiro-Wilk test for
   normality.
 }
 \examples{
 require("graphics")

 x <- rnorm(50)
 y <- runif(30)
 # Do x and y come from the same distribution?
 ks.test(x, y)
 # Does x come from a shifted gamma distribution with shape 3 and rate 2?
 ks.test(x+2, "pgamma", 3, 2) # two-sided, exact
 ks.test(x+2, "pgamma", 3, 2, exact = FALSE)
 ks.test(x+2, "pgamma", 3, 2, alternative = "gr")

 # test if x is stochastically larger than x2
 x2 <- rnorm(50, -1)
 plot(ecdf(x), xlim = range(c(x, x2)))
 plot(ecdf(x2), add = TRUE, lty = "dashed")
 t.test(x, x2, alternative = "g")
 wilcox.test(x, x2, alternative = "g")
 ks.test(x, x2, alternative = "l")

 # with ties, example from Schröer and Trenkler (1995)
 # D = 3 / 7, p = 0.2424242
 ks.test(c(1, 2, 2, 3, 3), c(1, 2, 3, 3, 4, 5, 6), exact = TRUE)

 # formula interface, see ?wilcox.test
 ks.test(Ozone ~ Month, data = airquality,
         subset = Month \%in\% c(5, 8))
 }
 \keyword{htest}
	% File src/library/stats/man/ks.test.Rd
	% Part of the R package, https://www.R-project.org
	% Copyright 1995-2022 R Core Team
	% Distributed under GPL 2 or later

	\name{ks.test}
	\alias{ks.test}
	\alias{ks.test.default}
	\alias{ks.test.formula}
	\encoding{UTF-8}
	\title{Kolmogorov-Smirnov Tests}
	\description{
	Perform a one- or two-sample Kolmogorov-Smirnov test.
	}
	\usage{
	ks.test(x, \dots)
	\method{ks.test}{default}(x, y, \dots,
	alternative = c("two.sided", "less", "greater"),
	exact = NULL, simulate.p.value = FALSE, B = 2000)
	\method{ks.test}{formula}(formula, data, subset, na.action, \dots)
	}
	\arguments{
	\item{x}{a numeric vector of data values.}
	\item{y}{either a numeric vector of data values, or a character string
	naming a cumulative distribution function or an actual cumulative
	distribution function such as \code{pnorm}. Only continuous CDFs
	are valid.}
	\item{\dots}{for the default method, parameters of the distribution
	specified (as a character string) by \code{y}. Otherwise, further
	arguments to be passed to or from methods.}
	\item{alternative}{indicates the alternative hypothesis and must be
	one of \code{"two.sided"} (default), \code{"less"}, or
	\code{"greater"}. You can specify just the initial letter of the
	value, but the argument name must be given in full.
	See \sQuote{Details} for the meanings of the possible values.}
	\item{exact}{\code{NULL} or a logical indicating whether an exact
	p-value should be computed. See \sQuote{Details} for the meaning of
	\code{NULL}.}
	\item{simulate.p.value}{a logical indicating whether to compute
	p-values by Monte Carlo simulation.}
	\item{B}{an integer specifying the number of replicates used in the
	Monte Carlo test.}
	\item{formula}{a formula of the form \code{lhs ~ rhs} where \code{lhs}
	is a numeric variable giving the data values and \code{rhs} either
	\code{1} for a one-sample test or a factor with two levels giving
	the corresponding groups for a two-sample test.}
	\item{data}{an optional matrix or data frame (or similar: see
	\code{\link{model.frame}}) containing the variables in the
	formula \code{formula}. By default the variables are taken from
	\code{environment(formula)}.}
	\item{subset}{an optional vector specifying a subset of observations
	to be used.}
	\item{na.action}{a function which indicates what should happen when
	the data contain \code{NA}s. Defaults to
	\code{getOption("na.action")}.}
	}
	\details{
	If \code{y} is numeric, a two-sample (Smirnov) test of the null hypothesis
	that \code{x} and \code{y} were drawn from the same \emph{continuous}
	distribution is performed.

	Alternatively, \code{y} can be a character string naming a continuous
	(cumulative) distribution function, or such a function. In this case,
	a one-sample (Kolmogorov) test is carried out of the null that the distribution
	function which generated \code{x} is distribution \code{y} with
	parameters specified by \code{\dots}.

	The presence of ties always generates a warning, since continuous
	distributions do not generate them. If the ties arose from rounding
	the tests may be approximately valid, but even modest amounts of
	rounding can have a significant effect on the calculated statistic.

	Missing values are silently omitted from \code{x} and (in the
	two-sample case) \code{y}.

	The possible values \code{"two.sided"}, \code{"less"} and
	\code{"greater"} of \code{alternative} specify the null hypothesis
	that the true distribution function of \code{x} is equal to, not less
	than or not greater than the hypothesized distribution function
	(one-sample case) or the distribution function of \code{y} (two-sample
	case), respectively. This is a comparison of cumulative distribution
	functions, and the test statistic is the maximum difference in value,
	with the statistic in the \code{"greater"} alternative being
	\eqn{D^+ = \max_u [ F_x(u) - F_y(u) ]}{D^+ = max[F_x(u) - F_y(u)]}.
	Thus in the two-sample case \code{alternative = "greater"} includes
	distributions for which \code{x} is stochastically \emph{smaller} than
	\code{y} (the CDF of \code{x} lies above and hence to the left of that
	for \code{y}), in contrast to \code{\link{t.test}} or
	\code{\link{wilcox.test}}.

	Exact p-values are not available for the one-sample case in the
	presence of ties.
	If \code{exact = NULL} (the default), an
	exact p-value is computed if the sample size is less than 100 in the
	one-sample case \emph{and there are no ties}, and if the product of
	the sample sizes is less than 10000 in the two-sample case, with or
	without ties (using the algorithm described in Schröer and Trenkler, 1995).
	Otherwise, asymptotic distributions are used whose approximations may
	be inaccurate in small samples. In the one-sample two-sided case,
	exact p-values are obtained as described in Marsaglia, Tsang & Wang
	(2003) (but not using the optional approximation in the right tail, so
	this can be slow for small p-values). The formula of Birnbaum &
	Tingey (1951) is used for the one-sample one-sided case.

	If a one-sample test is used, the parameters specified in
	\code{\dots} must be pre-specified and not estimated from the data.
	There is some more refined distribution theory for the KS test with
	estimated parameters (see Durbin, 1973), but that is not implemented
	in \code{ks.test}.
	}
	\value{
	A list inheriting from classes \code{"ks.test"} and \code{"htest"}
	containing the following components:
	\item{statistic}{the value of the test statistic.}
	\item{p.value}{the p-value of the test.}
	\item{alternative}{a character string describing the alternative
	hypothesis.}
	\item{method}{a character string indicating what type of test was
	performed.}
	\item{data.name}{a character string giving the name(s) of the data.}
	}
	\source{
	The two-sided one-sample distribution comes \emph{via}
	Marsaglia, Tsang and Wang (2003).

	Exact distributions for the two-sample (Smirnov) test are computed
	by the algorithm proposed Schröer (1991) and Schröer & Trenkler (1995).

	}
	\references{
	Z. W. Birnbaum and Fred H. Tingey (1951).
	One-sided confidence contours for probability distribution functions.
	\emph{The Annals of Mathematical Statistics}, \bold{22}/4, 592--596.
	\doi{10.1214/aoms/1177729550}.

	William J. Conover (1971).
	\emph{Practical Nonparametric Statistics}.
	New York: John Wiley & Sons.
	Pages 295--301 (one-sample Kolmogorov test),
	309--314 (two-sample Smirnov test).

	Durbin, J. (1973).
	\emph{Distribution theory for tests based on the sample distribution
	function}.
	SIAM.

	W. Feller (1948).
	On the Kolmogorov-Smirnov limit theorems for empirical distributions.
	\emph{The Annals of Mathematical Statistics}, \bold{19}(2), 177--189.
	\doi{10.1214/aoms/1177730243}.

	George Marsaglia, Wai Wan Tsang and Jingbo Wang (2003).
	Evaluating Kolmogorov's distribution.
	\emph{Journal of Statistical Software}, \bold{8}/18.
	\doi{10.18637/jss.v008.i18}.

	Gunar Schröer (1991),
	Computergestützte statistische Inferenz am Beispiel der
	Kolmogorov-Smirnov Tests.
	Diplomarbeit Universität Osnabrück.

	Gunar Schröer and Dietrich Trenkler (1995).
	Exact and Randomization Distributions of Kolmogorov-Smirnov Tests for
	Two or Three Samples.
	\emph{Computational Statistics & Data Analysis}, \bold{20}(2),
	185--202.
	\doi{10.1016/0167-9473(94)00040-P}.
	}
	\seealso{
	\code{\link{psmirnov}}.

	\code{\link{shapiro.test}} which performs the Shapiro-Wilk test for
	normality.
	}
	\examples{
	require("graphics")

	x <- rnorm(50)
	y <- runif(30)
	# Do x and y come from the same distribution?
	ks.test(x, y)
	# Does x come from a shifted gamma distribution with shape 3 and rate 2?
	ks.test(x+2, "pgamma", 3, 2) # two-sided, exact
	ks.test(x+2, "pgamma", 3, 2, exact = FALSE)
	ks.test(x+2, "pgamma", 3, 2, alternative = "gr")

	# test if x is stochastically larger than x2
	x2 <- rnorm(50, -1)
	plot(ecdf(x), xlim = range(c(x, x2)))
	plot(ecdf(x2), add = TRUE, lty = "dashed")
	t.test(x, x2, alternative = "g")
	wilcox.test(x, x2, alternative = "g")
	ks.test(x, x2, alternative = "l")

	# with ties, example from Schröer and Trenkler (1995)
	# D = 3 / 7, p = 0.2424242
	ks.test(c(1, 2, 2, 3, 3), c(1, 2, 3, 3, 4, 5, 6), exact = TRUE)

	# formula interface, see ?wilcox.test
	ks.test(Ozone ~ Month, data = airquality,
	subset = Month \%in\% c(5, 8))
	}
	\keyword{htest}