src/library/stats/man/fisher.test.Rd - R - Git at Google

 % File src/library/stats/man/fisher.test.Rd
 % Part of the R package, https://www.R-project.org
 % Copyright 1995-2022 R Core Team
 % Distributed under GPL 2 or later

 \name{fisher.test}
 \alias{fisher.test}
 \title{Fisher's Exact Test for Count Data}
 \description{
   Performs Fisher's exact test for testing the null of independence of
   rows and columns in a contingency table with fixed marginals.
 }
 \usage{
 fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
             hybridPars = c(expect = 5, percent = 80, Emin = 1),
             control = list(), or = 1, alternative = "two.sided",
             conf.int = TRUE, conf.level = 0.95,
             simulate.p.value = FALSE, B = 2000)
 }
 \arguments{
   \item{x}{either a two-dimensional contingency table in matrix form,
     or a factor object.}
   \item{y}{a factor object; ignored if \code{x} is a matrix.}
   \item{workspace}{an integer specifying the size of the workspace
     used in the network algorithm.  In units of 4 bytes.  Only used for
     non-simulated p-values larger than \eqn{2 \times 2}{2 by 2} tables.
     Since \R version 3.5.0, this also increases the internal stack size
     which allows larger problems to be solved, however sometimes needing
     hours.  In such cases, \code{simulate.p.values=TRUE} may be more
     reasonable.}
     \item{hybrid}{a logical. Only used for larger than \eqn{2 \times 2}{2 by 2}
     tables, in which cases it indicates whether the exact probabilities
     (default) or a hybrid approximation thereof should be computed.}
   \item{hybridPars}{a numeric vector of length 3, by default describing
     \dQuote{Cochran's conditions} for the validity of the chisquare
     approximation, see \sQuote{Details}.}
   \item{control}{a list with named components for low level algorithm
     control.  At present the only one used is \code{"mult"}, a positive
     integer \eqn{\ge 2} with default 30 used only for larger than
     \eqn{2 \times 2}{2 by 2} tables.  This says how many times as much
     space should be allocated to paths as to keys: see file
     \file{fexact.c} in the sources of this package.}
   \item{or}{the hypothesized odds ratio.  Only used in the
     \eqn{2 \times 2}{2 by 2} case.}
   \item{alternative}{indicates the alternative hypothesis and must be
     one of \code{"two.sided"}, \code{"greater"} or \code{"less"}.
     You can specify just the initial letter.  Only used in the
     \eqn{2 \times 2}{2 by 2} case.}
   \item{conf.int}{logical indicating if a confidence interval for the
     odds ratio in a \eqn{2 \times 2}{2 by 2} table should be
     computed (and returned).}
   \item{conf.level}{confidence level for the returned confidence
     interval.  Only used in the \eqn{2 \times 2}{2 by 2} case and if
     \code{conf.int = TRUE}.}
   \item{simulate.p.value}{a logical indicating whether to compute
     p-values by Monte Carlo simulation, in larger than \eqn{2 \times
       2}{2 by 2} tables.}
   \item{B}{an integer specifying the number of replicates used in the
     Monte Carlo test.}
 }
 \value{
   A list with class \code{"htest"} containing the following components:
   \item{p.value}{the p-value of the test.}
   \item{conf.int}{a confidence interval for the odds ratio.
     Only present in the \eqn{2 \times 2}{2 by 2} case and if argument
     \code{conf.int = TRUE}.}
   \item{estimate}{an estimate of the odds ratio.  Note that the
     \emph{conditional} Maximum Likelihood Estimate (MLE) rather than the
     unconditional MLE (the sample odds ratio) is used.
     Only present in the \eqn{2 \times 2}{2 by 2} case.}
   \item{null.value}{the odds ratio under the null, \code{or}.
     Only present in the \eqn{2 \times 2}{2 by 2} case.}
   \item{alternative}{a character string describing the alternative
     hypothesis.}
   \item{method}{the character string
     \code{"Fisher's Exact Test for Count Data"}.}
   \item{data.name}{a character string giving the name(s) of the data.}
 }
 \details{
   If \code{x} is a matrix, it is taken as a two-dimensional contingency
   table, and hence its entries should be nonnegative integers.
   Otherwise, both \code{x} and \code{y} must be vectors or factors of the same
   length.  Incomplete cases are removed, vectors are coerced into
   factor objects, and the contingency table is computed from these.

   For \eqn{2 \times 2}{2 by 2} cases, p-values are obtained directly
   using the (central or non-central) hypergeometric
   distribution. Otherwise, computations are based on a C version of the
   FORTRAN subroutine FEXACT which implements the network developed by
   Mehta and Patel (1983, 1986) and improved by Clarkson, Fan and Joe (1993).
   The FORTRAN code can be obtained from
   \url{https://www.netlib.org/toms/643}.  Note this fails (with an error
   message) when the entries of the table are too large.  (It transposes
   the table if necessary so it has no more rows than columns.  One
   constraint is that the product of the row marginals be less than
   \eqn{2^{31} - 1}{2^31 - 1}.)

   For \eqn{2 \times 2}{2 by 2} tables, the null of conditional
   independence is equivalent to the hypothesis that the odds ratio
   equals one.  \sQuote{Exact} inference can be based on observing that in
   general, given all marginal totals fixed, the first element of the
   contingency table has a non-central hypergeometric distribution with
   non-centrality parameter given by the odds ratio (Fisher, 1935).  The
   alternative for a one-sided test is based on the odds ratio, so
   \code{alternative = "greater"} is a test of the odds ratio being bigger
   than \code{or}.

   Two-sided tests are based on the probabilities of the tables, and take
   as \sQuote{more extreme} all tables with probabilities less than or
   equal to that of the observed table, the p-value being the sum of such
   probabilities.

   For larger than \eqn{2 \times 2}{2 by 2} tables and \code{hybrid = TRUE},
   asymptotic chi-squared probabilities are only used if the
   \sQuote{Cochran conditions} (or modified version thereof) specified by
   \code{hybridPars = c(expect = 5, percent = 80, Emin = 1)} are
   satisfied, that is if no cell has expected counts less than
   \code{1} (\code{= Emin}) and more than 80\% (\code{= percent}) of the
   cells have expected counts at least 5 (\code{= expect}), otherwise
   the exact calculation is used.  A corresponding \code{if()} decision
   is made for all sub-tables considered.
   %
   Accidentally, \R has used \code{180} instead of \code{80} as
   \code{percent}, i.e., \code{hybridPars[2]} in \R versions between
   3.0.0 and 3.4.1 (inclusive), i.e., the 2nd of the \code{hybridPars}
   (all of which used to be hard-coded previous to \R 3.5.0).
   Consequently, in these versions of \R, \code{hybrid=TRUE} never made a
   difference.

   In the \eqn{r \times c}{r x c} case with \eqn{r > 2} or \eqn{c > 2},
   internal tables can get too large for the exact test in which case an
   error is signalled.  Apart from increasing \code{workspace}
   sufficiently, which then may lead to very long running times, using
   \code{simulate.p.value = TRUE} may then often be sufficient and hence
   advisable.

   Simulation is done conditional on the row and column marginals, and
   works only if the marginals are strictly positive.  (A C translation
   of the algorithm of Patefield (1981) is used.)
   Note that the default number of replicates (\code{B = 2000}) implies a
   minimum p-value of about 0.0005 (\eqn{1/(B+1)}).
 }
 \references{
   Agresti, A. (1990).
   \emph{Categorical data analysis}.
   New York: Wiley.
   Pages 59--66.

   Agresti, A. (2002).
   \emph{Categorical data analysis}. Second edition.
   New York: Wiley.
   Pages 91--101.

   Fisher, R. A. (1935).
   The logic of inductive inference.
   \emph{Journal of the Royal Statistical Society Series A}, \bold{98},
   39--54.
   \doi{10.2307/2342435}.

   Fisher, R. A. (1962).
   Confidence limits for a cross-product ratio.
   \emph{Australian Journal of Statistics}, \bold{4}, 41.
   \doi{10.1111/j.1467-842X.1962.tb00285.x}.

   Fisher, R. A. (1970).
   \emph{Statistical Methods for Research Workers}.
   Oliver & Boyd.

   Mehta, Cyrus R. and Patel, Nitin R. (1983).
   A network algorithm for performing Fisher's exact test in \eqn{r
   \times c}{r x c} contingency tables.
   \emph{Journal of the American Statistical Association}, \bold{78},
   427--434.
   \doi{10.1080/01621459.1983.10477989}.

   Mehta, C. R. and Patel, N. R. (1986).
   Algorithm 643: FEXACT, a FORTRAN subroutine for Fisher's exact test
   on unordered \eqn{r \times c}{r x c} contingency tables.
   \emph{ACM Transactions on Mathematical Software}, \bold{12},
   154--161.
   \doi{10.1145/6497.214326}.

   Clarkson, D. B., Fan, Y. and Joe, H. (1993)
   A Remark on Algorithm 643: FEXACT: An Algorithm for Performing
   Fisher's Exact Test in \eqn{r \times c}{r x c} Contingency Tables.
   \emph{ACM Transactions on Mathematical Software}, \bold{19},
   484--488.
   \doi{10.1145/168173.168412}.

   Patefield, W. M. (1981).
   Algorithm AS 159: An efficient method of generating r x c tables
   with given row and column totals.
   \emph{Applied Statistics}, \bold{30}, 91--97.
   \doi{10.2307/2346669}.
 }
 \seealso{
   \code{\link{chisq.test}}

   \code{fisher.exact} in package \CRANpkg{exact2x2} for alternative
   interpretations of two-sided tests and confidence intervals for
   \eqn{2 \times 2}{2 by 2} tables.
 }
 \examples{
 ## Agresti (1990, p. 61f; 2002, p. 91) Fisher's Tea Drinker
 ## A British woman claimed to be able to distinguish whether milk or
 ##  tea was added to the cup first.  To test, she was given 8 cups of
 ##  tea, in four of which milk was added first.  The null hypothesis
 ##  is that there is no association between the true order of pouring
 ##  and the woman's guess, the alternative that there is a positive
 ##  association (that the odds ratio is greater than 1).
 TeaTasting <-
 matrix(c(3, 1, 1, 3),
        nrow = 2,
        dimnames = list(Guess = c("Milk", "Tea"),
                        Truth = c("Milk", "Tea")))
 fisher.test(TeaTasting, alternative = "greater")
 ## => p = 0.2429, association could not be established

 ## Fisher (1962, 1970), Criminal convictions of like-sex twins
 Convictions <- matrix(c(2, 10, 15, 3), nrow = 2,
 	              dimnames =
 	       list(c("Dizygotic", "Monozygotic"),
 		    c("Convicted", "Not convicted")))
 Convictions
 fisher.test(Convictions, alternative = "less")
 fisher.test(Convictions, conf.int = FALSE)
 fisher.test(Convictions, conf.level = 0.95)$conf.int
 fisher.test(Convictions, conf.level = 0.99)$conf.int

 ## A r x c table  Agresti (2002, p. 57) Job Satisfaction
 Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4,
            dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"),
                      satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")))
 fisher.test(Job) # 0.7827
 fisher.test(Job, simulate.p.value = TRUE, B = 1e5) # also close to 0.78

 ## 6th example in Mehta & Patel's JASA paper
 MP6 <- rbind(
         c(1,2,2,1,1,0,1),
         c(2,0,0,2,3,0,0),
         c(0,1,1,1,2,7,3),
         c(1,1,2,0,0,0,1),
         c(0,1,1,1,1,0,0))
 fisher.test(MP6)
 # Exactly the same p-value, as Cochran's conditions are never met:
 fisher.test(MP6, hybrid=TRUE)
 }
 \keyword{htest}
	% File src/library/stats/man/fisher.test.Rd
	% Part of the R package, https://www.R-project.org
	% Copyright 1995-2022 R Core Team
	% Distributed under GPL 2 or later

	\name{fisher.test}
	\alias{fisher.test}
	\title{Fisher's Exact Test for Count Data}
	\description{
	Performs Fisher's exact test for testing the null of independence of
	rows and columns in a contingency table with fixed marginals.
	}
	\usage{
	fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
	hybridPars = c(expect = 5, percent = 80, Emin = 1),
	control = list(), or = 1, alternative = "two.sided",
	conf.int = TRUE, conf.level = 0.95,
	simulate.p.value = FALSE, B = 2000)
	}
	\arguments{
	\item{x}{either a two-dimensional contingency table in matrix form,
	or a factor object.}
	\item{y}{a factor object; ignored if \code{x} is a matrix.}
	\item{workspace}{an integer specifying the size of the workspace
	used in the network algorithm. In units of 4 bytes. Only used for
	non-simulated p-values larger than \eqn{2 \times 2}{2 by 2} tables.
	Since \R version 3.5.0, this also increases the internal stack size
	which allows larger problems to be solved, however sometimes needing
	hours. In such cases, \code{simulate.p.values=TRUE} may be more
	reasonable.}
	\item{hybrid}{a logical. Only used for larger than \eqn{2 \times 2}{2 by 2}
	tables, in which cases it indicates whether the exact probabilities
	(default) or a hybrid approximation thereof should be computed.}
	\item{hybridPars}{a numeric vector of length 3, by default describing
	\dQuote{Cochran's conditions} for the validity of the chisquare
	approximation, see \sQuote{Details}.}
	\item{control}{a list with named components for low level algorithm
	control. At present the only one used is \code{"mult"}, a positive
	integer \eqn{\ge 2} with default 30 used only for larger than
	\eqn{2 \times 2}{2 by 2} tables. This says how many times as much
	space should be allocated to paths as to keys: see file
	\file{fexact.c} in the sources of this package.}
	\item{or}{the hypothesized odds ratio. Only used in the
	\eqn{2 \times 2}{2 by 2} case.}
	\item{alternative}{indicates the alternative hypothesis and must be
	one of \code{"two.sided"}, \code{"greater"} or \code{"less"}.
	You can specify just the initial letter. Only used in the
	\eqn{2 \times 2}{2 by 2} case.}
	\item{conf.int}{logical indicating if a confidence interval for the
	odds ratio in a \eqn{2 \times 2}{2 by 2} table should be
	computed (and returned).}
	\item{conf.level}{confidence level for the returned confidence
	interval. Only used in the \eqn{2 \times 2}{2 by 2} case and if
	\code{conf.int = TRUE}.}
	\item{simulate.p.value}{a logical indicating whether to compute
	p-values by Monte Carlo simulation, in larger than \eqn{2 \times
	2}{2 by 2} tables.}
	\item{B}{an integer specifying the number of replicates used in the
	Monte Carlo test.}
	}
	\value{
	A list with class \code{"htest"} containing the following components:
	\item{p.value}{the p-value of the test.}
	\item{conf.int}{a confidence interval for the odds ratio.
	Only present in the \eqn{2 \times 2}{2 by 2} case and if argument
	\code{conf.int = TRUE}.}
	\item{estimate}{an estimate of the odds ratio. Note that the
	\emph{conditional} Maximum Likelihood Estimate (MLE) rather than the
	unconditional MLE (the sample odds ratio) is used.
	Only present in the \eqn{2 \times 2}{2 by 2} case.}
	\item{null.value}{the odds ratio under the null, \code{or}.
	Only present in the \eqn{2 \times 2}{2 by 2} case.}
	\item{alternative}{a character string describing the alternative
	hypothesis.}
	\item{method}{the character string
	\code{"Fisher's Exact Test for Count Data"}.}
	\item{data.name}{a character string giving the name(s) of the data.}
	}
	\details{
	If \code{x} is a matrix, it is taken as a two-dimensional contingency
	table, and hence its entries should be nonnegative integers.
	Otherwise, both \code{x} and \code{y} must be vectors or factors of the same
	length. Incomplete cases are removed, vectors are coerced into
	factor objects, and the contingency table is computed from these.

	For \eqn{2 \times 2}{2 by 2} cases, p-values are obtained directly
	using the (central or non-central) hypergeometric
	distribution. Otherwise, computations are based on a C version of the
	FORTRAN subroutine FEXACT which implements the network developed by
	Mehta and Patel (1983, 1986) and improved by Clarkson, Fan and Joe (1993).
	The FORTRAN code can be obtained from
	\url{https://www.netlib.org/toms/643}. Note this fails (with an error
	message) when the entries of the table are too large. (It transposes
	the table if necessary so it has no more rows than columns. One
	constraint is that the product of the row marginals be less than
	\eqn{2^{31} - 1}{2^31 - 1}.)

	For \eqn{2 \times 2}{2 by 2} tables, the null of conditional
	independence is equivalent to the hypothesis that the odds ratio
	equals one. \sQuote{Exact} inference can be based on observing that in
	general, given all marginal totals fixed, the first element of the
	contingency table has a non-central hypergeometric distribution with
	non-centrality parameter given by the odds ratio (Fisher, 1935). The
	alternative for a one-sided test is based on the odds ratio, so
	\code{alternative = "greater"} is a test of the odds ratio being bigger
	than \code{or}.

	Two-sided tests are based on the probabilities of the tables, and take
	as \sQuote{more extreme} all tables with probabilities less than or
	equal to that of the observed table, the p-value being the sum of such
	probabilities.

	For larger than \eqn{2 \times 2}{2 by 2} tables and \code{hybrid = TRUE},
	asymptotic chi-squared probabilities are only used if the
	\sQuote{Cochran conditions} (or modified version thereof) specified by
	\code{hybridPars = c(expect = 5, percent = 80, Emin = 1)} are
	satisfied, that is if no cell has expected counts less than
	\code{1} (\code{= Emin}) and more than 80\% (\code{= percent}) of the
	cells have expected counts at least 5 (\code{= expect}), otherwise
	the exact calculation is used. A corresponding \code{if()} decision
	is made for all sub-tables considered.
	%
	Accidentally, \R has used \code{180} instead of \code{80} as
	\code{percent}, i.e., \code{hybridPars[2]} in \R versions between
	3.0.0 and 3.4.1 (inclusive), i.e., the 2nd of the \code{hybridPars}
	(all of which used to be hard-coded previous to \R 3.5.0).
	Consequently, in these versions of \R, \code{hybrid=TRUE} never made a
	difference.

	In the \eqn{r \times c}{r x c} case with \eqn{r > 2} or \eqn{c > 2},
	internal tables can get too large for the exact test in which case an
	error is signalled. Apart from increasing \code{workspace}
	sufficiently, which then may lead to very long running times, using
	\code{simulate.p.value = TRUE} may then often be sufficient and hence
	advisable.

	Simulation is done conditional on the row and column marginals, and
	works only if the marginals are strictly positive. (A C translation
	of the algorithm of Patefield (1981) is used.)
	Note that the default number of replicates (\code{B = 2000}) implies a
	minimum p-value of about 0.0005 (\eqn{1/(B+1)}).
	}
	\references{
	Agresti, A. (1990).
	\emph{Categorical data analysis}.
	New York: Wiley.
	Pages 59--66.

	Agresti, A. (2002).
	\emph{Categorical data analysis}. Second edition.
	New York: Wiley.
	Pages 91--101.

	Fisher, R. A. (1935).
	The logic of inductive inference.
	\emph{Journal of the Royal Statistical Society Series A}, \bold{98},
	39--54.
	\doi{10.2307/2342435}.

	Fisher, R. A. (1962).
	Confidence limits for a cross-product ratio.
	\emph{Australian Journal of Statistics}, \bold{4}, 41.
	\doi{10.1111/j.1467-842X.1962.tb00285.x}.

	Fisher, R. A. (1970).
	\emph{Statistical Methods for Research Workers}.
	Oliver & Boyd.

	Mehta, Cyrus R. and Patel, Nitin R. (1983).
	A network algorithm for performing Fisher's exact test in \eqn{r
	\times c}{r x c} contingency tables.
	\emph{Journal of the American Statistical Association}, \bold{78},
	427--434.
	\doi{10.1080/01621459.1983.10477989}.

	Mehta, C. R. and Patel, N. R. (1986).
	Algorithm 643: FEXACT, a FORTRAN subroutine for Fisher's exact test
	on unordered \eqn{r \times c}{r x c} contingency tables.
	\emph{ACM Transactions on Mathematical Software}, \bold{12},
	154--161.
	\doi{10.1145/6497.214326}.

	Clarkson, D. B., Fan, Y. and Joe, H. (1993)
	A Remark on Algorithm 643: FEXACT: An Algorithm for Performing
	Fisher's Exact Test in \eqn{r \times c}{r x c} Contingency Tables.
	\emph{ACM Transactions on Mathematical Software}, \bold{19},
	484--488.
	\doi{10.1145/168173.168412}.

	Patefield, W. M. (1981).
	Algorithm AS 159: An efficient method of generating r x c tables
	with given row and column totals.
	\emph{Applied Statistics}, \bold{30}, 91--97.
	\doi{10.2307/2346669}.
	}
	\seealso{
	\code{\link{chisq.test}}

	\code{fisher.exact} in package \CRANpkg{exact2x2} for alternative
	interpretations of two-sided tests and confidence intervals for
	\eqn{2 \times 2}{2 by 2} tables.
	}
	\examples{
	## Agresti (1990, p. 61f; 2002, p. 91) Fisher's Tea Drinker
	## A British woman claimed to be able to distinguish whether milk or
	## tea was added to the cup first. To test, she was given 8 cups of
	## tea, in four of which milk was added first. The null hypothesis
	## is that there is no association between the true order of pouring
	## and the woman's guess, the alternative that there is a positive
	## association (that the odds ratio is greater than 1).
	TeaTasting <-
	matrix(c(3, 1, 1, 3),
	nrow = 2,
	dimnames = list(Guess = c("Milk", "Tea"),
	Truth = c("Milk", "Tea")))
	fisher.test(TeaTasting, alternative = "greater")
	## => p = 0.2429, association could not be established

	## Fisher (1962, 1970), Criminal convictions of like-sex twins
	Convictions <- matrix(c(2, 10, 15, 3), nrow = 2,
	dimnames =
	list(c("Dizygotic", "Monozygotic"),
	c("Convicted", "Not convicted")))
	Convictions
	fisher.test(Convictions, alternative = "less")
	fisher.test(Convictions, conf.int = FALSE)
	fisher.test(Convictions, conf.level = 0.95)$conf.int
	fisher.test(Convictions, conf.level = 0.99)$conf.int

	## A r x c table Agresti (2002, p. 57) Job Satisfaction
	Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4,
	dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"),
	satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")))
	fisher.test(Job) # 0.7827
	fisher.test(Job, simulate.p.value = TRUE, B = 1e5) # also close to 0.78

	## 6th example in Mehta & Patel's JASA paper
	MP6 <- rbind(
	c(1,2,2,1,1,0,1),
	c(2,0,0,2,3,0,0),
	c(0,1,1,1,2,7,3),
	c(1,1,2,0,0,0,1),
	c(0,1,1,1,1,0,0))
	fisher.test(MP6)
	# Exactly the same p-value, as Cochran's conditions are never met:
	fisher.test(MP6, hybrid=TRUE)
	}
	\keyword{htest}