src/library/stats/man/lm.Rd - R - Git at Google

 % File src/library/stats/man/lm.Rd
 % Part of the R package, https://www.R-project.org
 % Copyright 1995-2018 R Core Team
 % Distributed under GPL 2 or later

 \name{lm}
 \alias{lm}
 %\alias{print.lm}
 \concept{regression}
 \title{Fitting Linear Models}
 \description{
   \code{lm} is used to fit linear models.
   It can be used to carry out regression,
   single stratum analysis of variance and
   analysis of covariance (although \code{\link{aov}} may provide a more
   convenient interface for these).
 }
 \usage{
 lm(formula, data, subset, weights, na.action,
    method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
    singular.ok = TRUE, contrasts = NULL, offset, \dots)
 }
 \arguments{
   \item{formula}{an object of class \code{"\link{formula}"} (or one that
     can be coerced to that class): a symbolic description of the
     model to be fitted.  The details of model specification are given
     under \sQuote{Details}.}

   \item{data}{an optional data frame, list or environment (or object
     coercible by \code{\link{as.data.frame}} to a data frame) containing
     the variables in the model.  If not found in \code{data}, the
     variables are taken from \code{environment(formula)},
     typically the environment from which \code{lm} is called.}

   \item{subset}{an optional vector specifying a subset of observations
     to be used in the fitting process.}

   \item{weights}{an optional vector of weights to be used in the fitting
     process.  Should be \code{NULL} or a numeric vector.
     If non-NULL, weighted least squares is used with weights
     \code{weights} (that is, minimizing \code{sum(w*e^2)}); otherwise
     ordinary least squares is used.  See also \sQuote{Details},}

   \item{na.action}{a function which indicates what should happen
     when the data contain \code{NA}s.  The default is set by
     the \code{na.action} setting of \code{\link{options}}, and is
     \code{\link{na.fail}} if that is unset.  The \sQuote{factory-fresh}
     default is \code{\link{na.omit}}.  Another possible value is
     \code{NULL}, no action.  Value \code{\link{na.exclude}} can be useful.}

   \item{method}{the method to be used; for fitting, currently only
     \code{method = "qr"} is supported; \code{method = "model.frame"} returns
     the model frame (the same as with \code{model = TRUE}, see below).}

   \item{model, x, y, qr}{logicals.  If \code{TRUE} the corresponding
     components of the fit (the model frame, the model matrix, the
     response, the QR decomposition) are returned.
   }

   \item{singular.ok}{logical. If \code{FALSE} (the default in S but
     not in \R) a singular fit is an error.}

   \item{contrasts}{an optional list. See the \code{contrasts.arg}
     of \code{\link{model.matrix.default}}.}

   \item{offset}{this can be used to specify an \emph{a priori} known
     component to be included in the linear predictor during fitting.
     This should be \code{NULL} or a numeric vector or matrix of extents
     matching those of the response.  One or more \code{\link{offset}} terms can be
     included in the formula instead or as well, and if more than one are
     specified their sum is used.  See \code{\link{model.offset}}.}

   \item{\dots}{additional arguments to be passed to the low level
     regression fitting functions (see below).}
 }
 \details{
   Models for \code{lm} are specified symbolically.  A typical model has
   the form \code{response ~ terms} where \code{response} is the (numeric)
   response vector and \code{terms} is a series of terms which specifies a
   linear predictor for \code{response}.  A terms specification of the form
   \code{first + second} indicates all the terms in \code{first} together
   with all the terms in \code{second} with duplicates removed.  A
   specification of the form \code{first:second} indicates the set of
   terms obtained by taking the interactions of all terms in \code{first}
   with all terms in \code{second}.  The specification \code{first*second}
   indicates the \emph{cross} of \code{first} and \code{second}.  This is
   the same as \code{first + second + first:second}.

   If the formula includes an \code{\link{offset}}, this is evaluated and
   subtracted from the response.

   If \code{response} is a matrix a linear model is fitted separately by
   least-squares to each column of the matrix.

   See \code{\link{model.matrix}} for some further details.  The terms in
   the formula will be re-ordered so that main effects come first,
   followed by the interactions, all second-order, all third-order and so
   on: to avoid this pass a \code{terms} object as the formula (see
   \code{\link{aov}} and \code{demo(glm.vr)} for an example).

   A formula has an implied intercept term.  To remove this use either
   \code{y ~ x - 1} or \code{y ~ 0 + x}.  See \code{\link{formula}} for
   more details of allowed formulae.

   Non-\code{NULL} \code{weights} can be used to indicate that
   different observations have different variances (with the values in
   \code{weights} being inversely proportional to the variances); or
   equivalently, when the elements of \code{weights} are positive
   integers \eqn{w_i}, that each response \eqn{y_i} is the mean of
   \eqn{w_i} unit-weight observations (including the case that there
   are \eqn{w_i} observations equal to \eqn{y_i} and the data have been
   summarized). However, in the latter case, notice that within-group
   variation is not used.  Therefore, the sigma estimate and residual
   degrees of freedom may be suboptimal; in the case of replication
   weights, even wrong. Hence, standard errors and analysis of variance
   tables should be treated with care.

   \code{lm} calls the lower level functions \code{\link{lm.fit}}, etc,
   see below, for the actual numerical computations.  For programming
   only, you may consider doing likewise.

   All of \code{weights}, \code{subset} and \code{offset} are evaluated
   in the same way as variables in \code{formula}, that is first in
   \code{data} and then in the environment of \code{formula}.
 }
 \value{
   \code{lm} returns an object of \code{\link{class}} \code{"lm"} or for
   multiple responses of class \code{c("mlm", "lm")}.

   The functions \code{summary} and \code{\link{anova}} are used to
   obtain and print a summary and analysis of variance table of the
   results.  The generic accessor functions \code{coefficients},
   \code{effects}, \code{fitted.values} and \code{residuals} extract
   various useful features of the value returned by \code{lm}.

   An object of class \code{"lm"} is a list containing at least the
   following components:

   \item{coefficients}{a named vector of coefficients}
   \item{residuals}{the residuals, that is response minus fitted values.}
   \item{fitted.values}{the fitted mean values.}
   \item{rank}{the numeric rank of the fitted linear model.}
   \item{weights}{(only for weighted fits) the specified weights.}
   \item{df.residual}{the residual degrees of freedom.}
   \item{call}{the matched call.}
   \item{terms}{the \code{\link{terms}} object used.}
   \item{contrasts}{(only where relevant) the contrasts used.}
   \item{xlevels}{(only where relevant) a record of the levels of the
     factors used in fitting.}
   \item{offset}{the offset used (missing if none were used).}
   \item{y}{if requested, the response used.}
   \item{x}{if requested, the model matrix used.}
   \item{model}{if requested (the default), the model frame used.}
   \item{na.action}{(where relevant) information returned by
     \code{\link{model.frame}} on the special handling of \code{NA}s.}

   In addition, non-null fits will have components \code{assign},
   \code{effects} and (unless not requested) \code{qr} relating to the linear
   fit, for use by extractor functions such as \code{summary} and
   \code{\link{effects}}.
 }
 \section{Using time series}{
   Considerable care is needed when using \code{lm} with time series.

   Unless \code{na.action = NULL}, the time series attributes are
   stripped from the variables before the regression is done.  (This is
   necessary as omitting \code{NA}s would invalidate the time series
   attributes, and if \code{NA}s are omitted in the middle of the series
   the result would no longer be a regular time series.)

   Even if the time series attributes are retained, they are not used to
   line up series, so that the time shift of a lagged or differenced
   regressor would be ignored.  It is good practice to prepare a
   \code{data} argument by \code{\link{ts.intersect}(\dots, dframe = TRUE)},
   then apply a suitable \code{na.action} to that data frame and call
   \code{lm} with \code{na.action = NULL} so that residuals and fitted
   values are time series.
 }
 \seealso{
   \code{\link{summary.lm}} for summaries and \code{\link{anova.lm}} for
   the ANOVA table; \code{\link{aov}} for a different interface.

   The generic functions \code{\link{coef}}, \code{\link{effects}},
   \code{\link{residuals}}, \code{\link{fitted}}, \code{\link{vcov}}.

   \code{\link{predict.lm}} (via \code{\link{predict}}) for prediction,
   including confidence and prediction intervals;
   \code{\link{confint}} for confidence intervals of \emph{parameters}.

   \code{\link{lm.influence}} for regression diagnostics, and
   \code{\link{glm}} for \bold{generalized} linear models.

   The underlying low level functions,
   \code{\link{lm.fit}} for plain, and \code{\link{lm.wfit}} for weighted
   regression fitting.

   More \code{lm()} examples are available e.g., in
   \code{\link{anscombe}}, \code{\link{attitude}}, \code{\link{freeny}},
   \code{\link{LifeCycleSavings}}, \code{\link{longley}},
   \code{\link{stackloss}}, \code{\link{swiss}}.

   \code{biglm} in package \CRANpkg{biglm} for an alternative
   way to fit linear models to large datasets (especially those with many
   cases).
 }
 \references{
   Chambers, J. M. (1992)
   \emph{Linear models.}
   Chapter 4 of \emph{Statistical Models in S}
   eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.

   Wilkinson, G. N. and Rogers, C. E. (1973).
   Symbolic descriptions of factorial models for analysis of variance.
   \emph{Applied Statistics}, \bold{22}, 392--399.
   \doi{10.2307/2346786}.
 }
 \author{
   The design was inspired by the S function of the same name described
   in Chambers (1992).  The implementation of model formula by Ross Ihaka
   was based on Wilkinson & Rogers (1973).
 }

 \note{
   Offsets specified by \code{offset} will not be included in predictions
   by \code{\link{predict.lm}}, whereas those specified by an offset term
   in the formula will be.
 }
 \examples{
 require(graphics)

 ## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
 ## Page 9: Plant Weight Data.
 ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
 trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
 group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
 weight <- c(ctl, trt)
 lm.D9 <- lm(weight ~ group)
 lm.D90 <- lm(weight ~ group - 1) # omitting intercept
 \donttest{
 anova(lm.D9)
 summary(lm.D90)
 }
 opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
 plot(lm.D9, las = 1)      # Residuals, Fitted, ...
 par(opar)
 \dontshow{
 ## model frame :
 stopifnot(identical(lm(weight ~ group, method = "model.frame"),
                     model.frame(lm.D9)))
 }
 ### less simple examples in "See Also" above
 }
 \keyword{regression}
	% File src/library/stats/man/lm.Rd
	% Part of the R package, https://www.R-project.org
	% Copyright 1995-2018 R Core Team
	% Distributed under GPL 2 or later

	\name{lm}
	\alias{lm}
	%\alias{print.lm}
	\concept{regression}
	\title{Fitting Linear Models}
	\description{
	\code{lm} is used to fit linear models.
	It can be used to carry out regression,
	single stratum analysis of variance and
	analysis of covariance (although \code{\link{aov}} may provide a more
	convenient interface for these).
	}
	\usage{
	lm(formula, data, subset, weights, na.action,
	method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
	singular.ok = TRUE, contrasts = NULL, offset, \dots)
	}
	\arguments{
	\item{formula}{an object of class \code{"\link{formula}"} (or one that
	can be coerced to that class): a symbolic description of the
	model to be fitted. The details of model specification are given
	under \sQuote{Details}.}

	\item{data}{an optional data frame, list or environment (or object
	coercible by \code{\link{as.data.frame}} to a data frame) containing
	the variables in the model. If not found in \code{data}, the
	variables are taken from \code{environment(formula)},
	typically the environment from which \code{lm} is called.}

	\item{subset}{an optional vector specifying a subset of observations
	to be used in the fitting process.}

	\item{weights}{an optional vector of weights to be used in the fitting
	process. Should be \code{NULL} or a numeric vector.
	If non-NULL, weighted least squares is used with weights
	\code{weights} (that is, minimizing \code{sum(w*e^2)}); otherwise
	ordinary least squares is used. See also \sQuote{Details},}

	\item{na.action}{a function which indicates what should happen
	when the data contain \code{NA}s. The default is set by
	the \code{na.action} setting of \code{\link{options}}, and is
	\code{\link{na.fail}} if that is unset. The \sQuote{factory-fresh}
	default is \code{\link{na.omit}}. Another possible value is
	\code{NULL}, no action. Value \code{\link{na.exclude}} can be useful.}

	\item{method}{the method to be used; for fitting, currently only
	\code{method = "qr"} is supported; \code{method = "model.frame"} returns
	the model frame (the same as with \code{model = TRUE}, see below).}

	\item{model, x, y, qr}{logicals. If \code{TRUE} the corresponding
	components of the fit (the model frame, the model matrix, the
	response, the QR decomposition) are returned.
	}

	\item{singular.ok}{logical. If \code{FALSE} (the default in S but
	not in \R) a singular fit is an error.}

	\item{contrasts}{an optional list. See the \code{contrasts.arg}
	of \code{\link{model.matrix.default}}.}

	\item{offset}{this can be used to specify an \emph{a priori} known
	component to be included in the linear predictor during fitting.
	This should be \code{NULL} or a numeric vector or matrix of extents
	matching those of the response. One or more \code{\link{offset}} terms can be
	included in the formula instead or as well, and if more than one are
	specified their sum is used. See \code{\link{model.offset}}.}

	\item{\dots}{additional arguments to be passed to the low level
	regression fitting functions (see below).}
	}
	\details{
	Models for \code{lm} are specified symbolically. A typical model has
	the form \code{response ~ terms} where \code{response} is the (numeric)
	response vector and \code{terms} is a series of terms which specifies a
	linear predictor for \code{response}. A terms specification of the form
	\code{first + second} indicates all the terms in \code{first} together
	with all the terms in \code{second} with duplicates removed. A
	specification of the form \code{first:second} indicates the set of
	terms obtained by taking the interactions of all terms in \code{first}
	with all terms in \code{second}. The specification \code{first*second}
	indicates the \emph{cross} of \code{first} and \code{second}. This is
	the same as \code{first + second + first:second}.

	If the formula includes an \code{\link{offset}}, this is evaluated and
	subtracted from the response.

	If \code{response} is a matrix a linear model is fitted separately by
	least-squares to each column of the matrix.

	See \code{\link{model.matrix}} for some further details. The terms in
	the formula will be re-ordered so that main effects come first,
	followed by the interactions, all second-order, all third-order and so
	on: to avoid this pass a \code{terms} object as the formula (see
	\code{\link{aov}} and \code{demo(glm.vr)} for an example).

	A formula has an implied intercept term. To remove this use either
	\code{y ~ x - 1} or \code{y ~ 0 + x}. See \code{\link{formula}} for
	more details of allowed formulae.

	Non-\code{NULL} \code{weights} can be used to indicate that
	different observations have different variances (with the values in
	\code{weights} being inversely proportional to the variances); or
	equivalently, when the elements of \code{weights} are positive
	integers \eqn{w_i}, that each response \eqn{y_i} is the mean of
	\eqn{w_i} unit-weight observations (including the case that there
	are \eqn{w_i} observations equal to \eqn{y_i} and the data have been
	summarized). However, in the latter case, notice that within-group
	variation is not used. Therefore, the sigma estimate and residual
	degrees of freedom may be suboptimal; in the case of replication
	weights, even wrong. Hence, standard errors and analysis of variance
	tables should be treated with care.

	\code{lm} calls the lower level functions \code{\link{lm.fit}}, etc,
	see below, for the actual numerical computations. For programming
	only, you may consider doing likewise.

	All of \code{weights}, \code{subset} and \code{offset} are evaluated
	in the same way as variables in \code{formula}, that is first in
	\code{data} and then in the environment of \code{formula}.
	}
	\value{
	\code{lm} returns an object of \code{\link{class}} \code{"lm"} or for
	multiple responses of class \code{c("mlm", "lm")}.

	The functions \code{summary} and \code{\link{anova}} are used to
	obtain and print a summary and analysis of variance table of the
	results. The generic accessor functions \code{coefficients},
	\code{effects}, \code{fitted.values} and \code{residuals} extract
	various useful features of the value returned by \code{lm}.

	An object of class \code{"lm"} is a list containing at least the
	following components:

	\item{coefficients}{a named vector of coefficients}
	\item{residuals}{the residuals, that is response minus fitted values.}
	\item{fitted.values}{the fitted mean values.}
	\item{rank}{the numeric rank of the fitted linear model.}
	\item{weights}{(only for weighted fits) the specified weights.}
	\item{df.residual}{the residual degrees of freedom.}
	\item{call}{the matched call.}
	\item{terms}{the \code{\link{terms}} object used.}
	\item{contrasts}{(only where relevant) the contrasts used.}
	\item{xlevels}{(only where relevant) a record of the levels of the
	factors used in fitting.}
	\item{offset}{the offset used (missing if none were used).}
	\item{y}{if requested, the response used.}
	\item{x}{if requested, the model matrix used.}
	\item{model}{if requested (the default), the model frame used.}
	\item{na.action}{(where relevant) information returned by
	\code{\link{model.frame}} on the special handling of \code{NA}s.}

	In addition, non-null fits will have components \code{assign},
	\code{effects} and (unless not requested) \code{qr} relating to the linear
	fit, for use by extractor functions such as \code{summary} and
	\code{\link{effects}}.
	}
	\section{Using time series}{
	Considerable care is needed when using \code{lm} with time series.

	Unless \code{na.action = NULL}, the time series attributes are
	stripped from the variables before the regression is done. (This is
	necessary as omitting \code{NA}s would invalidate the time series
	attributes, and if \code{NA}s are omitted in the middle of the series
	the result would no longer be a regular time series.)

	Even if the time series attributes are retained, they are not used to
	line up series, so that the time shift of a lagged or differenced
	regressor would be ignored. It is good practice to prepare a
	\code{data} argument by \code{\link{ts.intersect}(\dots, dframe = TRUE)},
	then apply a suitable \code{na.action} to that data frame and call
	\code{lm} with \code{na.action = NULL} so that residuals and fitted
	values are time series.
	}
	\seealso{
	\code{\link{summary.lm}} for summaries and \code{\link{anova.lm}} for
	the ANOVA table; \code{\link{aov}} for a different interface.

	The generic functions \code{\link{coef}}, \code{\link{effects}},
	\code{\link{residuals}}, \code{\link{fitted}}, \code{\link{vcov}}.

	\code{\link{predict.lm}} (via \code{\link{predict}}) for prediction,
	including confidence and prediction intervals;
	\code{\link{confint}} for confidence intervals of \emph{parameters}.

	\code{\link{lm.influence}} for regression diagnostics, and
	\code{\link{glm}} for \bold{generalized} linear models.

	The underlying low level functions,
	\code{\link{lm.fit}} for plain, and \code{\link{lm.wfit}} for weighted
	regression fitting.

	More \code{lm()} examples are available e.g., in
	\code{\link{anscombe}}, \code{\link{attitude}}, \code{\link{freeny}},
	\code{\link{LifeCycleSavings}}, \code{\link{longley}},
	\code{\link{stackloss}}, \code{\link{swiss}}.

	\code{biglm} in package \CRANpkg{biglm} for an alternative
	way to fit linear models to large datasets (especially those with many
	cases).
	}
	\references{
	Chambers, J. M. (1992)
	\emph{Linear models.}
	Chapter 4 of \emph{Statistical Models in S}
	eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.

	Wilkinson, G. N. and Rogers, C. E. (1973).
	Symbolic descriptions of factorial models for analysis of variance.
	\emph{Applied Statistics}, \bold{22}, 392--399.
	\doi{10.2307/2346786}.
	}
	\author{
	The design was inspired by the S function of the same name described
	in Chambers (1992). The implementation of model formula by Ross Ihaka
	was based on Wilkinson & Rogers (1973).
	}

	\note{
	Offsets specified by \code{offset} will not be included in predictions
	by \code{\link{predict.lm}}, whereas those specified by an offset term
	in the formula will be.
	}
	\examples{
	require(graphics)

	## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
	## Page 9: Plant Weight Data.
	ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
	trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
	group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
	weight <- c(ctl, trt)
	lm.D9 <- lm(weight ~ group)
	lm.D90 <- lm(weight ~ group - 1) # omitting intercept
	\donttest{
	anova(lm.D9)
	summary(lm.D90)
	}
	opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
	plot(lm.D9, las = 1) # Residuals, Fitted, ...
	par(opar)
	\dontshow{
	## model frame :
	stopifnot(identical(lm(weight ~ group, method = "model.frame"),
	model.frame(lm.D9)))
	}
	### less simple examples in "See Also" above
	}
	\keyword{regression}