blob: b62e054c488bf472278354d892d04572ee31afba [file] [log] [blame]
% File src/library/stats/man/lm.Rd
% Part of the R package, https://www.R-project.org
% Copyright 1995-2018 R Core Team
% Distributed under GPL 2 or later
\name{lm}
\alias{lm}
%\alias{print.lm}
\concept{regression}
\title{Fitting Linear Models}
\description{
\code{lm} is used to fit linear models.
It can be used to carry out regression,
single stratum analysis of variance and
analysis of covariance (although \code{\link{aov}} may provide a more
convenient interface for these).
}
\usage{
lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, \dots)
}
\arguments{
\item{formula}{an object of class \code{"\link{formula}"} (or one that
can be coerced to that class): a symbolic description of the
model to be fitted. The details of model specification are given
under \sQuote{Details}.}
\item{data}{an optional data frame, list or environment (or object
coercible by \code{\link{as.data.frame}} to a data frame) containing
the variables in the model. If not found in \code{data}, the
variables are taken from \code{environment(formula)},
typically the environment from which \code{lm} is called.}
\item{subset}{an optional vector specifying a subset of observations
to be used in the fitting process.}
\item{weights}{an optional vector of weights to be used in the fitting
process. Should be \code{NULL} or a numeric vector.
If non-NULL, weighted least squares is used with weights
\code{weights} (that is, minimizing \code{sum(w*e^2)}); otherwise
ordinary least squares is used. See also \sQuote{Details},}
\item{na.action}{a function which indicates what should happen
when the data contain \code{NA}s. The default is set by
the \code{na.action} setting of \code{\link{options}}, and is
\code{\link{na.fail}} if that is unset. The \sQuote{factory-fresh}
default is \code{\link{na.omit}}. Another possible value is
\code{NULL}, no action. Value \code{\link{na.exclude}} can be useful.}
\item{method}{the method to be used; for fitting, currently only
\code{method = "qr"} is supported; \code{method = "model.frame"} returns
the model frame (the same as with \code{model = TRUE}, see below).}
\item{model, x, y, qr}{logicals. If \code{TRUE} the corresponding
components of the fit (the model frame, the model matrix, the
response, the QR decomposition) are returned.
}
\item{singular.ok}{logical. If \code{FALSE} (the default in S but
not in \R) a singular fit is an error.}
\item{contrasts}{an optional list. See the \code{contrasts.arg}
of \code{\link{model.matrix.default}}.}
\item{offset}{this can be used to specify an \emph{a priori} known
component to be included in the linear predictor during fitting.
This should be \code{NULL} or a numeric vector or matrix of extents
matching those of the response. One or more \code{\link{offset}} terms can be
included in the formula instead or as well, and if more than one are
specified their sum is used. See \code{\link{model.offset}}.}
\item{\dots}{additional arguments to be passed to the low level
regression fitting functions (see below).}
}
\details{
Models for \code{lm} are specified symbolically. A typical model has
the form \code{response ~ terms} where \code{response} is the (numeric)
response vector and \code{terms} is a series of terms which specifies a
linear predictor for \code{response}. A terms specification of the form
\code{first + second} indicates all the terms in \code{first} together
with all the terms in \code{second} with duplicates removed. A
specification of the form \code{first:second} indicates the set of
terms obtained by taking the interactions of all terms in \code{first}
with all terms in \code{second}. The specification \code{first*second}
indicates the \emph{cross} of \code{first} and \code{second}. This is
the same as \code{first + second + first:second}.
If the formula includes an \code{\link{offset}}, this is evaluated and
subtracted from the response.
If \code{response} is a matrix a linear model is fitted separately by
least-squares to each column of the matrix.
See \code{\link{model.matrix}} for some further details. The terms in
the formula will be re-ordered so that main effects come first,
followed by the interactions, all second-order, all third-order and so
on: to avoid this pass a \code{terms} object as the formula (see
\code{\link{aov}} and \code{demo(glm.vr)} for an example).
A formula has an implied intercept term. To remove this use either
\code{y ~ x - 1} or \code{y ~ 0 + x}. See \code{\link{formula}} for
more details of allowed formulae.
Non-\code{NULL} \code{weights} can be used to indicate that
different observations have different variances (with the values in
\code{weights} being inversely proportional to the variances); or
equivalently, when the elements of \code{weights} are positive
integers \eqn{w_i}, that each response \eqn{y_i} is the mean of
\eqn{w_i} unit-weight observations (including the case that there
are \eqn{w_i} observations equal to \eqn{y_i} and the data have been
summarized). However, in the latter case, notice that within-group
variation is not used. Therefore, the sigma estimate and residual
degrees of freedom may be suboptimal; in the case of replication
weights, even wrong. Hence, standard errors and analysis of variance
tables should be treated with care.
\code{lm} calls the lower level functions \code{\link{lm.fit}}, etc,
see below, for the actual numerical computations. For programming
only, you may consider doing likewise.
All of \code{weights}, \code{subset} and \code{offset} are evaluated
in the same way as variables in \code{formula}, that is first in
\code{data} and then in the environment of \code{formula}.
}
\value{
\code{lm} returns an object of \code{\link{class}} \code{"lm"} or for
multiple responses of class \code{c("mlm", "lm")}.
The functions \code{summary} and \code{\link{anova}} are used to
obtain and print a summary and analysis of variance table of the
results. The generic accessor functions \code{coefficients},
\code{effects}, \code{fitted.values} and \code{residuals} extract
various useful features of the value returned by \code{lm}.
An object of class \code{"lm"} is a list containing at least the
following components:
\item{coefficients}{a named vector of coefficients}
\item{residuals}{the residuals, that is response minus fitted values.}
\item{fitted.values}{the fitted mean values.}
\item{rank}{the numeric rank of the fitted linear model.}
\item{weights}{(only for weighted fits) the specified weights.}
\item{df.residual}{the residual degrees of freedom.}
\item{call}{the matched call.}
\item{terms}{the \code{\link{terms}} object used.}
\item{contrasts}{(only where relevant) the contrasts used.}
\item{xlevels}{(only where relevant) a record of the levels of the
factors used in fitting.}
\item{offset}{the offset used (missing if none were used).}
\item{y}{if requested, the response used.}
\item{x}{if requested, the model matrix used.}
\item{model}{if requested (the default), the model frame used.}
\item{na.action}{(where relevant) information returned by
\code{\link{model.frame}} on the special handling of \code{NA}s.}
In addition, non-null fits will have components \code{assign},
\code{effects} and (unless not requested) \code{qr} relating to the linear
fit, for use by extractor functions such as \code{summary} and
\code{\link{effects}}.
}
\section{Using time series}{
Considerable care is needed when using \code{lm} with time series.
Unless \code{na.action = NULL}, the time series attributes are
stripped from the variables before the regression is done. (This is
necessary as omitting \code{NA}s would invalidate the time series
attributes, and if \code{NA}s are omitted in the middle of the series
the result would no longer be a regular time series.)
Even if the time series attributes are retained, they are not used to
line up series, so that the time shift of a lagged or differenced
regressor would be ignored. It is good practice to prepare a
\code{data} argument by \code{\link{ts.intersect}(\dots, dframe = TRUE)},
then apply a suitable \code{na.action} to that data frame and call
\code{lm} with \code{na.action = NULL} so that residuals and fitted
values are time series.
}
\seealso{
\code{\link{summary.lm}} for summaries and \code{\link{anova.lm}} for
the ANOVA table; \code{\link{aov}} for a different interface.
The generic functions \code{\link{coef}}, \code{\link{effects}},
\code{\link{residuals}}, \code{\link{fitted}}, \code{\link{vcov}}.
\code{\link{predict.lm}} (via \code{\link{predict}}) for prediction,
including confidence and prediction intervals;
\code{\link{confint}} for confidence intervals of \emph{parameters}.
\code{\link{lm.influence}} for regression diagnostics, and
\code{\link{glm}} for \bold{generalized} linear models.
The underlying low level functions,
\code{\link{lm.fit}} for plain, and \code{\link{lm.wfit}} for weighted
regression fitting.
More \code{lm()} examples are available e.g., in
\code{\link{anscombe}}, \code{\link{attitude}}, \code{\link{freeny}},
\code{\link{LifeCycleSavings}}, \code{\link{longley}},
\code{\link{stackloss}}, \code{\link{swiss}}.
\code{biglm} in package \CRANpkg{biglm} for an alternative
way to fit linear models to large datasets (especially those with many
cases).
}
\references{
Chambers, J. M. (1992)
\emph{Linear models.}
Chapter 4 of \emph{Statistical Models in S}
eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
Wilkinson, G. N. and Rogers, C. E. (1973).
Symbolic descriptions of factorial models for analysis of variance.
\emph{Applied Statistics}, \bold{22}, 392--399.
\doi{10.2307/2346786}.
}
\author{
The design was inspired by the S function of the same name described
in Chambers (1992). The implementation of model formula by Ross Ihaka
was based on Wilkinson & Rogers (1973).
}
\note{
Offsets specified by \code{offset} will not be included in predictions
by \code{\link{predict.lm}}, whereas those specified by an offset term
in the formula will be.
}
\examples{
require(graphics)
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
lm.D90 <- lm(weight ~ group - 1) # omitting intercept
\donttest{
anova(lm.D9)
summary(lm.D90)
}
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(lm.D9, las = 1) # Residuals, Fitted, ...
par(opar)
\dontshow{
## model frame :
stopifnot(identical(lm(weight ~ group, method = "model.frame"),
model.frame(lm.D9)))
}
### less simple examples in "See Also" above
}
\keyword{regression}