blob: 0c04a92b2ff7add69da9d1196b2d411baa77bd81 [file] [log] [blame] [edit]
% File src/library/stats/man/reshape.Rd
% Part of the R package, https://www.R-project.org
% Copyright 1995-2012 R Core Team
% Distributed under GPL 2 or later
\name{reshape}
\alias{reshape}
\title{Reshape Grouped Data}
\description{
This function reshapes a data frame between \sQuote{wide} format (with
repeated measurements in separate columns of the same row) and
\sQuote{long} format (with the repeated measurements in separate
rows).
}
\usage{
reshape(data, varying = NULL, v.names = NULL, timevar = "time",
idvar = "id", ids = 1:NROW(data),
times = seq_along(varying[[1]]),
drop = NULL, direction, new.row.names = NULL,
sep = ".",
split = if (sep == "") {
list(regexp = "[A-Za-z][0-9]", include = TRUE)
} else {
list(regexp = sep, include = FALSE, fixed = TRUE)}
)
### Typical usage for converting from long to wide format:
# reshape(data, direction = "wide",
# idvar = "___", timevar = "___", # mandatory
# v.names = c(___), # time-varying variables
# varying = list(___)) # auto-generated if missing
### Typical usage for converting from wide to long format:
### If names of wide-format variables are in a 'nice' format
# reshape(data, direction = "long",
# varying = c(___), # vector
# sep) # to help guess 'v.names' and 'times'
### To specify long-format variable names explicitly
# reshape(data, direction = "long",
# varying = ___, # list / matrix / vector (use with care)
# v.names = ___, # vector of variable names in long format
# timevar, times, # name / values of constructed time variable
# idvar, ids) # name / values of constructed id variable
}
\arguments{
\item{data}{a data frame}
\item{varying}{names of sets of variables in the wide format that
correspond to single variables in long format
(\sQuote{time-varying}). This is canonically a list of vectors of
variable names, but it can optionally be a matrix of names, or a
single vector of names. In each case, when \code{direction =
"long"}, the names can be replaced by indices which are interpreted
as referring to \code{names(data)}. See \sQuote{Details} for more
details and options.}
\item{v.names}{names of variables in the long format that correspond
to multiple variables in the wide format. See \sQuote{Details}.}
\item{timevar}{the variable in long format that differentiates multiple
records from the same group or individual. If more than one record
matches, the first will be taken (with a warning). }
\item{idvar}{Names of one or more variables in long format that
identify multiple records from the same group/individual. These
variables may also be present in wide format.}
\item{ids}{the values to use for a newly created \code{idvar}
variable in long format.}
\item{times}{the values to use for a newly created \code{timevar}
variable in long format. See \sQuote{Details}.}
\item{drop}{a vector of names of variables to drop before reshaping.}
\item{direction}{character string, partially matched to either
\code{"wide"} to reshape to wide format, or \code{"long"} to reshape
to long format.}
\item{new.row.names}{character or \code{NULL}: a non-null value will be
used for the row names of the result.}
\item{sep}{A character vector of length 1, indicating a separating
character in the variable names in the wide format. This is used for
guessing \code{v.names} and \code{times} arguments based on the
names in \code{varying}. If \code{sep == ""}, the split is just before
the first numeral that follows an alphabetic character. This is
also used to create variable names when reshaping to wide format.}
\item{split}{A list with three components, \code{regexp},
\code{include}, and (optionally) \code{fixed}. This allows an
extended interface to variable name splitting. See \sQuote{Details}.}
}
\details{
Although \code{reshape()} can be used in a variety of contexts, the
motivating application is data from longitudinal studies, and the
arguments of this function are named and described in those terms. A
longitudinal study is characterized by repeated measurements of the
same variable(s), e.g., height and weight, on each unit being studied
(e.g., individual persons) at different time points (which are assumed
to be the same for all units). These variables are called time-varying
variables. The study may include other variables that are measured
only once for each unit and do not vary with time (e.g., gender and
race); these are called time-constant variables.
A \sQuote{wide} format representation of a longitudinal dataset will
have one record (row) for each unit, typically with some time-constant
variables that occupy single columns, and some time-varying variables
that occupy multiple columns (one column for each time point). A
\sQuote{long} format representation of the same dataset will have
multiple records (rows) for each individual, with the time-constant
variables being constant across these records and the time-varying
variables varying across the records. The \sQuote{long} format
dataset will have two additional variables: a \sQuote{time} variable
identifying which time point each record comes from, and an
\sQuote{id} variable showing which records refer to the same unit.
The type of conversion (long to wide or wide to long) is determined by
the \code{direction} argument, which is mandatory unless the
\code{data} argument is the result of a previous call to
\code{reshape}. In that case, the operation can be reversed simply
using \code{reshape(data)} (the other arguments are stored as
attributes on the data frame).
Conversion from long to wide format with \code{direction = "wide"} is
the simpler operation, and is mainly useful in the context of
multivariate analysis where data is often expected as a wide-format
matrix. In this case, the time variable \code{timevar} and id variable
\code{idvar} must be specified. All other variables are assumed to be
time-varying, unless the time-varying variables are explicitly
specified via the \code{v.names} argument. A warning is issued if
time-constant variables are not actually constant.
Each time-varying variable is expanded into multiple variables in the
wide format. The names of these expanded variables are generated
automatically, unless they are specified as the \code{varying}
argument in the form of a list (or matrix) with one component (or row)
for each time-varying variable. If \code{varying} is a vector of
names, it is implicitly converted into a matrix, with one row for each
time-varying variable. Use this option with care if there are multiple
time-varying variables, as the ordering (by column, the default in the
\code{\link{matrix}} constructor) may be unintuitive, whereas the
explicit list or matrix form is unambiguous.
%% (FIXME: delete?) If 'direction = "wide"' and no \code{varying} or
%% \code{v.names} arguments are supplied it is assumed that all variables
%% except \code{idvar} and \code{timevar} are time-varying.
Conversion from wide to long with \code{direction = "long"} is the
more common operation as most (univariate) statistical modeling
functions expect data in the long format. In the simpler case where
there is only one time-varying variable, the corresponding columns in
the wide format input can be specified as the \code{varying} argument,
which can be either a vector of column names or the corresponding
column indices. The name of the corresponding variable in the long
format output combining these columns can be optionally specified as
the \code{v.names} argument, and the name of the time variables as the
\code{timevar} argument. The values to use as the time values
corresponding to the different columns in the wide format can be
specified as the \code{times} argument. If \code{v.names} is
unspecified, the function will attempt to guess \code{v.names} and
\code{times} from \code{varying} (an explicitly specified \code{times}
argument is unused in that case). The default expects variable names
like \code{x.1}, \code{x.2}, where \code{sep = "."} specifies to
split at the dot and drop it from the name. To have alphabetic
followed by numeric times use \code{sep = ""}.
Multiple time-varying variables can be specified in two ways, either
with \code{varying} as an atomic vector as above, or as a list (or a
matrix). The first form is useful (and mandatory) if the automatic
variable name splitting as described above is used; this requires the
names of all time-varying variables to be suitably formatted in the
same manner, and \code{v.names} to be unspecified. If \code{varying}
is a list (with one component for each time-varying variable) or a
matrix (one row for each time-varying variable), variable name
splitting is not attempted, and \code{v.names} and \code{times} will
generally need to be specified, although they will default to,
respectively, the first variable name in each set, and sequential
times.
Also, guessing is not attempted if \code{v.names} is given explicitly,
even if \code{varying} is an atomic vector. In that case, the number
of time-varying variables is taken to be the length of \code{v.names},
and \code{varying} is implicitly converted into a matrix, with one row
for each time-varying variable. As in the case of long to wide
conversion, the matrix is filled up by column, so careful attention needs
to be paid to the order of variable names (or indices) in
\code{varying}, which is taken to be like \code{x.1}, \code{y.1},
\code{x.2}, \code{y.2} (i.e., variables corresponding to the same time
point need to be grouped together).
The \code{split} argument should not usually be necessary. The
\code{split$regexp} component is passed to either
\code{\link{strsplit}} or \code{\link{regexpr}}, where the latter is
used if \code{split$include} is \code{TRUE}, in which case the
splitting occurs after the first character of the matched string. In
the \code{\link{strsplit}} case, the separator is not included in the
result, and it is possible to specify fixed-string matching using
\code{split$fixed}.
}
\value{
The reshaped data frame with added attributes to simplify reshaping
back to the original form.
}
\seealso{\code{\link{stack}}, \code{\link{aperm}};
\code{\link{relist}} for reshaping the result of
\code{\link{unlist}}. \code{\link{xtabs}} and
\code{\link{as.data.frame.table}} for creating contingency tables and
converting them back to data frames.
}
\examples{
summary(Indometh) # data in long format
## long to wide (direction = "wide") requires idvar and timevar at a minimum
reshape(Indometh, direction = "wide", idvar = "Subject", timevar = "time")
## can also explicitly specify name of combined variable
wide <- reshape(Indometh, direction = "wide", idvar = "Subject",
timevar = "time", v.names = "conc", sep= "_")
wide
## reverse transformation
reshape(wide, direction = "long")
reshape(wide, idvar = "Subject", varying = list(2:12),
v.names = "conc", direction = "long")
## times need not be numeric
df <- data.frame(id = rep(1:4, rep(2,4)),
visit = I(rep(c("Before","After"), 4)),
x = rnorm(4), y = runif(4))
df
reshape(df, timevar = "visit", idvar = "id", direction = "wide")
## warns that y is really varying
reshape(df, timevar = "visit", idvar = "id", direction = "wide", v.names = "x")
## unbalanced 'long' data leads to NA fill in 'wide' form
df2 <- df[1:7, ]
df2
reshape(df2, timevar = "visit", idvar = "id", direction = "wide")
## Alternative regular expressions for guessing names
df3 <- data.frame(id = 1:4, age = c(40,50,60,50), dose1 = c(1,2,1,2),
dose2 = c(2,1,2,1), dose4 = c(3,3,3,3))
reshape(df3, direction = "long", varying = 3:5, sep = "")
## an example that isn't longitudinal data
state.x77 <- as.data.frame(state.x77)
long <- reshape(state.x77, idvar = "state", ids = row.names(state.x77),
times = names(state.x77), timevar = "Characteristic",
varying = list(names(state.x77)), direction = "long")
reshape(long, direction = "wide")
reshape(long, direction = "wide", new.row.names = unique(long$state))
## multiple id variables
df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),
time = rep(c(1,1,2,2), 3), score = rnorm(12))
wide <- reshape(df3, idvar = c("school", "class"), direction = "wide")
wide
## transform back
reshape(wide)
}
\keyword{manip}