src/library/base/man/factor.Rd - R - Git at Google

 % File src/library/base/man/factor.Rd
 % Part of the R package, https://www.R-project.org
 % Copyright 1995-2018 R Core Team
 % Distributed under GPL 2 or later

 \name{factor}
 \title{Factors}
 \alias{factor}
 \alias{ordered}
 \alias{is.factor}
 \alias{is.ordered}
 \alias{as.factor}
 \alias{as.ordered}
 \alias{is.na<-.factor}
 \alias{Math.factor}
 \alias{Ops.factor}
 \alias{Summary.factor}
 \alias{Ops.ordered}
 \alias{Summary.ordered}
 \alias{addNA}
 \alias{.valid.factor}
 \concept{categorical variable}
 \concept{enumerated type}
 \concept{category}
 \description{
   The function \code{factor} is used to encode a vector as a factor (the
   terms \sQuote{category} and \sQuote{enumerated type} are also used for
   factors).  If argument \code{ordered} is \code{TRUE}, the factor
   levels are assumed to be ordered.  For compatibility with S there is
   also a function \code{ordered}.

   \code{is.factor}, \code{is.ordered}, \code{as.factor} and \code{as.ordered}
   are the membership and coercion functions for these classes.
 }
 \usage{
 factor(x = character(), levels, labels = levels,
        exclude = NA, ordered = is.ordered(x), nmax = NA)

 ordered(x, \dots)

 is.factor(x)
 is.ordered(x)

 as.factor(x)
 as.ordered(x)

 addNA(x, ifany = FALSE)
 }
 \arguments{
   \item{x}{a vector of data, usually taking a small number of distinct
     values.}
   \item{levels}{an optional vector of the unique values (as character strings)
     that \code{x} might have taken.  The default is the unique set of
     values taken by \code{\link{as.character}(x)}, sorted into
     increasing order \emph{of \code{x}}.  Note that this set can be
     specified as smaller than \code{sort(unique(x))}.}
   \item{labels}{\emph{either} an optional character vector of
     labels for the levels (in the same order as \code{levels} after
     removing those in \code{exclude}), \emph{or} a character string of
     length 1.  Duplicated values in \code{labels} can be used to map
     different values of \code{x} to the same factor level.}
   \item{exclude}{a vector of values to be excluded when forming the
     set of levels.  This may be factor with the same level set as \code{x}
     or should be a \code{character}.}
   \item{ordered}{logical flag to determine if the levels should be regarded
     as ordered (in the order given).}
   \item{nmax}{an upper bound on the number of levels; see \sQuote{Details}.}
   \item{\dots}{(in \code{ordered(.)}): any of the above, apart from
     \code{ordered} itself.}
   \item{ifany}{only add an \code{NA} level if it is used, i.e.
     if \code{any(is.na(x))}.}
 }
 \value{
   \code{factor} returns an object of class \code{"factor"} which has a
   set of integer codes the length of \code{x} with a \code{"levels"}
   attribute of mode \code{\link{character}} and unique
   (\code{!\link{anyDuplicated}(.)}) entries.  If argument \code{ordered}
   is true (or \code{ordered()} is used) the result has class
   \code{c("ordered", "factor")}.
   Undocumentedly for a long time, \code{factor(x)} loses all
   \code{\link{attributes}(x)} but \code{"names"}, and resets
   \code{"levels"} and \code{"class"}.

   Applying \code{factor} to an ordered or unordered factor returns a
   factor (of the same type) with just the levels which occur: see also
   \code{\link{[.factor}} for a more transparent way to achieve this.

   \code{is.factor} returns \code{TRUE} or \code{FALSE} depending on
   whether its argument is of type factor or not.  Correspondingly,
   \code{is.ordered} returns \code{TRUE} when its argument is an ordered
   factor and \code{FALSE} otherwise.

   \code{as.factor} coerces its argument to a factor.
   It is an abbreviated (sometimes faster) form of \code{factor}.

   \code{as.ordered(x)} returns \code{x} if this is ordered, and
   \code{ordered(x)} otherwise.

   \code{addNA} modifies a factor by turning \code{NA} into an extra
   level (so that \code{NA} values are counted in tables, for instance).

   \code{.valid.factor(object)} checks the validity of a factor,
   currently only \code{levels(object)}, and returns \code{TRUE} if it is
   valid, otherwise a string describing the validity problem.  This
   function is used for \code{\link{validObject}(<factor>)}.
 }
 \details{
   The type of the vector \code{x} is not restricted; it only must have
   an \code{\link{as.character}} method and be sortable (by
   \code{\link{order}}).

   Ordered factors differ from factors only in their class, but methods
   and the model-fitting functions treat the two classes quite differently.

   The encoding of the vector happens as follows.  First all the values
   in \code{exclude} are removed from \code{levels}. If \code{x[i]}
   equals \code{levels[j]}, then the \code{i}-th element of the result is
   \code{j}.  If no match is found for \code{x[i]} in \code{levels}
   (which will happen for excluded values) then the \code{i}-th element
   of the result is set to \code{\link{NA}}.

   Normally the \sQuote{levels} used as an attribute of the result are
   the reduced set of levels after removing those in \code{exclude}, but
   this can be altered by supplying \code{labels}.  This should either
   be a set of new labels for the levels, or a character string, in
   which case the levels are that character string with a sequence
   number appended.

   \code{factor(x, exclude = NULL)} applied to a factor without
   \code{\link{NA}}s is a no-operation unless there are unused levels: in
   that case, a factor with the reduced level set is returned.  If
   \code{exclude} is used, since \R version 3.4.0, excluding non-existing
   character levels is equivalent to excluding nothing, and when
   \code{exclude} is a \code{\link{character}} vector, that \emph{is}
   applied to the levels of \code{x}.
   Alternatively, \code{exclude} can be factor with the same level set as
   \code{x} and will exclude the levels present in \code{exclude}.

   The codes of a factor may contain \code{\link{NA}}.  For a numeric
   \code{x}, set \code{exclude = NULL} to make \code{\link{NA}} an extra
   level (prints as \code{<NA>}); by default, this is the last level.

   If \code{NA} is a level, the way to set a code to be missing (as
   opposed to the code of the missing level) is to
   use \code{\link{is.na}} on the left-hand-side of an assignment (as in
   \code{is.na(f)[i] <- TRUE}; indexing inside \code{is.na} does not work).
   Under those circumstances missing values are currently printed as
   \code{<NA>}, i.e., identical to entries of level \code{NA}.

   \code{is.factor} is generic: you can write methods to handle
   specific classes of objects, see \link{InternalMethods}.

   Where \code{levels} is not supplied, \code{\link{unique}} is called.
   Since factors typically have quite a small number of levels, for large
   vectors \code{x} it is helpful to supply \code{nmax} as an upper bound
   on the number of unique values.
 }
 \section{Warning}{
   The interpretation of a factor depends on both the codes and the
   \code{"levels"} attribute.  Be careful only to compare factors with
   the same set of levels (in the same order).  In particular,
   \code{as.numeric} applied to a factor is meaningless, and may
   happen by implicit coercion.  To transform a factor \code{f} to
   approximately its original numeric values,
   \code{as.numeric(levels(f))[f]} is recommended and slightly more
   efficient than \code{as.numeric(as.character(f))}.

   The levels of a factor are by default sorted, but the sort order
   may well depend on the locale at the time of creation, and should
   not be assumed to be ASCII.

   There are some anomalies associated with factors that have
   \code{NA} as a level.  It is suggested to use them sparingly, e.g.,
   only for tabulation purposes.
 }
 %% Is this still true, after Ops.factor (==, !=) is fixed ?

 \section{Comparison operators and group generic methods}{
   There are \code{"factor"} and \code{"ordered"} methods for the
   \link{group generic} \code{\link[=S3groupGeneric]{Ops}} which
   provide methods for the \link{Comparison} operators,
   and for the \code{\link{min}}, \code{\link{max}}, and
   \code{\link{range}} generics in \code{\link[=S3groupGeneric]{Summary}}
   of \code{"ordered"}.  (The rest of the groups and the
   \code{\link[=S3groupGeneric]{Math}} group generate an error as they
   are not meaningful for factors.)

   Only \code{==} and \code{!=} can be used for factors: a factor can
   only be compared to another factor with an identical set of levels
   (not necessarily in the same ordering) or to a character vector.
   Ordered factors are compared in the same way, but the general dispatch
   mechanism precludes comparing ordered and unordered factors.

   All the comparison operators are available for ordered factors.
   Collation is done by the levels of the operands: if both operands are
   ordered factors they must have the same level set.
 }
 \note{
   In earlier versions of \R, storing character data as a factor was more
   space efficient if there is even a small proportion of
   repeats.  However, identical character strings now share storage, so
   the difference is small in most cases.  (Integer values are stored
   in 4 bytes whereas each reference to a character string needs a
   pointer of 4 or 8 bytes.)
 }
 \references{
   Chambers, J. M. and Hastie, T. J. (1992)
   \emph{Statistical Models in S}.
   Wadsworth & Brooks/Cole.
 }
 \seealso{
   \code{\link{[.factor}} for subsetting of factors.

   \code{\link{gl}} for construction of balanced factors and
   \code{\link{C}} for factors with specified contrasts.
   \code{\link{levels}} and \code{\link{nlevels}} for accessing the
   levels, and \code{\link{unclass}} to get integer codes.
 }
 \examples{
 (ff <- factor(substring("statistics", 1:10, 1:10), levels = letters))
 as.integer(ff)      # the internal codes
 (f. <- factor(ff))  # drops the levels that do not occur
 ff[, drop = TRUE]   # the same, more transparently

 factor(letters[1:20], labels = "letter")

 class(ordered(4:1)) # "ordered", inheriting from "factor"
 z <- factor(LETTERS[3:1], ordered = TRUE)
 ## and "relational" methods work:
 stopifnot(sort(z)[c(1,3)] == range(z), min(z) < max(z))
 \dontshow{
 of <- ordered(ff)
 stopifnot(identical(range(of, rev(of)), of[3:2]),
 	  identical(max(of), of[2]))
 }

 ## suppose you want "NA" as a level, and to allow missing values.
 (x <- factor(c(1, 2, NA), exclude = NULL))
 is.na(x)[2] <- TRUE
 x  # [1] 1    <NA> <NA>
 is.na(x)
 # [1] FALSE  TRUE FALSE

 ## More rational, since R 3.4.0 :
 factor(c(1:2, NA), exclude =  "" ) # keeps <NA> , as
 factor(c(1:2, NA), exclude = NULL) # always did
 ## exclude = <character>
 z # ordered levels 'A < B < C'
 factor(z, exclude = "C") # does exclude
 factor(z, exclude = "B") # ditto

 ## Now, labels maybe duplicated:
 ## factor() with duplicated labels allowing to "merge levels"
 x <- c("Man", "Male", "Man", "Lady", "Female")
 ## Map from 4 different values to only two levels:
 (xf <- factor(x, levels = c("Male", "Man" , "Lady",   "Female"),
                  labels = c("Male", "Male", "Female", "Female")))
 #> [1] Male   Male   Male   Female Female
 #> Levels: Male Female

 ## Using addNA()
 Month <- airquality$Month
 table(addNA(Month))
 table(addNA(Month, ifany = TRUE))
 }
 \keyword{category}
 \keyword{NA}
	% File src/library/base/man/factor.Rd
	% Part of the R package, https://www.R-project.org
	% Copyright 1995-2018 R Core Team
	% Distributed under GPL 2 or later

	\name{factor}
	\title{Factors}
	\alias{factor}
	\alias{ordered}
	\alias{is.factor}
	\alias{is.ordered}
	\alias{as.factor}
	\alias{as.ordered}
	\alias{is.na<-.factor}
	\alias{Math.factor}
	\alias{Ops.factor}
	\alias{Summary.factor}
	\alias{Ops.ordered}
	\alias{Summary.ordered}
	\alias{addNA}
	\alias{.valid.factor}
	\concept{categorical variable}
	\concept{enumerated type}
	\concept{category}
	\description{
	The function \code{factor} is used to encode a vector as a factor (the
	terms \sQuote{category} and \sQuote{enumerated type} are also used for
	factors). If argument \code{ordered} is \code{TRUE}, the factor
	levels are assumed to be ordered. For compatibility with S there is
	also a function \code{ordered}.

	\code{is.factor}, \code{is.ordered}, \code{as.factor} and \code{as.ordered}
	are the membership and coercion functions for these classes.
	}
	\usage{
	factor(x = character(), levels, labels = levels,
	exclude = NA, ordered = is.ordered(x), nmax = NA)

	ordered(x, \dots)

	is.factor(x)
	is.ordered(x)

	as.factor(x)
	as.ordered(x)

	addNA(x, ifany = FALSE)
	}
	\arguments{
	\item{x}{a vector of data, usually taking a small number of distinct
	values.}
	\item{levels}{an optional vector of the unique values (as character strings)
	that \code{x} might have taken. The default is the unique set of
	values taken by \code{\link{as.character}(x)}, sorted into
	increasing order \emph{of \code{x}}. Note that this set can be
	specified as smaller than \code{sort(unique(x))}.}
	\item{labels}{\emph{either} an optional character vector of
	labels for the levels (in the same order as \code{levels} after
	removing those in \code{exclude}), \emph{or} a character string of
	length 1. Duplicated values in \code{labels} can be used to map
	different values of \code{x} to the same factor level.}
	\item{exclude}{a vector of values to be excluded when forming the
	set of levels. This may be factor with the same level set as \code{x}
	or should be a \code{character}.}
	\item{ordered}{logical flag to determine if the levels should be regarded
	as ordered (in the order given).}
	\item{nmax}{an upper bound on the number of levels; see \sQuote{Details}.}
	\item{\dots}{(in \code{ordered(.)}): any of the above, apart from
	\code{ordered} itself.}
	\item{ifany}{only add an \code{NA} level if it is used, i.e.
	if \code{any(is.na(x))}.}
	}
	\value{
	\code{factor} returns an object of class \code{"factor"} which has a
	set of integer codes the length of \code{x} with a \code{"levels"}
	attribute of mode \code{\link{character}} and unique
	(\code{!\link{anyDuplicated}(.)}) entries. If argument \code{ordered}
	is true (or \code{ordered()} is used) the result has class
	\code{c("ordered", "factor")}.
	Undocumentedly for a long time, \code{factor(x)} loses all
	\code{\link{attributes}(x)} but \code{"names"}, and resets
	\code{"levels"} and \code{"class"}.

	Applying \code{factor} to an ordered or unordered factor returns a
	factor (of the same type) with just the levels which occur: see also
	\code{\link{[.factor}} for a more transparent way to achieve this.

	\code{is.factor} returns \code{TRUE} or \code{FALSE} depending on
	whether its argument is of type factor or not. Correspondingly,
	\code{is.ordered} returns \code{TRUE} when its argument is an ordered
	factor and \code{FALSE} otherwise.

	\code{as.factor} coerces its argument to a factor.
	It is an abbreviated (sometimes faster) form of \code{factor}.

	\code{as.ordered(x)} returns \code{x} if this is ordered, and
	\code{ordered(x)} otherwise.

	\code{addNA} modifies a factor by turning \code{NA} into an extra
	level (so that \code{NA} values are counted in tables, for instance).

	\code{.valid.factor(object)} checks the validity of a factor,
	currently only \code{levels(object)}, and returns \code{TRUE} if it is
	valid, otherwise a string describing the validity problem. This
	function is used for \code{\link{validObject}(<factor>)}.
	}
	\details{
	The type of the vector \code{x} is not restricted; it only must have
	an \code{\link{as.character}} method and be sortable (by
	\code{\link{order}}).

	Ordered factors differ from factors only in their class, but methods
	and the model-fitting functions treat the two classes quite differently.

	The encoding of the vector happens as follows. First all the values
	in \code{exclude} are removed from \code{levels}. If \code{x[i]}
	equals \code{levels[j]}, then the \code{i}-th element of the result is
	\code{j}. If no match is found for \code{x[i]} in \code{levels}
	(which will happen for excluded values) then the \code{i}-th element
	of the result is set to \code{\link{NA}}.

	Normally the \sQuote{levels} used as an attribute of the result are
	the reduced set of levels after removing those in \code{exclude}, but
	this can be altered by supplying \code{labels}. This should either
	be a set of new labels for the levels, or a character string, in
	which case the levels are that character string with a sequence
	number appended.

	\code{factor(x, exclude = NULL)} applied to a factor without
	\code{\link{NA}}s is a no-operation unless there are unused levels: in
	that case, a factor with the reduced level set is returned. If
	\code{exclude} is used, since \R version 3.4.0, excluding non-existing
	character levels is equivalent to excluding nothing, and when
	\code{exclude} is a \code{\link{character}} vector, that \emph{is}
	applied to the levels of \code{x}.
	Alternatively, \code{exclude} can be factor with the same level set as
	\code{x} and will exclude the levels present in \code{exclude}.

	The codes of a factor may contain \code{\link{NA}}. For a numeric
	\code{x}, set \code{exclude = NULL} to make \code{\link{NA}} an extra
	level (prints as \code{<NA>}); by default, this is the last level.

	If \code{NA} is a level, the way to set a code to be missing (as
	opposed to the code of the missing level) is to
	use \code{\link{is.na}} on the left-hand-side of an assignment (as in
	\code{is.na(f)[i] <- TRUE}; indexing inside \code{is.na} does not work).
	Under those circumstances missing values are currently printed as
	\code{<NA>}, i.e., identical to entries of level \code{NA}.

	\code{is.factor} is generic: you can write methods to handle
	specific classes of objects, see \link{InternalMethods}.

	Where \code{levels} is not supplied, \code{\link{unique}} is called.
	Since factors typically have quite a small number of levels, for large
	vectors \code{x} it is helpful to supply \code{nmax} as an upper bound
	on the number of unique values.
	}
	\section{Warning}{
	The interpretation of a factor depends on both the codes and the
	\code{"levels"} attribute. Be careful only to compare factors with
	the same set of levels (in the same order). In particular,
	\code{as.numeric} applied to a factor is meaningless, and may
	happen by implicit coercion. To transform a factor \code{f} to
	approximately its original numeric values,
	\code{as.numeric(levels(f))[f]} is recommended and slightly more
	efficient than \code{as.numeric(as.character(f))}.

	The levels of a factor are by default sorted, but the sort order
	may well depend on the locale at the time of creation, and should
	not be assumed to be ASCII.

	There are some anomalies associated with factors that have
	\code{NA} as a level. It is suggested to use them sparingly, e.g.,
	only for tabulation purposes.
	}
	%% Is this still true, after Ops.factor (==, !=) is fixed ?

	\section{Comparison operators and group generic methods}{
	There are \code{"factor"} and \code{"ordered"} methods for the
	\link{group generic} \code{\link[=S3groupGeneric]{Ops}} which
	provide methods for the \link{Comparison} operators,
	and for the \code{\link{min}}, \code{\link{max}}, and
	\code{\link{range}} generics in \code{\link[=S3groupGeneric]{Summary}}
	of \code{"ordered"}. (The rest of the groups and the
	\code{\link[=S3groupGeneric]{Math}} group generate an error as they
	are not meaningful for factors.)

	Only \code{==} and \code{!=} can be used for factors: a factor can
	only be compared to another factor with an identical set of levels
	(not necessarily in the same ordering) or to a character vector.
	Ordered factors are compared in the same way, but the general dispatch
	mechanism precludes comparing ordered and unordered factors.

	All the comparison operators are available for ordered factors.
	Collation is done by the levels of the operands: if both operands are
	ordered factors they must have the same level set.
	}
	\note{
	In earlier versions of \R, storing character data as a factor was more
	space efficient if there is even a small proportion of
	repeats. However, identical character strings now share storage, so
	the difference is small in most cases. (Integer values are stored
	in 4 bytes whereas each reference to a character string needs a
	pointer of 4 or 8 bytes.)
	}
	\references{
	Chambers, J. M. and Hastie, T. J. (1992)
	\emph{Statistical Models in S}.
	Wadsworth & Brooks/Cole.
	}
	\seealso{
	\code{\link{[.factor}} for subsetting of factors.

	\code{\link{gl}} for construction of balanced factors and
	\code{\link{C}} for factors with specified contrasts.
	\code{\link{levels}} and \code{\link{nlevels}} for accessing the
	levels, and \code{\link{unclass}} to get integer codes.
	}
	\examples{
	(ff <- factor(substring("statistics", 1:10, 1:10), levels = letters))
	as.integer(ff) # the internal codes
	(f. <- factor(ff)) # drops the levels that do not occur
	ff[, drop = TRUE] # the same, more transparently

	factor(letters[1:20], labels = "letter")

	class(ordered(4:1)) # "ordered", inheriting from "factor"
	z <- factor(LETTERS[3:1], ordered = TRUE)
	## and "relational" methods work:
	stopifnot(sort(z)[c(1,3)] == range(z), min(z) < max(z))
	\dontshow{
	of <- ordered(ff)
	stopifnot(identical(range(of, rev(of)), of[3:2]),
	identical(max(of), of[2]))
	}

	## suppose you want "NA" as a level, and to allow missing values.
	(x <- factor(c(1, 2, NA), exclude = NULL))
	is.na(x)[2] <- TRUE
	x # [1] 1 <NA> <NA>
	is.na(x)
	# [1] FALSE TRUE FALSE

	## More rational, since R 3.4.0 :
	factor(c(1:2, NA), exclude = "" ) # keeps <NA> , as
	factor(c(1:2, NA), exclude = NULL) # always did
	## exclude = <character>
	z # ordered levels 'A < B < C'
	factor(z, exclude = "C") # does exclude
	factor(z, exclude = "B") # ditto

	## Now, labels maybe duplicated:
	## factor() with duplicated labels allowing to "merge levels"
	x <- c("Man", "Male", "Man", "Lady", "Female")
	## Map from 4 different values to only two levels:
	(xf <- factor(x, levels = c("Male", "Man" , "Lady", "Female"),
	labels = c("Male", "Male", "Female", "Female")))
	#> [1] Male Male Male Female Female
	#> Levels: Male Female

	## Using addNA()
	Month <- airquality$Month
	table(addNA(Month))
	table(addNA(Month, ifany = TRUE))
	}
	\keyword{category}
	\keyword{NA}