src/library/utils/man/data.Rd - R - Git at Google

 % File src/library/utils/man/data.Rd
 % Part of the R package, https://www.R-project.org
 % Copyright 1995-2022 R Core Team
 % Distributed under GPL 2 or later

 \name{data}
 \alias{data}
 \alias{print.packageIQR}
 \title{Data Sets}
 \description{
   Loads specified data sets, or list the available data sets.
 }
 \usage{
 data(\dots, list = character(), package = NULL, lib.loc = NULL,
      verbose = getOption("verbose"), envir = .GlobalEnv,
      overwrite = TRUE)
 }
 \arguments{
   \item{\dots}{literal character strings or names.}
   \item{list}{a character vector.}
   \item{package}{
     a character vector giving the package(s) to look
     in for data sets, or \code{NULL}.

     By default, all packages in the search path are used, then
     the \file{data} subdirectory (if present) of the current working
     directory.
   }
   \item{lib.loc}{a character vector of directory names of \R libraries,
     or \code{NULL}.  The default value of \code{NULL} corresponds to all
     libraries currently known.}
   \item{verbose}{a logical.  If \code{TRUE}, additional diagnostics are
     printed.}
   \item{envir}{the \link{environment} where the data should be loaded.}
   \item{overwrite}{logical: should existing objects of the same name in
     \env{envir} be replaced?}
 }
 \details{
   Currently, four formats of data files are supported:

   \enumerate{
     \item files ending \file{.R} or \file{.r} are
     \code{\link{source}()}d in, with the \R working directory changed
     temporarily to the directory containing the respective file.
     (\code{data} ensures that the \pkg{utils} package is attached, in
     case it had been run \emph{via} \code{utils::data}.)

     \item files ending \file{.RData} or \file{.rda} are
     \code{\link{load}()}ed.

     \item files ending \file{.tab}, \file{.txt} or \file{.TXT} are read
     using \code{\link{read.table}(\dots, header = TRUE, as.is=FALSE)},
     and hence
     result in a data frame.

     \item files ending \file{.csv} or \file{.CSV} are read using
     \code{\link{read.table}(\dots, header = TRUE, sep = ";", as.is=FALSE)},
     and also result in a data frame.
   }
   If more than one matching file name is found, the first on this list
   is used.  (Files with extensions \file{.txt}, \file{.tab} or
   \file{.csv} can be compressed, with or without further extension
   \file{.gz}, \file{.bz2} or \file{.xz}.)

   The data sets to be loaded can be specified as a set of character
   strings or names, or as the character vector \code{list}, or as both.

   For each given data set, the first two types (\file{.R} or \file{.r},
   and \file{.RData} or \file{.rda} files) can create several variables
   in the load environment, which might all be named differently from the
   data set.  The third and fourth types will always result in the
   creation of a single variable with the same name (without extension)
   as the data set.

   If no data sets are specified, \code{data} lists the available data
   sets.  For each package,
   it looks for a data index in the \file{Meta} subdirectory or, if
   this is not found, scans the \file{data} subdirectory for data files
   using \code{\link{list_files_with_type}}.
   The information about
   available data sets is returned in an object of class
   \code{"packageIQR"}.  The structure of this class is experimental.
   Where the datasets have a different name from the argument that should
   be used to retrieve them the index will have an entry like
   \code{beaver1 (beavers)} which tells us that dataset \code{beaver1}
   can be retrieved by the call \code{data(beaver)}.

   If \code{lib.loc} and \code{package} are both \code{NULL} (the
   default), the data sets are searched for in all the currently loaded
   packages then in the \file{data} directory (if any) of the current
   working directory.

   If \code{lib.loc = NULL} but \code{package} is specified as a
   character vector, the specified package(s) are searched for first
   amongst loaded packages and then in the default library/ies
   (see \code{\link{.libPaths}}).

   If \code{lib.loc} \emph{is} specified (and not \code{NULL}), packages
   are searched for in the specified library/ies, even if they are
   already loaded from another library.

   To just look in the \file{data} directory of the current working
   directory, set \code{package = character(0)}
   (and \code{lib.loc = NULL}, the default).
 }
 \value{
   A character vector of all data sets specified (whether found or not),
   or information about all available data sets in an object of class
   \code{"packageIQR"} if none were specified.
 }
 \section{Good practice}{
   There is no requirement for \code{data(\var{foo})} to create an object
   named \code{\var{foo}} (nor to create one object), although it much
   reduces confusion if this convention is followed (and it is enforced
   if datasets are lazy-loaded).

   \code{data()} was originally intended to allow users to load datasets
   from packages for use in their examples, and as such it loaded the
   datasets into the workspace \code{\link{.GlobalEnv}}.  This avoided
   having large datasets in memory when not in use: that need has been
   almost entirely superseded by lazy-loading of datasets.

   The ability to specify a dataset by name (without quotes) is a
   convenience: in programming the datasets should be specified by
   character strings (with quotes).

   Use of \code{data} within a function without an \code{envir} argument
   has the almost always undesirable side-effect of putting an object in
   the user's workspace (and indeed, of replacing any object of that name
   already there).  It would almost always be better to put the object in
   the current evaluation environment by
   \code{data(\dots, envir = environment())}.
   However, two alternatives are usually preferable,
   both described in the \sQuote{Writing R Extensions} manual.
   \itemize{
     \item For sets of data, set up a package to use lazy-loading of data.
     \item For objects which are system data, for example lookup tables
     used in calculations within the function, use a file
     \file{R/sysdata.rda} in the package sources or create the objects by
     \R code at package installation time.
   }
   A sometimes important distinction is that the second approach places
   objects in the namespace but the first does not.  So if it is important
   that the function sees \code{mytable} as an object from the package,
   it is system data and the second approach should be used.  In the
   unusual case that a package uses a lazy-loaded dataset as a default
   argument to a function, that needs to be specified by \code{\link{::}},
   e.g., \code{survival::survexp.us}.
 }
 \note{
   One can take advantage of the search order and the fact that a
   \file{.R} file will change directory.  If raw data are stored in
   \file{mydata.txt} then one can set up \file{mydata.R} to read
   \file{mydata.txt} and pre-process it, e.g., using \code{\link{transform}()}.
   For instance one can convert numeric vectors to factors with the
   appropriate labels.  Thus, the \file{.R} file can effectively contain
   a metadata specification for the plaintext formats.

   %% In older versions of \R, up to 3.6.x, both \code{package = "base"} and
   %% \code{package = "stats"} were using \code{package = "datasets"}, (with a
   %% warning), as before 2004, (most of) the datasets in \pkg{datasets} were
   %% either in \pkg{base} or \pkg{stats}.  For these packages, the result
   %% is now empty as they contain no data sets.
 }
 \section{Warning}{
   This function creates objects in the \code{envir} environment (by
   default the user's workspace) replacing any which already
   existed. \code{data("foo")} can silently create objects other than
   \code{foo}: there have been instances in published  packages where it
   created/replaced \code{\link{.Random.seed}} and hence change the seed
   for the session.
 }
 \seealso{
   \code{\link{help}} for obtaining documentation on data sets,
   \code{\link{save}} for \emph{creating} the second (\file{.rda}) kind
   of data, typically the most efficient one.

   The \sQuote{Writing R Extensions} for considerations in preparing the
   \file{data} directory of a package.
 }
 \examples{
 require(utils)
 data()                         # list all available data sets
 try(data(package = "rpart"), silent = TRUE) # list the data sets in the rpart package
 data(USArrests, "VADeaths")    # load the data sets 'USArrests' and 'VADeaths'
 \dontrun{## Alternatively
 ds <- c("USArrests", "VADeaths"); data(list = ds)}
 help(USArrests)                # give information on data set 'USArrests'
 }
 \keyword{documentation}
 \keyword{datasets}
	% File src/library/utils/man/data.Rd
	% Part of the R package, https://www.R-project.org
	% Copyright 1995-2022 R Core Team
	% Distributed under GPL 2 or later

	\name{data}
	\alias{data}
	\alias{print.packageIQR}
	\title{Data Sets}
	\description{
	Loads specified data sets, or list the available data sets.
	}
	\usage{
	data(\dots, list = character(), package = NULL, lib.loc = NULL,
	verbose = getOption("verbose"), envir = .GlobalEnv,
	overwrite = TRUE)
	}
	\arguments{
	\item{\dots}{literal character strings or names.}
	\item{list}{a character vector.}
	\item{package}{
	a character vector giving the package(s) to look
	in for data sets, or \code{NULL}.

	By default, all packages in the search path are used, then
	the \file{data} subdirectory (if present) of the current working
	directory.
	}
	\item{lib.loc}{a character vector of directory names of \R libraries,
	or \code{NULL}. The default value of \code{NULL} corresponds to all
	libraries currently known.}
	\item{verbose}{a logical. If \code{TRUE}, additional diagnostics are
	printed.}
	\item{envir}{the \link{environment} where the data should be loaded.}
	\item{overwrite}{logical: should existing objects of the same name in
	\env{envir} be replaced?}
	}
	\details{
	Currently, four formats of data files are supported:

	\enumerate{
	\item files ending \file{.R} or \file{.r} are
	\code{\link{source}()}d in, with the \R working directory changed
	temporarily to the directory containing the respective file.
	(\code{data} ensures that the \pkg{utils} package is attached, in
	case it had been run \emph{via} \code{utils::data}.)

	\item files ending \file{.RData} or \file{.rda} are
	\code{\link{load}()}ed.

	\item files ending \file{.tab}, \file{.txt} or \file{.TXT} are read
	using \code{\link{read.table}(\dots, header = TRUE, as.is=FALSE)},
	and hence
	result in a data frame.

	\item files ending \file{.csv} or \file{.CSV} are read using
	\code{\link{read.table}(\dots, header = TRUE, sep = ";", as.is=FALSE)},
	and also result in a data frame.
	}
	If more than one matching file name is found, the first on this list
	is used. (Files with extensions \file{.txt}, \file{.tab} or
	\file{.csv} can be compressed, with or without further extension
	\file{.gz}, \file{.bz2} or \file{.xz}.)

	The data sets to be loaded can be specified as a set of character
	strings or names, or as the character vector \code{list}, or as both.

	For each given data set, the first two types (\file{.R} or \file{.r},
	and \file{.RData} or \file{.rda} files) can create several variables
	in the load environment, which might all be named differently from the
	data set. The third and fourth types will always result in the
	creation of a single variable with the same name (without extension)
	as the data set.

	If no data sets are specified, \code{data} lists the available data
	sets. For each package,
	it looks for a data index in the \file{Meta} subdirectory or, if
	this is not found, scans the \file{data} subdirectory for data files
	using \code{\link{list_files_with_type}}.
	The information about
	available data sets is returned in an object of class
	\code{"packageIQR"}. The structure of this class is experimental.
	Where the datasets have a different name from the argument that should
	be used to retrieve them the index will have an entry like
	\code{beaver1 (beavers)} which tells us that dataset \code{beaver1}
	can be retrieved by the call \code{data(beaver)}.

	If \code{lib.loc} and \code{package} are both \code{NULL} (the
	default), the data sets are searched for in all the currently loaded
	packages then in the \file{data} directory (if any) of the current
	working directory.

	If \code{lib.loc = NULL} but \code{package} is specified as a
	character vector, the specified package(s) are searched for first
	amongst loaded packages and then in the default library/ies
	(see \code{\link{.libPaths}}).

	If \code{lib.loc} \emph{is} specified (and not \code{NULL}), packages
	are searched for in the specified library/ies, even if they are
	already loaded from another library.

	To just look in the \file{data} directory of the current working
	directory, set \code{package = character(0)}
	(and \code{lib.loc = NULL}, the default).
	}
	\value{
	A character vector of all data sets specified (whether found or not),
	or information about all available data sets in an object of class
	\code{"packageIQR"} if none were specified.
	}
	\section{Good practice}{
	There is no requirement for \code{data(\var{foo})} to create an object
	named \code{\var{foo}} (nor to create one object), although it much
	reduces confusion if this convention is followed (and it is enforced
	if datasets are lazy-loaded).

	\code{data()} was originally intended to allow users to load datasets
	from packages for use in their examples, and as such it loaded the
	datasets into the workspace \code{\link{.GlobalEnv}}. This avoided
	having large datasets in memory when not in use: that need has been
	almost entirely superseded by lazy-loading of datasets.

	The ability to specify a dataset by name (without quotes) is a
	convenience: in programming the datasets should be specified by
	character strings (with quotes).

	Use of \code{data} within a function without an \code{envir} argument
	has the almost always undesirable side-effect of putting an object in
	the user's workspace (and indeed, of replacing any object of that name
	already there). It would almost always be better to put the object in
	the current evaluation environment by
	\code{data(\dots, envir = environment())}.
	However, two alternatives are usually preferable,
	both described in the \sQuote{Writing R Extensions} manual.
	\itemize{
	\item For sets of data, set up a package to use lazy-loading of data.
	\item For objects which are system data, for example lookup tables
	used in calculations within the function, use a file
	\file{R/sysdata.rda} in the package sources or create the objects by
	\R code at package installation time.
	}
	A sometimes important distinction is that the second approach places
	objects in the namespace but the first does not. So if it is important
	that the function sees \code{mytable} as an object from the package,
	it is system data and the second approach should be used. In the
	unusual case that a package uses a lazy-loaded dataset as a default
	argument to a function, that needs to be specified by \code{\link{::}},
	e.g., \code{survival::survexp.us}.
	}
	\note{
	One can take advantage of the search order and the fact that a
	\file{.R} file will change directory. If raw data are stored in
	\file{mydata.txt} then one can set up \file{mydata.R} to read
	\file{mydata.txt} and pre-process it, e.g., using \code{\link{transform}()}.
	For instance one can convert numeric vectors to factors with the
	appropriate labels. Thus, the \file{.R} file can effectively contain
	a metadata specification for the plaintext formats.

	%% In older versions of \R, up to 3.6.x, both \code{package = "base"} and
	%% \code{package = "stats"} were using \code{package = "datasets"}, (with a
	%% warning), as before 2004, (most of) the datasets in \pkg{datasets} were
	%% either in \pkg{base} or \pkg{stats}. For these packages, the result
	%% is now empty as they contain no data sets.
	}
	\section{Warning}{
	This function creates objects in the \code{envir} environment (by
	default the user's workspace) replacing any which already
	existed. \code{data("foo")} can silently create objects other than
	\code{foo}: there have been instances in published packages where it
	created/replaced \code{\link{.Random.seed}} and hence change the seed
	for the session.
	}
	\seealso{
	\code{\link{help}} for obtaining documentation on data sets,
	\code{\link{save}} for \emph{creating} the second (\file{.rda}) kind
	of data, typically the most efficient one.

	The \sQuote{Writing R Extensions} for considerations in preparing the
	\file{data} directory of a package.
	}
	\examples{
	require(utils)
	data() # list all available data sets
	try(data(package = "rpart"), silent = TRUE) # list the data sets in the rpart package
	data(USArrests, "VADeaths") # load the data sets 'USArrests' and 'VADeaths'
	\dontrun{## Alternatively
	ds <- c("USArrests", "VADeaths"); data(list = ds)}
	help(USArrests) # give information on data set 'USArrests'
	}
	\keyword{documentation}
	\keyword{datasets}