src/library/base/man/Encoding.Rd - R - Git at Google

 % File src/library/base/man/Encoding.Rd
 % Part of the R package, https://www.R-project.org
 % Copyright 1995-2019 R Core Team
 % Distributed under GPL 2 or later

 \name{Encoding}
 \alias{Encoding}
 \alias{Encoding<-}
 \alias{enc2native}
 \alias{enc2utf8}
 \concept{encoding}
 \title{Read or Set the Declared Encodings for a Character Vector}
 \description{
   Read or set the declared encodings for a character vector.
 }
 \usage{
 Encoding(x)

 Encoding(x) <- value

 enc2native(x)
 enc2utf8(x)
 }
 \arguments{
   \item{x}{A character vector.}
   \item{value}{A character vector of positive length.}
 }
 \details{
   Character strings in \R can be declared to be encoded in
   \code{"latin1"} or \code{"UTF-8"} or as \code{"bytes"}.  These
   declarations can be read by \code{Encoding}, which will return a
   character vector of values \code{"latin1"}, \code{"UTF-8"}
   \code{"bytes"} or \code{"unknown"}, or set, when \code{value} is
   recycled as needed and other values are silently treated as
   \code{"unknown"}.  ASCII strings will never be marked with a declared
   encoding, since their representation is the same in all supported
   encodings.  Strings marked as \code{"bytes"} are intended to be
   non-ASCII strings which should be manipulated as bytes, and never
   converted to a character encoding (so writing them to a text file is
   supported only by \code{writeLines(useBytes = TRUE)}).
   % non-bug report PR#16327

   \code{enc2native} and \code{enc2utf8} convert elements of character
   vectors to the native encoding or UTF-8 respectively, taking any
   marked encoding into account.  They are \link{primitive} functions,
   designed to do minimal copying.

   There are other ways for character strings to acquire a declared
   encoding apart from explicitly setting it (and these have changed as
   \R has evolved).  Functions \code{\link{scan}},
   \code{\link{read.table}}, \code{\link{readLines}}, and
   \code{\link{parse}} have an \code{encoding} argument that is used to
   declare encodings, \code{\link{iconv}} declares encodings from its
   \code{to} argument, and console input in suitable locales is also
   declared.  \code{\link{intToUtf8}} declares its output as
   \code{"UTF-8"}, and output text connections (see
   \code{\link{textConnection}}) are marked if running in a
   suitable locale.  Under some circumstances (see its help page)
   \code{\link{source}(encoding=)} will mark encodings of character
   strings it outputs.

   Most character manipulation functions will set the encoding on output
   strings if it was declared on the corresponding input.  These include
   \code{\link{chartr}}, \code{\link{strsplit}(useBytes = FALSE)},
   \code{\link{tolower}} and \code{\link{toupper}} as well as
   \code{\link{sub}(useBytes = FALSE)} and \code{\link{gsub}(useBytes =
   FALSE)}.  Note that such functions do not \emph{preserve} the
   encoding, but if they know the input encoding and that the string has
   been successfully re-encoded (to the current encoding or UTF-8), they
   mark the output.

   \code{\link{substr}} does preserve the encoding, and
   \code{\link{chartr}}, \code{\link{tolower}} and \code{\link{toupper}}
   preserve UTF-8 encoding on systems with Unicode wide characters.  With
   their \code{fixed} and \code{perl} options, \code{\link{strsplit}},
   \code{\link{sub}} and \code{gsub} will give a marked UTF-8 result if
   any of the inputs are UTF-8.

   \code{\link{paste}} and \code{\link{sprintf}} return elements marked
   as bytes if any of the corresponding inputs is marked as bytes, and
   otherwise marked as UTF-8 of any of the inputs is marked as UTF-8.

   \code{\link{match}}, \code{\link{pmatch}}, \code{\link{charmatch}},
   \code{\link{duplicated}} and \code{\link{unique}} all match in UTF-8
   if any of the elements are marked as UTF-8.

   There is some ambiguity as to what is meant by a \sQuote{Latin-1}
   locale, since some OSes (notably Windows) make use of character
   positions used for control characters in the ISO 8859-1 character set.
   How such characters are interpreted is system-dependent but as from \R
   3.5.0 they are if possible interpreted as per Windows codepage 1252
   (which Microsoft calls \sQuote{Windows Latin 1 (ANSI)}) when
   converting to e.g.\sspace{}UTF-8.
 }
 \value{
   A character vector.

   For \code{enc2utf8} encodings are always marked: they are for
   \code{enc2native} in UTF-8 and Latin-1 locales.
 }
 \examples{
 ## x is intended to be in latin1
 x <- "fa\xE7ile"
 Encoding(x)
 Encoding(x) <- "latin1"
 x
 xx <- iconv(x, "latin1", "UTF-8")
 Encoding(c(x, xx))
 c(x, xx)
 Encoding(xx) <- "bytes"
 xx # will be encoded in hex
 cat("xx = ", xx, "\n", sep = "")
 }
 \keyword{utilities}
 \keyword{character}
	% File src/library/base/man/Encoding.Rd
	% Part of the R package, https://www.R-project.org
	% Copyright 1995-2019 R Core Team
	% Distributed under GPL 2 or later

	\name{Encoding}
	\alias{Encoding}
	\alias{Encoding<-}
	\alias{enc2native}
	\alias{enc2utf8}
	\concept{encoding}
	\title{Read or Set the Declared Encodings for a Character Vector}
	\description{
	Read or set the declared encodings for a character vector.
	}
	\usage{
	Encoding(x)

	Encoding(x) <- value

	enc2native(x)
	enc2utf8(x)
	}
	\arguments{
	\item{x}{A character vector.}
	\item{value}{A character vector of positive length.}
	}
	\details{
	Character strings in \R can be declared to be encoded in
	\code{"latin1"} or \code{"UTF-8"} or as \code{"bytes"}. These
	declarations can be read by \code{Encoding}, which will return a
	character vector of values \code{"latin1"}, \code{"UTF-8"}
	\code{"bytes"} or \code{"unknown"}, or set, when \code{value} is
	recycled as needed and other values are silently treated as
	\code{"unknown"}. ASCII strings will never be marked with a declared
	encoding, since their representation is the same in all supported
	encodings. Strings marked as \code{"bytes"} are intended to be
	non-ASCII strings which should be manipulated as bytes, and never
	converted to a character encoding (so writing them to a text file is
	supported only by \code{writeLines(useBytes = TRUE)}).
	% non-bug report PR#16327

	\code{enc2native} and \code{enc2utf8} convert elements of character
	vectors to the native encoding or UTF-8 respectively, taking any
	marked encoding into account. They are \link{primitive} functions,
	designed to do minimal copying.

	There are other ways for character strings to acquire a declared
	encoding apart from explicitly setting it (and these have changed as
	\R has evolved). Functions \code{\link{scan}},
	\code{\link{read.table}}, \code{\link{readLines}}, and
	\code{\link{parse}} have an \code{encoding} argument that is used to
	declare encodings, \code{\link{iconv}} declares encodings from its
	\code{to} argument, and console input in suitable locales is also
	declared. \code{\link{intToUtf8}} declares its output as
	\code{"UTF-8"}, and output text connections (see
	\code{\link{textConnection}}) are marked if running in a
	suitable locale. Under some circumstances (see its help page)
	\code{\link{source}(encoding=)} will mark encodings of character
	strings it outputs.

	Most character manipulation functions will set the encoding on output
	strings if it was declared on the corresponding input. These include
	\code{\link{chartr}}, \code{\link{strsplit}(useBytes = FALSE)},
	\code{\link{tolower}} and \code{\link{toupper}} as well as
	\code{\link{sub}(useBytes = FALSE)} and \code{\link{gsub}(useBytes =
	FALSE)}. Note that such functions do not \emph{preserve} the
	encoding, but if they know the input encoding and that the string has
	been successfully re-encoded (to the current encoding or UTF-8), they
	mark the output.

	\code{\link{substr}} does preserve the encoding, and
	\code{\link{chartr}}, \code{\link{tolower}} and \code{\link{toupper}}
	preserve UTF-8 encoding on systems with Unicode wide characters. With
	their \code{fixed} and \code{perl} options, \code{\link{strsplit}},
	\code{\link{sub}} and \code{gsub} will give a marked UTF-8 result if
	any of the inputs are UTF-8.

	\code{\link{paste}} and \code{\link{sprintf}} return elements marked
	as bytes if any of the corresponding inputs is marked as bytes, and
	otherwise marked as UTF-8 of any of the inputs is marked as UTF-8.

	\code{\link{match}}, \code{\link{pmatch}}, \code{\link{charmatch}},
	\code{\link{duplicated}} and \code{\link{unique}} all match in UTF-8
	if any of the elements are marked as UTF-8.

	There is some ambiguity as to what is meant by a \sQuote{Latin-1}
	locale, since some OSes (notably Windows) make use of character
	positions used for control characters in the ISO 8859-1 character set.
	How such characters are interpreted is system-dependent but as from \R
	3.5.0 they are if possible interpreted as per Windows codepage 1252
	(which Microsoft calls \sQuote{Windows Latin 1 (ANSI)}) when
	converting to e.g.\sspace{}UTF-8.
	}
	\value{
	A character vector.

	For \code{enc2utf8} encodings are always marked: they are for
	\code{enc2native} in UTF-8 and Latin-1 locales.
	}
	\examples{
	## x is intended to be in latin1
	x <- "fa\xE7ile"
	Encoding(x)
	Encoding(x) <- "latin1"
	x
	xx <- iconv(x, "latin1", "UTF-8")
	Encoding(c(x, xx))
	c(x, xx)
	Encoding(xx) <- "bytes"
	xx # will be encoded in hex
	cat("xx = ", xx, "\n", sep = "")
	}
	\keyword{utilities}
	\keyword{character}