blob: 521db35700cd9e97aafcbd1ebfdf49241282383c [file] [log] [blame]
% File src/library/base/man/Encoding.Rd
% Part of the R package, https://www.R-project.org
% Copyright 1995-2019 R Core Team
% Distributed under GPL 2 or later
\name{Encoding}
\alias{Encoding}
\alias{Encoding<-}
\alias{enc2native}
\alias{enc2utf8}
\concept{encoding}
\title{Read or Set the Declared Encodings for a Character Vector}
\description{
Read or set the declared encodings for a character vector.
}
\usage{
Encoding(x)
Encoding(x) <- value
enc2native(x)
enc2utf8(x)
}
\arguments{
\item{x}{A character vector.}
\item{value}{A character vector of positive length.}
}
\details{
Character strings in \R can be declared to be encoded in
\code{"latin1"} or \code{"UTF-8"} or as \code{"bytes"}. These
declarations can be read by \code{Encoding}, which will return a
character vector of values \code{"latin1"}, \code{"UTF-8"}
\code{"bytes"} or \code{"unknown"}, or set, when \code{value} is
recycled as needed and other values are silently treated as
\code{"unknown"}. ASCII strings will never be marked with a declared
encoding, since their representation is the same in all supported
encodings. Strings marked as \code{"bytes"} are intended to be
non-ASCII strings which should be manipulated as bytes, and never
converted to a character encoding (so writing them to a text file is
supported only by \code{writeLines(useBytes = TRUE)}).
% non-bug report PR#16327
\code{enc2native} and \code{enc2utf8} convert elements of character
vectors to the native encoding or UTF-8 respectively, taking any
marked encoding into account. They are \link{primitive} functions,
designed to do minimal copying.
There are other ways for character strings to acquire a declared
encoding apart from explicitly setting it (and these have changed as
\R has evolved). Functions \code{\link{scan}},
\code{\link{read.table}}, \code{\link{readLines}}, and
\code{\link{parse}} have an \code{encoding} argument that is used to
declare encodings, \code{\link{iconv}} declares encodings from its
\code{to} argument, and console input in suitable locales is also
declared. \code{\link{intToUtf8}} declares its output as
\code{"UTF-8"}, and output text connections (see
\code{\link{textConnection}}) are marked if running in a
suitable locale. Under some circumstances (see its help page)
\code{\link{source}(encoding=)} will mark encodings of character
strings it outputs.
Most character manipulation functions will set the encoding on output
strings if it was declared on the corresponding input. These include
\code{\link{chartr}}, \code{\link{strsplit}(useBytes = FALSE)},
\code{\link{tolower}} and \code{\link{toupper}} as well as
\code{\link{sub}(useBytes = FALSE)} and \code{\link{gsub}(useBytes =
FALSE)}. Note that such functions do not \emph{preserve} the
encoding, but if they know the input encoding and that the string has
been successfully re-encoded (to the current encoding or UTF-8), they
mark the output.
\code{\link{substr}} does preserve the encoding, and
\code{\link{chartr}}, \code{\link{tolower}} and \code{\link{toupper}}
preserve UTF-8 encoding on systems with Unicode wide characters. With
their \code{fixed} and \code{perl} options, \code{\link{strsplit}},
\code{\link{sub}} and \code{gsub} will give a marked UTF-8 result if
any of the inputs are UTF-8.
\code{\link{paste}} and \code{\link{sprintf}} return elements marked
as bytes if any of the corresponding inputs is marked as bytes, and
otherwise marked as UTF-8 of any of the inputs is marked as UTF-8.
\code{\link{match}}, \code{\link{pmatch}}, \code{\link{charmatch}},
\code{\link{duplicated}} and \code{\link{unique}} all match in UTF-8
if any of the elements are marked as UTF-8.
There is some ambiguity as to what is meant by a \sQuote{Latin-1}
locale, since some OSes (notably Windows) make use of character
positions used for control characters in the ISO 8859-1 character set.
How such characters are interpreted is system-dependent but as from \R
3.5.0 they are if possible interpreted as per Windows codepage 1252
(which Microsoft calls \sQuote{Windows Latin 1 (ANSI)}) when
converting to e.g.\sspace{}UTF-8.
}
\value{
A character vector.
For \code{enc2utf8} encodings are always marked: they are for
\code{enc2native} in UTF-8 and Latin-1 locales.
}
\examples{
## x is intended to be in latin1
x <- "fa\xE7ile"
Encoding(x)
Encoding(x) <- "latin1"
x
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
c(x, xx)
Encoding(xx) <- "bytes"
xx # will be encoded in hex
cat("xx = ", xx, "\n", sep = "")
}
\keyword{utilities}
\keyword{character}