| % File src/library/base/man/iconv.Rd |
| % Part of the R package, https://www.R-project.org |
| % Copyright 1995-2017 R Core Team |
| % Distributed under GPL 2 or later |
| |
| \name{iconv} |
| \alias{iconv} |
| \alias{iconvlist} |
| \concept{encoding} |
| \title{Convert Character Vector between Encodings} |
| \description{ |
| This uses system facilities to convert a character vector between |
| encodings: the \sQuote{i} stands for \sQuote{internationalization}. |
| } |
| \usage{ |
| iconv(x, from = "", to = "", sub = NA, mark = TRUE, toRaw = FALSE) |
| |
| iconvlist() |
| } |
| |
| \arguments{ |
| \item{x}{A character vector, or an object to be converted to a character |
| vector by \code{\link{as.character}}, or a list with \code{NULL} and |
| \code{raw} elements as returned by \code{iconv(toRaw = TRUE)}.} |
| \item{from}{A character string describing the current encoding.} |
| \item{to}{A character string describing the target encoding.} |
| \item{sub}{character string. If not \code{NA} it is used to replace |
| any non-convertible bytes in the input. (This would normally be a |
| single character, but can be more.) If \code{"byte"}, the indication is |
| \code{"<xx>"} with the hex code of the byte.} |
| \item{mark}{logical, for expert use. Should encodings be marked?} |
| \item{toRaw}{logical. Should a list of raw vectors be returned rather |
| than a character vector?} |
| } |
| |
| \details{ |
| The names of encodings and which ones are available are |
| platform-dependent. All \R platforms support \code{""} (for the |
| encoding of the current locale), \code{"latin1"} and \code{"UTF-8"}. |
| Generally case is ignored when specifying an encoding. |
| |
| On most platforms \code{iconvlist} provides an alphabetical list of |
| the supported encodings. On others, the information is on the man |
| page for \code{iconv(5)} or elsewhere in the man pages (but beware |
| that the system command \code{iconv} may not support the same set of |
| encodings as the C functions \R calls). Unfortunately, the names are |
| rarely supported across all platforms. |
| |
| Elements of \code{x} which cannot be converted (perhaps because they |
| are invalid or because they cannot be represented in the target |
| encoding) will be returned as \code{NA} unless \code{sub} is specified. |
| |
| Most versions of \code{iconv} will allow transliteration by appending |
| \samp{//TRANSLIT} to the \code{to} encoding: see the examples. |
| |
| Encoding \code{"ASCII"} is accepted, and on most systems \code{"C"} |
| and \code{"POSIX"} are synonyms for ASCII. |
| |
| Any encoding bits (see \code{\link{Encoding}}) on elements of \code{x} |
| are ignored: they will always be translated as if from encoding |
| \code{from} even if declared otherwise. \code{\link{enc2native}} and |
| \code{\link{enc2utf8}} provide alternatives which do take declared |
| encodings into account. |
| |
| Note that implementations of \code{iconv} typically do not do much |
| validity checking and will often mis-convert inputs which are invalid |
| in encoding \code{from}. |
| } |
| |
| \section{Implementation Details}{ |
| There are three main implementations of \code{iconv} in use. |
| Linux's C runtime \samp{glibc} contains one. Several platforms |
| supply GNU \samp{libiconv}, including macOS, FreeBSD and Cygwin, in |
| some cases with additional encodings. On Windows we use a version of |
| Yukihiro Nakadaira's \samp{win_iconv}, which is based on Windows' |
| codepages. (We have added many encoding names for compatibility |
| with other systems.) All three have \code{iconvlist}, ignore case in |
| encoding names and support \samp{//TRANSLIT} (but with different |
| results, and for \samp{win_iconv} currently a \sQuote{best fit} |
| strategy is used except for \code{to = "ASCII"}). |
| |
| Most commercial Unixes contain an implementation of \code{iconv} but |
| none we have encountered have supported the encoding names we need: |
| the \dQuote{R Installation and Administration Manual} recommends |
| installing GNU \samp{libiconv} on Solaris and AIX, for example. |
| |
| There are other implementations, e.g.\sspace{} NetBSD has used one from the |
| Citrus project (which does not support \samp{//TRANSLIT}) and there is |
| an older FreeBSD port (\samp{libiconv} is usually used there): it has |
| not been reported whether or not these work with \R. |
| |
| Note that you cannot rely on invalid inputs being detected, especially |
| for \code{to = "ASCII"} where some implementations allow 8-bit |
| characters and pass them through unchanged or with transliteration. |
| |
| Some of the implementations have interesting extra encodings: for |
| example GNU \samp{libiconv} allows \code{to = "C99"} to use |
| \samp{\\uxxxx} escapes for non-ASCII characters. |
| } |
| |
| \section{Byte Order Marks}{ |
| most commonly known as \sQuote{BOMs}. |
| |
| Encodings using character units which are more than one byte in size |
| can be written on a file in either big-endian or little-endian order: |
| this applies most commonly to UCS-2, UTF-16 and UTF-32/UCS-4 |
| encodings. Some systems will write the Unicode character |
| \code{U+FEFF} at the beginning of a file in these encodings and |
| perhaps also in UTF-8. In that usage the character is known as a BOM, |
| and should be handled during input (see the \sQuote{Encodings} section |
| under \code{\link{connection}}: re-encoded connections have some |
| special handling of BOMs). The rest of this section applies when this |
| has not been done so \code{x} starts with a BOM. |
| |
| Implementations will generally interpret a BOM for \code{from} given |
| as one of \code{"UCS-2"}, \code{"UTF-16"} and |
| \code{"UTF-32"}. Implementations differ in how they treat BOMs in |
| \code{x} in other \code{from} encodings: they may be discarded, |
| returned as character \code{U+FEFF} or regarded as invalid. |
| } |
| |
| \value{ |
| If \code{toRaw = FALSE} (the default), the value is a character vector |
| of the same length and the same attributes as \code{x} (after |
| conversion to a character vector). |
| |
| If \code{mark = TRUE} (the default) the elements of the result have a |
| declared encoding if \code{to} is \code{"latin1"} or \code{"UTF-8"}, |
| or if \code{to = ""} and the current locale's encoding is detected as |
| Latin-1 (or its superset CP1252 on Windows) or UTF-8. |
| |
| If \code{toRaw = TRUE}, the value is a list of the same length and |
| the same attributes as \code{x} whose elements are either \code{NULL} |
| (if conversion fails) or a raw vector. |
| |
| For \code{iconvlist()}, a character vector (typically of a few hundred |
| elements) of known encoding names. |
| } |
| \note{ |
| The only reasonably portable name for the ISO 8859-15 encoding, |
| commonly known as \sQuote{Latin 9}, is \code{"latin-9"}: some |
| platforms support \code{"latin9"} but GNU \samp{libiconv} does not. |
| |
| Encoding names \code{"utf8"}, \code{"mac"} and \code{"macroman"} are |
| not portable. \code{"utf8"} is converted to \code{"UTF-8"} for |
| \code{from} and \code{to} by \code{iconv}, but not |
| for e.g.\sspace{}\code{fileEncoding} arguments. \code{"macintosh"} is |
| the official (and most widely supported) name for \sQuote{Mac Roman} |
| (\url{https://en.wikipedia.org/wiki/Mac_OS_Roman}). |
| } |
| |
| \seealso{ |
| \code{\link{localeToCharset}}, \code{\link{file}}. |
| } |
| \examples{ |
| ## In principle, as not all systems have iconvlist |
| try(utils::head(iconvlist(), n = 50)) |
| |
| \dontrun{ |
| ## convert from Latin-2 to UTF-8: two of the glibc iconv variants. |
| iconv(x, "ISO_8859-2", "UTF-8") |
| iconv(x, "LATIN2", "UTF-8") |
| } |
| |
| ## Both x below are in latin1 and will only display correctly in a |
| ## locale that can represent and display latin1. |
| x <- "fa\xE7ile" |
| Encoding(x) <- "latin1" |
| x |
| charToRaw(xx <- iconv(x, "latin1", "UTF-8")) |
| xx |
| |
| iconv(x, "latin1", "ASCII") # NA |
| iconv(x, "latin1", "ASCII", "?") # "fa?ile" |
| iconv(x, "latin1", "ASCII", "") # "faile" |
| iconv(x, "latin1", "ASCII", "byte") # "fa<e7>ile" |
| |
| ## Extracts from old R help files (they are nowadays in UTF-8) |
| x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") |
| Encoding(x) <- "latin1" |
| x |
| try(iconv(x, "latin1", "ASCII//TRANSLIT")) # platform-dependent |
| iconv(x, "latin1", "ASCII", sub = "byte") |
| ## and for Windows' 'Unicode' |
| str(xx <- iconv(x, "latin1", "UTF-16LE", toRaw = TRUE)) |
| iconv(xx, "UTF-16LE", "UTF-8") |
| } |
| \keyword{ character } |
| \keyword{ utilities } |