blob: 1015ddbf0cb2fadba975220f57068ab5d2c2dd9c [file] [log] [blame]
% File src/library/utils/man/charClass.Rd
% Part of the R package, https://www.R-project.org
% Copyright 2021 R Core Team
% Distributed under GPL 2 or later
\name{charClass}
\alias{charClass}
\title{Character Classification}
\description{
An interface to the (C99) wide character classification functions in use.
}
\usage{
charClass(x, class)
}
\arguments{
\item{x}{\strong{Either} a UTF-8-encoded length-1 character vector
\strong{or} an integer vector of Unicode points (or a vector
coercible to integer).}
\item{class}{A character string, one of those given in the
\sQuote{Details} section.}
}
\details{
The classification into character classes is platform-dependent. The
classes are determined by internal tables on Windows and (optionally
but by default) on macOS and AIX.
The character classes are interpreted as follows:
\describe{
\item{\code{"alnum"}}{Alphabetic or numeric.}
\item{\code{"alpha"}}{Alphabetic.}
\item{\code{"blank"}}{Space or tab.}
\item{\code{"cntrl"}}{Control characters.}
\item{\code{"digit"}}{Digits \code{0-9}.}
\item{\code{"graph"}}{Graphical characters (printable characters
except whitespace).}
\item{\code{"lower"}}{Lower-case alphabetic.}
\item{\code{"print"}}{Printable characters.}
\item{\code{"punct"}}{Punctuation characters. Some platforms treat all
non-alphanumeric graphical characters as punctuation.}
\item{\code{"space"}}{Whitespace, including tabs, form and line
feeds and carriage returns. Some OSes include non-breaking
spaces, some exclude them.}
\item{\code{"upper"}}{Upper-case alphabetic.}
\item{\code{"xdigit"}}{Hexadecimal character, one of \code{0-9A-fa-f}.}
}
Alphabetic characters contain all lower- and upper-case ones and some
others (for example, those in \sQuote{title case}).
Whether a character is printable is used to decide whether to escape
it when printing -- see the help for \code{\link{print.default}}.
If \code{x} is a character string it should either be ASCII or declared
as UTF-8 -- see \code{\link{Encoding}}.
\code{charClass} was added in \R 4.1.0. A less direct way to examine
character classes which also worked in earlier versions is to use
something like \code{grepl("[[:print:]]", intToUtf8(x))} -- however,
the regular-expression code might not use the same classification
functions as printing and on macOS used not to.
}
\value{
A logical vector of the length the number of characters or integers in
\code{x}.
}
\note{
Non-ASCII digits are excluded by the C99 standard from the class
\code{"digit"}: most platforms will have them as alphabetic.
It is an assumption that the system's wide character classification
functions are coded in Unicode points, but this is known to be true
for all recent platforms.
In principle the classification could depend on the locale even on
one platform, but that seems no longer to be seen.
}
\seealso{
Character classes are used in \link{regular expression}s.
The OS's \command{man} pages for \code{iswctype} and \code{wctype}.
}
\examples{
x <- c(48:70, 32, 0xa0) # Last is non-breaking space
cl <- c("alnum", "alpha", "blank", "digit", "graph", "punct", "upper", "xdigit")
X <- lapply(cl, function(y) charClass(x,y)); names(X) <- cl
X <- as.data.frame(X); row.names(X) <- sQuote(intToUtf8(x, multiple = TRUE))
X
charClass("ABC123", "alpha")
## Some accented capital Greek characters
(x <- "\u0386\u0388\u0389")
charClass(x, "upper")
## How many printable characters are there? (Around 280,000 in Unicode 13.)
## There are 2^21-1 possible Unicode points (most not yet assigned).
pr <- charClass(1:0x1fffff, "print")
table(pr)
}