src/library/utils/man/charClass.Rd - R - Git at Google

 % File src/library/utils/man/charClass.Rd
 % Part of the R package, https://www.R-project.org
 % Copyright 2021 R Core Team
 % Distributed under GPL 2 or later
 \name{charClass}
 \alias{charClass}
 \title{Character Classification}
 \description{
   An interface to the (C99) wide character classification functions in use.
 }
 \usage{
 charClass(x, class)
 }
 \arguments{
   \item{x}{\strong{Either} a UTF-8-encoded length-1 character vector
     \strong{or} an integer vector of Unicode points (or a vector
     coercible to integer).}
   \item{class}{A character string, one of those given in the
     \sQuote{Details} section.}
 }
 \details{
   The classification into character classes is platform-dependent.  The
   classes are determined by internal tables on Windows and (optionally
   but by default) on macOS and AIX.

   The character classes are interpreted as follows:
   \describe{
     \item{\code{"alnum"}}{Alphabetic or numeric.}
     \item{\code{"alpha"}}{Alphabetic.}
     \item{\code{"blank"}}{Space or tab.}
     \item{\code{"cntrl"}}{Control characters.}
     \item{\code{"digit"}}{Digits \code{0-9}.}
     \item{\code{"graph"}}{Graphical characters (printable characters
       except whitespace).}
     \item{\code{"lower"}}{Lower-case alphabetic.}
     \item{\code{"print"}}{Printable characters.}
     \item{\code{"punct"}}{Punctuation characters.  Some platforms treat all
       non-alphanumeric graphical characters as punctuation.}
     \item{\code{"space"}}{Whitespace, including tabs, form and line
       feeds and carriage returns.  Some OSes include non-breaking
       spaces, some exclude them.}
     \item{\code{"upper"}}{Upper-case alphabetic.}
     \item{\code{"xdigit"}}{Hexadecimal character, one of \code{0-9A-fa-f}.}
   }

   Alphabetic characters contain all lower- and upper-case ones and some
   others (for example, those in \sQuote{title case}).

   Whether a character is printable is used to decide whether to escape
   it when printing -- see the help for \code{\link{print.default}}.

   If \code{x} is a character string it should either be ASCII or declared
   as UTF-8 -- see \code{\link{Encoding}}.

   \code{charClass} was added in \R 4.1.0.  A less direct way to examine
   character classes which also worked in earlier versions is to use
   something like \code{grepl("[[:print:]]", intToUtf8(x))} -- however,
   the regular-expression code might not use the same classification
   functions as printing and on macOS used not to.
 }
 \value{
   A logical vector of the length the number of characters or integers in
   \code{x}.
 }
 \note{
   Non-ASCII digits are excluded by the C99 standard from the class
   \code{"digit"}: most platforms will have them as alphabetic.

   It is an assumption that the system's wide character classification
   functions are coded in Unicode points, but this is known to be true
   for all recent platforms.

   In principle the classification could depend on the locale even on
   one platform, but that seems no longer to be seen.
 }
 \seealso{
   Character classes are used in \link{regular expression}s.

   The OS's \command{man} pages for \code{iswctype} and \code{wctype}.
 }
 \examples{
 x <- c(48:70, 32, 0xa0) # Last is non-breaking space
 cl <- c("alnum", "alpha", "blank", "digit", "graph", "punct", "upper", "xdigit")
 X <- lapply(cl, function(y) charClass(x,y)); names(X) <- cl
 X <- as.data.frame(X); row.names(X) <- sQuote(intToUtf8(x, multiple = TRUE))
 X

 charClass("ABC123", "alpha")
 ## Some accented capital Greek characters
 (x <- "\u0386\u0388\u0389")
 charClass(x, "upper")

 ## How many printable characters are there? (Around 280,000 in Unicode 13.)
 ## There are 2^21-1 possible Unicode points (most not yet assigned).
 pr <- charClass(1:0x1fffff, "print")
 table(pr)
 }
	% File src/library/utils/man/charClass.Rd
	% Part of the R package, https://www.R-project.org
	% Copyright 2021 R Core Team
	% Distributed under GPL 2 or later
	\name{charClass}
	\alias{charClass}
	\title{Character Classification}
	\description{
	An interface to the (C99) wide character classification functions in use.
	}
	\usage{
	charClass(x, class)
	}
	\arguments{
	\item{x}{\strong{Either} a UTF-8-encoded length-1 character vector
	\strong{or} an integer vector of Unicode points (or a vector
	coercible to integer).}
	\item{class}{A character string, one of those given in the
	\sQuote{Details} section.}
	}
	\details{
	The classification into character classes is platform-dependent. The
	classes are determined by internal tables on Windows and (optionally
	but by default) on macOS and AIX.

	The character classes are interpreted as follows:
	\describe{
	\item{\code{"alnum"}}{Alphabetic or numeric.}
	\item{\code{"alpha"}}{Alphabetic.}
	\item{\code{"blank"}}{Space or tab.}
	\item{\code{"cntrl"}}{Control characters.}
	\item{\code{"digit"}}{Digits \code{0-9}.}
	\item{\code{"graph"}}{Graphical characters (printable characters
	except whitespace).}
	\item{\code{"lower"}}{Lower-case alphabetic.}
	\item{\code{"print"}}{Printable characters.}
	\item{\code{"punct"}}{Punctuation characters. Some platforms treat all
	non-alphanumeric graphical characters as punctuation.}
	\item{\code{"space"}}{Whitespace, including tabs, form and line
	feeds and carriage returns. Some OSes include non-breaking
	spaces, some exclude them.}
	\item{\code{"upper"}}{Upper-case alphabetic.}
	\item{\code{"xdigit"}}{Hexadecimal character, one of \code{0-9A-fa-f}.}
	}

	Alphabetic characters contain all lower- and upper-case ones and some
	others (for example, those in \sQuote{title case}).

	Whether a character is printable is used to decide whether to escape
	it when printing -- see the help for \code{\link{print.default}}.

	If \code{x} is a character string it should either be ASCII or declared
	as UTF-8 -- see \code{\link{Encoding}}.

	\code{charClass} was added in \R 4.1.0. A less direct way to examine
	character classes which also worked in earlier versions is to use
	something like \code{grepl("[[:print:]]", intToUtf8(x))} -- however,
	the regular-expression code might not use the same classification
	functions as printing and on macOS used not to.
	}
	\value{
	A logical vector of the length the number of characters or integers in
	\code{x}.
	}
	\note{
	Non-ASCII digits are excluded by the C99 standard from the class
	\code{"digit"}: most platforms will have them as alphabetic.

	It is an assumption that the system's wide character classification
	functions are coded in Unicode points, but this is known to be true
	for all recent platforms.

	In principle the classification could depend on the locale even on
	one platform, but that seems no longer to be seen.
	}
	\seealso{
	Character classes are used in \link{regular expression}s.

	The OS's \command{man} pages for \code{iswctype} and \code{wctype}.
	}
	\examples{
	x <- c(48:70, 32, 0xa0) # Last is non-breaking space
	cl <- c("alnum", "alpha", "blank", "digit", "graph", "punct", "upper", "xdigit")
	X <- lapply(cl, function(y) charClass(x,y)); names(X) <- cl
	X <- as.data.frame(X); row.names(X) <- sQuote(intToUtf8(x, multiple = TRUE))
	X

	charClass("ABC123", "alpha")
	## Some accented capital Greek characters
	(x <- "\u0386\u0388\u0389")
	charClass(x, "upper")

	## How many printable characters are there? (Around 280,000 in Unicode 13.)
	## There are 2^21-1 possible Unicode points (most not yet assigned).
	pr <- charClass(1:0x1fffff, "print")
	table(pr)
	}