src/library/base/man/nchar.Rd - R - Git at Google

 % File src/library/base/man/nchar.Rd
 % Part of the R package, https://www.R-project.org
 % Copyright 1995-2016 R Core Team
 % Distributed under GPL 2 or later

 \name{nchar}
 \alias{nchar}
 \alias{nzchar}
 \title{Count the Number of Characters (or Bytes or Width)}
 \usage{
 nchar(x, type = "chars", allowNA = FALSE, keepNA = NA)

 nzchar(x, keepNA = FALSE)
 }
 \description{
   \code{nchar} takes a character vector as an argument and
   returns a vector whose elements contain the sizes of
   the corresponding elements of \code{x}. Internally, it is a generic,
   for which methods can be defined.

   \code{nzchar} is a fast way to find out if elements of a character
   vector are non-empty strings.
 }
 \arguments{
   \item{x}{character vector, or a vector to be coerced to a character
     vector.  Giving a factor is an error.}
   \item{type}{character string: partial matching to one of
     \code{c("bytes", "chars", "width")}.  See \sQuote{Details}.}
   \item{allowNA}{logical: should \code{NA} be returned for invalid
     multibyte strings or \code{"bytes"}-encoded strings (rather than
     throwing an error)?}
   \item{keepNA}{logical: should \code{NA} be returned where ever
   \code{x} is \code{\link{NA}}?  If false, \code{nchar()} returns
   \code{2}, as that is the number of printing characters used when
   strings are written to output, and \code{nzchar()} is \code{TRUE}.  The
   default for \code{nchar()}, \code{NA}, means to use \code{keepNA = TRUE}
   unless \code{type} is \code{"width"}.  Used to be (implicitly) hard
   coded to \code{FALSE} in \R versions \eqn{\le}{<=} 3.2.0.}
 }
 \details{
   The \sQuote{size} of a character string can be measured in one of
   three ways (corresponding to the \code{type} argument):
   \describe{
     \item{\code{bytes}}{The number of bytes needed to store the string
       (plus in C a final terminator which is not counted).}
     \item{\code{chars}}{The number of human-readable characters.}
     \item{\code{width}}{The number of columns \code{\link{cat}} will use to
       print the string in a monospaced font.  The same as \code{chars}
       if this cannot be calculated.}
   }
   These will often be the same, and almost always will be in single-byte
   locales (but note how \code{type} determines the default for
   \code{keepNA}).  There will be differences between the first two with
   multibyte character sequences, e.g.\sspace{}in UTF-8 locales.

   The internal equivalent of the default method of
   \code{\link{as.character}} is performed on \code{x} (so there is no
   method dispatch).  If you want to operate on non-vector objects
   passing them through \code{\link{deparse}} first will be required.
 }
 \value{
   For \code{nchar}, an integer vector giving the sizes of each element.
   For missing values (i.e., \code{NA}, i.e., \code{\link{NA_character_}}),
   \code{nchar()} returns \code{\link{NA_integer_}} if \code{keepNA} is
   true, and \code{2}, the number of printing characters, if false.

   \code{type = "width"} gives (an approximation to) the number of
   columns used in printing each element in a terminal font, taking into
   account double-width, zero-width and \sQuote{composing} characters.

   If \code{allowNA = TRUE} and an element is detected as invalid in a
   multi-byte character set such as UTF-8, its number of characters and
   the width will be \code{NA}.  Otherwise the number of characters will
   be non-negative, so \code{!is.na(nchar(x, "chars", TRUE))} is a test
   of validity.

   A character string marked with \code{"bytes"} encoding (see
   \code{\link{Encoding}}) has a number of bytes, but neither a known
   number of characters nor a width, so the latter two types are
   \code{NA} if \code{allowNA = TRUE}, otherwise an error.

   Names, dims and dimnames are copied from the input.

   For \code{nzchar}, a logical vector of the same length as \code{x},
   true if and only if the element has non-zero length; if the element is
   \code{NA}, \code{nzchar()} is true when \code{keepNA} is false, as by
   default, and \code{NA} otherwise.
 }
 \note{
   This does \strong{not} by default give the number of characters that
   will be used to \code{print()} the string.  Use
   \code{\link{encodeString}} to find that.
 #ifdef windows
   This is particularly important on Windows when \samp{\\uxxxx}
   sequences have been used to enter Unicode characters not representable
   in the current encoding.  Thus \code{nchar("\\u2642")} is \code{1},
   and it is printed in \code{Rgui} as one character, but it will be
   printed in \code{Rterm} as \code{<U+2642>}, which is what
   \code{encodeString} gives.
 #endif
 #ifdef unix
   Where character strings have been marked as UTF-8, the number of
   characters and widths will be computed in UTF-8, even though printing
   may use escapes such as \samp{<U+2642>} in a non-UTF-8 locale.
 #endif

   The concept of \sQuote{width} is a slippery one even in a monospaced
   font. Some human languages have the concept of \emph{combining}
   characters, in which two or more characters are rendered together: an
   example would be \code{"y\u306"}, which is two characters of width
   one: combining characters are given width zero, and there are other
   zero-width characters such as the zero-width space \code{"\u200b"}.

   Some East Asian languages have \sQuote{wide} characters, ideographs
   which are conventionally printed across two columns when mixed with
   ASCII and other \sQuote{narrow} characters in those languages.  The
   problem is that whether a computer prints wide characters over two or
   one columns depends on the font, with it not being uncommon to use two
   columns in a font intended for East Asian users and a single column in
   a \sQuote{Western} font.  Unicode has encodings for \sQuote{fullwidth}
   versions of ASCII characters and \sQuote{halfwidth} versions of
   Katakana (Japanese) and Hangul (Korean) characters.  Then there is the
   \sQuote{East Asian Ambiguous class} (Greek, Cyrillic, signs, some
   accented Latin chars, etc), for which the historical practice was to
   use two columns in East Asia and one elsewhere.  The width quoted by
   \code{nchar} for characters in that class (and some others) depends on
   the locale, being one except in some East Asian locales on some OSes
   (notably Windows).

   Control characters are given width zero.
 }
 \references{
   Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
   \emph{The New S Language}.
   Wadsworth & Brooks/Cole.

   Unicode Standard Annex #11: \emph{East Asian Width.}
   \url{http://www.unicode.org/reports/tr11/}
 }
 \seealso{
   \code{\link{strwidth}} giving width of strings for plotting;
   \code{\link{paste}}, \code{\link{substr}}, \code{\link{strsplit}}
 }
 \examples{
 x <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech")
 nchar(x)
 # 5  6  6  1 15

 nchar(deparse(mean))
 # 18 17  <-- unless mean differs from base::mean

 x[3] <- NA; x
 nchar(x, keepNA= TRUE) #  5  6 NA  1 15
 nchar(x, keepNA=FALSE) #  5  6  2  1 15
 stopifnot(identical(nchar(x     ), nchar(x, keepNA= TRUE)),
           identical(nchar(x, "w"), nchar(x, keepNA=FALSE)),
           identical(is.na(x), is.na(nchar(x))))

 ##' nchar() for all three types :
 nchars <- function(x, ...)
    vapply(c("chars", "bytes", "width"),
           function(tp) nchar(x, tp, ...), integer(length(x)))

 nchars("\\u200b") # in R versions (>= 2015-09-xx):
 ## chars bytes width
 ##     1     3     0

 data.frame(x, nchars(x)) ## all three types : same unless for NA
 ## force the same by forcing 'keepNA':
 (ncT <- nchars(x, keepNA = TRUE)) ## .... NA NA NA ....
 (ncF <- nchars(x, keepNA = FALSE))## ....  2  2  2 ....
 stopifnot(apply(ncT, 1, function(.) length(unique(.))) == 1,
           apply(ncF, 1, function(.) length(unique(.))) == 1)
 }
 \keyword{character}
	% File src/library/base/man/nchar.Rd
	% Part of the R package, https://www.R-project.org
	% Copyright 1995-2016 R Core Team
	% Distributed under GPL 2 or later

	\name{nchar}
	\alias{nchar}
	\alias{nzchar}
	\title{Count the Number of Characters (or Bytes or Width)}
	\usage{
	nchar(x, type = "chars", allowNA = FALSE, keepNA = NA)

	nzchar(x, keepNA = FALSE)
	}
	\description{
	\code{nchar} takes a character vector as an argument and
	returns a vector whose elements contain the sizes of
	the corresponding elements of \code{x}. Internally, it is a generic,
	for which methods can be defined.

	\code{nzchar} is a fast way to find out if elements of a character
	vector are non-empty strings.
	}
	\arguments{
	\item{x}{character vector, or a vector to be coerced to a character
	vector. Giving a factor is an error.}
	\item{type}{character string: partial matching to one of
	\code{c("bytes", "chars", "width")}. See \sQuote{Details}.}
	\item{allowNA}{logical: should \code{NA} be returned for invalid
	multibyte strings or \code{"bytes"}-encoded strings (rather than
	throwing an error)?}
	\item{keepNA}{logical: should \code{NA} be returned where ever
	\code{x} is \code{\link{NA}}? If false, \code{nchar()} returns
	\code{2}, as that is the number of printing characters used when
	strings are written to output, and \code{nzchar()} is \code{TRUE}. The
	default for \code{nchar()}, \code{NA}, means to use \code{keepNA = TRUE}
	unless \code{type} is \code{"width"}. Used to be (implicitly) hard
	coded to \code{FALSE} in \R versions \eqn{\le}{<=} 3.2.0.}
	}
	\details{
	The \sQuote{size} of a character string can be measured in one of
	three ways (corresponding to the \code{type} argument):
	\describe{
	\item{\code{bytes}}{The number of bytes needed to store the string
	(plus in C a final terminator which is not counted).}
	\item{\code{chars}}{The number of human-readable characters.}
	\item{\code{width}}{The number of columns \code{\link{cat}} will use to
	print the string in a monospaced font. The same as \code{chars}
	if this cannot be calculated.}
	}
	These will often be the same, and almost always will be in single-byte
	locales (but note how \code{type} determines the default for
	\code{keepNA}). There will be differences between the first two with
	multibyte character sequences, e.g.\sspace{}in UTF-8 locales.

	The internal equivalent of the default method of
	\code{\link{as.character}} is performed on \code{x} (so there is no
	method dispatch). If you want to operate on non-vector objects
	passing them through \code{\link{deparse}} first will be required.
	}
	\value{
	For \code{nchar}, an integer vector giving the sizes of each element.
	For missing values (i.e., \code{NA}, i.e., \code{\link{NA_character_}}),
	\code{nchar()} returns \code{\link{NA_integer_}} if \code{keepNA} is
	true, and \code{2}, the number of printing characters, if false.

	\code{type = "width"} gives (an approximation to) the number of
	columns used in printing each element in a terminal font, taking into
	account double-width, zero-width and \sQuote{composing} characters.

	If \code{allowNA = TRUE} and an element is detected as invalid in a
	multi-byte character set such as UTF-8, its number of characters and
	the width will be \code{NA}. Otherwise the number of characters will
	be non-negative, so \code{!is.na(nchar(x, "chars", TRUE))} is a test
	of validity.

	A character string marked with \code{"bytes"} encoding (see
	\code{\link{Encoding}}) has a number of bytes, but neither a known
	number of characters nor a width, so the latter two types are
	\code{NA} if \code{allowNA = TRUE}, otherwise an error.

	Names, dims and dimnames are copied from the input.

	For \code{nzchar}, a logical vector of the same length as \code{x},
	true if and only if the element has non-zero length; if the element is
	\code{NA}, \code{nzchar()} is true when \code{keepNA} is false, as by
	default, and \code{NA} otherwise.
	}
	\note{
	This does \strong{not} by default give the number of characters that
	will be used to \code{print()} the string. Use
	\code{\link{encodeString}} to find that.
	#ifdef windows
	This is particularly important on Windows when \samp{\\uxxxx}
	sequences have been used to enter Unicode characters not representable
	in the current encoding. Thus \code{nchar("\\u2642")} is \code{1},
	and it is printed in \code{Rgui} as one character, but it will be
	printed in \code{Rterm} as \code{<U+2642>}, which is what
	\code{encodeString} gives.
	#endif
	#ifdef unix
	Where character strings have been marked as UTF-8, the number of
	characters and widths will be computed in UTF-8, even though printing
	may use escapes such as \samp{<U+2642>} in a non-UTF-8 locale.
	#endif

	The concept of \sQuote{width} is a slippery one even in a monospaced
	font. Some human languages have the concept of \emph{combining}
	characters, in which two or more characters are rendered together: an
	example would be \code{"y\u306"}, which is two characters of width
	one: combining characters are given width zero, and there are other
	zero-width characters such as the zero-width space \code{"\u200b"}.

	Some East Asian languages have \sQuote{wide} characters, ideographs
	which are conventionally printed across two columns when mixed with
	ASCII and other \sQuote{narrow} characters in those languages. The
	problem is that whether a computer prints wide characters over two or
	one columns depends on the font, with it not being uncommon to use two
	columns in a font intended for East Asian users and a single column in
	a \sQuote{Western} font. Unicode has encodings for \sQuote{fullwidth}
	versions of ASCII characters and \sQuote{halfwidth} versions of
	Katakana (Japanese) and Hangul (Korean) characters. Then there is the
	\sQuote{East Asian Ambiguous class} (Greek, Cyrillic, signs, some
	accented Latin chars, etc), for which the historical practice was to
	use two columns in East Asia and one elsewhere. The width quoted by
	\code{nchar} for characters in that class (and some others) depends on
	the locale, being one except in some East Asian locales on some OSes
	(notably Windows).

	Control characters are given width zero.
	}
	\references{
	Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
	\emph{The New S Language}.
	Wadsworth & Brooks/Cole.

	Unicode Standard Annex #11: \emph{East Asian Width.}
	\url{http://www.unicode.org/reports/tr11/}
	}
	\seealso{
	\code{\link{strwidth}} giving width of strings for plotting;
	\code{\link{paste}}, \code{\link{substr}}, \code{\link{strsplit}}
	}
	\examples{
	x <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech")
	nchar(x)
	# 5 6 6 1 15

	nchar(deparse(mean))
	# 18 17 <-- unless mean differs from base::mean

	x[3] <- NA; x
	nchar(x, keepNA= TRUE) # 5 6 NA 1 15
	nchar(x, keepNA=FALSE) # 5 6 2 1 15
	stopifnot(identical(nchar(x ), nchar(x, keepNA= TRUE)),
	identical(nchar(x, "w"), nchar(x, keepNA=FALSE)),
	identical(is.na(x), is.na(nchar(x))))

	##' nchar() for all three types :
	nchars <- function(x, ...)
	vapply(c("chars", "bytes", "width"),
	function(tp) nchar(x, tp, ...), integer(length(x)))

	nchars("\\u200b") # in R versions (>= 2015-09-xx):
	## chars bytes width
	## 1 3 0

	data.frame(x, nchars(x)) ## all three types : same unless for NA
	## force the same by forcing 'keepNA':
	(ncT <- nchars(x, keepNA = TRUE)) ## .... NA NA NA ....
	(ncF <- nchars(x, keepNA = FALSE))## .... 2 2 2 ....
	stopifnot(apply(ncT, 1, function(.) length(unique(.))) == 1,
	apply(ncF, 1, function(.) length(unique(.))) == 1)
	}
	\keyword{character}