src/library/base/man/icuSetCollate.Rd - R - Git at Google

 % File src/library/base/man/icuSetCollate.Rd
 % Part of the R package, https://www.R-project.org
 % Copyright 2008-2018 R Core Team
 % Distributed under GPL 2 or later

 \name{icuSetCollate}
 \alias{icuSetCollate}
 \alias{icuGetCollate}
 \alias{R_ICU_LOCALE}

 \title{ Setup Collation by ICU }
 \description{
   Controls the way collation is done by ICU (an optional part of the \R
   build).
 }
 \usage{
 icuSetCollate(...)

 icuGetCollate(type = c("actual", "valid"))
 }
 \arguments{
   \item{\dots}{Named arguments, see \sQuote{Details}.}
   \item{type}{character string: can be abbreviated.  Either the actual locale
     in use for collation or the most specific locale which would be valid.}
 }
 \details{
   Optionally, \R can be built to collate character strings by ICU
   (\url{http://site.icu-project.org}).  For such systems,
   \code{icuSetCollate} can be used to tune the way collation is done.
   On other builds calling this function does nothing, with a warning.

   Possible arguments are
   \describe{
     \item{\code{locale}:}{A character string such as \code{"da_DK"}
       giving the language and country whose collation rules are to be
       used.  If present, this should be the first argument.}
     \item{\code{case_first}:}{\code{"upper"}, \code{"lower"} or
       \code{"default"}, asking for upper- or lower-case characters to be
       sorted first.  The default is usually lower-case first, but not in
       all languages (not under the default settings for Danish, for example).}
     \item{\code{alternate_handling}:}{Controls the handling of
       \sQuote{variable} characters (mainly punctuation and symbols).
       Possible values are \code{"non_ignorable"} (primary strength) and
       \code{"shifted"} (quaternary strength).}
     \item{\code{strength}:}{Which components should be used?  Possible
       values \code{"primary"}, \code{"secondary"}, \code{"tertiary"}
       (default), \code{"quaternary"} and \code{"identical"}. }
     \item{\code{french_collation}:}{In a French locale the way accents
       affect collation is from right to left, whereas in most other locales
       it is from left to right.  Possible values \code{"on"}, \code{"off"}
       and \code{"default"}.}
     \item{\code{normalization}:}{Should strings be normalized?  Possible values
       are \code{"on"} and \code{"off"} (default).  This affects the
       collation of composite characters.}
     \item{\code{case_level}:}{An additional level between secondary and
       tertiary, used to distinguish large and small Japanese Kana
       characters. Possible values \code{"on"} and \code{"off"} (default).}
     \item{\code{hiragana_quaternary}:}{Possible values \code{"on"} (sort
       Hiragana first at quaternary level) and \code{"off"}.}
   }
   Only the first three are likely to be of interest except to those with a
   detailed understanding of collation and specialized requirements.

   Some special values are accepted for \code{locale}:
   \describe{
     \item{\code{"none"}:}{ICU is not used for collation: the OS's
       collation services are used instead.}
     \item{\code{"ASCII"}:}{ICU is not used for collation: the C function
       \code{strcmp} is used instead, which should sort byte-by-byte in
       (unsigned) numerical order.}
     \item{\code{"default"}:}{
       obtains the locale from the OS as is done at the start of the
       session.  If environment variable \env{R_ICU_LOCALE} is set to a
       non-empty value, its value is used rather than consulting the OS,
       unless environment variable \env{LC_ALL} is set to 'C' (or unset but
       \env{LC_COLLATE} is set to 'C').
     }
     \item{\code{""}, \code{"root"}:}{
       the \sQuote{root} collation: see
       \url{http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation}.
     }
   }
   For the specifications of \sQuote{real} ICU locales, see
   \url{http://userguide.icu-project.org/locale}.  Note that ICU does not
   report that a locale is not supported, but falls back to its idea of
   \sQuote{best fit} (which could be rather different and is reported by
   \code{icuGetCollate("actual")}, often \code{"root"}).  Most English
   locales fall back to \code{"root"} as although e.g.\sspace{}\code{"en_GB"} is
   a valid locale (at least on some platforms), it contains no special
   rules for collation.  Note that \code{"C"} is not a supported ICU locale
   and hence \env{R_ICU_LOCALE} should never be set to \code{"C"}.

   Some examples are \code{case_level = "on", strength = "primary"} to ignore
   accent differences and \code{alternate_handling = "shifted"} to ignore
   space and punctuation characters.

   Initially ICU will not be used for collation if the OS is set to use the
   \code{C} locale for collation and \env{R_ICU_LOCALE} is not set.  Once
   this function is called with a value for \code{locale}, ICU will be used
   until it is called again with \code{locale = "none"}.  ICU will not be
   used once \code{Sys.setlocale} is called with a \code{"C"} value for
   \code{LC_ALL} or \code{LC_COLLATE}, even if \env{R_ICU_LOCALE} is set.
   ICU will be used again honoring \env{R_ICU_LOCALE} once
   \code{Sys.setlocale} is called to set a different collation order.
   Environment variables \env{LC_ALL} (or \env{LC_COLLATE}) take precedence
   over \env{R_ICU_LOCALE} if and only if they are set to 'C'.  Due to the
   interaction with other ways of setting the collation order,
   \env{R_ICU_LOCALE} should be used with care and only when needed.

   All customizations are reset to the default for the locale if
   \code{locale} is specified: the collation engine is reset if the
   OS collation locate category is changed by \code{\link{Sys.setlocale}}.
 }
 \value{
   For \code{icuGetCollate}, a character string describing the ICU locale
   in use (which may be reported as \code{"ICU not in use"}).  The
   \sQuote{actual} locale may be simpler than the requested locale: for
   example \code{"da"} rather than \code{"da_DK"}: English locales are
   likely to report \code{"root"}.
 }
 \note{
   ICU is used by default wherever it is available: this include macOS,
   Solaris and many Linux installations.  As it works internally in
   UTF-8, it will be most efficient in UTF-8 locales.

   It is optional on Windows: if \R has been built against ICU, it will
   only be used if environment variable \env{R_ICU_LOCALE} is set or once
   \code{icuSetCollate} is called to select the locale (as ICU and
   Windows differ in their idea of locale names).  Note that
   \code{icuSetCollate(locale = "default")} should work reasonably well
   for \R >= 3.2.0 and Windows Vista/Server 2008 and later (but finds the
   system default ignoring environment variables such as \env{LC_COLLATE}).
 }
 \seealso{
   \link{Comparison}, \code{\link{sort}}.

   \code{\link{capabilities}} for whether ICU is available;
   \code{\link{extSoftVersion}} for its version.

   The ICU user guide chapter on collation
   (\url{http://userguide.icu-project.org/collation}).
 }
 \examples{\donttest{
 ## These examples depend on having ICU available, and on the locale.
 ## As we don't know the current settings, we can only reset to the default.
 if(capabilities("ICU")) {
     print(icuGetCollate())
     print(icuGetCollate("valid"))
     x <- c("Aarhus", "aarhus", "safe", "test", "Zoo")
     print(sort(x))
     icuSetCollate(case_first = "upper"); print(sort(x))
     icuSetCollate(case_first = "lower"); print(sort(x))

     ## Danish collates upper-case-first and with 'aa' as a single letter
     icuSetCollate(locale = "da_DK", case_first = "default"); print(sort(x))
     ## Estonian collates Z between S and T
     icuSetCollate(locale = "et_EE"); print(sort(x))
     icuSetCollate(locale = "default"); print(icuGetCollate("valid"))
 }
 }}
 \keyword{ utilities }
	% File src/library/base/man/icuSetCollate.Rd
	% Part of the R package, https://www.R-project.org
	% Copyright 2008-2018 R Core Team
	% Distributed under GPL 2 or later

	\name{icuSetCollate}
	\alias{icuSetCollate}
	\alias{icuGetCollate}
	\alias{R_ICU_LOCALE}

	\title{ Setup Collation by ICU }
	\description{
	Controls the way collation is done by ICU (an optional part of the \R
	build).
	}
	\usage{
	icuSetCollate(...)

	icuGetCollate(type = c("actual", "valid"))
	}
	\arguments{
	\item{\dots}{Named arguments, see \sQuote{Details}.}
	\item{type}{character string: can be abbreviated. Either the actual locale
	in use for collation or the most specific locale which would be valid.}
	}
	\details{
	Optionally, \R can be built to collate character strings by ICU
	(\url{http://site.icu-project.org}). For such systems,
	\code{icuSetCollate} can be used to tune the way collation is done.
	On other builds calling this function does nothing, with a warning.

	Possible arguments are
	\describe{
	\item{\code{locale}:}{A character string such as \code{"da_DK"}
	giving the language and country whose collation rules are to be
	used. If present, this should be the first argument.}
	\item{\code{case_first}:}{\code{"upper"}, \code{"lower"} or
	\code{"default"}, asking for upper- or lower-case characters to be
	sorted first. The default is usually lower-case first, but not in
	all languages (not under the default settings for Danish, for example).}
	\item{\code{alternate_handling}:}{Controls the handling of
	\sQuote{variable} characters (mainly punctuation and symbols).
	Possible values are \code{"non_ignorable"} (primary strength) and
	\code{"shifted"} (quaternary strength).}
	\item{\code{strength}:}{Which components should be used? Possible
	values \code{"primary"}, \code{"secondary"}, \code{"tertiary"}
	(default), \code{"quaternary"} and \code{"identical"}. }
	\item{\code{french_collation}:}{In a French locale the way accents
	affect collation is from right to left, whereas in most other locales
	it is from left to right. Possible values \code{"on"}, \code{"off"}
	and \code{"default"}.}
	\item{\code{normalization}:}{Should strings be normalized? Possible values
	are \code{"on"} and \code{"off"} (default). This affects the
	collation of composite characters.}
	\item{\code{case_level}:}{An additional level between secondary and
	tertiary, used to distinguish large and small Japanese Kana
	characters. Possible values \code{"on"} and \code{"off"} (default).}
	\item{\code{hiragana_quaternary}:}{Possible values \code{"on"} (sort
	Hiragana first at quaternary level) and \code{"off"}.}
	}
	Only the first three are likely to be of interest except to those with a
	detailed understanding of collation and specialized requirements.

	Some special values are accepted for \code{locale}:
	\describe{
	\item{\code{"none"}:}{ICU is not used for collation: the OS's
	collation services are used instead.}
	\item{\code{"ASCII"}:}{ICU is not used for collation: the C function
	\code{strcmp} is used instead, which should sort byte-by-byte in
	(unsigned) numerical order.}
	\item{\code{"default"}:}{
	obtains the locale from the OS as is done at the start of the
	session. If environment variable \env{R_ICU_LOCALE} is set to a
	non-empty value, its value is used rather than consulting the OS,
	unless environment variable \env{LC_ALL} is set to 'C' (or unset but
	\env{LC_COLLATE} is set to 'C').
	}
	\item{\code{""}, \code{"root"}:}{
	the \sQuote{root} collation: see
	\url{http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation}.
	}
	}
	For the specifications of \sQuote{real} ICU locales, see
	\url{http://userguide.icu-project.org/locale}. Note that ICU does not
	report that a locale is not supported, but falls back to its idea of
	\sQuote{best fit} (which could be rather different and is reported by
	\code{icuGetCollate("actual")}, often \code{"root"}). Most English
	locales fall back to \code{"root"} as although e.g.\sspace{}\code{"en_GB"} is
	a valid locale (at least on some platforms), it contains no special
	rules for collation. Note that \code{"C"} is not a supported ICU locale
	and hence \env{R_ICU_LOCALE} should never be set to \code{"C"}.

	Some examples are \code{case_level = "on", strength = "primary"} to ignore
	accent differences and \code{alternate_handling = "shifted"} to ignore
	space and punctuation characters.

	Initially ICU will not be used for collation if the OS is set to use the
	\code{C} locale for collation and \env{R_ICU_LOCALE} is not set. Once
	this function is called with a value for \code{locale}, ICU will be used
	until it is called again with \code{locale = "none"}. ICU will not be
	used once \code{Sys.setlocale} is called with a \code{"C"} value for
	\code{LC_ALL} or \code{LC_COLLATE}, even if \env{R_ICU_LOCALE} is set.
	ICU will be used again honoring \env{R_ICU_LOCALE} once
	\code{Sys.setlocale} is called to set a different collation order.
	Environment variables \env{LC_ALL} (or \env{LC_COLLATE}) take precedence
	over \env{R_ICU_LOCALE} if and only if they are set to 'C'. Due to the
	interaction with other ways of setting the collation order,
	\env{R_ICU_LOCALE} should be used with care and only when needed.

	All customizations are reset to the default for the locale if
	\code{locale} is specified: the collation engine is reset if the
	OS collation locate category is changed by \code{\link{Sys.setlocale}}.
	}
	\value{
	For \code{icuGetCollate}, a character string describing the ICU locale
	in use (which may be reported as \code{"ICU not in use"}). The
	\sQuote{actual} locale may be simpler than the requested locale: for
	example \code{"da"} rather than \code{"da_DK"}: English locales are
	likely to report \code{"root"}.
	}
	\note{
	ICU is used by default wherever it is available: this include macOS,
	Solaris and many Linux installations. As it works internally in
	UTF-8, it will be most efficient in UTF-8 locales.

	It is optional on Windows: if \R has been built against ICU, it will
	only be used if environment variable \env{R_ICU_LOCALE} is set or once
	\code{icuSetCollate} is called to select the locale (as ICU and
	Windows differ in their idea of locale names). Note that
	\code{icuSetCollate(locale = "default")} should work reasonably well
	for \R >= 3.2.0 and Windows Vista/Server 2008 and later (but finds the
	system default ignoring environment variables such as \env{LC_COLLATE}).
	}
	\seealso{
	\link{Comparison}, \code{\link{sort}}.

	\code{\link{capabilities}} for whether ICU is available;
	\code{\link{extSoftVersion}} for its version.

	The ICU user guide chapter on collation
	(\url{http://userguide.icu-project.org/collation}).
	}
	\examples{\donttest{
	## These examples depend on having ICU available, and on the locale.
	## As we don't know the current settings, we can only reset to the default.
	if(capabilities("ICU")) {
	print(icuGetCollate())
	print(icuGetCollate("valid"))
	x <- c("Aarhus", "aarhus", "safe", "test", "Zoo")
	print(sort(x))
	icuSetCollate(case_first = "upper"); print(sort(x))
	icuSetCollate(case_first = "lower"); print(sort(x))

	## Danish collates upper-case-first and with 'aa' as a single letter
	icuSetCollate(locale = "da_DK", case_first = "default"); print(sort(x))
	## Estonian collates Z between S and T
	icuSetCollate(locale = "et_EE"); print(sort(x))
	icuSetCollate(locale = "default"); print(icuGetCollate("valid"))
	}
	}}
	\keyword{ utilities }