| @node Character Set Handling, Locales, String and Array Utilities, Top |
| @c %MENU% Support for extended character sets |
| @chapter Character Set Handling |
| |
| @ifnottex |
| @macro cal{text} |
| \text\ |
| @end macro |
| @end ifnottex |
| |
| Character sets used in the early days of computing had only six, seven, |
| or eight bits for each character: there was never a case where more than |
| eight bits (one byte) were used to represent a single character. The |
| limitations of this approach became more apparent as more people |
| grappled with non-Roman character sets, where not all the characters |
| that make up a language's character set can be represented by @math{2^8} |
| choices. This chapter shows the functionality that was added to the C |
| library to support multiple character sets. |
| |
| @menu |
| * Extended Char Intro:: Introduction to Extended Characters. |
| * Charset Function Overview:: Overview about Character Handling |
| Functions. |
| * Restartable multibyte conversion:: Restartable multibyte conversion |
| Functions. |
| * Non-reentrant Conversion:: Non-reentrant Conversion Function. |
| * Generic Charset Conversion:: Generic Charset Conversion. |
| @end menu |
| |
| |
| @node Extended Char Intro |
| @section Introduction to Extended Characters |
| |
| A variety of solutions is available to overcome the differences between |
| character sets with a 1:1 relation between bytes and characters and |
| character sets with ratios of 2:1 or 4:1. The remainder of this |
| section gives a few examples to help understand the design decisions |
| made while developing the functionality of the @w{C library}. |
| |
| @cindex internal representation |
| A distinction we have to make right away is between internal and |
| external representation. @dfn{Internal representation} means the |
| representation used by a program while keeping the text in memory. |
| External representations are used when text is stored or transmitted |
| through some communication channel. Examples of external |
| representations include files waiting in a directory to be |
| read and parsed. |
| |
| Traditionally there has been no difference between the two representations. |
| It was equally comfortable and useful to use the same single-byte |
| representation internally and externally. This comfort level decreases |
| with more and larger character sets. |
| |
| One of the problems to overcome with the internal representation is |
| handling text that is externally encoded using different character |
| sets. Assume a program that reads two texts and compares them using |
| some metric. The comparison can be usefully done only if the texts are |
| internally kept in a common format. |
| |
| @cindex wide character |
| For such a common format (@math{=} character set) eight bits are certainly |
| no longer enough. So the smallest entity will have to grow: @dfn{wide |
| characters} will now be used. Instead of one byte per character, two or |
| four will be used instead. (Three are not good to address in memory and |
| more than four bytes seem not to be necessary). |
| |
| @cindex Unicode |
| @cindex ISO 10646 |
| As shown in some other part of this manual, |
| @c !!! Ahem, wide char string functions are not yet covered -- drepper |
| a completely new family has been created of functions that can handle wide |
| character texts in memory. The most commonly used character sets for such |
| internal wide character representations are Unicode and @w{ISO 10646} |
| (also known as UCS for Universal Character Set). Unicode was originally |
| planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to |
| be a 31-bit large code space. The two standards are practically identical. |
| They have the same character repertoire and code table, but Unicode specifies |
| added semantics. At the moment, only characters in the first @code{0x10000} |
| code positions (the so-called Basic Multilingual Plane, BMP) have been |
| assigned, but the assignment of more specialized characters outside this |
| 16-bit space is already in progress. A number of encodings have been |
| defined for Unicode and @w{ISO 10646} characters: |
| @cindex UCS-2 |
| @cindex UCS-4 |
| @cindex UTF-8 |
| @cindex UTF-16 |
| UCS-2 is a 16-bit word that can only represent characters |
| from the BMP, UCS-4 is a 32-bit word than can represent any Unicode |
| and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where |
| ASCII characters are represented by ASCII bytes and non-ASCII characters |
| by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension |
| of UCS-2 in which pairs of certain UCS-2 words can be used to encode |
| non-BMP characters up to @code{0x10ffff}. |
| |
| To represent wide characters the @code{char} type is not suitable. For |
| this reason the @w{ISO C} standard introduces a new type that is |
| designed to keep one character of a wide character string. To maintain |
| the similarity there is also a type corresponding to @code{int} for |
| those functions that take a single wide character. |
| |
| @comment stddef.h |
| @comment ISO |
| @deftp {Data type} wchar_t |
| This data type is used as the base type for wide character strings. |
| In other words, arrays of objects of this type are the equivalent of |
| @code{char[]} for multibyte character strings. The type is defined in |
| @file{stddef.h}. |
| |
| The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not |
| say anything specific about the representation. It only requires that |
| this type is capable of storing all elements of the basic character set. |
| Therefore it would be legitimate to define @code{wchar_t} as @code{char}, |
| which might make sense for embedded systems. |
| |
| But in @theglibc{} @code{wchar_t} is always 32 bits wide and, therefore, |
| capable of representing all UCS-4 values and, therefore, covering all of |
| @w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type |
| and thereby follow Unicode very strictly. This definition is perfectly |
| fine with the standard, but it also means that to represent all |
| characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate |
| characters, which is in fact a multi-wide-character encoding. But |
| resorting to multi-wide-character encoding contradicts the purpose of the |
| @code{wchar_t} type. |
| @end deftp |
| |
| @comment wchar.h |
| @comment ISO |
| @deftp {Data type} wint_t |
| @code{wint_t} is a data type used for parameters and variables that |
| contain a single wide character. As the name suggests this type is the |
| equivalent of @code{int} when using the normal @code{char} strings. The |
| types @code{wchar_t} and @code{wint_t} often have the same |
| representation if their size is 32 bits wide but if @code{wchar_t} is |
| defined as @code{char} the type @code{wint_t} must be defined as |
| @code{int} due to the parameter promotion. |
| |
| @pindex wchar.h |
| This type is defined in @file{wchar.h} and was introduced in |
| @w{Amendment 1} to @w{ISO C90}. |
| @end deftp |
| |
| As there are for the @code{char} data type macros are available for |
| specifying the minimum and maximum value representable in an object of |
| type @code{wchar_t}. |
| |
| @comment wchar.h |
| @comment ISO |
| @deftypevr Macro wint_t WCHAR_MIN |
| The macro @code{WCHAR_MIN} evaluates to the minimum value representable |
| by an object of type @code{wint_t}. |
| |
| This macro was introduced in @w{Amendment 1} to @w{ISO C90}. |
| @end deftypevr |
| |
| @comment wchar.h |
| @comment ISO |
| @deftypevr Macro wint_t WCHAR_MAX |
| The macro @code{WCHAR_MAX} evaluates to the maximum value representable |
| by an object of type @code{wint_t}. |
| |
| This macro was introduced in @w{Amendment 1} to @w{ISO C90}. |
| @end deftypevr |
| |
| Another special wide character value is the equivalent to @code{EOF}. |
| |
| @comment wchar.h |
| @comment ISO |
| @deftypevr Macro wint_t WEOF |
| The macro @code{WEOF} evaluates to a constant expression of type |
| @code{wint_t} whose value is different from any member of the extended |
| character set. |
| |
| @code{WEOF} need not be the same value as @code{EOF} and unlike |
| @code{EOF} it also need @emph{not} be negative. In other words, sloppy |
| code like |
| |
| @smallexample |
| @{ |
| int c; |
| @dots{} |
| while ((c = getc (fp)) < 0) |
| @dots{} |
| @} |
| @end smallexample |
| |
| @noindent |
| has to be rewritten to use @code{WEOF} explicitly when wide characters |
| are used: |
| |
| @smallexample |
| @{ |
| wint_t c; |
| @dots{} |
| while ((c = wgetc (fp)) != WEOF) |
| @dots{} |
| @} |
| @end smallexample |
| |
| @pindex wchar.h |
| This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is |
| defined in @file{wchar.h}. |
| @end deftypevr |
| |
| |
| These internal representations present problems when it comes to storing |
| and transmittal. Because each single wide character consists of more |
| than one byte, they are affected by byte-ordering. Thus, machines with |
| different endianesses would see different values when accessing the same |
| data. This byte ordering concern also applies for communication protocols |
| that are all byte-based and therefore require that the sender has to |
| decide about splitting the wide character in bytes. A last (but not least |
| important) point is that wide characters often require more storage space |
| than a customized byte-oriented character set. |
| |
| @cindex multibyte character |
| @cindex EBCDIC |
| For all the above reasons, an external encoding that is different from |
| the internal encoding is often used if the latter is UCS-2 or UCS-4. |
| The external encoding is byte-based and can be chosen appropriately for |
| the environment and for the texts to be handled. A variety of different |
| character sets can be used for this external encoding (information that |
| will not be exhaustively presented here--instead, a description of the |
| major groups will suffice). All of the ASCII-based character sets |
| fulfill one requirement: they are "filesystem safe." This means that |
| the character @code{'/'} is used in the encoding @emph{only} to |
| represent itself. Things are a bit different for character sets like |
| EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set |
| family used by IBM), but if the operating system does not understand |
| EBCDIC directly the parameters-to-system calls have to be converted |
| first anyhow. |
| |
| @itemize @bullet |
| @item |
| The simplest character sets are single-byte character sets. There can |
| be only up to 256 characters (for @w{8 bit} character sets), which is |
| not sufficient to cover all languages but might be sufficient to handle |
| a specific text. Handling of a @w{8 bit} character sets is simple. This |
| is not true for other kinds presented later, and therefore, the |
| application one uses might require the use of @w{8 bit} character sets. |
| |
| @cindex ISO 2022 |
| @item |
| The @w{ISO 2022} standard defines a mechanism for extended character |
| sets where one character @emph{can} be represented by more than one |
| byte. This is achieved by associating a state with the text. |
| Characters that can be used to change the state can be embedded in the |
| text. Each byte in the text might have a different interpretation in each |
| state. The state might even influence whether a given byte stands for a |
| character on its own or whether it has to be combined with some more |
| bytes. |
| |
| @cindex EUC |
| @cindex Shift_JIS |
| @cindex SJIS |
| In most uses of @w{ISO 2022} the defined character sets do not allow |
| state changes that cover more than the next character. This has the |
| big advantage that whenever one can identify the beginning of the byte |
| sequence of a character one can interpret a text correctly. Examples of |
| character sets using this policy are the various EUC character sets |
| (used by Sun's operating systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) |
| or Shift_JIS (SJIS, a Japanese encoding). |
| |
| But there are also character sets using a state that is valid for more |
| than one character and has to be changed by another byte sequence. |
| Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. |
| |
| @item |
| @cindex ISO 6937 |
| Early attempts to fix 8 bit character sets for other languages using the |
| Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes |
| representing characters like the acute accent do not produce output |
| themselves: one has to combine them with other characters to get the |
| desired result. For example, the byte sequence @code{0xc2 0x61} |
| (non-spacing acute accent, followed by lower-case `a') to get the ``small |
| a with acute'' character. To get the acute accent character on its own, |
| one has to write @code{0xc2 0x20} (the non-spacing acute followed by a |
| space). |
| |
| Character sets like @w{ISO 6937} are used in some embedded systems such |
| as teletex. |
| |
| @item |
| @cindex UTF-8 |
| Instead of converting the Unicode or @w{ISO 10646} text used internally, |
| it is often also sufficient to simply use an encoding different than |
| UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an |
| encoding: UTF-8. This encoding is able to represent all of @w{ISO |
| 10646} 31 bits in a byte string of length one to six. |
| |
| @cindex UTF-7 |
| There were a few other attempts to encode @w{ISO 10646} such as UTF-7, |
| but UTF-8 is today the only encoding that should be used. In fact, with |
| any luck UTF-8 will soon be the only external encoding that has to be |
| supported. It proves to be universally usable and its only disadvantage |
| is that it favors Roman languages by making the byte string |
| representation of other scripts (Cyrillic, Greek, Asian scripts) longer |
| than necessary if using a specific character set for these scripts. |
| Methods like the Unicode compression scheme can alleviate these |
| problems. |
| @end itemize |
| |
| The question remaining is: how to select the character set or encoding |
| to use. The answer: you cannot decide about it yourself, it is decided |
| by the developers of the system or the majority of the users. Since the |
| goal is interoperability one has to use whatever the other people one |
| works with use. If there are no constraints, the selection is based on |
| the requirements the expected circle of users will have. In other words, |
| if a project is expected to be used in only, say, Russia it is fine to use |
| KOI8-R or a similar character set. But if at the same time people from, |
| say, Greece are participating one should use a character set that allows |
| all people to collaborate. |
| |
| The most widely useful solution seems to be: go with the most general |
| character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding |
| and problems about users not being able to use their own language |
| adequately are a thing of the past. |
| |
| One final comment about the choice of the wide character representation |
| is necessary at this point. We have said above that the natural choice |
| is using Unicode or @w{ISO 10646}. This is not required, but at least |
| encouraged, by the @w{ISO C} standard. The standard defines at least a |
| macro @code{__STDC_ISO_10646__} that is only defined on systems where |
| the @code{wchar_t} type encodes @w{ISO 10646} characters. If this |
| symbol is not defined one should avoid making assumptions about the wide |
| character representation. If the programmer uses only the functions |
| provided by the C library to handle wide character strings there should |
| be no compatibility problems with other systems. |
| |
| @node Charset Function Overview |
| @section Overview about Character Handling Functions |
| |
| A Unix @w{C library} contains three different sets of functions in two |
| families to handle character set conversion. One of the function families |
| (the most commonly used) is specified in the @w{ISO C90} standard and, |
| therefore, is portable even beyond the Unix world. Unfortunately this |
| family is the least useful one. These functions should be avoided |
| whenever possible, especially when developing libraries (as opposed to |
| applications). |
| |
| The second family of functions got introduced in the early Unix standards |
| (XPG2) and is still part of the latest and greatest Unix standard: |
| @w{Unix 98}. It is also the most powerful and useful set of functions. |
| But we will start with the functions defined in @w{Amendment 1} to |
| @w{ISO C90}. |
| |
| @node Restartable multibyte conversion |
| @section Restartable Multibyte Conversion Functions |
| |
| The @w{ISO C} standard defines functions to convert strings from a |
| multibyte representation to wide character strings. There are a number |
| of peculiarities: |
| |
| @itemize @bullet |
| @item |
| The character set assumed for the multibyte encoding is not specified |
| as an argument to the functions. Instead the character set specified by |
| the @code{LC_CTYPE} category of the current locale is used; see |
| @ref{Locale Categories}. |
| |
| @item |
| The functions handling more than one character at a time require NUL |
| terminated strings as the argument (i.e., converting blocks of text |
| does not work unless one can add a NUL byte at an appropriate place). |
| @Theglibc{} contains some extensions to the standard that allow |
| specifying a size, but basically they also expect terminated strings. |
| @end itemize |
| |
| Despite these limitations the @w{ISO C} functions can be used in many |
| contexts. In graphical user interfaces, for instance, it is not |
| uncommon to have functions that require text to be displayed in a wide |
| character string if the text is not simple ASCII. The text itself might |
| come from a file with translations and the user should decide about the |
| current locale, which determines the translation and therefore also the |
| external encoding used. In such a situation (and many others) the |
| functions described here are perfect. If more freedom while performing |
| the conversion is necessary take a look at the @code{iconv} functions |
| (@pxref{Generic Charset Conversion}). |
| |
| @menu |
| * Selecting the Conversion:: Selecting the conversion and its properties. |
| * Keeping the state:: Representing the state of the conversion. |
| * Converting a Character:: Converting Single Characters. |
| * Converting Strings:: Converting Multibyte and Wide Character |
| Strings. |
| * Multibyte Conversion Example:: A Complete Multibyte Conversion Example. |
| @end menu |
| |
| @node Selecting the Conversion |
| @subsection Selecting the conversion and its properties |
| |
| We already said above that the currently selected locale for the |
| @code{LC_CTYPE} category decides about the conversion that is performed |
| by the functions we are about to describe. Each locale uses its own |
| character set (given as an argument to @code{localedef}) and this is the |
| one assumed as the external multibyte encoding. The wide character |
| set is always UCS-4 in @theglibc{}. |
| |
| A characteristic of each multibyte character set is the maximum number |
| of bytes that can be necessary to represent one character. This |
| information is quite important when writing code that uses the |
| conversion functions (as shown in the examples below). |
| The @w{ISO C} standard defines two macros that provide this information. |
| |
| |
| @comment limits.h |
| @comment ISO |
| @deftypevr Macro int MB_LEN_MAX |
| @code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte |
| sequence for a single character in any of the supported locales. It is |
| a compile-time constant and is defined in @file{limits.h}. |
| @pindex limits.h |
| @end deftypevr |
| |
| @comment stdlib.h |
| @comment ISO |
| @deftypevr Macro int MB_CUR_MAX |
| @code{MB_CUR_MAX} expands into a positive integer expression that is the |
| maximum number of bytes in a multibyte character in the current locale. |
| The value is never greater than @code{MB_LEN_MAX}. Unlike |
| @code{MB_LEN_MAX} this macro need not be a compile-time constant, and in |
| @theglibc{} it is not. |
| |
| @pindex stdlib.h |
| @code{MB_CUR_MAX} is defined in @file{stdlib.h}. |
| @end deftypevr |
| |
| Two different macros are necessary since strictly @w{ISO C90} compilers |
| do not allow variable length array definitions, but still it is desirable |
| to avoid dynamic allocation. This incomplete piece of code shows the |
| problem: |
| |
| @smallexample |
| @{ |
| char buf[MB_LEN_MAX]; |
| ssize_t len = 0; |
| |
| while (! feof (fp)) |
| @{ |
| fread (&buf[len], 1, MB_CUR_MAX - len, fp); |
| /* @r{@dots{} process} buf */ |
| len -= used; |
| @} |
| @} |
| @end smallexample |
| |
| The code in the inner loop is expected to have always enough bytes in |
| the array @var{buf} to convert one multibyte character. The array |
| @var{buf} has to be sized statically since many compilers do not allow a |
| variable size. The @code{fread} call makes sure that @code{MB_CUR_MAX} |
| bytes are always available in @var{buf}. Note that it isn't |
| a problem if @code{MB_CUR_MAX} is not a compile-time constant. |
| |
| |
| @node Keeping the state |
| @subsection Representing the state of the conversion |
| |
| @cindex stateful |
| In the introduction of this chapter it was said that certain character |
| sets use a @dfn{stateful} encoding. That is, the encoded values depend |
| in some way on the previous bytes in the text. |
| |
| Since the conversion functions allow converting a text in more than one |
| step we must have a way to pass this information from one call of the |
| functions to another. |
| |
| @comment wchar.h |
| @comment ISO |
| @deftp {Data type} mbstate_t |
| @cindex shift state |
| A variable of type @code{mbstate_t} can contain all the information |
| about the @dfn{shift state} needed from one call to a conversion |
| function to another. |
| |
| @pindex wchar.h |
| @code{mbstate_t} is defined in @file{wchar.h}. It was introduced in |
| @w{Amendment 1} to @w{ISO C90}. |
| @end deftp |
| |
| To use objects of type @code{mbstate_t} the programmer has to define such |
| objects (normally as local variables on the stack) and pass a pointer to |
| the object to the conversion functions. This way the conversion function |
| can update the object if the current multibyte character set is stateful. |
| |
| There is no specific function or initializer to put the state object in |
| any specific state. The rules are that the object should always |
| represent the initial state before the first use, and this is achieved by |
| clearing the whole variable with code such as follows: |
| |
| @smallexample |
| @{ |
| mbstate_t state; |
| memset (&state, '\0', sizeof (state)); |
| /* @r{from now on @var{state} can be used.} */ |
| @dots{} |
| @} |
| @end smallexample |
| |
| When using the conversion functions to generate output it is often |
| necessary to test whether the current state corresponds to the initial |
| state. This is necessary, for example, to decide whether to emit |
| escape sequences to set the state to the initial state at certain |
| sequence points. Communication protocols often require this. |
| |
| @comment wchar.h |
| @comment ISO |
| @deftypefun int mbsinit (const mbstate_t *@var{ps}) |
| @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} |
| @c ps is dereferenced once, unguarded. This would call for @mtsrace:ps, |
| @c but since a single word-sized field is (atomically) accessed, any |
| @c race here would be harmless. Other functions that take an optional |
| @c mbstate_t* argument named ps are marked with @mtasurace:<func>/!ps, |
| @c to indicate that the function uses a static buffer if ps is NULL. |
| @c These could also have been marked with @mtsrace:ps, but we'll omit |
| @c that for brevity, for it's somewhat redundant with the @mtasurace. |
| The @code{mbsinit} function determines whether the state object pointed |
| to by @var{ps} is in the initial state. If @var{ps} is a null pointer or |
| the object is in the initial state the return value is nonzero. Otherwise |
| it is zero. |
| |
| @pindex wchar.h |
| @code{mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is |
| declared in @file{wchar.h}. |
| @end deftypefun |
| |
| Code using @code{mbsinit} often looks similar to this: |
| |
| @c Fix the example to explicitly say how to generate the escape sequence |
| @c to restore the initial state. |
| @smallexample |
| @{ |
| mbstate_t state; |
| memset (&state, '\0', sizeof (state)); |
| /* @r{Use @var{state}.} */ |
| @dots{} |
| if (! mbsinit (&state)) |
| @{ |
| /* @r{Emit code to return to initial state.} */ |
| const wchar_t empty[] = L""; |
| const wchar_t *srcp = empty; |
| wcsrtombs (outbuf, &srcp, outbuflen, &state); |
| @} |
| @dots{} |
| @} |
| @end smallexample |
| |
| The code to emit the escape sequence to get back to the initial state is |
| interesting. The @code{wcsrtombs} function can be used to determine the |
| necessary output code (@pxref{Converting Strings}). Please note that with |
| @theglibc{} it is not necessary to perform this extra action for the |
| conversion from multibyte text to wide character text since the wide |
| character encoding is not stateful. But there is nothing mentioned in |
| any standard that prohibits making @code{wchar_t} using a stateful |
| encoding. |
| |
| @node Converting a Character |
| @subsection Converting Single Characters |
| |
| The most fundamental of the conversion functions are those dealing with |
| single characters. Please note that this does not always mean single |
| bytes. But since there is very often a subset of the multibyte |
| character set that consists of single byte sequences, there are |
| functions to help with converting bytes. Frequently, ASCII is a subpart |
| of the multibyte character set. In such a scenario, each ASCII character |
| stands for itself, and all other characters have at least a first byte |
| that is beyond the range @math{0} to @math{127}. |
| |
| @comment wchar.h |
| @comment ISO |
| @deftypefun wint_t btowc (int @var{c}) |
| @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| @c Calls btowc_fct or __fct; reads from locale, and from the |
| @c get_gconv_fcts result multiple times. get_gconv_fcts calls |
| @c __wcsmbs_load_conv to initialize the ctype if it's null. |
| @c wcsmbs_load_conv takes a non-recursive wrlock before allocating |
| @c memory for the fcts structure, initializing it, and then storing it |
| @c in the locale object. The initialization involves dlopening and a |
| @c lot more. |
| The @code{btowc} function (``byte to wide character'') converts a valid |
| single byte character @var{c} in the initial shift state into the wide |
| character equivalent using the conversion rules from the currently |
| selected locale of the @code{LC_CTYPE} category. |
| |
| If @code{(unsigned char) @var{c}} is no valid single byte multibyte |
| character or if @var{c} is @code{EOF}, the function returns @code{WEOF}. |
| |
| Please note the restriction of @var{c} being tested for validity only in |
| the initial shift state. No @code{mbstate_t} object is used from |
| which the state information is taken, and the function also does not use |
| any static state. |
| |
| @pindex wchar.h |
| The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90} |
| and is declared in @file{wchar.h}. |
| @end deftypefun |
| |
| Despite the limitation that the single byte value is always interpreted |
| in the initial state, this function is actually useful most of the time. |
| Most characters are either entirely single-byte character sets or they |
| are extension to ASCII. But then it is possible to write code like this |
| (not that this specific example is very useful): |
| |
| @smallexample |
| wchar_t * |
| itow (unsigned long int val) |
| @{ |
| static wchar_t buf[30]; |
| wchar_t *wcp = &buf[29]; |
| *wcp = L'\0'; |
| while (val != 0) |
| @{ |
| *--wcp = btowc ('0' + val % 10); |
| val /= 10; |
| @} |
| if (wcp == &buf[29]) |
| *--wcp = L'0'; |
| return wcp; |
| @} |
| @end smallexample |
| |
| Why is it necessary to use such a complicated implementation and not |
| simply cast @code{'0' + val % 10} to a wide character? The answer is |
| that there is no guarantee that one can perform this kind of arithmetic |
| on the character of the character set used for @code{wchar_t} |
| representation. In other situations the bytes are not constant at |
| compile time and so the compiler cannot do the work. In situations like |
| this, using @code{btowc} is required. |
| |
| @noindent |
| There is also a function for the conversion in the other direction. |
| |
| @comment wchar.h |
| @comment ISO |
| @deftypefun int wctob (wint_t @var{c}) |
| @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| The @code{wctob} function (``wide character to byte'') takes as the |
| parameter a valid wide character. If the multibyte representation for |
| this character in the initial state is exactly one byte long, the return |
| value of this function is this character. Otherwise the return value is |
| @code{EOF}. |
| |
| @pindex wchar.h |
| @code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and |
| is declared in @file{wchar.h}. |
| @end deftypefun |
| |
| There are more general functions to convert single character from |
| multibyte representation to wide characters and vice versa. These |
| functions pose no limit on the length of the multibyte representation |
| and they also do not require it to be in the initial state. |
| |
| @comment wchar.h |
| @comment ISO |
| @deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps}) |
| @safety{@prelim{}@mtunsafe{@mtasurace{:mbrtowc/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| @cindex stateful |
| The @code{mbrtowc} function (``multibyte restartable to wide |
| character'') converts the next multibyte character in the string pointed |
| to by @var{s} into a wide character and stores it in the wide character |
| string pointed to by @var{pwc}. The conversion is performed according |
| to the locale currently selected for the @code{LC_CTYPE} category. If |
| the conversion for the character set used in the locale requires a state, |
| the multibyte string is interpreted in the state represented by the |
| object pointed to by @var{ps}. If @var{ps} is a null pointer, a static, |
| internal state variable used only by the @code{mbrtowc} function is |
| used. |
| |
| If the next multibyte character corresponds to the NUL wide character, |
| the return value of the function is @math{0} and the state object is |
| afterwards in the initial state. If the next @var{n} or fewer bytes |
| form a correct multibyte character, the return value is the number of |
| bytes starting from @var{s} that form the multibyte character. The |
| conversion state is updated according to the bytes consumed in the |
| conversion. In both cases the wide character (either the @code{L'\0'} |
| or the one found in the conversion) is stored in the string pointed to |
| by @var{pwc} if @var{pwc} is not null. |
| |
| If the first @var{n} bytes of the multibyte string possibly form a valid |
| multibyte character but there are more than @var{n} bytes needed to |
| complete it, the return value of the function is @code{(size_t) -2} and |
| no value is stored. Please note that this can happen even if @var{n} |
| has a value greater than or equal to @code{MB_CUR_MAX} since the input |
| might contain redundant shift sequences. |
| |
| If the first @code{n} bytes of the multibyte string cannot possibly form |
| a valid multibyte character, no value is stored, the global variable |
| @code{errno} is set to the value @code{EILSEQ}, and the function returns |
| @code{(size_t) -1}. The conversion state is afterwards undefined. |
| |
| @pindex wchar.h |
| @code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and |
| is declared in @file{wchar.h}. |
| @end deftypefun |
| |
| Use of @code{mbrtowc} is straightforward. A function that copies a |
| multibyte string into a wide character string while at the same time |
| converting all lowercase characters into uppercase could look like this |
| (this is not the final version, just an example; it has no error |
| checking, and sometimes leaks memory): |
| |
| @smallexample |
| wchar_t * |
| mbstouwcs (const char *s) |
| @{ |
| size_t len = strlen (s); |
| wchar_t *result = malloc ((len + 1) * sizeof (wchar_t)); |
| wchar_t *wcp = result; |
| wchar_t tmp[1]; |
| mbstate_t state; |
| size_t nbytes; |
| |
| memset (&state, '\0', sizeof (state)); |
| while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0) |
| @{ |
| if (nbytes >= (size_t) -2) |
| /* Invalid input string. */ |
| return NULL; |
| *wcp++ = towupper (tmp[0]); |
| len -= nbytes; |
| s += nbytes; |
| @} |
| return result; |
| @} |
| @end smallexample |
| |
| The use of @code{mbrtowc} should be clear. A single wide character is |
| stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored |
| in the variable @var{nbytes}. If the conversion is successful, the |
| uppercase variant of the wide character is stored in the @var{result} |
| array and the pointer to the input string and the number of available |
| bytes is adjusted. |
| |
| The only non-obvious thing about @code{mbrtowc} might be the way memory |
| is allocated for the result. The above code uses the fact that there |
| can never be more wide characters in the converted results than there are |
| bytes in the multibyte input string. This method yields a pessimistic |
| guess about the size of the result, and if many wide character strings |
| have to be constructed this way or if the strings are long, the extra |
| memory required to be allocated because the input string contains |
| multibyte characters might be significant. The allocated memory block can |
| be resized to the correct size before returning it, but a better solution |
| might be to allocate just the right amount of space for the result right |
| away. Unfortunately there is no function to compute the length of the wide |
| character string directly from the multibyte string. There is, however, a |
| function that does part of the work. |
| |
| @comment wchar.h |
| @comment ISO |
| @deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps}) |
| @safety{@prelim{}@mtunsafe{@mtasurace{:mbrlen/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| The @code{mbrlen} function (``multibyte restartable length'') computes |
| the number of at most @var{n} bytes starting at @var{s}, which form the |
| next valid and complete multibyte character. |
| |
| If the next multibyte character corresponds to the NUL wide character, |
| the return value is @math{0}. If the next @var{n} bytes form a valid |
| multibyte character, the number of bytes belonging to this multibyte |
| character byte sequence is returned. |
| |
| If the first @var{n} bytes possibly form a valid multibyte |
| character but the character is incomplete, the return value is |
| @code{(size_t) -2}. Otherwise the multibyte character sequence is invalid |
| and the return value is @code{(size_t) -1}. |
| |
| The multibyte sequence is interpreted in the state represented by the |
| object pointed to by @var{ps}. If @var{ps} is a null pointer, a state |
| object local to @code{mbrlen} is used. |
| |
| @pindex wchar.h |
| @code{mbrlen} was introduced in @w{Amendment 1} to @w{ISO C90} and |
| is declared in @file{wchar.h}. |
| @end deftypefun |
| |
| The attentive reader now will note that @code{mbrlen} can be implemented |
| as |
| |
| @smallexample |
| mbrtowc (NULL, s, n, ps != NULL ? ps : &internal) |
| @end smallexample |
| |
| This is true and in fact is mentioned in the official specification. |
| How can this function be used to determine the length of the wide |
| character string created from a multibyte character string? It is not |
| directly usable, but we can define a function @code{mbslen} using it: |
| |
| @smallexample |
| size_t |
| mbslen (const char *s) |
| @{ |
| mbstate_t state; |
| size_t result = 0; |
| size_t nbytes; |
| memset (&state, '\0', sizeof (state)); |
| while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0) |
| @{ |
| if (nbytes >= (size_t) -2) |
| /* @r{Something is wrong.} */ |
| return (size_t) -1; |
| s += nbytes; |
| ++result; |
| @} |
| return result; |
| @} |
| @end smallexample |
| |
| This function simply calls @code{mbrlen} for each multibyte character |
| in the string and counts the number of function calls. Please note that |
| we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen} |
| call. This is acceptable since a) this value is larger than the length of |
| the longest multibyte character sequence and b) we know that the string |
| @var{s} ends with a NUL byte, which cannot be part of any other multibyte |
| character sequence but the one representing the NUL wide character. |
| Therefore, the @code{mbrlen} function will never read invalid memory. |
| |
| Now that this function is available (just to make this clear, this |
| function is @emph{not} part of @theglibc{}) we can compute the |
| number of wide character required to store the converted multibyte |
| character string @var{s} using |
| |
| @smallexample |
| wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t); |
| @end smallexample |
| |
| Please note that the @code{mbslen} function is quite inefficient. The |
| implementation of @code{mbstouwcs} with @code{mbslen} would have to |
| perform the conversion of the multibyte character input string twice, and |
| this conversion might be quite expensive. So it is necessary to think |
| about the consequences of using the easier but imprecise method before |
| doing the work twice. |
| |
| @comment wchar.h |
| @comment ISO |
| @deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps}) |
| @safety{@prelim{}@mtunsafe{@mtasurace{:wcrtomb/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| @c wcrtomb uses a static, non-thread-local unguarded state variable when |
| @c PS is NULL. When a state is passed in, and it's not used |
| @c concurrently in other threads, this function behaves safely as long |
| @c as gconv modules don't bring MT safety issues of their own. |
| @c Attempting to load gconv modules or to build conversion chains in |
| @c signal handlers may encounter gconv databases or caches in a |
| @c partially-updated state, and asynchronous cancellation may leave them |
| @c in such states, besides leaking the lock that guards them. |
| @c get_gconv_fcts ok |
| @c wcsmbs_load_conv ok |
| @c norm_add_slashes ok |
| @c wcsmbs_getfct ok |
| @c gconv_find_transform ok |
| @c gconv_read_conf (libc_once) |
| @c gconv_lookup_cache ok |
| @c find_module_idx ok |
| @c find_module ok |
| @c gconv_find_shlib (ok) |
| @c ->init_fct (assumed ok) |
| @c gconv_get_builtin_trans ok |
| @c gconv_release_step ok |
| @c do_lookup_alias ok |
| @c find_derivation ok |
| @c derivation_lookup ok |
| @c increment_counter ok |
| @c gconv_find_shlib ok |
| @c step->init_fct (assumed ok) |
| @c gen_steps ok |
| @c gconv_find_shlib ok |
| @c dlopen (presumed ok) |
| @c dlsym (presumed ok) |
| @c step->init_fct (assumed ok) |
| @c step->end_fct (assumed ok) |
| @c gconv_get_builtin_trans ok |
| @c gconv_release_step ok |
| @c add_derivation ok |
| @c gconv_close_transform ok |
| @c gconv_release_step ok |
| @c step->end_fct (assumed ok) |
| @c gconv_release_shlib ok |
| @c dlclose (presumed ok) |
| @c gconv_release_cache ok |
| @c ->tomb->__fct (assumed ok) |
| The @code{wcrtomb} function (``wide character restartable to |
| multibyte'') converts a single wide character into a multibyte string |
| corresponding to that wide character. |
| |
| If @var{s} is a null pointer, the function resets the state stored in |
| the objects pointed to by @var{ps} (or the internal @code{mbstate_t} |
| object) to the initial state. This can also be achieved by a call like |
| this: |
| |
| @smallexample |
| wcrtombs (temp_buf, L'\0', ps) |
| @end smallexample |
| |
| @noindent |
| since, if @var{s} is a null pointer, @code{wcrtomb} performs as if it |
| writes into an internal buffer, which is guaranteed to be large enough. |
| |
| If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if |
| necessary, a shift sequence to get the state @var{ps} into the initial |
| state followed by a single NUL byte, which is stored in the string |
| @var{s}. |
| |
| Otherwise a byte sequence (possibly including shift sequences) is written |
| into the string @var{s}. This only happens if @var{wc} is a valid wide |
| character (i.e., it has a multibyte representation in the character set |
| selected by locale of the @code{LC_CTYPE} category). If @var{wc} is no |
| valid wide character, nothing is stored in the strings @var{s}, |
| @code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps} |
| is undefined and the return value is @code{(size_t) -1}. |
| |
| If no error occurred the function returns the number of bytes stored in |
| the string @var{s}. This includes all bytes representing shift |
| sequences. |
| |
| One word about the interface of the function: there is no parameter |
| specifying the length of the array @var{s}. Instead the function |
| assumes that there are at least @code{MB_CUR_MAX} bytes available since |
| this is the maximum length of any byte sequence representing a single |
| character. So the caller has to make sure that there is enough space |
| available, otherwise buffer overruns can occur. |
| |
| @pindex wchar.h |
| @code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is |
| declared in @file{wchar.h}. |
| @end deftypefun |
| |
| Using @code{wcrtomb} is as easy as using @code{mbrtowc}. The following |
| example appends a wide character string to a multibyte character string. |
| Again, the code is not really useful (or correct), it is simply here to |
| demonstrate the use and some problems. |
| |
| @smallexample |
| char * |
| mbscatwcs (char *s, size_t len, const wchar_t *ws) |
| @{ |
| mbstate_t state; |
| /* @r{Find the end of the existing string.} */ |
| char *wp = strchr (s, '\0'); |
| len -= wp - s; |
| memset (&state, '\0', sizeof (state)); |
| do |
| @{ |
| size_t nbytes; |
| if (len < MB_CUR_LEN) |
| @{ |
| /* @r{We cannot guarantee that the next} |
| @r{character fits into the buffer, so} |
| @r{return an error.} */ |
| errno = E2BIG; |
| return NULL; |
| @} |
| nbytes = wcrtomb (wp, *ws, &state); |
| if (nbytes == (size_t) -1) |
| /* @r{Error in the conversion.} */ |
| return NULL; |
| len -= nbytes; |
| wp += nbytes; |
| @} |
| while (*ws++ != L'\0'); |
| return s; |
| @} |
| @end smallexample |
| |
| First the function has to find the end of the string currently in the |
| array @var{s}. The @code{strchr} call does this very efficiently since a |
| requirement for multibyte character representations is that the NUL byte |
| is never used except to represent itself (and in this context, the end |
| of the string). |
| |
| After initializing the state object the loop is entered where the first |
| task is to make sure there is enough room in the array @var{s}. We |
| abort if there are not at least @code{MB_CUR_LEN} bytes available. This |
| is not always optimal but we have no other choice. We might have less |
| than @code{MB_CUR_LEN} bytes available but the next multibyte character |
| might also be only one byte long. At the time the @code{wcrtomb} call |
| returns it is too late to decide whether the buffer was large enough. If |
| this solution is unsuitable, there is a very slow but more accurate |
| solution. |
| |
| @smallexample |
| @dots{} |
| if (len < MB_CUR_LEN) |
| @{ |
| mbstate_t temp_state; |
| memcpy (&temp_state, &state, sizeof (state)); |
| if (wcrtomb (NULL, *ws, &temp_state) > len) |
| @{ |
| /* @r{We cannot guarantee that the next} |
| @r{character fits into the buffer, so} |
| @r{return an error.} */ |
| errno = E2BIG; |
| return NULL; |
| @} |
| @} |
| @dots{} |
| @end smallexample |
| |
| Here we perform the conversion that might overflow the buffer so that |
| we are afterwards in the position to make an exact decision about the |
| buffer size. Please note the @code{NULL} argument for the destination |
| buffer in the new @code{wcrtomb} call; since we are not interested in the |
| converted text at this point, this is a nice way to express this. The |
| most unusual thing about this piece of code certainly is the duplication |
| of the conversion state object, but if a change of the state is necessary |
| to emit the next multibyte character, we want to have the same shift state |
| change performed in the real conversion. Therefore, we have to preserve |
| the initial shift state information. |
| |
| There are certainly many more and even better solutions to this problem. |
| This example is only provided for educational purposes. |
| |
| @node Converting Strings |
| @subsection Converting Multibyte and Wide Character Strings |
| |
| The functions described in the previous section only convert a single |
| character at a time. Most operations to be performed in real-world |
| programs include strings and therefore the @w{ISO C} standard also |
| defines conversions on entire strings. However, the defined set of |
| functions is quite limited; therefore, @theglibc{} contains a few |
| extensions that can help in some important situations. |
| |
| @comment wchar.h |
| @comment ISO |
| @deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
| @safety{@prelim{}@mtunsafe{@mtasurace{:mbsrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| The @code{mbsrtowcs} function (``multibyte string restartable to wide |
| character string'') converts a NUL-terminated multibyte character |
| string at @code{*@var{src}} into an equivalent wide character string, |
| including the NUL wide character at the end. The conversion is started |
| using the state information from the object pointed to by @var{ps} or |
| from an internal object of @code{mbsrtowcs} if @var{ps} is a null |
| pointer. Before returning, the state object is updated to match the state |
| after the last converted character. The state is the initial state if the |
| terminating NUL byte is reached and converted. |
| |
| If @var{dst} is not a null pointer, the result is stored in the array |
| pointed to by @var{dst}; otherwise, the conversion result is not |
| available since it is stored in an internal buffer. |
| |
| If @var{len} wide characters are stored in the array @var{dst} before |
| reaching the end of the input string, the conversion stops and @var{len} |
| is returned. If @var{dst} is a null pointer, @var{len} is never checked. |
| |
| Another reason for a premature return from the function call is if the |
| input string contains an invalid multibyte sequence. In this case the |
| global variable @code{errno} is set to @code{EILSEQ} and the function |
| returns @code{(size_t) -1}. |
| |
| @c XXX The ISO C9x draft seems to have a problem here. It says that PS |
| @c is not updated if DST is NULL. This is not said straightforward and |
| @c none of the other functions is described like this. It would make sense |
| @c to define the function this way but I don't think it is meant like this. |
| |
| In all other cases the function returns the number of wide characters |
| converted during this call. If @var{dst} is not null, @code{mbsrtowcs} |
| stores in the pointer pointed to by @var{src} either a null pointer (if |
| the NUL byte in the input string was reached) or the address of the byte |
| following the last converted multibyte character. |
| |
| @pindex wchar.h |
| @code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is |
| declared in @file{wchar.h}. |
| @end deftypefun |
| |
| The definition of the @code{mbsrtowcs} function has one important |
| limitation. The requirement that @var{dst} has to be a NUL-terminated |
| string provides problems if one wants to convert buffers with text. A |
| buffer is normally no collection of NUL-terminated strings but instead a |
| continuous collection of lines, separated by newline characters. Now |
| assume that a function to convert one line from a buffer is needed. Since |
| the line is not NUL-terminated, the source pointer cannot directly point |
| into the unmodified text buffer. This means, either one inserts the NUL |
| byte at the appropriate place for the time of the @code{mbsrtowcs} |
| function call (which is not doable for a read-only buffer or in a |
| multi-threaded application) or one copies the line in an extra buffer |
| where it can be terminated by a NUL byte. Note that it is not in general |
| possible to limit the number of characters to convert by setting the |
| parameter @var{len} to any specific value. Since it is not known how |
| many bytes each multibyte character sequence is in length, one can only |
| guess. |
| |
| @cindex stateful |
| There is still a problem with the method of NUL-terminating a line right |
| after the newline character, which could lead to very strange results. |
| As said in the description of the @code{mbsrtowcs} function above the |
| conversion state is guaranteed to be in the initial shift state after |
| processing the NUL byte at the end of the input string. But this NUL |
| byte is not really part of the text (i.e., the conversion state after |
| the newline in the original text could be something different than the |
| initial shift state and therefore the first character of the next line |
| is encoded using this state). But the state in question is never |
| accessible to the user since the conversion stops after the NUL byte |
| (which resets the state). Most stateful character sets in use today |
| require that the shift state after a newline be the initial state--but |
| this is not a strict guarantee. Therefore, simply NUL-terminating a |
| piece of a running text is not always an adequate solution and, |
| therefore, should never be used in generally used code. |
| |
| The generic conversion interface (@pxref{Generic Charset Conversion}) |
| does not have this limitation (it simply works on buffers, not |
| strings), and @theglibc{} contains a set of functions that take |
| additional parameters specifying the maximal number of bytes that are |
| consumed from the input string. This way the problem of |
| @code{mbsrtowcs}'s example above could be solved by determining the line |
| length and passing this length to the function. |
| |
| @comment wchar.h |
| @comment ISO |
| @deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
| @safety{@prelim{}@mtunsafe{@mtasurace{:wcsrtombs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| The @code{wcsrtombs} function (``wide character string restartable to |
| multibyte string'') converts the NUL-terminated wide character string at |
| @code{*@var{src}} into an equivalent multibyte character string and |
| stores the result in the array pointed to by @var{dst}. The NUL wide |
| character is also converted. The conversion starts in the state |
| described in the object pointed to by @var{ps} or by a state object |
| locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If |
| @var{dst} is a null pointer, the conversion is performed as usual but the |
| result is not available. If all characters of the input string were |
| successfully converted and if @var{dst} is not a null pointer, the |
| pointer pointed to by @var{src} gets assigned a null pointer. |
| |
| If one of the wide characters in the input string has no valid multibyte |
| character equivalent, the conversion stops early, sets the global |
| variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}. |
| |
| Another reason for a premature stop is if @var{dst} is not a null |
| pointer and the next converted character would require more than |
| @var{len} bytes in total to the array @var{dst}. In this case (and if |
| @var{dest} is not a null pointer) the pointer pointed to by @var{src} is |
| assigned a value pointing to the wide character right after the last one |
| successfully converted. |
| |
| Except in the case of an encoding error the return value of the |
| @code{wcsrtombs} function is the number of bytes in all the multibyte |
| character sequences stored in @var{dst}. Before returning the state in |
| the object pointed to by @var{ps} (or the internal object in case |
| @var{ps} is a null pointer) is updated to reflect the state after the |
| last conversion. The state is the initial shift state in case the |
| terminating NUL wide character was converted. |
| |
| @pindex wchar.h |
| The @code{wcsrtombs} function was introduced in @w{Amendment 1} to |
| @w{ISO C90} and is declared in @file{wchar.h}. |
| @end deftypefun |
| |
| The restriction mentioned above for the @code{mbsrtowcs} function applies |
| here also. There is no possibility of directly controlling the number of |
| input characters. One has to place the NUL wide character at the correct |
| place or control the consumed input indirectly via the available output |
| array size (the @var{len} parameter). |
| |
| @comment wchar.h |
| @comment GNU |
| @deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
| @safety{@prelim{}@mtunsafe{@mtasurace{:mbsnrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs} |
| function. All the parameters are the same except for @var{nmc}, which is |
| new. The return value is the same as for @code{mbsrtowcs}. |
| |
| This new parameter specifies how many bytes at most can be used from the |
| multibyte character string. In other words, the multibyte character |
| string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte |
| is found within the @var{nmc} first bytes of the string, the conversion |
| stops here. |
| |
| This function is a GNU extension. It is meant to work around the |
| problems mentioned above. Now it is possible to convert a buffer with |
| multibyte character text piece for piece without having to care about |
| inserting NUL bytes and the effect of NUL bytes on the conversion state. |
| @end deftypefun |
| |
| A function to convert a multibyte string into a wide character string |
| and display it could be written like this (this is not a really useful |
| example): |
| |
| @smallexample |
| void |
| showmbs (const char *src, FILE *fp) |
| @{ |
| mbstate_t state; |
| int cnt = 0; |
| memset (&state, '\0', sizeof (state)); |
| while (1) |
| @{ |
| wchar_t linebuf[100]; |
| const char *endp = strchr (src, '\n'); |
| size_t n; |
| |
| /* @r{Exit if there is no more line.} */ |
| if (endp == NULL) |
| break; |
| |
| n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state); |
| linebuf[n] = L'\0'; |
| fprintf (fp, "line %d: \"%S\"\n", linebuf); |
| @} |
| @} |
| @end smallexample |
| |
| There is no problem with the state after a call to @code{mbsnrtowcs}. |
| Since we don't insert characters in the strings that were not in there |
| right from the beginning and we use @var{state} only for the conversion |
| of the given buffer, there is no problem with altering the state. |
| |
| @comment wchar.h |
| @comment GNU |
| @deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
| @safety{@prelim{}@mtunsafe{@mtasurace{:wcsnrtombs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| The @code{wcsnrtombs} function implements the conversion from wide |
| character strings to multibyte character strings. It is similar to |
| @code{wcsrtombs} but, just like @code{mbsnrtowcs}, it takes an extra |
| parameter, which specifies the length of the input string. |
| |
| No more than @var{nwc} wide characters from the input string |
| @code{*@var{src}} are converted. If the input string contains a NUL |
| wide character in the first @var{nwc} characters, the conversion stops at |
| this place. |
| |
| The @code{wcsnrtombs} function is a GNU extension and just like |
| @code{mbsnrtowcs} helps in situations where no NUL-terminated input |
| strings are available. |
| @end deftypefun |
| |
| |
| @node Multibyte Conversion Example |
| @subsection A Complete Multibyte Conversion Example |
| |
| The example programs given in the last sections are only brief and do |
| not contain all the error checking, etc. Presented here is a complete |
| and documented example. It features the @code{mbrtowc} function but it |
| should be easy to derive versions using the other functions. |
| |
| @smallexample |
| int |
| file_mbsrtowcs (int input, int output) |
| @{ |
| /* @r{Note the use of @code{MB_LEN_MAX}.} |
| @r{@code{MB_CUR_MAX} cannot portably be used here.} */ |
| char buffer[BUFSIZ + MB_LEN_MAX]; |
| mbstate_t state; |
| int filled = 0; |
| int eof = 0; |
| |
| /* @r{Initialize the state.} */ |
| memset (&state, '\0', sizeof (state)); |
| |
| while (!eof) |
| @{ |
| ssize_t nread; |
| ssize_t nwrite; |
| char *inp = buffer; |
| wchar_t outbuf[BUFSIZ]; |
| wchar_t *outp = outbuf; |
| |
| /* @r{Fill up the buffer from the input file.} */ |
| nread = read (input, buffer + filled, BUFSIZ); |
| if (nread < 0) |
| @{ |
| perror ("read"); |
| return 0; |
| @} |
| /* @r{If we reach end of file, make a note to read no more.} */ |
| if (nread == 0) |
| eof = 1; |
| |
| /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */ |
| filled += nread; |
| |
| /* @r{Convert those bytes to wide characters--as many as we can.} */ |
| while (1) |
| @{ |
| size_t thislen = mbrtowc (outp, inp, filled, &state); |
| /* @r{Stop converting at invalid character;} |
| @r{this can mean we have read just the first part} |
| @r{of a valid character.} */ |
| if (thislen == (size_t) -1) |
| break; |
| /* @r{We want to handle embedded NUL bytes} |
| @r{but the return value is 0. Correct this.} */ |
| if (thislen == 0) |
| thislen = 1; |
| /* @r{Advance past this character.} */ |
| inp += thislen; |
| filled -= thislen; |
| ++outp; |
| @} |
| |
| /* @r{Write the wide characters we just made.} */ |
| nwrite = write (output, outbuf, |
| (outp - outbuf) * sizeof (wchar_t)); |
| if (nwrite < 0) |
| @{ |
| perror ("write"); |
| return 0; |
| @} |
| |
| /* @r{See if we have a @emph{real} invalid character.} */ |
| if ((eof && filled > 0) || filled >= MB_CUR_MAX) |
| @{ |
| error (0, 0, "invalid multibyte character"); |
| return 0; |
| @} |
| |
| /* @r{If any characters must be carried forward,} |
| @r{put them at the beginning of @code{buffer}.} */ |
| if (filled > 0) |
| memmove (buffer, inp, filled); |
| @} |
| |
| return 1; |
| @} |
| @end smallexample |
| |
| |
| @node Non-reentrant Conversion |
| @section Non-reentrant Conversion Function |
| |
| The functions described in the previous chapter are defined in |
| @w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard |
| also contained functions for character set conversion. The reason that |
| these original functions are not described first is that they are almost |
| entirely useless. |
| |
| The problem is that all the conversion functions described in the |
| original @w{ISO C90} use a local state. Using a local state implies that |
| multiple conversions at the same time (not only when using threads) |
| cannot be done, and that you cannot first convert single characters and |
| then strings since you cannot tell the conversion functions which state |
| to use. |
| |
| These original functions are therefore usable only in a very limited set |
| of situations. One must complete converting the entire string before |
| starting a new one, and each string/text must be converted with the same |
| function (there is no problem with the library itself; it is guaranteed |
| that no library function changes the state of any of these functions). |
| @strong{For the above reasons it is highly requested that the functions |
| described in the previous section be used in place of non-reentrant |
| conversion functions.} |
| |
| @menu |
| * Non-reentrant Character Conversion:: Non-reentrant Conversion of Single |
| Characters. |
| * Non-reentrant String Conversion:: Non-reentrant Conversion of Strings. |
| * Shift State:: States in Non-reentrant Functions. |
| @end menu |
| |
| @node Non-reentrant Character Conversion |
| @subsection Non-reentrant Conversion of Single Characters |
| |
| @comment stdlib.h |
| @comment ISO |
| @deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size}) |
| @safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| The @code{mbtowc} (``multibyte to wide character'') function when called |
| with non-null @var{string} converts the first multibyte character |
| beginning at @var{string} to its corresponding wide character code. It |
| stores the result in @code{*@var{result}}. |
| |
| @code{mbtowc} never examines more than @var{size} bytes. (The idea is |
| to supply for @var{size} the number of bytes of data you have in hand.) |
| |
| @code{mbtowc} with non-null @var{string} distinguishes three |
| possibilities: the first @var{size} bytes at @var{string} start with |
| valid multibyte characters, they start with an invalid byte sequence or |
| just part of a character, or @var{string} points to an empty string (a |
| null character). |
| |
| For a valid multibyte character, @code{mbtowc} converts it to a wide |
| character and stores that in @code{*@var{result}}, and returns the |
| number of bytes in that character (always at least @math{1} and never |
| more than @var{size}). |
| |
| For an invalid byte sequence, @code{mbtowc} returns @math{-1}. For an |
| empty string, it returns @math{0}, also storing @code{'\0'} in |
| @code{*@var{result}}. |
| |
| If the multibyte character code uses shift characters, then |
| @code{mbtowc} maintains and updates a shift state as it scans. If you |
| call @code{mbtowc} with a null pointer for @var{string}, that |
| initializes the shift state to its standard initial value. It also |
| returns nonzero if the multibyte character code in use actually has a |
| shift state. @xref{Shift State}. |
| @end deftypefun |
| |
| @comment stdlib.h |
| @comment ISO |
| @deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar}) |
| @safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| The @code{wctomb} (``wide character to multibyte'') function converts |
| the wide character code @var{wchar} to its corresponding multibyte |
| character sequence, and stores the result in bytes starting at |
| @var{string}. At most @code{MB_CUR_MAX} characters are stored. |
| |
| @code{wctomb} with non-null @var{string} distinguishes three |
| possibilities for @var{wchar}: a valid wide character code (one that can |
| be translated to a multibyte character), an invalid code, and |
| @code{L'\0'}. |
| |
| Given a valid code, @code{wctomb} converts it to a multibyte character, |
| storing the bytes starting at @var{string}. Then it returns the number |
| of bytes in that character (always at least @math{1} and never more |
| than @code{MB_CUR_MAX}). |
| |
| If @var{wchar} is an invalid wide character code, @code{wctomb} returns |
| @math{-1}. If @var{wchar} is @code{L'\0'}, it returns @code{0}, also |
| storing @code{'\0'} in @code{*@var{string}}. |
| |
| If the multibyte character code uses shift characters, then |
| @code{wctomb} maintains and updates a shift state as it scans. If you |
| call @code{wctomb} with a null pointer for @var{string}, that |
| initializes the shift state to its standard initial value. It also |
| returns nonzero if the multibyte character code in use actually has a |
| shift state. @xref{Shift State}. |
| |
| Calling this function with a @var{wchar} argument of zero when |
| @var{string} is not null has the side-effect of reinitializing the |
| stored shift state @emph{as well as} storing the multibyte character |
| @code{'\0'} and returning @math{0}. |
| @end deftypefun |
| |
| Similar to @code{mbrlen} there is also a non-reentrant function that |
| computes the length of a multibyte character. It can be defined in |
| terms of @code{mbtowc}. |
| |
| @comment stdlib.h |
| @comment ISO |
| @deftypefun int mblen (const char *@var{string}, size_t @var{size}) |
| @safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| The @code{mblen} function with a non-null @var{string} argument returns |
| the number of bytes that make up the multibyte character beginning at |
| @var{string}, never examining more than @var{size} bytes. (The idea is |
| to supply for @var{size} the number of bytes of data you have in hand.) |
| |
| The return value of @code{mblen} distinguishes three possibilities: the |
| first @var{size} bytes at @var{string} start with valid multibyte |
| characters, they start with an invalid byte sequence or just part of a |
| character, or @var{string} points to an empty string (a null character). |
| |
| For a valid multibyte character, @code{mblen} returns the number of |
| bytes in that character (always at least @code{1} and never more than |
| @var{size}). For an invalid byte sequence, @code{mblen} returns |
| @math{-1}. For an empty string, it returns @math{0}. |
| |
| If the multibyte character code uses shift characters, then @code{mblen} |
| maintains and updates a shift state as it scans. If you call |
| @code{mblen} with a null pointer for @var{string}, that initializes the |
| shift state to its standard initial value. It also returns a nonzero |
| value if the multibyte character code in use actually has a shift state. |
| @xref{Shift State}. |
| |
| @pindex stdlib.h |
| The function @code{mblen} is declared in @file{stdlib.h}. |
| @end deftypefun |
| |
| |
| @node Non-reentrant String Conversion |
| @subsection Non-reentrant Conversion of Strings |
| |
| For convenience the @w{ISO C90} standard also defines functions to |
| convert entire strings instead of single characters. These functions |
| suffer from the same problems as their reentrant counterparts from |
| @w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}. |
| |
| @comment stdlib.h |
| @comment ISO |
| @deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size}) |
| @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| @c Odd... Although this was supposed to be non-reentrant, the internal |
| @c state is not a static buffer, but an automatic variable. |
| The @code{mbstowcs} (``multibyte string to wide character string'') |
| function converts the null-terminated string of multibyte characters |
| @var{string} to an array of wide character codes, storing not more than |
| @var{size} wide characters into the array beginning at @var{wstring}. |
| The terminating null character counts towards the size, so if @var{size} |
| is less than the actual number of wide characters resulting from |
| @var{string}, no terminating null character is stored. |
| |
| The conversion of characters from @var{string} begins in the initial |
| shift state. |
| |
| If an invalid multibyte character sequence is found, the @code{mbstowcs} |
| function returns a value of @math{-1}. Otherwise, it returns the number |
| of wide characters stored in the array @var{wstring}. This number does |
| not include the terminating null character, which is present if the |
| number is less than @var{size}. |
| |
| Here is an example showing how to convert a string of multibyte |
| characters, allocating enough space for the result. |
| |
| @smallexample |
| wchar_t * |
| mbstowcs_alloc (const char *string) |
| @{ |
| size_t size = strlen (string) + 1; |
| wchar_t *buf = xmalloc (size * sizeof (wchar_t)); |
| |
| size = mbstowcs (buf, string, size); |
| if (size == (size_t) -1) |
| return NULL; |
| buf = xrealloc (buf, (size + 1) * sizeof (wchar_t)); |
| return buf; |
| @} |
| @end smallexample |
| |
| @end deftypefun |
| |
| @comment stdlib.h |
| @comment ISO |
| @deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size}) |
| @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| The @code{wcstombs} (``wide character string to multibyte string'') |
| function converts the null-terminated wide character array @var{wstring} |
| into a string containing multibyte characters, storing not more than |
| @var{size} bytes starting at @var{string}, followed by a terminating |
| null character if there is room. The conversion of characters begins in |
| the initial shift state. |
| |
| The terminating null character counts towards the size, so if @var{size} |
| is less than or equal to the number of bytes needed in @var{wstring}, no |
| terminating null character is stored. |
| |
| If a code that does not correspond to a valid multibyte character is |
| found, the @code{wcstombs} function returns a value of @math{-1}. |
| Otherwise, the return value is the number of bytes stored in the array |
| @var{string}. This number does not include the terminating null character, |
| which is present if the number is less than @var{size}. |
| @end deftypefun |
| |
| @node Shift State |
| @subsection States in Non-reentrant Functions |
| |
| In some multibyte character codes, the @emph{meaning} of any particular |
| byte sequence is not fixed; it depends on what other sequences have come |
| earlier in the same string. Typically there are just a few sequences that |
| can change the meaning of other sequences; these few are called |
| @dfn{shift sequences} and we say that they set the @dfn{shift state} for |
| other sequences that follow. |
| |
| To illustrate shift state and shift sequences, suppose we decide that |
| the sequence @code{0200} (just one byte) enters Japanese mode, in which |
| pairs of bytes in the range from @code{0240} to @code{0377} are single |
| characters, while @code{0201} enters Latin-1 mode, in which single bytes |
| in the range from @code{0240} to @code{0377} are characters, and |
| interpreted according to the ISO Latin-1 character set. This is a |
| multibyte code that has two alternative shift states (``Japanese mode'' |
| and ``Latin-1 mode''), and two shift sequences that specify particular |
| shift states. |
| |
| When the multibyte character code in use has shift states, then |
| @code{mblen}, @code{mbtowc}, and @code{wctomb} must maintain and update |
| the current shift state as they scan the string. To make this work |
| properly, you must follow these rules: |
| |
| @itemize @bullet |
| @item |
| Before starting to scan a string, call the function with a null pointer |
| for the multibyte character address---for example, @code{mblen (NULL, |
| 0)}. This initializes the shift state to its standard initial value. |
| |
| @item |
| Scan the string one character at a time, in order. Do not ``back up'' |
| and rescan characters already scanned, and do not intersperse the |
| processing of different strings. |
| @end itemize |
| |
| Here is an example of using @code{mblen} following these rules: |
| |
| @smallexample |
| void |
| scan_string (char *s) |
| @{ |
| int length = strlen (s); |
| |
| /* @r{Initialize shift state.} */ |
| mblen (NULL, 0); |
| |
| while (1) |
| @{ |
| int thischar = mblen (s, length); |
| /* @r{Deal with end of string and invalid characters.} */ |
| if (thischar == 0) |
| break; |
| if (thischar == -1) |
| @{ |
| error ("invalid multibyte character"); |
| break; |
| @} |
| /* @r{Advance past this character.} */ |
| s += thischar; |
| length -= thischar; |
| @} |
| @} |
| @end smallexample |
| |
| The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not |
| reentrant when using a multibyte code that uses a shift state. However, |
| no other library functions call these functions, so you don't have to |
| worry that the shift state will be changed mysteriously. |
| |
| |
| @node Generic Charset Conversion |
| @section Generic Charset Conversion |
| |
| The conversion functions mentioned so far in this chapter all had in |
| common that they operate on character sets that are not directly |
| specified by the functions. The multibyte encoding used is specified by |
| the currently selected locale for the @code{LC_CTYPE} category. The |
| wide character set is fixed by the implementation (in the case of @theglibc{} |
| it is always UCS-4 encoded @w{ISO 10646}. |
| |
| This has of course several problems when it comes to general character |
| conversion: |
| |
| @itemize @bullet |
| @item |
| For every conversion where neither the source nor the destination |
| character set is the character set of the locale for the @code{LC_CTYPE} |
| category, one has to change the @code{LC_CTYPE} locale using |
| @code{setlocale}. |
| |
| Changing the @code{LC_CTYPE} locale introduces major problems for the rest |
| of the programs since several more functions (e.g., the character |
| classification functions, @pxref{Classification of Characters}) use the |
| @code{LC_CTYPE} category. |
| |
| @item |
| Parallel conversions to and from different character sets are not |
| possible since the @code{LC_CTYPE} selection is global and shared by all |
| threads. |
| |
| @item |
| If neither the source nor the destination character set is the character |
| set used for @code{wchar_t} representation, there is at least a two-step |
| process necessary to convert a text using the functions above. One would |
| have to select the source character set as the multibyte encoding, |
| convert the text into a @code{wchar_t} text, select the destination |
| character set as the multibyte encoding, and convert the wide character |
| text to the multibyte (@math{=} destination) character set. |
| |
| Even if this is possible (which is not guaranteed) it is a very tiring |
| work. Plus it suffers from the other two raised points even more due to |
| the steady changing of the locale. |
| @end itemize |
| |
| The XPG2 standard defines a completely new set of functions, which has |
| none of these limitations. They are not at all coupled to the selected |
| locales, and they have no constraints on the character sets selected for |
| source and destination. Only the set of available conversions limits |
| them. The standard does not specify that any conversion at all must be |
| available. Such availability is a measure of the quality of the |
| implementation. |
| |
| In the following text first the interface to @code{iconv} and then the |
| conversion function, will be described. Comparisons with other |
| implementations will show what obstacles stand in the way of portable |
| applications. Finally, the implementation is described in so far as might |
| interest the advanced user who wants to extend conversion capabilities. |
| |
| @menu |
| * Generic Conversion Interface:: Generic Character Set Conversion Interface. |
| * iconv Examples:: A complete @code{iconv} example. |
| * Other iconv Implementations:: Some Details about other @code{iconv} |
| Implementations. |
| * glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C |
| library. |
| @end menu |
| |
| @node Generic Conversion Interface |
| @subsection Generic Character Set Conversion Interface |
| |
| This set of functions follows the traditional cycle of using a resource: |
| open--use--close. The interface consists of three functions, each of |
| which implements one step. |
| |
| Before the interfaces are described it is necessary to introduce a |
| data type. Just like other open--use--close interfaces the functions |
| introduced here work using handles and the @file{iconv.h} header |
| defines a special type for the handles used. |
| |
| @comment iconv.h |
| @comment XPG2 |
| @deftp {Data Type} iconv_t |
| This data type is an abstract type defined in @file{iconv.h}. The user |
| must not assume anything about the definition of this type; it must be |
| completely opaque. |
| |
| Objects of this type can get assigned handles for the conversions using |
| the @code{iconv} functions. The objects themselves need not be freed, but |
| the conversions for which the handles stand for have to. |
| @end deftp |
| |
| @noindent |
| The first step is the function to create a handle. |
| |
| @comment iconv.h |
| @comment XPG2 |
| @deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode}) |
| @safety{@prelim{}@mtsafe{@mtslocale{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| @c Calls malloc if tocode and/or fromcode are too big for alloca. Calls |
| @c strip and upstr on both, then gconv_open. strip and upstr call |
| @c isalnum_l and toupper_l with the C locale. gconv_open may MT-safely |
| @c tokenize toset, replace unspecified codesets with the current locale |
| @c (possibly two different accesses), and finally it calls |
| @c gconv_find_transform and initializes the gconv_t result with all the |
| @c steps in the conversion sequence, running each one's initializer, |
| @c destructing and releasing them all if anything fails. |
| |
| The @code{iconv_open} function has to be used before starting a |
| conversion. The two parameters this function takes determine the |
| source and destination character set for the conversion, and if the |
| implementation has the possibility to perform such a conversion, the |
| function returns a handle. |
| |
| If the wanted conversion is not available, the @code{iconv_open} function |
| returns @code{(iconv_t) -1}. In this case the global variable |
| @code{errno} can have the following values: |
| |
| @table @code |
| @item EMFILE |
| The process already has @code{OPEN_MAX} file descriptors open. |
| @item ENFILE |
| The system limit of open file is reached. |
| @item ENOMEM |
| Not enough memory to carry out the operation. |
| @item EINVAL |
| The conversion from @var{fromcode} to @var{tocode} is not supported. |
| @end table |
| |
| It is not possible to use the same descriptor in different threads to |
| perform independent conversions. The data structures associated |
| with the descriptor include information about the conversion state. |
| This must not be messed up by using it in different conversions. |
| |
| An @code{iconv} descriptor is like a file descriptor as for every use a |
| new descriptor must be created. The descriptor does not stand for all |
| of the conversions from @var{fromset} to @var{toset}. |
| |
| The @glibcadj{} implementation of @code{iconv_open} has one |
| significant extension to other implementations. To ease the extension |
| of the set of available conversions, the implementation allows storing |
| the necessary files with data and code in an arbitrary number of |
| directories. How this extension must be written will be explained below |
| (@pxref{glibc iconv Implementation}). Here it is only important to say |
| that all directories mentioned in the @code{GCONV_PATH} environment |
| variable are considered only if they contain a file @file{gconv-modules}. |
| These directories need not necessarily be created by the system |
| administrator. In fact, this extension is introduced to help users |
| writing and using their own, new conversions. Of course, this does not |
| work for security reasons in SUID binaries; in this case only the system |
| directory is considered and this normally is |
| @file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment |
| variable is examined exactly once at the first call of the |
| @code{iconv_open} function. Later modifications of the variable have no |
| effect. |
| |
| @pindex iconv.h |
| The @code{iconv_open} function was introduced early in the X/Open |
| Portability Guide, @w{version 2}. It is supported by all commercial |
| Unices as it is required for the Unix branding. However, the quality and |
| completeness of the implementation varies widely. The @code{iconv_open} |
| function is declared in @file{iconv.h}. |
| @end deftypefun |
| |
| The @code{iconv} implementation can associate large data structure with |
| the handle returned by @code{iconv_open}. Therefore, it is crucial to |
| free all the resources once all conversions are carried out and the |
| conversion is not needed anymore. |
| |
| @comment iconv.h |
| @comment XPG2 |
| @deftypefun int iconv_close (iconv_t @var{cd}) |
| @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{}}} |
| @c Calls gconv_close to destruct and release each of the conversion |
| @c steps, release the gconv_t object, then call gconv_close_transform. |
| @c Access to the gconv_t object is not guarded, but calling iconv_close |
| @c concurrently with any other use is undefined. |
| |
| The @code{iconv_close} function frees all resources associated with the |
| handle @var{cd}, which must have been returned by a successful call to |
| the @code{iconv_open} function. |
| |
| If the function call was successful the return value is @math{0}. |
| Otherwise it is @math{-1} and @code{errno} is set appropriately. |
| Defined error are: |
| |
| @table @code |
| @item EBADF |
| The conversion descriptor is invalid. |
| @end table |
| |
| @pindex iconv.h |
| The @code{iconv_close} function was introduced together with the rest |
| of the @code{iconv} functions in XPG2 and is declared in @file{iconv.h}. |
| @end deftypefun |
| |
| The standard defines only one actual conversion function. This has, |
| therefore, the most general interface: it allows conversion from one |
| buffer to another. Conversion from a file to a buffer, vice versa, or |
| even file to file can be implemented on top of it. |
| |
| @comment iconv.h |
| @comment XPG2 |
| @deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft}) |
| @safety{@prelim{}@mtsafe{@mtsrace{:cd}}@assafe{}@acunsafe{@acucorrupt{}}} |
| @c Without guarding access to the iconv_t object pointed to by cd, call |
| @c the conversion function to convert inbuf or flush the internal |
| @c conversion state. |
| @cindex stateful |
| The @code{iconv} function converts the text in the input buffer |
| according to the rules associated with the descriptor @var{cd} and |
| stores the result in the output buffer. It is possible to call the |
| function for the same text several times in a row since for stateful |
| character sets the necessary state information is kept in the data |
| structures associated with the descriptor. |
| |
| The input buffer is specified by @code{*@var{inbuf}} and it contains |
| @code{*@var{inbytesleft}} bytes. The extra indirection is necessary for |
| communicating the used input back to the caller (see below). It is |
| important to note that the buffer pointer is of type @code{char} and the |
| length is measured in bytes even if the input text is encoded in wide |
| characters. |
| |
| The output buffer is specified in a similar way. @code{*@var{outbuf}} |
| points to the beginning of the buffer with at least |
| @code{*@var{outbytesleft}} bytes room for the result. The buffer |
| pointer again is of type @code{char} and the length is measured in |
| bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer, the |
| conversion is performed but no output is available. |
| |
| If @var{inbuf} is a null pointer, the @code{iconv} function performs the |
| necessary action to put the state of the conversion into the initial |
| state. This is obviously a no-op for non-stateful encodings, but if the |
| encoding has a state, such a function call might put some byte sequences |
| in the output buffer, which perform the necessary state changes. The |
| next call with @var{inbuf} not being a null pointer then simply goes on |
| from the initial state. It is important that the programmer never makes |
| any assumption as to whether the conversion has to deal with states. |
| Even if the input and output character sets are not stateful, the |
| implementation might still have to keep states. This is due to the |
| implementation chosen for @theglibc{} as it is described below. |
| Therefore an @code{iconv} call to reset the state should always be |
| performed if some protocol requires this for the output text. |
| |
| The conversion stops for one of three reasons. The first is that all |
| characters from the input buffer are converted. This actually can mean |
| two things: either all bytes from the input buffer are consumed or |
| there are some bytes at the end of the buffer that possibly can form a |
| complete character but the input is incomplete. The second reason for a |
| stop is that the output buffer is full. And the third reason is that |
| the input contains invalid characters. |
| |
| In all of these cases the buffer pointers after the last successful |
| conversion, for input and output buffer, are stored in @var{inbuf} and |
| @var{outbuf}, and the available room in each buffer is stored in |
| @var{inbytesleft} and @var{outbytesleft}. |
| |
| Since the character sets selected in the @code{iconv_open} call can be |
| almost arbitrary, there can be situations where the input buffer contains |
| valid characters, which have no identical representation in the output |
| character set. The behavior in this situation is undefined. The |
| @emph{current} behavior of @theglibc{} in this situation is to |
| return with an error immediately. This certainly is not the most |
| desirable solution; therefore, future versions will provide better ones, |
| but they are not yet finished. |
| |
| If all input from the input buffer is successfully converted and stored |
| in the output buffer, the function returns the number of non-reversible |
| conversions performed. In all other cases the return value is |
| @code{(size_t) -1} and @code{errno} is set appropriately. In such cases |
| the value pointed to by @var{inbytesleft} is nonzero. |
| |
| @table @code |
| @item EILSEQ |
| The conversion stopped because of an invalid byte sequence in the input. |
| After the call, @code{*@var{inbuf}} points at the first byte of the |
| invalid byte sequence. |
| |
| @item E2BIG |
| The conversion stopped because it ran out of space in the output buffer. |
| |
| @item EINVAL |
| The conversion stopped because of an incomplete byte sequence at the end |
| of the input buffer. |
| |
| @item EBADF |
| The @var{cd} argument is invalid. |
| @end table |
| |
| @pindex iconv.h |
| The @code{iconv} function was introduced in the XPG2 standard and is |
| declared in the @file{iconv.h} header. |
| @end deftypefun |
| |
| The definition of the @code{iconv} function is quite good overall. It |
| provides quite flexible functionality. The only problems lie in the |
| boundary cases, which are incomplete byte sequences at the end of the |
| input buffer and invalid input. A third problem, which is not really |
| a design problem, is the way conversions are selected. The standard |
| does not say anything about the legitimate names, a minimal set of |
| available conversions. We will see how this negatively impacts other |
| implementations, as demonstrated below. |
| |
| @node iconv Examples |
| @subsection A complete @code{iconv} example |
| |
| The example below features a solution for a common problem. Given that |
| one knows the internal encoding used by the system for @code{wchar_t} |
| strings, one often is in the position to read text from a file and store |
| it in wide character buffers. One can do this using @code{mbsrtowcs}, |
| but then we run into the problems discussed above. |
| |
| @smallexample |
| int |
| file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail) |
| @{ |
| char inbuf[BUFSIZ]; |
| size_t insize = 0; |
| char *wrptr = (char *) outbuf; |
| int result = 0; |
| iconv_t cd; |
| |
| cd = iconv_open ("WCHAR_T", charset); |
| if (cd == (iconv_t) -1) |
| @{ |
| /* @r{Something went wrong.} */ |
| if (errno == EINVAL) |
| error (0, 0, "conversion from '%s' to wchar_t not available", |
| charset); |
| else |
| perror ("iconv_open"); |
| |
| /* @r{Terminate the output string.} */ |
| *outbuf = L'\0'; |
| |
| return -1; |
| @} |
| |
| while (avail > 0) |
| @{ |
| size_t nread; |
| size_t nconv; |
| char *inptr = inbuf; |
| |
| /* @r{Read more input.} */ |
| nread = read (fd, inbuf + insize, sizeof (inbuf) - insize); |
| if (nread == 0) |
| @{ |
| /* @r{When we come here the file is completely read.} |
| @r{This still could mean there are some unused} |
| @r{characters in the @code{inbuf}. Put them back.} */ |
| if (lseek (fd, -insize, SEEK_CUR) == -1) |
| result = -1; |
| |
| /* @r{Now write out the byte sequence to get into the} |
| @r{initial state if this is necessary.} */ |
| iconv (cd, NULL, NULL, &wrptr, &avail); |
| |
| break; |
| @} |
| insize += nread; |
| |
| /* @r{Do the conversion.} */ |
| nconv = iconv (cd, &inptr, &insize, &wrptr, &avail); |
| if (nconv == (size_t) -1) |
| @{ |
| /* @r{Not everything went right. It might only be} |
| @r{an unfinished byte sequence at the end of the} |
| @r{buffer. Or it is a real problem.} */ |
| if (errno == EINVAL) |
| /* @r{This is harmless. Simply move the unused} |
| @r{bytes to the beginning of the buffer so that} |
| @r{they can be used in the next round.} */ |
| memmove (inbuf, inptr, insize); |
| else |
| @{ |
| /* @r{It is a real problem. Maybe we ran out of} |
| @r{space in the output buffer or we have invalid} |
| @r{input. In any case back the file pointer to} |
| @r{the position of the last processed byte.} */ |
| lseek (fd, -insize, SEEK_CUR); |
| result = -1; |
| break; |
| @} |
| @} |
| @} |
| |
| /* @r{Terminate the output string.} */ |
| if (avail >= sizeof (wchar_t)) |
| *((wchar_t *) wrptr) = L'\0'; |
| |
| if (iconv_close (cd) != 0) |
| perror ("iconv_close"); |
| |
| return (wchar_t *) wrptr - outbuf; |
| @} |
| @end smallexample |
| |
| @cindex stateful |
| This example shows the most important aspects of using the @code{iconv} |
| functions. It shows how successive calls to @code{iconv} can be used to |
| convert large amounts of text. The user does not have to care about |
| stateful encodings as the functions take care of everything. |
| |
| An interesting point is the case where @code{iconv} returns an error and |
| @code{errno} is set to @code{EINVAL}. This is not really an error in the |
| transformation. It can happen whenever the input character set contains |
| byte sequences of more than one byte for some character and texts are not |
| processed in one piece. In this case there is a chance that a multibyte |
| sequence is cut. The caller can then simply read the remainder of the |
| takes and feed the offending bytes together with new character from the |
| input to @code{iconv} and continue the work. The internal state kept in |
| the descriptor is @emph{not} unspecified after such an event as is the |
| case with the conversion functions from the @w{ISO C} standard. |
| |
| The example also shows the problem of using wide character strings with |
| @code{iconv}. As explained in the description of the @code{iconv} |
| function above, the function always takes a pointer to a @code{char} |
| array and the available space is measured in bytes. In the example, the |
| output buffer is a wide character buffer; therefore, we use a local |
| variable @var{wrptr} of type @code{char *}, which is used in the |
| @code{iconv} calls. |
| |
| This looks rather innocent but can lead to problems on platforms that |
| have tight restriction on alignment. Therefore the caller of @code{iconv} |
| has to make sure that the pointers passed are suitable for access of |
| characters from the appropriate character set. Since, in the |
| above case, the input parameter to the function is a @code{wchar_t} |
| pointer, this is the case (unless the user violates alignment when |
| computing the parameter). But in other situations, especially when |
| writing generic functions where one does not know what type of character |
| set one uses and, therefore, treats text as a sequence of bytes, it might |
| become tricky. |
| |
| @node Other iconv Implementations |
| @subsection Some Details about other @code{iconv} Implementations |
| |
| This is not really the place to discuss the @code{iconv} implementation |
| of other systems but it is necessary to know a bit about them to write |
| portable programs. The above mentioned problems with the specification |
| of the @code{iconv} functions can lead to portability issues. |
| |
| The first thing to notice is that, due to the large number of character |
| sets in use, it is certainly not practical to encode the conversions |
| directly in the C library. Therefore, the conversion information must |
| come from files outside the C library. This is usually done in one or |
| both of the following ways: |
| |
| @itemize @bullet |
| @item |
| The C library contains a set of generic conversion functions that can |
| read the needed conversion tables and other information from data files. |
| These files get loaded when necessary. |
| |
| This solution is problematic as it requires a great deal of effort to |
| apply to all character sets (potentially an infinite set). The |
| differences in the structure of the different character sets is so large |
| that many different variants of the table-processing functions must be |
| developed. In addition, the generic nature of these functions make them |
| slower than specifically implemented functions. |
| |
| @item |
| The C library only contains a framework that can dynamically load |
| object files and execute the conversion functions contained therein. |
| |
| This solution provides much more flexibility. The C library itself |
| contains only very little code and therefore reduces the general memory |
| footprint. Also, with a documented interface between the C library and |
| the loadable modules it is possible for third parties to extend the set |
| of available conversion modules. A drawback of this solution is that |
| dynamic loading must be available. |
| @end itemize |
| |
| Some implementations in commercial Unices implement a mixture of these |
| possibilities; the majority implement only the second solution. Using |
| loadable modules moves the code out of the library itself and keeps |
| the door open for extensions and improvements, but this design is also |
| limiting on some platforms since not many platforms support dynamic |
| loading in statically linked programs. On platforms without this |
| capability it is therefore not possible to use this interface in |
| statically linked programs. @Theglibc{} has, on ELF platforms, no |
| problems with dynamic loading in these situations; therefore, this |
| point is moot. The danger is that one gets acquainted with this |
| situation and forgets about the restrictions on other systems. |
| |
| A second thing to know about other @code{iconv} implementations is that |
| the number of available conversions is often very limited. Some |
| implementations provide, in the standard release (not special |
| international or developer releases), at most 100 to 200 conversion |
| possibilities. This does not mean 200 different character sets are |
| supported; for example, conversions from one character set to a set of 10 |
| others might count as 10 conversions. Together with the other direction |
| this makes 20 conversion possibilities used up by one character set. One |
| can imagine the thin coverage these platform provide. Some Unix vendors |
| even provide only a handful of conversions, which renders them useless for |
| almost all uses. |
| |
| This directly leads to a third and probably the most problematic point. |
| The way the @code{iconv} conversion functions are implemented on all |
| known Unix systems and the availability of the conversion functions from |
| character set @math{@cal{A}} to @math{@cal{B}} and the conversion from |
| @math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the |
| conversion from @math{@cal{A}} to @math{@cal{C}} is available. |
| |
| This might not seem unreasonable and problematic at first, but it is a |
| quite big problem as one will notice shortly after hitting it. To show |
| the problem we assume to write a program that has to convert from |
| @math{@cal{A}} to @math{@cal{C}}. A call like |
| |
| @smallexample |
| cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}"); |
| @end smallexample |
| |
| @noindent |
| fails according to the assumption above. But what does the program |
| do now? The conversion is necessary; therefore, simply giving up is not |
| an option. |
| |
| This is a nuisance. The @code{iconv} function should take care of this. |
| But how should the program proceed from here on? If it tries to convert |
| to character set @math{@cal{B}}, first the two @code{iconv_open} |
| calls |
| |
| @smallexample |
| cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}"); |
| @end smallexample |
| |
| @noindent |
| and |
| |
| @smallexample |
| cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}"); |
| @end smallexample |
| |
| @noindent |
| will succeed, but how to find @math{@cal{B}}? |
| |
| Unfortunately, the answer is: there is no general solution. On some |
| systems guessing might help. On those systems most character sets can |
| convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside |
| this only some very system-specific methods can help. Since the |
| conversion functions come from loadable modules and these modules must |
| be stored somewhere in the filesystem, one @emph{could} try to find them |
| and determine from the available file which conversions are available |
| and whether there is an indirect route from @math{@cal{A}} to |
| @math{@cal{C}}. |
| |
| This example shows one of the design errors of @code{iconv} mentioned |
| above. It should at least be possible to determine the list of available |
| conversion programmatically so that if @code{iconv_open} says there is no |
| such conversion, one could make sure this also is true for indirect |
| routes. |
| |
| @node glibc iconv Implementation |
| @subsection The @code{iconv} Implementation in @theglibc{} |
| |
| After reading about the problems of @code{iconv} implementations in the |
| last section it is certainly good to note that the implementation in |
| @theglibc{} has none of the problems mentioned above. What |
| follows is a step-by-step analysis of the points raised above. The |
| evaluation is based on the current state of the development (as of |
| January 1999). The development of the @code{iconv} functions is not |
| complete, but basic functionality has solidified. |
| |
| @Theglibc{}'s @code{iconv} implementation uses shared loadable |
| modules to implement the conversions. A very small number of |
| conversions are built into the library itself but these are only rather |
| trivial conversions. |
| |
| All the benefits of loadable modules are available in the @glibcadj{} |
| implementation. This is especially appealing since the interface is |
| well documented (see below), and it, therefore, is easy to write new |
| conversion modules. The drawback of using loadable objects is not a |
| problem in @theglibc{}, at least on ELF systems. Since the |
| library is able to load shared objects even in statically linked |
| binaries, static linking need not be forbidden in case one wants to use |
| @code{iconv}. |
| |
| The second mentioned problem is the number of supported conversions. |
| Currently, @theglibc{} supports more than 150 character sets. The |
| way the implementation is designed the number of supported conversions |
| is greater than 22350 (@math{150} times @math{149}). If any conversion |
| from or to a character set is missing, it can be added easily. |
| |
| Particularly impressive as it may be, this high number is due to the |
| fact that the @glibcadj{} implementation of @code{iconv} does not have |
| the third problem mentioned above (i.e., whenever there is a conversion |
| from a character set @math{@cal{A}} to @math{@cal{B}} and from |
| @math{@cal{B}} to @math{@cal{C}} it is always possible to convert from |
| @math{@cal{A}} to @math{@cal{C}} directly). If the @code{iconv_open} |
| returns an error and sets @code{errno} to @code{EINVAL}, there is no |
| known way, directly or indirectly, to perform the wanted conversion. |
| |
| @cindex triangulation |
| Triangulation is achieved by providing for each character set a |
| conversion from and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646} |
| as an intermediate representation it is possible to @dfn{triangulate} |
| (i.e., convert with an intermediate representation). |
| |
| There is no inherent requirement to provide a conversion to @w{ISO |
| 10646} for a new character set, and it is also possible to provide other |
| conversions where neither source nor destination character set is @w{ISO |
| 10646}. The existing set of conversions is simply meant to cover all |
| conversions that might be of interest. |
| |
| @cindex ISO-2022-JP |
| @cindex EUC-JP |
| All currently available conversions use the triangulation method above, |
| making conversion run unnecessarily slow. If, for example, somebody |
| often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution |
| would involve direct conversion between the two character sets, skipping |
| the input to @w{ISO 10646} first. The two character sets of interest |
| are much more similar to each other than to @w{ISO 10646}. |
| |
| In such a situation one easily can write a new conversion and provide it |
| as a better alternative. The @glibcadj{} @code{iconv} implementation |
| would automatically use the module implementing the conversion if it is |
| specified to be more efficient. |
| |
| @subsubsection Format of @file{gconv-modules} files |
| |
| All information about the available conversions comes from a file named |
| @file{gconv-modules}, which can be found in any of the directories along |
| the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented |
| text files, where each of the lines has one of the following formats: |
| |
| @itemize @bullet |
| @item |
| If the first non-whitespace character is a @kbd{#} the line contains only |
| comments and is ignored. |
| |
| @item |
| Lines starting with @code{alias} define an alias name for a character |
| set. Two more words are expected on the line. The first word |
| defines the alias name, and the second defines the original name of the |
| character set. The effect is that it is possible to use the alias name |
| in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and |
| achieve the same result as when using the real character set name. |
| |
| This is quite important as a character set has often many different |
| names. There is normally an official name but this need not correspond to |
| the most popular name. Beside this many character sets have special |
| names that are somehow constructed. For example, all character sets |
| specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}} |
| where @var{nnn} is the registration number. This allows programs that |
| know about the registration number to construct character set names and |
| use them in @code{iconv_open} calls. More on the available names and |
| aliases follows below. |
| |
| @item |
| Lines starting with @code{module} introduce an available conversion |
| module. These lines must contain three or four more words. |
| |
| The first word specifies the source character set, the second word the |
| destination character set of conversion implemented in this module, and |
| the third word is the name of the loadable module. The filename is |
| constructed by appending the usual shared object suffix (normally |
| @file{.so}) and this file is then supposed to be found in the same |
| directory the @file{gconv-modules} file is in. The last word on the line, |
| which is optional, is a numeric value representing the cost of the |
| conversion. If this word is missing, a cost of @math{1} is assumed. The |
| numeric value itself does not matter that much; what counts are the |
| relative values of the sums of costs for all possible conversion paths. |
| Below is a more precise description of the use of the cost value. |
| @end itemize |
| |
| Returning to the example above where one has written a module to directly |
| convert from ISO-2022-JP to EUC-JP and back. All that has to be done is |
| to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory |
| and add a file @file{gconv-modules} with the following content in the |
| same directory: |
| |
| @smallexample |
| module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1 |
| module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1 |
| @end smallexample |
| |
| To see why this is sufficient, it is necessary to understand how the |
| conversion used by @code{iconv} (and described in the descriptor) is |
| selected. The approach to this problem is quite simple. |
| |
| At the first call of the @code{iconv_open} function the program reads |
| all available @file{gconv-modules} files and builds up two tables: one |
| containing all the known aliases and another that contains the |
| information about the conversions and which shared object implements |
| them. |
| |
| @subsubsection Finding the conversion path in @code{iconv} |
| |
| The set of available conversions form a directed graph with weighted |
| edges. The weights on the edges are the costs specified in the |
| @file{gconv-modules} files. The @code{iconv_open} function uses an |
| algorithm suitable for search for the best path in such a graph and so |
| constructs a list of conversions that must be performed in succession |
| to get the transformation from the source to the destination character |
| set. |
| |
| Explaining why the above @file{gconv-modules} files allows the |
| @code{iconv} implementation to resolve the specific ISO-2022-JP to |
| EUC-JP conversion module instead of the conversion coming with the |
| library itself is straightforward. Since the latter conversion takes two |
| steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to |
| EUC-JP), the cost is @math{1+1 = 2}. The above @file{gconv-modules} |
| file, however, specifies that the new conversion modules can perform this |
| conversion with only the cost of @math{1}. |
| |
| A mysterious item about the @file{gconv-modules} file above (and also |
| the file coming with @theglibc{}) are the names of the character |
| sets specified in the @code{module} lines. Why do almost all the names |
| end in @code{//}? And this is not all: the names can actually be |
| regular expressions. At this point in time this mystery should not be |
| revealed, unless you have the relevant spell-casting materials: ashes |
| from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix |
| blessed by St.@: Emacs, assorted herbal roots from Central America, sand |
| from Cebu, etc. Sorry! @strong{The part of the implementation where |
| this is used is not yet finished. For now please simply follow the |
| existing examples. It'll become clearer once it is. --drepper} |
| |
| A last remark about the @file{gconv-modules} is about the names not |
| ending with @code{//}. A character set named @code{INTERNAL} is often |
| mentioned. From the discussion above and the chosen name it should have |
| become clear that this is the name for the representation used in the |
| intermediate step of the triangulation. We have said that this is UCS-4 |
| but actually that is not quite right. The UCS-4 specification also |
| includes the specification of the byte ordering used. Since a UCS-4 value |
| consists of four bytes, a stored value is affected by byte ordering. The |
| internal representation is @emph{not} the same as UCS-4 in case the byte |
| ordering of the processor (or at least the running process) is not the |
| same as the one required for UCS-4. This is done for performance reasons |
| as one does not want to perform unnecessary byte-swapping operations if |
| one is not interested in actually seeing the result in UCS-4. To avoid |
| trouble with endianness, the internal representation consistently is named |
| @code{INTERNAL} even on big-endian systems where the representations are |
| identical. |
| |
| @subsubsection @code{iconv} module data structures |
| |
| So far this section has described how modules are located and considered |
| to be used. What remains to be described is the interface of the modules |
| so that one can write new ones. This section describes the interface as |
| it is in use in January 1999. The interface will change a bit in the |
| future but, with luck, only in an upwardly compatible way. |
| |
| The definitions necessary to write new modules are publicly available |
| in the non-standard header @file{gconv.h}. The following text, |
| therefore, describes the definitions from this header file. First, |
| however, it is necessary to get an overview. |
| |
| From the perspective of the user of @code{iconv} the interface is quite |
| simple: the @code{iconv_open} function returns a handle that can be used |
| in calls to @code{iconv}, and finally the handle is freed with a call to |
| @code{iconv_close}. The problem is that the handle has to be able to |
| represent the possibly long sequences of conversion steps and also the |
| state of each conversion since the handle is all that is passed to the |
| @code{iconv} function. Therefore, the data structures are really the |
| elements necessary to understanding the implementation. |
| |
| We need two different kinds of data structures. The first describes the |
| conversion and the second describes the state etc. There are really two |
| type definitions like this in @file{gconv.h}. |
| @pindex gconv.h |
| |
| @comment gconv.h |
| @comment GNU |
| @deftp {Data type} {struct __gconv_step} |
| This data structure describes one conversion a module can perform. For |
| each function in a loaded module with conversion functions there is |
| exactly one object of this type. This object is shared by all users of |
| the conversion (i.e., this object does not contain any information |
| corresponding to an actual conversion; it only describes the conversion |
| itself). |
| |
| @table @code |
| @item struct __gconv_loaded_object *__shlib_handle |
| @itemx const char *__modname |
| @itemx int __counter |
| All these elements of the structure are used internally in the C library |
| to coordinate loading and unloading the shared. One must not expect any |
| of the other elements to be available or initialized. |
| |
| @item const char *__from_name |
| @itemx const char *__to_name |
| @code{__from_name} and @code{__to_name} contain the names of the source and |
| destination character sets. They can be used to identify the actual |
| conversion to be carried out since one module might implement conversions |
| for more than one character set and/or direction. |
| |
| @item gconv_fct __fct |
| @itemx gconv_init_fct __init_fct |
| @itemx gconv_end_fct __end_fct |
| These elements contain pointers to the functions in the loadable module. |
| The interface will be explained below. |
| |
| @item int __min_needed_from |
| @itemx int __max_needed_from |
| @itemx int __min_needed_to |
| @itemx int __max_needed_to; |
| These values have to be supplied in the init function of the module. The |
| @code{__min_needed_from} value specifies how many bytes a character of |
| the source character set at least needs. The @code{__max_needed_from} |
| specifies the maximum value that also includes possible shift sequences. |
| |
| The @code{__min_needed_to} and @code{__max_needed_to} values serve the |
| same purpose as @code{__min_needed_from} and @code{__max_needed_from} but |
| this time for the destination character set. |
| |
| It is crucial that these values be accurate since otherwise the |
| conversion functions will have problems or not work at all. |
| |
| @item int __stateful |
| This element must also be initialized by the init function. |
| @code{int __stateful} is nonzero if the source character set is stateful. |
| Otherwise it is zero. |
| |
| @item void *__data |
| This element can be used freely by the conversion functions in the |
| module. @code{void *__data} can be used to communicate extra information |
| from one call to another. @code{void *__data} need not be initialized if |
| not needed at all. If @code{void *__data} element is assigned a pointer |
| to dynamically allocated memory (presumably in the init function) it has |
| to be made sure that the end function deallocates the memory. Otherwise |
| the application will leak memory. |
| |
| It is important to be aware that this data structure is shared by all |
| users of this specification conversion and therefore the @code{__data} |
| element must not contain data specific to one specific use of the |
| conversion function. |
| @end table |
| @end deftp |
| |
| @comment gconv.h |
| @comment GNU |
| @deftp {Data type} {struct __gconv_step_data} |
| This is the data structure that contains the information specific to |
| each use of the conversion functions. |
| |
| |
| @table @code |
| @item char *__outbuf |
| @itemx char *__outbufend |
| These elements specify the output buffer for the conversion step. The |
| @code{__outbuf} element points to the beginning of the buffer, and |
| @code{__outbufend} points to the byte following the last byte in the |
| buffer. The conversion function must not assume anything about the size |
| of the buffer but it can be safely assumed the there is room for at |
| least one complete character in the output buffer. |
| |
| Once the conversion is finished, if the conversion is the last step, the |
| @code{__outbuf} element must be modified to point after the last byte |
| written into the buffer to signal how much output is available. If this |
| conversion step is not the last one, the element must not be modified. |
| The @code{__outbufend} element must not be modified. |
| |
| @item int __is_last |
| This element is nonzero if this conversion step is the last one. This |
| information is necessary for the recursion. See the description of the |
| conversion function internals below. This element must never be |
| modified. |
| |
| @item int __invocation_counter |
| The conversion function can use this element to see how many calls of |
| the conversion function already happened. Some character sets require a |
| certain prolog when generating output, and by comparing this value with |
| zero, one can find out whether it is the first call and whether, |
| therefore, the prolog should be emitted. This element must never be |
| modified. |
| |
| @item int __internal_use |
| This element is another one rarely used but needed in certain |
| situations. It is assigned a nonzero value in case the conversion |
| functions are used to implement @code{mbsrtowcs} et.al.@: (i.e., the |
| function is not used directly through the @code{iconv} interface). |
| |
| This sometimes makes a difference as it is expected that the |
| @code{iconv} functions are used to translate entire texts while the |
| @code{mbsrtowcs} functions are normally used only to convert single |
| strings and might be used multiple times to convert entire texts. |
| |
| But in this situation we would have problem complying with some rules of |
| the character set specification. Some character sets require a prolog, |
| which must appear exactly once for an entire text. If a number of |
| @code{mbsrtowcs} calls are used to convert the text, only the first call |
| must add the prolog. However, because there is no communication between the |
| different calls of @code{mbsrtowcs}, the conversion functions have no |
| possibility to find this out. The situation is different for sequences |
| of @code{iconv} calls since the handle allows access to the needed |
| information. |
| |
| The @code{int __internal_use} element is mostly used together with |
| @code{__invocation_counter} as follows: |
| |
| @smallexample |
| if (!data->__internal_use |
| && data->__invocation_counter == 0) |
| /* @r{Emit prolog.} */ |
| @dots{} |
| @end smallexample |
| |
| This element must never be modified. |
| |
| @item mbstate_t *__statep |
| The @code{__statep} element points to an object of type @code{mbstate_t} |
| (@pxref{Keeping the state}). The conversion of a stateful character |
| set must use the object pointed to by @code{__statep} to store |
| information about the conversion state. The @code{__statep} element |
| itself must never be modified. |
| |
| @item mbstate_t __state |
| This element must @emph{never} be used directly. It is only part of |
| this structure to have the needed space allocated. |
| @end table |
| @end deftp |
| |
| @subsubsection @code{iconv} module interfaces |
| |
| With the knowledge about the data structures we now can describe the |
| conversion function itself. To understand the interface a bit of |
| knowledge is necessary about the functionality in the C library that |
| loads the objects with the conversions. |
| |
| It is often the case that one conversion is used more than once (i.e., |
| there are several @code{iconv_open} calls for the same set of character |
| sets during one program run). The @code{mbsrtowcs} et.al.@: functions in |
| @theglibc{} also use the @code{iconv} functionality, which |
| increases the number of uses of the same functions even more. |
| |
| Because of this multiple use of conversions, the modules do not get |
| loaded exclusively for one conversion. Instead a module once loaded can |
| be used by an arbitrary number of @code{iconv} or @code{mbsrtowcs} calls |
| at the same time. The splitting of the information between conversion- |
| function-specific information and conversion data makes this possible. |
| The last section showed the two data structures used to do this. |
| |
| This is of course also reflected in the interface and semantics of the |
| functions that the modules must provide. There are three functions that |
| must have the following names: |
| |
| @table @code |
| @item gconv_init |
| The @code{gconv_init} function initializes the conversion function |
| specific data structure. This very same object is shared by all |
| conversions that use this conversion and, therefore, no state information |
| about the conversion itself must be stored in here. If a module |
| implements more than one conversion, the @code{gconv_init} function will |
| be called multiple times. |
| |
| @item gconv_end |
| The @code{gconv_end} function is responsible for freeing all resources |
| allocated by the @code{gconv_init} function. If there is nothing to do, |
| this function can be missing. Special care must be taken if the module |
| implements more than one conversion and the @code{gconv_init} function |
| does not allocate the same resources for all conversions. |
| |
| @item gconv |
| This is the actual conversion function. It is called to convert one |
| block of text. It gets passed the conversion step information |
| initialized by @code{gconv_init} and the conversion data, specific to |
| this use of the conversion functions. |
| @end table |
| |
| There are three data types defined for the three module interface |
| functions and these define the interface. |
| |
| @comment gconv.h |
| @comment GNU |
| @deftypevr {Data type} int {(*__gconv_init_fct)} (struct __gconv_step *) |
| This specifies the interface of the initialization function of the |
| module. It is called exactly once for each conversion the module |
| implements. |
| |
| As explained in the description of the @code{struct __gconv_step} data |
| structure above the initialization function has to initialize parts of |
| it. |
| |
| @table @code |
| @item __min_needed_from |
| @itemx __max_needed_from |
| @itemx __min_needed_to |
| @itemx __max_needed_to |
| These elements must be initialized to the exact numbers of the minimum |
| and maximum number of bytes used by one character in the source and |
| destination character sets, respectively. If the characters all have the |
| same size, the minimum and maximum values are the same. |
| |
| @item __stateful |
| This element must be initialized to a nonzero value if the source |
| character set is stateful. Otherwise it must be zero. |
| @end table |
| |
| If the initialization function needs to communicate some information |
| to the conversion function, this communication can happen using the |
| @code{__data} element of the @code{__gconv_step} structure. But since |
| this data is shared by all the conversions, it must not be modified by |
| the conversion function. The example below shows how this can be used. |
| |
| @smallexample |
| #define MIN_NEEDED_FROM 1 |
| #define MAX_NEEDED_FROM 4 |
| #define MIN_NEEDED_TO 4 |
| #define MAX_NEEDED_TO 4 |
| |
| int |
| gconv_init (struct __gconv_step *step) |
| @{ |
| /* @r{Determine which direction.} */ |
| struct iso2022jp_data *new_data; |
| enum direction dir = illegal_dir; |
| enum variant var = illegal_var; |
| int result; |
| |
| if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0) |
| @{ |
| dir = from_iso2022jp; |
| var = iso2022jp; |
| @} |
| else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0) |
| @{ |
| dir = to_iso2022jp; |
| var = iso2022jp; |
| @} |
| else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0) |
| @{ |
| dir = from_iso2022jp; |
| var = iso2022jp2; |
| @} |
| else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0) |
| @{ |
| dir = to_iso2022jp; |
| var = iso2022jp2; |
| @} |
| |
| result = __GCONV_NOCONV; |
| if (dir != illegal_dir) |
| @{ |
| new_data = (struct iso2022jp_data *) |
| malloc (sizeof (struct iso2022jp_data)); |
| |
| result = __GCONV_NOMEM; |
| if (new_data != NULL) |
| @{ |
| new_data->dir = dir; |
| new_data->var = var; |
| step->__data = new_data; |
| |
| if (dir == from_iso2022jp) |
| @{ |
| step->__min_needed_from = MIN_NEEDED_FROM; |
| step->__max_needed_from = MAX_NEEDED_FROM; |
| step->__min_needed_to = MIN_NEEDED_TO; |
| step->__max_needed_to = MAX_NEEDED_TO; |
| @} |
| else |
| @{ |
| step->__min_needed_from = MIN_NEEDED_TO; |
| step->__max_needed_from = MAX_NEEDED_TO; |
| step->__min_needed_to = MIN_NEEDED_FROM; |
| step->__max_needed_to = MAX_NEEDED_FROM + 2; |
| @} |
| |
| /* @r{Yes, this is a stateful encoding.} */ |
| step->__stateful = 1; |
| |
| result = __GCONV_OK; |
| @} |
| @} |
| |
| return result; |
| @} |
| @end smallexample |
| |
| The function first checks which conversion is wanted. The module from |
| which this function is taken implements four different conversions; |
| which one is selected can be determined by comparing the names. The |
| comparison should always be done without paying attention to the case. |
| |
| Next, a data structure, which contains the necessary information about |
| which conversion is selected, is allocated. The data structure |
| @code{struct iso2022jp_data} is locally defined since, outside the |
| module, this data is not used at all. Please note that if all four |
| conversions this modules supports are requested there are four data |
| blocks. |
| |
| One interesting thing is the initialization of the @code{__min_} and |
| @code{__max_} elements of the step data object. A single ISO-2022-JP |
| character can consist of one to four bytes. Therefore the |
| @code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined |
| this way. The output is always the @code{INTERNAL} character set (aka |
| UCS-4) and therefore each character consists of exactly four bytes. For |
| the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into |
| account that escape sequences might be necessary to switch the character |
| sets. Therefore the @code{__max_needed_to} element for this direction |
| gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the |
| two bytes needed for the escape sequences to single the switching. The |
| asymmetry in the maximum values for the two directions can be explained |
| easily: when reading ISO-2022-JP text, escape sequences can be handled |
| alone (i.e., it is not necessary to process a real character since the |
| effect of the escape sequence can be recorded in the state information). |
| The situation is different for the other direction. Since it is in |
| general not known which character comes next, one cannot emit escape |
| sequences to change the state in advance. This means the escape |
| sequences that have to be emitted together with the next character. |
| Therefore one needs more room than only for the character itself. |
| |
| The possible return values of the initialization function are: |
| |
| @table @code |
| @item __GCONV_OK |
| The initialization succeeded |
| @item __GCONV_NOCONV |
| The requested conversion is not supported in the module. This can |
| happen if the @file{gconv-modules} file has errors. |
| @item __GCONV_NOMEM |
| Memory required to store additional information could not be allocated. |
| @end table |
| @end deftypevr |
| |
| The function called before the module is unloaded is significantly |
| easier. It often has nothing at all to do; in which case it can be left |
| out completely. |
| |
| @comment gconv.h |
| @comment GNU |
| @deftypevr {Data type} void {(*__gconv_end_fct)} (struct gconv_step *) |
| The task of this function is to free all resources allocated in the |
| initialization function. Therefore only the @code{__data} element of |
| the object pointed to by the argument is of interest. Continuing the |
| example from the initialization function, the finalization function |
| looks like this: |
| |
| @smallexample |
| void |
| gconv_end (struct __gconv_step *data) |
| @{ |
| free (data->__data); |
| @} |
| @end smallexample |
| @end deftypevr |
| |
| The most important function is the conversion function itself, which can |
| get quite complicated for complex character sets. But since this is not |
| of interest here, we will only describe a possible skeleton for the |
| conversion function. |
| |
| @comment gconv.h |
| @comment GNU |
| @deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int) |
| The conversion function can be called for two basic reason: to convert |
| text or to reset the state. From the description of the @code{iconv} |
| function it can be seen why the flushing mode is necessary. What mode |
| is selected is determined by the sixth argument, an integer. This |
| argument being nonzero means that flushing is selected. |
| |
| Common to both modes is where the output buffer can be found. The |
| information about this buffer is stored in the conversion step data. A |
| pointer to this information is passed as the second argument to this |
| function. The description of the @code{struct __gconv_step_data} |
| structure has more information on the conversion step data. |
| |
| @cindex stateful |
| What has to be done for flushing depends on the source character set. |
| If the source character set is not stateful, nothing has to be done. |
| Otherwise the function has to emit a byte sequence to bring the state |
| object into the initial state. Once this all happened the other |
| conversion modules in the chain of conversions have to get the same |
| chance. Whether another step follows can be determined from the |
| @code{__is_last} element of the step data structure to which the first |
| parameter points. |
| |
| The more interesting mode is when actual text has to be converted. The |
| first step in this case is to convert as much text as possible from the |
| input buffer and store the result in the output buffer. The start of the |
| input buffer is determined by the third argument, which is a pointer to a |
| pointer variable referencing the beginning of the buffer. The fourth |
| argument is a pointer to the byte right after the last byte in the buffer. |
| |
| The conversion has to be performed according to the current state if the |
| character set is stateful. The state is stored in an object pointed to |
| by the @code{__statep} element of the step data (second argument). Once |
| either the input buffer is empty or the output buffer is full the |
| conversion stops. At this point, the pointer variable referenced by the |
| third parameter must point to the byte following the last processed |
| byte (i.e., if all of the input is consumed, this pointer and the fourth |
| parameter have the same value). |
| |
| What now happens depends on whether this step is the last one. If it is |
| the last step, the only thing that has to be done is to update the |
| @code{__outbuf} element of the step data structure to point after the |
| last written byte. This update gives the caller the information on how |
| much text is available in the output buffer. In addition, the variable |
| pointed to by the fifth parameter, which is of type @code{size_t}, must |
| be incremented by the number of characters (@emph{not bytes}) that were |
| converted in a non-reversible way. Then, the function can return. |
| |
| In case the step is not the last one, the later conversion functions have |
| to get a chance to do their work. Therefore, the appropriate conversion |
| function has to be called. The information about the functions is |
| stored in the conversion data structures, passed as the first parameter. |
| This information and the step data are stored in arrays, so the next |
| element in both cases can be found by simple pointer arithmetic: |
| |
| @smallexample |
| int |
| gconv (struct __gconv_step *step, struct __gconv_step_data *data, |
| const char **inbuf, const char *inbufend, size_t *written, |
| int do_flush) |
| @{ |
| struct __gconv_step *next_step = step + 1; |
| struct __gconv_step_data *next_data = data + 1; |
| @dots{} |
| @end smallexample |
| |
| The @code{next_step} pointer references the next step information and |
| @code{next_data} the next data record. The call of the next function |
| therefore will look similar to this: |
| |
| @smallexample |
| next_step->__fct (next_step, next_data, &outerr, outbuf, |
| written, 0) |
| @end smallexample |
| |
| But this is not yet all. Once the function call returns the conversion |
| function might have some more to do. If the return value of the function |
| is @code{__GCONV_EMPTY_INPUT}, more room is available in the output |
| buffer. Unless the input buffer is empty the conversion, functions start |
| all over again and process the rest of the input buffer. If the return |
| value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have |
| to recover from this. |
| |
| A requirement for the conversion function is that the input buffer |
| pointer (the third argument) always point to the last character that |
| was put in converted form into the output buffer. This is trivially |
| true after the conversion performed in the current step, but if the |
| conversion functions deeper downstream stop prematurely, not all |
| characters from the output buffer are consumed and, therefore, the input |
| buffer pointers must be backed off to the right position. |
| |
| Correcting the input buffers is easy to do if the input and output |
| character sets have a fixed width for all characters. In this situation |
| we can compute how many characters are left in the output buffer and, |
| therefore, can correct the input buffer pointer appropriately with a |
| similar computation. Things are getting tricky if either character set |
| has characters represented with variable length byte sequences, and it |
| gets even more complicated if the conversion has to take care of the |
| state. In these cases the conversion has to be performed once again, from |
| the known state before the initial conversion (i.e., if necessary the |
| state of the conversion has to be reset and the conversion loop has to be |
| executed again). The difference now is that it is known how much input |
| must be created, and the conversion can stop before converting the first |
| unused character. Once this is done the input buffer pointers must be |
| updated again and the function can return. |
| |
| One final thing should be mentioned. If it is necessary for the |
| conversion to know whether it is the first invocation (in case a prolog |
| has to be emitted), the conversion function should increment the |
| @code{__invocation_counter} element of the step data structure just |
| before returning to the caller. See the description of the @code{struct |
| __gconv_step_data} structure above for more information on how this can |
| be used. |
| |
| The return value must be one of the following values: |
| |
| @table @code |
| @item __GCONV_EMPTY_INPUT |
| All input was consumed and there is room left in the output buffer. |
| @item __GCONV_FULL_OUTPUT |
| No more room in the output buffer. In case this is not the last step |
| this value is propagated down from the call of the next conversion |
| function in the chain. |
| @item __GCONV_INCOMPLETE_INPUT |
| The input buffer is not entirely empty since it contains an incomplete |
| character sequence. |
| @end table |
| |
| The following example provides a framework for a conversion function. |
| In case a new conversion has to be written the holes in this |
| implementation have to be filled and that is it. |
| |
| @smallexample |
| int |
| gconv (struct __gconv_step *step, struct __gconv_step_data *data, |
| const char **inbuf, const char *inbufend, size_t *written, |
| int do_flush) |
| @{ |
| struct __gconv_step *next_step = step + 1; |
| struct __gconv_step_data *next_data = data + 1; |
| gconv_fct fct = next_step->__fct; |
| int status; |
| |
| /* @r{If the function is called with no input this means we have} |
| @r{to reset to the initial state. The possibly partly} |
| @r{converted input is dropped.} */ |
| if (do_flush) |
| @{ |
| status = __GCONV_OK; |
| |
| /* @r{Possible emit a byte sequence which put the state object} |
| @r{into the initial state.} */ |
| |
| /* @r{Call the steps down the chain if there are any but only} |
| @r{if we successfully emitted the escape sequence.} */ |
| if (status == __GCONV_OK && ! data->__is_last) |
| status = fct (next_step, next_data, NULL, NULL, |
| written, 1); |
| @} |
| else |
| @{ |
| /* @r{We preserve the initial values of the pointer variables.} */ |
| const char *inptr = *inbuf; |
| char *outbuf = data->__outbuf; |
| char *outend = data->__outbufend; |
| char *outptr; |
| |
| do |
| @{ |
| /* @r{Remember the start value for this round.} */ |
| inptr = *inbuf; |
| /* @r{The outbuf buffer is empty.} */ |
| outptr = outbuf; |
| |
| /* @r{For stateful encodings the state must be safe here.} */ |
| |
| /* @r{Run the conversion loop. @code{status} is set} |
| @r{appropriately afterwards.} */ |
| |
| /* @r{If this is the last step, leave the loop. There is} |
| @r{nothing we can do.} */ |
| if (data->__is_last) |
| @{ |
| /* @r{Store information about how many bytes are} |
| @r{available.} */ |
| data->__outbuf = outbuf; |
| |
| /* @r{If any non-reversible conversions were performed,} |
| @r{add the number to @code{*written}.} */ |
| |
| break; |
| @} |
| |
| /* @r{Write out all output that was produced.} */ |
| if (outbuf > outptr) |
| @{ |
| const char *outerr = data->__outbuf; |
| int result; |
| |
| result = fct (next_step, next_data, &outerr, |
| outbuf, written, 0); |
| |
| if (result != __GCONV_EMPTY_INPUT) |
| @{ |
| if (outerr != outbuf) |
| @{ |
| /* @r{Reset the input buffer pointer. We} |
| @r{document here the complex case.} */ |
| size_t nstatus; |
| |
| /* @r{Reload the pointers.} */ |
| *inbuf = inptr; |
| outbuf = outptr; |
| |
| /* @r{Possibly reset the state.} */ |
| |
| /* @r{Redo the conversion, but this time} |
| @r{the end of the output buffer is at} |
| @r{@code{outerr}.} */ |
| @} |
| |
| /* @r{Change the status.} */ |
| status = result; |
| @} |
| else |
| /* @r{All the output is consumed, we can make} |
| @r{ another run if everything was ok.} */ |
| if (status == __GCONV_FULL_OUTPUT) |
| status = __GCONV_OK; |
| @} |
| @} |
| while (status == __GCONV_OK); |
| |
| /* @r{We finished one use of this step.} */ |
| ++data->__invocation_counter; |
| @} |
| |
| return status; |
| @} |
| @end smallexample |
| @end deftypevr |
| |
| This information should be sufficient to write new modules. Anybody |
| doing so should also take a look at the available source code in the |
| @glibcadj{} sources. It contains many examples of working and optimized |
| modules. |
| |
| @c File charset.texi edited October 2001 by Dennis Grace, IBM Corporation |