mingw/libiconv/NOTES - kiwivm - Git at Google

 Q: Why does libiconv support encoding XXX? Why does libiconv not support
    encoding ZZZ?

 A: libiconv, as an internationalization library, supports those character
    sets and encodings which are in wide-spread use in at least one territory
    of the world.

    Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
    page "Languages, countries, and the charsets typically used for them".
    From this table, we can conclude that the following are in active use:

      ISO-8859-1, CP1252   Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
                           English, Faroese, Finnish, French, Galician, German,
                           Icelandic, Irish, Italian, Norwegian, Portuguese,
                           Scottish, Spanish, Swedish
      ISO-8859-2           Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
                           Slovenian
      ISO-8859-3           Esperanto, Maltese
      ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
                           Serbian, Ukrainian
      ISO-8859-6           Arabic
      ISO-8859-7           Greek
      ISO-8859-8           Hebrew
      ISO-8859-9, CP1254   Turkish
      ISO-8859-10          Inuit, Lapp
      ISO-8859-13          Latvian, Lithuanian
      ISO-8859-15          Estonian
      KOI8-R               Russian
      SHIFT_JIS            Japanese
      ISO-2022-JP          Japanese
      EUC-JP               Japanese

    Ordered by frequency on the web (1997):
      ISO-8859-1, CP1252   96%
      SHIFT_JIS             1.6%
      ISO-2022-JP           1.2%
      EUC-JP                0.4%
      CP1250                0.3%
      CP1251                0.2%
      CP850                 0.1%
      MACINTOSH             0.1%
      ISO-8859-5            0.1%
      ISO-8859-2            0.0%

    Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.

      ISO-8859-1           Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
                           English, Estonian, Faroese, Finnish, French,
                           Galician, German, Greenlandic, Icelandic,
                           Indonesian, Irish, Italian, Lithuanian, Norwegian,
                           Occitan, Portuguese, Scottish, Spanish, Swedish,
                           Walloon, Welsh
      ISO-8859-2           Albanian, Croatian, Czech, Hungarian, Polish,
                           Romanian, Serbian, Slovak, Slovenian
      ISO-8859-3           Esperanto
      ISO-8859-4           Estonian, Latvian, Lithuanian
      ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
                           Serbian, Ukrainian
      ISO-8859-6           Arabic
      ISO-8859-7           Greek
      ISO-8859-8           Hebrew
      ISO-8859-9           Turkish
      ISO-8859-14          Breton, Irish, Scottish, Welsh
      ISO-8859-15          Basque, Breton, Catalan, Danish, Dutch, Estonian,
                           Faroese, Finnish, French, Galician, German,
                           Greenlandic, Icelandic, Irish, Italian, Lithuanian,
                           Norwegian, Occitan, Portuguese, Scottish, Spanish,
                           Swedish, Walloon, Welsh
      KOI8-R               Russian
      KOI8-U               Russian, Ukrainian
      EUC-JP (alias eucJP)      Japanese
      ISO-2022-JP (alias JIS7)  Japanese
      SHIFT_JIS (alias SJIS)    Japanese
      U90                       Japanese
      S90                       Japanese
      EUC-CN (alias eucCN)      Chinese
      EUC-TW (alias eucTW)      Chinese
      BIG5                      Chinese
      EUC-KR (alias eucKR)      Korean
      ARMSCII-8                 Armenian
      GEORGIAN-ACADEMY          Georgian
      GEORGIAN-PS               Georgian
      TIS-620 (alias TACTIS)    Thai
      MULELAO-1                 Laothian
      IBM-CP1133                Laothian
      VISCII                    Vietnamese
      TCVN                      Vietnamese
      NUNACOM-8                 Inuktitut

    Hint3: The character sets supported by Netscape Communicator 4.

      Where is this documented? For the complete picture, I had to use
      "strings netscape" and then a lot of guesswork. For a quick take,
      look at the "View - Character set" menu of Netscape Communicator 4.6:

      ISO-8859-{1,2,5,7,9,15}
      WINDOWS-{1250,1251,1253}
      KOI8-R               Cyrillic
      CP866                Cyrillic
      Autodetect           Japanese  (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
      EUC-JP               Japanese
      SHIFT_JIS            Japanese
      GB2312               Chinese
      BIG5                 Chinese
      EUC-TW               Chinese
      Autodetect           Korean    (EUC-KR, ISO-2022-KR, but not JOHAB)

      UTF-8
      UTF-7

    Hint4: The character sets supported by Microsoft Internet Explorer 4.

      ISO-8859-{1,2,3,4,5,6,7,8,9}
      WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
      KOI8-R               Cyrillic
      KOI8-RU              Ukrainian
      ASMO-708             Arabic
      EUC-JP               Japanese
      ISO-2022-JP          Japanese
      SHIFT_JIS            Japanese
      GB2312               Chinese
      HZ-GB-2312           Chinese
      BIG5                 Chinese
      EUC-KR               Korean
      ISO-2022-KR          Korean
      WINDOWS-874          Thai
      WINDOWS-1258         Vietnamese

      UTF-8
      UTF-7
      UNICODE             actually UNICODE-LITTLE
      UNICODEFEFF         actually UNICODE-BIG

      and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.

    We take the union of all these four sets. The result is:

    European and Semitic languages
      * ASCII.
        We implement this because it is occasionally useful to know or to
        check whether some text is entirely ASCII (i.e. if the conversion
        ISO-8859-x -> UTF-8 is trivial).
      * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
        We implement this because they are widely used. Except ISO-8859-4
        which appears to have been superseded by ISO-8859-13 in the baltic
        countries. But it's an ISO standard anyway.
      * ISO-8859-13
        We implement this because it's a standard in Lithuania and Latvia.
      * ISO-8859-14
        We implement this because it's an ISO standard.
      * ISO-8859-15
        We implement this because it's increasingly used in Europe, because
        of the Euro symbol.
      * ISO-8859-16
        We implement this because it's an ISO standard.
      * KOI8-R, KOI8-U
        We implement this because it appears to be the predominant encoding
        on Unix in Russia and Ukraine, respectively.
      * KOI8-RU
        We implement this because MSIE4 supports it.
      * KOI8-T
        We implement this because it is the locale encoding in glibc's Tajik
        locale.
      * PT154
        We implement this because it is the locale encoding in glibc's Kazakh
        locale.
      * RK1048
        We implement this because it's a standard in Kazakhstan.
      * CP{1250,1251,1252,1253,1254,1255,1256,1257}
        We implement these because they are the predominant Windows encodings
        in Europe.
      * CP850
        We implement this because it is mentioned as occurring in the web
        in the aforementioned statistics.
      * CP862
        We implement this because Ron Aaron says it is sometimes used in web
        pages and emails.
      * CP866
        We implement this because Netscape Communicator does.
      * CP1131
        We implement this because it is the locale encoding of a Belorusian
        locale in FreeBSD and MacOS X.
      * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
        Mac{Hebrew,Arabic}
        We implement these because the Sun JDK does, and because Mac users
        don't deserve to be punished.
      * Macintosh
        We implement this because it is mentioned as occurring in the web
        in the aforementioned statistics.
    Japanese
      * EUC-JP, SHIFT_JIS, ISO-2022-JP
        We implement these because they are widely used. EUC-JP and SHIFT_JIS
        are more used for files, whereas ISO-2022-JP is recommended for email.
      * CP932
        We implement this because it is the Microsoft variant of SHIFT_JIS,
        used on Windows.
      * ISO-2022-JP-2
        We implement this because it's the common way to represent mails which
        make use of JIS X 0212 characters.
      * ISO-2022-JP-1
        We implement this because it's in the RFCs, but I don't think it is
        really used.
      * ISO-2022-JP-MS
        We implement this because Microsoft Outlook Express / Microsoft MimeOLE
        sends emails in this encoding.
      * U90, S90
        We DON'T implement this because I have no informations about what it
        is or who uses it.
    Simplified Chinese
      * EUC-CN = GB2312
        We implement this because it is the widely used representation
        of simplified Chinese.
      * GBK
        We implement this because it appears to be used on Solaris and Windows.
      * GB18030
        We implement this because it is an official requirement in the
        People's Republic of China.
      * ISO-2022-CN
        We implement this because it is in the RFCs, but I have no idea
        whether it is really used.
      * ISO-2022-CN-EXT
        We implement this because it's in the RFCs, but I don't think it is
        really used.
      * HZ = HZ-GB-2312
        We implement this because the RFCs recommend it for Usenet postings,
        and because MSIE4 supports it.
    Traditional Chinese
      * EUC-TW
        We implement it because it appears to be used on Unix.
      * BIG5
        We implement it because it is the de-facto standard for traditional
        Chinese.
      * CP950
        We implement this because it is the Microsoft variant of BIG5, used
        on Windows.
      * BIG5+
        We DON'T implement this because it doesn't appear to be in wide use.
        Only the CWEX fonts use this encoding. Furthermore, the conversion
        tables in the big5p package are not coherent: If you convert directly,
        you get different results than when you convert via GBK.
      * BIG5-HKSCS
        We implement it because it is the de-facto standard for traditional
        Chinese in Hongkong.
    Korean
      * EUC-KR
        We implement these because they appear to be the widely used
        representations for Korean.
      * CP949
        We implement this because it is the Microsoft variant of EUC-KR, used
        on Windows.
      * ISO-2022-KR
        We implement it because it is in the RFCs and because MSIE4 supports
        it, but I have no idea whether it's really used.
      * JOHAB
        We implement this because it is apparently used on Windows as a locale
        encoding (codepage 1361).
      * ISO-646-KR
        We DON'T implement this because although an old ASCII variant, its
        glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
        say it's a tilde, but Ken Lunde's "CJKV information processing" says
        it's an overline. And it is not ISO-IR registered.
    Armenian
      * ARMSCII-8
        We implement it because XFree86 supports it.
    Georgian
      * Georgian-Academy, Georgian-PS
        We implement these because they appear to be both used for Georgian;
        Xfree86 supports them.
    Thai
      * ISO-8859-11, TIS-620
        We implement these because it seems to be standard for Thai.
      * CP874
        We implement this because MSIE4 supports it.
      * MacThai
        We implement this because the Sun JDK does, and because Mac users
        don't deserve to be punished.
    Laotian
      * MuleLao-1, CP1133
        We implement these because XFree86 supports them. I have no idea which
        one is used more widely.
    Vietnamese
      * VISCII, TCVN
        We implement these because XFree86 supports them.
      * CP1258
        We implement this because MSIE4 supports it.
    Other languages
      * NUNACOM-8 (Inuktitut)
        We DON'T implement this because it isn't part of Unicode yet, and
        therefore doesn't convert to anything except itself.
    Platform specifics
      * HP-ROMAN8, NEXTSTEP
        We implement these because they were the native character set on HPs
        and NeXTs for a long time, and libiconv is intended to be usable on
        these old machines.
    Full Unicode
      * UTF-8, UCS-2, UCS-4
        We implement these. Obviously.
      * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
        We implement these because they are the preferred internal
        representation of strings in Unicode aware applications. These are
        non-ambiguous names, known to glibc. (glibc doesn't have
        UCS-2-INTERNAL and UCS-4-INTERNAL.)
      * UTF-16, UTF-16BE, UTF-16LE
        We implement these, because UTF-16 is still the favourite encoding of
        the president of the Unicode Consortium (for political reasons), and
        because they appear in RFC 2781.
      * UTF-32, UTF-32BE, UTF-32LE
        We implement these because they are part of Unicode 3.1.
      * UTF-7
        We implement this because it is essential functionality for mail
        applications.
      * C99
        We implement it because it's used for C and C++ programs and because
        it's a nice encoding for debugging.
      * JAVA
        We implement it because it's used for Java programs and because it's
        a nice encoding for debugging.
      * UNICODE (big endian), UNICODEFEFF (little endian)
        We DON'T implement these because they are stupid and not standardized.
    Full Unicode, in terms of 'uint16_t' or 'uint32_t'
    (with machine dependent endianness and alignment)
      * UCS-2-INTERNAL, UCS-4-INTERNAL
        We implement these because they are the preferred internal
        representation of strings in Unicode aware applications.

 Q: Support encodings mentioned in RFC 1345 ?
 A: No, they are not in use any more. Supporting ISO-646 variants is pointless
    since ISO-8859-* have been adopted.

 Q: Support EBCDIC ?
 A: No!

 Q: How do I add a new character set?
 A: 1. Explain the "why" in this file, above.
    2. You need to have a conversion table from/to Unicode. Transform it into
    the format used by the mapping tables found on ftp.unicode.org: each line
    contains the character code, in hex, with 0x prefix, then whitespace,
    then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
    counts as a comment delimiter until end of line.
    Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
    can include it in his collection.
    3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
    tools directory to generate the C code for the conversion. You may tweak
    the resulting C code if you are not satisfied with its quality, but this
    is rarely needed.
    If it's a two-dimensional character set (with rows and columns), use the
    'cjk_tab_to_h' program in the tools directory to generate the C code for
    the conversion. You will need to modify the main() function to recognize
    the new character set name, with the proper dimensions, but that shouldn't
    be too hard. This yields the CCS. The CES you have to write by hand.
    4. Store the resulting C code file in the lib directory. Add a #include
    directive to converters.h, and add an entry to the encodings.def file.
    5. Compile the package, and test your new encoding using a program like
    iconv(1) or clisp(1).
    6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
    encoding, create the complete table as a TXT file. For a stateful encoding,
    provide a text snippet encoded using your new encoding and its UTF-8
    equivalent.
    7. Update the README and man/iconv_open.3, to mention the new encoding.
    Add a note in the NEWS file.

 Q: What about bidirectional text? Should it be tagged or reversed when
    converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
    this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
 A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
    ISO-8859-E remains to be implemented.
    On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
    is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
    the same as ISO-8859-8-I. I'm confused.

 Other character sets not implemented:
 "MNEMONIC" = "csMnemonic"
 "MNEM" = "csMnem"
 "ISO-10646-UCS-Basic" = "csUnicodeASCII"
 "ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
 "ISO-10646-J-1"
 "UNICODE-1-1" = "csUnicode11"
 "csWindows31Latin5"

 Other aliases not implemented (and not implemented in glibc-2.1 either):
   From MSIE4:
     ISO-8859-1: alias ISO8859-1
     ISO-8859-2: alias ISO8859-2
     KSC_5601: alias KS_C_5601
     UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8


 Q: How can I integrate libiconv into my package?
 A: Just copy the entire libiconv package into a subdirectory of your package.
    At configuration time, call libiconv's configure script with the
    appropriate --srcdir option and maybe --enable-static or --disable-shared.
    Then "cd libiconv && make && make install-lib libdir=... includedir=...".
    'install-lib' is a special (not GNU standardized) target which installs
    only the include file - in $(includedir) - and the library - in $(libdir) -
    and does not use other directory variables. After "installing" libiconv
    in your package's build directory, building of your package can proceed.

 Q: Why is the testsuite so big?
 A: Because some of the tests are very comprehensive.
    If you don't feel like using the testsuite, you can simply remove the
    tests/ directory.
	Q: Why does libiconv support encoding XXX? Why does libiconv not support
	encoding ZZZ?

	A: libiconv, as an internationalization library, supports those character
	sets and encodings which are in wide-spread use in at least one territory
	of the world.

	Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
	page "Languages, countries, and the charsets typically used for them".
	From this table, we can conclude that the following are in active use:

	ISO-8859-1, CP1252 Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
	English, Faroese, Finnish, French, Galician, German,
	Icelandic, Irish, Italian, Norwegian, Portuguese,
	Scottish, Spanish, Swedish
	ISO-8859-2 Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
	Slovenian
	ISO-8859-3 Esperanto, Maltese
	ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
	Serbian, Ukrainian
	ISO-8859-6 Arabic
	ISO-8859-7 Greek
	ISO-8859-8 Hebrew
	ISO-8859-9, CP1254 Turkish
	ISO-8859-10 Inuit, Lapp
	ISO-8859-13 Latvian, Lithuanian
	ISO-8859-15 Estonian
	KOI8-R Russian
	SHIFT_JIS Japanese
	ISO-2022-JP Japanese
	EUC-JP Japanese

	Ordered by frequency on the web (1997):
	ISO-8859-1, CP1252 96%
	SHIFT_JIS 1.6%
	ISO-2022-JP 1.2%
	EUC-JP 0.4%
	CP1250 0.3%
	CP1251 0.2%
	CP850 0.1%
	MACINTOSH 0.1%
	ISO-8859-5 0.1%
	ISO-8859-2 0.0%

	Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.

	ISO-8859-1 Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
	English, Estonian, Faroese, Finnish, French,
	Galician, German, Greenlandic, Icelandic,
	Indonesian, Irish, Italian, Lithuanian, Norwegian,
	Occitan, Portuguese, Scottish, Spanish, Swedish,
	Walloon, Welsh
	ISO-8859-2 Albanian, Croatian, Czech, Hungarian, Polish,
	Romanian, Serbian, Slovak, Slovenian
	ISO-8859-3 Esperanto
	ISO-8859-4 Estonian, Latvian, Lithuanian
	ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
	Serbian, Ukrainian
	ISO-8859-6 Arabic
	ISO-8859-7 Greek
	ISO-8859-8 Hebrew
	ISO-8859-9 Turkish
	ISO-8859-14 Breton, Irish, Scottish, Welsh
	ISO-8859-15 Basque, Breton, Catalan, Danish, Dutch, Estonian,
	Faroese, Finnish, French, Galician, German,
	Greenlandic, Icelandic, Irish, Italian, Lithuanian,
	Norwegian, Occitan, Portuguese, Scottish, Spanish,
	Swedish, Walloon, Welsh
	KOI8-R Russian
	KOI8-U Russian, Ukrainian
	EUC-JP (alias eucJP) Japanese
	ISO-2022-JP (alias JIS7) Japanese
	SHIFT_JIS (alias SJIS) Japanese
	U90 Japanese
	S90 Japanese
	EUC-CN (alias eucCN) Chinese
	EUC-TW (alias eucTW) Chinese
	BIG5 Chinese
	EUC-KR (alias eucKR) Korean
	ARMSCII-8 Armenian
	GEORGIAN-ACADEMY Georgian
	GEORGIAN-PS Georgian
	TIS-620 (alias TACTIS) Thai
	MULELAO-1 Laothian
	IBM-CP1133 Laothian
	VISCII Vietnamese
	TCVN Vietnamese
	NUNACOM-8 Inuktitut

	Hint3: The character sets supported by Netscape Communicator 4.

	Where is this documented? For the complete picture, I had to use
	"strings netscape" and then a lot of guesswork. For a quick take,
	look at the "View - Character set" menu of Netscape Communicator 4.6:

	ISO-8859-{1,2,5,7,9,15}
	WINDOWS-{1250,1251,1253}
	KOI8-R Cyrillic
	CP866 Cyrillic
	Autodetect Japanese (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
	EUC-JP Japanese
	SHIFT_JIS Japanese
	GB2312 Chinese
	BIG5 Chinese
	EUC-TW Chinese
	Autodetect Korean (EUC-KR, ISO-2022-KR, but not JOHAB)

	UTF-8
	UTF-7

	Hint4: The character sets supported by Microsoft Internet Explorer 4.

	ISO-8859-{1,2,3,4,5,6,7,8,9}
	WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
	KOI8-R Cyrillic
	KOI8-RU Ukrainian
	ASMO-708 Arabic
	EUC-JP Japanese
	ISO-2022-JP Japanese
	SHIFT_JIS Japanese
	GB2312 Chinese
	HZ-GB-2312 Chinese
	BIG5 Chinese
	EUC-KR Korean
	ISO-2022-KR Korean
	WINDOWS-874 Thai
	WINDOWS-1258 Vietnamese

	UTF-8
	UTF-7
	UNICODE actually UNICODE-LITTLE
	UNICODEFEFF actually UNICODE-BIG

	and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.

	We take the union of all these four sets. The result is:

	European and Semitic languages
	* ASCII.
	We implement this because it is occasionally useful to know or to
	check whether some text is entirely ASCII (i.e. if the conversion
	ISO-8859-x -> UTF-8 is trivial).
	* ISO-8859-{1,2,3,4,5,6,7,8,9,10}
	We implement this because they are widely used. Except ISO-8859-4
	which appears to have been superseded by ISO-8859-13 in the baltic
	countries. But it's an ISO standard anyway.
	* ISO-8859-13
	We implement this because it's a standard in Lithuania and Latvia.
	* ISO-8859-14
	We implement this because it's an ISO standard.
	* ISO-8859-15
	We implement this because it's increasingly used in Europe, because
	of the Euro symbol.
	* ISO-8859-16
	We implement this because it's an ISO standard.
	* KOI8-R, KOI8-U
	We implement this because it appears to be the predominant encoding
	on Unix in Russia and Ukraine, respectively.
	* KOI8-RU
	We implement this because MSIE4 supports it.
	* KOI8-T
	We implement this because it is the locale encoding in glibc's Tajik
	locale.
	* PT154
	We implement this because it is the locale encoding in glibc's Kazakh
	locale.
	* RK1048
	We implement this because it's a standard in Kazakhstan.
	* CP{1250,1251,1252,1253,1254,1255,1256,1257}
	We implement these because they are the predominant Windows encodings
	in Europe.
	* CP850
	We implement this because it is mentioned as occurring in the web
	in the aforementioned statistics.
	* CP862
	We implement this because Ron Aaron says it is sometimes used in web
	pages and emails.
	* CP866
	We implement this because Netscape Communicator does.
	* CP1131
	We implement this because it is the locale encoding of a Belorusian
	locale in FreeBSD and MacOS X.
	* Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
	Mac{Hebrew,Arabic}
	We implement these because the Sun JDK does, and because Mac users
	don't deserve to be punished.
	* Macintosh
	We implement this because it is mentioned as occurring in the web
	in the aforementioned statistics.
	Japanese
	* EUC-JP, SHIFT_JIS, ISO-2022-JP
	We implement these because they are widely used. EUC-JP and SHIFT_JIS
	are more used for files, whereas ISO-2022-JP is recommended for email.
	* CP932
	We implement this because it is the Microsoft variant of SHIFT_JIS,
	used on Windows.
	* ISO-2022-JP-2
	We implement this because it's the common way to represent mails which
	make use of JIS X 0212 characters.
	* ISO-2022-JP-1
	We implement this because it's in the RFCs, but I don't think it is
	really used.
	* ISO-2022-JP-MS
	We implement this because Microsoft Outlook Express / Microsoft MimeOLE
	sends emails in this encoding.
	* U90, S90
	We DON'T implement this because I have no informations about what it
	is or who uses it.
	Simplified Chinese
	* EUC-CN = GB2312
	We implement this because it is the widely used representation
	of simplified Chinese.
	* GBK
	We implement this because it appears to be used on Solaris and Windows.
	* GB18030
	We implement this because it is an official requirement in the
	People's Republic of China.
	* ISO-2022-CN
	We implement this because it is in the RFCs, but I have no idea
	whether it is really used.
	* ISO-2022-CN-EXT
	We implement this because it's in the RFCs, but I don't think it is
	really used.
	* HZ = HZ-GB-2312
	We implement this because the RFCs recommend it for Usenet postings,
	and because MSIE4 supports it.
	Traditional Chinese
	* EUC-TW
	We implement it because it appears to be used on Unix.
	* BIG5
	We implement it because it is the de-facto standard for traditional
	Chinese.
	* CP950
	We implement this because it is the Microsoft variant of BIG5, used
	on Windows.
	* BIG5+
	We DON'T implement this because it doesn't appear to be in wide use.
	Only the CWEX fonts use this encoding. Furthermore, the conversion
	tables in the big5p package are not coherent: If you convert directly,
	you get different results than when you convert via GBK.
	* BIG5-HKSCS
	We implement it because it is the de-facto standard for traditional
	Chinese in Hongkong.
	Korean
	* EUC-KR
	We implement these because they appear to be the widely used
	representations for Korean.
	* CP949
	We implement this because it is the Microsoft variant of EUC-KR, used
	on Windows.
	* ISO-2022-KR
	We implement it because it is in the RFCs and because MSIE4 supports
	it, but I have no idea whether it's really used.
	* JOHAB
	We implement this because it is apparently used on Windows as a locale
	encoding (codepage 1361).
	* ISO-646-KR
	We DON'T implement this because although an old ASCII variant, its
	glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
	say it's a tilde, but Ken Lunde's "CJKV information processing" says
	it's an overline. And it is not ISO-IR registered.
	Armenian
	* ARMSCII-8
	We implement it because XFree86 supports it.
	Georgian
	* Georgian-Academy, Georgian-PS
	We implement these because they appear to be both used for Georgian;
	Xfree86 supports them.
	Thai
	* ISO-8859-11, TIS-620
	We implement these because it seems to be standard for Thai.
	* CP874
	We implement this because MSIE4 supports it.
	* MacThai
	We implement this because the Sun JDK does, and because Mac users
	don't deserve to be punished.
	Laotian
	* MuleLao-1, CP1133
	We implement these because XFree86 supports them. I have no idea which
	one is used more widely.
	Vietnamese
	* VISCII, TCVN
	We implement these because XFree86 supports them.
	* CP1258
	We implement this because MSIE4 supports it.
	Other languages
	* NUNACOM-8 (Inuktitut)
	We DON'T implement this because it isn't part of Unicode yet, and
	therefore doesn't convert to anything except itself.
	Platform specifics
	* HP-ROMAN8, NEXTSTEP
	We implement these because they were the native character set on HPs
	and NeXTs for a long time, and libiconv is intended to be usable on
	these old machines.
	Full Unicode
	* UTF-8, UCS-2, UCS-4
	We implement these. Obviously.
	* UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
	We implement these because they are the preferred internal
	representation of strings in Unicode aware applications. These are
	non-ambiguous names, known to glibc. (glibc doesn't have
	UCS-2-INTERNAL and UCS-4-INTERNAL.)
	* UTF-16, UTF-16BE, UTF-16LE
	We implement these, because UTF-16 is still the favourite encoding of
	the president of the Unicode Consortium (for political reasons), and
	because they appear in RFC 2781.
	* UTF-32, UTF-32BE, UTF-32LE
	We implement these because they are part of Unicode 3.1.
	* UTF-7
	We implement this because it is essential functionality for mail
	applications.
	* C99
	We implement it because it's used for C and C++ programs and because
	it's a nice encoding for debugging.
	* JAVA
	We implement it because it's used for Java programs and because it's
	a nice encoding for debugging.
	* UNICODE (big endian), UNICODEFEFF (little endian)
	We DON'T implement these because they are stupid and not standardized.
	Full Unicode, in terms of 'uint16_t' or 'uint32_t'
	(with machine dependent endianness and alignment)
	* UCS-2-INTERNAL, UCS-4-INTERNAL
	We implement these because they are the preferred internal
	representation of strings in Unicode aware applications.

	Q: Support encodings mentioned in RFC 1345 ?
	A: No, they are not in use any more. Supporting ISO-646 variants is pointless
	since ISO-8859-* have been adopted.

	Q: Support EBCDIC ?
	A: No!

	Q: How do I add a new character set?
	A: 1. Explain the "why" in this file, above.
	2. You need to have a conversion table from/to Unicode. Transform it into
	the format used by the mapping tables found on ftp.unicode.org: each line
	contains the character code, in hex, with 0x prefix, then whitespace,
	then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
	counts as a comment delimiter until end of line.
	Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
	can include it in his collection.
	3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
	tools directory to generate the C code for the conversion. You may tweak
	the resulting C code if you are not satisfied with its quality, but this
	is rarely needed.
	If it's a two-dimensional character set (with rows and columns), use the
	'cjk_tab_to_h' program in the tools directory to generate the C code for
	the conversion. You will need to modify the main() function to recognize
	the new character set name, with the proper dimensions, but that shouldn't
	be too hard. This yields the CCS. The CES you have to write by hand.
	4. Store the resulting C code file in the lib directory. Add a #include
	directive to converters.h, and add an entry to the encodings.def file.
	5. Compile the package, and test your new encoding using a program like
	iconv(1) or clisp(1).
	6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
	encoding, create the complete table as a TXT file. For a stateful encoding,
	provide a text snippet encoded using your new encoding and its UTF-8
	equivalent.
	7. Update the README and man/iconv_open.3, to mention the new encoding.
	Add a note in the NEWS file.

	Q: What about bidirectional text? Should it be tagged or reversed when
	converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
	this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
	A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
	ISO-8859-E remains to be implemented.
	On the other hand, a page on www.w3c.org says that ISO-8859-8 in email
	is visually encoded, ISO-8859-8 in HTML is logically encoded, i.e.
	the same as ISO-8859-8-I. I'm confused.

	Other character sets not implemented:
	"MNEMONIC" = "csMnemonic"
	"MNEM" = "csMnem"
	"ISO-10646-UCS-Basic" = "csUnicodeASCII"
	"ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
	"ISO-10646-J-1"
	"UNICODE-1-1" = "csUnicode11"
	"csWindows31Latin5"

	Other aliases not implemented (and not implemented in glibc-2.1 either):
	From MSIE4:
	ISO-8859-1: alias ISO8859-1
	ISO-8859-2: alias ISO8859-2
	KSC_5601: alias KS_C_5601
	UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8


	Q: How can I integrate libiconv into my package?
	A: Just copy the entire libiconv package into a subdirectory of your package.
	At configuration time, call libiconv's configure script with the
	appropriate --srcdir option and maybe --enable-static or --disable-shared.
	Then "cd libiconv && make && make install-lib libdir=... includedir=...".
	'install-lib' is a special (not GNU standardized) target which installs
	only the include file - in $(includedir) - and the library - in $(libdir) -
	and does not use other directory variables. After "installing" libiconv
	in your package's build directory, building of your package can proceed.

	Q: Why is the testsuite so big?
	A: Because some of the tests are very comprehensive.
	If you don't feel like using the testsuite, you can simply remove the
	tests/ directory.