1Q: Why does libiconv support encoding XXX? Why does libiconv not support
2   encoding ZZZ?
3
4A: libiconv, as an internationalization library, supports those character
5   sets and encodings which are in wide-spread use in at least one territory
6   of the world.
7
8   Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
9   page "Languages, countries, and the charsets typically used for them".
10   From this table, we can conclude that the following are in active use:
11
12     ISO-8859-1, CP1252   Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
13                          English, Faroese, Finnish, French, Galician, German,
14                          Icelandic, Irish, Italian, Norwegian, Portuguese,
15                          Scottish, Spanish, Swedish
16     ISO-8859-2           Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
17                          Slovenian
18     ISO-8859-3           Esperanto, Maltese
19     ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
20                          Serbian, Ukrainian
21     ISO-8859-6           Arabic
22     ISO-8859-7           Greek
23     ISO-8859-8           Hebrew
24     ISO-8859-9, CP1254   Turkish
25     ISO-8859-10          Inuit, Lapp
26     ISO-8859-13          Latvian, Lithuanian
27     ISO-8859-15          Estonian
28     KOI8-R               Russian
29     SHIFT_JIS            Japanese
30     ISO-2022-JP          Japanese
31     EUC-JP               Japanese
32
33   Ordered by frequency on the web (1997):
34     ISO-8859-1, CP1252   96%
35     SHIFT_JIS             1.6%
36     ISO-2022-JP           1.2%
37     EUC-JP                0.4%
38     CP1250                0.3%
39     CP1251                0.2%
40     CP850                 0.1%
41     MACINTOSH             0.1%
42     ISO-8859-5            0.1%
43     ISO-8859-2            0.0%
44
45   Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.
46
47     ISO-8859-1           Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
48                          English, Estonian, Faroese, Finnish, French,
49                          Galician, German, Greenlandic, Icelandic,
50                          Indonesian, Irish, Italian, Lithuanian, Norwegian,
51                          Occitan, Portuguese, Scottish, Spanish, Swedish,
52                          Walloon, Welsh
53     ISO-8859-2           Albanian, Croatian, Czech, Hungarian, Polish,
54                          Romanian, Serbian, Slovak, Slovenian
55     ISO-8859-3           Esperanto
56     ISO-8859-4           Estonian, Latvian, Lithuanian
57     ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
58                          Serbian, Ukrainian
59     ISO-8859-6           Arabic
60     ISO-8859-7           Greek
61     ISO-8859-8           Hebrew
62     ISO-8859-9           Turkish
63     ISO-8859-14          Breton, Irish, Scottish, Welsh
64     ISO-8859-15          Basque, Breton, Catalan, Danish, Dutch, Estonian,
65                          Faroese, Finnish, French, Galician, German,
66                          Greenlandic, Icelandic, Irish, Italian, Lithuanian,
67                          Norwegian, Occitan, Portuguese, Scottish, Spanish,
68                          Swedish, Walloon, Welsh
69     KOI8-R               Russian
70     KOI8-U               Russian, Ukrainian
71     EUC-JP (alias eucJP)      Japanese
72     ISO-2022-JP (alias JIS7)  Japanese
73     SHIFT_JIS (alias SJIS)    Japanese
74     U90                       Japanese
75     S90                       Japanese
76     EUC-CN (alias eucCN)      Chinese
77     EUC-TW (alias eucTW)      Chinese
78     BIG5                      Chinese
79     EUC-KR (alias eucKR)      Korean
80     ARMSCII-8                 Armenian
81     GEORGIAN-ACADEMY          Georgian
82     GEORGIAN-PS               Georgian
83     TIS-620 (alias TACTIS)    Thai
84     MULELAO-1                 Laothian
85     IBM-CP1133                Laothian
86     VISCII                    Vietnamese
87     TCVN                      Vietnamese
88     NUNACOM-8                 Inuktitut
89
90   Hint3: The character sets supported by Netscape Communicator 4.
91
92     Where is this documented? For the complete picture, I had to use
93     "strings netscape" and then a lot of guesswork. For a quick take,
94     look at the "View - Character set" menu of Netscape Communicator 4.6:
95
96     ISO-8859-{1,2,5,7,9,15}
97     WINDOWS-{1250,1251,1253}
98     KOI8-R               Cyrillic
99     CP866                Cyrillic
100     Autodetect           Japanese  (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
101     EUC-JP               Japanese
102     SHIFT_JIS            Japanese
103     GB2312               Chinese
104     BIG5                 Chinese
105     EUC-TW               Chinese
106     Autodetect           Korean    (EUC-KR, ISO-2022-KR, but not JOHAB)
107
108     UTF-8
109     UTF-7
110
111   Hint4: The character sets supported by Microsoft Internet Explorer 4.
112
113     ISO-8859-{1,2,3,4,5,6,7,8,9}
114     WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
115     KOI8-R               Cyrillic
116     KOI8-RU              Ukrainian
117     ASMO-708             Arabic
118     EUC-JP               Japanese
119     ISO-2022-JP          Japanese
120     SHIFT_JIS            Japanese
121     GB2312               Chinese
122     HZ-GB-2312           Chinese
123     BIG5                 Chinese
124     EUC-KR               Korean
125     ISO-2022-KR          Korean
126     WINDOWS-874          Thai
127     WINDOWS-1258         Vietnamese
128
129     UTF-8
130     UTF-7
131     UNICODE             actually UNICODE-LITTLE
132     UNICODEFEFF         actually UNICODE-BIG
133
134     and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.
135
136   We take the union of all these four sets. The result is:
137
138   European and Semitic languages
139     * ASCII.
140       We implement this because it is occasionally useful to know or to
141       check whether some text is entirely ASCII (i.e. if the conversion
142       ISO-8859-x -> UTF-8 is trivial).
143     * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
144       We implement this because they are widely used. Except ISO-8859-4
145       which appears to have been superseded by ISO-8859-13 in the baltic
146       countries. But it's an ISO standard anyway.
147     * ISO-8859-13
148       We implement this because it's a standard in Lithuania and Latvia.
149     * ISO-8859-14
150       We implement this because it's an ISO standard.
151     * ISO-8859-15
152       We implement this because it's increasingly used in Europe, because
153       of the Euro symbol.
154     * ISO-8859-16
155       We implement this because it's an ISO standard.
156     * KOI8-R, KOI8-U
157       We implement this because it appears to be the predominant encoding
158       on Unix in Russia and Ukraine, respectively.
159     * KOI8-RU
160       We implement this because MSIE4 supports it.
161     * KOI8-T
162       We implement this because it is the locale encoding in glibc's Tajik
163       locale.
164     * PT154
165       We implement this because it is the locale encoding in glibc's Kazakh
166       locale.
167     * CP{1250,1251,1252,1253,1254,1255,1256,1257}
168       We implement these because they are the predominant Windows encodings
169       in Europe.
170     * CP850
171       We implement this because it is mentioned as occurring in the web
172       in the aforementioned statistics.
173     * CP862
174       We implement this because Ron Aaron says it is sometimes used in web
175       pages and emails.
176     * CP866
177       We implement this because Netscape Communicator does.
178     * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
179       Mac{Hebrew,Arabic}
180       We implement these because the Sun JDK does, and because Mac users
181       don't deserve to be punished.
182     * Macintosh
183       We implement this because it is mentioned as occurring in the web
184       in the aforementioned statistics.
185   Japanese
186     * EUC-JP, SHIFT_JIS, ISO-2022-JP
187       We implement these because they are widely used. EUC-JP and SHIFT_JIS
188       are more used for files, whereas ISO-2022-JP is recommended for email.
189     * CP932
190       We implement this because it is the Microsoft variant of SHIFT_JIS,
191       used on Windows.
192     * ISO-2022-JP-2
193       We implement this because it's the common way to represent mails which
194       make use of JIS X 0212 characters.
195     * ISO-2022-JP-1
196       We implement this because it's in the RFCs, but I don't think it is
197       really used.
198     * U90, S90
199       We DON'T implement this because I have no informations about what it
200       is or who uses it.
201   Simplified Chinese
202     * EUC-CN = GB2312
203       We implement this because it is the widely used representation
204       of simplified Chinese.
205     * GBK
206       We implement this because it appears to be used on Solaris and Windows.
207     * GB18030
208       We implement this because it is an official requirement in the
209       People's Republic of China.
210     * ISO-2022-CN
211       We implement this because it is in the RFCs, but I have no idea
212       whether it is really used.
213     * ISO-2022-CN-EXT
214       We implement this because it's in the RFCs, but I don't think it is
215       really used.
216     * HZ = HZ-GB-2312
217       We implement this because the RFCs recommend it for Usenet postings,
218       and because MSIE4 supports it.
219   Traditional Chinese
220     * EUC-TW
221       We implement it because it appears to be used on Unix.
222     * BIG5
223       We implement it because it is the de-facto standard for traditional
224       Chinese.
225     * CP950
226       We implement this because it is the Microsoft variant of BIG5, used
227       on Windows.
228     * BIG5+
229       We DON'T implement this because it doesn't appear to be in wide use.
230       Only the CWEX fonts use this encoding. Furthermore, the conversion
231       tables in the big5p package are not coherent: If you convert directly,
232       you get different results than when you convert via GBK.
233     * BIG5-HKSCS
234       We implement it because it is the de-facto standard for traditional
235       Chinese in Hongkong.
236   Korean
237     * EUC-KR
238       We implement these because they appear to be the widely used
239       representations for Korean.
240     * CP949
241       We implement this because it is the Microsoft variant of EUC-KR, used
242       on Windows.
243     * ISO-2022-KR
244       We implement it because it is in the RFCs and because MSIE4 supports
245       it, but I have no idea whether it's really used.
246     * JOHAB
247       We implement this because it is apparently used on Windows as a locale
248       encoding (codepage 1361).
249     * ISO-646-KR
250       We DON'T implement this because although an old ASCII variant, its
251       glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
252       say it's a tilde, but Ken Lunde's "CJKV information processing" says
253       it's an overline. And it is not ISO-IR registered.
254   Armenian
255     * ARMSCII-8
256       We implement it because XFree86 supports it.
257   Georgian
258     * Georgian-Academy, Georgian-PS
259       We implement these because they appear to be both used for Georgian;
260       Xfree86 supports them.
261   Thai
262     * ISO-8859-11, TIS-620
263       We implement these because it seems to be standard for Thai.
264     * CP874
265       We implement this because MSIE4 supports it.
266     * MacThai
267       We implement this because the Sun JDK does, and because Mac users
268       don't deserve to be punished.
269   Laotian
270     * MuleLao-1, CP1133
271       We implement these because XFree86 supports them. I have no idea which
272       one is used more widely.
273   Vietnamese
274     * VISCII, TCVN
275       We implement these because XFree86 supports them.
276     * CP1258
277       We implement this because MSIE4 supports it.
278   Other languages
279     * NUNACOM-8 (Inuktitut)
280       We DON'T implement this because it isn't part of Unicode yet, and
281       therefore doesn't convert to anything except itself.
282   Platform specifics
283     * HP-ROMAN8, NEXTSTEP
284       We implement these because they were the native character set on HPs
285       and NeXTs for a long time, and libiconv is intended to be usable on
286       these old machines.
287   Full Unicode
288     * UTF-8, UCS-2, UCS-4
289       We implement these. Obviously.
290     * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
291       We implement these because they are the preferred internal
292       representation of strings in Unicode aware applications. These are
293       non-ambiguous names, known to glibc. (glibc doesn't have
294       UCS-2-INTERNAL and UCS-4-INTERNAL.)
295     * UTF-16, UTF-16BE, UTF-16LE
296       We implement these, because UTF-16 is still the favourite encoding of
297       the president of the Unicode Consortium (for political reasons), and
298       because they appear in RFC 2781.
299     * UTF-32, UTF-32BE, UTF-32LE
300       We implement these because they are part of Unicode 3.1.
301     * UTF-7
302       We implement this because it is essential functionality for mail
303       applications.
304     * C99
305       We implement it because it's used for C and C++ programs and because
306       it's a nice encoding for debugging.
307     * JAVA
308       We implement it because it's used for Java programs and because it's
309       a nice encoding for debugging.
310     * UNICODE (big endian), UNICODEFEFF (little endian)
311       We DON'T implement these because they are stupid and not standardized.
312   Full Unicode, in terms of `uint16_t' or `uint32_t'
313   (with machine dependent endianness and alignment)
314     * UCS-2-INTERNAL, UCS-4-INTERNAL
315       We implement these because they are the preferred internal
316       representation of strings in Unicode aware applications.
317
318Q: Support encodings mentioned in RFC 1345 ?
319A: No, they are not in use any more. Supporting ISO-646 variants is pointless
320   since ISO-8859-* have been adopted.
321
322Q: Support EBCDIC ?
323A: No!
324
325Q: How do I add a new character set?
326A: 1. Explain the "why" in this file, above.
327   2. You need to have a conversion table from/to Unicode. Transform it into
328   the format used by the mapping tables found on ftp.unicode.org: each line
329   contains the character code, in hex, with 0x prefix, then whitespace,
330   then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
331   counts as a comment delimiter until end of line.
332   Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
333   can include it in his collection.
334   3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
335   tools directory to generate the C code for the conversion. You may tweak
336   the resulting C code if you are not satisfied with its quality, but this
337   is rarely needed.
338   If it's a two-dimensional character set (with rows and columns), use the
339   'cjk_tab_to_h' program in the tools directory to generate the C code for
340   the conversion. You will need to modify the main() function to recognize
341   the new character set name, with the proper dimensions, but that shouldn't
342   be too hard. This yields the CCS. The CES you have to write by hand.
343   4. Store the resulting C code file in the lib directory. Add a #include
344   directive to converters.h, and add an entry to the encodings.def file.
345   5. Compile the package, and test your new encoding using a program like
346   iconv(1) or clisp(1).
347   6. Augment the testsuite: Add a line to each of tests/Makefile.in,
348   tests/Makefile.msvc and tests/Makefile.os2. For a stateless encoding,
349   create the complete table as a TXT file. For a stateful encoding,
350   provide a text snippet encoded using your new encoding and its UTF-8
351   equivalent.
352   7. Update the README and man/iconv_open.3, to mention the new encoding.
353   Add a note in the NEWS file.
354
355Q: What about bidirectional text? Should it be tagged or reversed when
356   converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
357   this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
358A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
359   ISO-8859-E remains to be implemented.
360   On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
361   is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
362   the same as ISO-8859-8-I. I'm confused.
363
364Other character sets not implemented:
365"MNEMONIC" = "csMnemonic"
366"MNEM" = "csMnem"
367"ISO-10646-UCS-Basic" = "csUnicodeASCII"
368"ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
369"ISO-10646-J-1"
370"UNICODE-1-1" = "csUnicode11"
371"csWindows31Latin5"
372
373Other aliases not implemented (and not implemented in glibc-2.1 either):
374  From MSIE4:
375    ISO-8859-1: alias ISO8859-1
376    ISO-8859-2: alias ISO8859-2
377    KSC_5601: alias KS_C_5601
378    UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8
379
380
381Q: How can I integrate libiconv into my package?
382A: Just copy the entire libiconv package into a subdirectory of your package.
383   At configuration time, call libiconv's configure script with the
384   appropriate --srcdir option and maybe --enable-static or --disable-shared.
385   Then "cd libiconv && make && make install-lib libdir=... includedir=...".
386   'install-lib' is a special (not GNU standardized) target which installs
387   only the include file - in $(includedir) - and the library - in $(libdir) -
388   and does not use other directory variables. After "installing" libiconv
389   in your package's build directory, building of your package can proceed.
390
391Q: Why is the testsuite so big?
392A: Because some of the tests are very comprehensive.
393   If you don't feel like using the testsuite, you can simply remove the
394   tests/ directory.
395
396