1#
2# $Id: format.txt,v 1.2 2001/01/02 18:46:20 mleisher Exp $
3#
4
5CHARACTER DATA
6==============
7
8This package generates some data files that contain character properties useful
9for text processing.
10
11CHARACTER PROPERTIES
12====================
13
14The first data file is called "ctype.dat" and contains a compressed form of
15the character properties found in the Unicode Character Database (UCDB).
16Additional properties can be specified in limited UCDB format in another file
17to avoid modifying the original UCDB.
18
19The following is a property name and code table to be used with the character
20data:
21
22NAME CODE DESCRIPTION
23---------------------
24Mn   0    Mark, Non-Spacing
25Mc   1    Mark, Spacing Combining
26Me   2    Mark, Enclosing
27Nd   3    Number, Decimal Digit
28Nl   4    Number, Letter
29No   5    Number, Other
30Zs   6    Separator, Space
31Zl   7    Separator, Line
32Zp   8    Separator, Paragraph
33Cc   9    Other, Control
34Cf   10   Other, Format
35Cs   11   Other, Surrogate
36Co   12   Other, Private Use
37Cn   13   Other, Not Assigned
38Lu   14   Letter, Uppercase
39Ll   15   Letter, Lowercase
40Lt   16   Letter, Titlecase
41Lm   17   Letter, Modifier
42Lo   18   Letter, Other
43Pc   19   Punctuation, Connector
44Pd   20   Punctuation, Dash
45Ps   21   Punctuation, Open
46Pe   22   Punctuation, Close
47Po   23   Punctuation, Other
48Sm   24   Symbol, Math
49Sc   25   Symbol, Currency
50Sk   26   Symbol, Modifier
51So   27   Symbol, Other
52L    28   Left-To-Right
53R    29   Right-To-Left
54EN   30   European Number
55ES   31   European Number Separator
56ET   32   European Number Terminator
57AN   33   Arabic Number
58CS   34   Common Number Separator
59B    35   Block Separator
60S    36   Segment Separator
61WS   37   Whitespace
62ON   38   Other Neutrals
63Pi   47   Punctuation, Initial
64Pf   48   Punctuation, Final
65#
66# Implementation specific properties.
67#
68Cm   39   Composite
69Nb   40   Non-Breaking
70Sy   41   Symmetric (characters which are part of open/close pairs)
71Hd   42   Hex Digit
72Qm   43   Quote Mark
73Mr   44   Mirroring
74Ss   45   Space, Other (controls viewed as spaces in ctype isspace())
75Cp   46   Defined character
76
77The actual binary data is formatted as follows:
78
79  Assumptions: unsigned short is at least 16-bits in size and unsigned long
80               is at least 32-bits in size.
81
82    unsigned short ByteOrderMark
83    unsigned short OffsetArraySize
84    unsigned long  Bytes
85    unsigned short Offsets[OffsetArraySize + 1]
86    unsigned long  Ranges[N], N = value of Offsets[OffsetArraySize]
87
88  The Bytes field provides the total byte count used for the Offsets[] and
89  Ranges[] arrays.  The Offsets[] array is aligned on a 4-byte boundary and
90  there is always one extra node on the end to hold the final index of the
91  Ranges[] array.  The Ranges[] array contains pairs of 4-byte values
92  representing a range of Unicode characters.  The pairs are arranged in
93  increasing order by the first character code in the range.
94
95  Determining if a particular character is in the property list requires a
96  simple binary search to determine if a character is in any of the ranges
97  for the property.
98
99  If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
100  machine with a different endian order and the values must be byte-swapped.
101
102  To swap a 16-bit value:
103     c = (c >> 8) | ((c & 0xff) << 8)
104
105  To swap a 32-bit value:
106     c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
107         (((c >> 16) & 0xff) << 8) | (c >> 24)
108
109CASE MAPPINGS
110=============
111
112The next data file is called "case.dat" and contains three case mapping tables
113in the following order: upper, lower, and title case.  Each table is in
114increasing order by character code and each mapping contains 3 unsigned longs
115which represent the possible mappings.
116
117The format for the binary form of these tables is:
118
119  unsigned short ByteOrderMark
120  unsigned short NumMappingNodes, count of all mapping nodes
121  unsigned short CaseTableSizes[2], upper and lower mapping node counts
122  unsigned long  CaseTables[NumMappingNodes]
123
124  The starting indexes of the case tables are calculated as following:
125
126    UpperIndex = 0;
127    LowerIndex = CaseTableSizes[0] * 3;
128    TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
129
130  The order of the fields for the three tables are:
131
132    Upper case
133    ----------
134    unsigned long upper;
135    unsigned long lower;
136    unsigned long title;
137
138    Lower case
139    ----------
140    unsigned long lower;
141    unsigned long upper;
142    unsigned long title;
143
144    Title case
145    ----------
146    unsigned long title;
147    unsigned long upper;
148    unsigned long lower;
149
150  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
151  same way as described in the CHARACTER PROPERTIES section.
152
153  Because the tables are in increasing order by character code, locating a
154  mapping requires a simple binary search on one of the 3 codes that make up
155  each node.
156
157  It is important to note that there can only be 65536 mapping nodes which
158  divided into 3 portions allows 21845 nodes for each case mapping table.  The
159  distribution of mappings may be more or less than 21845 per table, but only
160  65536 are allowed.
161
162COMPOSITIONS
163============
164
165This data file is called "comp.dat" and contains data that tracks character
166pairs that have a single Unicode value representing the combination of the two
167characters.
168
169The format for the binary form of this table is:
170
171  unsigned short ByteOrderMark
172  unsigned short NumCompositionNodes, count of composition nodes
173  unsigned long  Bytes, total number of bytes used for composition nodes
174  unsigned long  CompositionNodes[NumCompositionNodes * 4]
175
176  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
177  same way as described in the CHARACTER PROPERTIES section.
178
179  The CompositionNodes[] array consists of groups of 4 unsigned longs.  The
180  first of these is the character code representing the combination of two
181  other character codes, the second records the number of character codes that
182  make up the composition (not currently used), and the last two are the pair
183  of character codes whose combination is represented by the character code in
184  the first field.
185
186DECOMPOSITIONS
187==============
188
189The next data file is called "decomp.dat" and contains the decomposition data
190for all characters with decompositions containing more than one character and
191are *not* compatibility decompositions.  Compatibility decompositions are
192signaled in the UCDB format by the use of the <compat> tag in the
193decomposition field.  Each list of character codes represents a full
194decomposition of a composite character.  The nodes are arranged in increasing
195order by character code.
196
197The format for the binary form of this table is:
198
199  unsigned short ByteOrderMark
200  unsigned short NumDecompNodes, count of all decomposition nodes
201  unsigned long  Bytes
202  unsigned long  DecompNodes[(NumDecompNodes * 2) + 1]
203  unsigned long  Decomp[N], N = sum of all counts in DecompNodes[]
204
205  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
206  same way as described in the CHARACTER PROPERTIES section.
207
208  The DecompNodes[] array consists of pairs of unsigned longs, the first of
209  which is the character code and the second is the initial index of the list
210  of character codes representing the decomposition.
211
212  Locating the decomposition of a composite character requires a binary search
213  for a character code in the DecompNodes[] array and using its index to
214  locate the start of the decomposition.  The length of the decomposition list
215  is the index in the following element in DecompNode[] minus the current
216  index.
217
218COMBINING CLASSES
219=================
220
221The fourth data file is called "cmbcl.dat" and contains the characters with
222non-zero combining classes.
223
224The format for the binary form of this table is:
225
226  unsigned short ByteOrderMark
227  unsigned short NumCCLNodes
228  unsigned long  Bytes
229  unsigned long  CCLNodes[NumCCLNodes * 3]
230
231  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
232  same way as described in the CHARACTER PROPERTIES section.
233
234  The CCLNodes[] array consists of groups of three unsigned longs.  The first
235  and second are the beginning and ending of a range and the third is the
236  combining class of that range.
237
238  If a character is not found in this table, then the combining class is
239  assumed to be 0.
240
241  It is important to note that only 65536 distinct ranges plus combining class
242  can be specified because the NumCCLNodes is usually a 16-bit number.
243
244NUMBER TABLE
245============
246
247The final data file is called "num.dat" and contains the characters that have
248a numeric value associated with them.
249
250The format for the binary form of the table is:
251
252  unsigned short ByteOrderMark
253  unsigned short NumNumberNodes
254  unsigned long  Bytes
255  unsigned long  NumberNodes[NumNumberNodes]
256  unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
257                            / sizeof(short)]
258
259  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
260  same way as described in the CHARACTER PROPERTIES section.
261
262  The NumberNodes array contains pairs of values, the first of which is the
263  character code and the second an index into the ValueNodes array.  The
264  ValueNodes array contains pairs of integers which represent the numerator
265  and denominator of the numeric value of the character.  If the character
266  happens to map to an integer, both the values in ValueNodes will be the
267  same.
268