• Home
  • History
  • Annotate
  • only in this directory
NameDateSize

..25-Aug-201410

api.txtH A D08-May-200911.6 KiB

bidiapi.txtH A D08-May-20093.1 KiB

format.txtH A D08-May-20099.2 KiB

MUTTUCData.txtH A D08-May-20099.8 KiB

READMEH A D08-May-200910.5 KiB

ucdata.cH A D07-Feb-201235.6 KiB

ucdata.hH A D07-Feb-201213.6 KiB

ucdata.manH A D08-May-200912 KiB

ucgendat.cH A D07-Feb-201248.4 KiB

ucpgba.cH A D07-Feb-201222.9 KiB

ucpgba.hH A D07-Feb-20125.9 KiB

ucpgba.manH A D08-May-20092.6 KiB

uctable.hH A D08-May-2009500.1 KiB

README

1#
2# $Id: README,v 1.33 2001/01/02 18:46:19 mleisher Exp $
3#
4
5                           MUTT UCData Package 2.5
6                           -----------------------
7
8This is a package that supports ctype-like operations for Unicode UCS-2 text
9(and surrogates), case mapping, decomposition lookup, and provides a
10bidirectional reordering algorithm.  To use it, you will need to get the
11latest "UnicodeData-*.txt" (or later) file from the Unicode Web or FTP site.
12
13The character information portion of the package consists of three parts:
14
15  1. A program called "ucgendat" which generates five data files from the
16     UnicodeData-*.txt file.  The files are:
17
18     A. case.dat   - the case mappings.
19     B. ctype.dat  - the character property tables.
20     C. comp.dat   - the character composition pairs.
21     D. decomp.dat - the character decompositions.
22     E. cmbcl.dat  - the non-zero combining classes.
23     F. num.dat    - the codes representing numbers.
24
25  2. The "ucdata.[ch]" files which implement the functions needed to
26     check to see if a character matches groups of properties, to map between
27     upper, lower, and title case, to look up the decomposition of a
28     character, look up the combining class of a character, and get the number
29     value of a character.
30
31  3. The UCData.java class which provides the same API (with minor changes for
32     the numbers) and loads the same binary data files as the C code.
33
34A short reference to the functions available is in the "api.txt" file.
35
36Techie Details
37==============
38
39The "ucgendat" program parses files from the command line which are all in the
40Unicode Character Database (UCDB) format.  An additional properties file,
41"MUTTUCData.txt", provides some extra properties for some characters.
42
43The program looks for the two character properties fields (2 and 4), the
44combining class field (3), the decomposition field (5), the numeric value
45field (8), and the case mapping fields (12, 13, and 14).  The decompositions
46are recursively expanded before being written out.
47
48The decomposition table contains all the canonical decompositions.  This means
49all decompositions that do not have tags such as "<compat>" or "<font>".
50
51The data is almost all stored as unsigned longs (32-bits assumed) and the
52routines that load the data take care of endian swaps when necessary.  This
53also means that supplementary characters (>= 0x10000) can be placed in the
54data files the "ucgendat" program parses.
55
56The data is written as external files and broken into six parts so it can be
57selectively updated at runtime if necessary.
58
59The data files currently generated from the "ucgendat" program total about 56K
60in size all together.
61
62The format of the binary data files is documented in the "format.txt" file.
63
64==========================================================================
65
66                       The "Pretty Good Bidi Algorithm"
67                       --------------------------------
68
69This routine provides an alternative to the Unicode Bidi algorithm.  The
70difference is that this version of the PGBA does not handle the explicit
71directional codes (LRE, RLE, LRO, RLO, PDF).  It should now produce the same
72results as the Unicode BiDi algorithm for implicit reordering.  Included are
73functions for doing cursor motion in both logical and visual order.
74
75This implementation is provided to demonstrate an effective alternate method
76for implicit reordering.  To make this useful for an application, it probably
77needs some changes to the memory allocation and deallocation, as well as data
78structure additions for rendering.
79
80Mark Leisher <mleisher@crl.nmsu.edu>
8119 November 1999
82
83-----------------------------------------------------------------------------
84
85CHANGES
86=======
87Version 2.5
88-----------
891. Changed the number lookup to set the denominator to 1 in cases of digits.
90   This restores functional compatibility with John Cowan's UCType package.
91
922. Added support for the AL property.
93
943. Modified load and reload functions to return error codes.
95
96Version 2.4
97-----------
981. Improved some bidi algorithm documentation in the code.
99
1002. Fixed a code mixup that produced a non-working version.
101
102Version 2.3
103-----------
1041. Fixed a misspelling in the ucpgba.h header file.
105
1062. Fixed a bug which caused trailing weak non-digit sequences to be left out of
107   the reordered string in the bidi algorithm.
108
1093. Fixed a problem with weak sequences containing non-spacing marks in the
110   bidi algorithm.
111
1124. Fixed a problem with text runs of the opposite direction of the string
113   surrounding a weak + neutral text run appearing in the wrong order in the
114   bidi algorithm.
115
1165. Added a default overall direction parameter to the reordering function for
117   cases of strings with no strong directional characters in the bidi
118   algorithm.
119
1206. The bidi API documentation was improved.
121
1227. Added a man page for the bidi API.
123
124Version 2.2
125-----------
1261. Fixed a problem with the bidi algorithm locating directional section
127   boundaries.
128
1292. Fixed a problem with the bidi algorithm starting the reordering correctly.
130
1313. Fixed a problem with the bidi algorithm determining end boundaries for LTR
132   segments.
133
1344. Fixed a problem with the bidi algorithm reordering weak (digits and number
135   separators) segments.
136
1375. Added automatic switching of symmetrically paired characters when
138   reversing RTL segments.
139
1406. Added a missing symmetric character to the extra character properties in
141   MUTTUCData.txt.
142
1437. Added support for doing logical and visual cursor traversal.
144
145Version 2.1
146-----------
1471. Updated the ucgendat program to handle the Unicode 3.0 character database
148   properties.  The AL and BM bidi properties gets marked as strong RTL and
149   Other Neutral, the NSM, LRE, RLE, PDF, LRO, and RLO controls all get marked
150   as Other Neutral.
151
1522. Fixed some problems with testing against signed values in the UCData.java
153   code and some minor cleanup.
154
1553. Added the "Pretty Good Bidi Algorithm."
156
157Version 2.0
158-----------
1591. Removed the old Java stuff for a new class that loads directly from the
160   same data files as the C code does.
161
1622. Fixed a problem with choosing the correct field when mapping case.
163
1643. Adjust some search routines to start their search in the correct position.
165
1664. Moved the copyright year to 1999.
167
168Version 1.9
169-----------
1701. Fixed a problem with an incorrect amount of storage being allocated for the
171   combining class nodes.
172
1732. Fixed an invalid initialization in the number code.
174
1753. Changed the Java template file formatting a bit.
176
1774. Added tables and function for getting decompositions in the Java class.
178
179Version 1.8
180-----------
1811. Fixed a problem with adding certain ranges.
182
1832. Added two more macros for testing for identifiers.
184
1853. Tested with the UnicodeData-2.1.5.txt file.
186
187Version 1.7
188-----------
1891. Fixed a problem with looking up decompositions in "ucgendat."
190
191Version 1.6
192-----------
1931. Added two new properties introduced with UnicodeData-2.1.4.txt.
194
1952. Changed the "ucgendat.c" program a little to automatically align the
196   property data on a 4-byte boundary when new properties are added.
197
1983. Changed the "ucgendat.c" programs to only generate canonical
199   decompositions.
200
2014. Added two new macros ucisinitialpunct() and ucisfinalpunct() to check for
202   initial and final punctuation characters.
203
2045. Minor additions and changes to the documentation.
205
206Version 1.5
207-----------
2081. Changed all file open calls to include binary mode with "b" for DOS/WIN
209   platforms.
210
2112. Wrapped the unistd.h include so it won't be included when compiled under
212   Win32.
213
2143. Fixed a bad range check for hex digits in ucgendat.c.
215
2164. Fixed a bad endian swap for combining classes.
217
2185. Added code to make a number table and associated lookup functions.
219   Functions added are ucnumber(), ucdigit(), and ucgetnumber().  The last
220   function is to maintain compatibility with John Cowan's "uctype" package.
221
222Version 1.4
223-----------
2241. Fixed a bug with adding a range.
225
2262. Fixed a bug with inserting a range in order.
227
2283. Fixed incorrectly specified ucisdefined() and ucisundefined() macros.
229
2304. Added the missing unload for the combining class data.
231
2325. Fixed a bad macro placement in ucisweak().
233
234Version 1.3
235-----------
2361. Bug with case mapping calculations fixed.
237
2382. Bug with empty character property entries fixed.
239
2403. Bug with incorrect type in the combining class lookup fixed.
241
2424. Some corrections done to api.txt.
243
2445. Bug in certain character property lookups fixed.
245
2466. Added a character property table that records the defined characters.
247
2487. Replaced ucisunknown() with ucisdefined() and ucisundefined().
249
250Version 1.2
251-----------
2521. Added code to ucgendat to generate a combining class table.
253
2542. Fixed an endian problem with the byte count of decompositions.
255
2563. Fixed some minor problems in the "format.txt" file.
257
2584. Removed some bogus "Ss" values from MUTTUCData.txt file.
259
2605. Added API function to get combining class.
261
2626. Changed the open mode to "rb" so binary data files will be opened correctly
263   on DOS/WIN as well as other platforms.
264
2657. Added the "api.txt" file.
266
267Version 1.1
268-----------
2691. Added ucisxdigit() which I overlooked.
270
2712. Added UC_LT to the ucisalpha() macro which I overlooked.
272
2733. Change uciscntrl() to include UC_CF.
274
2754. Added ucisocntrl() and ucfntcntrl() macros.
276
2775. Added a ucisblank() which I overlooked.
278
2796. Added missing properties to ucissymbol() and ucisnumber().
280
2817. Added ucisgraph() and ucisprint().
282
2838. Changed the "Mr" property to "Sy" to mark this subset of mirroring
284   characters as symmetric to avoid trampling the Unicode/ISO10646 sense of
285   mirroring.
286
2879. Added another property called "Ss" which includes control characters
288   traditionally seen as spaces in the isspace() macro.
289
29010. Added a bunch of macros to be API compatible with John Cowan's package.
291
292ACKNOWLEDGEMENTS
293================
294
295Thanks go to John Cowan <cowan@locke.ccil.org> for pointing out lots of
296missing things and giving me stuff, particularly a bunch of new macros.
297
298Thanks go to Bob Verbrugge <bob_verbrugge@nl.compuware.com> for pointing out
299various bugs.
300
301Thanks go to Christophe Pierret <cpierret@businessobjects.com> for pointing
302out that file modes need to have "b" for DOS/WIN machines, pointing out
303unistd.h is not a Win 32 header, and pointing out a problem with ucisalnum().
304
305Thanks go to Kent Johnson <kent@pondview.mv.com> for finding a bug that caused
306incomplete decompositions to be generated by the "ucgendat" program.
307
308Thanks go to Valeriy E. Ushakov <uwe@ptc.spbu.ru> for spotting an allocation
309error and an initialization error.
310
311Thanks go to Stig Venaas <Stig.Venaas@uninett.no> for providing a patch to
312support return types on load and reload, and for major updates to handle
313canonical composition and decomposition.
314