1# 2# File: fsckFixDecompsNotes.txt 3# 4# Contains: Notes on fsckFixDecomps function and related tools 5# 6# Copyright: � 2002 by Apple Computer, Inc., all rights reserved. 7# 8# CVS change log: 9# 10# $Log: FixDecompsNotes.txt,v $ 11# Revision 1.2 2002/12/20 01:20:36 lindak 12# Merged PR-2937515-2 into ZZ100 13# Old HFS+ decompositions need to be repaired 14# 15# Revision 1.1.4.1 2002/12/16 18:55:22 jcotting 16# integrated code from text group (Peter Edberg) that will correct some 17# illegal names created with obsolete Unicode 2.1.2 decomposition rules 18# Bug #: 2937515 19# Submitted by: jerry cottingham 20# Reviewed by: don brady 21# 22# Revision 1.1.2.1 2002/10/25 17:15:23 jcotting 23# added code from Peter Edberg that will detect and offer replacement 24# names for file system object names with pre-Jaguar decomp errors 25# Bug #: 2937515 26# Submitted by: jerry cottingham 27# Reviewed by: don brady 28# 29# Revision 1.2 2002/10/16 20:17:21 pedberg 30# Add more notes 31# 32# Revision 1.1 2002/10/16 06:53:54 pedberg 33# [3066897, 2937515] Start adding notes about this code 34# 35# --------------------------------------------------------------------------- 36 37Code provided per Radar #3066897 to fix bug #2937515. 38 39The Unicode decomposition used to date for HFS+ volumes - as described in 40 <http://developer.apple.com/technotes/tn/tn1150.html#CanonicalDecomposition> 41 <http://developer.apple.com/technotes/tn/tn1150table.html> 42- is based on a modified version of the decomposition rules for Unicode 2.1.2 43(but even those were not correctly implemented for certain combinations of 44multiple combining marks). Unicode has updated the decomposition and combining 45mark reordering rules and data many times since then, but they have locked them 46down for Unicode 3.1. This is because Unicode 3.1 is the basis of the Unicode 47normalization forms such as NFC and NFD. We began supporting these normalization 48formats in Jaguar. 49 50Because of this, the Apple Unicode cross-functional committee decided to do a 51one-time change to update the decomposition rules used for HFS+ volumes from the 52Unicode 2.1.2 rules to the Unicode 3.1 rules. TEC and the kernel encoding 53converters made this change in Jaguar. One other piece that was supposed to 54happen was an enhancement to fsck to convert filenames on HFS+ volumes from the 55old decomposition to the new. 56 57That fsck change did not happen in Jaguar, and as a result there are bugs such 58as 2937515 (in which users are seeing partial garbage for filenames). The update 59affects the decomposition of Greek characters - about 80 of them (18 of which 60correspond to characters in MacGreek). It also affects the decomposition of a 61few others: around 23 Latin-script characters and 18 Cyrillic characters (none 62of which correspond to anything in a traditional Mac encoding), 8 Arabic 63characters (5 of which do correspond to MacArabic characters), 16 Indic, Thai, & 64Lao characters (3 of which correspond to characters in Mac encodings). It also 65potentially affects the ordering of all combining marks. 66 67This directory contains code provided per 3066897 that fsck can use to address 68this problem for HFS+ volumes. 69 70---- 71A. Data structure 72 73The data is organized into a two-level trie. The first level is a table that 74maps the high-order 12 bits of a UniChar to an index value. The value is -1 if 75no character with those high 12 bits has either a decomposition update or a 76nonzero combining class; otherwise, it is an index into an array of ranges that 77map the low 4 bits of the UniChar to the necessary data. There are two such 78arrays of ranges; one provides the mappings to combining class values, the other 79provides the mappings to decomposition update information. The latter is in the 80form of an index into an array of sequences that contain an action code, an 81optional list of additional characters that need to be matched for a complete 82sequence match (in the case where a 2-element or 3-element sequence needs to be 83updated), and the replacement decomposition sequence. 84 85There is one additional twist for the first-level trie table. Since the 86characters that have classor decomposition data are all either in the range 870x0000-30FF or 0xFB00-FFFF, we can save 3K space in the table by eliminating the 88middle. Before looking up a UTF16 character in the table, we first add 0x0500 to 89it; the resulting shifted UniChar is in the range 0x0000-35FF. So if the shifted 90UniChar is >= 0x3600, we don't bother looking in the table. 91 92The table data is generated automatically by the fsckMakeDecompData tool; the 93sources for this tool contain an array with the raw data for characters that 94either have nonzero combining class or begin a sequence of characters that may 95need to be updated. The tool generates the index, the two range arrays, and the 96list of decomposition update actions. 97 98---- 99B. Files 100 101* fsckDecompDataEnums.h contains enums related to the data tables 102 103* fsckMakeDecompData.c contains the raw data source; when this tool is compiled 104and run, it writes to standard output the contents of the binary data tables; 105this should be directed into a file fsckDecompData.h. 106 107* fsckFixDecomps.h contains the interface for the fsckFixDecomps function (and 108related types) 109 110* fsckFixDecomps.c contains the function code. 111 112---- 113C. Function interface 114 115The basic interface (defined in fsckFixDecomps.h) is: 116 117Boolean fsckFixDecomps( ConstHFSUniStr255Param inFilename, HFSUniStr255 118*outFilename ); 119 120If inFilename needs updating and the function was able to do this without 121overflowing the 255-character limit, it returns 1 (true) and outFIlename 122contains the update file. If inFilename did not need updating, or an update 123would overflow the limit, the function returns 0 (false) and the contents of 124outFilename are undefined. 125 126The function needs a couple of types from Apple interface files (not standard C 127ones): HFSUniStr255 and Boolean. For now these are defined in fsckFixDecomps.h 128if NO_APPLE_INCLUDES is 1. For building with fsck_hfs, the appropriate includes 129should be put into fsckFixDecomps.h. 130 131For the record, hfs_format.h defines HFSUniStr255 as follows: 132 133struct HFSUniStr255 { 134 u_int16_t length; /* number of unicode characters */ 135 u_int16_t unicode[255]; /* unicode characters */ 136}; 137typedef struct HFSUniStr255 HFSUniStr255; 138typedef const HFSUniStr255 *ConstHFSUniStr255Param; 139 140---- 141D. Function implementation 142 143Characters that don't require any special handling have combining class 0 and do 144not begin a decomposition sequence (of 1-3 characters) that needs updating. For 145these characters, the function just copies them from inFilename to outFilename 146and sets the pointer outNameCombSeqPtr to NULL (when this pointer is not NULL, 147it points to the beginning of a sequence of combining marks that continues up to 148the current character; if the current character is combining, it may need to be 149reordered into that sequence). The copying operation in cheap, and postponing it 150until we know the filename needs modification would make the code much more 151complicated. 152 153This copying operation may be invoked from many places in the code, some deeply 154nested - any time the code determines that the current character needs no 155special handling. For this reason it has a label (CopyBaseChar) and is located 156at the end of the character processing loop; various places in the code use goto 157statements to jump to it (this is a situation where they are justified). 158 159The main function loop has 4 sections. 160 161First, it quickly determines if the high 12 bits of the character indicate that 162it is in a range that has neither nonzero combining class nor any decomposition 163sequences that need updating. If so, the code jumps straight to CopyBaseChar. 164 165Second, the code determines if the character is part of a sequence that needs 166updating. It checks if the current character has a corresponding action in the 167replaceData array. If so, depending on the action, it may need to check for 168additional matching characters in inFilename. If the sequence of 1-3 characters 169is successfully matched, then a replacement sequence of 1-3 characters is 170inserted at the corresponding position in outFilename. While this replacement 171sequence is being inserted, each character must be checked to see if it has 172nonzero combining class and needs reordering (some replacement sequences consist 173entirely of combining characters and may interact with combining characters in 174the filename before the updated sequence). 175 176Third, the code handles characters whose high-order 12 bits indicated that some 177action was required, but were not part of sequences that needed updating (these 178may include characters that were examined in the second part but were part of 179sequences that did not completely match, so there are also goto fallthroughs to 180this code - labeled CheckCombClass - from the second part). These characters 181have to be checked for nonzero combining class; if so, they are reordered as 182necessary. Each time a new nonzero class character is encountered, it is added 183to outFIlename at the correct point in any active combining character sequence 184(with other characters in the sequence moved as necessary), so the sequence 185pointed to by outNameCombSeqPtr is always in correct order up to the current 186character. 187 188Finally, the fourth part has the default handlers to just copy characters to 189outFilename. 190 191