1251881Speter<HTML>
2251881Speter<HEAD><TITLE>APR Canonical Filenames</TITLE></HEAD>
3251881Speter<BODY>
4251881Speter<h1>APR Canonical Filename</h1>
5251881Speter
6251881Speter<h2>Requirements</h2>
7251881Speter
8251881Speter<p>APR porters need to address the underlying discrepancies between
9251881Speterfile systems.  To achieve a reasonable degree of security, the
10251881Speterprogram depending upon APR needs to know that two paths may be
11251881Spetercompared, and that a mismatch is guarenteed to reflect that the
12251881Spetertwo paths do not return the same resource</p>.
13251881Speter
14251881Speter<p>The first discrepancy is in volume roots.  Unix and pure deriviates
15251881Speterhave only one root path, "/".  Win32 and OS2 share root paths of
16251881Speterthe form "D:/", D: is the volume designation.  However, this can
17251881Speterbe specified as "//./D:/" as well, indicating D: volume of the 
18251881Speter'this' machine.  Win32 and OS2 also may employ a UNC root path,
19251881Speterof the form "//server/share/" where share is a share-point of the
20251881Speterspecified network server.  Finally, NetWare root paths are of the
21251881Speterform "server/volume:/", or the simpler "volume:/" syntax for 'this'
22251881Spetermachine.  All these non-Unix file systems accept volume:path,
23251881Speterwithout a slash following the colon, as a path relative to the
24251881Spetercurrent working directory, which APR will treat as ambigious, that
25251881Speteris, neither an absolute nor a relative path per se.</p>
26251881Speter
27251881Speter<p>The second discrepancy is in the meaning of the 'this' directory.
28251881SpeterIn general, 'this' must be eliminated from the path where it occurs.
29251881SpeterThe syntax "path/./" and "path/" are both aliases to path.  However,
30251881Speterthis isn't file system independent, since the double slash "//" has
31251881Spetera special meaning on OS2 and Win32 at the start of the path name,
32251881Speterand is invalid on those platforms before the "//server/share/" UNC
33251881Speterroot path is completed.  Finally, as noted above, "//./volume/" is
34251881Speterlegal root syntax on WinNT, and perhaps others.</p>
35251881Speter
36251881Speter<p>The third discrepancy is in the context of the 'parent' directory.
37251881SpeterWhen "parent/path/.." occurs, the path must be unwound to "parent".
38251881SpeterIt's also critical to simply truncate leading "/../" paths to "/",
39251881Spetersince the parent of the root is root.  This gets tricky on the
40251881SpeterWin32 and OS2 platforms, since the ".." element is invalid before
41251881Speterthe "//server/share/" is complete, and the "//server/share/../"
42251881Speterseqence is the complete UNC root "//server/share/".  In relative
43251881Speterpaths, leading ".." elements are significant, until they are merged
44251881Speterwith an absolute path.  The relative form must only retain the ".."
45251881Spetersegments as leading segments, to be resolved once merged to another
46251881Speterrelative or an absolute path.</p>
47251881Speter
48251881Speter<p>The fourth discrepancy occurs with acceptance of alternate character
49251881Spetercodes for the same element.  Path seperators are not retained within
50251881Speterthe APR canonical forms.  The OS filesystem and APR (slashed) forms
51251881Spetercan both be returned as strings, to be used in the proper context.
52251881SpeterUnix, Win32 and Netware all accept slashes and backslashes as the
53251881Spetersame path seperator symbol, although unix strictly accepts slashes.
54251881SpeterWhile the APR form of the name strictly uses slashes, always consider
55251881Speterthat there could be a platform that actually accepts slashes as a
56251881Spetercharacter within a segment name.</p>
57251881Speter
58251881Speter<p>The fifth and worst discrepancy plauges Win32, OS2, Netware, and some
59251881Speterfilesystems mounted in Unix.  Case insensitivity can permit the same
60251881Speterfile to slip through in both it's proper case and alternate cases.
61251881SpeterSimply changing the case is insufficient for any character set beyond
62251881SpeterASCII, since various dilectic forms of characters suffer from one to
63251881Spetermany or many to one translations.  An example would be u-umlaut, which
64might be accepted as a single character u-umlaut, a two character
65sequence u and the zero-width umlaut, the upper case form of the same,
66or perhaps even a captial U alone.  This can be handled in different
67ways depending on the purposes of the APR based program, but the one
68requirement is that the path must be absolute in order to resolve these
69ambiguities.  Methods employed include comparison of device and inode
70file uniqifiers, which is a fairly fast operation, or quering the OS
71for the true form of the name, which can be much slower.  Only the
72acknowledgement of the file names by the OS can validate the equality
73of two different cases of the same filename.</p>
74
75<p>The sixth discrepancy, illegal or insignificant characters, is especially 
76significant in non-unix file systems.  Trailing periods are accepted
77but never stored, therefore trailing periods must be ignored for any
78form of comparison.  And all OS's have certain expectations of what
79characters are illegal (or undesireable due to confusion.)</p>
80
81<p>A final warning, canonical functions don't transform or resolve case
82or character ambiguity issues until they are resolved into an absolute
83path.  The relative canonical path, while useful, while useful for URL
84or similar identifiers, cannot be used for testing or comparison of file 
85system objects.</p>
86
87<hr>
88
89<h2>Canonical API</h2>
90
91Functions to manipulate the apr_canon_file_t (an opaque type) include:
92
93<ul>
94<li>Create canon_file_t (from char* path and canon_file_t parent path)
95<li>Merged canon_file_t (from path and parent, both canon_file_t)
96<li>Get char* path of all or some segments
97<li>Get path flags of IsRelative, IsVirtualRoot, and IsAbsolute
98<li>Compare two canon_file_t structures for file equality
99</ul>
100
101<p>The path is corrected to the file system case only if is in absolute 
102form.  The apr_canon_file_t should be preserved as long as possible and 
103used as the parent to create child entries to reduce the number of expensive 
104stat and case canonicalization calls to the OS.</p>
105
106<p>The comparison operation provides that the APR can postpone correction
107of case by simply relying upon the device and inode for equivalence.  The
108stat implementation provides that two files are the same, while their
109strings are not equivalent, and eliminates the need for the operating
110system to return the proper form of the name.</p>
111
112<p>In any case, returning the char* path, with a flag to request the proper
113case, forces the OS calls to resolve the true names of each segment.  Where
114there is a penality for this operation and the stat device and inode test
115is faster, case correction is postponed until the char* result is requested.
116On platforms that identify the inode, device, or proper name interchangably
117with no penalities, this may occur when the name is initially processed.</p>
118
119<hr>
120
121<h2>Unix Example</h2>
122
123<p>First the simplest case:</p>
124
125<pre>
126Parse Canonical Name 
127accepts parent path as canonical_t
128        this path as string
129
130Split this path Segments on '/'
131
132For each of this path Segments
133  If first Segment
134    If this Segment is Empty ([nothing]/)
135      Append this Root Segment (don't merge)
136      Continue to next Segment
137    Else is relative
138      Append parent Segments (to merge)
139      Continue with this Segment
140  If Segment is '.' or empty (2 slashes)
141    Discard this Segment
142    Continue with next Segment
143  If Segment is '..'
144    If no previous Segment or previous Segment is '..'
145      Append this Segment
146      Continue with next Segment
147    If previous Segment and previous is not Root Segment
148      Discard previous Segment
149    Discard this Segment
150    Continue with next Segment
151  Append this Relative Segment
152  Continue with next Segment        
153</pre>
154
155</BODY>
156</HTML>
157