1226031Sstas
2226031Sstas
3226031Sstas
4226031Sstas
5226031Sstas
6226031Sstas
7226031SstasNetwork Working Group                                        K. Zeilenga
8226031SstasRequest for Comments: 4518                           OpenLDAP Foundation
9226031SstasCategory: Standards Track                                      June 2006
10226031Sstas
11226031Sstas
12226031Sstas             Lightweight Directory Access Protocol (LDAP):
13226031Sstas                  Internationalized String Preparation
14226031Sstas
15226031SstasStatus of This Memo
16226031Sstas
17226031Sstas   This document specifies an Internet standards track protocol for the
18226031Sstas   Internet community, and requests discussion and suggestions for
19226031Sstas   improvements.  Please refer to the current edition of the "Internet
20226031Sstas   Official Protocol Standards" (STD 1) for the standardization state
21226031Sstas   and status of this protocol.  Distribution of this memo is unlimited.
22226031Sstas
23226031SstasCopyright Notice
24226031Sstas
25226031Sstas   Copyright (C) The Internet Society (2006).
26226031Sstas
27226031SstasAbstract
28226031Sstas
29226031Sstas   The previous Lightweight Directory Access Protocol (LDAP) technical
30226031Sstas   specifications did not precisely define how character string matching
31226031Sstas   is to be performed.  This led to a number of usability and
32226031Sstas   interoperability problems.  This document defines string preparation
33226031Sstas   algorithms for character-based matching rules defined for use in
34226031Sstas   LDAP.
35226031Sstas
36226031Sstas1.  Introduction
37226031Sstas
38226031Sstas1.1.  Background
39226031Sstas
40226031Sstas   A Lightweight Directory Access Protocol (LDAP) [RFC4510] matching
41226031Sstas   rule [RFC4517] defines an algorithm for determining whether a
42226031Sstas   presented value matches an attribute value in accordance with the
43226031Sstas   criteria defined for the rule.  The proposition may be evaluated to
44226031Sstas   True, False, or Undefined.
45226031Sstas
46226031Sstas      True      - the attribute contains a matching value,
47226031Sstas
48226031Sstas      False     - the attribute contains no matching value,
49226031Sstas
50226031Sstas      Undefined - it cannot be determined whether the attribute contains
51226031Sstas                  a matching value.
52226031Sstas
53226031Sstas
54226031Sstas
55226031Sstas
56226031Sstas
57226031Sstas
58226031SstasZeilenga                    Standards Track                     [Page 1]
59226031Sstas
60226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
61226031Sstas
62226031Sstas
63226031Sstas   For instance, the caseIgnoreMatch matching rule may be used to
64226031Sstas   compare whether the commonName attribute contains a particular value
65226031Sstas   without regard for case and insignificant spaces.
66226031Sstas
67226031Sstas1.2.  X.500 String Matching Rules
68226031Sstas
69226031Sstas   "X.520: Selected attribute types" [X.520] provides (among other
70226031Sstas   things) value syntaxes and matching rules for comparing values
71226031Sstas   commonly used in the directory [X.500].  These specifications are
72226031Sstas   inadequate for strings composed of Unicode [Unicode] characters.
73226031Sstas
74226031Sstas   The caseIgnoreMatch matching rule [X.520], for example, is simply
75226031Sstas   defined as being a case-insensitive comparison where insignificant
76226031Sstas   spaces are ignored.  For printableString, there is only one space
77226031Sstas   character and case mapping is bijective, hence this definition is
78226031Sstas   sufficient.  However, for Unicode string types such as
79226031Sstas   universalString, this is not sufficient.  For example, a case-
80226031Sstas   insensitive matching implementation that folded lowercase characters
81226031Sstas   to uppercase would yield different results than an implementation
82226031Sstas   that used uppercase to lowercase folding.  Or one implementation may
83226031Sstas   view space as referring to only SPACE (U+0020), a second
84226031Sstas   implementation may view any character with the space separator (Zs)
85226031Sstas   property as a space, and another implementation may view any
86226031Sstas   character with the whitespace (WS) category as a space.
87226031Sstas
88226031Sstas   The lack of precise specification for character string matching has
89226031Sstas   led to significant interoperability problems.  When used in
90226031Sstas   certificate chain validation, security vulnerabilities can arise.  To
91226031Sstas   address these problems, this document defines precise algorithms for
92226031Sstas   preparing character strings for matching.
93226031Sstas
94226031Sstas1.3.  Relationship to "stringprep"
95226031Sstas
96226031Sstas   The character string preparation algorithms described in this
97226031Sstas   document are based upon the "stringprep" approach [RFC3454].  In
98226031Sstas   "stringprep", presented and stored values are first prepared for
99226031Sstas   comparison so that a character-by-character comparison yields the
100226031Sstas   "correct" result.
101226031Sstas
102226031Sstas   The approach used here is a refinement of the "stringprep" [RFC3454]
103226031Sstas   approach.  Each algorithm involves two additional preparation steps.
104226031Sstas
105226031Sstas   a) Prior to applying the Unicode string preparation steps outlined in
106226031Sstas      "stringprep", the string is transcoded to Unicode.
107226031Sstas
108226031Sstas   b) After applying the Unicode string preparation steps outlined in
109226031Sstas      "stringprep", the string is modified to appropriately handle
110226031Sstas      characters insignificant to the matching rule.
111226031Sstas
112226031Sstas
113226031Sstas
114226031SstasZeilenga                    Standards Track                     [Page 2]
115226031Sstas
116226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
117226031Sstas
118226031Sstas
119226031Sstas   Hence, preparation of character strings for X.500 [X.500] matching
120226031Sstas   [X.501] involves the following steps:
121226031Sstas
122226031Sstas      1) Transcode
123226031Sstas      2) Map
124226031Sstas      3) Normalize
125226031Sstas      4) Prohibit
126226031Sstas      5) Check Bidi (Bidirectional)
127226031Sstas      6) Insignificant Character Handling
128226031Sstas
129226031Sstas   These steps are described in Section 2.
130226031Sstas
131226031Sstas   It is noted that while various tables of Unicode characters included
132226031Sstas   or referenced by this specification are derived from Unicode
133226031Sstas   [Unicode] data, these tables are to be considered definitive for the
134226031Sstas   purpose of implementing this specification.
135226031Sstas
136226031Sstas1.4.  Relationship to the LDAP Technical Specification
137226031Sstas
138226031Sstas   This document is an integral part of the LDAP technical specification
139226031Sstas   [RFC4510], which obsoletes the previously defined LDAP technical
140226031Sstas   specification [RFC3377] in its entirety.
141226031Sstas
142226031Sstas   This document details new LDAP internationalized character string
143226031Sstas   preparation algorithms used by [RFC4517] and possible other technical
144226031Sstas   specifications defining LDAP syntaxes and/or matching rules.
145226031Sstas
146226031Sstas1.5.  Relationship to X.500
147226031Sstas
148226031Sstas   LDAP is defined [RFC4510] in X.500 terms as an X.500 access
149226031Sstas   mechanism.  As such, there is a strong desire for alignment between
150226031Sstas   LDAP and X.500 syntax and semantics.  The character string
151226031Sstas   preparation algorithms described in this document are based upon
152226031Sstas   "Internationalized String Matching Rules for X.500" [XMATCH] proposal
153226031Sstas   to ITU/ISO Joint Study Group 2.
154226031Sstas
155226031Sstas1.6.  Conventions and Terms
156226031Sstas
157226031Sstas   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
158226031Sstas   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
159226031Sstas   document are to be interpreted as described in BCP 14 [RFC2119].
160226031Sstas
161226031Sstas   Character names in this document use the notation for code points and
162226031Sstas   names from the Unicode Standard [Unicode].  For example, the letter
163226031Sstas   "a" may be represented as either <U+0061> or <LATIN SMALL LETTER A>.
164226031Sstas   In the lists of mappings and the prohibited characters, the "U+" is
165226031Sstas
166226031Sstas
167226031Sstas
168226031Sstas
169226031Sstas
170226031SstasZeilenga                    Standards Track                     [Page 3]
171226031Sstas
172226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
173226031Sstas
174226031Sstas
175226031Sstas   left off to make the lists easier to read.  The comments for
176226031Sstas   character ranges are shown in square brackets (such as "[CONTROL
177226031Sstas   CHARACTERS]") and do not come from the standard.
178226031Sstas
179226031Sstas   Note: a glossary of terms used in Unicode can be found in [Glossary].
180226031Sstas   Information on the Unicode character encoding model can be found in
181226031Sstas   [CharModel].
182226031Sstas
183226031Sstas   The term "combining mark", as used in this specification, refers to
184226031Sstas   any Unicode [Unicode] code point that has a mark property (Mn, Mc,
185226031Sstas   Me).  Appendix A provides a definitive list of combining marks.
186226031Sstas
187226031Sstas2.  String Preparation
188226031Sstas
189226031Sstas   The following six-step process SHALL be applied to each presented and
190226031Sstas   attribute value in preparation for character string matching rule
191226031Sstas   evaluation.
192226031Sstas
193226031Sstas      1) Transcode
194226031Sstas      2) Map
195226031Sstas      3) Normalize
196226031Sstas      4) Prohibit
197226031Sstas      5) Check bidi
198226031Sstas      6) Insignificant Character Handling
199226031Sstas
200226031Sstas   Failure in any step causes the assertion to evaluate to Undefined.
201226031Sstas
202226031Sstas   The character repertoire of this process is Unicode 3.2 [Unicode].
203226031Sstas
204226031Sstas   Note that this six-step process specification is intended to describe
205226031Sstas   expected matching behavior.  Implementations are free to use
206226031Sstas   alternative processes so long as the matching rule evaluation
207226031Sstas   behavior provided is consistent with the behavior described by this
208226031Sstas   specification.
209226031Sstas
210226031Sstas2.1.  Transcode
211226031Sstas
212226031Sstas   Each non-Unicode string value is transcoded to Unicode.
213226031Sstas
214226031Sstas   PrintableString [X.680] values are transcoded directly to Unicode.
215226031Sstas
216226031Sstas   UniversalString, UTF8String, and bmpString [X.680] values need not be
217226031Sstas   transcoded as they are Unicode-based strings (in the case of
218226031Sstas   bmpString, a subset of Unicode).
219226031Sstas
220226031Sstas   TeletexString [X.680] values are transcoded to Unicode.  As there is
221226031Sstas   no standard for mapping TeletexString values to Unicode, the mapping
222226031Sstas   is left a local matter.
223226031Sstas
224226031Sstas
225226031Sstas
226226031SstasZeilenga                    Standards Track                     [Page 4]
227226031Sstas
228226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
229226031Sstas
230226031Sstas
231226031Sstas   For these and other reasons, use of TeletexString is NOT RECOMMENDED.
232226031Sstas
233226031Sstas   The output is the transcoded string.
234226031Sstas
235226031Sstas2.2.  Map
236226031Sstas
237226031Sstas   SOFT HYPHEN (U+00AD) and MONGOLIAN TODO SOFT HYPHEN (U+1806) code
238226031Sstas   points are mapped to nothing.  COMBINING GRAPHEME JOINER (U+034F) and
239226031Sstas   VARIATION SELECTORs (U+180B-180D, FF00-FE0F) code points are also
240226031Sstas   mapped to nothing.  The OBJECT REPLACEMENT CHARACTER (U+FFFC) is
241226031Sstas   mapped to nothing.
242226031Sstas
243226031Sstas   CHARACTER TABULATION (U+0009), LINE FEED (LF) (U+000A), LINE
244226031Sstas   TABULATION (U+000B), FORM FEED (FF) (U+000C), CARRIAGE RETURN (CR)
245226031Sstas   (U+000D), and NEXT LINE (NEL) (U+0085) are mapped to SPACE (U+0020).
246226031Sstas
247226031Sstas   All other control code (e.g., Cc) points or code points with a
248226031Sstas   control function (e.g., Cf) are mapped to nothing.  The following is
249226031Sstas   a complete list of these code points: U+0000-0008, 000E-001F, 007F-
250226031Sstas   0084, 0086-009F, 06DD, 070F, 180E, 200C-200F, 202A-202E, 2060-2063,
251226031Sstas   206A-206F, FEFF, FFF9-FFFB, 1D173-1D17A, E0001, E0020-E007F.
252226031Sstas
253226031Sstas   ZERO WIDTH SPACE (U+200B) is mapped to nothing.  All other code
254226031Sstas   points with Separator (space, line, or paragraph) property (e.g., Zs,
255226031Sstas   Zl, or Zp) are mapped to SPACE (U+0020).  The following is a complete
256226031Sstas   list of these code points: U+0020, 00A0, 1680, 2000-200A, 2028-2029,
257226031Sstas   202F, 205F, 3000.
258226031Sstas
259226031Sstas   For case ignore, numeric, and stored prefix string matching rules,
260226031Sstas   characters are case folded per B.2 of [RFC3454].
261226031Sstas
262226031Sstas   The output is the mapped string.
263226031Sstas
264226031Sstas2.3.  Normalize
265226031Sstas
266226031Sstas   The input string is to be normalized to Unicode Form KC
267226031Sstas   (compatibility composed) as described in [UAX15].  The output is the
268226031Sstas   normalized string.
269226031Sstas
270226031Sstas2.4.  Prohibit
271226031Sstas
272226031Sstas   All Unassigned code points are prohibited.  Unassigned code points
273226031Sstas   are listed in Table A.1 of [RFC3454].
274226031Sstas
275226031Sstas   Characters that, per Section 5.8 of [RFC3454], change display
276226031Sstas   properties or are deprecated are prohibited.  These characters are
277226031Sstas   listed in Table C.8 of [RFC3454].
278226031Sstas
279226031Sstas
280226031Sstas
281226031Sstas
282226031SstasZeilenga                    Standards Track                     [Page 5]
283226031Sstas
284226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
285226031Sstas
286226031Sstas
287226031Sstas   Private Use code points are prohibited.  These characters are listed
288226031Sstas   in Table C.3 of [RFC3454].
289226031Sstas
290226031Sstas   All non-character code points are prohibited.  These code points are
291226031Sstas   listed in Table C.4 of [RFC3454].
292226031Sstas
293226031Sstas   Surrogate codes are prohibited.  These characters are listed in Table
294226031Sstas   C.5 of [RFC3454].
295226031Sstas
296226031Sstas   The REPLACEMENT CHARACTER (U+FFFD) code point is prohibited.
297226031Sstas
298226031Sstas   The step fails if the input string contains any prohibited code
299226031Sstas   point.  Otherwise, the output is the input string.
300226031Sstas
301226031Sstas2.5.  Check bidi
302226031Sstas
303226031Sstas   Bidirectional characters are ignored.
304226031Sstas
305226031Sstas2.6.  Insignificant Character Handling
306226031Sstas
307226031Sstas   In this step, the string is modified to ensure proper handling of
308226031Sstas   characters insignificant to the matching rule.  This modification
309226031Sstas   differs from matching rule to matching rule.
310226031Sstas
311226031Sstas   Section 2.6.1 applies to case ignore and exact string matching.
312226031Sstas   Section 2.6.2 applies to numericString matching.
313226031Sstas   Section 2.6.3 applies to telephoneNumber matching.
314226031Sstas
315226031Sstas2.6.1.  Insignificant Space Handling
316226031Sstas
317226031Sstas   For the purposes of this section, a space is defined to be the SPACE
318226031Sstas   (U+0020) code point followed by no combining marks.
319226031Sstas
320226031Sstas       NOTE - The previous steps ensure that the string cannot contain
321226031Sstas              any code points in the separator class, other than SPACE
322226031Sstas              (U+0020).
323226031Sstas
324226031Sstas   For input strings that are attribute values or non-substring
325226031Sstas   assertion values:  If the input string contains no non-space
326226031Sstas   character, then the output is exactly two SPACEs.  Otherwise (the
327226031Sstas   input string contains at least one non-space character), the string
328226031Sstas   is modified such that the string starts with exactly one space
329226031Sstas   character, ends with exactly one SPACE character, and any inner
330226031Sstas   (non-empty) sequence of space characters is replaced with exactly two
331226031Sstas   SPACE characters.  For instance, the input strings
332226031Sstas   "foo<SPACE>bar<SPACE><SPACE>", result in the output
333226031Sstas   "<SPACE>foo<SPACE><SPACE>bar<SPACE>".
334226031Sstas
335226031Sstas
336226031Sstas
337226031Sstas
338226031SstasZeilenga                    Standards Track                     [Page 6]
339226031Sstas
340226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
341226031Sstas
342226031Sstas
343226031Sstas   For input strings that are substring assertion values: If the string
344226031Sstas   being prepared contains no non-space characters, then the output
345226031Sstas   string is exactly one SPACE.  Otherwise, the following steps are
346226031Sstas   taken:
347226031Sstas
348226031Sstas   -  If the input string is an initial substring, it is modified to
349226031Sstas      start with exactly one SPACE character;
350226031Sstas
351226031Sstas   -  If the input string is an initial or an any substring that ends in
352226031Sstas      one or more space characters, it is modified to end with exactly
353226031Sstas      one SPACE character;
354226031Sstas
355226031Sstas   -  If the input string is an any or a final substring that starts in
356226031Sstas      one or more space characters, it is modified to start with exactly
357226031Sstas      one SPACE character; and
358226031Sstas
359226031Sstas   -  If the input string is a final substring, it is modified to end
360226031Sstas      with exactly one SPACE character.
361226031Sstas
362226031Sstas   For instance, for the input string "foo<SPACE>bar<SPACE><SPACE>" as
363226031Sstas   an initial substring, the output would be
364226031Sstas   "<SPACE>foo<SPACE><SPACE>bar<SPACE>".  As an any or final substring,
365226031Sstas   the same input would result in "foo<SPACE>bar<SPACE>".
366226031Sstas
367226031Sstas   Appendix B discusses the rationale for the behavior.
368226031Sstas
369226031Sstas2.6.2.  numericString Insignificant Character Handling
370226031Sstas
371226031Sstas   For the purposes of this section, a space is defined to be the SPACE
372226031Sstas   (U+0020) code point followed by no combining marks.
373226031Sstas
374226031Sstas   All spaces are regarded as insignificant and are to be removed.
375226031Sstas
376226031Sstas   For example, removal of spaces from the Form KC string:
377226031Sstas       "<SPACE><SPACE>123<SPACE><SPACE>456<SPACE><SPACE>"
378226031Sstas   would result in the output string:
379226031Sstas       "123456"
380226031Sstas   and the Form KC string:
381226031Sstas       "<SPACE><SPACE><SPACE>"
382226031Sstas   would result in the output string:
383226031Sstas       "" (an empty string).
384226031Sstas
385226031Sstas2.6.3.  telephoneNumber Insignificant Character Handling
386226031Sstas
387226031Sstas   For the purposes of this section, a hyphen is defined to be a
388226031Sstas   HYPHEN-MINUS (U+002D), ARMENIAN HYPHEN (U+058A), HYPHEN (U+2010),
389226031Sstas   NON-BREAKING HYPHEN (U+2011), MINUS SIGN (U+2212), SMALL HYPHEN-MINUS
390226031Sstas   (U+FE63), or FULLWIDTH HYPHEN-MINUS (U+FF0D) code point followed by
391226031Sstas
392226031Sstas
393226031Sstas
394226031SstasZeilenga                    Standards Track                     [Page 7]
395226031Sstas
396226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
397226031Sstas
398226031Sstas
399226031Sstas   no combining marks and a space is defined to be the SPACE (U+0020)
400226031Sstas   code point followed by no combining marks.
401226031Sstas
402226031Sstas   All hyphens and spaces are considered insignificant and are to be
403226031Sstas   removed.
404226031Sstas
405226031Sstas   For example, removal of hyphens and spaces from the Form KC string:
406226031Sstas       "<SPACE><HYPHEN>123<SPACE><SPACE>456<SPACE><HYPHEN>"
407226031Sstas   would result in the output string:
408226031Sstas       "123456"
409226031Sstas   and the Form KC string:
410226031Sstas       "<HYPHEN><HYPHEN><HYPHEN>"
411226031Sstas   would result in the (empty) output string:
412226031Sstas       "".
413226031Sstas
414226031Sstas3.  Security Considerations
415226031Sstas
416226031Sstas   "Preparation of Internationalized Strings ("stringprep")" [RFC3454]
417226031Sstas   security considerations generally apply to the algorithms described
418226031Sstas   here.
419226031Sstas
420226031Sstas4.  Acknowledgements
421226031Sstas
422226031Sstas   The approach used in this document is based upon design principles
423226031Sstas   and algorithms described in "Preparation of Internationalized Strings
424226031Sstas   ('stringprep')" [RFC3454] by Paul Hoffman and Marc Blanchet.  Some
425226031Sstas   additional guidance was drawn from Unicode Technical Standards,
426226031Sstas   Technical Reports, and Notes.
427226031Sstas
428226031Sstas   This document is a product of the IETF LDAP Revision (LDAPBIS)
429226031Sstas   Working Group.
430226031Sstas
431226031Sstas5.  References
432226031Sstas
433226031Sstas5.1.  Normative References
434226031Sstas
435226031Sstas   [RFC2119]     Bradner, S., "Key words for use in RFCs to Indicate
436226031Sstas                 Requirement Levels", BCP 14, RFC 2119, March 1997.
437226031Sstas
438226031Sstas   [RFC3454]     Hoffman, P. and M. Blanchet, "Preparation of
439226031Sstas                 Internationalized Strings ("stringprep")", RFC 3454,
440226031Sstas                 December 2002.
441226031Sstas
442226031Sstas   [RFC4510]     Zeilenga, K., "Lightweight Directory Access Protocol
443226031Sstas                 (LDAP): Technical Specification Road Map", RFC 4510,
444226031Sstas                 June 2006.
445226031Sstas
446226031Sstas
447226031Sstas
448226031Sstas
449226031Sstas
450226031SstasZeilenga                    Standards Track                     [Page 8]
451226031Sstas
452226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
453226031Sstas
454226031Sstas
455226031Sstas   [RFC4517]     Legg, S., Ed., "Lightweight Directory Access Protocol
456226031Sstas                 (LDAP): Syntaxes and Matching Rules", RFC 4517, June
457226031Sstas                 2006.
458226031Sstas
459226031Sstas   [Unicode]     The Unicode Consortium, "The Unicode Standard, Version
460226031Sstas                 3.2.0" is defined by "The Unicode Standard, Version
461226031Sstas                 3.0" (Reading, MA, Addison-Wesley, 2000.  ISBN 0-201-
462226031Sstas                 61633-5), as amended by the "Unicode Standard Annex
463226031Sstas                 #27: Unicode 3.1"
464226031Sstas                 (http://www.unicode.org/reports/tr27/) and by the
465226031Sstas                 "Unicode Standard Annex #28: Unicode 3.2"
466226031Sstas                 (http://www.unicode.org/reports/tr28/).
467226031Sstas
468226031Sstas   [UAX15]       Davis, M. and M. Duerst, "Unicode Standard Annex #15:
469226031Sstas                 Unicode Normalization Forms, Version 3.2.0".
470226031Sstas                 <http://www.unicode.org/unicode/reports/tr15/tr15-
471226031Sstas                 22.html>, March 2002.
472226031Sstas
473226031Sstas   [X.680]       International Telecommunication Union -
474226031Sstas                 Telecommunication Standardization Sector, "Abstract
475226031Sstas                 Syntax Notation One (ASN.1) - Specification of Basic
476226031Sstas                 Notation", X.680(2002) (also ISO/IEC 8824-1:2002).
477226031Sstas
478226031Sstas5.2.  Informative References
479226031Sstas
480226031Sstas   [X.500]       International Telecommunication Union -
481226031Sstas                 Telecommunication Standardization Sector, "The
482226031Sstas                 Directory -- Overview of concepts, models and
483226031Sstas                 services," X.500(1993) (also ISO/IEC 9594-1:1994).
484226031Sstas
485226031Sstas   [X.501]       International Telecommunication Union -
486226031Sstas                 Telecommunication Standardization Sector, "The
487226031Sstas                 Directory -- Models," X.501(1993) (also ISO/IEC 9594-
488226031Sstas                 2:1994).
489226031Sstas
490226031Sstas   [X.520]       International Telecommunication Union -
491226031Sstas                 Telecommunication Standardization Sector, "The
492226031Sstas                 Directory: Selected Attribute Types", X.520(1993) (also
493226031Sstas                 ISO/IEC 9594-6:1994).
494226031Sstas
495226031Sstas   [Glossary]    The Unicode Consortium, "Unicode Glossary",
496226031Sstas                 <http://www.unicode.org/glossary/>.
497226031Sstas
498226031Sstas   [CharModel]   Whistler, K. and M. Davis, "Unicode Technical Report
499226031Sstas                 #17, Character Encoding Model", UTR17,
500226031Sstas                 <http://www.unicode.org/unicode/reports/tr17/>, August
501226031Sstas                 2000.
502226031Sstas
503226031Sstas
504226031Sstas
505226031Sstas
506226031SstasZeilenga                    Standards Track                     [Page 9]
507226031Sstas
508226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
509226031Sstas
510226031Sstas
511226031Sstas   [RFC3377]     Hodges, J. and R. Morgan, "Lightweight Directory Access
512226031Sstas                 Protocol (v3): Technical Specification", RFC 3377,
513226031Sstas                 September 2002.
514226031Sstas
515226031Sstas   [RFC4515]     Smith, M., Ed. and T. Howes, "Lightweight Directory
516226031Sstas                 Access Protocol (LDAP): String Representation of Search
517226031Sstas                 Filters", RFC 4515, June 2006.
518226031Sstas
519226031Sstas   [XMATCH]      Zeilenga, K., "Internationalized String Matching Rules
520226031Sstas                 for X.500", Work in Progress.
521226031Sstas
522226031Sstas
523226031Sstas
524226031Sstas
525226031Sstas
526226031Sstas
527226031Sstas
528226031Sstas
529226031Sstas
530226031Sstas
531226031Sstas
532226031Sstas
533226031Sstas
534226031Sstas
535226031Sstas
536226031Sstas
537226031Sstas
538226031Sstas
539226031Sstas
540226031Sstas
541226031Sstas
542226031Sstas
543226031Sstas
544226031Sstas
545226031Sstas
546226031Sstas
547226031Sstas
548226031Sstas
549226031Sstas
550226031Sstas
551226031Sstas
552226031Sstas
553226031Sstas
554226031Sstas
555226031Sstas
556226031Sstas
557226031Sstas
558226031Sstas
559226031Sstas
560226031Sstas
561226031Sstas
562226031SstasZeilenga                    Standards Track                    [Page 10]
563226031Sstas
564226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
565226031Sstas
566226031Sstas
567226031SstasAppendix A.  Combining Marks
568226031Sstas
569226031Sstas   This appendix is normative.
570226031Sstas
571226031Sstas   This table was derived from Unicode [Unicode] data files; it lists
572226031Sstas   all code points with the Mn, Mc, or Me properties.  This table is to
573226031Sstas   be considered definitive for the purposes of implementation of this
574226031Sstas   specification.
575226031Sstas
576226031Sstas         0300-034F 0360-036F 0483-0486 0488-0489 0591-05A1
577226031Sstas         05A3-05B9 05BB-05BC 05BF 05C1-05C2 05C4 064B-0655 0670
578226031Sstas         06D6-06DC 06DE-06E4 06E7-06E8 06EA-06ED 0711 0730-074A
579226031Sstas         07A6-07B0 0901-0903 093C 093E-094F 0951-0954 0962-0963
580226031Sstas         0981-0983 09BC 09BE-09C4 09C7-09C8 09CB-09CD 09D7
581226031Sstas         09E2-09E3 0A02 0A3C 0A3E-0A42 0A47-0A48 0A4B-0A4D
582226031Sstas         0A70-0A71 0A81-0A83 0ABC 0ABE-0AC5 0AC7-0AC9 0ACB-0ACD
583226031Sstas         0B01-0B03 0B3C 0B3E-0B43 0B47-0B48 0B4B-0B4D 0B56-0B57
584226031Sstas         0B82 0BBE-0BC2 0BC6-0BC8 0BCA-0BCD 0BD7 0C01-0C03
585226031Sstas         0C3E-0C44 0C46-0C48 0C4A-0C4D 0C55-0C56 0C82-0C83
586226031Sstas         0CBE-0CC4 0CC6-0CC8 0CCA-0CCD 0CD5-0CD6 0D02-0D03
587226031Sstas         0D3E-0D43 0D46-0D48 0D4A-0D4D 0D57 0D82-0D83 0DCA
588226031Sstas         0DCF-0DD4 0DD6 0DD8-0DDF 0DF2-0DF3 0E31 0E34-0E3A
589226031Sstas         0E47-0E4E 0EB1 0EB4-0EB9 0EBB-0EBC 0EC8-0ECD 0F18-0F19
590226031Sstas         0F35 0F37 0F39 0F3E-0F3F 0F71-0F84 0F86-0F87 0F90-0F97
591226031Sstas         0F99-0FBC 0FC6 102C-1032 1036-1039 1056-1059 1712-1714
592226031Sstas         1732-1734 1752-1753 1772-1773 17B4-17D3 180B-180D 18A9
593226031Sstas         20D0-20EA 302A-302F 3099-309A FB1E FE00-FE0F FE20-FE23
594226031Sstas         1D165-1D169 1D16D-1D172 1D17B-1D182 1D185-1D18B
595226031Sstas         1D1AA-1D1AD
596226031Sstas
597226031SstasAppendix B.  Substrings Matching
598226031Sstas
599226031Sstas   This appendix is non-normative.
600226031Sstas
601226031Sstas   In the absence of substrings matching, the insignificant space
602226031Sstas   handling for case ignore/exact matching could be simplified.
603226031Sstas   Specifically, the handling could be to require that all sequences of
604226031Sstas   one or more spaces be replaced with one space and, if the string
605226031Sstas   contains non-space characters, removal of all leading spaces and
606226031Sstas   trailing spaces.
607226031Sstas
608226031Sstas   In the presence of substrings matching, this simplified space
609226031Sstas   handling would lead to unexpected and undesirable matching behavior.
610226031Sstas   For instance:
611226031Sstas
612226031Sstas   1) (CN=foo\20*\20bar) would match the CN value "foobar";
613226031Sstas
614226031Sstas
615226031Sstas
616226031Sstas
617226031Sstas
618226031SstasZeilenga                    Standards Track                    [Page 11]
619226031Sstas
620226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
621226031Sstas
622226031Sstas
623226031Sstas   2) (CN=*\20foobar\20*) would match "foobar", but
624226031Sstas      (CN=*\20*foobar*\20*) would not.
625226031Sstas
626226031Sstas   Note to readers not familiar with LDAP substrings matching: the LDAP
627226031Sstas   filter [RFC4515] assertion (CN=A*B*C) says to "match any value (of
628226031Sstas   the attribute CN) that begins with A, contains B after A, ends with C
629226031Sstas   where C is also after B."
630226031Sstas
631226031Sstas   The first case illustrates that this simplified space handling would
632226031Sstas   cause leading and trailing spaces in substrings of the string to be
633226031Sstas   regarded as insignificant.  However, only leading and trailing (as
634226031Sstas   well as multiple consecutive spaces) of the string (as a whole) are
635226031Sstas   insignificant.
636226031Sstas
637226031Sstas   The second case illustrates that this simplified space handling would
638226031Sstas   cause sub-partitioning failures.  That is, if a prepared any
639226031Sstas   substring matches a partition of the attribute value, then an
640226031Sstas   assertion constructed by subdividing that substring into multiple
641226031Sstas   substrings should also match.
642226031Sstas
643226031Sstas   In designing an appropriate approach for space handling for
644226031Sstas   substrings matching, one must study key aspects of X.500 case
645226031Sstas   exact/ignore matching.  X.520 [X.520] says:
646226031Sstas
647226031Sstas      The [substrings] rule returns TRUE if there is a partitioning of
648226031Sstas      the attribute value (into portions) such that:
649226031Sstas
650226031Sstas         -  the specified substrings (initial, any, final) match
651226031Sstas            different portions of the value in the order of the strings
652226031Sstas            sequence;
653226031Sstas
654226031Sstas         -  initial, if present, matches the first portion of the value;
655226031Sstas
656226031Sstas         -  final, if present, matches the last portion of the value;
657226031Sstas
658226031Sstas         -  any, if present, matches some arbitrary portion of the
659226031Sstas            value.
660226031Sstas
661226031Sstas   That is, the substrings assertion (CN=foo\20*\20bar) matches the
662226031Sstas   attribute value "foo<SPACE><SPACE>bar" as the value can be
663226031Sstas   partitioned into the portions "foo<SPACE>" and "<SPACE>bar" meeting
664226031Sstas   the above requirements.
665226031Sstas
666226031Sstas
667226031Sstas
668226031Sstas
669226031Sstas
670226031Sstas
671226031Sstas
672226031Sstas
673226031Sstas
674226031SstasZeilenga                    Standards Track                    [Page 12]
675226031Sstas
676226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
677226031Sstas
678226031Sstas
679226031Sstas   X.520 also says:
680226031Sstas
681226031Sstas      [T]he following spaces are regarded as not significant:
682226031Sstas
683226031Sstas         -  leading spaces (i.e., those preceding the first character
684226031Sstas            that is not a space);
685226031Sstas
686226031Sstas         -  trailing spaces (i.e., those following the last character
687226031Sstas            that is not a space);
688226031Sstas
689226031Sstas         -  multiple consecutive spaces (these are taken as equivalent
690226031Sstas            to a single space character).
691226031Sstas
692226031Sstas   This statement applies to the assertion values and attribute values
693226031Sstas   as whole strings, and not individually to substrings of an assertion
694226031Sstas   value.  In particular, the statements should be taken to mean that if
695226031Sstas   an assertion value and attribute value match without any
696226031Sstas   consideration to insignificant characters, then that assertion value
697226031Sstas   should also match any attribute value that differs only by inclusion
698226031Sstas   nor removal of insignificant characters.
699226031Sstas
700226031Sstas   Hence the assertion (CN=foo\20*\20bar) matches
701226031Sstas   "foo<SPACE><SPACE><SPACE>bar" and "foo<SPACE>bar" as these values
702226031Sstas   only differ from "foo<SPACE><SPACE>bar" by the inclusion or removal
703226031Sstas   of insignificant spaces.
704226031Sstas
705226031Sstas   Astute readers of this text will also note that there are special
706226031Sstas   cases where the specified space handling does not ignore spaces that
707226031Sstas   could be considered insignificant.  For instance, the assertion
708226031Sstas   (CN=\20*\20*\20) does not match "<SPACE><SPACE><SPACE>"
709226031Sstas   (insignificant spaces present in value) or " " (insignificant spaces
710226031Sstas   not present in value).  However, as these cases have no practical
711226031Sstas   application that cannot be met by simple assertions, e.g., (cn=\20),
712226031Sstas   and this minor anomaly can only be fully addressed by a preparation
713226031Sstas   algorithm to be used in conjunction with character-by-character
714226031Sstas   partitioning and matching, the anomaly is considered acceptable.
715226031Sstas
716226031SstasAuthor's Address
717226031Sstas
718226031Sstas   Kurt D. Zeilenga
719226031Sstas   OpenLDAP Foundation
720226031Sstas
721226031Sstas   EMail: Kurt@OpenLDAP.org
722226031Sstas
723226031Sstas
724226031Sstas
725226031Sstas
726226031Sstas
727226031Sstas
728226031Sstas
729226031Sstas
730226031SstasZeilenga                    Standards Track                    [Page 13]
731226031Sstas
732226031SstasRFC 4518       LDAP: Internationalized String Preparation      June 2006
733226031Sstas
734226031Sstas
735226031SstasFull Copyright Statement
736226031Sstas
737226031Sstas   Copyright (C) The Internet Society (2006).
738226031Sstas
739226031Sstas   This document is subject to the rights, licenses and restrictions
740226031Sstas   contained in BCP 78, and except as set forth therein, the authors
741226031Sstas   retain all their rights.
742226031Sstas
743226031Sstas   This document and the information contained herein are provided on an
744226031Sstas   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
745226031Sstas   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
746226031Sstas   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
747226031Sstas   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
748226031Sstas   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
749226031Sstas   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
750226031Sstas
751226031SstasIntellectual Property
752226031Sstas
753226031Sstas   The IETF takes no position regarding the validity or scope of any
754226031Sstas   Intellectual Property Rights or other rights that might be claimed to
755226031Sstas   pertain to the implementation or use of the technology described in
756226031Sstas   this document or the extent to which any license under such rights
757226031Sstas   might or might not be available; nor does it represent that it has
758226031Sstas   made any independent effort to identify any such rights.  Information
759226031Sstas   on the procedures with respect to rights in RFC documents can be
760226031Sstas   found in BCP 78 and BCP 79.
761226031Sstas
762226031Sstas   Copies of IPR disclosures made to the IETF Secretariat and any
763226031Sstas   assurances of licenses to be made available, or the result of an
764226031Sstas   attempt made to obtain a general license or permission for the use of
765226031Sstas   such proprietary rights by implementers or users of this
766226031Sstas   specification can be obtained from the IETF on-line IPR repository at
767226031Sstas   http://www.ietf.org/ipr.
768226031Sstas
769226031Sstas   The IETF invites any interested party to bring to its attention any
770226031Sstas   copyrights, patents or patent applications, or other proprietary
771226031Sstas   rights that may cover technology that may be required to implement
772226031Sstas   this standard.  Please address the information to the IETF at
773226031Sstas   ietf-ipr@ietf.org.
774226031Sstas
775226031SstasAcknowledgement
776226031Sstas
777226031Sstas   Funding for the RFC Editor function is provided by the IETF
778226031Sstas   Administrative Support Activity (IASA).
779226031Sstas
780226031Sstas
781226031Sstas
782226031Sstas
783226031Sstas
784226031Sstas
785226031Sstas
786226031SstasZeilenga                    Standards Track                    [Page 14]
787226031Sstas
788