1226031Sstas
2226031Sstas
3226031Sstas
4226031Sstas
5226031Sstas
6226031Sstas
7226031SstasNetwork Working Group                                        A. Costello
8226031SstasRequest for Comments: 3492                 Univ. of California, Berkeley
9226031SstasCategory: Standards Track                                     March 2003
10226031Sstas
11226031Sstas
12226031Sstas              Punycode: A Bootstring encoding of Unicode
13226031Sstas       for Internationalized Domain Names in Applications (IDNA)
14226031Sstas
15226031SstasStatus of this Memo
16226031Sstas
17226031Sstas   This document specifies an Internet standards track protocol for the
18226031Sstas   Internet community, and requests discussion and suggestions for
19226031Sstas   improvements.  Please refer to the current edition of the "Internet
20226031Sstas   Official Protocol Standards" (STD 1) for the standardization state
21226031Sstas   and status of this protocol.  Distribution of this memo is unlimited.
22226031Sstas
23226031SstasCopyright Notice
24226031Sstas
25226031Sstas   Copyright (C) The Internet Society (2003).  All Rights Reserved.
26226031Sstas
27226031SstasAbstract
28226031Sstas
29226031Sstas   Punycode is a simple and efficient transfer encoding syntax designed
30226031Sstas   for use with Internationalized Domain Names in Applications (IDNA).
31226031Sstas   It uniquely and reversibly transforms a Unicode string into an ASCII
32226031Sstas   string.  ASCII characters in the Unicode string are represented
33226031Sstas   literally, and non-ASCII characters are represented by ASCII
34226031Sstas   characters that are allowed in host name labels (letters, digits, and
35226031Sstas   hyphens).  This document defines a general algorithm called
36226031Sstas   Bootstring that allows a string of basic code points to uniquely
37226031Sstas   represent any string of code points drawn from a larger set.
38226031Sstas   Punycode is an instance of Bootstring that uses particular parameter
39226031Sstas   values specified by this document, appropriate for IDNA.
40226031Sstas
41226031SstasTable of Contents
42226031Sstas
43226031Sstas   1. Introduction...............................................2
44226031Sstas       1.1 Features..............................................2
45226031Sstas       1.2 Interaction of protocol parts.........................3
46226031Sstas   2. Terminology................................................3
47226031Sstas   3. Bootstring description.....................................4
48226031Sstas       3.1 Basic code point segregation..........................4
49226031Sstas       3.2 Insertion unsort coding...............................4
50226031Sstas       3.3 Generalized variable-length integers..................5
51226031Sstas       3.4 Bias adaptation.......................................7
52226031Sstas   4. Bootstring parameters......................................8
53226031Sstas   5. Parameter values for Punycode..............................8
54226031Sstas   6. Bootstring algorithms......................................9
55226031Sstas
56226031Sstas
57226031Sstas
58226031SstasCostello                    Standards Track                     [Page 1]
59226031Sstas
60226031SstasRFC 3492                     IDNA Punycode                    March 2003
61226031Sstas
62226031Sstas
63226031Sstas       6.1 Bias adaptation function.............................10
64226031Sstas       6.2 Decoding procedure...................................11
65226031Sstas       6.3 Encoding procedure...................................12
66226031Sstas       6.4 Overflow handling....................................13
67226031Sstas   7. Punycode examples.........................................14
68226031Sstas       7.1 Sample strings.......................................14
69226031Sstas       7.2 Decoding traces......................................17
70226031Sstas       7.3 Encoding traces......................................19
71226031Sstas   8. Security Considerations...................................20
72226031Sstas   9. References................................................21
73226031Sstas       9.1 Normative References.................................21
74226031Sstas       9.2 Informative References...............................21
75226031Sstas   A. Mixed-case annotation.....................................22
76226031Sstas   B. Disclaimer and license....................................22
77226031Sstas   C. Punycode sample implementation............................23
78226031Sstas   Author's Address.............................................34
79226031Sstas   Full Copyright Statement.....................................35
80226031Sstas
81226031Sstas1. Introduction
82226031Sstas
83226031Sstas   [IDNA] describes an architecture for supporting internationalized
84226031Sstas   domain names.  Labels containing non-ASCII characters can be
85226031Sstas   represented by ACE labels, which begin with a special ACE prefix and
86226031Sstas   contain only ASCII characters.  The remainder of the label after the
87226031Sstas   prefix is a Punycode encoding of a Unicode string satisfying certain
88226031Sstas   constraints.  For the details of the prefix and constraints, see
89226031Sstas   [IDNA] and [NAMEPREP].
90226031Sstas
91226031Sstas   Punycode is an instance of a more general algorithm called
92226031Sstas   Bootstring, which allows strings composed from a small set of "basic"
93226031Sstas   code points to uniquely represent any string of code points drawn
94226031Sstas   from a larger set.  Punycode is Bootstring with particular parameter
95226031Sstas   values appropriate for IDNA.
96226031Sstas
97226031Sstas1.1 Features
98226031Sstas
99226031Sstas   Bootstring has been designed to have the following features:
100226031Sstas
101226031Sstas   *  Completeness:  Every extended string (sequence of arbitrary code
102226031Sstas      points) can be represented by a basic string (sequence of basic
103226031Sstas      code points).  Restrictions on what strings are allowed, and on
104226031Sstas      length, can be imposed by higher layers.
105226031Sstas
106226031Sstas   *  Uniqueness:  There is at most one basic string that represents a
107226031Sstas      given extended string.
108226031Sstas
109226031Sstas   *  Reversibility:  Any extended string mapped to a basic string can
110226031Sstas      be recovered from that basic string.
111226031Sstas
112226031Sstas
113226031Sstas
114226031SstasCostello                    Standards Track                     [Page 2]
115226031Sstas
116226031SstasRFC 3492                     IDNA Punycode                    March 2003
117226031Sstas
118226031Sstas
119226031Sstas   *  Efficient encoding:  The ratio of basic string length to extended
120226031Sstas      string length is small.  This is important in the context of
121226031Sstas      domain names because RFC 1034 [RFC1034] restricts the length of a
122226031Sstas      domain label to 63 characters.
123226031Sstas
124226031Sstas   *  Simplicity:  The encoding and decoding algorithms are reasonably
125226031Sstas      simple to implement.  The goals of efficiency and simplicity are
126226031Sstas      at odds; Bootstring aims at a good balance between them.
127226031Sstas
128226031Sstas   *  Readability:  Basic code points appearing in the extended string
129226031Sstas      are represented as themselves in the basic string (although the
130226031Sstas      main purpose is to improve efficiency, not readability).
131226031Sstas
132226031Sstas   Punycode can also support an additional feature that is not used by
133226031Sstas   the ToASCII and ToUnicode operations of [IDNA].  When extended
134226031Sstas   strings are case-folded prior to encoding, the basic string can use
135226031Sstas   mixed case to tell how to convert the folded string into a mixed-case
136226031Sstas   string.  See appendix A "Mixed-case annotation".
137226031Sstas
138226031Sstas1.2 Interaction of protocol parts
139226031Sstas
140226031Sstas   Punycode is used by the IDNA protocol [IDNA] for converting domain
141226031Sstas   labels into ASCII; it is not designed for any other purpose.  It is
142226031Sstas   explicitly not designed for processing arbitrary free text.
143226031Sstas
144226031Sstas2. Terminology
145226031Sstas
146226031Sstas   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
147226031Sstas   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
148226031Sstas   document are to be interpreted as described in BCP 14, RFC 2119
149226031Sstas   [RFC2119].
150226031Sstas
151226031Sstas   A code point is an integral value associated with a character in a
152226031Sstas   coded character set.
153226031Sstas
154226031Sstas   As in the Unicode Standard [UNICODE], Unicode code points are denoted
155226031Sstas   by "U+" followed by four to six hexadecimal digits, while a range of
156226031Sstas   code points is denoted by two hexadecimal numbers separated by "..",
157226031Sstas   with no prefixes.
158226031Sstas
159226031Sstas   The operators div and mod perform integer division; (x div y) is the
160226031Sstas   quotient of x divided by y, discarding the remainder, and (x mod y)
161226031Sstas   is the remainder, so (x div y) * y + (x mod y) == x.  Bootstring uses
162226031Sstas   these operators only with nonnegative operands, so the quotient and
163226031Sstas   remainder are always nonnegative.
164226031Sstas
165226031Sstas   The break statement jumps out of the innermost loop (as in C).
166226031Sstas
167226031Sstas
168226031Sstas
169226031Sstas
170226031SstasCostello                    Standards Track                     [Page 3]
171226031Sstas
172226031SstasRFC 3492                     IDNA Punycode                    March 2003
173226031Sstas
174226031Sstas
175226031Sstas   An overflow is an attempt to compute a value that exceeds the maximum
176226031Sstas   value of an integer variable.
177226031Sstas
178226031Sstas3. Bootstring description
179226031Sstas
180226031Sstas   Bootstring represents an arbitrary sequence of code points (the
181226031Sstas   "extended string") as a sequence of basic code points (the "basic
182226031Sstas   string").  This section describes the representation.  Section 6
183226031Sstas   "Bootstring algorithms" presents the algorithms as pseudocode.
184226031Sstas   Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
185226031Sstas   algorithms for sample inputs.
186226031Sstas
187226031Sstas   The following sections describe the four techniques used in
188226031Sstas   Bootstring.  "Basic code point segregation" is a very simple and
189226031Sstas   efficient encoding for basic code points occurring in the extended
190226031Sstas   string: they are simply copied all at once.  "Insertion unsort
191226031Sstas   coding" encodes the non-basic code points as deltas, and processes
192226031Sstas   the code points in numerical order rather than in order of
193226031Sstas   appearance, which typically results in smaller deltas.  The deltas
194226031Sstas   are represented as "generalized variable-length integers", which use
195226031Sstas   basic code points to represent nonnegative integers.  The parameters
196226031Sstas   of this integer representation are dynamically adjusted using "bias
197226031Sstas   adaptation", to improve efficiency when consecutive deltas have
198226031Sstas   similar magnitudes.
199226031Sstas
200226031Sstas3.1 Basic code point segregation
201226031Sstas
202226031Sstas   All basic code points appearing in the extended string are
203226031Sstas   represented literally at the beginning of the basic string, in their
204226031Sstas   original order, followed by a delimiter if (and only if) the number
205226031Sstas   of basic code points is nonzero.  The delimiter is a particular basic
206226031Sstas   code point, which never appears in the remainder of the basic string.
207226031Sstas   The decoder can therefore find the end of the literal portion (if
208226031Sstas   there is one) by scanning for the last delimiter.
209226031Sstas
210226031Sstas3.2 Insertion unsort coding
211226031Sstas
212226031Sstas   The remainder of the basic string (after the last delimiter if there
213226031Sstas   is one) represents a sequence of nonnegative integral deltas as
214226031Sstas   generalized variable-length integers, described in section 3.3.  The
215226031Sstas   meaning of the deltas is best understood in terms of the decoder.
216226031Sstas
217226031Sstas   The decoder builds the extended string incrementally.  Initially, the
218226031Sstas   extended string is a copy of the literal portion of the basic string
219226031Sstas   (excluding the last delimiter).  The decoder inserts non-basic code
220226031Sstas   points, one for each delta, into the extended string, ultimately
221226031Sstas   arriving at the final decoded string.
222226031Sstas
223226031Sstas
224226031Sstas
225226031Sstas
226226031SstasCostello                    Standards Track                     [Page 4]
227226031Sstas
228226031SstasRFC 3492                     IDNA Punycode                    March 2003
229226031Sstas
230226031Sstas
231226031Sstas   At the heart of this process is a state machine with two state
232226031Sstas   variables: an index i and a counter n.  The index i refers to a
233226031Sstas   position in the extended string; it ranges from 0 (the first
234226031Sstas   position) to the current length of the extended string (which refers
235226031Sstas   to a potential position beyond the current end).  If the current
236226031Sstas   state is <n,i>, the next state is <n,i+1> if i is less than the
237226031Sstas   length of the extended string, or <n+1,0> if i equals the length of
238226031Sstas   the extended string.  In other words, each state change causes i to
239226031Sstas   increment, wrapping around to zero if necessary, and n counts the
240226031Sstas   number of wrap-arounds.
241226031Sstas
242226031Sstas   Notice that the state always advances monotonically (there is no way
243226031Sstas   for the decoder to return to an earlier state).  At each state, an
244226031Sstas   insertion is either performed or not performed.  At most one
245226031Sstas   insertion is performed in a given state.  An insertion inserts the
246226031Sstas   value of n at position i in the extended string.  The deltas are a
247226031Sstas   run-length encoding of this sequence of events: they are the lengths
248226031Sstas   of the runs of non-insertion states preceeding the insertion states.
249226031Sstas   Hence, for each delta, the decoder performs delta state changes, then
250226031Sstas   an insertion, and then one more state change.  (An implementation
251226031Sstas   need not perform each state change individually, but can instead use
252226031Sstas   division and remainder calculations to compute the next insertion
253226031Sstas   state directly.)  It is an error if the inserted code point is a
254226031Sstas   basic code point (because basic code points were supposed to be
255226031Sstas   segregated as described in section 3.1).
256226031Sstas
257226031Sstas   The encoder's main task is to derive the sequence of deltas that will
258226031Sstas   cause the decoder to construct the desired string.  It can do this by
259226031Sstas   repeatedly scanning the extended string for the next code point that
260226031Sstas   the decoder would need to insert, and counting the number of state
261226031Sstas   changes the decoder would need to perform, mindful of the fact that
262226031Sstas   the decoder's extended string will include only those code points
263226031Sstas   that have already been inserted.  Section 6.3 "Encoding procedure"
264226031Sstas   gives a precise algorithm.
265226031Sstas
266226031Sstas3.3 Generalized variable-length integers
267226031Sstas
268226031Sstas   In a conventional integer representation the base is the number of
269226031Sstas   distinct symbols for digits, whose values are 0 through base-1.  Let
270226031Sstas   digit_0 denote the least significant digit, digit_1 the next least
271226031Sstas   significant, and so on.  The value represented is the sum over j of
272226031Sstas   digit_j * w(j), where w(j) = base^j is the weight (scale factor) for
273226031Sstas   position j.  For example, in the base 8 integer 437, the digits are
274226031Sstas   7, 3, and 4, and the weights are 1, 8, and 64, so the value is 7 +
275226031Sstas   3*8 + 4*64 = 287.  This representation has two disadvantages:  First,
276226031Sstas   there are multiple encodings of each value (because there can be
277226031Sstas   extra zeros in the most significant positions), which is inconvenient
278226031Sstas
279226031Sstas
280226031Sstas
281226031Sstas
282226031SstasCostello                    Standards Track                     [Page 5]
283226031Sstas
284226031SstasRFC 3492                     IDNA Punycode                    March 2003
285226031Sstas
286226031Sstas
287226031Sstas   when unique encodings are needed.  Second, the integer is not self-
288226031Sstas   delimiting, so if multiple integers are concatenated the boundaries
289226031Sstas   between them are lost.
290226031Sstas
291226031Sstas   The generalized variable-length representation solves these two
292226031Sstas   problems.  The digit values are still 0 through base-1, but now the
293226031Sstas   integer is self-delimiting by means of thresholds t(j), each of which
294226031Sstas   is in the range 0 through base-1.  Exactly one digit, the most
295226031Sstas   significant, satisfies digit_j < t(j).  Therefore, if several
296226031Sstas   integers are concatenated, it is easy to separate them, starting with
297226031Sstas   the first if they are little-endian (least significant digit first),
298226031Sstas   or starting with the last if they are big-endian (most significant
299226031Sstas   digit first).  As before, the value is the sum over j of digit_j *
300226031Sstas   w(j), but the weights are different:
301226031Sstas
302226031Sstas      w(0) = 1
303226031Sstas      w(j) = w(j-1) * (base - t(j-1)) for j > 0
304226031Sstas
305226031Sstas   For example, consider the little-endian sequence of base 8 digits
306226031Sstas   734251...  Suppose the thresholds are 2, 3, 5, 5, 5, 5...  This
307226031Sstas   implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) =
308226031Sstas   90, 90*(8-5) = 270, and so on.  7 is not less than 2, and 3 is not
309226031Sstas   less than 3, but 4 is less than 5, so 4 is the last digit.  The value
310226031Sstas   of 734 is 7*1 + 3*6 + 4*30 = 145.  The next integer is 251, with
311226031Sstas   value 2*1 + 5*6 + 1*30 = 62.  Decoding this representation is very
312226031Sstas   similar to decoding a conventional integer:  Start with a current
313226031Sstas   value of N = 0 and a weight w = 1.  Fetch the next digit d and
314226031Sstas   increase N by d * w.  If d is less than the current threshold (t)
315226031Sstas   then stop, otherwise increase w by a factor of (base - t), update t
316226031Sstas   for the next position, and repeat.
317226031Sstas
318226031Sstas   Encoding this representation is similar to encoding a conventional
319226031Sstas   integer:  If N < t then output one digit for N and stop, otherwise
320226031Sstas   output the digit for t + ((N - t) mod (base - t)), then replace N
321226031Sstas   with (N - t) div (base - t), update t for the next position, and
322226031Sstas   repeat.
323226031Sstas
324226031Sstas   For any particular set of values of t(j), there is exactly one
325226031Sstas   generalized variable-length representation of each nonnegative
326226031Sstas   integral value.
327226031Sstas
328226031Sstas   Bootstring uses little-endian ordering so that the deltas can be
329226031Sstas   separated starting with the first.  The t(j) values are defined in
330226031Sstas   terms of the constants base, tmin, and tmax, and a state variable
331226031Sstas   called bias:
332226031Sstas
333226031Sstas      t(j) = base * (j + 1) - bias,
334226031Sstas      clamped to the range tmin through tmax
335226031Sstas
336226031Sstas
337226031Sstas
338226031SstasCostello                    Standards Track                     [Page 6]
339226031Sstas
340226031SstasRFC 3492                     IDNA Punycode                    March 2003
341226031Sstas
342226031Sstas
343226031Sstas   The clamping means that if the formula yields a value less than tmin
344226031Sstas   or greater than tmax, then t(j) = tmin or tmax, respectively.  (In
345226031Sstas   the pseudocode in section 6 "Bootstring algorithms", the expression
346226031Sstas   base * (j + 1) is denoted by k for performance reasons.)  These t(j)
347226031Sstas   values cause the representation to favor integers within a particular
348226031Sstas   range determined by the bias.
349226031Sstas
350226031Sstas3.4 Bias adaptation
351226031Sstas
352226031Sstas   After each delta is encoded or decoded, bias is set for the next
353226031Sstas   delta as follows:
354226031Sstas
355226031Sstas   1. Delta is scaled in order to avoid overflow in the next step:
356226031Sstas
357226031Sstas         let delta = delta div 2
358226031Sstas
359226031Sstas      But when this is the very first delta, the divisor is not 2, but
360226031Sstas      instead a constant called damp.  This compensates for the fact
361226031Sstas      that the second delta is usually much smaller than the first.
362226031Sstas
363226031Sstas   2. Delta is increased to compensate for the fact that the next delta
364226031Sstas      will be inserting into a longer string:
365226031Sstas
366226031Sstas         let delta = delta + (delta div numpoints)
367226031Sstas
368226031Sstas      numpoints is the total number of code points encoded/decoded so
369226031Sstas      far (including the one corresponding to this delta itself, and
370226031Sstas      including the basic code points).
371226031Sstas
372226031Sstas   3. Delta is repeatedly divided until it falls within a threshold, to
373226031Sstas      predict the minimum number of digits needed to represent the next
374226031Sstas      delta:
375226031Sstas
376226031Sstas         while delta > ((base - tmin) * tmax) div 2
377226031Sstas         do let delta = delta div (base - tmin)
378226031Sstas
379226031Sstas   4. The bias is set:
380226031Sstas
381226031Sstas         let bias =
382226031Sstas           (base * the number of divisions performed in step 3) +
383226031Sstas           (((base - tmin + 1) * delta) div (delta + skew))
384226031Sstas
385226031Sstas      The motivation for this procedure is that the current delta
386226031Sstas      provides a hint about the likely size of the next delta, and so
387226031Sstas      t(j) is set to tmax for the more significant digits starting with
388226031Sstas      the one expected to be last, tmin for the less significant digits
389226031Sstas      up through the one expected to be third-last, and somewhere
390226031Sstas      between tmin and tmax for the digit expected to be second-last
391226031Sstas
392226031Sstas
393226031Sstas
394226031SstasCostello                    Standards Track                     [Page 7]
395226031Sstas
396226031SstasRFC 3492                     IDNA Punycode                    March 2003
397226031Sstas
398226031Sstas
399226031Sstas      (balancing the hope of the expected-last digit being unnecessary
400226031Sstas      against the danger of it being insufficient).
401226031Sstas
402226031Sstas4. Bootstring parameters
403226031Sstas
404226031Sstas   Given a set of basic code points, one needs to be designated as the
405226031Sstas   delimiter.  The base cannot be greater than the number of
406226031Sstas   distinguishable basic code points remaining.  The digit-values in the
407226031Sstas   range 0 through base-1 need to be associated with distinct non-
408226031Sstas   delimiter basic code points.  In some cases multiple code points need
409226031Sstas   to have the same digit-value; for example, uppercase and lowercase
410226031Sstas   versions of the same letter need to be equivalent if basic strings
411226031Sstas   are case-insensitive.
412226031Sstas
413226031Sstas   The initial value of n cannot be greater than the minimum non-basic
414226031Sstas   code point that could appear in extended strings.
415226031Sstas
416226031Sstas   The remaining five parameters (tmin, tmax, skew, damp, and the
417226031Sstas   initial value of bias) need to satisfy the following constraints:
418226031Sstas
419226031Sstas      0 <= tmin <= tmax <= base-1
420226031Sstas      skew >= 1
421226031Sstas      damp >= 2
422226031Sstas      initial_bias mod base <= base - tmin
423226031Sstas
424226031Sstas   Provided the constraints are satisfied, these five parameters affect
425226031Sstas   efficiency but not correctness.  They are best chosen empirically.
426226031Sstas
427226031Sstas   If support for mixed-case annotation is desired (see appendix A),
428226031Sstas   make sure that the code points corresponding to 0 through tmax-1 all
429226031Sstas   have both uppercase and lowercase forms.
430226031Sstas
431226031Sstas5. Parameter values for Punycode
432226031Sstas
433226031Sstas   Punycode uses the following Bootstring parameter values:
434226031Sstas
435226031Sstas      base         = 36
436226031Sstas      tmin         = 1
437226031Sstas      tmax         = 26
438226031Sstas      skew         = 38
439226031Sstas      damp         = 700
440226031Sstas      initial_bias = 72
441226031Sstas      initial_n    = 128 = 0x80
442226031Sstas
443226031Sstas   Although the only restriction Punycode imposes on the input integers
444226031Sstas   is that they be nonnegative, these parameters are especially designed
445226031Sstas   to work well with Unicode [UNICODE] code points, which are integers
446226031Sstas   in the range 0..10FFFF (but not D800..DFFF, which are reserved for
447226031Sstas
448226031Sstas
449226031Sstas
450226031SstasCostello                    Standards Track                     [Page 8]
451226031Sstas
452226031SstasRFC 3492                     IDNA Punycode                    March 2003
453226031Sstas
454226031Sstas
455226031Sstas   use by the UTF-16 encoding of Unicode).  The basic code points are
456226031Sstas   the ASCII [ASCII] code points (0..7F), of which U+002D (-) is the
457226031Sstas   delimiter, and some of the others have digit-values as follows:
458226031Sstas
459226031Sstas      code points    digit-values
460226031Sstas      ------------   ----------------------
461226031Sstas      41..5A (A-Z) =  0 to 25, respectively
462226031Sstas      61..7A (a-z) =  0 to 25, respectively
463226031Sstas      30..39 (0-9) = 26 to 35, respectively
464226031Sstas
465226031Sstas   Using hyphen-minus as the delimiter implies that the encoded string
466226031Sstas   can end with a hyphen-minus only if the Unicode string consists
467226031Sstas   entirely of basic code points, but IDNA forbids such strings from
468226031Sstas   being encoded.  The encoded string can begin with a hyphen-minus, but
469226031Sstas   IDNA prepends a prefix.  Therefore IDNA using Punycode conforms to
470226031Sstas   the RFC 952 rule that host name labels neither begin nor end with a
471226031Sstas   hyphen-minus [RFC952].
472226031Sstas
473226031Sstas   A decoder MUST recognize the letters in both uppercase and lowercase
474226031Sstas   forms (including mixtures of both forms).  An encoder SHOULD output
475226031Sstas   only uppercase forms or only lowercase forms, unless it uses mixed-
476226031Sstas   case annotation (see appendix A).
477226031Sstas
478226031Sstas   Presumably most users will not manually write or type encoded strings
479226031Sstas   (as opposed to cutting and pasting them), but those who do will need
480226031Sstas   to be alert to the potential visual ambiguity between the following
481226031Sstas   sets of characters:
482226031Sstas
483226031Sstas      G 6
484226031Sstas      I l 1
485226031Sstas      O 0
486226031Sstas      S 5
487226031Sstas      U V
488226031Sstas      Z 2
489226031Sstas
490226031Sstas   Such ambiguities are usually resolved by context, but in a Punycode
491226031Sstas   encoded string there is no context apparent to humans.
492226031Sstas
493226031Sstas6. Bootstring algorithms
494226031Sstas
495226031Sstas   Some parts of the pseudocode can be omitted if the parameters satisfy
496226031Sstas   certain conditions (for which Punycode qualifies).  These parts are
497226031Sstas   enclosed in {braces}, and notes immediately following the pseudocode
498226031Sstas   explain the conditions under which they can be omitted.
499226031Sstas
500226031Sstas
501226031Sstas
502226031Sstas
503226031Sstas
504226031Sstas
505226031Sstas
506226031SstasCostello                    Standards Track                     [Page 9]
507226031Sstas
508226031SstasRFC 3492                     IDNA Punycode                    March 2003
509226031Sstas
510226031Sstas
511226031Sstas   Formally, code points are integers, and hence the pseudocode assumes
512226031Sstas   that arithmetic operations can be performed directly on code points.
513226031Sstas   In some programming languages, explicit conversion between code
514226031Sstas   points and integers might be necessary.
515226031Sstas
516226031Sstas6.1 Bias adaptation function
517226031Sstas
518226031Sstas   function adapt(delta,numpoints,firsttime):
519226031Sstas     if firsttime then let delta = delta div damp
520226031Sstas     else let delta = delta div 2
521226031Sstas     let delta = delta + (delta div numpoints)
522226031Sstas     let k = 0
523226031Sstas     while delta > ((base - tmin) * tmax) div 2 do begin
524226031Sstas       let delta = delta div (base - tmin)
525226031Sstas       let k = k + base
526226031Sstas     end
527226031Sstas     return k + (((base - tmin + 1) * delta) div (delta + skew))
528226031Sstas
529226031Sstas   It does not matter whether the modifications to delta and k inside
530226031Sstas   adapt() affect variables of the same name inside the
531226031Sstas   encoding/decoding procedures, because after calling adapt() the
532226031Sstas   caller does not read those variables before overwriting them.
533226031Sstas
534226031Sstas
535226031Sstas
536226031Sstas
537226031Sstas
538226031Sstas
539226031Sstas
540226031Sstas
541226031Sstas
542226031Sstas
543226031Sstas
544226031Sstas
545226031Sstas
546226031Sstas
547226031Sstas
548226031Sstas
549226031Sstas
550226031Sstas
551226031Sstas
552226031Sstas
553226031Sstas
554226031Sstas
555226031Sstas
556226031Sstas
557226031Sstas
558226031Sstas
559226031Sstas
560226031Sstas
561226031Sstas
562226031SstasCostello                    Standards Track                    [Page 10]
563226031Sstas
564226031SstasRFC 3492                     IDNA Punycode                    March 2003
565226031Sstas
566226031Sstas
567226031Sstas6.2 Decoding procedure
568226031Sstas
569226031Sstas   let n = initial_n
570226031Sstas   let i = 0
571226031Sstas   let bias = initial_bias
572226031Sstas   let output = an empty string indexed from 0
573226031Sstas   consume all code points before the last delimiter (if there is one)
574226031Sstas     and copy them to output, fail on any non-basic code point
575226031Sstas   if more than zero code points were consumed then consume one more
576226031Sstas     (which will be the last delimiter)
577226031Sstas   while the input is not exhausted do begin
578226031Sstas     let oldi = i
579226031Sstas     let w = 1
580226031Sstas     for k = base to infinity in steps of base do begin
581226031Sstas       consume a code point, or fail if there was none to consume
582226031Sstas       let digit = the code point's digit-value, fail if it has none
583226031Sstas       let i = i + digit * w, fail on overflow
584226031Sstas       let t = tmin if k <= bias {+ tmin}, or
585226031Sstas               tmax if k >= bias + tmax, or k - bias otherwise
586226031Sstas       if digit < t then break
587226031Sstas       let w = w * (base - t), fail on overflow
588226031Sstas     end
589226031Sstas     let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?)
590226031Sstas     let n = n + i div (length(output) + 1), fail on overflow
591226031Sstas     let i = i mod (length(output) + 1)
592226031Sstas     {if n is a basic code point then fail}
593226031Sstas     insert n into output at position i
594226031Sstas     increment i
595226031Sstas   end
596226031Sstas
597226031Sstas   The full statement enclosed in braces (checking whether n is a basic
598226031Sstas   code point) can be omitted if initial_n exceeds all basic code points
599226031Sstas   (which is true for Punycode), because n is never less than initial_n.
600226031Sstas
601226031Sstas   In the assignment of t, where t is clamped to the range tmin through
602226031Sstas   tmax, "+ tmin" can always be omitted.  This makes the clamping
603226031Sstas   calculation incorrect when bias < k < bias + tmin, but that cannot
604226031Sstas   happen because of the way bias is computed and because of the
605226031Sstas   constraints on the parameters.
606226031Sstas
607226031Sstas   Because the decoder state can only advance monotonically, and there
608226031Sstas   is only one representation of any delta, there is therefore only one
609226031Sstas   encoded string that can represent a given sequence of integers.  The
610226031Sstas   only error conditions are invalid code points, unexpected end-of-
611226031Sstas   input, overflow, and basic code points encoded using deltas instead
612226031Sstas   of appearing literally.  If the decoder fails on these errors as
613226031Sstas   shown above, then it cannot produce the same output for two distinct
614226031Sstas   inputs.  Without this property it would have been necessary to re-
615226031Sstas
616226031Sstas
617226031Sstas
618226031SstasCostello                    Standards Track                    [Page 11]
619226031Sstas
620226031SstasRFC 3492                     IDNA Punycode                    March 2003
621226031Sstas
622226031Sstas
623226031Sstas   encode the output and verify that it matches the input in order to
624226031Sstas   guarantee the uniqueness of the encoding.
625226031Sstas
626226031Sstas6.3 Encoding procedure
627226031Sstas
628226031Sstas   let n = initial_n
629226031Sstas   let delta = 0
630226031Sstas   let bias = initial_bias
631226031Sstas   let h = b = the number of basic code points in the input
632226031Sstas   copy them to the output in order, followed by a delimiter if b > 0
633226031Sstas   {if the input contains a non-basic code point < n then fail}
634226031Sstas   while h < length(input) do begin
635226031Sstas     let m = the minimum {non-basic} code point >= n in the input
636226031Sstas     let delta = delta + (m - n) * (h + 1), fail on overflow
637226031Sstas     let n = m
638226031Sstas     for each code point c in the input (in order) do begin
639226031Sstas       if c < n {or c is basic} then increment delta, fail on overflow
640226031Sstas       if c == n then begin
641226031Sstas         let q = delta
642226031Sstas         for k = base to infinity in steps of base do begin
643226031Sstas           let t = tmin if k <= bias {+ tmin}, or
644226031Sstas                   tmax if k >= bias + tmax, or k - bias otherwise
645226031Sstas           if q < t then break
646226031Sstas           output the code point for digit t + ((q - t) mod (base - t))
647226031Sstas           let q = (q - t) div (base - t)
648226031Sstas         end
649226031Sstas         output the code point for digit q
650226031Sstas         let bias = adapt(delta, h + 1, test h equals b?)
651226031Sstas         let delta = 0
652226031Sstas         increment h
653226031Sstas       end
654226031Sstas     end
655226031Sstas     increment delta and n
656226031Sstas   end
657226031Sstas
658226031Sstas   The full statement enclosed in braces (checking whether the input
659226031Sstas   contains a non-basic code point less than n) can be omitted if all
660226031Sstas   code points less than initial_n are basic code points (which is true
661226031Sstas   for Punycode if code points are unsigned).
662226031Sstas
663226031Sstas   The brace-enclosed conditions "non-basic" and "or c is basic" can be
664226031Sstas   omitted if initial_n exceeds all basic code points (which is true for
665226031Sstas   Punycode), because the code point being tested is never less than
666226031Sstas   initial_n.
667226031Sstas
668226031Sstas   In the assignment of t, where t is clamped to the range tmin through
669226031Sstas   tmax, "+ tmin" can always be omitted.  This makes the clamping
670226031Sstas   calculation incorrect when bias < k < bias + tmin, but that cannot
671226031Sstas
672226031Sstas
673226031Sstas
674226031SstasCostello                    Standards Track                    [Page 12]
675226031Sstas
676226031SstasRFC 3492                     IDNA Punycode                    March 2003
677226031Sstas
678226031Sstas
679226031Sstas   happen because of the way bias is computed and because of the
680226031Sstas   constraints on the parameters.
681226031Sstas
682226031Sstas   The checks for overflow are necessary to avoid producing invalid
683226031Sstas   output when the input contains very large values or is very long.
684226031Sstas
685226031Sstas   The increment of delta at the bottom of the outer loop cannot
686226031Sstas   overflow because delta < length(input) before the increment, and
687226031Sstas   length(input) is already assumed to be representable.  The increment
688226031Sstas   of n could overflow, but only if h == length(input), in which case
689226031Sstas   the procedure is finished anyway.
690226031Sstas
691226031Sstas6.4 Overflow handling
692226031Sstas
693226031Sstas   For IDNA, 26-bit unsigned integers are sufficient to handle all valid
694226031Sstas   IDNA labels without overflow, because any string that needed a 27-bit
695226031Sstas   delta would have to exceed either the code point limit (0..10FFFF) or
696226031Sstas   the label length limit (63 characters).  However, overflow handling
697226031Sstas   is necessary because the inputs are not necessarily valid IDNA
698226031Sstas   labels.
699226031Sstas
700226031Sstas   If the programming language does not provide overflow detection, the
701226031Sstas   following technique can be used.  Suppose A, B, and C are
702226031Sstas   representable nonnegative integers and C is nonzero.  Then A + B
703226031Sstas   overflows if and only if B > maxint - A, and A + (B * C) overflows if
704226031Sstas   and only if B > (maxint - A) div C, where maxint is the greatest
705226031Sstas   integer for which maxint + 1 cannot be represented.  Refer to
706226031Sstas   appendix C "Punycode sample implementation" for demonstrations of
707226031Sstas   this technique in the C language.
708226031Sstas
709226031Sstas   The decoding and encoding algorithms shown in sections 6.2 and 6.3
710226031Sstas   handle overflow by detecting it whenever it happens.  Another
711226031Sstas   approach is to enforce limits on the inputs that prevent overflow
712226031Sstas   from happening.  For example, if the encoder were to verify that no
713226031Sstas   input code points exceed M and that the input length does not exceed
714226031Sstas   L, then no delta could ever exceed (M - initial_n) * (L + 1), and
715226031Sstas   hence no overflow could occur if integer variables were capable of
716226031Sstas   representing values that large.  This prevention approach would
717226031Sstas   impose more restrictions on the input than the detection approach
718226031Sstas   does, but might be considered simpler in some programming languages.
719226031Sstas
720226031Sstas   In theory, the decoder could use an analogous approach, limiting the
721226031Sstas   number of digits in a variable-length integer (that is, limiting the
722226031Sstas   number of iterations in the innermost loop).  However, the number of
723226031Sstas   digits that suffice to represent a given delta can sometimes
724226031Sstas   represent much larger deltas (because of the adaptation), and hence
725226031Sstas   this approach would probably need integers wider than 32 bits.
726226031Sstas
727226031Sstas
728226031Sstas
729226031Sstas
730226031SstasCostello                    Standards Track                    [Page 13]
731226031Sstas
732226031SstasRFC 3492                     IDNA Punycode                    March 2003
733226031Sstas
734226031Sstas
735226031Sstas   Yet another approach for the decoder is to allow overflow to occur,
736226031Sstas   but to check the final output string by re-encoding it and comparing
737226031Sstas   to the decoder input.  If and only if they do not match (using a
738226031Sstas   case-insensitive ASCII comparison) overflow has occurred.  This
739226031Sstas   delayed-detection approach would not impose any more restrictions on
740226031Sstas   the input than the immediate-detection approach does, and might be
741226031Sstas   considered simpler in some programming languages.
742226031Sstas
743226031Sstas   In fact, if the decoder is used only inside the IDNA ToUnicode
744226031Sstas   operation [IDNA], then it need not check for overflow at all, because
745226031Sstas   ToUnicode performs a higher level re-encoding and comparison, and a
746226031Sstas   mismatch has the same consequence as if the Punycode decoder had
747226031Sstas   failed.
748226031Sstas
749226031Sstas7. Punycode examples
750226031Sstas
751226031Sstas7.1 Sample strings
752226031Sstas
753226031Sstas   In the Punycode encodings below, the ACE prefix is not shown.
754226031Sstas   Backslashes show where line breaks have been inserted in strings too
755226031Sstas   long for one line.
756226031Sstas
757226031Sstas   The first several examples are all translations of the sentence "Why
758226031Sstas   can't they just speak in <language>?" (courtesy of Michael Kaplan's
759226031Sstas   "provincial" page [PROVINCIAL]).  Word breaks and punctuation have
760226031Sstas   been removed, as is often done in domain names.
761226031Sstas
762226031Sstas   (A) Arabic (Egyptian):
763226031Sstas       u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644
764226031Sstas       u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
765226031Sstas       Punycode: egbpdaj6bu4bxfgehfvwxn
766226031Sstas
767226031Sstas   (B) Chinese (simplified):
768226031Sstas       u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
769226031Sstas       Punycode: ihqwcrb4cv8a8dqg056pqjye
770226031Sstas
771226031Sstas   (C) Chinese (traditional):
772226031Sstas       u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
773226031Sstas       Punycode: ihqwctvzc91f659drss3x8bo0yb
774226031Sstas
775226031Sstas   (D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
776226031Sstas       U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
777226031Sstas       u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
778226031Sstas       u+0065 u+0073 u+006B u+0079
779226031Sstas       Punycode: Proprostnemluvesky-uyb24dma41a
780226031Sstas
781226031Sstas
782226031Sstas
783226031Sstas
784226031Sstas
785226031Sstas
786226031SstasCostello                    Standards Track                    [Page 14]
787226031Sstas
788226031SstasRFC 3492                     IDNA Punycode                    March 2003
789226031Sstas
790226031Sstas
791226031Sstas   (E) Hebrew:
792226031Sstas       u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8
793226031Sstas       u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2
794226031Sstas       u+05D1 u+05E8 u+05D9 u+05EA
795226031Sstas       Punycode: 4dbcagdahymbxekheh6e0a7fei0b
796226031Sstas
797226031Sstas   (F) Hindi (Devanagari):
798226031Sstas       u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D
799226031Sstas       u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939
800226031Sstas       u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947
801226031Sstas       u+0939 u+0948 u+0902
802226031Sstas       Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd
803226031Sstas
804226031Sstas   (G) Japanese (kanji and hiragana):
805226031Sstas       u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092
806226031Sstas       u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
807226031Sstas       Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa
808226031Sstas
809226031Sstas   (H) Korean (Hangul syllables):
810226031Sstas       u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774
811226031Sstas       u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74
812226031Sstas       u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
813226031Sstas       Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\
814226031Sstas                 psd879ccm6fea98c
815226031Sstas
816226031Sstas   (I) Russian (Cyrillic):
817226031Sstas       U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
818226031Sstas       u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
819226031Sstas       u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
820226031Sstas       u+0438
821226031Sstas       Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l
822226031Sstas
823226031Sstas   (J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
824226031Sstas       U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
825226031Sstas       u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
826226031Sstas       u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
827226031Sstas       u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
828226031Sstas       u+0061 u+00F1 u+006F u+006C
829226031Sstas       Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a
830226031Sstas
831226031Sstas   (K) Vietnamese:
832226031Sstas       T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\
833226031Sstas       <ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t
834226031Sstas       U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B
835226031Sstas       u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068
836226031Sstas       u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067
837226031Sstas       U+0056 u+0069 u+1EC7 u+0074
838226031Sstas       Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g
839226031Sstas
840226031Sstas
841226031Sstas
842226031SstasCostello                    Standards Track                    [Page 15]
843226031Sstas
844226031SstasRFC 3492                     IDNA Punycode                    March 2003
845226031Sstas
846226031Sstas
847226031Sstas   The next several examples are all names of Japanese music artists,
848226031Sstas   song titles, and TV programs, just because the author happens to have
849226031Sstas   them handy (but Japanese is useful for providing examples of single-
850226031Sstas   row text, two-row text, ideographic text, and various mixtures
851226031Sstas   thereof).
852226031Sstas
853226031Sstas   (L) 3<nen>B<gumi><kinpachi><sensei>
854226031Sstas       u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
855226031Sstas       Punycode: 3B-ww4c5e180e575a65lsy2b
856226031Sstas
857226031Sstas   (M) <amuro><namie>-with-SUPER-MONKEYS
858226031Sstas       u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
859226031Sstas       u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D
860226031Sstas       U+004F U+004E U+004B U+0045 U+0059 U+0053
861226031Sstas       Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n
862226031Sstas
863226031Sstas   (N) Hello-Another-Way-<sorezore><no><basho>
864226031Sstas       U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F
865226031Sstas       u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D
866226031Sstas       u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
867226031Sstas       Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b
868226031Sstas
869226031Sstas   (O) <hitotsu><yane><no><shita>2
870226031Sstas       u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
871226031Sstas       Punycode: 2-u9tlzr9756bt3uc0v
872226031Sstas
873226031Sstas   (P) Maji<de>Koi<suru>5<byou><mae>
874226031Sstas       U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059
875226031Sstas       u+308B u+0035 u+79D2 u+524D
876226031Sstas       Punycode: MajiKoi5-783gue6qz075azm5e
877226031Sstas
878226031Sstas   (Q) <pafii>de<runba>
879226031Sstas       u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
880226031Sstas       Punycode: de-jg4avhby1noc0d
881226031Sstas
882226031Sstas   (R) <sono><supiido><de>
883226031Sstas       u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
884226031Sstas       Punycode: d9juau41awczczp
885226031Sstas
886226031Sstas   The last example is an ASCII string that breaks the existing rules
887226031Sstas   for host name labels.  (It is not a realistic example for IDNA,
888226031Sstas   because IDNA never encodes pure ASCII labels.)
889226031Sstas
890226031Sstas   (S) -> $1.00 <-
891226031Sstas       u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
892226031Sstas       u+003C u+002D
893226031Sstas       Punycode: -> $1.00 <--
894226031Sstas
895226031Sstas
896226031Sstas
897226031Sstas
898226031SstasCostello                    Standards Track                    [Page 16]
899226031Sstas
900226031SstasRFC 3492                     IDNA Punycode                    March 2003
901226031Sstas
902226031Sstas
903226031Sstas7.2 Decoding traces
904226031Sstas
905226031Sstas   In the following traces, the evolving state of the decoder is shown
906226031Sstas   as a sequence of hexadecimal values, representing the code points in
907226031Sstas   the extended string.  An asterisk appears just after the most
908226031Sstas   recently inserted code point, indicating both n (the value preceeding
909226031Sstas   the asterisk) and i (the position of the value just after the
910226031Sstas   asterisk).  Other numerical values are decimal.
911226031Sstas
912226031Sstas   Decoding trace of example B from section 7.1:
913226031Sstas
914226031Sstas   n is 128, i is 0, bias is 72
915226031Sstas   input is "ihqwcrb4cv8a8dqg056pqjye"
916226031Sstas   there is no delimiter, so extended string starts empty
917226031Sstas   delta "ihq" decodes to 19853
918226031Sstas   bias becomes 21
919226031Sstas   4E0D *
920226031Sstas   delta "wc" decodes to 64
921226031Sstas   bias becomes 20
922226031Sstas   4E0D 4E2D *
923226031Sstas   delta "rb" decodes to 37
924226031Sstas   bias becomes 13
925226031Sstas   4E3A * 4E0D 4E2D
926226031Sstas   delta "4c" decodes to 56
927226031Sstas   bias becomes 17
928226031Sstas   4E3A 4E48 * 4E0D 4E2D
929226031Sstas   delta "v8a" decodes to 599
930226031Sstas   bias becomes 32
931226031Sstas   4E3A 4EC0 * 4E48 4E0D 4E2D
932226031Sstas   delta "8d" decodes to 130
933226031Sstas   bias becomes 23
934226031Sstas   4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D
935226031Sstas   delta "qg" decodes to 154
936226031Sstas   bias becomes 25
937226031Sstas   4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D
938226031Sstas   delta "056p" decodes to 46301
939226031Sstas   bias becomes 84
940226031Sstas   4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *
941226031Sstas   delta "qjye" decodes to 88531
942226031Sstas   bias becomes 90
943226031Sstas   4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587
944226031Sstas
945226031Sstas
946226031Sstas
947226031Sstas
948226031Sstas
949226031Sstas
950226031Sstas
951226031Sstas
952226031Sstas
953226031Sstas
954226031SstasCostello                    Standards Track                    [Page 17]
955226031Sstas
956226031SstasRFC 3492                     IDNA Punycode                    March 2003
957226031Sstas
958226031Sstas
959226031Sstas   Decoding trace of example L from section 7.1:
960226031Sstas
961226031Sstas   n is 128, i is 0, bias is 72
962226031Sstas   input is "3B-ww4c5e180e575a65lsy2b"
963226031Sstas   literal portion is "3B-", so extended string starts as:
964226031Sstas   0033 0042
965226031Sstas   delta "ww4c" decodes to 62042
966226031Sstas   bias becomes 27
967226031Sstas   0033 0042 5148 *
968226031Sstas   delta "5e" decodes to 139
969226031Sstas   bias becomes 24
970226031Sstas   0033 0042 516B * 5148
971226031Sstas   delta "180e" decodes to 16683
972226031Sstas   bias becomes 67
973226031Sstas   0033 5E74 * 0042 516B 5148
974226031Sstas   delta "575a" decodes to 34821
975226031Sstas   bias becomes 82
976226031Sstas   0033 5E74 0042 516B 5148 751F *
977226031Sstas   delta "65l" decodes to 14592
978226031Sstas   bias becomes 67
979226031Sstas   0033 5E74 0042 7D44 * 516B 5148 751F
980226031Sstas   delta "sy2b" decodes to 42088
981226031Sstas   bias becomes 84
982226031Sstas   0033 5E74 0042 7D44 91D1 * 516B 5148 751F
983226031Sstas
984226031Sstas
985226031Sstas
986226031Sstas
987226031Sstas
988226031Sstas
989226031Sstas
990226031Sstas
991226031Sstas
992226031Sstas
993226031Sstas
994226031Sstas
995226031Sstas
996226031Sstas
997226031Sstas
998226031Sstas
999226031Sstas
1000226031Sstas
1001226031Sstas
1002226031Sstas
1003226031Sstas
1004226031Sstas
1005226031Sstas
1006226031Sstas
1007226031Sstas
1008226031Sstas
1009226031Sstas
1010226031SstasCostello                    Standards Track                    [Page 18]
1011226031Sstas
1012226031SstasRFC 3492                     IDNA Punycode                    March 2003
1013226031Sstas
1014226031Sstas
1015226031Sstas7.3 Encoding traces
1016226031Sstas
1017226031Sstas   In the following traces, code point values are hexadecimal, while
1018226031Sstas   other numerical values are decimal.
1019226031Sstas
1020226031Sstas   Encoding trace of example B from section 7.1:
1021226031Sstas
1022226031Sstas   bias is 72
1023226031Sstas   input is:
1024226031Sstas   4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587
1025226031Sstas   there are no basic code points, so no literal portion
1026226031Sstas   next code point to insert is 4E0D
1027226031Sstas   needed delta is 19853, encodes as "ihq"
1028226031Sstas   bias becomes 21
1029226031Sstas   next code point to insert is 4E2D
1030226031Sstas   needed delta is 64, encodes as "wc"
1031226031Sstas   bias becomes 20
1032226031Sstas   next code point to insert is 4E3A
1033226031Sstas   needed delta is 37, encodes as "rb"
1034226031Sstas   bias becomes 13
1035226031Sstas   next code point to insert is 4E48
1036226031Sstas   needed delta is 56, encodes as "4c"
1037226031Sstas   bias becomes 17
1038226031Sstas   next code point to insert is 4EC0
1039226031Sstas   needed delta is 599, encodes as "v8a"
1040226031Sstas   bias becomes 32
1041226031Sstas   next code point to insert is 4ED6
1042226031Sstas   needed delta is 130, encodes as "8d"
1043226031Sstas   bias becomes 23
1044226031Sstas   next code point to insert is 4EEC
1045226031Sstas   needed delta is 154, encodes as "qg"
1046226031Sstas   bias becomes 25
1047226031Sstas   next code point to insert is 6587
1048226031Sstas   needed delta is 46301, encodes as "056p"
1049226031Sstas   bias becomes 84
1050226031Sstas   next code point to insert is 8BF4
1051226031Sstas   needed delta is 88531, encodes as "qjye"
1052226031Sstas   bias becomes 90
1053226031Sstas   output is "ihqwcrb4cv8a8dqg056pqjye"
1054226031Sstas
1055226031Sstas
1056226031Sstas
1057226031Sstas
1058226031Sstas
1059226031Sstas
1060226031Sstas
1061226031Sstas
1062226031Sstas
1063226031Sstas
1064226031Sstas
1065226031Sstas
1066226031SstasCostello                    Standards Track                    [Page 19]
1067226031Sstas
1068226031SstasRFC 3492                     IDNA Punycode                    March 2003
1069226031Sstas
1070226031Sstas
1071226031Sstas   Encoding trace of example L from section 7.1:
1072226031Sstas
1073226031Sstas   bias is 72
1074226031Sstas   input is:
1075226031Sstas   0033 5E74 0042 7D44 91D1 516B 5148 751F
1076226031Sstas   basic code points (0033, 0042) are copied to literal portion: "3B-"
1077226031Sstas   next code point to insert is 5148
1078226031Sstas   needed delta is 62042, encodes as "ww4c"
1079226031Sstas   bias becomes 27
1080226031Sstas   next code point to insert is 516B
1081226031Sstas   needed delta is 139, encodes as "5e"
1082226031Sstas   bias becomes 24
1083226031Sstas   next code point to insert is 5E74
1084226031Sstas   needed delta is 16683, encodes as "180e"
1085226031Sstas   bias becomes 67
1086226031Sstas   next code point to insert is 751F
1087226031Sstas   needed delta is 34821, encodes as "575a"
1088226031Sstas   bias becomes 82
1089226031Sstas   next code point to insert is 7D44
1090226031Sstas   needed delta is 14592, encodes as "65l"
1091226031Sstas   bias becomes 67
1092226031Sstas   next code point to insert is 91D1
1093226031Sstas   needed delta is 42088, encodes as "sy2b"
1094226031Sstas   bias becomes 84
1095226031Sstas   output is "3B-ww4c5e180e575a65lsy2b"
1096226031Sstas
1097226031Sstas8. Security Considerations
1098226031Sstas
1099226031Sstas   Users expect each domain name in DNS to be controlled by a single
1100226031Sstas   authority.  If a Unicode string intended for use as a domain label
1101226031Sstas   could map to multiple ACE labels, then an internationalized domain
1102226031Sstas   name could map to multiple ASCII domain names, each controlled by a
1103226031Sstas   different authority, some of which could be spoofs that hijack
1104226031Sstas   service requests intended for another.  Therefore Punycode is
1105226031Sstas   designed so that each Unicode string has a unique encoding.
1106226031Sstas
1107226031Sstas   However, there can still be multiple Unicode representations of the
1108226031Sstas   "same" text, for various definitions of "same".  This problem is
1109226031Sstas   addressed to some extent by the Unicode standard under the topic of
1110226031Sstas   canonicalization, and this work is leveraged for domain names by
1111226031Sstas   Nameprep [NAMEPREP].
1112226031Sstas
1113226031Sstas
1114226031Sstas
1115226031Sstas
1116226031Sstas
1117226031Sstas
1118226031Sstas
1119226031Sstas
1120226031Sstas
1121226031Sstas
1122226031SstasCostello                    Standards Track                    [Page 20]
1123226031Sstas
1124226031SstasRFC 3492                     IDNA Punycode                    March 2003
1125226031Sstas
1126226031Sstas
1127226031Sstas9. References
1128226031Sstas
1129226031Sstas9.1 Normative References
1130226031Sstas
1131226031Sstas   [RFC2119]    Bradner, S., "Key words for use in RFCs to Indicate
1132226031Sstas                Requirement Levels", BCP 14, RFC 2119, March 1997.
1133226031Sstas
1134226031Sstas9.2 Informative References
1135226031Sstas
1136226031Sstas   [RFC952]     Harrenstien, K., Stahl, M. and E. Feinler, "DOD Internet
1137226031Sstas                Host Table Specification", RFC 952, October 1985.
1138226031Sstas
1139226031Sstas   [RFC1034]    Mockapetris, P., "Domain Names - Concepts and
1140226031Sstas                Facilities", STD 13, RFC 1034, November 1987.
1141226031Sstas
1142226031Sstas   [IDNA]       Faltstrom, P., Hoffman, P. and A. Costello,
1143226031Sstas                "Internationalizing Domain Names in Applications
1144226031Sstas                (IDNA)", RFC 3490, March 2003.
1145226031Sstas
1146226031Sstas   [NAMEPREP]   Hoffman, P. and  M. Blanchet, "Nameprep: A Stringprep
1147226031Sstas                Profile for Internationalized Domain Names (IDN)", RFC
1148226031Sstas                3491, March 2003.
1149226031Sstas
1150226031Sstas   [ASCII]      Cerf, V., "ASCII format for Network Interchange", RFC
1151226031Sstas                20, October 1969.
1152226031Sstas
1153226031Sstas   [PROVINCIAL] Kaplan, M., "The 'anyone can be provincial!' page",
1154226031Sstas                http://www.trigeminal.com/samples/provincial.html.
1155226031Sstas
1156226031Sstas   [UNICODE]    The Unicode Consortium, "The Unicode Standard",
1157226031Sstas                http://www.unicode.org/unicode/standard/standard.html.
1158226031Sstas
1159226031Sstas
1160226031Sstas
1161226031Sstas
1162226031Sstas
1163226031Sstas
1164226031Sstas
1165226031Sstas
1166226031Sstas
1167226031Sstas
1168226031Sstas
1169226031Sstas
1170226031Sstas
1171226031Sstas
1172226031Sstas
1173226031Sstas
1174226031Sstas
1175226031Sstas
1176226031Sstas
1177226031Sstas
1178226031SstasCostello                    Standards Track                    [Page 21]
1179226031Sstas
1180226031SstasRFC 3492                     IDNA Punycode                    March 2003
1181226031Sstas
1182226031Sstas
1183226031SstasA. Mixed-case annotation
1184226031Sstas
1185226031Sstas   In order to use Punycode to represent case-insensitive strings,
1186226031Sstas   higher layers need to case-fold the strings prior to Punycode
1187226031Sstas   encoding.  The encoded string can use mixed case as an annotation
1188226031Sstas   telling how to convert the folded string into a mixed-case string for
1189226031Sstas   display purposes.  Note, however, that mixed-case annotation is not
1190226031Sstas   used by the ToASCII and ToUnicode operations specified in [IDNA], and
1191226031Sstas   therefore implementors of IDNA can disregard this appendix.
1192226031Sstas
1193226031Sstas   Basic code points can use mixed case directly, because the decoder
1194226031Sstas   copies them verbatim, leaving lowercase code points lowercase, and
1195226031Sstas   leaving uppercase code points uppercase.  Each non-basic code point
1196226031Sstas   is represented by a delta, which is represented by a sequence of
1197226031Sstas   basic code points, the last of which provides the annotation.  If it
1198226031Sstas   is uppercase, it is a suggestion to map the non-basic code point to
1199226031Sstas   uppercase (if possible); if it is lowercase, it is a suggestion to
1200226031Sstas   map the non-basic code point to lowercase (if possible).
1201226031Sstas
1202226031Sstas   These annotations do not alter the code points returned by decoders;
1203226031Sstas   the annotations are returned separately, for the caller to use or
1204226031Sstas   ignore.  Encoders can accept annotations in addition to code points,
1205226031Sstas   but the annotations do not alter the output, except to influence the
1206226031Sstas   uppercase/lowercase form of ASCII letters.
1207226031Sstas
1208226031Sstas   Punycode encoders and decoders need not support these annotations,
1209226031Sstas   and higher layers need not use them.
1210226031Sstas
1211226031SstasB. Disclaimer and license
1212226031Sstas
1213226031Sstas   Regarding this entire document or any portion of it (including the
1214226031Sstas   pseudocode and C code), the author makes no guarantees and is not
1215226031Sstas   responsible for any damage resulting from its use.  The author grants
1216226031Sstas   irrevocable permission to anyone to use, modify, and distribute it in
1217226031Sstas   any way that does not diminish the rights of anyone else to use,
1218226031Sstas   modify, and distribute it, provided that redistributed derivative
1219226031Sstas   works do not contain misleading author or version information.
1220226031Sstas   Derivative works need not be licensed under similar terms.
1221226031Sstas
1222226031Sstas
1223226031Sstas
1224226031Sstas
1225226031Sstas
1226226031Sstas
1227226031Sstas
1228226031Sstas
1229226031Sstas
1230226031Sstas
1231226031Sstas
1232226031Sstas
1233226031Sstas
1234226031SstasCostello                    Standards Track                    [Page 22]
1235226031Sstas
1236226031SstasRFC 3492                     IDNA Punycode                    March 2003
1237226031Sstas
1238226031Sstas
1239226031SstasC. Punycode sample implementation
1240226031Sstas
1241226031Sstas/*
1242226031Sstaspunycode.c from RFC 3492
1243226031Sstashttp://www.nicemice.net/idn/
1244226031SstasAdam M. Costello
1245226031Sstashttp://www.nicemice.net/amc/
1246226031Sstas
1247226031SstasThis is ANSI C code (C89) implementing Punycode (RFC 3492).
1248226031Sstas
1249226031Sstas*/
1250226031Sstas
1251226031Sstas
1252226031Sstas/************************************************************/
1253226031Sstas/* Public interface (would normally go in its own .h file): */
1254226031Sstas
1255226031Sstas#include <limits.h>
1256226031Sstas
1257226031Sstasenum punycode_status {
1258226031Sstas  punycode_success,
1259226031Sstas  punycode_bad_input,   /* Input is invalid.                       */
1260226031Sstas  punycode_big_output,  /* Output would exceed the space provided. */
1261226031Sstas  punycode_overflow     /* Input needs wider integers to process.  */
1262226031Sstas};
1263226031Sstas
1264226031Sstas#if UINT_MAX >= (1 << 26) - 1
1265226031Sstastypedef unsigned int punycode_uint;
1266226031Sstas#else
1267226031Sstastypedef unsigned long punycode_uint;
1268226031Sstas#endif
1269226031Sstas
1270226031Sstasenum punycode_status punycode_encode(
1271226031Sstas  punycode_uint input_length,
1272226031Sstas  const punycode_uint input[],
1273226031Sstas  const unsigned char case_flags[],
1274226031Sstas  punycode_uint *output_length,
1275226031Sstas  char output[] );
1276226031Sstas
1277226031Sstas    /* punycode_encode() converts Unicode to Punycode.  The input     */
1278226031Sstas    /* is represented as an array of Unicode code points (not code    */
1279226031Sstas    /* units; surrogate pairs are not allowed), and the output        */
1280226031Sstas    /* will be represented as an array of ASCII code points.  The     */
1281226031Sstas    /* output string is *not* null-terminated; it will contain        */
1282226031Sstas    /* zeros if and only if the input contains zeros.  (Of course     */
1283226031Sstas    /* the caller can leave room for a terminator and add one if      */
1284226031Sstas    /* needed.)  The input_length is the number of code points in     */
1285226031Sstas    /* the input.  The output_length is an in/out argument: the       */
1286226031Sstas    /* caller passes in the maximum number of code points that it     */
1287226031Sstas
1288226031Sstas
1289226031Sstas
1290226031SstasCostello                    Standards Track                    [Page 23]
1291226031Sstas
1292226031SstasRFC 3492                     IDNA Punycode                    March 2003
1293226031Sstas
1294226031Sstas
1295226031Sstas    /* can receive, and on successful return it will contain the      */
1296226031Sstas    /* number of code points actually output.  The case_flags array   */
1297226031Sstas    /* holds input_length boolean values, where nonzero suggests that */
1298226031Sstas    /* the corresponding Unicode character be forced to uppercase     */
1299226031Sstas    /* after being decoded (if possible), and zero suggests that      */
1300226031Sstas    /* it be forced to lowercase (if possible).  ASCII code points    */
1301226031Sstas    /* are encoded literally, except that ASCII letters are forced    */
1302226031Sstas    /* to uppercase or lowercase according to the corresponding       */
1303226031Sstas    /* uppercase flags.  If case_flags is a null pointer then ASCII   */
1304226031Sstas    /* letters are left as they are, and other code points are        */
1305226031Sstas    /* treated as if their uppercase flags were zero.  The return     */
1306226031Sstas    /* value can be any of the punycode_status values defined above   */
1307226031Sstas    /* except punycode_bad_input; if not punycode_success, then       */
1308226031Sstas    /* output_size and output might contain garbage.                  */
1309226031Sstas
1310226031Sstasenum punycode_status punycode_decode(
1311226031Sstas  punycode_uint input_length,
1312226031Sstas  const char input[],
1313226031Sstas  punycode_uint *output_length,
1314226031Sstas  punycode_uint output[],
1315226031Sstas  unsigned char case_flags[] );
1316226031Sstas
1317226031Sstas    /* punycode_decode() converts Punycode to Unicode.  The input is  */
1318226031Sstas    /* represented as an array of ASCII code points, and the output   */
1319226031Sstas    /* will be represented as an array of Unicode code points.  The   */
1320226031Sstas    /* input_length is the number of code points in the input.  The   */
1321226031Sstas    /* output_length is an in/out argument: the caller passes in      */
1322226031Sstas    /* the maximum number of code points that it can receive, and     */
1323226031Sstas    /* on successful return it will contain the actual number of      */
1324226031Sstas    /* code points output.  The case_flags array needs room for at    */
1325226031Sstas    /* least output_length values, or it can be a null pointer if the */
1326226031Sstas    /* case information is not needed.  A nonzero flag suggests that  */
1327226031Sstas    /* the corresponding Unicode character be forced to uppercase     */
1328226031Sstas    /* by the caller (if possible), while zero suggests that it be    */
1329226031Sstas    /* forced to lowercase (if possible).  ASCII code points are      */
1330226031Sstas    /* output already in the proper case, but their flags will be set */
1331226031Sstas    /* appropriately so that applying the flags would be harmless.    */
1332226031Sstas    /* The return value can be any of the punycode_status values      */
1333226031Sstas    /* defined above; if not punycode_success, then output_length,    */
1334226031Sstas    /* output, and case_flags might contain garbage.  On success, the */
1335226031Sstas    /* decoder will never need to write an output_length greater than */
1336226031Sstas    /* input_length, because of how the encoding is defined.          */
1337226031Sstas
1338226031Sstas/**********************************************************/
1339226031Sstas/* Implementation (would normally go in its own .c file): */
1340226031Sstas
1341226031Sstas#include <string.h>
1342226031Sstas
1343226031Sstas
1344226031Sstas
1345226031Sstas
1346226031SstasCostello                    Standards Track                    [Page 24]
1347226031Sstas
1348226031SstasRFC 3492                     IDNA Punycode                    March 2003
1349226031Sstas
1350226031Sstas
1351226031Sstas/*** Bootstring parameters for Punycode ***/
1352226031Sstas
1353226031Sstasenum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
1354226031Sstas       initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
1355226031Sstas
1356226031Sstas/* basic(cp) tests whether cp is a basic code point: */
1357226031Sstas#define basic(cp) ((punycode_uint)(cp) < 0x80)
1358226031Sstas
1359226031Sstas/* delim(cp) tests whether cp is a delimiter: */
1360226031Sstas#define delim(cp) ((cp) == delimiter)
1361226031Sstas
1362226031Sstas/* decode_digit(cp) returns the numeric value of a basic code */
1363226031Sstas/* point (for use in representing integers) in the range 0 to */
1364226031Sstas/* base-1, or base if cp is does not represent a value.       */
1365226031Sstas
1366226031Sstasstatic punycode_uint decode_digit(punycode_uint cp)
1367226031Sstas{
1368226031Sstas  return  cp - 48 < 10 ? cp - 22 :  cp - 65 < 26 ? cp - 65 :
1369226031Sstas          cp - 97 < 26 ? cp - 97 :  base;
1370226031Sstas}
1371226031Sstas
1372226031Sstas/* encode_digit(d,flag) returns the basic code point whose value      */
1373226031Sstas/* (when used for representing integers) is d, which needs to be in   */
1374226031Sstas/* the range 0 to base-1.  The lowercase form is used unless flag is  */
1375226031Sstas/* nonzero, in which case the uppercase form is used.  The behavior   */
1376226031Sstas/* is undefined if flag is nonzero and digit d has no uppercase form. */
1377226031Sstas
1378226031Sstasstatic char encode_digit(punycode_uint d, int flag)
1379226031Sstas{
1380226031Sstas  return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
1381226031Sstas  /*  0..25 map to ASCII a..z or A..Z */
1382226031Sstas  /* 26..35 map to ASCII 0..9         */
1383226031Sstas}
1384226031Sstas
1385226031Sstas/* flagged(bcp) tests whether a basic code point is flagged */
1386226031Sstas/* (uppercase).  The behavior is undefined if bcp is not a  */
1387226031Sstas/* basic code point.                                        */
1388226031Sstas
1389226031Sstas#define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)
1390226031Sstas
1391226031Sstas/* encode_basic(bcp,flag) forces a basic code point to lowercase */
1392226031Sstas/* if flag is zero, uppercase if flag is nonzero, and returns    */
1393226031Sstas/* the resulting code point.  The code point is unchanged if it  */
1394226031Sstas/* is caseless.  The behavior is undefined if bcp is not a basic */
1395226031Sstas/* code point.                                                   */
1396226031Sstas
1397226031Sstasstatic char encode_basic(punycode_uint bcp, int flag)
1398226031Sstas{
1399226031Sstas
1400226031Sstas
1401226031Sstas
1402226031SstasCostello                    Standards Track                    [Page 25]
1403226031Sstas
1404226031SstasRFC 3492                     IDNA Punycode                    March 2003
1405226031Sstas
1406226031Sstas
1407226031Sstas  bcp -= (bcp - 97 < 26) << 5;
1408226031Sstas  return bcp + ((!flag && (bcp - 65 < 26)) << 5);
1409226031Sstas}
1410226031Sstas
1411226031Sstas/*** Platform-specific constants ***/
1412226031Sstas
1413226031Sstas/* maxint is the maximum value of a punycode_uint variable: */
1414226031Sstasstatic const punycode_uint maxint = -1;
1415226031Sstas/* Because maxint is unsigned, -1 becomes the maximum value. */
1416226031Sstas
1417226031Sstas/*** Bias adaptation function ***/
1418226031Sstas
1419226031Sstasstatic punycode_uint adapt(
1420226031Sstas  punycode_uint delta, punycode_uint numpoints, int firsttime )
1421226031Sstas{
1422226031Sstas  punycode_uint k;
1423226031Sstas
1424226031Sstas  delta = firsttime ? delta / damp : delta >> 1;
1425226031Sstas  /* delta >> 1 is a faster way of doing delta / 2 */
1426226031Sstas  delta += delta / numpoints;
1427226031Sstas
1428226031Sstas  for (k = 0;  delta > ((base - tmin) * tmax) / 2;  k += base) {
1429226031Sstas    delta /= base - tmin;
1430226031Sstas  }
1431226031Sstas
1432226031Sstas  return k + (base - tmin + 1) * delta / (delta + skew);
1433226031Sstas}
1434226031Sstas
1435226031Sstas/*** Main encode function ***/
1436226031Sstas
1437226031Sstasenum punycode_status punycode_encode(
1438226031Sstas  punycode_uint input_length,
1439226031Sstas  const punycode_uint input[],
1440226031Sstas  const unsigned char case_flags[],
1441226031Sstas  punycode_uint *output_length,
1442226031Sstas  char output[] )
1443226031Sstas{
1444226031Sstas  punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;
1445226031Sstas
1446226031Sstas  /* Initialize the state: */
1447226031Sstas
1448226031Sstas  n = initial_n;
1449226031Sstas  delta = out = 0;
1450226031Sstas  max_out = *output_length;
1451226031Sstas  bias = initial_bias;
1452226031Sstas
1453226031Sstas  /* Handle the basic code points: */
1454226031Sstas
1455226031Sstas
1456226031Sstas
1457226031Sstas
1458226031SstasCostello                    Standards Track                    [Page 26]
1459226031Sstas
1460226031SstasRFC 3492                     IDNA Punycode                    March 2003
1461226031Sstas
1462226031Sstas
1463226031Sstas  for (j = 0;  j < input_length;  ++j) {
1464226031Sstas    if (basic(input[j])) {
1465226031Sstas      if (max_out - out < 2) return punycode_big_output;
1466226031Sstas      output[out++] =
1467226031Sstas        case_flags ?  encode_basic(input[j], case_flags[j]) : input[j];
1468226031Sstas    }
1469226031Sstas    /* else if (input[j] < n) return punycode_bad_input; */
1470226031Sstas    /* (not needed for Punycode with unsigned code points) */
1471226031Sstas  }
1472226031Sstas
1473226031Sstas  h = b = out;
1474226031Sstas
1475226031Sstas  /* h is the number of code points that have been handled, b is the  */
1476226031Sstas  /* number of basic code points, and out is the number of characters */
1477226031Sstas  /* that have been output.                                           */
1478226031Sstas
1479226031Sstas  if (b > 0) output[out++] = delimiter;
1480226031Sstas
1481226031Sstas  /* Main encoding loop: */
1482226031Sstas
1483226031Sstas  while (h < input_length) {
1484226031Sstas    /* All non-basic code points < n have been     */
1485226031Sstas    /* handled already.  Find the next larger one: */
1486226031Sstas
1487226031Sstas    for (m = maxint, j = 0;  j < input_length;  ++j) {
1488226031Sstas      /* if (basic(input[j])) continue; */
1489226031Sstas      /* (not needed for Punycode) */
1490226031Sstas      if (input[j] >= n && input[j] < m) m = input[j];
1491226031Sstas    }
1492226031Sstas
1493226031Sstas    /* Increase delta enough to advance the decoder's    */
1494226031Sstas    /* <n,i> state to <m,0>, but guard against overflow: */
1495226031Sstas
1496226031Sstas    if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;
1497226031Sstas    delta += (m - n) * (h + 1);
1498226031Sstas    n = m;
1499226031Sstas
1500226031Sstas    for (j = 0;  j < input_length;  ++j) {
1501226031Sstas      /* Punycode does not need to check whether input[j] is basic: */
1502226031Sstas      if (input[j] < n /* || basic(input[j]) */ ) {
1503226031Sstas        if (++delta == 0) return punycode_overflow;
1504226031Sstas      }
1505226031Sstas
1506226031Sstas      if (input[j] == n) {
1507226031Sstas        /* Represent delta as a generalized variable-length integer: */
1508226031Sstas
1509226031Sstas        for (q = delta, k = base;  ;  k += base) {
1510226031Sstas          if (out >= max_out) return punycode_big_output;
1511226031Sstas
1512226031Sstas
1513226031Sstas
1514226031SstasCostello                    Standards Track                    [Page 27]
1515226031Sstas
1516226031SstasRFC 3492                     IDNA Punycode                    March 2003
1517226031Sstas
1518226031Sstas
1519226031Sstas          t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */
1520226031Sstas              k >= bias + tmax ? tmax : k - bias;
1521226031Sstas          if (q < t) break;
1522226031Sstas          output[out++] = encode_digit(t + (q - t) % (base - t), 0);
1523226031Sstas          q = (q - t) / (base - t);
1524226031Sstas        }
1525226031Sstas
1526226031Sstas        output[out++] = encode_digit(q, case_flags && case_flags[j]);
1527226031Sstas        bias = adapt(delta, h + 1, h == b);
1528226031Sstas        delta = 0;
1529226031Sstas        ++h;
1530226031Sstas      }
1531226031Sstas    }
1532226031Sstas
1533226031Sstas    ++delta, ++n;
1534226031Sstas  }
1535226031Sstas
1536226031Sstas  *output_length = out;
1537226031Sstas  return punycode_success;
1538226031Sstas}
1539226031Sstas
1540226031Sstas/*** Main decode function ***/
1541226031Sstas
1542226031Sstasenum punycode_status punycode_decode(
1543226031Sstas  punycode_uint input_length,
1544226031Sstas  const char input[],
1545226031Sstas  punycode_uint *output_length,
1546226031Sstas  punycode_uint output[],
1547226031Sstas  unsigned char case_flags[] )
1548226031Sstas{
1549226031Sstas  punycode_uint n, out, i, max_out, bias,
1550226031Sstas                 b, j, in, oldi, w, k, digit, t;
1551226031Sstas
1552226031Sstas  /* Initialize the state: */
1553226031Sstas
1554226031Sstas  n = initial_n;
1555226031Sstas  out = i = 0;
1556226031Sstas  max_out = *output_length;
1557226031Sstas  bias = initial_bias;
1558226031Sstas
1559226031Sstas  /* Handle the basic code points:  Let b be the number of input code */
1560226031Sstas  /* points before the last delimiter, or 0 if there is none, then    */
1561226031Sstas  /* copy the first b code points to the output.                      */
1562226031Sstas
1563226031Sstas  for (b = j = 0;  j < input_length;  ++j) if (delim(input[j])) b = j;
1564226031Sstas  if (b > max_out) return punycode_big_output;
1565226031Sstas
1566226031Sstas  for (j = 0;  j < b;  ++j) {
1567226031Sstas
1568226031Sstas
1569226031Sstas
1570226031SstasCostello                    Standards Track                    [Page 28]
1571226031Sstas
1572226031SstasRFC 3492                     IDNA Punycode                    March 2003
1573226031Sstas
1574226031Sstas
1575226031Sstas    if (case_flags) case_flags[out] = flagged(input[j]);
1576226031Sstas    if (!basic(input[j])) return punycode_bad_input;
1577226031Sstas    output[out++] = input[j];
1578226031Sstas  }
1579226031Sstas
1580226031Sstas  /* Main decoding loop:  Start just after the last delimiter if any  */
1581226031Sstas  /* basic code points were copied; start at the beginning otherwise. */
1582226031Sstas
1583226031Sstas  for (in = b > 0 ? b + 1 : 0;  in < input_length;  ++out) {
1584226031Sstas
1585226031Sstas    /* in is the index of the next character to be consumed, and */
1586226031Sstas    /* out is the number of code points in the output array.     */
1587226031Sstas
1588226031Sstas    /* Decode a generalized variable-length integer into delta,  */
1589226031Sstas    /* which gets added to i.  The overflow checking is easier   */
1590226031Sstas    /* if we increase i as we go, then subtract off its starting */
1591226031Sstas    /* value at the end to obtain delta.                         */
1592226031Sstas
1593226031Sstas    for (oldi = i, w = 1, k = base;  ;  k += base) {
1594226031Sstas      if (in >= input_length) return punycode_bad_input;
1595226031Sstas      digit = decode_digit(input[in++]);
1596226031Sstas      if (digit >= base) return punycode_bad_input;
1597226031Sstas      if (digit > (maxint - i) / w) return punycode_overflow;
1598226031Sstas      i += digit * w;
1599226031Sstas      t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */
1600226031Sstas          k >= bias + tmax ? tmax : k - bias;
1601226031Sstas      if (digit < t) break;
1602226031Sstas      if (w > maxint / (base - t)) return punycode_overflow;
1603226031Sstas      w *= (base - t);
1604226031Sstas    }
1605226031Sstas
1606226031Sstas    bias = adapt(i - oldi, out + 1, oldi == 0);
1607226031Sstas
1608226031Sstas    /* i was supposed to wrap around from out+1 to 0,   */
1609226031Sstas    /* incrementing n each time, so we'll fix that now: */
1610226031Sstas
1611226031Sstas    if (i / (out + 1) > maxint - n) return punycode_overflow;
1612226031Sstas    n += i / (out + 1);
1613226031Sstas    i %= (out + 1);
1614226031Sstas
1615226031Sstas    /* Insert n at position i of the output: */
1616226031Sstas
1617226031Sstas    /* not needed for Punycode: */
1618226031Sstas    /* if (decode_digit(n) <= base) return punycode_invalid_input; */
1619226031Sstas    if (out >= max_out) return punycode_big_output;
1620226031Sstas
1621226031Sstas    if (case_flags) {
1622226031Sstas      memmove(case_flags + i + 1, case_flags + i, out - i);
1623226031Sstas
1624226031Sstas
1625226031Sstas
1626226031SstasCostello                    Standards Track                    [Page 29]
1627226031Sstas
1628226031SstasRFC 3492                     IDNA Punycode                    March 2003
1629226031Sstas
1630226031Sstas
1631226031Sstas      /* Case of last character determines uppercase flag: */
1632226031Sstas      case_flags[i] = flagged(input[in - 1]);
1633226031Sstas    }
1634226031Sstas
1635226031Sstas    memmove(output + i + 1, output + i, (out - i) * sizeof *output);
1636226031Sstas    output[i++] = n;
1637226031Sstas  }
1638226031Sstas
1639226031Sstas  *output_length = out;
1640226031Sstas  return punycode_success;
1641226031Sstas}
1642226031Sstas
1643226031Sstas/******************************************************************/
1644226031Sstas/* Wrapper for testing (would normally go in a separate .c file): */
1645226031Sstas
1646226031Sstas#include <assert.h>
1647226031Sstas#include <stdio.h>
1648226031Sstas#include <stdlib.h>
1649226031Sstas#include <string.h>
1650226031Sstas
1651226031Sstas/* For testing, we'll just set some compile-time limits rather than */
1652226031Sstas/* use malloc(), and set a compile-time option rather than using a  */
1653226031Sstas/* command-line option.                                             */
1654226031Sstas
1655226031Sstasenum {
1656226031Sstas  unicode_max_length = 256,
1657226031Sstas  ace_max_length = 256
1658226031Sstas};
1659226031Sstas
1660226031Sstasstatic void usage(char **argv)
1661226031Sstas{
1662226031Sstas  fprintf(stderr,
1663226031Sstas    "\n"
1664226031Sstas    "%s -e reads code points and writes a Punycode string.\n"
1665226031Sstas    "%s -d reads a Punycode string and writes code points.\n"
1666226031Sstas    "\n"
1667226031Sstas    "Input and output are plain text in the native character set.\n"
1668226031Sstas    "Code points are in the form u+hex separated by whitespace.\n"
1669226031Sstas    "Although the specification allows Punycode strings to contain\n"
1670226031Sstas    "any characters from the ASCII repertoire, this test code\n"
1671226031Sstas    "supports only the printable characters, and needs the Punycode\n"
1672226031Sstas    "string to be followed by a newline.\n"
1673226031Sstas    "The case of the u in u+hex is the force-to-uppercase flag.\n"
1674226031Sstas    , argv[0], argv[0]);
1675226031Sstas  exit(EXIT_FAILURE);
1676226031Sstas}
1677226031Sstas
1678226031Sstasstatic void fail(const char *msg)
1679226031Sstas
1680226031Sstas
1681226031Sstas
1682226031SstasCostello                    Standards Track                    [Page 30]
1683226031Sstas
1684226031SstasRFC 3492                     IDNA Punycode                    March 2003
1685226031Sstas
1686226031Sstas
1687226031Sstas{
1688226031Sstas  fputs(msg,stderr);
1689226031Sstas  exit(EXIT_FAILURE);
1690226031Sstas}
1691226031Sstas
1692226031Sstasstatic const char too_big[] =
1693226031Sstas  "input or output is too large, recompile with larger limits\n";
1694226031Sstasstatic const char invalid_input[] = "invalid input\n";
1695226031Sstasstatic const char overflow[] = "arithmetic overflow\n";
1696226031Sstasstatic const char io_error[] = "I/O error\n";
1697226031Sstas
1698226031Sstas/* The following string is used to convert printable */
1699226031Sstas/* characters between ASCII and the native charset:  */
1700226031Sstas
1701226031Sstasstatic const char print_ascii[] =
1702226031Sstas  "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1703226031Sstas  "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1704226031Sstas  " !\"#$%&'()*+,-./"
1705226031Sstas  "0123456789:;<=>?"
1706226031Sstas  "@ABCDEFGHIJKLMNO"
1707226031Sstas  "PQRSTUVWXYZ[\\]^_"
1708226031Sstas  "`abcdefghijklmno"
1709226031Sstas  "pqrstuvwxyz{|}~\n";
1710226031Sstas
1711226031Sstasint main(int argc, char **argv)
1712226031Sstas{
1713226031Sstas  enum punycode_status status;
1714226031Sstas  int r;
1715226031Sstas  unsigned int input_length, output_length, j;
1716226031Sstas  unsigned char case_flags[unicode_max_length];
1717226031Sstas
1718226031Sstas  if (argc != 2) usage(argv);
1719226031Sstas  if (argv[1][0] != '-') usage(argv);
1720226031Sstas  if (argv[1][2] != 0) usage(argv);
1721226031Sstas
1722226031Sstas  if (argv[1][1] == 'e') {
1723226031Sstas    punycode_uint input[unicode_max_length];
1724226031Sstas    unsigned long codept;
1725226031Sstas    char output[ace_max_length+1], uplus[3];
1726226031Sstas    int c;
1727226031Sstas
1728226031Sstas    /* Read the input code points: */
1729226031Sstas
1730226031Sstas    input_length = 0;
1731226031Sstas
1732226031Sstas    for (;;) {
1733226031Sstas      r = scanf("%2s%lx", uplus, &codept);
1734226031Sstas      if (ferror(stdin)) fail(io_error);
1735226031Sstas
1736226031Sstas
1737226031Sstas
1738226031SstasCostello                    Standards Track                    [Page 31]
1739226031Sstas
1740226031SstasRFC 3492                     IDNA Punycode                    March 2003
1741226031Sstas
1742226031Sstas
1743226031Sstas      if (r == EOF || r == 0) break;
1744226031Sstas
1745226031Sstas      if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) {
1746226031Sstas        fail(invalid_input);
1747226031Sstas      }
1748226031Sstas
1749226031Sstas      if (input_length == unicode_max_length) fail(too_big);
1750226031Sstas
1751226031Sstas      if (uplus[0] == 'u') case_flags[input_length] = 0;
1752226031Sstas      else if (uplus[0] == 'U') case_flags[input_length] = 1;
1753226031Sstas      else fail(invalid_input);
1754226031Sstas
1755226031Sstas      input[input_length++] = codept;
1756226031Sstas    }
1757226031Sstas
1758226031Sstas    /* Encode: */
1759226031Sstas
1760226031Sstas    output_length = ace_max_length;
1761226031Sstas    status = punycode_encode(input_length, input, case_flags,
1762226031Sstas                             &output_length, output);
1763226031Sstas    if (status == punycode_bad_input) fail(invalid_input);
1764226031Sstas    if (status == punycode_big_output) fail(too_big);
1765226031Sstas    if (status == punycode_overflow) fail(overflow);
1766226031Sstas    assert(status == punycode_success);
1767226031Sstas
1768226031Sstas    /* Convert to native charset and output: */
1769226031Sstas
1770226031Sstas    for (j = 0;  j < output_length;  ++j) {
1771226031Sstas      c = output[j];
1772226031Sstas      assert(c >= 0 && c <= 127);
1773226031Sstas      if (print_ascii[c] == 0) fail(invalid_input);
1774226031Sstas      output[j] = print_ascii[c];
1775226031Sstas    }
1776226031Sstas
1777226031Sstas    output[j] = 0;
1778226031Sstas    r = puts(output);
1779226031Sstas    if (r == EOF) fail(io_error);
1780226031Sstas    return EXIT_SUCCESS;
1781226031Sstas  }
1782226031Sstas
1783226031Sstas  if (argv[1][1] == 'd') {
1784226031Sstas    char input[ace_max_length+2], *p, *pp;
1785226031Sstas    punycode_uint output[unicode_max_length];
1786226031Sstas
1787226031Sstas    /* Read the Punycode input string and convert to ASCII: */
1788226031Sstas
1789226031Sstas    fgets(input, ace_max_length+2, stdin);
1790226031Sstas    if (ferror(stdin)) fail(io_error);
1791226031Sstas
1792226031Sstas
1793226031Sstas
1794226031SstasCostello                    Standards Track                    [Page 32]
1795226031Sstas
1796226031SstasRFC 3492                     IDNA Punycode                    March 2003
1797226031Sstas
1798226031Sstas
1799226031Sstas    if (feof(stdin)) fail(invalid_input);
1800226031Sstas    input_length = strlen(input) - 1;
1801226031Sstas    if (input[input_length] != '\n') fail(too_big);
1802226031Sstas    input[input_length] = 0;
1803226031Sstas
1804226031Sstas    for (p = input;  *p != 0;  ++p) {
1805226031Sstas      pp = strchr(print_ascii, *p);
1806226031Sstas      if (pp == 0) fail(invalid_input);
1807226031Sstas      *p = pp - print_ascii;
1808226031Sstas    }
1809226031Sstas
1810226031Sstas    /* Decode: */
1811226031Sstas
1812226031Sstas    output_length = unicode_max_length;
1813226031Sstas    status = punycode_decode(input_length, input, &output_length,
1814226031Sstas                             output, case_flags);
1815226031Sstas    if (status == punycode_bad_input) fail(invalid_input);
1816226031Sstas    if (status == punycode_big_output) fail(too_big);
1817226031Sstas    if (status == punycode_overflow) fail(overflow);
1818226031Sstas    assert(status == punycode_success);
1819226031Sstas
1820226031Sstas    /* Output the result: */
1821226031Sstas
1822226031Sstas    for (j = 0;  j < output_length;  ++j) {
1823226031Sstas      r = printf("%s+%04lX\n",
1824226031Sstas                 case_flags[j] ? "U" : "u",
1825226031Sstas                 (unsigned long) output[j] );
1826226031Sstas      if (r < 0) fail(io_error);
1827226031Sstas    }
1828226031Sstas
1829226031Sstas    return EXIT_SUCCESS;
1830226031Sstas  }
1831226031Sstas
1832226031Sstas  usage(argv);
1833226031Sstas  return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
1834226031Sstas}
1835226031Sstas
1836226031Sstas
1837226031Sstas
1838226031Sstas
1839226031Sstas
1840226031Sstas
1841226031Sstas
1842226031Sstas
1843226031Sstas
1844226031Sstas
1845226031Sstas
1846226031Sstas
1847226031Sstas
1848226031Sstas
1849226031Sstas
1850226031SstasCostello                    Standards Track                    [Page 33]
1851226031Sstas
1852226031SstasRFC 3492                     IDNA Punycode                    March 2003
1853226031Sstas
1854226031Sstas
1855226031SstasAuthor's Address
1856226031Sstas
1857226031Sstas   Adam M. Costello
1858226031Sstas   University of California, Berkeley
1859226031Sstas   http://www.nicemice.net/amc/
1860226031Sstas
1861226031Sstas
1862226031Sstas
1863226031Sstas
1864226031Sstas
1865226031Sstas
1866226031Sstas
1867226031Sstas
1868226031Sstas
1869226031Sstas
1870226031Sstas
1871226031Sstas
1872226031Sstas
1873226031Sstas
1874226031Sstas
1875226031Sstas
1876226031Sstas
1877226031Sstas
1878226031Sstas
1879226031Sstas
1880226031Sstas
1881226031Sstas
1882226031Sstas
1883226031Sstas
1884226031Sstas
1885226031Sstas
1886226031Sstas
1887226031Sstas
1888226031Sstas
1889226031Sstas
1890226031Sstas
1891226031Sstas
1892226031Sstas
1893226031Sstas
1894226031Sstas
1895226031Sstas
1896226031Sstas
1897226031Sstas
1898226031Sstas
1899226031Sstas
1900226031Sstas
1901226031Sstas
1902226031Sstas
1903226031Sstas
1904226031Sstas
1905226031Sstas
1906226031SstasCostello                    Standards Track                    [Page 34]
1907226031Sstas
1908226031SstasRFC 3492                     IDNA Punycode                    March 2003
1909226031Sstas
1910226031Sstas
1911226031SstasFull Copyright Statement
1912226031Sstas
1913226031Sstas   Copyright (C) The Internet Society (2003).  All Rights Reserved.
1914226031Sstas
1915226031Sstas   This document and translations of it may be copied and furnished to
1916226031Sstas   others, and derivative works that comment on or otherwise explain it
1917226031Sstas   or assist in its implementation may be prepared, copied, published
1918226031Sstas   and distributed, in whole or in part, without restriction of any
1919226031Sstas   kind, provided that the above copyright notice and this paragraph are
1920226031Sstas   included on all such copies and derivative works.  However, this
1921226031Sstas   document itself may not be modified in any way, such as by removing
1922226031Sstas   the copyright notice or references to the Internet Society or other
1923226031Sstas   Internet organizations, except as needed for the purpose of
1924226031Sstas   developing Internet standards in which case the procedures for
1925226031Sstas   copyrights defined in the Internet Standards process must be
1926226031Sstas   followed, or as required to translate it into languages other than
1927226031Sstas   English.
1928226031Sstas
1929226031Sstas   The limited permissions granted above are perpetual and will not be
1930226031Sstas   revoked by the Internet Society or its successors or assigns.
1931226031Sstas
1932226031Sstas   This document and the information contained herein is provided on an
1933226031Sstas   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
1934226031Sstas   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
1935226031Sstas   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
1936226031Sstas   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
1937226031Sstas   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1938226031Sstas
1939226031SstasAcknowledgement
1940226031Sstas
1941226031Sstas   Funding for the RFC Editor function is currently provided by the
1942226031Sstas   Internet Society.
1943226031Sstas
1944226031Sstas
1945226031Sstas
1946226031Sstas
1947226031Sstas
1948226031Sstas
1949226031Sstas
1950226031Sstas
1951226031Sstas
1952226031Sstas
1953226031Sstas
1954226031Sstas
1955226031Sstas
1956226031Sstas
1957226031Sstas
1958226031Sstas
1959226031Sstas
1960226031Sstas
1961226031Sstas
1962226031SstasCostello                    Standards Track                    [Page 35]
1963226031Sstas
1964