1226031Sstas 2226031Sstas 3226031Sstas 4226031Sstas 5226031Sstas 6226031Sstas 7226031SstasNetwork Working Group A. Costello 8226031SstasRequest for Comments: 3492 Univ. of California, Berkeley 9226031SstasCategory: Standards Track March 2003 10226031Sstas 11226031Sstas 12226031Sstas Punycode: A Bootstring encoding of Unicode 13226031Sstas for Internationalized Domain Names in Applications (IDNA) 14226031Sstas 15226031SstasStatus of this Memo 16226031Sstas 17226031Sstas This document specifies an Internet standards track protocol for the 18226031Sstas Internet community, and requests discussion and suggestions for 19226031Sstas improvements. Please refer to the current edition of the "Internet 20226031Sstas Official Protocol Standards" (STD 1) for the standardization state 21226031Sstas and status of this protocol. Distribution of this memo is unlimited. 22226031Sstas 23226031SstasCopyright Notice 24226031Sstas 25226031Sstas Copyright (C) The Internet Society (2003). All Rights Reserved. 26226031Sstas 27226031SstasAbstract 28226031Sstas 29226031Sstas Punycode is a simple and efficient transfer encoding syntax designed 30226031Sstas for use with Internationalized Domain Names in Applications (IDNA). 31226031Sstas It uniquely and reversibly transforms a Unicode string into an ASCII 32226031Sstas string. ASCII characters in the Unicode string are represented 33226031Sstas literally, and non-ASCII characters are represented by ASCII 34226031Sstas characters that are allowed in host name labels (letters, digits, and 35226031Sstas hyphens). This document defines a general algorithm called 36226031Sstas Bootstring that allows a string of basic code points to uniquely 37226031Sstas represent any string of code points drawn from a larger set. 38226031Sstas Punycode is an instance of Bootstring that uses particular parameter 39226031Sstas values specified by this document, appropriate for IDNA. 40226031Sstas 41226031SstasTable of Contents 42226031Sstas 43226031Sstas 1. Introduction...............................................2 44226031Sstas 1.1 Features..............................................2 45226031Sstas 1.2 Interaction of protocol parts.........................3 46226031Sstas 2. Terminology................................................3 47226031Sstas 3. Bootstring description.....................................4 48226031Sstas 3.1 Basic code point segregation..........................4 49226031Sstas 3.2 Insertion unsort coding...............................4 50226031Sstas 3.3 Generalized variable-length integers..................5 51226031Sstas 3.4 Bias adaptation.......................................7 52226031Sstas 4. Bootstring parameters......................................8 53226031Sstas 5. Parameter values for Punycode..............................8 54226031Sstas 6. Bootstring algorithms......................................9 55226031Sstas 56226031Sstas 57226031Sstas 58226031SstasCostello Standards Track [Page 1] 59226031Sstas 60226031SstasRFC 3492 IDNA Punycode March 2003 61226031Sstas 62226031Sstas 63226031Sstas 6.1 Bias adaptation function.............................10 64226031Sstas 6.2 Decoding procedure...................................11 65226031Sstas 6.3 Encoding procedure...................................12 66226031Sstas 6.4 Overflow handling....................................13 67226031Sstas 7. Punycode examples.........................................14 68226031Sstas 7.1 Sample strings.......................................14 69226031Sstas 7.2 Decoding traces......................................17 70226031Sstas 7.3 Encoding traces......................................19 71226031Sstas 8. Security Considerations...................................20 72226031Sstas 9. References................................................21 73226031Sstas 9.1 Normative References.................................21 74226031Sstas 9.2 Informative References...............................21 75226031Sstas A. Mixed-case annotation.....................................22 76226031Sstas B. Disclaimer and license....................................22 77226031Sstas C. Punycode sample implementation............................23 78226031Sstas Author's Address.............................................34 79226031Sstas Full Copyright Statement.....................................35 80226031Sstas 81226031Sstas1. Introduction 82226031Sstas 83226031Sstas [IDNA] describes an architecture for supporting internationalized 84226031Sstas domain names. Labels containing non-ASCII characters can be 85226031Sstas represented by ACE labels, which begin with a special ACE prefix and 86226031Sstas contain only ASCII characters. The remainder of the label after the 87226031Sstas prefix is a Punycode encoding of a Unicode string satisfying certain 88226031Sstas constraints. For the details of the prefix and constraints, see 89226031Sstas [IDNA] and [NAMEPREP]. 90226031Sstas 91226031Sstas Punycode is an instance of a more general algorithm called 92226031Sstas Bootstring, which allows strings composed from a small set of "basic" 93226031Sstas code points to uniquely represent any string of code points drawn 94226031Sstas from a larger set. Punycode is Bootstring with particular parameter 95226031Sstas values appropriate for IDNA. 96226031Sstas 97226031Sstas1.1 Features 98226031Sstas 99226031Sstas Bootstring has been designed to have the following features: 100226031Sstas 101226031Sstas * Completeness: Every extended string (sequence of arbitrary code 102226031Sstas points) can be represented by a basic string (sequence of basic 103226031Sstas code points). Restrictions on what strings are allowed, and on 104226031Sstas length, can be imposed by higher layers. 105226031Sstas 106226031Sstas * Uniqueness: There is at most one basic string that represents a 107226031Sstas given extended string. 108226031Sstas 109226031Sstas * Reversibility: Any extended string mapped to a basic string can 110226031Sstas be recovered from that basic string. 111226031Sstas 112226031Sstas 113226031Sstas 114226031SstasCostello Standards Track [Page 2] 115226031Sstas 116226031SstasRFC 3492 IDNA Punycode March 2003 117226031Sstas 118226031Sstas 119226031Sstas * Efficient encoding: The ratio of basic string length to extended 120226031Sstas string length is small. This is important in the context of 121226031Sstas domain names because RFC 1034 [RFC1034] restricts the length of a 122226031Sstas domain label to 63 characters. 123226031Sstas 124226031Sstas * Simplicity: The encoding and decoding algorithms are reasonably 125226031Sstas simple to implement. The goals of efficiency and simplicity are 126226031Sstas at odds; Bootstring aims at a good balance between them. 127226031Sstas 128226031Sstas * Readability: Basic code points appearing in the extended string 129226031Sstas are represented as themselves in the basic string (although the 130226031Sstas main purpose is to improve efficiency, not readability). 131226031Sstas 132226031Sstas Punycode can also support an additional feature that is not used by 133226031Sstas the ToASCII and ToUnicode operations of [IDNA]. When extended 134226031Sstas strings are case-folded prior to encoding, the basic string can use 135226031Sstas mixed case to tell how to convert the folded string into a mixed-case 136226031Sstas string. See appendix A "Mixed-case annotation". 137226031Sstas 138226031Sstas1.2 Interaction of protocol parts 139226031Sstas 140226031Sstas Punycode is used by the IDNA protocol [IDNA] for converting domain 141226031Sstas labels into ASCII; it is not designed for any other purpose. It is 142226031Sstas explicitly not designed for processing arbitrary free text. 143226031Sstas 144226031Sstas2. Terminology 145226031Sstas 146226031Sstas The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 147226031Sstas "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 148226031Sstas document are to be interpreted as described in BCP 14, RFC 2119 149226031Sstas [RFC2119]. 150226031Sstas 151226031Sstas A code point is an integral value associated with a character in a 152226031Sstas coded character set. 153226031Sstas 154226031Sstas As in the Unicode Standard [UNICODE], Unicode code points are denoted 155226031Sstas by "U+" followed by four to six hexadecimal digits, while a range of 156226031Sstas code points is denoted by two hexadecimal numbers separated by "..", 157226031Sstas with no prefixes. 158226031Sstas 159226031Sstas The operators div and mod perform integer division; (x div y) is the 160226031Sstas quotient of x divided by y, discarding the remainder, and (x mod y) 161226031Sstas is the remainder, so (x div y) * y + (x mod y) == x. Bootstring uses 162226031Sstas these operators only with nonnegative operands, so the quotient and 163226031Sstas remainder are always nonnegative. 164226031Sstas 165226031Sstas The break statement jumps out of the innermost loop (as in C). 166226031Sstas 167226031Sstas 168226031Sstas 169226031Sstas 170226031SstasCostello Standards Track [Page 3] 171226031Sstas 172226031SstasRFC 3492 IDNA Punycode March 2003 173226031Sstas 174226031Sstas 175226031Sstas An overflow is an attempt to compute a value that exceeds the maximum 176226031Sstas value of an integer variable. 177226031Sstas 178226031Sstas3. Bootstring description 179226031Sstas 180226031Sstas Bootstring represents an arbitrary sequence of code points (the 181226031Sstas "extended string") as a sequence of basic code points (the "basic 182226031Sstas string"). This section describes the representation. Section 6 183226031Sstas "Bootstring algorithms" presents the algorithms as pseudocode. 184226031Sstas Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the 185226031Sstas algorithms for sample inputs. 186226031Sstas 187226031Sstas The following sections describe the four techniques used in 188226031Sstas Bootstring. "Basic code point segregation" is a very simple and 189226031Sstas efficient encoding for basic code points occurring in the extended 190226031Sstas string: they are simply copied all at once. "Insertion unsort 191226031Sstas coding" encodes the non-basic code points as deltas, and processes 192226031Sstas the code points in numerical order rather than in order of 193226031Sstas appearance, which typically results in smaller deltas. The deltas 194226031Sstas are represented as "generalized variable-length integers", which use 195226031Sstas basic code points to represent nonnegative integers. The parameters 196226031Sstas of this integer representation are dynamically adjusted using "bias 197226031Sstas adaptation", to improve efficiency when consecutive deltas have 198226031Sstas similar magnitudes. 199226031Sstas 200226031Sstas3.1 Basic code point segregation 201226031Sstas 202226031Sstas All basic code points appearing in the extended string are 203226031Sstas represented literally at the beginning of the basic string, in their 204226031Sstas original order, followed by a delimiter if (and only if) the number 205226031Sstas of basic code points is nonzero. The delimiter is a particular basic 206226031Sstas code point, which never appears in the remainder of the basic string. 207226031Sstas The decoder can therefore find the end of the literal portion (if 208226031Sstas there is one) by scanning for the last delimiter. 209226031Sstas 210226031Sstas3.2 Insertion unsort coding 211226031Sstas 212226031Sstas The remainder of the basic string (after the last delimiter if there 213226031Sstas is one) represents a sequence of nonnegative integral deltas as 214226031Sstas generalized variable-length integers, described in section 3.3. The 215226031Sstas meaning of the deltas is best understood in terms of the decoder. 216226031Sstas 217226031Sstas The decoder builds the extended string incrementally. Initially, the 218226031Sstas extended string is a copy of the literal portion of the basic string 219226031Sstas (excluding the last delimiter). The decoder inserts non-basic code 220226031Sstas points, one for each delta, into the extended string, ultimately 221226031Sstas arriving at the final decoded string. 222226031Sstas 223226031Sstas 224226031Sstas 225226031Sstas 226226031SstasCostello Standards Track [Page 4] 227226031Sstas 228226031SstasRFC 3492 IDNA Punycode March 2003 229226031Sstas 230226031Sstas 231226031Sstas At the heart of this process is a state machine with two state 232226031Sstas variables: an index i and a counter n. The index i refers to a 233226031Sstas position in the extended string; it ranges from 0 (the first 234226031Sstas position) to the current length of the extended string (which refers 235226031Sstas to a potential position beyond the current end). If the current 236226031Sstas state is <n,i>, the next state is <n,i+1> if i is less than the 237226031Sstas length of the extended string, or <n+1,0> if i equals the length of 238226031Sstas the extended string. In other words, each state change causes i to 239226031Sstas increment, wrapping around to zero if necessary, and n counts the 240226031Sstas number of wrap-arounds. 241226031Sstas 242226031Sstas Notice that the state always advances monotonically (there is no way 243226031Sstas for the decoder to return to an earlier state). At each state, an 244226031Sstas insertion is either performed or not performed. At most one 245226031Sstas insertion is performed in a given state. An insertion inserts the 246226031Sstas value of n at position i in the extended string. The deltas are a 247226031Sstas run-length encoding of this sequence of events: they are the lengths 248226031Sstas of the runs of non-insertion states preceeding the insertion states. 249226031Sstas Hence, for each delta, the decoder performs delta state changes, then 250226031Sstas an insertion, and then one more state change. (An implementation 251226031Sstas need not perform each state change individually, but can instead use 252226031Sstas division and remainder calculations to compute the next insertion 253226031Sstas state directly.) It is an error if the inserted code point is a 254226031Sstas basic code point (because basic code points were supposed to be 255226031Sstas segregated as described in section 3.1). 256226031Sstas 257226031Sstas The encoder's main task is to derive the sequence of deltas that will 258226031Sstas cause the decoder to construct the desired string. It can do this by 259226031Sstas repeatedly scanning the extended string for the next code point that 260226031Sstas the decoder would need to insert, and counting the number of state 261226031Sstas changes the decoder would need to perform, mindful of the fact that 262226031Sstas the decoder's extended string will include only those code points 263226031Sstas that have already been inserted. Section 6.3 "Encoding procedure" 264226031Sstas gives a precise algorithm. 265226031Sstas 266226031Sstas3.3 Generalized variable-length integers 267226031Sstas 268226031Sstas In a conventional integer representation the base is the number of 269226031Sstas distinct symbols for digits, whose values are 0 through base-1. Let 270226031Sstas digit_0 denote the least significant digit, digit_1 the next least 271226031Sstas significant, and so on. The value represented is the sum over j of 272226031Sstas digit_j * w(j), where w(j) = base^j is the weight (scale factor) for 273226031Sstas position j. For example, in the base 8 integer 437, the digits are 274226031Sstas 7, 3, and 4, and the weights are 1, 8, and 64, so the value is 7 + 275226031Sstas 3*8 + 4*64 = 287. This representation has two disadvantages: First, 276226031Sstas there are multiple encodings of each value (because there can be 277226031Sstas extra zeros in the most significant positions), which is inconvenient 278226031Sstas 279226031Sstas 280226031Sstas 281226031Sstas 282226031SstasCostello Standards Track [Page 5] 283226031Sstas 284226031SstasRFC 3492 IDNA Punycode March 2003 285226031Sstas 286226031Sstas 287226031Sstas when unique encodings are needed. Second, the integer is not self- 288226031Sstas delimiting, so if multiple integers are concatenated the boundaries 289226031Sstas between them are lost. 290226031Sstas 291226031Sstas The generalized variable-length representation solves these two 292226031Sstas problems. The digit values are still 0 through base-1, but now the 293226031Sstas integer is self-delimiting by means of thresholds t(j), each of which 294226031Sstas is in the range 0 through base-1. Exactly one digit, the most 295226031Sstas significant, satisfies digit_j < t(j). Therefore, if several 296226031Sstas integers are concatenated, it is easy to separate them, starting with 297226031Sstas the first if they are little-endian (least significant digit first), 298226031Sstas or starting with the last if they are big-endian (most significant 299226031Sstas digit first). As before, the value is the sum over j of digit_j * 300226031Sstas w(j), but the weights are different: 301226031Sstas 302226031Sstas w(0) = 1 303226031Sstas w(j) = w(j-1) * (base - t(j-1)) for j > 0 304226031Sstas 305226031Sstas For example, consider the little-endian sequence of base 8 digits 306226031Sstas 734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This 307226031Sstas implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) = 308226031Sstas 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is not 309226031Sstas less than 3, but 4 is less than 5, so 4 is the last digit. The value 310226031Sstas of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251, with 311226031Sstas value 2*1 + 5*6 + 1*30 = 62. Decoding this representation is very 312226031Sstas similar to decoding a conventional integer: Start with a current 313226031Sstas value of N = 0 and a weight w = 1. Fetch the next digit d and 314226031Sstas increase N by d * w. If d is less than the current threshold (t) 315226031Sstas then stop, otherwise increase w by a factor of (base - t), update t 316226031Sstas for the next position, and repeat. 317226031Sstas 318226031Sstas Encoding this representation is similar to encoding a conventional 319226031Sstas integer: If N < t then output one digit for N and stop, otherwise 320226031Sstas output the digit for t + ((N - t) mod (base - t)), then replace N 321226031Sstas with (N - t) div (base - t), update t for the next position, and 322226031Sstas repeat. 323226031Sstas 324226031Sstas For any particular set of values of t(j), there is exactly one 325226031Sstas generalized variable-length representation of each nonnegative 326226031Sstas integral value. 327226031Sstas 328226031Sstas Bootstring uses little-endian ordering so that the deltas can be 329226031Sstas separated starting with the first. The t(j) values are defined in 330226031Sstas terms of the constants base, tmin, and tmax, and a state variable 331226031Sstas called bias: 332226031Sstas 333226031Sstas t(j) = base * (j + 1) - bias, 334226031Sstas clamped to the range tmin through tmax 335226031Sstas 336226031Sstas 337226031Sstas 338226031SstasCostello Standards Track [Page 6] 339226031Sstas 340226031SstasRFC 3492 IDNA Punycode March 2003 341226031Sstas 342226031Sstas 343226031Sstas The clamping means that if the formula yields a value less than tmin 344226031Sstas or greater than tmax, then t(j) = tmin or tmax, respectively. (In 345226031Sstas the pseudocode in section 6 "Bootstring algorithms", the expression 346226031Sstas base * (j + 1) is denoted by k for performance reasons.) These t(j) 347226031Sstas values cause the representation to favor integers within a particular 348226031Sstas range determined by the bias. 349226031Sstas 350226031Sstas3.4 Bias adaptation 351226031Sstas 352226031Sstas After each delta is encoded or decoded, bias is set for the next 353226031Sstas delta as follows: 354226031Sstas 355226031Sstas 1. Delta is scaled in order to avoid overflow in the next step: 356226031Sstas 357226031Sstas let delta = delta div 2 358226031Sstas 359226031Sstas But when this is the very first delta, the divisor is not 2, but 360226031Sstas instead a constant called damp. This compensates for the fact 361226031Sstas that the second delta is usually much smaller than the first. 362226031Sstas 363226031Sstas 2. Delta is increased to compensate for the fact that the next delta 364226031Sstas will be inserting into a longer string: 365226031Sstas 366226031Sstas let delta = delta + (delta div numpoints) 367226031Sstas 368226031Sstas numpoints is the total number of code points encoded/decoded so 369226031Sstas far (including the one corresponding to this delta itself, and 370226031Sstas including the basic code points). 371226031Sstas 372226031Sstas 3. Delta is repeatedly divided until it falls within a threshold, to 373226031Sstas predict the minimum number of digits needed to represent the next 374226031Sstas delta: 375226031Sstas 376226031Sstas while delta > ((base - tmin) * tmax) div 2 377226031Sstas do let delta = delta div (base - tmin) 378226031Sstas 379226031Sstas 4. The bias is set: 380226031Sstas 381226031Sstas let bias = 382226031Sstas (base * the number of divisions performed in step 3) + 383226031Sstas (((base - tmin + 1) * delta) div (delta + skew)) 384226031Sstas 385226031Sstas The motivation for this procedure is that the current delta 386226031Sstas provides a hint about the likely size of the next delta, and so 387226031Sstas t(j) is set to tmax for the more significant digits starting with 388226031Sstas the one expected to be last, tmin for the less significant digits 389226031Sstas up through the one expected to be third-last, and somewhere 390226031Sstas between tmin and tmax for the digit expected to be second-last 391226031Sstas 392226031Sstas 393226031Sstas 394226031SstasCostello Standards Track [Page 7] 395226031Sstas 396226031SstasRFC 3492 IDNA Punycode March 2003 397226031Sstas 398226031Sstas 399226031Sstas (balancing the hope of the expected-last digit being unnecessary 400226031Sstas against the danger of it being insufficient). 401226031Sstas 402226031Sstas4. Bootstring parameters 403226031Sstas 404226031Sstas Given a set of basic code points, one needs to be designated as the 405226031Sstas delimiter. The base cannot be greater than the number of 406226031Sstas distinguishable basic code points remaining. The digit-values in the 407226031Sstas range 0 through base-1 need to be associated with distinct non- 408226031Sstas delimiter basic code points. In some cases multiple code points need 409226031Sstas to have the same digit-value; for example, uppercase and lowercase 410226031Sstas versions of the same letter need to be equivalent if basic strings 411226031Sstas are case-insensitive. 412226031Sstas 413226031Sstas The initial value of n cannot be greater than the minimum non-basic 414226031Sstas code point that could appear in extended strings. 415226031Sstas 416226031Sstas The remaining five parameters (tmin, tmax, skew, damp, and the 417226031Sstas initial value of bias) need to satisfy the following constraints: 418226031Sstas 419226031Sstas 0 <= tmin <= tmax <= base-1 420226031Sstas skew >= 1 421226031Sstas damp >= 2 422226031Sstas initial_bias mod base <= base - tmin 423226031Sstas 424226031Sstas Provided the constraints are satisfied, these five parameters affect 425226031Sstas efficiency but not correctness. They are best chosen empirically. 426226031Sstas 427226031Sstas If support for mixed-case annotation is desired (see appendix A), 428226031Sstas make sure that the code points corresponding to 0 through tmax-1 all 429226031Sstas have both uppercase and lowercase forms. 430226031Sstas 431226031Sstas5. Parameter values for Punycode 432226031Sstas 433226031Sstas Punycode uses the following Bootstring parameter values: 434226031Sstas 435226031Sstas base = 36 436226031Sstas tmin = 1 437226031Sstas tmax = 26 438226031Sstas skew = 38 439226031Sstas damp = 700 440226031Sstas initial_bias = 72 441226031Sstas initial_n = 128 = 0x80 442226031Sstas 443226031Sstas Although the only restriction Punycode imposes on the input integers 444226031Sstas is that they be nonnegative, these parameters are especially designed 445226031Sstas to work well with Unicode [UNICODE] code points, which are integers 446226031Sstas in the range 0..10FFFF (but not D800..DFFF, which are reserved for 447226031Sstas 448226031Sstas 449226031Sstas 450226031SstasCostello Standards Track [Page 8] 451226031Sstas 452226031SstasRFC 3492 IDNA Punycode March 2003 453226031Sstas 454226031Sstas 455226031Sstas use by the UTF-16 encoding of Unicode). The basic code points are 456226031Sstas the ASCII [ASCII] code points (0..7F), of which U+002D (-) is the 457226031Sstas delimiter, and some of the others have digit-values as follows: 458226031Sstas 459226031Sstas code points digit-values 460226031Sstas ------------ ---------------------- 461226031Sstas 41..5A (A-Z) = 0 to 25, respectively 462226031Sstas 61..7A (a-z) = 0 to 25, respectively 463226031Sstas 30..39 (0-9) = 26 to 35, respectively 464226031Sstas 465226031Sstas Using hyphen-minus as the delimiter implies that the encoded string 466226031Sstas can end with a hyphen-minus only if the Unicode string consists 467226031Sstas entirely of basic code points, but IDNA forbids such strings from 468226031Sstas being encoded. The encoded string can begin with a hyphen-minus, but 469226031Sstas IDNA prepends a prefix. Therefore IDNA using Punycode conforms to 470226031Sstas the RFC 952 rule that host name labels neither begin nor end with a 471226031Sstas hyphen-minus [RFC952]. 472226031Sstas 473226031Sstas A decoder MUST recognize the letters in both uppercase and lowercase 474226031Sstas forms (including mixtures of both forms). An encoder SHOULD output 475226031Sstas only uppercase forms or only lowercase forms, unless it uses mixed- 476226031Sstas case annotation (see appendix A). 477226031Sstas 478226031Sstas Presumably most users will not manually write or type encoded strings 479226031Sstas (as opposed to cutting and pasting them), but those who do will need 480226031Sstas to be alert to the potential visual ambiguity between the following 481226031Sstas sets of characters: 482226031Sstas 483226031Sstas G 6 484226031Sstas I l 1 485226031Sstas O 0 486226031Sstas S 5 487226031Sstas U V 488226031Sstas Z 2 489226031Sstas 490226031Sstas Such ambiguities are usually resolved by context, but in a Punycode 491226031Sstas encoded string there is no context apparent to humans. 492226031Sstas 493226031Sstas6. Bootstring algorithms 494226031Sstas 495226031Sstas Some parts of the pseudocode can be omitted if the parameters satisfy 496226031Sstas certain conditions (for which Punycode qualifies). These parts are 497226031Sstas enclosed in {braces}, and notes immediately following the pseudocode 498226031Sstas explain the conditions under which they can be omitted. 499226031Sstas 500226031Sstas 501226031Sstas 502226031Sstas 503226031Sstas 504226031Sstas 505226031Sstas 506226031SstasCostello Standards Track [Page 9] 507226031Sstas 508226031SstasRFC 3492 IDNA Punycode March 2003 509226031Sstas 510226031Sstas 511226031Sstas Formally, code points are integers, and hence the pseudocode assumes 512226031Sstas that arithmetic operations can be performed directly on code points. 513226031Sstas In some programming languages, explicit conversion between code 514226031Sstas points and integers might be necessary. 515226031Sstas 516226031Sstas6.1 Bias adaptation function 517226031Sstas 518226031Sstas function adapt(delta,numpoints,firsttime): 519226031Sstas if firsttime then let delta = delta div damp 520226031Sstas else let delta = delta div 2 521226031Sstas let delta = delta + (delta div numpoints) 522226031Sstas let k = 0 523226031Sstas while delta > ((base - tmin) * tmax) div 2 do begin 524226031Sstas let delta = delta div (base - tmin) 525226031Sstas let k = k + base 526226031Sstas end 527226031Sstas return k + (((base - tmin + 1) * delta) div (delta + skew)) 528226031Sstas 529226031Sstas It does not matter whether the modifications to delta and k inside 530226031Sstas adapt() affect variables of the same name inside the 531226031Sstas encoding/decoding procedures, because after calling adapt() the 532226031Sstas caller does not read those variables before overwriting them. 533226031Sstas 534226031Sstas 535226031Sstas 536226031Sstas 537226031Sstas 538226031Sstas 539226031Sstas 540226031Sstas 541226031Sstas 542226031Sstas 543226031Sstas 544226031Sstas 545226031Sstas 546226031Sstas 547226031Sstas 548226031Sstas 549226031Sstas 550226031Sstas 551226031Sstas 552226031Sstas 553226031Sstas 554226031Sstas 555226031Sstas 556226031Sstas 557226031Sstas 558226031Sstas 559226031Sstas 560226031Sstas 561226031Sstas 562226031SstasCostello Standards Track [Page 10] 563226031Sstas 564226031SstasRFC 3492 IDNA Punycode March 2003 565226031Sstas 566226031Sstas 567226031Sstas6.2 Decoding procedure 568226031Sstas 569226031Sstas let n = initial_n 570226031Sstas let i = 0 571226031Sstas let bias = initial_bias 572226031Sstas let output = an empty string indexed from 0 573226031Sstas consume all code points before the last delimiter (if there is one) 574226031Sstas and copy them to output, fail on any non-basic code point 575226031Sstas if more than zero code points were consumed then consume one more 576226031Sstas (which will be the last delimiter) 577226031Sstas while the input is not exhausted do begin 578226031Sstas let oldi = i 579226031Sstas let w = 1 580226031Sstas for k = base to infinity in steps of base do begin 581226031Sstas consume a code point, or fail if there was none to consume 582226031Sstas let digit = the code point's digit-value, fail if it has none 583226031Sstas let i = i + digit * w, fail on overflow 584226031Sstas let t = tmin if k <= bias {+ tmin}, or 585226031Sstas tmax if k >= bias + tmax, or k - bias otherwise 586226031Sstas if digit < t then break 587226031Sstas let w = w * (base - t), fail on overflow 588226031Sstas end 589226031Sstas let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?) 590226031Sstas let n = n + i div (length(output) + 1), fail on overflow 591226031Sstas let i = i mod (length(output) + 1) 592226031Sstas {if n is a basic code point then fail} 593226031Sstas insert n into output at position i 594226031Sstas increment i 595226031Sstas end 596226031Sstas 597226031Sstas The full statement enclosed in braces (checking whether n is a basic 598226031Sstas code point) can be omitted if initial_n exceeds all basic code points 599226031Sstas (which is true for Punycode), because n is never less than initial_n. 600226031Sstas 601226031Sstas In the assignment of t, where t is clamped to the range tmin through 602226031Sstas tmax, "+ tmin" can always be omitted. This makes the clamping 603226031Sstas calculation incorrect when bias < k < bias + tmin, but that cannot 604226031Sstas happen because of the way bias is computed and because of the 605226031Sstas constraints on the parameters. 606226031Sstas 607226031Sstas Because the decoder state can only advance monotonically, and there 608226031Sstas is only one representation of any delta, there is therefore only one 609226031Sstas encoded string that can represent a given sequence of integers. The 610226031Sstas only error conditions are invalid code points, unexpected end-of- 611226031Sstas input, overflow, and basic code points encoded using deltas instead 612226031Sstas of appearing literally. If the decoder fails on these errors as 613226031Sstas shown above, then it cannot produce the same output for two distinct 614226031Sstas inputs. Without this property it would have been necessary to re- 615226031Sstas 616226031Sstas 617226031Sstas 618226031SstasCostello Standards Track [Page 11] 619226031Sstas 620226031SstasRFC 3492 IDNA Punycode March 2003 621226031Sstas 622226031Sstas 623226031Sstas encode the output and verify that it matches the input in order to 624226031Sstas guarantee the uniqueness of the encoding. 625226031Sstas 626226031Sstas6.3 Encoding procedure 627226031Sstas 628226031Sstas let n = initial_n 629226031Sstas let delta = 0 630226031Sstas let bias = initial_bias 631226031Sstas let h = b = the number of basic code points in the input 632226031Sstas copy them to the output in order, followed by a delimiter if b > 0 633226031Sstas {if the input contains a non-basic code point < n then fail} 634226031Sstas while h < length(input) do begin 635226031Sstas let m = the minimum {non-basic} code point >= n in the input 636226031Sstas let delta = delta + (m - n) * (h + 1), fail on overflow 637226031Sstas let n = m 638226031Sstas for each code point c in the input (in order) do begin 639226031Sstas if c < n {or c is basic} then increment delta, fail on overflow 640226031Sstas if c == n then begin 641226031Sstas let q = delta 642226031Sstas for k = base to infinity in steps of base do begin 643226031Sstas let t = tmin if k <= bias {+ tmin}, or 644226031Sstas tmax if k >= bias + tmax, or k - bias otherwise 645226031Sstas if q < t then break 646226031Sstas output the code point for digit t + ((q - t) mod (base - t)) 647226031Sstas let q = (q - t) div (base - t) 648226031Sstas end 649226031Sstas output the code point for digit q 650226031Sstas let bias = adapt(delta, h + 1, test h equals b?) 651226031Sstas let delta = 0 652226031Sstas increment h 653226031Sstas end 654226031Sstas end 655226031Sstas increment delta and n 656226031Sstas end 657226031Sstas 658226031Sstas The full statement enclosed in braces (checking whether the input 659226031Sstas contains a non-basic code point less than n) can be omitted if all 660226031Sstas code points less than initial_n are basic code points (which is true 661226031Sstas for Punycode if code points are unsigned). 662226031Sstas 663226031Sstas The brace-enclosed conditions "non-basic" and "or c is basic" can be 664226031Sstas omitted if initial_n exceeds all basic code points (which is true for 665226031Sstas Punycode), because the code point being tested is never less than 666226031Sstas initial_n. 667226031Sstas 668226031Sstas In the assignment of t, where t is clamped to the range tmin through 669226031Sstas tmax, "+ tmin" can always be omitted. This makes the clamping 670226031Sstas calculation incorrect when bias < k < bias + tmin, but that cannot 671226031Sstas 672226031Sstas 673226031Sstas 674226031SstasCostello Standards Track [Page 12] 675226031Sstas 676226031SstasRFC 3492 IDNA Punycode March 2003 677226031Sstas 678226031Sstas 679226031Sstas happen because of the way bias is computed and because of the 680226031Sstas constraints on the parameters. 681226031Sstas 682226031Sstas The checks for overflow are necessary to avoid producing invalid 683226031Sstas output when the input contains very large values or is very long. 684226031Sstas 685226031Sstas The increment of delta at the bottom of the outer loop cannot 686226031Sstas overflow because delta < length(input) before the increment, and 687226031Sstas length(input) is already assumed to be representable. The increment 688226031Sstas of n could overflow, but only if h == length(input), in which case 689226031Sstas the procedure is finished anyway. 690226031Sstas 691226031Sstas6.4 Overflow handling 692226031Sstas 693226031Sstas For IDNA, 26-bit unsigned integers are sufficient to handle all valid 694226031Sstas IDNA labels without overflow, because any string that needed a 27-bit 695226031Sstas delta would have to exceed either the code point limit (0..10FFFF) or 696226031Sstas the label length limit (63 characters). However, overflow handling 697226031Sstas is necessary because the inputs are not necessarily valid IDNA 698226031Sstas labels. 699226031Sstas 700226031Sstas If the programming language does not provide overflow detection, the 701226031Sstas following technique can be used. Suppose A, B, and C are 702226031Sstas representable nonnegative integers and C is nonzero. Then A + B 703226031Sstas overflows if and only if B > maxint - A, and A + (B * C) overflows if 704226031Sstas and only if B > (maxint - A) div C, where maxint is the greatest 705226031Sstas integer for which maxint + 1 cannot be represented. Refer to 706226031Sstas appendix C "Punycode sample implementation" for demonstrations of 707226031Sstas this technique in the C language. 708226031Sstas 709226031Sstas The decoding and encoding algorithms shown in sections 6.2 and 6.3 710226031Sstas handle overflow by detecting it whenever it happens. Another 711226031Sstas approach is to enforce limits on the inputs that prevent overflow 712226031Sstas from happening. For example, if the encoder were to verify that no 713226031Sstas input code points exceed M and that the input length does not exceed 714226031Sstas L, then no delta could ever exceed (M - initial_n) * (L + 1), and 715226031Sstas hence no overflow could occur if integer variables were capable of 716226031Sstas representing values that large. This prevention approach would 717226031Sstas impose more restrictions on the input than the detection approach 718226031Sstas does, but might be considered simpler in some programming languages. 719226031Sstas 720226031Sstas In theory, the decoder could use an analogous approach, limiting the 721226031Sstas number of digits in a variable-length integer (that is, limiting the 722226031Sstas number of iterations in the innermost loop). However, the number of 723226031Sstas digits that suffice to represent a given delta can sometimes 724226031Sstas represent much larger deltas (because of the adaptation), and hence 725226031Sstas this approach would probably need integers wider than 32 bits. 726226031Sstas 727226031Sstas 728226031Sstas 729226031Sstas 730226031SstasCostello Standards Track [Page 13] 731226031Sstas 732226031SstasRFC 3492 IDNA Punycode March 2003 733226031Sstas 734226031Sstas 735226031Sstas Yet another approach for the decoder is to allow overflow to occur, 736226031Sstas but to check the final output string by re-encoding it and comparing 737226031Sstas to the decoder input. If and only if they do not match (using a 738226031Sstas case-insensitive ASCII comparison) overflow has occurred. This 739226031Sstas delayed-detection approach would not impose any more restrictions on 740226031Sstas the input than the immediate-detection approach does, and might be 741226031Sstas considered simpler in some programming languages. 742226031Sstas 743226031Sstas In fact, if the decoder is used only inside the IDNA ToUnicode 744226031Sstas operation [IDNA], then it need not check for overflow at all, because 745226031Sstas ToUnicode performs a higher level re-encoding and comparison, and a 746226031Sstas mismatch has the same consequence as if the Punycode decoder had 747226031Sstas failed. 748226031Sstas 749226031Sstas7. Punycode examples 750226031Sstas 751226031Sstas7.1 Sample strings 752226031Sstas 753226031Sstas In the Punycode encodings below, the ACE prefix is not shown. 754226031Sstas Backslashes show where line breaks have been inserted in strings too 755226031Sstas long for one line. 756226031Sstas 757226031Sstas The first several examples are all translations of the sentence "Why 758226031Sstas can't they just speak in <language>?" (courtesy of Michael Kaplan's 759226031Sstas "provincial" page [PROVINCIAL]). Word breaks and punctuation have 760226031Sstas been removed, as is often done in domain names. 761226031Sstas 762226031Sstas (A) Arabic (Egyptian): 763226031Sstas u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644 764226031Sstas u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F 765226031Sstas Punycode: egbpdaj6bu4bxfgehfvwxn 766226031Sstas 767226031Sstas (B) Chinese (simplified): 768226031Sstas u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587 769226031Sstas Punycode: ihqwcrb4cv8a8dqg056pqjye 770226031Sstas 771226031Sstas (C) Chinese (traditional): 772226031Sstas u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587 773226031Sstas Punycode: ihqwctvzc91f659drss3x8bo0yb 774226031Sstas 775226031Sstas (D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky 776226031Sstas U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 777226031Sstas u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D 778226031Sstas u+0065 u+0073 u+006B u+0079 779226031Sstas Punycode: Proprostnemluvesky-uyb24dma41a 780226031Sstas 781226031Sstas 782226031Sstas 783226031Sstas 784226031Sstas 785226031Sstas 786226031SstasCostello Standards Track [Page 14] 787226031Sstas 788226031SstasRFC 3492 IDNA Punycode March 2003 789226031Sstas 790226031Sstas 791226031Sstas (E) Hebrew: 792226031Sstas u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8 793226031Sstas u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2 794226031Sstas u+05D1 u+05E8 u+05D9 u+05EA 795226031Sstas Punycode: 4dbcagdahymbxekheh6e0a7fei0b 796226031Sstas 797226031Sstas (F) Hindi (Devanagari): 798226031Sstas u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D 799226031Sstas u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939 800226031Sstas u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947 801226031Sstas u+0939 u+0948 u+0902 802226031Sstas Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd 803226031Sstas 804226031Sstas (G) Japanese (kanji and hiragana): 805226031Sstas u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092 806226031Sstas u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B 807226031Sstas Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa 808226031Sstas 809226031Sstas (H) Korean (Hangul syllables): 810226031Sstas u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 811226031Sstas u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 812226031Sstas u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C 813226031Sstas Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\ 814226031Sstas psd879ccm6fea98c 815226031Sstas 816226031Sstas (I) Russian (Cyrillic): 817226031Sstas U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E 818226031Sstas u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 819226031Sstas u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A 820226031Sstas u+0438 821226031Sstas Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l 822226031Sstas 823226031Sstas (J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol 824226031Sstas U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 825226031Sstas u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 826226031Sstas u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 827226031Sstas u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 828226031Sstas u+0061 u+00F1 u+006F u+006C 829226031Sstas Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a 830226031Sstas 831226031Sstas (K) Vietnamese: 832226031Sstas T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\ 833226031Sstas <ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t 834226031Sstas U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B 835226031Sstas u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068 836226031Sstas u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067 837226031Sstas U+0056 u+0069 u+1EC7 u+0074 838226031Sstas Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g 839226031Sstas 840226031Sstas 841226031Sstas 842226031SstasCostello Standards Track [Page 15] 843226031Sstas 844226031SstasRFC 3492 IDNA Punycode March 2003 845226031Sstas 846226031Sstas 847226031Sstas The next several examples are all names of Japanese music artists, 848226031Sstas song titles, and TV programs, just because the author happens to have 849226031Sstas them handy (but Japanese is useful for providing examples of single- 850226031Sstas row text, two-row text, ideographic text, and various mixtures 851226031Sstas thereof). 852226031Sstas 853226031Sstas (L) 3<nen>B<gumi><kinpachi><sensei> 854226031Sstas u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F 855226031Sstas Punycode: 3B-ww4c5e180e575a65lsy2b 856226031Sstas 857226031Sstas (M) <amuro><namie>-with-SUPER-MONKEYS 858226031Sstas u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 859226031Sstas u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D 860226031Sstas U+004F U+004E U+004B U+0045 U+0059 U+0053 861226031Sstas Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n 862226031Sstas 863226031Sstas (N) Hello-Another-Way-<sorezore><no><basho> 864226031Sstas U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F 865226031Sstas u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D 866226031Sstas u+305D u+308C u+305E u+308C u+306E u+5834 u+6240 867226031Sstas Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b 868226031Sstas 869226031Sstas (O) <hitotsu><yane><no><shita>2 870226031Sstas u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032 871226031Sstas Punycode: 2-u9tlzr9756bt3uc0v 872226031Sstas 873226031Sstas (P) Maji<de>Koi<suru>5<byou><mae> 874226031Sstas U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059 875226031Sstas u+308B u+0035 u+79D2 u+524D 876226031Sstas Punycode: MajiKoi5-783gue6qz075azm5e 877226031Sstas 878226031Sstas (Q) <pafii>de<runba> 879226031Sstas u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0 880226031Sstas Punycode: de-jg4avhby1noc0d 881226031Sstas 882226031Sstas (R) <sono><supiido><de> 883226031Sstas u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067 884226031Sstas Punycode: d9juau41awczczp 885226031Sstas 886226031Sstas The last example is an ASCII string that breaks the existing rules 887226031Sstas for host name labels. (It is not a realistic example for IDNA, 888226031Sstas because IDNA never encodes pure ASCII labels.) 889226031Sstas 890226031Sstas (S) -> $1.00 <- 891226031Sstas u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 892226031Sstas u+003C u+002D 893226031Sstas Punycode: -> $1.00 <-- 894226031Sstas 895226031Sstas 896226031Sstas 897226031Sstas 898226031SstasCostello Standards Track [Page 16] 899226031Sstas 900226031SstasRFC 3492 IDNA Punycode March 2003 901226031Sstas 902226031Sstas 903226031Sstas7.2 Decoding traces 904226031Sstas 905226031Sstas In the following traces, the evolving state of the decoder is shown 906226031Sstas as a sequence of hexadecimal values, representing the code points in 907226031Sstas the extended string. An asterisk appears just after the most 908226031Sstas recently inserted code point, indicating both n (the value preceeding 909226031Sstas the asterisk) and i (the position of the value just after the 910226031Sstas asterisk). Other numerical values are decimal. 911226031Sstas 912226031Sstas Decoding trace of example B from section 7.1: 913226031Sstas 914226031Sstas n is 128, i is 0, bias is 72 915226031Sstas input is "ihqwcrb4cv8a8dqg056pqjye" 916226031Sstas there is no delimiter, so extended string starts empty 917226031Sstas delta "ihq" decodes to 19853 918226031Sstas bias becomes 21 919226031Sstas 4E0D * 920226031Sstas delta "wc" decodes to 64 921226031Sstas bias becomes 20 922226031Sstas 4E0D 4E2D * 923226031Sstas delta "rb" decodes to 37 924226031Sstas bias becomes 13 925226031Sstas 4E3A * 4E0D 4E2D 926226031Sstas delta "4c" decodes to 56 927226031Sstas bias becomes 17 928226031Sstas 4E3A 4E48 * 4E0D 4E2D 929226031Sstas delta "v8a" decodes to 599 930226031Sstas bias becomes 32 931226031Sstas 4E3A 4EC0 * 4E48 4E0D 4E2D 932226031Sstas delta "8d" decodes to 130 933226031Sstas bias becomes 23 934226031Sstas 4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D 935226031Sstas delta "qg" decodes to 154 936226031Sstas bias becomes 25 937226031Sstas 4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D 938226031Sstas delta "056p" decodes to 46301 939226031Sstas bias becomes 84 940226031Sstas 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 * 941226031Sstas delta "qjye" decodes to 88531 942226031Sstas bias becomes 90 943226031Sstas 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587 944226031Sstas 945226031Sstas 946226031Sstas 947226031Sstas 948226031Sstas 949226031Sstas 950226031Sstas 951226031Sstas 952226031Sstas 953226031Sstas 954226031SstasCostello Standards Track [Page 17] 955226031Sstas 956226031SstasRFC 3492 IDNA Punycode March 2003 957226031Sstas 958226031Sstas 959226031Sstas Decoding trace of example L from section 7.1: 960226031Sstas 961226031Sstas n is 128, i is 0, bias is 72 962226031Sstas input is "3B-ww4c5e180e575a65lsy2b" 963226031Sstas literal portion is "3B-", so extended string starts as: 964226031Sstas 0033 0042 965226031Sstas delta "ww4c" decodes to 62042 966226031Sstas bias becomes 27 967226031Sstas 0033 0042 5148 * 968226031Sstas delta "5e" decodes to 139 969226031Sstas bias becomes 24 970226031Sstas 0033 0042 516B * 5148 971226031Sstas delta "180e" decodes to 16683 972226031Sstas bias becomes 67 973226031Sstas 0033 5E74 * 0042 516B 5148 974226031Sstas delta "575a" decodes to 34821 975226031Sstas bias becomes 82 976226031Sstas 0033 5E74 0042 516B 5148 751F * 977226031Sstas delta "65l" decodes to 14592 978226031Sstas bias becomes 67 979226031Sstas 0033 5E74 0042 7D44 * 516B 5148 751F 980226031Sstas delta "sy2b" decodes to 42088 981226031Sstas bias becomes 84 982226031Sstas 0033 5E74 0042 7D44 91D1 * 516B 5148 751F 983226031Sstas 984226031Sstas 985226031Sstas 986226031Sstas 987226031Sstas 988226031Sstas 989226031Sstas 990226031Sstas 991226031Sstas 992226031Sstas 993226031Sstas 994226031Sstas 995226031Sstas 996226031Sstas 997226031Sstas 998226031Sstas 999226031Sstas 1000226031Sstas 1001226031Sstas 1002226031Sstas 1003226031Sstas 1004226031Sstas 1005226031Sstas 1006226031Sstas 1007226031Sstas 1008226031Sstas 1009226031Sstas 1010226031SstasCostello Standards Track [Page 18] 1011226031Sstas 1012226031SstasRFC 3492 IDNA Punycode March 2003 1013226031Sstas 1014226031Sstas 1015226031Sstas7.3 Encoding traces 1016226031Sstas 1017226031Sstas In the following traces, code point values are hexadecimal, while 1018226031Sstas other numerical values are decimal. 1019226031Sstas 1020226031Sstas Encoding trace of example B from section 7.1: 1021226031Sstas 1022226031Sstas bias is 72 1023226031Sstas input is: 1024226031Sstas 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587 1025226031Sstas there are no basic code points, so no literal portion 1026226031Sstas next code point to insert is 4E0D 1027226031Sstas needed delta is 19853, encodes as "ihq" 1028226031Sstas bias becomes 21 1029226031Sstas next code point to insert is 4E2D 1030226031Sstas needed delta is 64, encodes as "wc" 1031226031Sstas bias becomes 20 1032226031Sstas next code point to insert is 4E3A 1033226031Sstas needed delta is 37, encodes as "rb" 1034226031Sstas bias becomes 13 1035226031Sstas next code point to insert is 4E48 1036226031Sstas needed delta is 56, encodes as "4c" 1037226031Sstas bias becomes 17 1038226031Sstas next code point to insert is 4EC0 1039226031Sstas needed delta is 599, encodes as "v8a" 1040226031Sstas bias becomes 32 1041226031Sstas next code point to insert is 4ED6 1042226031Sstas needed delta is 130, encodes as "8d" 1043226031Sstas bias becomes 23 1044226031Sstas next code point to insert is 4EEC 1045226031Sstas needed delta is 154, encodes as "qg" 1046226031Sstas bias becomes 25 1047226031Sstas next code point to insert is 6587 1048226031Sstas needed delta is 46301, encodes as "056p" 1049226031Sstas bias becomes 84 1050226031Sstas next code point to insert is 8BF4 1051226031Sstas needed delta is 88531, encodes as "qjye" 1052226031Sstas bias becomes 90 1053226031Sstas output is "ihqwcrb4cv8a8dqg056pqjye" 1054226031Sstas 1055226031Sstas 1056226031Sstas 1057226031Sstas 1058226031Sstas 1059226031Sstas 1060226031Sstas 1061226031Sstas 1062226031Sstas 1063226031Sstas 1064226031Sstas 1065226031Sstas 1066226031SstasCostello Standards Track [Page 19] 1067226031Sstas 1068226031SstasRFC 3492 IDNA Punycode March 2003 1069226031Sstas 1070226031Sstas 1071226031Sstas Encoding trace of example L from section 7.1: 1072226031Sstas 1073226031Sstas bias is 72 1074226031Sstas input is: 1075226031Sstas 0033 5E74 0042 7D44 91D1 516B 5148 751F 1076226031Sstas basic code points (0033, 0042) are copied to literal portion: "3B-" 1077226031Sstas next code point to insert is 5148 1078226031Sstas needed delta is 62042, encodes as "ww4c" 1079226031Sstas bias becomes 27 1080226031Sstas next code point to insert is 516B 1081226031Sstas needed delta is 139, encodes as "5e" 1082226031Sstas bias becomes 24 1083226031Sstas next code point to insert is 5E74 1084226031Sstas needed delta is 16683, encodes as "180e" 1085226031Sstas bias becomes 67 1086226031Sstas next code point to insert is 751F 1087226031Sstas needed delta is 34821, encodes as "575a" 1088226031Sstas bias becomes 82 1089226031Sstas next code point to insert is 7D44 1090226031Sstas needed delta is 14592, encodes as "65l" 1091226031Sstas bias becomes 67 1092226031Sstas next code point to insert is 91D1 1093226031Sstas needed delta is 42088, encodes as "sy2b" 1094226031Sstas bias becomes 84 1095226031Sstas output is "3B-ww4c5e180e575a65lsy2b" 1096226031Sstas 1097226031Sstas8. Security Considerations 1098226031Sstas 1099226031Sstas Users expect each domain name in DNS to be controlled by a single 1100226031Sstas authority. If a Unicode string intended for use as a domain label 1101226031Sstas could map to multiple ACE labels, then an internationalized domain 1102226031Sstas name could map to multiple ASCII domain names, each controlled by a 1103226031Sstas different authority, some of which could be spoofs that hijack 1104226031Sstas service requests intended for another. Therefore Punycode is 1105226031Sstas designed so that each Unicode string has a unique encoding. 1106226031Sstas 1107226031Sstas However, there can still be multiple Unicode representations of the 1108226031Sstas "same" text, for various definitions of "same". This problem is 1109226031Sstas addressed to some extent by the Unicode standard under the topic of 1110226031Sstas canonicalization, and this work is leveraged for domain names by 1111226031Sstas Nameprep [NAMEPREP]. 1112226031Sstas 1113226031Sstas 1114226031Sstas 1115226031Sstas 1116226031Sstas 1117226031Sstas 1118226031Sstas 1119226031Sstas 1120226031Sstas 1121226031Sstas 1122226031SstasCostello Standards Track [Page 20] 1123226031Sstas 1124226031SstasRFC 3492 IDNA Punycode March 2003 1125226031Sstas 1126226031Sstas 1127226031Sstas9. References 1128226031Sstas 1129226031Sstas9.1 Normative References 1130226031Sstas 1131226031Sstas [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1132226031Sstas Requirement Levels", BCP 14, RFC 2119, March 1997. 1133226031Sstas 1134226031Sstas9.2 Informative References 1135226031Sstas 1136226031Sstas [RFC952] Harrenstien, K., Stahl, M. and E. Feinler, "DOD Internet 1137226031Sstas Host Table Specification", RFC 952, October 1985. 1138226031Sstas 1139226031Sstas [RFC1034] Mockapetris, P., "Domain Names - Concepts and 1140226031Sstas Facilities", STD 13, RFC 1034, November 1987. 1141226031Sstas 1142226031Sstas [IDNA] Faltstrom, P., Hoffman, P. and A. Costello, 1143226031Sstas "Internationalizing Domain Names in Applications 1144226031Sstas (IDNA)", RFC 3490, March 2003. 1145226031Sstas 1146226031Sstas [NAMEPREP] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1147226031Sstas Profile for Internationalized Domain Names (IDN)", RFC 1148226031Sstas 3491, March 2003. 1149226031Sstas 1150226031Sstas [ASCII] Cerf, V., "ASCII format for Network Interchange", RFC 1151226031Sstas 20, October 1969. 1152226031Sstas 1153226031Sstas [PROVINCIAL] Kaplan, M., "The 'anyone can be provincial!' page", 1154226031Sstas http://www.trigeminal.com/samples/provincial.html. 1155226031Sstas 1156226031Sstas [UNICODE] The Unicode Consortium, "The Unicode Standard", 1157226031Sstas http://www.unicode.org/unicode/standard/standard.html. 1158226031Sstas 1159226031Sstas 1160226031Sstas 1161226031Sstas 1162226031Sstas 1163226031Sstas 1164226031Sstas 1165226031Sstas 1166226031Sstas 1167226031Sstas 1168226031Sstas 1169226031Sstas 1170226031Sstas 1171226031Sstas 1172226031Sstas 1173226031Sstas 1174226031Sstas 1175226031Sstas 1176226031Sstas 1177226031Sstas 1178226031SstasCostello Standards Track [Page 21] 1179226031Sstas 1180226031SstasRFC 3492 IDNA Punycode March 2003 1181226031Sstas 1182226031Sstas 1183226031SstasA. Mixed-case annotation 1184226031Sstas 1185226031Sstas In order to use Punycode to represent case-insensitive strings, 1186226031Sstas higher layers need to case-fold the strings prior to Punycode 1187226031Sstas encoding. The encoded string can use mixed case as an annotation 1188226031Sstas telling how to convert the folded string into a mixed-case string for 1189226031Sstas display purposes. Note, however, that mixed-case annotation is not 1190226031Sstas used by the ToASCII and ToUnicode operations specified in [IDNA], and 1191226031Sstas therefore implementors of IDNA can disregard this appendix. 1192226031Sstas 1193226031Sstas Basic code points can use mixed case directly, because the decoder 1194226031Sstas copies them verbatim, leaving lowercase code points lowercase, and 1195226031Sstas leaving uppercase code points uppercase. Each non-basic code point 1196226031Sstas is represented by a delta, which is represented by a sequence of 1197226031Sstas basic code points, the last of which provides the annotation. If it 1198226031Sstas is uppercase, it is a suggestion to map the non-basic code point to 1199226031Sstas uppercase (if possible); if it is lowercase, it is a suggestion to 1200226031Sstas map the non-basic code point to lowercase (if possible). 1201226031Sstas 1202226031Sstas These annotations do not alter the code points returned by decoders; 1203226031Sstas the annotations are returned separately, for the caller to use or 1204226031Sstas ignore. Encoders can accept annotations in addition to code points, 1205226031Sstas but the annotations do not alter the output, except to influence the 1206226031Sstas uppercase/lowercase form of ASCII letters. 1207226031Sstas 1208226031Sstas Punycode encoders and decoders need not support these annotations, 1209226031Sstas and higher layers need not use them. 1210226031Sstas 1211226031SstasB. Disclaimer and license 1212226031Sstas 1213226031Sstas Regarding this entire document or any portion of it (including the 1214226031Sstas pseudocode and C code), the author makes no guarantees and is not 1215226031Sstas responsible for any damage resulting from its use. The author grants 1216226031Sstas irrevocable permission to anyone to use, modify, and distribute it in 1217226031Sstas any way that does not diminish the rights of anyone else to use, 1218226031Sstas modify, and distribute it, provided that redistributed derivative 1219226031Sstas works do not contain misleading author or version information. 1220226031Sstas Derivative works need not be licensed under similar terms. 1221226031Sstas 1222226031Sstas 1223226031Sstas 1224226031Sstas 1225226031Sstas 1226226031Sstas 1227226031Sstas 1228226031Sstas 1229226031Sstas 1230226031Sstas 1231226031Sstas 1232226031Sstas 1233226031Sstas 1234226031SstasCostello Standards Track [Page 22] 1235226031Sstas 1236226031SstasRFC 3492 IDNA Punycode March 2003 1237226031Sstas 1238226031Sstas 1239226031SstasC. Punycode sample implementation 1240226031Sstas 1241226031Sstas/* 1242226031Sstaspunycode.c from RFC 3492 1243226031Sstashttp://www.nicemice.net/idn/ 1244226031SstasAdam M. Costello 1245226031Sstashttp://www.nicemice.net/amc/ 1246226031Sstas 1247226031SstasThis is ANSI C code (C89) implementing Punycode (RFC 3492). 1248226031Sstas 1249226031Sstas*/ 1250226031Sstas 1251226031Sstas 1252226031Sstas/************************************************************/ 1253226031Sstas/* Public interface (would normally go in its own .h file): */ 1254226031Sstas 1255226031Sstas#include <limits.h> 1256226031Sstas 1257226031Sstasenum punycode_status { 1258226031Sstas punycode_success, 1259226031Sstas punycode_bad_input, /* Input is invalid. */ 1260226031Sstas punycode_big_output, /* Output would exceed the space provided. */ 1261226031Sstas punycode_overflow /* Input needs wider integers to process. */ 1262226031Sstas}; 1263226031Sstas 1264226031Sstas#if UINT_MAX >= (1 << 26) - 1 1265226031Sstastypedef unsigned int punycode_uint; 1266226031Sstas#else 1267226031Sstastypedef unsigned long punycode_uint; 1268226031Sstas#endif 1269226031Sstas 1270226031Sstasenum punycode_status punycode_encode( 1271226031Sstas punycode_uint input_length, 1272226031Sstas const punycode_uint input[], 1273226031Sstas const unsigned char case_flags[], 1274226031Sstas punycode_uint *output_length, 1275226031Sstas char output[] ); 1276226031Sstas 1277226031Sstas /* punycode_encode() converts Unicode to Punycode. The input */ 1278226031Sstas /* is represented as an array of Unicode code points (not code */ 1279226031Sstas /* units; surrogate pairs are not allowed), and the output */ 1280226031Sstas /* will be represented as an array of ASCII code points. The */ 1281226031Sstas /* output string is *not* null-terminated; it will contain */ 1282226031Sstas /* zeros if and only if the input contains zeros. (Of course */ 1283226031Sstas /* the caller can leave room for a terminator and add one if */ 1284226031Sstas /* needed.) The input_length is the number of code points in */ 1285226031Sstas /* the input. The output_length is an in/out argument: the */ 1286226031Sstas /* caller passes in the maximum number of code points that it */ 1287226031Sstas 1288226031Sstas 1289226031Sstas 1290226031SstasCostello Standards Track [Page 23] 1291226031Sstas 1292226031SstasRFC 3492 IDNA Punycode March 2003 1293226031Sstas 1294226031Sstas 1295226031Sstas /* can receive, and on successful return it will contain the */ 1296226031Sstas /* number of code points actually output. The case_flags array */ 1297226031Sstas /* holds input_length boolean values, where nonzero suggests that */ 1298226031Sstas /* the corresponding Unicode character be forced to uppercase */ 1299226031Sstas /* after being decoded (if possible), and zero suggests that */ 1300226031Sstas /* it be forced to lowercase (if possible). ASCII code points */ 1301226031Sstas /* are encoded literally, except that ASCII letters are forced */ 1302226031Sstas /* to uppercase or lowercase according to the corresponding */ 1303226031Sstas /* uppercase flags. If case_flags is a null pointer then ASCII */ 1304226031Sstas /* letters are left as they are, and other code points are */ 1305226031Sstas /* treated as if their uppercase flags were zero. The return */ 1306226031Sstas /* value can be any of the punycode_status values defined above */ 1307226031Sstas /* except punycode_bad_input; if not punycode_success, then */ 1308226031Sstas /* output_size and output might contain garbage. */ 1309226031Sstas 1310226031Sstasenum punycode_status punycode_decode( 1311226031Sstas punycode_uint input_length, 1312226031Sstas const char input[], 1313226031Sstas punycode_uint *output_length, 1314226031Sstas punycode_uint output[], 1315226031Sstas unsigned char case_flags[] ); 1316226031Sstas 1317226031Sstas /* punycode_decode() converts Punycode to Unicode. The input is */ 1318226031Sstas /* represented as an array of ASCII code points, and the output */ 1319226031Sstas /* will be represented as an array of Unicode code points. The */ 1320226031Sstas /* input_length is the number of code points in the input. The */ 1321226031Sstas /* output_length is an in/out argument: the caller passes in */ 1322226031Sstas /* the maximum number of code points that it can receive, and */ 1323226031Sstas /* on successful return it will contain the actual number of */ 1324226031Sstas /* code points output. The case_flags array needs room for at */ 1325226031Sstas /* least output_length values, or it can be a null pointer if the */ 1326226031Sstas /* case information is not needed. A nonzero flag suggests that */ 1327226031Sstas /* the corresponding Unicode character be forced to uppercase */ 1328226031Sstas /* by the caller (if possible), while zero suggests that it be */ 1329226031Sstas /* forced to lowercase (if possible). ASCII code points are */ 1330226031Sstas /* output already in the proper case, but their flags will be set */ 1331226031Sstas /* appropriately so that applying the flags would be harmless. */ 1332226031Sstas /* The return value can be any of the punycode_status values */ 1333226031Sstas /* defined above; if not punycode_success, then output_length, */ 1334226031Sstas /* output, and case_flags might contain garbage. On success, the */ 1335226031Sstas /* decoder will never need to write an output_length greater than */ 1336226031Sstas /* input_length, because of how the encoding is defined. */ 1337226031Sstas 1338226031Sstas/**********************************************************/ 1339226031Sstas/* Implementation (would normally go in its own .c file): */ 1340226031Sstas 1341226031Sstas#include <string.h> 1342226031Sstas 1343226031Sstas 1344226031Sstas 1345226031Sstas 1346226031SstasCostello Standards Track [Page 24] 1347226031Sstas 1348226031SstasRFC 3492 IDNA Punycode March 2003 1349226031Sstas 1350226031Sstas 1351226031Sstas/*** Bootstring parameters for Punycode ***/ 1352226031Sstas 1353226031Sstasenum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700, 1354226031Sstas initial_bias = 72, initial_n = 0x80, delimiter = 0x2D }; 1355226031Sstas 1356226031Sstas/* basic(cp) tests whether cp is a basic code point: */ 1357226031Sstas#define basic(cp) ((punycode_uint)(cp) < 0x80) 1358226031Sstas 1359226031Sstas/* delim(cp) tests whether cp is a delimiter: */ 1360226031Sstas#define delim(cp) ((cp) == delimiter) 1361226031Sstas 1362226031Sstas/* decode_digit(cp) returns the numeric value of a basic code */ 1363226031Sstas/* point (for use in representing integers) in the range 0 to */ 1364226031Sstas/* base-1, or base if cp is does not represent a value. */ 1365226031Sstas 1366226031Sstasstatic punycode_uint decode_digit(punycode_uint cp) 1367226031Sstas{ 1368226031Sstas return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 : 1369226031Sstas cp - 97 < 26 ? cp - 97 : base; 1370226031Sstas} 1371226031Sstas 1372226031Sstas/* encode_digit(d,flag) returns the basic code point whose value */ 1373226031Sstas/* (when used for representing integers) is d, which needs to be in */ 1374226031Sstas/* the range 0 to base-1. The lowercase form is used unless flag is */ 1375226031Sstas/* nonzero, in which case the uppercase form is used. The behavior */ 1376226031Sstas/* is undefined if flag is nonzero and digit d has no uppercase form. */ 1377226031Sstas 1378226031Sstasstatic char encode_digit(punycode_uint d, int flag) 1379226031Sstas{ 1380226031Sstas return d + 22 + 75 * (d < 26) - ((flag != 0) << 5); 1381226031Sstas /* 0..25 map to ASCII a..z or A..Z */ 1382226031Sstas /* 26..35 map to ASCII 0..9 */ 1383226031Sstas} 1384226031Sstas 1385226031Sstas/* flagged(bcp) tests whether a basic code point is flagged */ 1386226031Sstas/* (uppercase). The behavior is undefined if bcp is not a */ 1387226031Sstas/* basic code point. */ 1388226031Sstas 1389226031Sstas#define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26) 1390226031Sstas 1391226031Sstas/* encode_basic(bcp,flag) forces a basic code point to lowercase */ 1392226031Sstas/* if flag is zero, uppercase if flag is nonzero, and returns */ 1393226031Sstas/* the resulting code point. The code point is unchanged if it */ 1394226031Sstas/* is caseless. The behavior is undefined if bcp is not a basic */ 1395226031Sstas/* code point. */ 1396226031Sstas 1397226031Sstasstatic char encode_basic(punycode_uint bcp, int flag) 1398226031Sstas{ 1399226031Sstas 1400226031Sstas 1401226031Sstas 1402226031SstasCostello Standards Track [Page 25] 1403226031Sstas 1404226031SstasRFC 3492 IDNA Punycode March 2003 1405226031Sstas 1406226031Sstas 1407226031Sstas bcp -= (bcp - 97 < 26) << 5; 1408226031Sstas return bcp + ((!flag && (bcp - 65 < 26)) << 5); 1409226031Sstas} 1410226031Sstas 1411226031Sstas/*** Platform-specific constants ***/ 1412226031Sstas 1413226031Sstas/* maxint is the maximum value of a punycode_uint variable: */ 1414226031Sstasstatic const punycode_uint maxint = -1; 1415226031Sstas/* Because maxint is unsigned, -1 becomes the maximum value. */ 1416226031Sstas 1417226031Sstas/*** Bias adaptation function ***/ 1418226031Sstas 1419226031Sstasstatic punycode_uint adapt( 1420226031Sstas punycode_uint delta, punycode_uint numpoints, int firsttime ) 1421226031Sstas{ 1422226031Sstas punycode_uint k; 1423226031Sstas 1424226031Sstas delta = firsttime ? delta / damp : delta >> 1; 1425226031Sstas /* delta >> 1 is a faster way of doing delta / 2 */ 1426226031Sstas delta += delta / numpoints; 1427226031Sstas 1428226031Sstas for (k = 0; delta > ((base - tmin) * tmax) / 2; k += base) { 1429226031Sstas delta /= base - tmin; 1430226031Sstas } 1431226031Sstas 1432226031Sstas return k + (base - tmin + 1) * delta / (delta + skew); 1433226031Sstas} 1434226031Sstas 1435226031Sstas/*** Main encode function ***/ 1436226031Sstas 1437226031Sstasenum punycode_status punycode_encode( 1438226031Sstas punycode_uint input_length, 1439226031Sstas const punycode_uint input[], 1440226031Sstas const unsigned char case_flags[], 1441226031Sstas punycode_uint *output_length, 1442226031Sstas char output[] ) 1443226031Sstas{ 1444226031Sstas punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t; 1445226031Sstas 1446226031Sstas /* Initialize the state: */ 1447226031Sstas 1448226031Sstas n = initial_n; 1449226031Sstas delta = out = 0; 1450226031Sstas max_out = *output_length; 1451226031Sstas bias = initial_bias; 1452226031Sstas 1453226031Sstas /* Handle the basic code points: */ 1454226031Sstas 1455226031Sstas 1456226031Sstas 1457226031Sstas 1458226031SstasCostello Standards Track [Page 26] 1459226031Sstas 1460226031SstasRFC 3492 IDNA Punycode March 2003 1461226031Sstas 1462226031Sstas 1463226031Sstas for (j = 0; j < input_length; ++j) { 1464226031Sstas if (basic(input[j])) { 1465226031Sstas if (max_out - out < 2) return punycode_big_output; 1466226031Sstas output[out++] = 1467226031Sstas case_flags ? encode_basic(input[j], case_flags[j]) : input[j]; 1468226031Sstas } 1469226031Sstas /* else if (input[j] < n) return punycode_bad_input; */ 1470226031Sstas /* (not needed for Punycode with unsigned code points) */ 1471226031Sstas } 1472226031Sstas 1473226031Sstas h = b = out; 1474226031Sstas 1475226031Sstas /* h is the number of code points that have been handled, b is the */ 1476226031Sstas /* number of basic code points, and out is the number of characters */ 1477226031Sstas /* that have been output. */ 1478226031Sstas 1479226031Sstas if (b > 0) output[out++] = delimiter; 1480226031Sstas 1481226031Sstas /* Main encoding loop: */ 1482226031Sstas 1483226031Sstas while (h < input_length) { 1484226031Sstas /* All non-basic code points < n have been */ 1485226031Sstas /* handled already. Find the next larger one: */ 1486226031Sstas 1487226031Sstas for (m = maxint, j = 0; j < input_length; ++j) { 1488226031Sstas /* if (basic(input[j])) continue; */ 1489226031Sstas /* (not needed for Punycode) */ 1490226031Sstas if (input[j] >= n && input[j] < m) m = input[j]; 1491226031Sstas } 1492226031Sstas 1493226031Sstas /* Increase delta enough to advance the decoder's */ 1494226031Sstas /* <n,i> state to <m,0>, but guard against overflow: */ 1495226031Sstas 1496226031Sstas if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow; 1497226031Sstas delta += (m - n) * (h + 1); 1498226031Sstas n = m; 1499226031Sstas 1500226031Sstas for (j = 0; j < input_length; ++j) { 1501226031Sstas /* Punycode does not need to check whether input[j] is basic: */ 1502226031Sstas if (input[j] < n /* || basic(input[j]) */ ) { 1503226031Sstas if (++delta == 0) return punycode_overflow; 1504226031Sstas } 1505226031Sstas 1506226031Sstas if (input[j] == n) { 1507226031Sstas /* Represent delta as a generalized variable-length integer: */ 1508226031Sstas 1509226031Sstas for (q = delta, k = base; ; k += base) { 1510226031Sstas if (out >= max_out) return punycode_big_output; 1511226031Sstas 1512226031Sstas 1513226031Sstas 1514226031SstasCostello Standards Track [Page 27] 1515226031Sstas 1516226031SstasRFC 3492 IDNA Punycode March 2003 1517226031Sstas 1518226031Sstas 1519226031Sstas t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */ 1520226031Sstas k >= bias + tmax ? tmax : k - bias; 1521226031Sstas if (q < t) break; 1522226031Sstas output[out++] = encode_digit(t + (q - t) % (base - t), 0); 1523226031Sstas q = (q - t) / (base - t); 1524226031Sstas } 1525226031Sstas 1526226031Sstas output[out++] = encode_digit(q, case_flags && case_flags[j]); 1527226031Sstas bias = adapt(delta, h + 1, h == b); 1528226031Sstas delta = 0; 1529226031Sstas ++h; 1530226031Sstas } 1531226031Sstas } 1532226031Sstas 1533226031Sstas ++delta, ++n; 1534226031Sstas } 1535226031Sstas 1536226031Sstas *output_length = out; 1537226031Sstas return punycode_success; 1538226031Sstas} 1539226031Sstas 1540226031Sstas/*** Main decode function ***/ 1541226031Sstas 1542226031Sstasenum punycode_status punycode_decode( 1543226031Sstas punycode_uint input_length, 1544226031Sstas const char input[], 1545226031Sstas punycode_uint *output_length, 1546226031Sstas punycode_uint output[], 1547226031Sstas unsigned char case_flags[] ) 1548226031Sstas{ 1549226031Sstas punycode_uint n, out, i, max_out, bias, 1550226031Sstas b, j, in, oldi, w, k, digit, t; 1551226031Sstas 1552226031Sstas /* Initialize the state: */ 1553226031Sstas 1554226031Sstas n = initial_n; 1555226031Sstas out = i = 0; 1556226031Sstas max_out = *output_length; 1557226031Sstas bias = initial_bias; 1558226031Sstas 1559226031Sstas /* Handle the basic code points: Let b be the number of input code */ 1560226031Sstas /* points before the last delimiter, or 0 if there is none, then */ 1561226031Sstas /* copy the first b code points to the output. */ 1562226031Sstas 1563226031Sstas for (b = j = 0; j < input_length; ++j) if (delim(input[j])) b = j; 1564226031Sstas if (b > max_out) return punycode_big_output; 1565226031Sstas 1566226031Sstas for (j = 0; j < b; ++j) { 1567226031Sstas 1568226031Sstas 1569226031Sstas 1570226031SstasCostello Standards Track [Page 28] 1571226031Sstas 1572226031SstasRFC 3492 IDNA Punycode March 2003 1573226031Sstas 1574226031Sstas 1575226031Sstas if (case_flags) case_flags[out] = flagged(input[j]); 1576226031Sstas if (!basic(input[j])) return punycode_bad_input; 1577226031Sstas output[out++] = input[j]; 1578226031Sstas } 1579226031Sstas 1580226031Sstas /* Main decoding loop: Start just after the last delimiter if any */ 1581226031Sstas /* basic code points were copied; start at the beginning otherwise. */ 1582226031Sstas 1583226031Sstas for (in = b > 0 ? b + 1 : 0; in < input_length; ++out) { 1584226031Sstas 1585226031Sstas /* in is the index of the next character to be consumed, and */ 1586226031Sstas /* out is the number of code points in the output array. */ 1587226031Sstas 1588226031Sstas /* Decode a generalized variable-length integer into delta, */ 1589226031Sstas /* which gets added to i. The overflow checking is easier */ 1590226031Sstas /* if we increase i as we go, then subtract off its starting */ 1591226031Sstas /* value at the end to obtain delta. */ 1592226031Sstas 1593226031Sstas for (oldi = i, w = 1, k = base; ; k += base) { 1594226031Sstas if (in >= input_length) return punycode_bad_input; 1595226031Sstas digit = decode_digit(input[in++]); 1596226031Sstas if (digit >= base) return punycode_bad_input; 1597226031Sstas if (digit > (maxint - i) / w) return punycode_overflow; 1598226031Sstas i += digit * w; 1599226031Sstas t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */ 1600226031Sstas k >= bias + tmax ? tmax : k - bias; 1601226031Sstas if (digit < t) break; 1602226031Sstas if (w > maxint / (base - t)) return punycode_overflow; 1603226031Sstas w *= (base - t); 1604226031Sstas } 1605226031Sstas 1606226031Sstas bias = adapt(i - oldi, out + 1, oldi == 0); 1607226031Sstas 1608226031Sstas /* i was supposed to wrap around from out+1 to 0, */ 1609226031Sstas /* incrementing n each time, so we'll fix that now: */ 1610226031Sstas 1611226031Sstas if (i / (out + 1) > maxint - n) return punycode_overflow; 1612226031Sstas n += i / (out + 1); 1613226031Sstas i %= (out + 1); 1614226031Sstas 1615226031Sstas /* Insert n at position i of the output: */ 1616226031Sstas 1617226031Sstas /* not needed for Punycode: */ 1618226031Sstas /* if (decode_digit(n) <= base) return punycode_invalid_input; */ 1619226031Sstas if (out >= max_out) return punycode_big_output; 1620226031Sstas 1621226031Sstas if (case_flags) { 1622226031Sstas memmove(case_flags + i + 1, case_flags + i, out - i); 1623226031Sstas 1624226031Sstas 1625226031Sstas 1626226031SstasCostello Standards Track [Page 29] 1627226031Sstas 1628226031SstasRFC 3492 IDNA Punycode March 2003 1629226031Sstas 1630226031Sstas 1631226031Sstas /* Case of last character determines uppercase flag: */ 1632226031Sstas case_flags[i] = flagged(input[in - 1]); 1633226031Sstas } 1634226031Sstas 1635226031Sstas memmove(output + i + 1, output + i, (out - i) * sizeof *output); 1636226031Sstas output[i++] = n; 1637226031Sstas } 1638226031Sstas 1639226031Sstas *output_length = out; 1640226031Sstas return punycode_success; 1641226031Sstas} 1642226031Sstas 1643226031Sstas/******************************************************************/ 1644226031Sstas/* Wrapper for testing (would normally go in a separate .c file): */ 1645226031Sstas 1646226031Sstas#include <assert.h> 1647226031Sstas#include <stdio.h> 1648226031Sstas#include <stdlib.h> 1649226031Sstas#include <string.h> 1650226031Sstas 1651226031Sstas/* For testing, we'll just set some compile-time limits rather than */ 1652226031Sstas/* use malloc(), and set a compile-time option rather than using a */ 1653226031Sstas/* command-line option. */ 1654226031Sstas 1655226031Sstasenum { 1656226031Sstas unicode_max_length = 256, 1657226031Sstas ace_max_length = 256 1658226031Sstas}; 1659226031Sstas 1660226031Sstasstatic void usage(char **argv) 1661226031Sstas{ 1662226031Sstas fprintf(stderr, 1663226031Sstas "\n" 1664226031Sstas "%s -e reads code points and writes a Punycode string.\n" 1665226031Sstas "%s -d reads a Punycode string and writes code points.\n" 1666226031Sstas "\n" 1667226031Sstas "Input and output are plain text in the native character set.\n" 1668226031Sstas "Code points are in the form u+hex separated by whitespace.\n" 1669226031Sstas "Although the specification allows Punycode strings to contain\n" 1670226031Sstas "any characters from the ASCII repertoire, this test code\n" 1671226031Sstas "supports only the printable characters, and needs the Punycode\n" 1672226031Sstas "string to be followed by a newline.\n" 1673226031Sstas "The case of the u in u+hex is the force-to-uppercase flag.\n" 1674226031Sstas , argv[0], argv[0]); 1675226031Sstas exit(EXIT_FAILURE); 1676226031Sstas} 1677226031Sstas 1678226031Sstasstatic void fail(const char *msg) 1679226031Sstas 1680226031Sstas 1681226031Sstas 1682226031SstasCostello Standards Track [Page 30] 1683226031Sstas 1684226031SstasRFC 3492 IDNA Punycode March 2003 1685226031Sstas 1686226031Sstas 1687226031Sstas{ 1688226031Sstas fputs(msg,stderr); 1689226031Sstas exit(EXIT_FAILURE); 1690226031Sstas} 1691226031Sstas 1692226031Sstasstatic const char too_big[] = 1693226031Sstas "input or output is too large, recompile with larger limits\n"; 1694226031Sstasstatic const char invalid_input[] = "invalid input\n"; 1695226031Sstasstatic const char overflow[] = "arithmetic overflow\n"; 1696226031Sstasstatic const char io_error[] = "I/O error\n"; 1697226031Sstas 1698226031Sstas/* The following string is used to convert printable */ 1699226031Sstas/* characters between ASCII and the native charset: */ 1700226031Sstas 1701226031Sstasstatic const char print_ascii[] = 1702226031Sstas "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" 1703226031Sstas "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" 1704226031Sstas " !\"#$%&'()*+,-./" 1705226031Sstas "0123456789:;<=>?" 1706226031Sstas "@ABCDEFGHIJKLMNO" 1707226031Sstas "PQRSTUVWXYZ[\\]^_" 1708226031Sstas "`abcdefghijklmno" 1709226031Sstas "pqrstuvwxyz{|}~\n"; 1710226031Sstas 1711226031Sstasint main(int argc, char **argv) 1712226031Sstas{ 1713226031Sstas enum punycode_status status; 1714226031Sstas int r; 1715226031Sstas unsigned int input_length, output_length, j; 1716226031Sstas unsigned char case_flags[unicode_max_length]; 1717226031Sstas 1718226031Sstas if (argc != 2) usage(argv); 1719226031Sstas if (argv[1][0] != '-') usage(argv); 1720226031Sstas if (argv[1][2] != 0) usage(argv); 1721226031Sstas 1722226031Sstas if (argv[1][1] == 'e') { 1723226031Sstas punycode_uint input[unicode_max_length]; 1724226031Sstas unsigned long codept; 1725226031Sstas char output[ace_max_length+1], uplus[3]; 1726226031Sstas int c; 1727226031Sstas 1728226031Sstas /* Read the input code points: */ 1729226031Sstas 1730226031Sstas input_length = 0; 1731226031Sstas 1732226031Sstas for (;;) { 1733226031Sstas r = scanf("%2s%lx", uplus, &codept); 1734226031Sstas if (ferror(stdin)) fail(io_error); 1735226031Sstas 1736226031Sstas 1737226031Sstas 1738226031SstasCostello Standards Track [Page 31] 1739226031Sstas 1740226031SstasRFC 3492 IDNA Punycode March 2003 1741226031Sstas 1742226031Sstas 1743226031Sstas if (r == EOF || r == 0) break; 1744226031Sstas 1745226031Sstas if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) { 1746226031Sstas fail(invalid_input); 1747226031Sstas } 1748226031Sstas 1749226031Sstas if (input_length == unicode_max_length) fail(too_big); 1750226031Sstas 1751226031Sstas if (uplus[0] == 'u') case_flags[input_length] = 0; 1752226031Sstas else if (uplus[0] == 'U') case_flags[input_length] = 1; 1753226031Sstas else fail(invalid_input); 1754226031Sstas 1755226031Sstas input[input_length++] = codept; 1756226031Sstas } 1757226031Sstas 1758226031Sstas /* Encode: */ 1759226031Sstas 1760226031Sstas output_length = ace_max_length; 1761226031Sstas status = punycode_encode(input_length, input, case_flags, 1762226031Sstas &output_length, output); 1763226031Sstas if (status == punycode_bad_input) fail(invalid_input); 1764226031Sstas if (status == punycode_big_output) fail(too_big); 1765226031Sstas if (status == punycode_overflow) fail(overflow); 1766226031Sstas assert(status == punycode_success); 1767226031Sstas 1768226031Sstas /* Convert to native charset and output: */ 1769226031Sstas 1770226031Sstas for (j = 0; j < output_length; ++j) { 1771226031Sstas c = output[j]; 1772226031Sstas assert(c >= 0 && c <= 127); 1773226031Sstas if (print_ascii[c] == 0) fail(invalid_input); 1774226031Sstas output[j] = print_ascii[c]; 1775226031Sstas } 1776226031Sstas 1777226031Sstas output[j] = 0; 1778226031Sstas r = puts(output); 1779226031Sstas if (r == EOF) fail(io_error); 1780226031Sstas return EXIT_SUCCESS; 1781226031Sstas } 1782226031Sstas 1783226031Sstas if (argv[1][1] == 'd') { 1784226031Sstas char input[ace_max_length+2], *p, *pp; 1785226031Sstas punycode_uint output[unicode_max_length]; 1786226031Sstas 1787226031Sstas /* Read the Punycode input string and convert to ASCII: */ 1788226031Sstas 1789226031Sstas fgets(input, ace_max_length+2, stdin); 1790226031Sstas if (ferror(stdin)) fail(io_error); 1791226031Sstas 1792226031Sstas 1793226031Sstas 1794226031SstasCostello Standards Track [Page 32] 1795226031Sstas 1796226031SstasRFC 3492 IDNA Punycode March 2003 1797226031Sstas 1798226031Sstas 1799226031Sstas if (feof(stdin)) fail(invalid_input); 1800226031Sstas input_length = strlen(input) - 1; 1801226031Sstas if (input[input_length] != '\n') fail(too_big); 1802226031Sstas input[input_length] = 0; 1803226031Sstas 1804226031Sstas for (p = input; *p != 0; ++p) { 1805226031Sstas pp = strchr(print_ascii, *p); 1806226031Sstas if (pp == 0) fail(invalid_input); 1807226031Sstas *p = pp - print_ascii; 1808226031Sstas } 1809226031Sstas 1810226031Sstas /* Decode: */ 1811226031Sstas 1812226031Sstas output_length = unicode_max_length; 1813226031Sstas status = punycode_decode(input_length, input, &output_length, 1814226031Sstas output, case_flags); 1815226031Sstas if (status == punycode_bad_input) fail(invalid_input); 1816226031Sstas if (status == punycode_big_output) fail(too_big); 1817226031Sstas if (status == punycode_overflow) fail(overflow); 1818226031Sstas assert(status == punycode_success); 1819226031Sstas 1820226031Sstas /* Output the result: */ 1821226031Sstas 1822226031Sstas for (j = 0; j < output_length; ++j) { 1823226031Sstas r = printf("%s+%04lX\n", 1824226031Sstas case_flags[j] ? "U" : "u", 1825226031Sstas (unsigned long) output[j] ); 1826226031Sstas if (r < 0) fail(io_error); 1827226031Sstas } 1828226031Sstas 1829226031Sstas return EXIT_SUCCESS; 1830226031Sstas } 1831226031Sstas 1832226031Sstas usage(argv); 1833226031Sstas return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ 1834226031Sstas} 1835226031Sstas 1836226031Sstas 1837226031Sstas 1838226031Sstas 1839226031Sstas 1840226031Sstas 1841226031Sstas 1842226031Sstas 1843226031Sstas 1844226031Sstas 1845226031Sstas 1846226031Sstas 1847226031Sstas 1848226031Sstas 1849226031Sstas 1850226031SstasCostello Standards Track [Page 33] 1851226031Sstas 1852226031SstasRFC 3492 IDNA Punycode March 2003 1853226031Sstas 1854226031Sstas 1855226031SstasAuthor's Address 1856226031Sstas 1857226031Sstas Adam M. Costello 1858226031Sstas University of California, Berkeley 1859226031Sstas http://www.nicemice.net/amc/ 1860226031Sstas 1861226031Sstas 1862226031Sstas 1863226031Sstas 1864226031Sstas 1865226031Sstas 1866226031Sstas 1867226031Sstas 1868226031Sstas 1869226031Sstas 1870226031Sstas 1871226031Sstas 1872226031Sstas 1873226031Sstas 1874226031Sstas 1875226031Sstas 1876226031Sstas 1877226031Sstas 1878226031Sstas 1879226031Sstas 1880226031Sstas 1881226031Sstas 1882226031Sstas 1883226031Sstas 1884226031Sstas 1885226031Sstas 1886226031Sstas 1887226031Sstas 1888226031Sstas 1889226031Sstas 1890226031Sstas 1891226031Sstas 1892226031Sstas 1893226031Sstas 1894226031Sstas 1895226031Sstas 1896226031Sstas 1897226031Sstas 1898226031Sstas 1899226031Sstas 1900226031Sstas 1901226031Sstas 1902226031Sstas 1903226031Sstas 1904226031Sstas 1905226031Sstas 1906226031SstasCostello Standards Track [Page 34] 1907226031Sstas 1908226031SstasRFC 3492 IDNA Punycode March 2003 1909226031Sstas 1910226031Sstas 1911226031SstasFull Copyright Statement 1912226031Sstas 1913226031Sstas Copyright (C) The Internet Society (2003). All Rights Reserved. 1914226031Sstas 1915226031Sstas This document and translations of it may be copied and furnished to 1916226031Sstas others, and derivative works that comment on or otherwise explain it 1917226031Sstas or assist in its implementation may be prepared, copied, published 1918226031Sstas and distributed, in whole or in part, without restriction of any 1919226031Sstas kind, provided that the above copyright notice and this paragraph are 1920226031Sstas included on all such copies and derivative works. However, this 1921226031Sstas document itself may not be modified in any way, such as by removing 1922226031Sstas the copyright notice or references to the Internet Society or other 1923226031Sstas Internet organizations, except as needed for the purpose of 1924226031Sstas developing Internet standards in which case the procedures for 1925226031Sstas copyrights defined in the Internet Standards process must be 1926226031Sstas followed, or as required to translate it into languages other than 1927226031Sstas English. 1928226031Sstas 1929226031Sstas The limited permissions granted above are perpetual and will not be 1930226031Sstas revoked by the Internet Society or its successors or assigns. 1931226031Sstas 1932226031Sstas This document and the information contained herein is provided on an 1933226031Sstas "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 1934226031Sstas TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 1935226031Sstas BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 1936226031Sstas HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1937226031Sstas MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1938226031Sstas 1939226031SstasAcknowledgement 1940226031Sstas 1941226031Sstas Funding for the RFC Editor function is currently provided by the 1942226031Sstas Internet Society. 1943226031Sstas 1944226031Sstas 1945226031Sstas 1946226031Sstas 1947226031Sstas 1948226031Sstas 1949226031Sstas 1950226031Sstas 1951226031Sstas 1952226031Sstas 1953226031Sstas 1954226031Sstas 1955226031Sstas 1956226031Sstas 1957226031Sstas 1958226031Sstas 1959226031Sstas 1960226031Sstas 1961226031Sstas 1962226031SstasCostello Standards Track [Page 35] 1963226031Sstas 1964