1[comment {-*- text -*-}] 2[section {PE serialization format}] 3 4Here we specify the format used by the Parser Tools to serialize 5Parsing Expressions as immutable values for transport, comparison, 6etc. 7 8[para] 9 10We distinguish between [term regular] and [term canonical] 11serializations. 12 13While a parsing expression may have more than one regular 14serialization only exactly one of them will be [term canonical]. 15 16[list_begin definitions][comment {-- serializations --}] 17[def {Regular serialization}] 18 19[list_begin definitions][comment {-- regular points --}] 20 21[def [const {Atomic Parsing Expressions}]] 22[list_begin enumerated][comment {-- atomic points --}] 23 24[enum] 25The string [const epsilon] is an atomic parsing expression. It matches 26the empty string. 27 28[enum] 29The string [const dot] is an atomic parsing expression. It matches 30any character. 31 32[enum] 33The string [const alnum] is an atomic parsing expression. It matches 34any Unicode alphabet or digit character. This is a custom extension of 35PEs based on Tcl's builtin command [cmd {string is}]. 36 37[enum] 38The string [const alpha] is an atomic parsing expression. It matches 39any Unicode alphabet character. This is a custom extension of PEs 40based on Tcl's builtin command [cmd {string is}]. 41 42[enum] 43The string [const ascii] is an atomic parsing expression. It matches 44any Unicode character below U0080. This is a custom extension of PEs 45based on Tcl's builtin command [cmd {string is}]. 46 47[enum] 48The string [const control] is an atomic parsing expression. It matches 49any Unicode control character. This is a custom extension of PEs based 50on Tcl's builtin command [cmd {string is}]. 51 52[enum] 53The string [const digit] is an atomic parsing expression. It matches 54any Unicode digit character. Note that this includes characters 55outside of the [lb]0..9[rb] range. This is a custom extension of PEs 56based on Tcl's builtin command [cmd {string is}]. 57 58[enum] 59The string [const graph] is an atomic parsing expression. It matches 60any Unicode printing character, except for space. This is a custom 61extension of PEs based on Tcl's builtin command [cmd {string is}]. 62 63[enum] 64The string [const lower] is an atomic parsing expression. It matches 65any Unicode lower-case alphabet character. This is a custom extension 66of PEs based on Tcl's builtin command [cmd {string is}]. 67 68[enum] 69The string [const print] is an atomic parsing expression. It matches 70any Unicode printing character, including space. This is a custom 71extension of PEs based on Tcl's builtin command [cmd {string is}]. 72 73[enum] 74The string [const punct] is an atomic parsing expression. It matches 75any Unicode punctuation character. This is a custom extension of PEs 76based on Tcl's builtin command [cmd {string is}]. 77 78[enum] 79The string [const space] is an atomic parsing expression. It matches 80any Unicode space character. This is a custom extension of PEs based 81on Tcl's builtin command [cmd {string is}]. 82 83[enum] 84The string [const upper] is an atomic parsing expression. It matches 85any Unicode upper-case alphabet character. This is a custom extension 86of PEs based on Tcl's builtin command [cmd {string is}]. 87 88[enum] 89The string [const wordchar] is an atomic parsing expression. It 90matches any Unicode word character. This is any alphanumeric character 91(see alnum), and any connector punctuation characters (e.g. 92underscore). This is a custom extension of PEs based on Tcl's builtin 93command [cmd {string is}]. 94 95[enum] 96The string [const xdigit] is an atomic parsing expression. It matches 97any hexadecimal digit character. This is a custom extension of PEs 98based on Tcl's builtin command [cmd {string is}]. 99 100[enum] 101The string [const ddigit] is an atomic parsing expression. It matches 102any decimal digit character. This is a custom extension of PEs based 103on Tcl's builtin command [cmd regexp]. 104 105[enum] 106The expression 107 [lb]list t [var x][rb] 108is an atomic parsing expression. It matches the terminal string [var x]. 109 110[enum] 111The expression 112 [lb]list n [var A][rb] 113is an atomic parsing expression. It matches the nonterminal [var A]. 114 115[list_end][comment {-- atomic points --}] 116 117[def [const {Combined Parsing Expressions}]] 118[list_begin enumerated][comment {-- combined points --}] 119 120[enum] 121For parsing expressions [var e1], [var e2], ... the result of 122 123 [lb]list / [var e1] [var e2] ... [rb] 124 125is a parsing expression as well. 126 127This is the [term {ordered choice}], aka [term {prioritized choice}]. 128 129[enum] 130For parsing expressions [var e1], [var e2], ... the result of 131 132 [lb]list x [var e1] [var e2] ... [rb] 133 134is a parsing expression as well. 135 136This is the [term {sequence}]. 137 138[enum] 139For a parsing expression [var e] the result of 140 141 [lb]list * [var e][rb] 142 143is a parsing expression as well. 144 145This is the [term {kleene closure}], describing zero or more 146repetitions. 147 148[enum] 149For a parsing expression [var e] the result of 150 151 [lb]list + [var e][rb] 152 153is a parsing expression as well. 154 155This is the [term {positive kleene closure}], describing one or more 156repetitions. 157 158[enum] 159For a parsing expression [var e] the result of 160 161 [lb]list & [var e][rb] 162 163is a parsing expression as well. 164 165This is the [term {and lookahead predicate}]. 166 167[enum] 168For a parsing expression [var e] the result of 169 170 [lb]list ! [var e][rb] 171 172is a parsing expression as well. 173 174This is the [term {not lookahead predicate}]. 175 176 177[enum] 178For a parsing expression [var e] the result of 179 180 [lb]list ? [var e][rb] 181 182is a parsing expression as well. 183 184This is the [term {optional input}]. 185 186 187[list_end][comment {-- combined points --}] 188[list_end][comment {-- regular points --}] 189 190[def {Canonical serialization}] 191 192The canonical serialization of a parsing expression has the format as 193specified in the previous item, and then additionally satisfies the 194constraints below, which make it unique among all the possible 195serializations of this parsing expression. 196 197[list_begin enumerated][comment {-- canonical points --}] 198[enum] 199 200The string representation of the value is the canonical representation 201of a pure Tcl list. I.e. it does not contain superfluous whitespace. 202 203[enum] 204 205Terminals are [emph not] encoded as ranges (where start and end of the 206range are identical). 207 208[comment { 209 Thinking about this I am not sure if that was a good move. 210 There are a lot more equivalent encodings around that just 211 the one I used above. Examples 212 213 {x {t a} {t b} {tc } {t d}} 214 {x {x {t a} {t b}} {x {tc } {t d}}} 215 {x {x {t a} {t b} {tc } {t d}}} 216 217 etc. Having the t/.. equivalence added it can now be argued 218 that we should handle these as well. Which essentially 219 amounts to a whole-sale system to simplify parsing 220 expressions. This moves expression equality from intensional 221 to extensional, or as near as is possible. 222 223 The only counter-argument I have is that the t/.. equivalence 224 is restricted to leaves of the tree, or alternatively, to 225 terminal symbol operators. 226}] 227 228[list_end][comment {-- canonical points --}] 229[list_end][comment {-- serializations --}] 230[para] 231 232[subsection Example] 233 234Assuming the parsing expression shown on the right-hand side of the 235rule 236 237[para] 238[include ../example/expr_pe.inc] 239[para] 240 241then its canonical serialization (except for whitespace) is 242 243[para] 244[include ../example/expr_pe_serial.inc] 245[para] 246