1[comment {-*- text -*-}]
2[section {PE serialization format}]
3
4Here we specify the format used by the Parser Tools to serialize
5Parsing Expressions as immutable values for transport, comparison,
6etc.
7
8[para]
9
10We distinguish between [term regular] and [term canonical]
11serializations.
12
13While a parsing expression may have more than one regular
14serialization only exactly one of them will be [term canonical].
15
16[list_begin definitions][comment {-- serializations --}]
17[def {Regular serialization}]
18
19[list_begin definitions][comment {-- regular points --}]
20
21[def [const {Atomic Parsing Expressions}]]
22[list_begin enumerated][comment {-- atomic points --}]
23
24[enum]
25The string [const epsilon] is an atomic parsing expression. It matches
26the empty string.
27
28[enum]
29The string [const dot] is an atomic parsing expression. It matches
30any character.
31
32[enum]
33The string [const alnum] is an atomic parsing expression. It matches
34any Unicode alphabet or digit character. This is a custom extension of
35PEs based on Tcl's builtin command [cmd {string is}].
36
37[enum]
38The string [const alpha] is an atomic parsing expression. It matches
39any Unicode alphabet character. This is a custom extension of PEs
40based on Tcl's builtin command [cmd {string is}].
41
42[enum]
43The string [const ascii] is an atomic parsing expression. It matches
44any Unicode character below U0080. This is a custom extension of PEs
45based on Tcl's builtin command [cmd {string is}].
46
47[enum]
48The string [const control] is an atomic parsing expression. It matches
49any Unicode control character. This is a custom extension of PEs based
50on Tcl's builtin command [cmd {string is}].
51
52[enum]
53The string [const digit] is an atomic parsing expression. It matches
54any Unicode digit character. Note that this includes characters
55outside of the [lb]0..9[rb] range. This is a custom extension of PEs
56based on Tcl's builtin command [cmd {string is}].
57
58[enum]
59The string [const graph] is an atomic parsing expression. It matches
60any Unicode printing character, except for space. This is a custom
61extension of PEs based on Tcl's builtin command [cmd {string is}].
62
63[enum]
64The string [const lower] is an atomic parsing expression. It matches
65any Unicode lower-case alphabet character. This is a custom extension
66of PEs based on Tcl's builtin command [cmd {string is}].
67
68[enum]
69The string [const print] is an atomic parsing expression. It matches
70any Unicode printing character, including space. This is a custom
71extension of PEs based on Tcl's builtin command [cmd {string is}].
72
73[enum]
74The string [const punct] is an atomic parsing expression. It matches
75any Unicode punctuation character. This is a custom extension of PEs
76based on Tcl's builtin command [cmd {string is}].
77
78[enum]
79The string [const space] is an atomic parsing expression. It matches
80any Unicode space character. This is a custom extension of PEs based
81on Tcl's builtin command [cmd {string is}].
82
83[enum]
84The string [const upper] is an atomic parsing expression. It matches
85any Unicode upper-case alphabet character. This is a custom extension
86of PEs based on Tcl's builtin command [cmd {string is}].
87
88[enum]
89The string [const wordchar] is an atomic parsing expression. It
90matches any Unicode word character. This is any alphanumeric character
91(see alnum), and any connector punctuation characters (e.g.
92underscore). This is a custom extension of PEs based on Tcl's builtin
93command [cmd {string is}].
94
95[enum]
96The string [const xdigit] is an atomic parsing expression. It matches
97any hexadecimal digit character. This is a custom extension of PEs
98based on Tcl's builtin command [cmd {string is}].
99
100[enum]
101The string [const ddigit] is an atomic parsing expression. It matches
102any decimal digit character. This is a custom extension of PEs based
103on Tcl's builtin command [cmd regexp].
104
105[enum]
106The expression
107    [lb]list t [var x][rb]
108is an atomic parsing expression. It matches the terminal string [var x].
109
110[enum]
111The expression
112    [lb]list n [var A][rb]
113is an atomic parsing expression. It matches the nonterminal [var A].
114
115[list_end][comment {-- atomic points --}]
116
117[def [const {Combined Parsing Expressions}]]
118[list_begin enumerated][comment {-- combined points --}]
119
120[enum]
121For parsing expressions [var e1], [var e2], ... the result of
122
123    [lb]list / [var e1] [var e2] ... [rb]
124
125is a parsing expression as well.
126
127This is the [term {ordered choice}], aka [term {prioritized choice}].
128
129[enum]
130For parsing expressions [var e1], [var e2], ... the result of
131
132    [lb]list x [var e1] [var e2] ... [rb]
133
134is a parsing expression as well.
135
136This is the [term {sequence}].
137
138[enum]
139For a parsing expression [var e] the result of
140
141    [lb]list * [var e][rb]
142
143is a parsing expression as well.
144
145This is the [term {kleene closure}], describing zero or more
146repetitions.
147
148[enum]
149For a parsing expression [var e] the result of
150
151    [lb]list + [var e][rb]
152
153is a parsing expression as well.
154
155This is the [term {positive kleene closure}], describing one or more
156repetitions.
157
158[enum]
159For a parsing expression [var e] the result of
160
161    [lb]list & [var e][rb]
162
163is a parsing expression as well.
164
165This is the [term {and lookahead predicate}].
166
167[enum]
168For a parsing expression [var e] the result of
169
170    [lb]list ! [var e][rb]
171
172is a parsing expression as well.
173
174This is the [term {not lookahead predicate}].
175
176
177[enum]
178For a parsing expression [var e] the result of
179
180    [lb]list ? [var e][rb]
181
182is a parsing expression as well.
183
184This is the [term {optional input}].
185
186
187[list_end][comment {-- combined points --}]
188[list_end][comment {-- regular points --}]
189
190[def {Canonical serialization}]
191
192The canonical serialization of a parsing expression has the format as
193specified in the previous item, and then additionally satisfies the
194constraints below, which make it unique among all the possible
195serializations of this parsing expression.
196
197[list_begin enumerated][comment {-- canonical points --}]
198[enum]
199
200The string representation of the value is the canonical representation
201of a pure Tcl list. I.e. it does not contain superfluous whitespace.
202
203[enum]
204
205Terminals are [emph not] encoded as ranges (where start and end of the
206range are identical).
207
208[comment {
209	 Thinking about this I am not sure if that was a good move.
210	 There are a lot more equivalent encodings around that just
211	 the one I used above. Examples
212
213	 	 {x {t a} {t b} {tc } {t d}}
214	 	 {x {x {t a} {t b}} {x {tc } {t d}}}
215	 	 {x {x {t a} {t b} {tc } {t d}}}
216
217	 etc. Having the t/.. equivalence added it can now be argued
218	 that we should handle these as well. Which essentially
219	 amounts to a whole-sale system to simplify parsing
220	 expressions. This moves expression equality from intensional
221	 to extensional, or as near as is possible.
222
223	 The only counter-argument I have is that the t/.. equivalence
224	 is restricted to leaves of the tree, or alternatively, to
225	 terminal symbol operators.
226}]
227
228[list_end][comment {-- canonical points --}]
229[list_end][comment {-- serializations --}]
230[para]
231
232[subsection Example]
233
234Assuming the parsing expression shown on the right-hand side of the
235rule
236
237[para]
238[include ../example/expr_pe.inc]
239[para]
240
241then its canonical serialization (except for whitespace) is
242
243[para]
244[include ../example/expr_pe_serial.inc]
245[para]
246