NameDateSize

..25-Jul-20196

.gitignoreH A D25-Jul-201911

charsetScript.smlH A D12-Sep-20197 KiB

codegen/H25-Jul-20197

DFA_Codegen.sigH A D25-Jul-2019333

DFA_Codegen.smlH A D20-Nov-20198.3 KiB

even.predicateH A D25-Jul-20195.1 KiB

HolmakefileH A D20-Nov-2019338

lexgen/H25-Jul-201911

READMEH A D05-Oct-20197.4 KiB

regexp2dfa.smlH A D20-Nov-20193.8 KiB

regexp_compilerScript.smlH A D07-Jul-202082.8 KiB

regexp_mapScript.smlH A D07-Jul-202021 KiB

Regexp_Match.sigH A D25-Jul-2019508

Regexp_Match.smlH A D25-Jul-201921.4 KiB

Regexp_Numerics.sigH A D25-Jul-20192.8 KiB

Regexp_Numerics.smlH A D05-Oct-201917.3 KiB

regexp_parserScript.smlH A D12-Sep-20198.1 KiB

Regexp_Type.sigH A D25-Jul-20192.2 KiB

Regexp_Type.smlH A D12-Sep-201927.5 KiB

regexpLib.sigH A D05-Oct-2019556

regexpLib.smlH A D20-Nov-201916.5 KiB

regexpMisc.sigH A D05-Oct-2019685

regexpMisc.smlH A D05-Oct-20191.6 KiB

regexpScript.smlH A D07-Jul-202072.5 KiB

regexpSyntax.sigH A D25-Jul-20191.6 KiB

regexpSyntax.smlH A D25-Jul-20199.1 KiB

regular-play/H14-Nov-201910

test.holH A D05-Oct-20199.6 KiB

test.mlH A D05-Oct-201921.8 KiB

vec_mapScript.smlH A D07-Jul-20203.4 KiB

README

1`Overview
2========
3
4The code in this directory generates DFAs from regular expressions
5using the technique of Brzozowski derivatives. There is a full formal
6definition and correctness proof of a regexp compiler, along with some
7ML support code.  In particular, there are two mechanisms---meeting a
8single interface---available for compiling regexps to DFAs.
9
10  1. HOL proof
11  2. SML code with no proofs generated
12
13Both are available through loading regexpLib into HOL4. There is also
14a standalone executable "regexp2dfa" that generates state machines to
15std_out in a variety of programming languages.
16
17
18Concrete syntax of regular expressions
19=======================================
20
21See test.ml and test.hol for examples.
22
23  [...] = character set
24  [^...] = complement char set
25  .  = any char
26  \d = [0-9]
27  \w = [a-zA-Z0-9_]
28  \s = whitespace = [ \n\r\t\f]  (* Note the space character! *)
29  \c = escape c, for c in {\,`,.,^,$,*,+,?,|,~,{,},[,],(,)}
30  \ddd = ASCII character given in decimal representation (\000 to \255)
31  \i{lo,hi} = Interval of numbers between lo and hi inclusive. Details of endianess and
32     the representation of signed numbers are set to defaults which can be
33     changed by Regexp_Type.{get,set}_intervalFn.
34
35  r|s = alternation
36  r&s = intersection
37  r* = Kleene star
38  r+ = rr*
39  r? = "" | r
40  r{n} = r^n
41  r{m,n} = r^m | r^{m+1} | ... | r^n (m<=n)
42  r{m,} = r{m}r*
43  r{,n} = r{0,n}
44  rs  = concatenation
45  ~r = complement
46  (r) = grouping
47
48Note: For now, as traditional, spaces are significant in regexp
49concrete syntax.
50
51Note: The regexp parser (implemented in Regexp_Type) is external to
52HOL and produces elements of the ML regexp datatype. Maps between the
53ML regexp datatype and HOL terms of type regexp may be found in
54regexpSyntax.
55
56
57
58Directory structure and content
59===============================
60
61The following are the important ML structures and HOL theories.
62
63Regexp_Type
64
65   * SML definition of the type of regular expressions. This lies at
66     the base of both "proof" and "code" paths.
67
68   * The size of the alphabet (default 256) can be set here. Note that
69     large alphabets like Unicode will need a different approach. Also
70     note that the interval regexp support assumes that the alphabet
71     size is 256.
72
73   * There is a regexp parser implementation (entrypoints:
74     fromSubstring, fromString, fromQuote). See above for the concrete
75     syntax. The files test.ml and test.hol have examples. There is
76     also a prettyprinter pp_regexp and associated print_regexp.
77
78Regexp_Match.
79
80   * SML implementation of the regexp match compiler. Does not
81     generate proofs. Is close to the algorithms in regexpTheory and
82     regexp_compilerTheory, having been derived from them.
83
84   * A version that only runs the search and doesn't post-process to
85     obtain the DFA is in the function domBrz. This is useful for
86     performance debugging, and for visualizing the workings of the
87     algorithm. Visualization is controlled by the "regexp-compiler"
88     trace (default is "on").
89
90Regexp_Numerics.
91
92   * Provides support for encoding numbers into strings in various ways
93     (unsigned, twos-complement, zig-zag, and sign-magnitude encodings are
94     supported).
95
96DFA_Codegen.
97
98   * Program support for generating state machines in Ada, C, Java, SML.
99
100charset{Script,Theory}.
101
102   * The type of regular expressions uses character sets. Since there
103     is more than one way to implement this, we have broken it out
104     into a separate theory.
105
106regexp{Script,Theory}.
107
108   * HOL theory of regexps. Much of the Regexp_Match code has been
109     hand-translated from this theory.
110
111eq_cmp_bmap{Script,Theory}
112
113   * Extension to the balanced_map theory, to support the common case
114     where the set of keys mapping to an element is a singleton. This
115     is used to relate balanced_mapTheory to finite_mapTheory.
116
117vec_map{Script,Theory}
118
119   * Support theory for translating maps into vector format.
120
121regexp_compiler{Script,Theory}.
122
123   * HOL theory proving correctness of derivative-based compiler
124    from regexps to DFAs.
125
126regexpSyntax.
127
128    * Support for manipulating regexp abstract syntax trees in HOL.
129      Also maps between Regexp_Type.regexp and the HOL regexp type.
130
131regexpLib.
132
133   * HOL evaluator for the regexp compilation and matching functions
134     defined in regexpTheory.
135
136   * A general matcher which allows choosing deduction or SML.
137
138test.{hol,ml}
139
140   * Examples of the application of the compiler/DFA evaluator, with,
141     and without, proofs.
142
143even.predicate
144
145   * Example showing showing how arithmetic predicates can be realized
146     by regexps.
147
148regexp2dfa
149
150   * Command-line tool for generating programs (Ada,C,Java,ML) from
151     regexps. See codegen/gen_*_files for invocation examples.
152
153codegen
154
155   * Directory of shell scripts demonstrating generation of code from
156     regexps.  Contents: gen_ada_files, gen__c_files, ...
157
158
159Status.
160========
161
162Jan 31. 2017.
163
164* Added some more optimizations to Regexp_Match that aren't in the
165  corresponding theories. See regexp_compare and build_or. See also
166  the notes in the implementation of domBrz supporting more efficient
167  handling of regexps of the form \w{1,n} (or \d{1,n}), which if left
168  alone, expose an exponential blow-up. The blow-up is not totally
169  defeated, but has been pushed much further away.
170
171* Proved some theorems in regexpScript.sml relating various representations
172  of Sigma*
173
174* Regexp_Type : exploring different ways of generating packed intervals. Not
175  in its final form, or thoroughly tested.
176
177Feb 13. 2017.
178
179* Moved formalization of charsets to be word64#word64#word64#word64. This
180  will support CakeML translation. However, the implementation of the
181  SML regexp compiler in Regexp_Match still uses IntInf as the representation
182  of charsets. It might be interesting to compare the two representations for
183  efficiency, but that's future work.
184
185* Incorporated in-logic PEG parser for regexps (regexp_parserTheory).
186  This was renamed from reSyntax to be in line with the existing naming
187  scheme (pre-pending things with "regexp").
188
189* Revised packed interval syntax \p{e_1 ... e_n} to get rid of superfluous
190  commas between elements, and to incorporate support for padding.
191
192July 1, 2017
193
194* Converted ML implementation of charsets to
195  word64#word64#word64#word64 in order to better align with the HOL
196  formalization.
197
198* Replaced regexp2{c,java,ml} by an all-inclusive regexp2dfa. Added options
199  for theorem generation and Ada. The usage message looks like
200
201    regexp2dfa:
202    Usage: regexp2dfa [-dfagen (HOL | SML)] (Ada | C | Java | ML | Thm) <name> '<regexp>'
203
204* Updated syntax for some examples.
205
206* codegen/gen_files split out into separate file generators for each language.
207
208July 1, 2019
209
210* Expansion of support for numeric encodings in regexps in Regexp_Numerics.
211
212* Removal of existing existing numeric regexp stuff in Regexp_Type. It
213  was incomplete and overly specialized. Padding has been removed.
214
215* Codegen now available from DFA_Codegen structure, as well as regexp2dfa.
216
217Sept 27, 2019
218
219* Extensive optimizations to speed up dfa generation by proof. Charset
220  membership and union are memoized and also the outgoing transitions
221  of a regexp are memoized in order to prevent them from being
222  generated twice (once when running dom_Brz_alt and once when running
223  exec_Brz). Smart derivatives of "Or" regexps are also dealt with
224  specially.  More can certainly still be done but at least now some
225  useful examples will go through in finite time.
226
227* The regexp_compset() has been removed and replaced by gen_dfa_conv.
228