1theory Foundations
2imports Introduction
3begin
4
5section \<open>Code generation foundations \label{sec:foundations}\<close>
6
7subsection \<open>Code generator architecture \label{sec:architecture}\<close>
8
9text \<open>
10  The code generator is actually a framework consisting of different
11  components which can be customised individually.
12
13  Conceptually all components operate on Isabelle's logic framework
14  Pure.  Practically, the object logic HOL
15  provides the necessary facilities to make use of the code generator,
16  mainly since it is an extension of Isabelle/Pure.
17
18  The constellation of the different components is visualized in the
19  following picture.
20
21  \begin{figure}[h]
22    \def\sys#1{\emph{#1}}
23    \begin{tikzpicture}[x = 4cm, y = 1cm]
24      \tikzstyle positive=[color = black, fill = white];
25      \tikzstyle negative=[color = white, fill = black];
26      \tikzstyle entity=[rounded corners, draw, thick];
27      \tikzstyle process=[ellipse, draw, thick];
28      \tikzstyle arrow=[-stealth, semithick];
29      \node (spec) at (0, 3) [entity, positive] {specification tools};
30      \node (user) at (1, 3) [entity, positive] {user proofs};
31      \node (spec_user_join) at (0.5, 3) [shape=coordinate] {};
32      \node (raw) at (0.5, 4) [entity, positive] {raw code equations};
33      \node (pre) at (1.5, 4) [process, positive] {preprocessing};
34      \node (eqn) at (2.5, 4) [entity, positive] {code equations};
35      \node (iml) at (0.5, 0) [entity, positive] {intermediate program};
36      \node (seri) at (1.5, 0) [process, positive] {serialisation};
37      \node (SML) at (2.5, 3) [entity, positive] {\sys{SML}};
38      \node (OCaml) at (2.5, 2) [entity, positive] {\sys{OCaml}};
39      \node (Haskell) at (2.5, 1) [entity, positive] {\sys{Haskell}};
40      \node (Scala) at (2.5, 0) [entity, positive] {\sys{Scala}};
41      \draw [semithick] (spec) -- (spec_user_join);
42      \draw [semithick] (user) -- (spec_user_join);
43      \draw [-diamond, semithick] (spec_user_join) -- (raw);
44      \draw [arrow] (raw) -- (pre);
45      \draw [arrow] (pre) -- (eqn);
46      \draw [arrow] (eqn) -- node (transl) [process, positive] {translation} (iml);
47      \draw [arrow] (iml) -- (seri);
48      \draw [arrow] (seri) -- (SML);
49      \draw [arrow] (seri) -- (OCaml);
50      \draw [arrow] (seri) -- (Haskell);
51      \draw [arrow] (seri) -- (Scala);
52    \end{tikzpicture}
53    \caption{Code generator architecture}
54    \label{fig:arch}
55  \end{figure}
56
57  \noindent Central to code generation is the notion of \emph{code
58  equations}.  A code equation as a first approximation is a theorem
59  of the form @{text "f t\<^sub>1 t\<^sub>2 \<dots> t\<^sub>n \<equiv> t"} (an equation headed by a
60  constant @{text f} with arguments @{text "t\<^sub>1 t\<^sub>2 \<dots> t\<^sub>n"} and right
61  hand side @{text t}).
62
63  \begin{itemize}
64
65    \item Starting point of code generation is a collection of (raw)
66      code equations in a theory. It is not relevant where they stem
67      from, but typically they were either produced by specification
68      tools or proved explicitly by the user.
69      
70    \item These raw code equations can be subjected to theorem
71      transformations.  This \qn{preprocessor} (see
72      \secref{sec:preproc}) can apply the full expressiveness of
73      ML-based theorem transformations to code generation.  The result
74      of preprocessing is a structured collection of code equations.
75
76    \item These code equations are \qn{translated} to a program in an
77      abstract intermediate language.  Think of it as a kind of
78      \qt{Mini-Haskell} with four \qn{statements}: @{text data} (for
79      datatypes), @{text fun} (stemming from code equations), also
80      @{text class} and @{text inst} (for type classes).
81
82    \item Finally, the abstract program is \qn{serialised} into
83      concrete source code of a target language.  This step only
84      produces concrete syntax but does not change the program in
85      essence; all conceptual transformations occur in the translation
86      step.
87
88  \end{itemize}
89
90  \noindent From these steps, only the last two are carried out
91  outside the logic; by keeping this layer as thin as possible, the
92  amount of code to trust is kept to a minimum.
93\<close>
94
95
96subsection \<open>The pre- and postprocessor \label{sec:preproc}\<close>
97
98text \<open>
99  Before selected function theorems are turned into abstract code, a
100  chain of definitional transformation steps is carried out:
101  \emph{preprocessing}.  The preprocessor consists of two
102  components: a \emph{simpset} and \emph{function transformers}.
103
104  The preprocessor simpset has a disparate brother, the
105  \emph{postprocessor simpset}.
106  In the theory-to-code scenario depicted in the picture 
107  above, it plays no role.  But if generated code is used
108  to evaluate expressions (cf.~\secref{sec:evaluation}), the
109  postprocessor simpset is applied to the resulting expression before this
110  is turned back.
111
112  The pre- and postprocessor \emph{simpsets} can apply the
113  full generality of the Isabelle
114  simplifier.  Due to the interpretation of theorems as code
115  equations, rewrites are applied to the right hand side and the
116  arguments of the left hand side of an equation, but never to the
117  constant heading the left hand side.
118
119  Pre- and postprocessor can be setup to transfer between
120  expressions suitable for logical reasoning and expressions 
121  suitable for execution.  As example, take list membership; logically
122  it is expressed as @{term "x \<in> set xs"}.  But for execution
123  the intermediate set is not desirable.  Hence the following
124  specification:
125\<close>
126
127definition %quote member :: "'a list \<Rightarrow> 'a \<Rightarrow> bool"
128where
129  [code_abbrev]: "member xs x \<longleftrightarrow> x \<in> set xs"
130
131text \<open>
132  \noindent The \emph{@{attribute code_abbrev} attribute} declares
133  its theorem a rewrite rule for the postprocessor and the symmetric
134  of its theorem as rewrite rule for the preprocessor.  Together,
135  this has the effect that expressions @{thm (rhs) member_def [no_vars]}
136  are replaced by @{thm (lhs) member_def [no_vars]} in generated code, but
137  are turned back into @{thm (rhs) member_def [no_vars]} if generated
138  code is used for evaluation.
139
140  Rewrite rules for pre- or postprocessor may be
141  declared independently using \emph{@{attribute code_unfold}}
142  or \emph{@{attribute code_post}} respectively.
143
144  \emph{Function transformers} provide a very general
145  interface, transforming a list of function theorems to another list
146  of function theorems, provided that neither the heading constant nor
147  its type change.  The @{term "0::nat"} / @{const Suc} pattern
148  used in theory @{text Code_Abstract_Nat} (see \secref{abstract_nat})
149  uses this interface.
150
151  \noindent The current setup of the pre- and postprocessor may be inspected
152  using the @{command_def print_codeproc} command.  @{command_def
153  code_thms} (see \secref{sec:equations}) provides a convenient
154  mechanism to inspect the impact of a preprocessor setup on code
155  equations.  Attribute @{attribute code_preproc_trace} allows
156  for low-level tracing:
157\<close>
158
159declare %quote [[code_preproc_trace]]
160
161declare %quote [[code_preproc_trace only: distinct remdups]]
162
163declare %quote [[code_preproc_trace off]]
164
165
166subsection \<open>Understanding code equations \label{sec:equations}\<close>
167
168text \<open>
169  As told in \secref{sec:principle}, the notion of code equations is
170  vital to code generation.  Indeed most problems which occur in
171  practice can be resolved by an inspection of the underlying code
172  equations.
173
174  It is possible to exchange the default code equations for constants
175  by explicitly proving alternative ones:
176\<close>
177
178lemma %quote [code]:
179  "dequeue (AQueue xs []) =
180     (if xs = [] then (None, AQueue [] [])
181       else dequeue (AQueue [] (rev xs)))"
182  "dequeue (AQueue xs (y # ys)) =
183     (Some y, AQueue xs ys)"
184  by (cases xs, simp_all) (cases "rev xs", simp_all)
185
186text \<open>
187  \noindent The annotation @{text "[code]"} is an @{text attribute}
188  which states that the given theorems should be considered as code
189  equations for a @{text fun} statement -- the corresponding constant
190  is determined syntactically.  The resulting code:
191\<close>
192
193text %quotetypewriter \<open>
194  @{code_stmts dequeue (consts) dequeue (Haskell)}
195\<close>
196
197text \<open>
198  \noindent You may note that the equality test @{term "xs = []"} has
199  been replaced by the predicate @{term "List.null xs"}.  This is due
200  to the default setup of the \qn{preprocessor}.
201
202  This possibility to select arbitrary code equations is the key
203  technique for program and datatype refinement (see
204  \secref{sec:refinement}).
205
206  Due to the preprocessor, there is the distinction of raw code
207  equations (before preprocessing) and code equations (after
208  preprocessing).
209
210  The first can be listed (among other data) using the @{command_def
211  print_codesetup} command.
212
213  The code equations after preprocessing are already are blueprint of
214  the generated program and can be inspected using the @{command
215  code_thms} command:
216\<close>
217
218code_thms %quote dequeue
219
220text \<open>
221  \noindent This prints a table with the code equations for @{const
222  dequeue}, including \emph{all} code equations those equations depend
223  on recursively.  These dependencies themselves can be visualized using
224  the @{command_def code_deps} command.
225\<close>
226
227
228subsection \<open>Equality\<close>
229
230text \<open>
231  Implementation of equality deserves some attention.  Here an example
232  function involving polymorphic equality:
233\<close>
234
235primrec %quote collect_duplicates :: "'a list \<Rightarrow> 'a list \<Rightarrow> 'a list \<Rightarrow> 'a list" where
236  "collect_duplicates xs ys [] = xs"
237| "collect_duplicates xs ys (z#zs) = (if z \<in> set xs
238    then if z \<in> set ys
239      then collect_duplicates xs ys zs
240      else collect_duplicates xs (z#ys) zs
241    else collect_duplicates (z#xs) (z#ys) zs)"
242
243text \<open>
244  \noindent During preprocessing, the membership test is rewritten,
245  resulting in @{const List.member}, which itself performs an explicit
246  equality check, as can be seen in the corresponding @{text SML} code:
247\<close>
248
249text %quotetypewriter \<open>
250  @{code_stmts collect_duplicates (SML)}
251\<close>
252
253text \<open>
254  \noindent Obviously, polymorphic equality is implemented the Haskell
255  way using a type class.  How is this achieved?  HOL introduces an
256  explicit class @{class equal} with a corresponding operation @{const
257  HOL.equal} such that @{thm equal [no_vars]}.  The preprocessing
258  framework does the rest by propagating the @{class equal} constraints
259  through all dependent code equations.  For datatypes, instances of
260  @{class equal} are implicitly derived when possible.  For other types,
261  you may instantiate @{text equal} manually like any other type class.
262\<close>
263
264
265subsection \<open>Explicit partiality \label{sec:partiality}\<close>
266
267text \<open>
268  Partiality usually enters the game by partial patterns, as
269  in the following example, again for amortised queues:
270\<close>
271
272definition %quote strict_dequeue :: "'a queue \<Rightarrow> 'a \<times> 'a queue" where
273  "strict_dequeue q = (case dequeue q
274    of (Some x, q') \<Rightarrow> (x, q'))"
275
276lemma %quote strict_dequeue_AQueue [code]:
277  "strict_dequeue (AQueue xs (y # ys)) = (y, AQueue xs ys)"
278  "strict_dequeue (AQueue xs []) =
279    (case rev xs of y # ys \<Rightarrow> (y, AQueue [] ys))"
280  by (simp_all add: strict_dequeue_def) (cases xs, simp_all split: list.split)
281
282text \<open>
283  \noindent In the corresponding code, there is no equation
284  for the pattern @{term "AQueue [] []"}:
285\<close>
286
287text %quotetypewriter \<open>
288  @{code_stmts strict_dequeue (consts) strict_dequeue (Haskell)}
289\<close>
290
291text \<open>
292  \noindent In some cases it is desirable to state this
293  pseudo-\qt{partiality} more explicitly, e.g.~as follows:
294\<close>
295
296axiomatization %quote empty_queue :: 'a
297
298definition %quote strict_dequeue' :: "'a queue \<Rightarrow> 'a \<times> 'a queue" where
299  "strict_dequeue' q = (case dequeue q of (Some x, q') \<Rightarrow> (x, q')
300    | _ \<Rightarrow> empty_queue)"
301
302lemma %quote strict_dequeue'_AQueue [code]:
303  "strict_dequeue' (AQueue xs []) = (if xs = [] then empty_queue
304     else strict_dequeue' (AQueue [] (rev xs)))"
305  "strict_dequeue' (AQueue xs (y # ys)) =
306     (y, AQueue xs ys)"
307  by (simp_all add: strict_dequeue'_def split: list.splits)
308
309text \<open>
310  Observe that on the right hand side of the definition of @{const
311  "strict_dequeue'"}, the unspecified constant @{const empty_queue} occurs.
312  An attempt to generate code for @{const strict_dequeue'} would
313  make the code generator complain that @{const empty_queue} has
314  no associated code equations.  In most situations unimplemented
315  constants indeed indicated a broken program; however such
316  constants can also be thought of as function definitions which always fail,
317  since there is never a successful pattern match on the left hand
318  side.  In order to categorise a constant into that category
319  explicitly, use the @{attribute code} attribute with
320  @{text abort}:
321\<close>
322
323declare %quote [[code abort: empty_queue]]
324
325text \<open>
326  \noindent Then the code generator will just insert an error or
327  exception at the appropriate position:
328\<close>
329
330text %quotetypewriter \<open>
331  @{code_stmts strict_dequeue' (consts) empty_queue strict_dequeue' (Haskell)}
332\<close>
333
334text \<open>
335  \noindent This feature however is rarely needed in practice.  Note
336  that the HOL default setup already includes
337\<close>
338
339declare %quote [[code abort: undefined]]
340
341text \<open>
342  \noindent -- hence @{const undefined} can always be used in such
343  situations.
344\<close>
345
346
347subsection \<open>If something goes utterly wrong \label{sec:utterly_wrong}\<close>
348
349text \<open>
350  Under certain circumstances, the code generator fails to produce
351  code entirely.  To debug these, the following hints may prove
352  helpful:
353
354  \begin{description}
355
356    \ditem{\emph{Check with a different target language}.}  Sometimes
357      the situation gets more clear if you switch to another target
358      language; the code generated there might give some hints what
359      prevents the code generator to produce code for the desired
360      language.
361
362    \ditem{\emph{Inspect code equations}.}  Code equations are the central
363      carrier of code generation.  Most problems occurring while generating
364      code can be traced to single equations which are printed as part of
365      the error message.  A closer inspection of those may offer the key
366      for solving issues (cf.~\secref{sec:equations}).
367
368    \ditem{\emph{Inspect preprocessor setup}.}  The preprocessor might
369      transform code equations unexpectedly; to understand an
370      inspection of its setup is necessary (cf.~\secref{sec:preproc}).
371
372    \ditem{\emph{Generate exceptions}.}  If the code generator
373      complains about missing code equations, in can be helpful to
374      implement the offending constants as exceptions
375      (cf.~\secref{sec:partiality}); this allows at least for a formal
376      generation of code, whose inspection may then give clues what is
377      wrong.
378
379    \ditem{\emph{Remove offending code equations}.}  If code
380      generation is prevented by just a single equation, this can be
381      removed (cf.~\secref{sec:equations}) to allow formal code
382      generation, whose result in turn can be used to trace the
383      problem.  The most prominent case here are mismatches in type
384      class signatures (\qt{wellsortedness error}).
385
386  \end{description}
387\<close>
388
389end
390