1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2% Copyright (c) 2013, ETH Zurich.
3% All rights reserved.
4%
5% This file is distributed under the terms in the attached LICENSE file.
6% If you do not find this file, copies can be found by writing to:
7% ETH Zurich D-INFK, Universitaetstrasse 6, CH-8092 Zurich. Attn: Systems Group.
8%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
9
10\documentclass[a4paper,twoside]{report}
11
12\usepackage{bftn}
13\usepackage{booktabs}
14\usepackage{hyperref}
15\usepackage{hyphenat}
16\usepackage{listings}
17\usepackage{makeidx}
18\usepackage{natbib}
19\usepackage{xspace}
20
21\def\chapterautorefname{Chapter}
22\def\sectionautorefname{Section}
23\def\subsectionautorefname{Section}
24\def\subsubsectionautorefname{Section}
25\def\tableautorefname{Table}
26\def\qemu{QEMU\xspace}
27
28\lstdefinelanguage{armasm}{
29    numbers=left,
30    numberstyle=\tiny,
31    numbersep=5pt,
32    basicstyle=\ttfamily\small
33}
34\lstset{language=armasm}
35
36\title{Barrelfish on ARMv8}
37\author{David Cock}
38
39\tnnumber{022}  % give the number of the tech report
40\tnkey{ARMv8} % Short title, will appear in footer
41
42\begin{document}
43\maketitle
44
45\begin{versionhistory}
46\vhEntry{1.0}{11.04.2016}{DC}{Initial version}
47\end{versionhistory}
48
49\chapter{Summary}
50
51Barrelfish now supports ARMv7 and ARMv8 as primary platforms, and we have
52discontinued support for all older architecture revisions (ARMv5, ARMv6). The
53current Barrelfish release contains a port to a simulated ARMv8 environment,
54derived from the existing ARMv7 codebase and running under GEM5, with generous
55contributions from HP Research.
56
57Simultaneously, we are undertaking a clean-slate redesign of the CPU driver
58for ARMv8, as it presents a number of novel features, and greatly improved
59platform standardisation (\autoref{s:sbsa}), that should allow for a much
60cleaner and simpler implementation. This redesigned CPU driver will for the
61basis for ongoing research into large-scale non-cache-coherent systems using
62ARMv8 cores. This document presents the new CPU driver design
63(\autoref{c:design}), briefly covering those features of ARMv8 of greatest
64relevance (\autoref{c:background}), and discusses a number of technical
65challenges presented by the new architecture (\autoref{c:tech}).
66
67\chapter{Background}\label{c:background}
68
69The Barrelfish research operating system is a vehicle for research into
70software support for likely future architectures, where large numbers of
71non-coherent (or weakly-coherent) heterogeneous processor cores are assembled
72into a single large-scale system. As such, support for a common non-x86
73architecture has always been part of the project, beginning with the ARMv5
74(XScale) port, which permitted the embedded processor on a network interface
75card to be integrated as a first-class part of the system, with its own CPU
76driver. We have also actively maintained an ARMv7 port, to the OMAP4460
77processor on the Pandaboard ES, which we use as a teaching platform in the
78Advanced Operating Systems course at ETH Z\"urich. These ports are described
79more fully in the accompanying technical report, \citet{btn017-arm}.
80
81\section{The ARMv8 Architecture}
82
83The ARMv8 architecture is quite a radical departure from previous versions,
84and represents the culmination of a trend that has been developing for quite
85some time. While the first wave of ARM-based microservers, based on the 32-bit
86ARMv7 architecture, was largely a commercial failure, it's clear that ARM is
87now actively targeting the server market, where Intel currently has near-total
88dominance.
89
90ARMv8 discards some long-standing features of the ARM instruction set:
91universal conditional execution, multiple loads/stores, and the program
92counter as a general-purpose register. These most likely caused difficulty in
93scaling the processor pipeline to high clock rates, and we present some
94consequences of their loss in \autoref{s:threads} and \autoref{s:traps}. The
95instruction-set changes, however challenging to the systems programmer, are
96ultimately of little consequence compared to the consolidation of the ARM
97ecosystem into a serious server platform. The two features of most interest at
98this stage in the design process are the standardisation of hardware features
99and memory maps, and of the boot process.
100
101\subsection{ARM Server Base System Architecture}\label{s:sbsa}
102
103ARM has long been criticised by systems programmers for its highly fragmented,
104and non-uniform programming interface. Linux, in particular, has struggled for
105years with supporting the great multiplicity of ARM platforms.  The principal
106reason for this is the lack of any concept of a \emph{platform}: a set of
107assumptions (available hardware, memory map, etc.), that programmers can rely
108on when initialising and managing a system. The Linux source tree famously
109contained a vastly greater amount of code in the ARM platform support
110subtrees, than that for x86.
111
112The relative standardisation of the x86 platform is largely a historical
113accident, due to the rapid proliferation of PC/AT clones in the early 1980s.
114The x86 platform thus contains layers of ossified legacy interfaces, necessary
115to ensure broad cross and backward compatibility.  ARM's business model, on
116the contrary, has long emphasised the specialisation of implementations: an
117ARM licensee would take their ARM-designed CPU core, and integrate it
118themselves in to a complex SoC (system on a chip), with their own specialised,
119proprietary interfaces. The upsides of this were the possibility to highly
120optimise a particular design, and no requirement on ARM itself to maintain a
121coherent platform.
122
123While ARM's customisable platform worked well for embedded devices, and scaled
124reasonably well to relatively powerful smartphones, it's a disaster for
125producing high-quality systems software, able to execute on a broad range of
126hardware from competing vendors: exactly what a competitive server platform
127requires. ARM clearly know this, and since 2014 have published the Server Base
128System Architecture \citep{arm:sbsa}. To the extent that manufacturers adhere
129to these guidelines, our job as systems programmers is significantly simpler:
130it should be possible to write a single set of initialisation and
131configuration code for ARMv8, that will run on any SBSA-compliant system, much
132as we already do for x86-64.
133
134\begin{table}
135\begin{center}
136\begin{tabular}{lllp{6cm}}
137\toprule
138Supplier & Processor & Name & \\
139\midrule
140APM & APM883208 & Mustang  & 1P 8-core X-Gene 1 with serial trace. \\
141\addlinespace[2pt]
142    & APM883408-X2 & X-C2  & 1P 8-core X-Gene 2. \\
143\addlinespace[2pt]
144Cavium & CN8890 & StratusX & 1P 48-core ThunderX. \\
145\addlinespace[2pt]
146       &        & Cirrus   & 2P 48-core ThunderX. \\
147\addlinespace[2pt]
148ARM & AEM & Fixed Virtual Platform & The \emph{architectural envelope model}
149covers the range of behaviour permitted by ARMv8. Bare-metal debug. \\
150\addlinespace[2pt]
151    &     & Foundation Platform & Freely available, compatible with FVP. \\
152\bottomrule
153\end{tabular}
154\end{center}
155\caption{ARMv8 platforms of interest}\label{t:platforms}
156\end{table}
157
158Our target platforms listed in \autoref{t:platforms} all support SBSA to some
159extent, and absent any compelling reason, we will only support SBSA-compliant
160platforms.
161
162\subsection{UEFI}\label{s:uefi}
163
164One aspect of the SBSA which eases portability is the specification, for the
165first time, of a boot process for ARM systems. ARM has specified that SBSA
166systems must support UEFI \citep{uefi} (the unified extensible firmware
167interface). UEFI is a descendant of the EFI specification, developed by Intel
168for the Itanium project. While Itanium is no longer a platform of any great
169commercial interest, UEFI support is now widespread in the x86-64 market.
170UEFI, in turn, specifies the use of ACPI \citep{acpi} (the advanced
171configuration and power interface) for platform discovery and control.
172
173Supporting ACPI and UEFI requires a one-off investment of effort to design a
174new boot and configuration subsystem, but should pay off in the long term, as
175ports to new ARM boards will no longer require extensive manual configuration.
176The code should also be largely reusable for x86 UEFI systems. Our new UEFI
177bootloader is described in \autoref{s:hagfish}.
178
179\section{A Direct Port from ARMv7}
180
181As already described, the Barrelfish release current at time of writing
182includes an initial ARMv8 port to the GEM5 simulator. This port contains code
183generously contributed by HP Research.
184
185Being developed from the existing codebase, this ARMv8 port follows the
186structure of the existing ARMv7 code closely. While it is highly useful to
187have a running port, we are nevertheless continuing with a significant
188redesign of the CPU driver, as significant improvements and simplifications
189will be possible, once we no longer need to follow the existing structure,
190originally developed for a significantly different platform.
191
192The GEM5 simulator's model of an ARMv8 platform is relatively primitive, and
193does not conform to modern platform conventions, for example placing RAM at
194address \texttt{0}, rather than \texttt{0x80000000} as mandated by the SBSA.
195For this reason, in addition to better integration with ARM debugging tools,
196we have switched to the ARM Fixed Virtual Platform as our default simulation
197environment, with the Foundation Platform supported as a freely available
198simulator.
199
200\section{Registers}
201
202\subsection{General-purpose Registers}
203
204In total there are 31+1 general purpose registers (\texttt{r0-r30}) of size 
20564bits(\autoref{tab:registers}). They are usually referred to by the names 
206\texttt{x0-x30}. The 32-bit content of the registers are referred to as 
207\texttt{w0-w30}. The additional stack pointer \texttt{SP} register can be 
208accessed with a restricted number of instructions.
209
210\begin{table}[!h]
211    \begin{center}
212
213    \begin{tabular}{lll}
214        \textbf{Register} & \textbf{Special} & \textbf{Description} \\
215        \hline
216        \texttt{X0-X7}   & Caller-save & function call arguments and return value 
217        \\
218        \texttt{X8}      & & indirect result e.g. location of large return value 
219                             (struct) \\
220        \texttt{X9-X15}  & Caller-save& temporary registers  \\
221        \texttt{X16}     & IP0 & The first intra-procedure-call scratch 
222        register\footnote{can be used by call veneers and PLT code; at other 
223        times may be used as a temporary register. Same for X17}\\
224        \texttt{X17}     & IP1 & The second intra-procedure-call temporary 
225        register  \\
226        \texttt{X18}     & & The Platform Register (TLS), if needed; otherwise a 
227        temporary register. \\
228        \texttt{X19-X28} & Callee-save & need to be preseved and restored when 
229                           modified\\
230        \texttt{X29}     & FP & frame pointer \\
231        \texttt{X30}     & LR & link register \\
232        \texttt{SP}      & & stack pointer (XZR) \\
233        
234    \end{tabular}
235    \caption{ARMv8 General purpose Registers}
236    \label{tab:registers}
237        \end{center}
238\end{table}
239
240\paragraph{Procedure call}
241\begin{itemize}
242    \item The registers \texttt{x19-28} and \texttt{SP} are callee-saved and 
243          hence must be preserved by the called subroutine. All 64 bits have to 
244          be preserved even when executing in the 32-bit mode.
245    \item The registers \texttt{x0-x7} and \texttt{x9-x15} are caller saved.
246    \item During procedure calls the registers \texttt{x16}, \texttt{x17}, 
247          \texttt{x29} and \texttt{x30} have special roles i.e.\ they store 
248          relevant addresses such as the return address.
249    \item Arguments for calls are passed in the registers \texttt{x0-x7}, 	    
250          \texttt{v0-v7} for floats/SIMD and on the stack
251\end{itemize}
252
253\paragraph{Indirect result} This register is used when returning a large value 
254such as declared by this function: \texttt{struct mystruct foo(int arg);}.
255
256
257\paragraph{Platform specific} The use of register \texttt{x18} is platform 
258specific and needs to be defined by the platform ABI. This register can hold
259inter-procedural state such as the thread context. 
260
261\paragraph{Linker} The registers \texttt{IP0} and \texttt{IP1} can be used by the 
262linker as a scratch register or to hold intermediate values between subroutine 
263calls.
264
265\subsection{SIMD and Floating point}
266There are 32 registers to be used by floating point and SIMD operations. The name 
267of those registers will change, depending on the size of the operation.
268
269\begin{table}[!h]
270    \begin{center}
271        
272        \begin{tabular}{ll}
273            \textbf{Register} & \textbf{Description} \\
274            \hline
275            \texttt{v0-v7}   & function call arguments, intermediate values and 
276                               return value, caller save registers \\
277            \texttt{v8-v15}  & Callee-save registers. They need to be 
278            preserved\\
279            \texttt{v16-v31} & Caller-save registers
280        \end{tabular}
281        \caption{ARMv8 General purpose Registers}
282        \label{tab:registers}
283    \end{center}
284\end{table}
285
286\chapter{Design and Implementation}\label{c:design}
287
288
289
290
291\section{Redesigning the CPU Driver}
292
293Given that ARMv8 is a significantly different platform to ARMv7, and that the
294ARMv7 codebase carries a significant legacy, reaching right back to ARMv5, we
295are pursuing substantial redesign of the CPU driver. Taking advantage of the
296standardisation of the hardware platform mandated by the SBSA
297(\autoref{s:sbsa}), and the facilities provided by UEFI (\autoref{s:uefi}), in
298addition to a relatively unrestricted virtual address space, we are able to
299significantly reduce the complexity of the CPU driver.  In this section we
300describe the updated design, and our progress on its implementation, while the
301UEFI interface (Hagfish) is described separately, in \autoref{s:hagfish}.
302
303\paragraph{Terminology}
304In the interest of clarity, in the discussion that follows, we use a few terms
305with precise intent:
306\begin{description}
307\item[shall]
308    indicates features or characteristics of the design to which the
309    Barrelfish implementation must conform.
310\item[should]
311    indicates features which should be supported if at all possible.
312\item[initially]
313    indicates features which will be provided from the outset in the
314    Barrelfish implementation.
315\item[eventually]
316    indicates features which will be provided later in the Barrelfish
317    implementation, and which the initial design will aim to facilitate.
318\end{description}
319
320\subsection{Goals}
321
322Our goal is to provide a reference design for the CPU driver and user-space
323execution environment for Barrelfish on an ARMv8 core, in order to understand
324both positive and negative implications of the architecture for a multikernel
325system.  The design \textbf{should} be applicable to any ARMv8 with
326virtualisation (\texttt{EL2}) support.
327
328\textbf{Initially}, our hardware development platform is the APM X-Gene 1,
329using the Mustang Development Board. We are using the Mustang principally as
330it was relatively easily available, as well as being a comparatively complex
331and powerful CPU. The ThunderX platform from Cavium is very interesting for
332Barrelfish, as it ties a large number (48) of less-powerful (2-issue) cores.
333We do not have the resources to develop for two platforms simultaneously, but
334we hope to \textbf{eventually} add support for the ThunderX.
335
336Our target simulation environment is the ARM Fixed Virtual Platform, and the
337Foundation Platform. These models are supplied by ARM. The Foundation Platform
338is freely available, and will be the default supported simulation platform for
339the public Barrelfish tree, while we will use the FVP internally to allow
340bare-metal debugging. Future support for \qemu is desirable, to the extent that
341it models a compatible system --- GEM5, which the ARMv7 port targets,
342currently does not.
343
344\textbf{Initially}, the design will support running both the CPU driver and
345user-space processes in AArch64 mode without support for virtualisation.
346\textbf{Eventually} the design will support running the CPU driver in AArch64
347mode, and user-space processes in both AArch64 and AArch32 modes without
348virtualisation, and virtual machines in AArch64 mode. We will only support
349virtualisation on ARMv8.1 or later platforms, that support the VHE extensions,
350as described in \autoref{s:layout}.
351
352\subsection{Processor Modes and Virtualisation}
353
354Where possible, we will keep the virtualisation model similar to that on
355Barrelfish/x86. In particular, it \textbf{should} be possible to implement
356native applications, fully virtualised (e.g. Linux) VMs, and VM-level
357applications e.g. Arrakis \citep{peter:osdi14}.
358
359ARMv8 has a somewhat different virtualisation model to x86, and different
360again from the ARMv7 virtualisation extensions. Rather than having exception
361levels (rings) duplicated between guest and host, ARMv8 provides 4 exception
362levels (ELs):
363
364\begin{itemize}
365\item \texttt{EL0} is unprivileged --- user applications.
366\item \texttt{EL1} is privileged --- OS kernel.
367\item \texttt{EL2} is hypervisor state.
368\item \texttt{EL3} is for switching between secure and non-secure (TrustZone)
369                   modes. The X-Gene 1 does not implement \texttt{EL3}, and it
370                   is currently not of interest for Barrelfish.
371\end{itemize}
372
373Explicit traps (syscalls/hypercalls) target only the next level up:
374\texttt{EL0} can call \texttt{EL1} using \texttt{svc} (syscall), and
375\texttt{EL1} can call \texttt{EL2} using \texttt{hvc} (hypercall), but
376\texttt{EL0} cannot directly call \texttt{EL2}, unless \texttt{EL1} is
377completely disabled.  Exceptions return to the caller's exception level.
378
379ELs \textbf{shall} be distributed as follows: The CPU driver \textbf{shall}
380exist at both \texttt{EL1} and \texttt{EL2}, and take both syscalls
381(\texttt{svc}, from \texttt{EL0} applications) and hypercalls (\texttt{hvc},
382from \texttt{EL1} applications). The system \textbf{shall} support
383applications both at \texttt{EL0}, and at \texttt{EL1} (e.g.  Arrakis, VMs).
384Most code paths \textbf{should} be identical, as most CPU driver operations do
385not depend on \texttt{EL2} privileges.  Hypercalls from \texttt{EL0}
386\textbf{shall} be chained via \texttt{EL1} (with appropriate permission
387checks).
388
389\texttt{EL1} apps such as Arrakis, and paravirtualised VMs using hypercalls
390know that they are being virtualised, and will use \texttt{hvc} explicitly.
391Fully-virtualised \texttt{EL1} VMs do not make hypercalls.
392
393ARMv8 implements two-level address translation: VA (virtual address) to IPA
394(intermediate physical address), and IPA to PA (physical address).
395\texttt{EL1} guests \textbf{shall} be isolated at the L1 translation layer,
396and by trapping all accesses to system control registers.
397
398\subsection{Virtual Address Space Layout}\label{s:layout}
399
400ARMv8 has an effective 48-bit virtual address space. At the lowest execution
401levels (0 --- BF user \& 1 --- BF CPU driver), the hardware supports two (up to)
40248-bit (256TB) 'windows' in a 64-bit space: one at the bottom, and one at the
403top.  Each region has its own translation table base register (\texttt{TTBR0}
404\& \texttt{TTBR1}). \texttt{TTBR0} is used at \texttt{EL0}, and \texttt{TTBR1}
405at \texttt{EL1}.
406
407In the initial ARMv8 specification, this split address space was not
408implemented at \texttt{EL2}, which would require a separate CPU driver
409instance for virtualisation, and hypercalls (e.g. for Arrakis). ARMv8.1
410introduced the virtualisation host extensions (VHE) which, among other things,
411extends the split address space to \texttt{EL2}. As this provides a cleaner
412implementation model, and to avoid having to support a now-deprecated
413interface, virtualisation will \textbf{only} be supported on ARMv8.1 and
414later. This means that we will not support virtualisation on the X-Gene 1.
415Both the simulation environment (FVP/FP) and, seemingly, the ThunderX chips,
416support VHE.
417
418The CPU driver \textbf{shall} use \texttt{TTBR1} to provide a complete
419physical window. The ARMv8 CPU driver \textbf{shall not} dynamically map
420device memory into its own window (as the ARMv7 CPU driver does) --- the few
421memory-mapped devices required will be statically mapped on boot, with
422appropriate memory attributes. All physical addresses, RAM and device,
423\textbf{shall} be accessible at a static, standard offset (the base of the
424\texttt{TTBR1} region).
425
426User-level page tables will \textbf{initially} be limited to a 4k translation
427granularity.  \textbf{Eventually} user-level page tables \textbf{should} have
428access to all page-table formats and page sizes, as is the case in the current
429Barrelfish x86 implementation.
430
431\subsection{Address Space, Context, and Thread Identifiers}
432
433ARMv8 also provides address-space identifiers (ASIDs) in the TLB to avoid
434flushing the translation cache on a context switch.
435
436ARMv8 ASIDs (referred to in ARM documentation as context IDs) are
437architecturally allowed to be either 8 or 16 bits, although the SBSA
438specifies that they must be at least 16. Relying on the SBSA platform will
439allow us to avoid multiplexing IDs among active processes, on any
440reasonably-sized system.  Managing the reuse of context IDs can be left to
441user-level code, and does not need to be on the critical path of a context
442switch. The CPU driver need only ensure that every allocated dispatcher has a
443unique ASID, which is loaded into the \texttt{ContextID} register on dispatch.
444
445The value in the \texttt{ContextID} register is also checked against the
446hardware breakpoint and watchpoint registers, in generating debug exceptions.
447Therefore, it \texttt{shall} be possible for authorised user-level code to
448load the Context ID for a given dispatcher into a breakpoint register --- this
449\texttt{may} be an invocation on the dispatcher capability.
450
451\begin{table}
452\begin{center}
453\begin{tabular}{ll}
454\texttt{tpidrro\_el0} & EL0 Read-Only Software Thread ID Register \\
455\texttt{tpidr\_el0} & EL0 Read/Write Software Thread ID Register \\
456\texttt{tpidr\_el1} & EL1 Read/Write Software Thread ID Register \\
457\texttt{tpidr\_el2} & EL2 Read/Write Software Thread ID Register \\
458\texttt{tpidr\_el3} & EL3 Read/Write Software Thread ID Register \\
459\end{tabular}
460\end{center}
461\caption{Thread ID registers in ARMv8}
462\label{t:threadid}
463\end{table}
464
465In addition to the \texttt{ContextID} register, used to tag TLB entries, ARMv8
466also provides a set of thread ID registers with no architecturally-defined
467semantics, as listed in \autoref{t:threadid}. The client-writeable
468\texttt{tpidr\_el0} and \texttt{tpidr\_el1} \textbf{shall} have no CPU
469driver-defined purpose, but \textbf{shall} be saved and restored in a
470dispatcher's trap frame, to allow their use as thread-local storage (TLS).
471Recall that the Barrelfish CPU driver has no awareness of threads, which are
472implemented purely at user level.
473
474To implement the upcall/dispatch mechanism of Barrelfish, the CPU driver and
475the user-level dispatcher need to share a certain amount of state --- the
476user-visible portion of the dispatcher control block, which contains the trap
477frames, and the disabled flag (used to achieve atomic dispatch). The address
478of this structure needs to be known to both the CPU driver, and to user-level
479code, and moreover be efficiently-accessible, as the CPU driver needs to find
480the trap frame on the critical path of system calls and exceptions. This
481pointer also needs to be trustworthy, from the CPU driver's perspective, and
482thus cannot be directly modifiable by user-level code.
483
484The x86-32, x86-64, and ARMv7 CPU drivers all store the address of the running
485dispatcher's shared segment at a fixed known address, \texttt{dcb\_current},
486which is loaded by the trap handler. At user level, on x86 this address is
487held in a \emph{segment register} (\texttt{fs} on x86-64, and \texttt{gs} on
488x86-32), while on ARMv7 we sacrifice a general-purpose register (\texttt{r9})
489for this purpose. Using the \texttt{tpidrro\_el0} register to hold the address
490of the current dispatcher structure will allow us to avoid both a memory load
491on the fast path, and sacrificing a register in user-level code, thus
492\texttt{tpidrro\_el0} \textbf{shall} hold the address of the currently-running
493dispatcher.
494
495\subsection{Instruction Sets}
496
497ARMv8 supports both AArch64, and legacy ARM/Thumb (renamed AArch32). Switching
498execution mode is only possible when switching execution level i.e. on a trap
499or return, and can only be changed while at the higher execution level. Thus,
500\texttt{EL2} can set execution mode for \texttt{EL1}, and \texttt{EL1} for
501\texttt{EL0}. There is no way for a program to change its own execution mode.
502If \texttt{ELn} is in AArch64, then \texttt{EL(n-1)} can be in either AArch64
503or AArch32. If \texttt{ELn} is in AArch32, all lower ELs must also be AArch32.
504
505The CPU driver \textbf{shall} execute in AArch64.
506
507\textbf{Initially}, the CPU driver will enforce that all directly-scheduled
508threads also use AArch64, by controlling all downward EL transitions. An
509\texttt{EL1} client (such as Arrakis or a full virtual machine) may execute
510its own \texttt{EL0} clients in AArch32 (and there is no way to prevent this).
511However, all transitions into the CPU driver (\texttt{svc}, \texttt{hvc} or
512exception) must come from a direct client of the CPU driver, and thus from
513AArch64. The syscall ABI \textbf{shall} be AArch64.
514
515\textbf{Eventually}, Barrelfish \textbf{should} also support the execution of
516AArch32 dispatcher processes, by marking each dispatcher with a flag
517indicating the instruction set to be used (much as is already done with
518VM/non-VM mode in the Arrakis CPU driver).
519
520\subsection{User-Space Access to Architectural Functions}
521
522Generally, anything that can be safely exported, \textbf{should} be made
523available outside of the CPU driver, preferable as a memory-mapped interface,
524at 4kiB granularity. The SBSA mandates that devices be present at addresses
525that can be individually mapped, thus this should not be a problem.
526
527\subsection{Cache Management}
528
529ARMv8 has moved most cache and TLB management from the system control
530coprocessor (cp15), into the core ISA. Several cache operations
531(invalidate/clean by VA) are executable at \texttt{EL0}, and thus no kernel
532interface is required. The system must take into account that user-directed
533flushes may have occurred, or may occur concurrently with any memory
534operation.
535
536\subsection{Performance Monitors}
537
538Performance monitors \textbf{should} be exposed, if it can be done safely.
539
540\subsection{Debugging}
541
542Self-hosted debug \textbf{should} be exposed, if it can be done safely. This
543is under active development.
544
545\subsection{Booting}
546
547Platform support i.e.~a standard set of peripherals, and a defined boot
548process, has improved dramatically on ARM, as it has been repositioned as a
549server platform. UEFI and ACPI support are widespread, including on the
550Mustang development board. We will assume support for UEFI booting, make use
551of ACPI data, where available.
552
553The Barrelfish CPU driver and initial image \textbf{shall} be loaded and
554executed by a UEFI shim, which will pass through all UEFI-supplied
555information, such as ACPI tables, and be able to interpret a Barrelfish
556Multiboot image.  This shim, or second-stage bootloader, is called Hagfish,
557and is described in \autoref{s:hagfish}.
558
559\subsection{Interrupts}
560
561ARMv8 interrupt handling is not substantially different from the existing
562architectures and platforms supported by Barrelfish. While a redesign of the
563Barrelfish interrupt system is under way (to use capabilities to grant access
564to receive interrupts), we do not anticipate ARMv8 to impose any particular
565challenges.
566
567The ARMv8 systems we \textbf{initially} target all use minor variations on the
568ARM Generic Interrupt Controller (GIC) design, already supported in
569Barrelfish. We currently have support for version 2 of the GIC, with which
570later implementations are backward-compatible. We will \textbf{eventually}
571support GICv3, the current specification at time of writing.
572
573\subsection{Inter-Domain Communication}
574
575User-level communication between cache-coherent cores in Barrelfish for ARMv8
576is likely to the same as with ARMv7 and x86, and we expect the existing
577User-level Message-Passing over Cache-Coherence (UMP-CC) interconnect driver
578to work unmodified.
579
580Between dispatchers on the same core, however, the different register set on
581the ARMv8 is likely to result in a very different Local Message Passing (LMP)
582interconnect driver---this is always an architecture-specific part of the CPU
583driver. In practice, its design will be closely tied to the context switch and
584upcall dispatch code.
585
586\chapter{Booting}\label{c:booting}
587
588Booting ARM systems has always been difficult to do in a standard way,
589and ARMv8 systems are no exception.  Barrelfish uses one of two
590methods of booting an initial ARMv8 core, depending on whether the
591hardware platform supports UEFI~\cite{uefi} or U-Boot.  If a platform
592supports neither, more work will be required to boot the board.
593
594If a board has full support for UEFI (such as TianoCore), you can use
595Hagfish~\ref{s:hagfish} to individually load the modules needed to
596boot Barrelfish and set up the initial CPU/MMU environment before
597entering the CPU driver proper. 
598
599Note that U-Boot also claims to support UEFI.  However, in practice it
600supports a small subset of UEFI functionality sufficient to boot
601\texttt{grub} or the Linux kernel as an EFI binary.   If your board
602boots via U-Boot, you should use the minimal EFI
603bootloader~\ref{s:uboot} which loads a single multiboot image into
604memory and sets up the environment similar to Hagfish. 
605
606\section{Hagfish}\label{s:hagfish}
607
608The Barrelfish/ARMv8 UEFI loader prototype is called Hagfish\footnote{A
609hagfish is a basal chordate i.e. something like the ancestor of all fishes.}.
610Hagfish is a second-stage bootloader for Barrelfish on UEFI platforms,
611initially the ARMv8 server platform.  Hagfish is loaded as a UEFI application,
612and uses the large set of supplied services to do as much of the one-time
613(boot core) setup that the CPU driver needs as is reasonably possible. More
614specifically, Hagfish:
615
616\begin{itemize}
617\item Is loaded over BOOTP/PXE.
618\item Reuses the PXE environment to load a menu.lst-style configuration.
619\item Loads the kernel image and the initial applications, as directed, and
620builds a Multiboot image.
621\item Allocates and builds the CPU driver's page tables.
622\item Activates the initial page table, and allocates a stack.
623\end{itemize}
624
625\subsection{Why Another Bootloader?}
626
627The ARMv8 machines that we're porting to are different to both existing ARM
628boards, and to x86. They have a full pre-boot environment, unlike most
629embedded boards, but it's not a PC-style BIOS. The ARM Server Base Boot
630Requirements specify UEFI. Moreover, there is no mainline support from GNU
631GRUB for the ARMv8 architecture, so no matter what, we need some amount of
632fresh code.
633
634Given that we had to write at least a shim loader, and keeping in mind that
635UEFI is multi-platform (and becoming more and more common in the x86 world),
636we're taking the opportunity to simplify the initial boot process within the
637CPU driver by moving the once-only initialisation into the bootloader. In
638particular, while running under UEFI boot services, we have memory allocation
639available for free, e.g. for the initial page tables. By moving ELF loading
640and relocation code into the bootloader, we can eliminate the need to relocate
641running code, and can cut down (hopefully eliminate) special-case code for
642booting the initial core. Subsequent cores can rely on user-level Coreboot
643code to relocate them, and to construct their page tables.
644
645\subsection{Assumptions and Requirements}
646
647Hagfish is (initially at least) intended to support development work on
648AArch64 server-style hardware and, as such, makes the following assumptions:
649
650\begin{itemize}
651\item 64-bit architecture, using ELF binaries. Porting to 32-bit architectures
652wouldn't be hard, if it were ever necessary (probably not).
653\item PXE/BOOTP/TFTP available for booting. Hagfish expects to load its
654configuration, and any binaries needed, using the same PXE context with which
655it was booted. Changing this to boot from a local device (e.g. HDD) wouldn't
656be hard, as the UEFI \texttt{LoadFile} interface abstracts from the hardware.
657\end{itemize}
658
659\subsection{Boot Process}
660
661In detail, Hagfish currently boots as follows:
662
663\begin{enumerate}
664\item \texttt{Hagfish.efi} is loaded over PXE by UEFI, and is executed at a
665runtime-allocated address, with translation (MMU) and caching enabled.
666\item Hagfish queries EFI for the PXE protocol instance used to load it, and
667squirrels away the current network configuration.
668\item Hagfish loads the file \texttt{hagfish.A.B.C.D.cfg} from the TFTP server
669root (where \texttt{A.B.C.D} is the IP address on the interface that ran PXE).
670\item Hagfish parses its configuration, which is essentially a GRUB menu.lst,
671and loads the kernel image and any additional modules specified therein. All
672ELF images are loaded into page-aligned regions of type
673\texttt{EfiBarrelfishELFData}.
674\item Hagfish queries UEFI for the system memory map, then allocates and
675initialises the inital page tables for the CPU driver (mapping all occupied
676physical addresses, within the \texttt{TTBR1} window, see \autoref{s:layout}).
677The frames holding these tables are marked with the EFI memory type\\
678\texttt{EfiBarrelfishBootPagetable}, allocated from the OS-specific range
679(\texttt{0x80000000}--\texttt{0x8fffffff}). All memory allocated by Hagfish on
680behalf of the CPU driver is page-aligned, and tagged with an OS-specific type,
681to allow EFI and Hagfish regions to be safely reclaimed.
682\item Hagfish builds a Multiboot 2 information structure, containing as much
683information as it can get from EFI, including:
684    \begin{itemize}
685    \item ACPI 1.0 and 2.0 tables.
686    \item The EFI memory map (including Hagfish's custom-tagged regions).
687    \item Network configuration (the saved DHCP ack packet).
688    \item The kernel command line.
689    \item All loaded modules.
690    \item The kernel's ELF section headers.
691    \end{itemize}
692\item Hagfish allocates a page-aligned kernel stack (type
693\texttt{EfiBarrelfishCPUDriverStack}), of the size specified in the
694configuration.
695\item Hagfish terminates EFI boot services (calls \texttt{ExitBootServices}),
696activates the CPU driver page table, switches to the kernel stack, and jumps
697into the relocated CPU driver image.
698\end{enumerate}
699
700\subsection{Post-Boot state}
701
702When the CPU driver on the boot core begins executing, it can assume the
703following:
704
705\begin{itemize}
706\item The MMU is configured with all RAM and I/O regions mapped via
707\texttt{TTBR1}.
708\item The CPU driver's code and data are both fully relocated into one or more
709distinct 4kiB-aligned regions.
710\item The stack pointer is at the top of a distinct 4kiB-aligned region of at
711least the requested size.
712\item The first argument register holds the Multiboot 2 magic value.
713\item The second holds a pointer to a Multiboot 2 information structure, in
714its own distinct 4kiB-aligned region.
715\item The console device is configured.
716\item Only one core is enabled.
717\item The Multiboot structure contains at least:
718    \begin{itemize}
719    \item The final EFI memory map, with all areas allocated by Hagfish to
720    hold data passed to the CPU driver marked with OS-specific types, all of
721    which refer to non-overlapping 4k-aligned regions:
722        \begin{description}
723        \item[\ttfamily EfiBarrelfishCPUDriver]
724        The currently-executing CPU driver's text and data segments.
725        \item[\ttfamily EfiBarrelfishCPUDriverStack]
726        The CPU driver's stack.
727        \item[\ttfamily EfiBarrelfishMultibootData]
728        The Multiboot structure.
729        \item[\ttfamily EfiBarrelfishELFData]
730        The unrelocated ELF image for a boot-time module (including that for
731        the CPU driver itself), as loaded over TFTP.
732        \item[\ttfamily EfiBarrelfishBootPageTable]
733        The currently-active page tables.
734        \end{description}
735    \item The CPU driver (kernel) command line.
736    \item A copy of the last DHCP Ack packet.
737    \item A copy of the section headers from the CPU driver's ELF image.
738    \item Module descriptions for the CPU driver and all other boot modules.
739    \item If UEFI provided an ACPI root table, the Multiboot structure
740    contains a pointer to it.
741    \end{itemize}
742\end{itemize}
743
744\subsection{Configuration}
745
746Hagfish configures itself by loading a file whose path is generated from its
747assigned IP address. Thus if your development machine receives the address
748192.168.1.100, Hagfish will load the file\\
749\texttt{hagfish.192.168.1.100.cfg}
750from the same TFTP server used to load it. The format is intended to be as
751close as practical to that of an old-style GRUB menu.lst file. The example
752configuration in \autoref{f:hag_config} loads
753\texttt{/armv8/sbin/cpu\_apm88xxxx} as the CPU driver, with arguments
754\texttt{loglevel=3}, and an 8192B (2-page) stack.
755
756\begin{figure}[htb]
757\begin{center}
758\begin{lstlisting}
759kernel /armv8/sbin/cpu_apm88xxxx loglevel=3
760stack 8192
761module /armv8/sbin/cpu_apm88xxxx
762module /armv8/sbin/init
763
764# Domains spawned by init
765module /armv8/sbin/mem_serv
766module /armv8/sbin/monitor
767
768# Special boot time domains spawned by monitor
769module /armv8/sbin/chips boot
770module /armv8/sbin/ramfsd boot
771module /armv8/sbin/skb boot
772module /armv8/sbin/kaluga boot
773module /armv8/sbin/spawnd boot bootarm=0
774module /armv8/sbin/startd boot
775
776# General user domains
777module /armv8/sbin/serial auto portbase=2
778module /armv8/sbin/fish nospawn
779module /armv8/sbin/angler serial0.terminal xterm
780
781module /armv8/sbin/memtest
782
783module /armv8/sbin/corectrl auto
784module /armv8/sbin/usb_manager auto
785module /armv8/sbin/usb_keyboard auto
786module /armv8/sbin/sdma auto
787\end{lstlisting}
788\end{center}
789\caption{Hagfish configuration file}
790\label{f:hag_config}
791\end{figure}
792
793
794
795\subsection{Booting with Hagfish in \qemu}\label{c:qemu}
796
797When booting a \qemu image for 64-bit ARM, a number of options are
798available (see \texttt{make help-boot}).  Building a boot image for
799\qemu with ARMv8 will typically result in a file in the build directory
800called \texttt{armv8_<core_type>_qemu_image}.  This is a disk image which can be
801read by Hagfish through EFI calls.
802
803Booting this with a boot target from \texttt{make} will run the
804following:
805\begin{lstlisting}
806  srcdir/tools/qemu-wrapper.sh \\
807     --image armv8_<core_type>_qemu_image \\
808     --arch armv8 \\
809     --bios ../git/barrelfish/tools/hagfish/QEMU_EFI.fd
810\end{lstlisting}
811
812This wrapper script is complex, but reasonably well documented (use
813'\texttt{--help}').  It will invoke \qemu as follows:
814\begin{lstlisting}
815  qemu-system-aarch64 \\
816  -m 1024 \\
817  -cpu cortex-a57 \\
818  -M virt \\
819  -d guest_errors \\
820  -M gic_version=3 \\
821  -smp 1 \\
822  -bios ../git/barrelfish/tools/hagfish/QEMU_EFI.fd \\
823  -device virtio-blk-device,drive=image \\
824  -drive if=none,id=image,file=armv8_<core_type>_qemu_image,format=raw \\
825  -nographic
826\end{lstlisting}
827
828
829Note that for this script to work, you need to have \texttt{mtools}
830(the MS-DOS file system manipulation tools) installed, since they are
831used to prepare the \texttt{armv8_<core_type>_qemu_image} file.
832
833More specifically, the \texttt{armv8_<core_type>_qemu_image} file is generated
834by \texttt{tools/harness/efiimage.py}.  This creates an EFI file
835system image out of the plain Barrelfish binaries built in
836\texttt{\textit{builddir}/armv8/sbin}, plus the Hagfish EFI image we
837regularly use for real hardware.  The \texttt{QEMU\_EFI.fd} file is
838the UEFI runtime built for \qemu. 
839
840\section{Booting from U-Boot}\label{s:uboot}
841
842Where a full UEFI environment is not available, it is possible to boot
843Barrelfish from U-Boot~\cite{uboot}.  We boot Barrelfish from U-Boot
844using U-Boot's limited EFI support: a build-time tool
845(\texttt{armv8\_bootimage} builds a single binary which only requires
846the minimal EFI environment provided by U-Boot.   This binary contains
847a loader (\texttt{efi\_loader}) which sets up the rest of the image as
848a multiboot image in memory before starting the CPU driver.
849
850\subsection{Booting in \qemu with U-Boot}
851
852A ``platform'' target like \texttt{QEMU\_UBoot} which build such an image
853for \qemu, and the \texttt{qemu-wrapper.sh} script can be invoked to
854use U-Boot instead of Hagfish:
855
856\begin{lstlisting}
857  srcdir/tools/qemu-wrapper.sh \\
858  --image armv8_a57_qemu_image.efi \\
859  --arch armv8 \\
860  --uboot-img srcdir/tools/qemu-armv8-uboot.bin
861\end{lstlisting}
862
863This invoked \qemu as follows:
864
865\begin{lstlisting}
866  qemu-system-aarch64 \\
867  -m 1024 \\
868  -cpu cortex-a57 \\
869  -M virt \\
870  -d guest_errors \\
871  -M gic_version=3 \\
872  -smp 1 \\
873  -bios srcdir/tools/qemu-armv8-uboot.bin \\
874  -device loader,addr=0x50000000,file=armv8_a57_qemu_image.efi \\
875  -nographic
876\end{lstlisting}
877
878As you can see, the UBoot binary is given as the BIOS, and the minimal
879EFI image with the complete set of multiboot modules compiled in is
880pre-loaded into memory when \qemu starts.
881
882\chapter{Technical Observations}\label{c:tech}
883
884\section{User-Space Threading}\label{s:threads}
885
886\begin{figure}[htb]
887\begin{center}
888\begin{minipage}[t]{0.3\textwidth}
889\begin{lstlisting}
890clrex
891/* Restore CPSR */
892ldr r0, [r1], #4
893msr cpsr, r0
894/* Restore registers */
895ldmia r1, {r0-r15}
896\end{lstlisting}
897\end{minipage}
898\hspace{2cm}
899\begin{minipage}[t]{0.5\textwidth}
900\begin{lstlisting}
901/* Restore PSTATE, load resume
902 * address into x18 */
903ldp x18, x2, [x1, #(PC_REG * 8)]
904/* Set only NZCV. */
905and x2, x2, #0xf0000000
906msr nzcv, x2
907/* Restore the stack pointer and x30. */
908ldp x30, x2, [x1, #(30 * 8)]
909mov sp, x2
910/* Restore everything else. */
911ldp x28, x29, [x1, #(28 * 8)]
912ldp x26, x27, [x1, #(26 * 8)]
913ldp x24, x25, [x1, #(24 * 8)]
914ldp x22, x23, [x1, #(22 * 8)]
915ldp x20, x21, [x1, #(20 * 8)]
916/* n.b. don't reload x18 */
917ldr      x19, [x1, #(19 * 8)]
918ldp x16, x17, [x1, #(16 * 8)]
919ldp x14, x15, [x1, #(14 * 8)]
920ldp x12, x13, [x1, #(12 * 8)]
921ldp x10, x11, [x1, #(10 * 8)]
922ldp  x8,  x9, [x1, #( 8 * 8)]
923ldp  x6,  x7, [x1, #( 6 * 8)]
924ldp  x4,  x5, [x1, #( 4 * 8)]
925ldp  x2,  x3, [x1, #( 2 * 8)]
926/* n.b. this clobbers x0&x1 */
927ldp  x0,  x1, [x1, #( 0 * 8)]
928/* Return to the thread. */
929br x18
930\end{lstlisting}
931\end{minipage}
932\end{center}
933\caption{\texttt{disp\_resume\_context} on ARMv7 (left) and ARMv8 (right)}
934\label{f:disp_resume}
935\end{figure}
936
937The ARMv8 architecture is in some ways an improvement, and in other ways
938problematic, for the sort of user-level threading implemented in Barrelfish,
939via \emph{scheduler activations}. Under this scheme, the kernel (in Barrelfish
940terms, the \emph{CPU driver}), does not schedule threads directly, but instead
941exposes all scheduling-relevant events via \emph{upcalls} to predefined
942user-level handlers (in Barrelfish, the \emph{dispatcher}), which then
943implements thread scheduling (or something else entirely), as it sees fit.
944This differs from the behaviour of a system such as UNIX, which only ever
945restores a user-level execution context simultaneously with dropping from a
946privileged to an unprivileged execution level.
947
948Processor architectures are, understandably, designed with common software in
949mind. Thus, the primitives available for restoring an execution context i.e.
950register state are often tied closely to those for changing privilege level. A
951common design (which ARMv8 also implements) is the \emph{exception return},
952where privileged code can atomically drop its privilege, and jump to a
953user-level execution address. In ARMv8, the \texttt{eret} instruction
954atomically updates the program state (PSTATE, most importantly the privilege
955level bits), and branches to the address held in the \emph{exception link
956register}, \texttt{elr}.
957
958In implementing user-level threading, we're not concerned with privilege
959levels, but the lack of some equivalent of \texttt{elr} is frustrating. Not
960only does \texttt{eret} provide an atomic update of the program counter and
961the program state, it does so without modifying any general-purpose register.
962Replicating this behaviour at \texttt{EL0}, where \texttt{eret} is unavailable
963is problematic. ARMv8 differs from ARMv7, in that the program counter can no
964longer be the target of a load instruction, but can only be loaded via a
965general-purpose register.
966
967Specifically, the only PC-modifying instructions (other than \texttt{eret})
968are PC-relative branches (which are useless in this scenario) and
969branch-to-register (of which \texttt{br}, \texttt{blr} and \texttt{ret} are
970all special encodings). Since ARMv8 has also removed the \texttt{ldm} (load
971multiple) instruction, there is no way to load the program counter with an
972arbitrary value (the thread's restart address), without overwriting one of the
973general-purpose registers. We cannot restore the thread's register value
974\emph{before} we branch to it, as we'd overwrite the return address, and we
975obviously can't do so afterwards, as the thread likely has no idea that it's
976been interrupted. The only alternative is to trampoline through kernel mode in
977order to use \texttt{eret} (which would eliminate the speed benefit of
978user-level threading), or to reserve a general-purpose register for use by the
979dispatcher. Neither option is appealing, but we went with the second option,
980reserving \texttt{x18}, reasoning that with 31 general-purpose registers
981available, the loss of one isn't a huge penalty. Register \texttt{x18} is
982explicitly marked as the \emph{platform register} in the AArch64 ABI
983\citep{arm:aa64pcs}, for such a purpose.
984
985Future revisions of the ARM architecture could prevent this issue in a number
986of ways: allowing the use of \texttt{eret} at \texttt{EL0} or providing an
987equivalent functionality (specifically a non-general-purpose register such as
988\texttt{elr}, that doesn't need to be restored); or alternatively, adding
989indirect jumps (load to PC) back to the instruction set.
990
991\autoref{f:disp_resume} compares the user-level thread resume code for the
992Barrelfish dispatcher (function \texttt{disp\_resume}) for ARMv7 and ARMv8
993side-by-side. The effect of removing the load-multiple instructions, and
994direct-to-SP loads, on code density is clearly visible: everything on lines
9958--29 for ARMv8 corresponds to the single \texttt{ldmia} instruction on lines
9969 for ARMv7 --- one instruction is now 18, on the thread-switch critical path!
997Note also, on line 17, that the ARMv8 code does not restore the thread's
998\texttt{r18}, but instead uses it to hold the branch address for use on line
99929. The only improvement on ARMv8 is that the \texttt{clrex} (clear exclusive
1000monitor) instruction is no longer required, as the monitor is cleared on
1001returning from the kernel. Note also that the usual method to efficiently load
1002multiple registers, using 16-word SIMD (NEON) loads, isn't available, as
1003there's no guarantee that the SIMD extensions are enabled on this dispatcher,
1004and we cannot handle a fault in this code.
1005
1006\section{Trap Handling}\label{s:traps}
1007
1008\begin{figure}
1009\begin{lstlisting}
1010el0_aarch64_sync:
1011    msr daifset, #3 /* IRQ and FIQ masked, Debug and Abort enabled. */
1012
1013    stp x11, x12, [sp, #-(2 * 8)]!
1014    stp x9,  x10, [sp, #-(2 * 8)]!
1015
1016    mrs x10, tpidr_el1
1017    mrs x9, elr_el1
1018
1019    ldp x11, x12, [x10, #OFFSETOF_DISP_CRIT_PC_LOW]
1020    cmp x11, x9
1021    ccmp x12, x9, #0, ls
1022    ldr w11, [x10, #OFFSETOF_DISP_DISABLED]
1023    ccmp x11, xzr, #0, ls
1024    /* NE <-> (low <= PC && PC < high) || disabled != 0 */
1025
1026    mrs x11, esr_el1  /* Exception Syndrome Register */
1027    lsr x11, x11, #26 /* Exception Class field is bits [31:26] */
1028
1029    b.ne el0_sync_disabled
1030
1031    add x10, x10, #OFFSETOF_DISP_ENABLED_AREA
1032
1033save_syscall_context:
1034    str x7,       [x10, #(7 * 8)]
1035
1036    stp x19, x20, [x10, #(19 * 8)]
1037    stp x21, x22, [x10, #(21 * 8)]
1038    stp x23, x24, [x10, #(23 * 8)]
1039    stp x25, x26, [x10, #(25 * 8)]
1040    stp x27, x28, [x10, #(27 * 8)]
1041    stp x29, x30, [x10, #(29 * 8)] /* FP & LR */
1042
1043    mrs x20, sp_el0
1044    stp x20, x9, [x10, #(31 * 8)]
1045
1046    mrs x19, spsr_el1
1047    str x19, [x10, #(33 * 8)]
1048
1049    cmp x11, #0x15 /* SVC or HVC from AArch64 EL0 */
1050    b.ne el0_abort_enabled
1051
1052    add sp, sp, #(4 * 8)
1053
1054    mov x7, x10
1055
1056    b sys_syscall
1057\end{lstlisting}
1058\caption{BF/ARMv8 synchronous exception handler}
1059\label{f:sync_el0}
1060\end{figure}
1061
1062\autoref{f:sync_el0} shows the CPU driver exception stub, for a synchronous
1063abort from \texttt{EL0}. This exception class includes system calls,
1064breakpoints, and page faults on both code and data. The effect of the loss of
1065store multiple instructions is again visible, for example on lines 27--32.
1066Although not as severe as in the case of the user-level thread restore in
1067\autoref{s:threads}, the extra instructions required do constrain us somewhat,
1068as each trap handler is constrained to 128 bytes, or 32 instructions, before
1069branching to another code block.
1070
1071We were able to squeeze the necessary code into the space available, including
1072the optimised test for a disabled dispatcher at lines 10--14, but only by
1073splitting the page fault handler (\texttt{el0\_abort\_enabled}) into a
1074separate subroutine, incurring an unnecessary branch. A more significant
1075annoyance is that system calls (\texttt{svc} and \texttt{hvc}) are routed to
1076the same exception vector as page faults (aborts).  The effect of this is that
1077we are forced to spill registers to the stack (\texttt{x9}--\texttt{x12} on
1078lines 4--5), even on the system call fast path, as we need at least one
1079register to check the exception syndrome (\texttt{esr\_el1}) to distinguish
1080aborts (where we must preserve all registers) from system calls (where we
1081could immediately begin using the caller-saved registers). Note that the code
1082on lines 27--32 only needs to stack the callee-saved registers, and leaves the
1083system call arguments in \texttt{x0}--\texttt{x7}, to be read as required by
1084\texttt{sys\_syscall} (in C).
1085
1086This sort of mismatch between the exception-handling interface of the CPU
1087architecture, and what is required for really high-performance systems code is
1088unfortunately extremely common. Unnecessary overheads, such as the additional
1089stacked registers here hurt the performance of highly-componentised systems,
1090such as Barrelfish, which rely on frequently crossing protection domains.
1091
1092The relatively well-compressed boolean arithmetic on lines 10--14 demonstrates
1093that, even with the loss of ARM's fully-conditional instructions, the
1094conditional compares which remain are still relatively powerful.
1095
1096\section{Cache Coherence}
1097
1098One aspect of the ARM architecture that is of particular interest for the
1099Barrelfish project, but which we have not yet explored in depth, is the
1100configurable cache coherency and fine-grained cache management operations
1101available. Any virtual mapping on a recent ARM architecture, including both
1102ARMv7 and ARMv8, can be tagged with various cacheability properties: inner
1103(L1), outer (L2+, usually), write-back or write-through. Combined with the
1104explicit flush operations at cache-line granularity, able to target either PoU
1105(point of unification, where data and instruction caches merge) or PoC (point
1106of coherency, typically RAM), a multi-core, multi-socket ARMv8 system would
1107make a very interesting testbed for investigating efficient cache management
1108and communication primitives for future partially-coherent architectures.
1109Indeed, the latest revision of the ARMv8 specification, ARMv8.2, introduced
1110flush to PoP, or \emph{point of persistence} --- perhaps in response to
1111interest from well-known systems integration firms investigating large
1112persistent memories.
1113
1114The design presented in this report is intended to expose as much control over
1115the caching hierarchy as possible to user-level code, to provide a platform
1116for future research.
1117
1118\bibliographystyle{plainnat}
1119\bibliography{defs,barrelfish}
1120
1121\end{document}
1122