1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2% Copyright (c) 2013, ETH Zurich.
3% All rights reserved.
4%
5% This file is distributed under the terms in the attached LICENSE file.
6% If you do not find this file, copies can be found by writing to:
7% ETH Zurich D-INFK, Universitaetstr. 6, CH-8092 Zurich. Attn: Systems Group.
8%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
9
10\documentclass[a4paper,twoside]{report}
11
12\usepackage{bftn}
13\usepackage{booktabs}
14\usepackage{hyperref}
15\usepackage{hyphenat}
16\usepackage{listings}
17\usepackage{makeidx}
18\usepackage{natbib}
19
20\def\chapterautorefname{Chapter}
21\def\sectionautorefname{Section}
22\def\subsectionautorefname{Section}
23\def\subsubsectionautorefname{Section}
24\def\tableautorefname{Table}
25
26\lstdefinelanguage{armasm}{
27    numbers=left,
28    numberstyle=\tiny,
29    numbersep=5pt,
30    basicstyle=\ttfamily\small
31}
32\lstset{language=armasm}
33
34\title{Barrelfish on ARMv8}
35\author{David Cock}
36
37\tnnumber{022}  % give the number of the tech report
38\tnkey{ARMv8} % Short title, will appear in footer
39
40\begin{document}
41\maketitle
42
43\begin{versionhistory}
44\vhEntry{1.0}{11.04.2016}{DC}{Initial version}
45\end{versionhistory}
46
47\chapter{Summary}
48
49Barrelfish now supports ARMv7 and ARMv8 as primary platforms, and we have
50discontinued support for all older architecture revisions (ARMv5, ARMv6). The
51current Barrelfish release contains a port to a simulated ARMv8 environment,
52derived from the existing ARMv7 codebase and running under GEM5, with generous
53contributions from HP Research.
54
55Simultaneously, we are undertaking a clean-slate redesign of the CPU driver
56for ARMv8, as it presents a number of novel features, and greatly improved
57platform standardisation (\autoref{s:sbsa}), that should allow for a much
58cleaner and simpler implementation. This redesigned CPU driver will for the
59basis for ongoing research into large-scale non-cache-coherent systems using
60ARMv8 cores. This document presents the new CPU driver design
61(\autoref{c:design}), briefly covering those features of ARMv8 of greatest
62relevance (\autoref{c:background}), and discusses a number of technical
63challenges presented by the new architecture (\autoref{c:tech}).
64
65\chapter{Background}\label{c:background}
66
67The Barrelfish research operating system is a vehicle for research into
68software support for likely future architectures, where large numbers of
69non-coherent (or weakly-coherent) heterogeneous processor cores are assembled
70into a single large-scale system. As such, support for a common non-x86
71architecture has always been part of the project, beginning with the ARMv5
72(XScale) port, which permitted the embedded processor on a network interface
73card to be integrated as a first-class part of the system, with its own CPU
74driver. We have also actively maintained an ARMv7 port, to the OMAP4460
75processor on the Pandaboard ES, which we use as a teaching platform in the
76Advanced Operating Systems course at ETH Z\"urich. These ports are described
77more fully in the accompanying technical report, \citet{btn017-arm}.
78
79\section{The ARMv8 Architecture}
80
81The ARMv8 architecture is quite a radical departure from previous versions,
82and represents the culmination of a trend that has been developing for quite
83some time. While the first wave of ARM-based microservers, based on the 32-bit
84ARMv7 architecture, was largely a commercial failure, it's clear that ARM is
85now actively targeting the server market, where Intel currently has near-total
86dominance.
87
88ARMv8 discards some long-standing features of the ARM instruction set:
89universal conditional execution, multiple loads/stores, and the program
90counter as a general-purpose register. These most likely caused difficulty in
91scaling the processor pipeline to high clock rates, and we present some
92consequences of their loss in \autoref{s:threads} and \autoref{s:traps}. The
93instruction-set changes, however challenging to the systems programmer, are
94ultimately of little consequence compared to the consolidation of the ARM
95ecosystem into a serious server platform. The two features of most interest at
96this stage in the design process are the standardisation of hardware features
97and memory maps, and of the boot process.
98
99\subsection{ARM Server Base System Architecture}\label{s:sbsa}
100
101ARM has long been criticised by systems programmers for its highly fragmented,
102and non-uniform programming interface. Linux, in particular, has struggled for
103years with supporting the great multiplicity of ARM platforms.  The principal
104reason for this is the lack of any concept of a \emph{platform}: a set of
105assumptions (available hardware, memory map, etc.), that programmers can rely
106on when initialising and managing a system. The Linux source tree famously
107contained a vastly greater amount of code in the ARM platform support
108subtrees, than that for x86.
109
110The relative standardisation of the x86 platform is largely a historical
111accident, due to the rapid proliferation of PC/AT clones in the early 1980s.
112The x86 platform thus contains layers of ossified legacy interfaces, necessary
113to ensure broad cross and backward compatibility.  ARM's business model, on
114the contrary, has long emphasised the specialisation of implementations: an
115ARM licensee would take their ARM-designed CPU core, and integrate it
116themselves in to a complex SoC (system on a chip), with their own specialised,
117proprietary interfaces. The upsides of this were the possibility to highly
118optimise a particular design, and no requirement on ARM itself to maintain a
119coherent platform.
120
121While ARM's customisable platform worked well for embedded devices, and scaled
122reasonably well to relatively powerful smartphones, it's a disaster for
123producing high-quality systems software, able to execute on a broad range of
124hardware from competing vendors: exactly what a competitive server platform
125requires. ARM clearly know this, and since 2014 have published the Server Base
126System Architecture \citep{arm:sbsa}. To the extent that manufacturers adhere
127to these guidelines, our job as systems programmers is significantly simpler:
128it should be possible to write a single set of initialisation and
129configuration code for ARMv8, that will run on any SBSA-compliant system, much
130as we already do for x86-64.
131
132\begin{table}
133\begin{center}
134\begin{tabular}{lllp{6cm}}
135\toprule
136Supplier & Processor & Name & \\
137\midrule
138APM & APM883208 & Mustang  & 1P 8-core X-Gene 1 with serial trace. \\
139\addlinespace[2pt]
140    & APM883408-X2 & X-C2  & 1P 8-core X-Gene 2. \\
141\addlinespace[2pt]
142Cavium & CN8890 & StratusX & 1P 48-core ThunderX. \\
143\addlinespace[2pt]
144       &        & Cirrus   & 2P 48-core ThunderX. \\
145\addlinespace[2pt]
146ARM & AEM & Fixed Virtual Platform & The \emph{architectural envelope model}
147covers the range of behaviour permitted by ARMv8. Bare-metal debug. \\
148\addlinespace[2pt]
149    &     & Foundation Platform & Freely available, compatible with FVP. \\
150\bottomrule
151\end{tabular}
152\end{center}
153\caption{ARMv8 platforms of interest}\label{t:platforms}
154\end{table}
155
156Our target platforms listed in \autoref{t:platforms} all support SBSA to some
157extent, and absent any compelling reason, we will only support SBSA-compliant
158platforms.
159
160\subsection{UEFI}\label{s:uefi}
161
162One aspect of the SBSA which eases portability is the specification, for the
163first time, of a boot process for ARM systems. ARM has specified that SBSA
164systems must support UEFI \citep{uefi} (the unified extensible firmware
165interface). UEFI is a descendant of the EFI specification, developed by Intel
166for the Itanium project. While Itanium is no longer a platform of any great
167commercial interest, UEFI support is now widespread in the x86-64 market.
168UEFI, in turn, specifies the use of ACPI \citep{acpi} (the advanced
169configuration and power interface) for platform discovery and control.
170
171Supporting ACPI and UEFI requires a one-off investment of effort to design a
172new boot and configuration subsystem, but should pay off in the long term, as
173ports to new ARM boards will no longer require extensive manual configuration.
174The code should also be largely reusable for x86 UEFI systems. Our new UEFI
175bootloader is described in \autoref{s:hagfish}.
176
177\section{A Direct Port from ARMv7}
178
179As already described, the Barrelfish release current at time of writing
180includes an initial ARMv8 port to the GEM5 simulator. This port contains code
181generously contributed by HP Research.
182
183Being developed from the existing codebase, this ARMv8 port follows the
184structure of the existing ARMv7 code closely. While it is highly useful to
185have a running port, we are nevertheless continuing with a significant
186redesign of the CPU driver, as significant improvements and simplifications
187will be possible, once we no longer need to follow the existing structure,
188originally developed for a significantly different platform.
189
190The GEM5 simulator's model of an ARMv8 platform is relatively primitive, and
191does not conform to modern platform conventions, for example placing RAM at
192address \texttt{0}, rather than \texttt{0x80000000} as mandated by the SBSA.
193For this reason, in addition to better integration with ARM debugging tools,
194we have switched to the ARM Fixed Virtual Platform as our default simulation
195environment, with the Foundation Platform supported as a freely available
196simulator.
197
198\section{Registers}
199
200\subsection{General-purpose Registers}
201
202In total there are 31+1 general purpose registers (\texttt{r0-r30}) of size 
20364bits(\autoref{tab:registers}). They are usually referred to by the names 
204\texttt{x0-x30}. The 32-bit content of the registers are referred to as 
205\texttt{w0-w30}. The additional stack pointer \texttt{SP} register can be 
206accessed with a restricted number of instructions.
207
208\begin{table}[!h]
209    \begin{center}
210
211    \begin{tabular}{lll}
212        \textbf{Register} & \textbf{Special} & \textbf{Description} \\
213        \hline
214        \texttt{X0-X7}   & Caller-save & function call arguments and return value 
215        \\
216        \texttt{X8}      & & indirect result e.g. location of large return value 
217                             (struct) \\
218        \texttt{X9-X15}  & Caller-save& temporary registers  \\
219        \texttt{X16}     & IP0 & The first intra-procedure-call scratch 
220        register\footnote{can be used by call veneers and PLT code; at other 
221        times may be used as a temporary register. Same for X17}\\
222        \texttt{X17}     & IP1 & The second intra-procedure-call temporary 
223        register  \\
224        \texttt{X18}     & & The Platform Register (TLS), if needed; otherwise a 
225        temporary register. \\
226        \texttt{X19-X28} & Callee-save & need to be preseved and restored when 
227                           modified\\
228        \texttt{X29}     & FP & frame pointer \\
229        \texttt{X30}     & LR & link register \\
230        \texttt{SP}      & & stack pointer (XZR) \\
231        
232    \end{tabular}
233    \caption{ARMv8 General purpose Registers}
234    \label{tab:registers}
235        \end{center}
236\end{table}
237
238\paragraph{Procedure call}
239\begin{itemize}
240    \item The registers \texttt{x19-28} and \texttt{SP} are callee-saved and 
241          hence must be preserved by the called subroutine. All 64 bits have to 
242          be preserved even when executing in the 32-bit mode.
243    \item The registers \texttt{x0-x7} and \texttt{x9-x15} are caller saved.
244    \item During procedure calls the registers \texttt{x16}, \texttt{x17}, 
245          \texttt{x29} and \texttt{x30} have special roles i.e.\ they store 
246          relevant addresses such as the return address.
247    \item Arguments for calls are passed in the registers \texttt{x0-x7}, 	    
248          \texttt{v0-v7} for floats/SIMD and on the stack
249\end{itemize}
250
251\paragraph{Indirect result} This register is used when returning a large value 
252such as declared by this function: \texttt{struct mystruct foo(int arg);}.
253
254
255\paragraph{Platform specific} The use of register \texttt{x18} is platform 
256specific and needs to be defined by the platform ABI. This register can hold
257inter-procedural state such as the thread context. 
258
259\paragraph{Linker} The registers \texttt{IP0} and \texttt{IP1} can be used by the 
260linker as a scratch register or to hold intermediate values between subroutine 
261calls.
262
263\subsection{SIMD and Floating point}
264There are 32 registers to be used by floating point and SIMD operations. The name 
265of those registers will change, depending on the size of the operation.
266
267\begin{table}[!h]
268    \begin{center}
269        
270        \begin{tabular}{ll}
271            \textbf{Register} & \textbf{Description} \\
272            \hline
273            \texttt{v0-v7}   & function call arguments, intermediate values and 
274                               return value, caller save registers \\
275            \texttt{v8-v15}  & Callee-save registers. They need to be 
276            preserved\\
277            \texttt{v16-v31} & Caller-save registers
278        \end{tabular}
279        \caption{ARMv8 General purpose Registers}
280        \label{tab:registers}
281    \end{center}
282\end{table}
283
284\chapter{Design and Implementation}\label{c:design}
285
286
287
288
289\section{Redesigning the CPU Driver}
290
291Given that ARMv8 is a significantly different platform to ARMv7, and that the
292ARMv7 codebase carries a significant legacy, reaching right back to ARMv5, we
293are pursuing substantial redesign of the CPU driver. Taking advantage of the
294standardisation of the hardware platform mandated by the SBSA
295(\autoref{s:sbsa}), and the facilities provided by UEFI (\autoref{s:uefi}), in
296addition to a relatively unrestricted virtual address space, we are able to
297significantly reduce the complexity of the CPU driver.  In this section we
298describe the updated design, and our progress on its implementation, while the
299UEFI interface (Hagfish) is described separately, in \autoref{s:hagfish}.
300
301\paragraph{Terminology}
302In the interest of clarity, in the discussion that follows, we use a few terms
303with precise intent:
304\begin{description}
305\item[shall]
306    indicates features or characteristics of the design to which the
307    Barrelfish implementation must conform.
308\item[should]
309    indicates features which should be supported if at all possible.
310\item[initially]
311    indicates features which will be provided from the outset in the
312    Barrelfish implementation.
313\item[eventually]
314    indicates features which will be provided later in the Barrelfish
315    implementation, and which the initial design will aim to facilitate.
316\end{description}
317
318\subsection{Goals}
319
320Our goal is to provide a reference design for the CPU driver and user-space
321execution environment for Barrelfish on an ARMv8 core, in order to understand
322both positive and negative implications of the architecture for a multikernel
323system.  The design \textbf{should} be applicable to any ARMv8 with
324virtualisation (\texttt{EL2}) support.
325
326\textbf{Initially}, our hardware development platform is the APM X-Gene 1,
327using the Mustang Development Board. We are using the Mustang principally as
328it was relatively easily available, as well as being a comparatively complex
329and powerful CPU. The ThunderX platform from Cavium is very interesting for
330Barrelfish, as it ties a large number (48) of less-powerful (2-issue) cores.
331We do not have the resources to develop for two platforms simultaneously, but
332we hope to \textbf{eventually} add support for the ThunderX.
333
334Our target simulation environment is the ARM Fixed Virtual Platform, and the
335Foundation Platform. These models are supplied by ARM. The Foundation Platform
336is freely available, and will be the default supported simulation platform for
337the public Barrelfish tree, while we will use the FVP internally to allow
338bare-metal debugging. Future support for QEmu is desirable, to the extent that
339it models a compatible system --- GEM5, which the ARMv7 port targets,
340currently does not.
341
342\textbf{Initially}, the design will support running both the CPU driver and
343user-space processes in AArch64 mode without support for virtualisation.
344\textbf{Eventually} the design will support running the CPU driver in AArch64
345mode, and user-space processes in both AArch64 and AArch32 modes without
346virtualisation, and virtual machines in AArch64 mode. We will only support
347virtualisation on ARMv8.1 or later platforms, that support the VHE extensions,
348as described in \autoref{s:layout}.
349
350\subsection{Processor Modes and Virtualisation}
351
352Where possible, we will keep the virtualisation model similar to that on
353Barrelfish/x86. In particular, it \textbf{should} be possible to implement
354native applications, fully virtualised (e.g. Linux) VMs, and VM-level
355applications e.g. Arrakis \citep{peter:osdi14}.
356
357ARMv8 has a somewhat different virtualisation model to x86, and different
358again from the ARMv7 virtualisation extensions. Rather than having exception
359levels (rings) duplicated between guest and host, ARMv8 provides 4 exception
360levels (ELs):
361
362\begin{itemize}
363\item \texttt{EL0} is unprivileged --- user applications.
364\item \texttt{EL1} is privileged --- OS kernel.
365\item \texttt{EL2} is hypervisor state.
366\item \texttt{EL3} is for switching between secure and non-secure (TrustZone)
367                   modes. The X-Gene 1 does not implement \texttt{EL3}, and it
368                   is currently not of interest for Barrelfish.
369\end{itemize}
370
371Explicit traps (syscalls/hypercalls) target only the next level up:
372\texttt{EL0} can call \texttt{EL1} using \texttt{svc} (syscall), and
373\texttt{EL1} can call \texttt{EL2} using \texttt{hvc} (hypercall), but
374\texttt{EL0} cannot directly call \texttt{EL2}, unless \texttt{EL1} is
375completely disabled.  Exceptions return to the caller's exception level.
376
377ELs \textbf{shall} be distributed as follows: The CPU driver \textbf{shall}
378exist at both \texttt{EL1} and \texttt{EL2}, and take both syscalls
379(\texttt{svc}, from \texttt{EL0} applications) and hypercalls (\texttt{hvc},
380from \texttt{EL1} applications). The system \textbf{shall} support
381applications both at \texttt{EL0}, and at \texttt{EL1} (e.g.  Arrakis, VMs).
382Most code paths \textbf{should} be identical, as most CPU driver operations do
383not depend on \texttt{EL2} privileges.  Hypercalls from \texttt{EL0}
384\textbf{shall} be chained via \texttt{EL1} (with appropriate permission
385checks).
386
387\texttt{EL1} apps such as Arrakis, and paravirtualised VMs using hypercalls
388know that they are being virtualised, and will use \texttt{hvc} explicitly.
389Fully-virtualised \texttt{EL1} VMs do not make hypercalls.
390
391ARMv8 implements two-level address translation: VA (virtual address) to IPA
392(intermediate physical address), and IPA to PA (physical address).
393\texttt{EL1} guests \textbf{shall} be isolated at the L1 translation layer,
394and by trapping all accesses to system control registers.
395
396\subsection{Virtual Address Space Layout}\label{s:layout}
397
398ARMv8 has an effective 48-bit virtual address space. At the lowest execution
399levels (0 --- BF user \& 1 --- BF CPU driver), the hardware supports two (up to)
40048-bit (256TB) 'windows' in a 64-bit space: one at the bottom, and one at the
401top.  Each region has its own translation table base register (\texttt{TTBR0}
402\& \texttt{TTBR1}). \texttt{TTBR0} is used at \texttt{EL0}, and \texttt{TTBR1}
403at \texttt{EL1}.
404
405In the initial ARMv8 specification, this split address space was not
406implemented at \texttt{EL2}, which would require a separate CPU driver
407instance for virtualisation, and hypercalls (e.g. for Arrakis). ARMv8.1
408introduced the virtualisation host extensions (VHE) which, among other things,
409extends the split address space to \texttt{EL2}. As this provides a cleaner
410implementation model, and to avoid having to support a now-deprecated
411interface, virtualisation will \textbf{only} be supported on ARMv8.1 and
412later. This means that we will not support virtualisation on the X-Gene 1.
413Both the simulation environment (FVP/FP) and, seemingly, the ThunderX chips,
414support VHE.
415
416The CPU driver \textbf{shall} use \texttt{TTBR1} to provide a complete
417physical window. The ARMv8 CPU driver \textbf{shall not} dynamically map
418device memory into its own window (as the ARMv7 CPU driver does) --- the few
419memory-mapped devices required will be statically mapped on boot, with
420appropriate memory attributes. All physical addresses, RAM and device,
421\textbf{shall} be accessible at a static, standard offset (the base of the
422\texttt{TTBR1} region).
423
424User-level page tables will \textbf{initially} be limited to a 4k translation
425granularity.  \textbf{Eventually} user-level page tables \textbf{should} have
426access to all page-table formats and page sizes, as is the case in the current
427Barrelfish x86 implementation.
428
429\subsection{Address Space, Context, and Thread Identifiers}
430
431ARMv8 also provides address-space identifiers (ASIDs) in the TLB to avoid
432flushing the translation cache on a context switch.
433
434ARMv8 ASIDs (referred to in ARM documentation as context IDs) are
435architecturally allowed to be either 8 or 16 bits, although the SBSA
436specifies that they must be at least 16. Relying on the SBSA platform will
437allow us to avoid multiplexing IDs among active processes, on any
438reasonably-sized system.  Managing the reuse of context IDs can be left to
439user-level code, and does not need to be on the critical path of a context
440switch. The CPU driver need only ensure that every allocated dispatcher has a
441unique ASID, which is loaded into the \texttt{ContextID} register on dispatch.
442
443The value in the \texttt{ContextID} register is also checked against the
444hardware breakpoint and watchpoint registers, in generating debug exceptions.
445Therefore, it \texttt{shall} be possible for authorised user-level code to
446load the Context ID for a given dispatcher into a breakpoint register --- this
447\texttt{may} be an invocation on the dispatcher capability.
448
449\begin{table}
450\begin{center}
451\begin{tabular}{ll}
452\texttt{tpidrro\_el0} & EL0 Read-Only Software Thread ID Register \\
453\texttt{tpidr\_el0} & EL0 Read/Write Software Thread ID Register \\
454\texttt{tpidr\_el1} & EL1 Read/Write Software Thread ID Register \\
455\texttt{tpidr\_el2} & EL2 Read/Write Software Thread ID Register \\
456\texttt{tpidr\_el3} & EL3 Read/Write Software Thread ID Register \\
457\end{tabular}
458\end{center}
459\caption{Thread ID registers in ARMv8}
460\label{t:threadid}
461\end{table}
462
463In addition to the \texttt{ContextID} register, used to tag TLB entries, ARMv8
464also provides a set of thread ID registers with no architecturally-defined
465semantics, as listed in \autoref{t:threadid}. The client-writeable
466\texttt{tpidr\_el0} and \texttt{tpidr\_el1} \textbf{shall} have no CPU
467driver-defined purpose, but \textbf{shall} be saved and restored in a
468dispatcher's trap frame, to allow their use as thread-local storage (TLS).
469Recall that the Barrelfish CPU driver has no awareness of threads, which are
470implemented purely at user level.
471
472To implement the upcall/dispatch mechanism of Barrelfish, the CPU driver and
473the user-level dispatcher need to share a certain amount of state --- the
474user-visible portion of the dispatcher control block, which contains the trap
475frames, and the disabled flag (used to achieve atomic dispatch). The address
476of this structure needs to be known to both the CPU driver, and to user-level
477code, and moreover be efficiently-accessible, as the CPU driver needs to find
478the trap frame on the critical path of system calls and exceptions. This
479pointer also needs to be trustworthy, from the CPU driver's perspective, and
480thus cannot be directly modifiable by user-level code.
481
482The x86-32, x86-64, and ARMv7 CPU drivers all store the address of the running
483dispatcher's shared segment at a fixed known address, \texttt{dcb\_current},
484which is loaded by the trap handler. At user level, on x86 this address is
485held in a \emph{segment register} (\texttt{fs} on x86-64, and \texttt{gs} on
486x86-32), while on ARMv7 we sacrifice a general-purpose register (\texttt{r9})
487for this purpose. Using the \texttt{tpidrro\_el0} register to hold the address
488of the current dispatcher structure will allow us to avoid both a memory load
489on the fast path, and sacrificing a register in user-level code, thus
490\texttt{tpidrro\_el0} \textbf{shall} hold the address of the currently-running
491dispatcher.
492
493\subsection{Instruction Sets}
494
495ARMv8 supports both AArch64, and legacy ARM/Thumb (renamed AArch32). Switching
496execution mode is only possible when switching execution level i.e. on a trap
497or return, and can only be changed while at the higher execution level. Thus,
498\texttt{EL2} can set execution mode for \texttt{EL1}, and \texttt{EL1} for
499\texttt{EL0}. There is no way for a program to change its own execution mode.
500If \texttt{ELn} is in AArch64, then \texttt{EL(n-1)} can be in either AArch64
501or AArch32. If \texttt{ELn} is in AArch32, all lower ELs must also be AArch32.
502
503The CPU driver \textbf{shall} execute in AArch64.
504
505\textbf{Initially}, the CPU driver will enforce that all directly-scheduled
506threads also use AArch64, by controlling all downward EL transitions. An
507\texttt{EL1} client (such as Arrakis or a full virtual machine) may execute
508its own \texttt{EL0} clients in AArch32 (and there is no way to prevent this).
509However, all transitions into the CPU driver (\texttt{svc}, \texttt{hvc} or
510exception) must come from a direct client of the CPU driver, and thus from
511AArch64. The syscall ABI \textbf{shall} be AArch64.
512
513\textbf{Eventually}, Barrelfish \textbf{should} also support the execution of
514AArch32 dispatcher processes, by marking each dispatcher with a flag
515indicating the instruction set to be used (much as is already done with
516VM/non-VM mode in the Arrakis CPU driver).
517
518\subsection{User-Space Access to Architectural Functions}
519
520Generally, anything that can be safely exported, \textbf{should} be made
521available outside of the CPU driver, preferable as a memory-mapped interface,
522at 4kiB granularity. The SBSA mandates that devices be present at addresses
523that can be individually mapped, thus this should not be a problem.
524
525\subsection{Cache Management}
526
527ARMv8 has moved most cache and TLB management from the system control
528coprocessor (cp15), into the core ISA. Several cache operations
529(invalidate/clean by VA) are executable at \texttt{EL0}, and thus no kernel
530interface is required. The system must take into account that user-directed
531flushes may have occurred, or may occur concurrently with any memory
532operation.
533
534\subsection{Performance Monitors}
535
536Performance monitors \textbf{should} be exposed, if it can be done safely.
537
538\subsection{Debugging}
539
540Self-hosted debug \textbf{should} be exposed, if it can be done safely. This
541is under active development.
542
543\subsection{Booting}
544
545Platform support i.e.~a standard set of peripherals, and a defined boot
546process, has improved dramatically on ARM, as it has been repositioned as a
547server platform. UEFI and ACPI support are widespread, including on the
548Mustang development board. We will assume support for UEFI booting, make use
549of ACPI data, where available.
550
551The Barrelfish CPU driver and initial image \textbf{shall} be loaded and
552executed by a UEFI shim, which will pass through all UEFI-supplied
553information, such as ACPI tables, and be able to interpret a Barrelfish
554Multiboot image.  This shim, or second-stage bootloader, is called Hagfish,
555and is described in \autoref{s:hagfish}.
556
557\subsection{Interrupts}
558
559ARMv8 interrupt handling is not substantially different from the existing
560architectures and platforms supported by Barrelfish. While a redesign of the
561Barrelfish interrupt system is under way (to use capabilities to grant access
562to receive interrupts), we do not anticipate ARMv8 to impose any particular
563challenges.
564
565The ARMv8 systems we \textbf{initially} target all use minor variations on the
566ARM Generic Interrupt Controller (GIC) design, already supported in
567Barrelfish. We currently have support for version 2 of the GIC, with which
568later implementations are backward-compatible. We will \textbf{eventually}
569support GICv3, the current specification at time of writing.
570
571\subsection{Inter-Domain Communication}
572
573User-level communication between cache-coherent cores in Barrelfish for ARMv8
574is likely to the same as with ARMv7 and x86, and we expect the existing
575User-level Message-Passing over Cache-Coherence (UMP-CC) interconnect driver
576to work unmodified.
577
578Between dispatchers on the same core, however, the different register set on
579the ARMv8 is likely to result in a very different Local Message Passing (LMP)
580interconnect driver---this is always an architecture-specific part of the CPU
581driver. In practice, its design will be closely tied to the context switch and
582upcall dispatch code.
583
584\section{Hagfish}\label{s:hagfish}
585
586The Barrelfish/ARMv8 UEFI loader prototype is called Hagfish\footnote{A
587hagfish is a basal chordate i.e. something like the ancestor of all fishes.}.
588Hagfish is a second-stage bootloader for Barrelfish on UEFI platforms,
589initially the ARMv8 server platform.  Hagfish is loaded as a UEFI application,
590and uses the large set of supplied services to do as much of the one-time
591(boot core) setup that the CPU driver needs as is reasonably possible. More
592specifically, Hagfish:
593
594\begin{itemize}
595\item Is loaded over BOOTP/PXE.
596\item Reuses the PXE environment to load a menu.lst-style configuration.
597\item Loads the kernel image and the initial applications, as directed, and
598builds a Multiboot image.
599\item Allocates and builds the CPU driver's page tables.
600\item Activates the initial page table, and allocates a stack.
601\end{itemize}
602
603\subsection{Why Another Bootloader?}
604
605The ARMv8 machines that we're porting to are different to both existing ARM
606boards, and to x86. They have a full pre-boot environment, unlike most
607embedded boards, but it's not a PC-style BIOS. The ARM Server Base Boot
608Requirements specify UEFI. Moreover, there is no mainline support from GNU
609GRUB for the ARMv8 architecture, so no matter what, we need some amount of
610fresh code.
611
612Given that we had to write at least a shim loader, and keeping in mind that
613UEFI is multi-platform (and becoming more and more common in the x86 world),
614we're taking the opportunity to simplify the initial boot process within the
615CPU driver by moving the once-only initialisation into the bootloader. In
616particular, while running under UEFI boot services, we have memory allocation
617available for free, e.g. for the initial page tables. By moving ELF loading
618and relocation code into the bootloader, we can eliminate the need to relocate
619running code, and can cut down (hopefully eliminate) special-case code for
620booting the initial core. Subsequent cores can rely on user-level Coreboot
621code to relocate them, and to construct their page tables.
622
623\subsection{Assumptions and Requirements}
624
625Hagfish is (initially at least) intended to support development work on
626AArch64 server-style hardware and, as such, makes the following assumptions:
627
628\begin{itemize}
629\item 64-bit architecture, using ELF binaries. Porting to 32-bit architectures
630wouldn't be hard, if it were ever necessary (probably not).
631\item PXE/BOOTP/TFTP available for booting. Hagfish expects to load its
632configuration, and any binaries needed, using the same PXE context with which
633it was booted. Changing this to boot from a local device (e.g. HDD) wouldn't
634be hard, as the UEFI \texttt{LoadFile} interface abstracts from the hardware.
635\end{itemize}
636
637\subsection{Boot Process}
638
639In detail, Hagfish currently boots as follows:
640
641\begin{enumerate}
642\item \texttt{Hagfish.efi} is loaded over PXE by UEFI, and is executed at a
643runtime-allocated address, with translation (MMU) and caching enabled.
644\item Hagfish queries EFI for the PXE protocol instance used to load it, and
645squirrels away the current network configuration.
646\item Hagfish loads the file \texttt{hagfish.A.B.C.D.cfg} from the TFTP server
647root (where \texttt{A.B.C.D} is the IP address on the interface that ran PXE).
648\item Hagfish parses its configuration, which is essentially a GRUB menu.lst,
649and loads the kernel image and any additional modules specified therein. All
650ELF images are loaded into page-aligned regions of type
651\texttt{EfiBarrelfishELFData}.
652\item Hagfish queries UEFI for the system memory map, then allocates and
653initialises the inital page tables for the CPU driver (mapping all occupied
654physical addresses, within the \texttt{TTBR1} window, see \autoref{s:layout}).
655The frames holding these tables are marked with the EFI memory type\\
656\texttt{EfiBarrelfishBootPagetable}, allocated from the OS-specific range
657(\texttt{0x80000000}--\texttt{0x8fffffff}). All memory allocated by Hagfish on
658behalf of the CPU driver is page-aligned, and tagged with an OS-specific type,
659to allow EFI and Hagfish regions to be safely reclaimed.
660\item Hagfish builds a Multiboot 2 information structure, containing as much
661information as it can get from EFI, including:
662    \begin{itemize}
663    \item ACPI 1.0 and 2.0 tables.
664    \item The EFI memory map (including Hagfish's custom-tagged regions).
665    \item Network configuration (the saved DHCP ack packet).
666    \item The kernel command line.
667    \item All loaded modules.
668    \item The kernel's ELF section headers.
669    \end{itemize}
670\item Hagfish allocates a page-aligned kernel stack (type
671\texttt{EfiBarrelfishCPUDriverStack}), of the size specified in the
672configuration.
673\item Hagfish terminates EFI boot services (calls \texttt{ExitBootServices}),
674activates the CPU driver page table, switches to the kernel stack, and jumps
675into the relocated CPU driver image.
676\end{enumerate}
677
678\subsection{Post-Boot state}
679
680When the CPU driver on the boot core begins executing, it can assume the
681following:
682
683\begin{itemize}
684\item The MMU is configured with all RAM and I/O regions mapped via
685\texttt{TTBR1}.
686\item The CPU driver's code and data are both fully relocated into one or more
687distinct 4kiB-aligned regions.
688\item The stack pointer is at the top of a distinct 4kiB-aligned region of at
689least the requested size.
690\item The first argument register holds the Multiboot 2 magic value.
691\item The second holds a pointer to a Multiboot 2 information structure, in
692its own distinct 4kiB-aligned region.
693\item The console device is configured.
694\item Only one core is enabled.
695\item The Multiboot structure contains at least:
696    \begin{itemize}
697    \item The final EFI memory map, with all areas allocated by Hagfish to
698    hold data passed to the CPU driver marked with OS-specific types, all of
699    which refer to non-overlapping 4k-aligned regions:
700        \begin{description}
701        \item[\ttfamily EfiBarrelfishCPUDriver]
702        The currently-executing CPU driver's text and data segments.
703        \item[\ttfamily EfiBarrelfishCPUDriverStack]
704        The CPU driver's stack.
705        \item[\ttfamily EfiBarrelfishMultibootData]
706        The Multiboot structure.
707        \item[\ttfamily EfiBarrelfishELFData]
708        The unrelocated ELF image for a boot-time module (including that for
709        the CPU driver itself), as loaded over TFTP.
710        \item[\ttfamily EfiBarrelfishBootPageTable]
711        The currently-active page tables.
712        \end{description}
713    \item The CPU driver (kernel) command line.
714    \item A copy of the last DHCP Ack packet.
715    \item A copy of the section headers from the CPU driver's ELF image.
716    \item Module descriptions for the CPU driver and all other boot modules.
717    \item If UEFI provided an ACPI root table, the Multiboot structure
718    contains a pointer to it.
719    \end{itemize}
720\end{itemize}
721
722\subsection{Configuration}
723
724Hagfish configures itself by loading a file whose path is generated from its
725assigned IP address. Thus if your development machine receives the address
726192.168.1.100, Hagfish will load the file\\
727\texttt{hagfish.192.168.1.100.cfg}
728from the same TFTP server used to load it. The format is intended to be as
729close as practical to that of an old-style GRUB menu.lst file. The example
730configuration in \autoref{f:hag_config} loads
731\texttt{/armv8/sbin/cpu\_apm88xxxx} as the CPU driver, with arguments
732\texttt{loglevel=3}, and an 8192B (2-page) stack.
733
734\begin{figure}[htb]
735\begin{center}
736\begin{lstlisting}
737kernel /armv8/sbin/cpu_apm88xxxx loglevel=3
738stack 8192
739module /armv8/sbin/cpu_apm88xxxx
740module /armv8/sbin/init
741
742# Domains spawned by init
743module /armv8/sbin/mem_serv
744module /armv8/sbin/monitor
745
746# Special boot time domains spawned by monitor
747module /armv8/sbin/chips boot
748module /armv8/sbin/ramfsd boot
749module /armv8/sbin/skb boot
750module /armv8/sbin/kaluga boot
751module /armv8/sbin/spawnd boot bootarm=0
752module /armv8/sbin/startd boot
753
754# General user domains
755module /armv8/sbin/serial auto portbase=2
756module /armv8/sbin/fish nospawn
757module /armv8/sbin/angler serial0.terminal xterm
758
759module /armv8/sbin/memtest
760
761module /armv8/sbin/corectrl auto
762module /armv8/sbin/usb_manager auto
763module /armv8/sbin/usb_keyboard auto
764module /armv8/sbin/sdma auto
765\end{lstlisting}
766\end{center}
767\caption{Hagfish configuration file}
768\label{f:hag_config}
769\end{figure}
770
771\chapter{Technical Observations}\label{c:tech}
772
773\section{User-Space Threading}\label{s:threads}
774
775\begin{figure}[htb]
776\begin{center}
777\begin{minipage}[t]{0.3\textwidth}
778\begin{lstlisting}
779clrex
780/* Restore CPSR */
781ldr r0, [r1], #4
782msr cpsr, r0
783/* Restore registers */
784ldmia r1, {r0-r15}
785\end{lstlisting}
786\end{minipage}
787\hspace{2cm}
788\begin{minipage}[t]{0.5\textwidth}
789\begin{lstlisting}
790/* Restore PSTATE, load resume
791 * address into x18 */
792ldp x18, x2, [x1, #(PC_REG * 8)]
793/* Set only NZCV. */
794and x2, x2, #0xf0000000
795msr nzcv, x2
796/* Restore the stack pointer and x30. */
797ldp x30, x2, [x1, #(30 * 8)]
798mov sp, x2
799/* Restore everything else. */
800ldp x28, x29, [x1, #(28 * 8)]
801ldp x26, x27, [x1, #(26 * 8)]
802ldp x24, x25, [x1, #(24 * 8)]
803ldp x22, x23, [x1, #(22 * 8)]
804ldp x20, x21, [x1, #(20 * 8)]
805/* n.b. don't reload x18 */
806ldr      x19, [x1, #(19 * 8)]
807ldp x16, x17, [x1, #(16 * 8)]
808ldp x14, x15, [x1, #(14 * 8)]
809ldp x12, x13, [x1, #(12 * 8)]
810ldp x10, x11, [x1, #(10 * 8)]
811ldp  x8,  x9, [x1, #( 8 * 8)]
812ldp  x6,  x7, [x1, #( 6 * 8)]
813ldp  x4,  x5, [x1, #( 4 * 8)]
814ldp  x2,  x3, [x1, #( 2 * 8)]
815/* n.b. this clobbers x0&x1 */
816ldp  x0,  x1, [x1, #( 0 * 8)]
817/* Return to the thread. */
818br x18
819\end{lstlisting}
820\end{minipage}
821\end{center}
822\caption{\texttt{disp\_resume\_context} on ARMv7 (left) and ARMv8 (right)}
823\label{f:disp_resume}
824\end{figure}
825
826The ARMv8 architecture is in some ways an improvement, and in other ways
827problematic, for the sort of user-level threading implemented in Barrelfish,
828via \emph{scheduler activations}. Under this scheme, the kernel (in Barrelfish
829terms, the \emph{CPU driver}), does not schedule threads directly, but instead
830exposes all scheduling-relevant events via \emph{upcalls} to predefined
831user-level handlers (in Barrelfish, the \emph{dispatcher}), which then
832implements thread scheduling (or something else entirely), as it sees fit.
833This differs from the behaviour of a system such as UNIX, which only ever
834restores a user-level execution context simultaneously with dropping from a
835privileged to an unprivileged execution level.
836
837Processor architectures are, understandably, designed with common software in
838mind. Thus, the primitives available for restoring an execution context i.e.
839register state are often tied closely to those for changing privilege level. A
840common design (which ARMv8 also implements) is the \emph{exception return},
841where privileged code can atomically drop its privilege, and jump to a
842user-level execution address. In ARMv8, the \texttt{eret} instruction
843atomically updates the program state (PSTATE, most importantly the privilege
844level bits), and branches to the address held in the \emph{exception link
845register}, \texttt{elr}.
846
847In implementing user-level threading, we're not concerned with privilege
848levels, but the lack of some equivalent of \texttt{elr} is frustrating. Not
849only does \texttt{eret} provide an atomic update of the program counter and
850the program state, it does so without modifying any general-purpose register.
851Replicating this behaviour at \texttt{EL0}, where \texttt{eret} is unavailable
852is problematic. ARMv8 differs from ARMv7, in that the program counter can no
853longer be the target of a load instruction, but can only be loaded via a
854general-purpose register.
855
856Specifically, the only PC-modifying instructions (other than \texttt{eret})
857are PC-relative branches (which are useless in this scenario) and
858branch-to-register (of which \texttt{br}, \texttt{blr} and \texttt{ret} are
859all special encodings). Since ARMv8 has also removed the \texttt{ldm} (load
860multiple) instruction, there is no way to load the program counter with an
861arbitrary value (the thread's restart address), without overwriting one of the
862general-purpose registers. We cannot restore the thread's register value
863\emph{before} we branch to it, as we'd overwrite the return address, and we
864obviously can't do so afterwards, as the thread likely has no idea that it's
865been interrupted. The only alternative is to trampoline through kernel mode in
866order to use \texttt{eret} (which would eliminate the speed benefit of
867user-level threading), or to reserve a general-purpose register for use by the
868dispatcher. Neither option is appealing, but we went with the second option,
869reserving \texttt{x18}, reasoning that with 31 general-purpose registers
870available, the loss of one isn't a huge penalty. Register \texttt{x18} is
871explicitly marked as the \emph{platform register} in the AArch64 ABI
872\citep{arm:aa64pcs}, for such a purpose.
873
874Future revisions of the ARM architecture could prevent this issue in a number
875of ways: allowing the use of \texttt{eret} at \texttt{EL0} or providing an
876equivalent functionality (specifically a non-general-purpose register such as
877\texttt{elr}, that doesn't need to be restored); or alternatively, adding
878indirect jumps (load to PC) back to the instruction set.
879
880\autoref{f:disp_resume} compares the user-level thread resume code for the
881Barrelfish dispatcher (function \texttt{disp\_resume}) for ARMv7 and ARMv8
882side-by-side. The effect of removing the load-multiple instructions, and
883direct-to-SP loads, on code density is clearly visible: everything on lines
8848--29 for ARMv8 corresponds to the single \texttt{ldmia} instruction on lines
8859 for ARMv7 --- one instruction is now 18, on the thread-switch critical path!
886Note also, on line 17, that the ARMv8 code does not restore the thread's
887\texttt{r18}, but instead uses it to hold the branch address for use on line
88829. The only improvement on ARMv8 is that the \texttt{clrex} (clear exclusive
889monitor) instruction is no longer required, as the monitor is cleared on
890returning from the kernel. Note also that the usual method to efficiently load
891multiple registers, using 16-word SIMD (NEON) loads, isn't available, as
892there's no guarantee that the SIMD extensions are enabled on this dispatcher,
893and we cannot handle a fault in this code.
894
895\section{Trap Handling}\label{s:traps}
896
897\begin{figure}
898\begin{lstlisting}
899el0_aarch64_sync:
900    msr daifset, #3 /* IRQ and FIQ masked, Debug and Abort enabled. */
901
902    stp x11, x12, [sp, #-(2 * 8)]!
903    stp x9,  x10, [sp, #-(2 * 8)]!
904
905    mrs x10, tpidr_el1
906    mrs x9, elr_el1
907
908    ldp x11, x12, [x10, #OFFSETOF_DISP_CRIT_PC_LOW]
909    cmp x11, x9
910    ccmp x12, x9, #0, ls
911    ldr w11, [x10, #OFFSETOF_DISP_DISABLED]
912    ccmp x11, xzr, #0, ls
913    /* NE <-> (low <= PC && PC < high) || disabled != 0 */
914
915    mrs x11, esr_el1  /* Exception Syndrome Register */
916    lsr x11, x11, #26 /* Exception Class field is bits [31:26] */
917
918    b.ne el0_sync_disabled
919
920    add x10, x10, #OFFSETOF_DISP_ENABLED_AREA
921
922save_syscall_context:
923    str x7,       [x10, #(7 * 8)]
924
925    stp x19, x20, [x10, #(19 * 8)]
926    stp x21, x22, [x10, #(21 * 8)]
927    stp x23, x24, [x10, #(23 * 8)]
928    stp x25, x26, [x10, #(25 * 8)]
929    stp x27, x28, [x10, #(27 * 8)]
930    stp x29, x30, [x10, #(29 * 8)] /* FP & LR */
931
932    mrs x20, sp_el0
933    stp x20, x9, [x10, #(31 * 8)]
934
935    mrs x19, spsr_el1
936    str x19, [x10, #(33 * 8)]
937
938    cmp x11, #0x15 /* SVC or HVC from AArch64 EL0 */
939    b.ne el0_abort_enabled
940
941    add sp, sp, #(4 * 8)
942
943    mov x7, x10
944
945    b sys_syscall
946\end{lstlisting}
947\caption{BF/ARMv8 synchronous exception handler}
948\label{f:sync_el0}
949\end{figure}
950
951\autoref{f:sync_el0} shows the CPU driver exception stub, for a synchronous
952abort from \texttt{EL0}. This exception class includes system calls,
953breakpoints, and page faults on both code and data. The effect of the loss of
954store multiple instructions is again visible, for example on lines 27--32.
955Although not as severe as in the case of the user-level thread restore in
956\autoref{s:threads}, the extra instructions required do constrain us somewhat,
957as each trap handler is constrained to 128 bytes, or 32 instructions, before
958branching to another code block.
959
960We were able to squeeze the necessary code into the space available, including
961the optimised test for a disabled dispatcher at lines 10--14, but only by
962splitting the page fault handler (\texttt{el0\_abort\_enabled}) into a
963separate subroutine, incurring an unnecessary branch. A more significant
964annoyance is that system calls (\texttt{svc} and \texttt{hvc}) are routed to
965the same exception vector as page faults (aborts).  The effect of this is that
966we are forced to spill registers to the stack (\texttt{x9}--\texttt{x12} on
967lines 4--5), even on the system call fast path, as we need at least one
968register to check the exception syndrome (\texttt{esr\_el1}) to distinguish
969aborts (where we must preserve all registers) from system calls (where we
970could immediately begin using the caller-saved registers). Note that the code
971on lines 27--32 only needs to stack the callee-saved registers, and leaves the
972system call arguments in \texttt{x0}--\texttt{x7}, to be read as required by
973\texttt{sys\_syscall} (in C).
974
975This sort of mismatch between the exception-handling interface of the CPU
976architecture, and what is required for really high-performance systems code is
977unfortunately extremely common. Unnecessary overheads, such as the additional
978stacked registers here hurt the performance of highly-componentised systems,
979such as Barrelfish, which rely on frequently crossing protection domains.
980
981The relatively well-compressed boolean arithmetic on lines 10--14 demonstrates
982that, even with the loss of ARM's fully-conditional instructions, the
983conditional compares which remain are still relatively powerful.
984
985\section{Cache Coherence}
986
987One aspect of the ARM architecture that is of particular interest for the
988Barrelfish project, but which we have not yet explored in depth, is the
989configurable cache coherency and fine-grained cache management operations
990available. Any virtual mapping on a recent ARM architecture, including both
991ARMv7 and ARMv8, can be tagged with various cacheability properties: inner
992(L1), outer (L2+, usually), write-back or write-through. Combined with the
993explicit flush operations at cache-line granularity, able to target either PoU
994(point of unification, where data and instruction caches merge) or PoC (point
995of coherency, typically RAM), a multi-core, multi-socket ARMv8 system would
996make a very interesting testbed for investigating efficient cache management
997and communication primitives for future partially-coherent architectures.
998Indeed, the latest revision of the ARMv8 specification, ARMv8.2, introduced
999flush to PoP, or \emph{point of persistence} --- perhaps in response to
1000interest from well-known systems integration firms investigating large
1001persistent memories.
1002
1003The design presented in this report is intended to expose as much control over
1004the caching hierarchy as possible to user-level code, to provide a platform
1005for future research.
1006
1007\bibliographystyle{plainnat}
1008\bibliography{defs,barrelfish}
1009
1010\end{document}
1011