1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 2% Copyright (c) 2013, ETH Zurich. 3% All rights reserved. 4% 5% This file is distributed under the terms in the attached LICENSE file. 6% If you do not find this file, copies can be found by writing to: 7% ETH Zurich D-INFK, Universitaetstrasse 6, CH-8092 Zurich. Attn: Systems Group. 8%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 9 10\documentclass[a4paper,twoside]{report} 11 12\usepackage{bftn} 13\usepackage{booktabs} 14\usepackage{hyperref} 15\usepackage{hyphenat} 16\usepackage{listings} 17\usepackage{makeidx} 18\usepackage{natbib} 19\usepackage{xspace} 20 21\def\chapterautorefname{Chapter} 22\def\sectionautorefname{Section} 23\def\subsectionautorefname{Section} 24\def\subsubsectionautorefname{Section} 25\def\tableautorefname{Table} 26\def\qemu{QEMU\xspace} 27 28\lstdefinelanguage{armasm}{ 29 numbers=left, 30 numberstyle=\tiny, 31 numbersep=5pt, 32 basicstyle=\ttfamily\small 33} 34\lstset{language=armasm} 35 36\title{Barrelfish on ARMv8} 37\author{David Cock} 38 39\tnnumber{022} % give the number of the tech report 40\tnkey{ARMv8} % Short title, will appear in footer 41 42\begin{document} 43\maketitle 44 45\begin{versionhistory} 46\vhEntry{1.0}{11.04.2016}{DC}{Initial version} 47\end{versionhistory} 48 49\chapter{Summary} 50 51Barrelfish now supports ARMv7 and ARMv8 as primary platforms, and we have 52discontinued support for all older architecture revisions (ARMv5, ARMv6). The 53current Barrelfish release contains a port to a simulated ARMv8 environment, 54derived from the existing ARMv7 codebase and running under GEM5, with generous 55contributions from HP Research. 56 57Simultaneously, we are undertaking a clean-slate redesign of the CPU driver 58for ARMv8, as it presents a number of novel features, and greatly improved 59platform standardisation (\autoref{s:sbsa}), that should allow for a much 60cleaner and simpler implementation. This redesigned CPU driver will for the 61basis for ongoing research into large-scale non-cache-coherent systems using 62ARMv8 cores. This document presents the new CPU driver design 63(\autoref{c:design}), briefly covering those features of ARMv8 of greatest 64relevance (\autoref{c:background}), and discusses a number of technical 65challenges presented by the new architecture (\autoref{c:tech}). 66 67\chapter{Background}\label{c:background} 68 69The Barrelfish research operating system is a vehicle for research into 70software support for likely future architectures, where large numbers of 71non-coherent (or weakly-coherent) heterogeneous processor cores are assembled 72into a single large-scale system. As such, support for a common non-x86 73architecture has always been part of the project, beginning with the ARMv5 74(XScale) port, which permitted the embedded processor on a network interface 75card to be integrated as a first-class part of the system, with its own CPU 76driver. We have also actively maintained an ARMv7 port, to the OMAP4460 77processor on the Pandaboard ES, which we use as a teaching platform in the 78Advanced Operating Systems course at ETH Z\"urich. These ports are described 79more fully in the accompanying technical report, \citet{btn017-arm}. 80 81\section{The ARMv8 Architecture} 82 83The ARMv8 architecture is quite a radical departure from previous versions, 84and represents the culmination of a trend that has been developing for quite 85some time. While the first wave of ARM-based microservers, based on the 32-bit 86ARMv7 architecture, was largely a commercial failure, it's clear that ARM is 87now actively targeting the server market, where Intel currently has near-total 88dominance. 89 90ARMv8 discards some long-standing features of the ARM instruction set: 91universal conditional execution, multiple loads/stores, and the program 92counter as a general-purpose register. These most likely caused difficulty in 93scaling the processor pipeline to high clock rates, and we present some 94consequences of their loss in \autoref{s:threads} and \autoref{s:traps}. The 95instruction-set changes, however challenging to the systems programmer, are 96ultimately of little consequence compared to the consolidation of the ARM 97ecosystem into a serious server platform. The two features of most interest at 98this stage in the design process are the standardisation of hardware features 99and memory maps, and of the boot process. 100 101\subsection{ARM Server Base System Architecture}\label{s:sbsa} 102 103ARM has long been criticised by systems programmers for its highly fragmented, 104and non-uniform programming interface. Linux, in particular, has struggled for 105years with supporting the great multiplicity of ARM platforms. The principal 106reason for this is the lack of any concept of a \emph{platform}: a set of 107assumptions (available hardware, memory map, etc.), that programmers can rely 108on when initialising and managing a system. The Linux source tree famously 109contained a vastly greater amount of code in the ARM platform support 110subtrees, than that for x86. 111 112The relative standardisation of the x86 platform is largely a historical 113accident, due to the rapid proliferation of PC/AT clones in the early 1980s. 114The x86 platform thus contains layers of ossified legacy interfaces, necessary 115to ensure broad cross and backward compatibility. ARM's business model, on 116the contrary, has long emphasised the specialisation of implementations: an 117ARM licensee would take their ARM-designed CPU core, and integrate it 118themselves in to a complex SoC (system on a chip), with their own specialised, 119proprietary interfaces. The upsides of this were the possibility to highly 120optimise a particular design, and no requirement on ARM itself to maintain a 121coherent platform. 122 123While ARM's customisable platform worked well for embedded devices, and scaled 124reasonably well to relatively powerful smartphones, it's a disaster for 125producing high-quality systems software, able to execute on a broad range of 126hardware from competing vendors: exactly what a competitive server platform 127requires. ARM clearly know this, and since 2014 have published the Server Base 128System Architecture \citep{arm:sbsa}. To the extent that manufacturers adhere 129to these guidelines, our job as systems programmers is significantly simpler: 130it should be possible to write a single set of initialisation and 131configuration code for ARMv8, that will run on any SBSA-compliant system, much 132as we already do for x86-64. 133 134\begin{table} 135\begin{center} 136\begin{tabular}{lllp{6cm}} 137\toprule 138Supplier & Processor & Name & \\ 139\midrule 140APM & APM883208 & Mustang & 1P 8-core X-Gene 1 with serial trace. \\ 141\addlinespace[2pt] 142 & APM883408-X2 & X-C2 & 1P 8-core X-Gene 2. \\ 143\addlinespace[2pt] 144Cavium & CN8890 & StratusX & 1P 48-core ThunderX. \\ 145\addlinespace[2pt] 146 & & Cirrus & 2P 48-core ThunderX. \\ 147\addlinespace[2pt] 148ARM & AEM & Fixed Virtual Platform & The \emph{architectural envelope model} 149covers the range of behaviour permitted by ARMv8. Bare-metal debug. \\ 150\addlinespace[2pt] 151 & & Foundation Platform & Freely available, compatible with FVP. \\ 152\bottomrule 153\end{tabular} 154\end{center} 155\caption{ARMv8 platforms of interest}\label{t:platforms} 156\end{table} 157 158Our target platforms listed in \autoref{t:platforms} all support SBSA to some 159extent, and absent any compelling reason, we will only support SBSA-compliant 160platforms. 161 162\subsection{UEFI}\label{s:uefi} 163 164One aspect of the SBSA which eases portability is the specification, for the 165first time, of a boot process for ARM systems. ARM has specified that SBSA 166systems must support UEFI \citep{uefi} (the unified extensible firmware 167interface). UEFI is a descendant of the EFI specification, developed by Intel 168for the Itanium project. While Itanium is no longer a platform of any great 169commercial interest, UEFI support is now widespread in the x86-64 market. 170UEFI, in turn, specifies the use of ACPI \citep{acpi} (the advanced 171configuration and power interface) for platform discovery and control. 172 173Supporting ACPI and UEFI requires a one-off investment of effort to design a 174new boot and configuration subsystem, but should pay off in the long term, as 175ports to new ARM boards will no longer require extensive manual configuration. 176The code should also be largely reusable for x86 UEFI systems. Our new UEFI 177bootloader is described in \autoref{s:hagfish}. 178 179\section{A Direct Port from ARMv7} 180 181As already described, the Barrelfish release current at time of writing 182includes an initial ARMv8 port to the GEM5 simulator. This port contains code 183generously contributed by HP Research. 184 185Being developed from the existing codebase, this ARMv8 port follows the 186structure of the existing ARMv7 code closely. While it is highly useful to 187have a running port, we are nevertheless continuing with a significant 188redesign of the CPU driver, as significant improvements and simplifications 189will be possible, once we no longer need to follow the existing structure, 190originally developed for a significantly different platform. 191 192The GEM5 simulator's model of an ARMv8 platform is relatively primitive, and 193does not conform to modern platform conventions, for example placing RAM at 194address \texttt{0}, rather than \texttt{0x80000000} as mandated by the SBSA. 195For this reason, in addition to better integration with ARM debugging tools, 196we have switched to the ARM Fixed Virtual Platform as our default simulation 197environment, with the Foundation Platform supported as a freely available 198simulator. 199 200\section{Registers} 201 202\subsection{General-purpose Registers} 203 204In total there are 31+1 general purpose registers (\texttt{r0-r30}) of size 20564bits(\autoref{tab:registers}). They are usually referred to by the names 206\texttt{x0-x30}. The 32-bit content of the registers are referred to as 207\texttt{w0-w30}. The additional stack pointer \texttt{SP} register can be 208accessed with a restricted number of instructions. 209 210\begin{table}[!h] 211 \begin{center} 212 213 \begin{tabular}{lll} 214 \textbf{Register} & \textbf{Special} & \textbf{Description} \\ 215 \hline 216 \texttt{X0-X7} & Caller-save & function call arguments and return value 217 \\ 218 \texttt{X8} & & indirect result e.g. location of large return value 219 (struct) \\ 220 \texttt{X9-X15} & Caller-save& temporary registers \\ 221 \texttt{X16} & IP0 & The first intra-procedure-call scratch 222 register\footnote{can be used by call veneers and PLT code; at other 223 times may be used as a temporary register. Same for X17}\\ 224 \texttt{X17} & IP1 & The second intra-procedure-call temporary 225 register \\ 226 \texttt{X18} & & The Platform Register (TLS), if needed; otherwise a 227 temporary register. \\ 228 \texttt{X19-X28} & Callee-save & need to be preseved and restored when 229 modified\\ 230 \texttt{X29} & FP & frame pointer \\ 231 \texttt{X30} & LR & link register \\ 232 \texttt{SP} & & stack pointer (XZR) \\ 233 234 \end{tabular} 235 \caption{ARMv8 General purpose Registers} 236 \label{tab:registers} 237 \end{center} 238\end{table} 239 240\paragraph{Procedure call} 241\begin{itemize} 242 \item The registers \texttt{x19-28} and \texttt{SP} are callee-saved and 243 hence must be preserved by the called subroutine. All 64 bits have to 244 be preserved even when executing in the 32-bit mode. 245 \item The registers \texttt{x0-x7} and \texttt{x9-x15} are caller saved. 246 \item During procedure calls the registers \texttt{x16}, \texttt{x17}, 247 \texttt{x29} and \texttt{x30} have special roles i.e.\ they store 248 relevant addresses such as the return address. 249 \item Arguments for calls are passed in the registers \texttt{x0-x7}, 250 \texttt{v0-v7} for floats/SIMD and on the stack 251\end{itemize} 252 253\paragraph{Indirect result} This register is used when returning a large value 254such as declared by this function: \texttt{struct mystruct foo(int arg);}. 255 256 257\paragraph{Platform specific} The use of register \texttt{x18} is platform 258specific and needs to be defined by the platform ABI. This register can hold 259inter-procedural state such as the thread context. 260 261\paragraph{Linker} The registers \texttt{IP0} and \texttt{IP1} can be used by the 262linker as a scratch register or to hold intermediate values between subroutine 263calls. 264 265\subsection{SIMD and Floating point} 266There are 32 registers to be used by floating point and SIMD operations. The name 267of those registers will change, depending on the size of the operation. 268 269\begin{table}[!h] 270 \begin{center} 271 272 \begin{tabular}{ll} 273 \textbf{Register} & \textbf{Description} \\ 274 \hline 275 \texttt{v0-v7} & function call arguments, intermediate values and 276 return value, caller save registers \\ 277 \texttt{v8-v15} & Callee-save registers. They need to be 278 preserved\\ 279 \texttt{v16-v31} & Caller-save registers 280 \end{tabular} 281 \caption{ARMv8 General purpose Registers} 282 \label{tab:registers} 283 \end{center} 284\end{table} 285 286\chapter{Design and Implementation}\label{c:design} 287 288 289 290 291\section{Redesigning the CPU Driver} 292 293Given that ARMv8 is a significantly different platform to ARMv7, and that the 294ARMv7 codebase carries a significant legacy, reaching right back to ARMv5, we 295are pursuing substantial redesign of the CPU driver. Taking advantage of the 296standardisation of the hardware platform mandated by the SBSA 297(\autoref{s:sbsa}), and the facilities provided by UEFI (\autoref{s:uefi}), in 298addition to a relatively unrestricted virtual address space, we are able to 299significantly reduce the complexity of the CPU driver. In this section we 300describe the updated design, and our progress on its implementation, while the 301UEFI interface (Hagfish) is described separately, in \autoref{s:hagfish}. 302 303\paragraph{Terminology} 304In the interest of clarity, in the discussion that follows, we use a few terms 305with precise intent: 306\begin{description} 307\item[shall] 308 indicates features or characteristics of the design to which the 309 Barrelfish implementation must conform. 310\item[should] 311 indicates features which should be supported if at all possible. 312\item[initially] 313 indicates features which will be provided from the outset in the 314 Barrelfish implementation. 315\item[eventually] 316 indicates features which will be provided later in the Barrelfish 317 implementation, and which the initial design will aim to facilitate. 318\end{description} 319 320\subsection{Goals} 321 322Our goal is to provide a reference design for the CPU driver and user-space 323execution environment for Barrelfish on an ARMv8 core, in order to understand 324both positive and negative implications of the architecture for a multikernel 325system. The design \textbf{should} be applicable to any ARMv8 with 326virtualisation (\texttt{EL2}) support. 327 328\textbf{Initially}, our hardware development platform is the APM X-Gene 1, 329using the Mustang Development Board. We are using the Mustang principally as 330it was relatively easily available, as well as being a comparatively complex 331and powerful CPU. The ThunderX platform from Cavium is very interesting for 332Barrelfish, as it ties a large number (48) of less-powerful (2-issue) cores. 333We do not have the resources to develop for two platforms simultaneously, but 334we hope to \textbf{eventually} add support for the ThunderX. 335 336Our target simulation environment is the ARM Fixed Virtual Platform, and the 337Foundation Platform. These models are supplied by ARM. The Foundation Platform 338is freely available, and will be the default supported simulation platform for 339the public Barrelfish tree, while we will use the FVP internally to allow 340bare-metal debugging. Future support for \qemu is desirable, to the extent that 341it models a compatible system --- GEM5, which the ARMv7 port targets, 342currently does not. 343 344\textbf{Initially}, the design will support running both the CPU driver and 345user-space processes in AArch64 mode without support for virtualisation. 346\textbf{Eventually} the design will support running the CPU driver in AArch64 347mode, and user-space processes in both AArch64 and AArch32 modes without 348virtualisation, and virtual machines in AArch64 mode. We will only support 349virtualisation on ARMv8.1 or later platforms, that support the VHE extensions, 350as described in \autoref{s:layout}. 351 352\subsection{Processor Modes and Virtualisation} 353 354Where possible, we will keep the virtualisation model similar to that on 355Barrelfish/x86. In particular, it \textbf{should} be possible to implement 356native applications, fully virtualised (e.g. Linux) VMs, and VM-level 357applications e.g. Arrakis \citep{peter:osdi14}. 358 359ARMv8 has a somewhat different virtualisation model to x86, and different 360again from the ARMv7 virtualisation extensions. Rather than having exception 361levels (rings) duplicated between guest and host, ARMv8 provides 4 exception 362levels (ELs): 363 364\begin{itemize} 365\item \texttt{EL0} is unprivileged --- user applications. 366\item \texttt{EL1} is privileged --- OS kernel. 367\item \texttt{EL2} is hypervisor state. 368\item \texttt{EL3} is for switching between secure and non-secure (TrustZone) 369 modes. The X-Gene 1 does not implement \texttt{EL3}, and it 370 is currently not of interest for Barrelfish. 371\end{itemize} 372 373Explicit traps (syscalls/hypercalls) target only the next level up: 374\texttt{EL0} can call \texttt{EL1} using \texttt{svc} (syscall), and 375\texttt{EL1} can call \texttt{EL2} using \texttt{hvc} (hypercall), but 376\texttt{EL0} cannot directly call \texttt{EL2}, unless \texttt{EL1} is 377completely disabled. Exceptions return to the caller's exception level. 378 379ELs \textbf{shall} be distributed as follows: The CPU driver \textbf{shall} 380exist at both \texttt{EL1} and \texttt{EL2}, and take both syscalls 381(\texttt{svc}, from \texttt{EL0} applications) and hypercalls (\texttt{hvc}, 382from \texttt{EL1} applications). The system \textbf{shall} support 383applications both at \texttt{EL0}, and at \texttt{EL1} (e.g. Arrakis, VMs). 384Most code paths \textbf{should} be identical, as most CPU driver operations do 385not depend on \texttt{EL2} privileges. Hypercalls from \texttt{EL0} 386\textbf{shall} be chained via \texttt{EL1} (with appropriate permission 387checks). 388 389\texttt{EL1} apps such as Arrakis, and paravirtualised VMs using hypercalls 390know that they are being virtualised, and will use \texttt{hvc} explicitly. 391Fully-virtualised \texttt{EL1} VMs do not make hypercalls. 392 393ARMv8 implements two-level address translation: VA (virtual address) to IPA 394(intermediate physical address), and IPA to PA (physical address). 395\texttt{EL1} guests \textbf{shall} be isolated at the L1 translation layer, 396and by trapping all accesses to system control registers. 397 398\subsection{Virtual Address Space Layout}\label{s:layout} 399 400ARMv8 has an effective 48-bit virtual address space. At the lowest execution 401levels (0 --- BF user \& 1 --- BF CPU driver), the hardware supports two (up to) 40248-bit (256TB) 'windows' in a 64-bit space: one at the bottom, and one at the 403top. Each region has its own translation table base register (\texttt{TTBR0} 404\& \texttt{TTBR1}). \texttt{TTBR0} is used at \texttt{EL0}, and \texttt{TTBR1} 405at \texttt{EL1}. 406 407In the initial ARMv8 specification, this split address space was not 408implemented at \texttt{EL2}, which would require a separate CPU driver 409instance for virtualisation, and hypercalls (e.g. for Arrakis). ARMv8.1 410introduced the virtualisation host extensions (VHE) which, among other things, 411extends the split address space to \texttt{EL2}. As this provides a cleaner 412implementation model, and to avoid having to support a now-deprecated 413interface, virtualisation will \textbf{only} be supported on ARMv8.1 and 414later. This means that we will not support virtualisation on the X-Gene 1. 415Both the simulation environment (FVP/FP) and, seemingly, the ThunderX chips, 416support VHE. 417 418The CPU driver \textbf{shall} use \texttt{TTBR1} to provide a complete 419physical window. The ARMv8 CPU driver \textbf{shall not} dynamically map 420device memory into its own window (as the ARMv7 CPU driver does) --- the few 421memory-mapped devices required will be statically mapped on boot, with 422appropriate memory attributes. All physical addresses, RAM and device, 423\textbf{shall} be accessible at a static, standard offset (the base of the 424\texttt{TTBR1} region). 425 426User-level page tables will \textbf{initially} be limited to a 4k translation 427granularity. \textbf{Eventually} user-level page tables \textbf{should} have 428access to all page-table formats and page sizes, as is the case in the current 429Barrelfish x86 implementation. 430 431\subsection{Address Space, Context, and Thread Identifiers} 432 433ARMv8 also provides address-space identifiers (ASIDs) in the TLB to avoid 434flushing the translation cache on a context switch. 435 436ARMv8 ASIDs (referred to in ARM documentation as context IDs) are 437architecturally allowed to be either 8 or 16 bits, although the SBSA 438specifies that they must be at least 16. Relying on the SBSA platform will 439allow us to avoid multiplexing IDs among active processes, on any 440reasonably-sized system. Managing the reuse of context IDs can be left to 441user-level code, and does not need to be on the critical path of a context 442switch. The CPU driver need only ensure that every allocated dispatcher has a 443unique ASID, which is loaded into the \texttt{ContextID} register on dispatch. 444 445The value in the \texttt{ContextID} register is also checked against the 446hardware breakpoint and watchpoint registers, in generating debug exceptions. 447Therefore, it \texttt{shall} be possible for authorised user-level code to 448load the Context ID for a given dispatcher into a breakpoint register --- this 449\texttt{may} be an invocation on the dispatcher capability. 450 451\begin{table} 452\begin{center} 453\begin{tabular}{ll} 454\texttt{tpidrro\_el0} & EL0 Read-Only Software Thread ID Register \\ 455\texttt{tpidr\_el0} & EL0 Read/Write Software Thread ID Register \\ 456\texttt{tpidr\_el1} & EL1 Read/Write Software Thread ID Register \\ 457\texttt{tpidr\_el2} & EL2 Read/Write Software Thread ID Register \\ 458\texttt{tpidr\_el3} & EL3 Read/Write Software Thread ID Register \\ 459\end{tabular} 460\end{center} 461\caption{Thread ID registers in ARMv8} 462\label{t:threadid} 463\end{table} 464 465In addition to the \texttt{ContextID} register, used to tag TLB entries, ARMv8 466also provides a set of thread ID registers with no architecturally-defined 467semantics, as listed in \autoref{t:threadid}. The client-writeable 468\texttt{tpidr\_el0} and \texttt{tpidr\_el1} \textbf{shall} have no CPU 469driver-defined purpose, but \textbf{shall} be saved and restored in a 470dispatcher's trap frame, to allow their use as thread-local storage (TLS). 471Recall that the Barrelfish CPU driver has no awareness of threads, which are 472implemented purely at user level. 473 474To implement the upcall/dispatch mechanism of Barrelfish, the CPU driver and 475the user-level dispatcher need to share a certain amount of state --- the 476user-visible portion of the dispatcher control block, which contains the trap 477frames, and the disabled flag (used to achieve atomic dispatch). The address 478of this structure needs to be known to both the CPU driver, and to user-level 479code, and moreover be efficiently-accessible, as the CPU driver needs to find 480the trap frame on the critical path of system calls and exceptions. This 481pointer also needs to be trustworthy, from the CPU driver's perspective, and 482thus cannot be directly modifiable by user-level code. 483 484The x86-32, x86-64, and ARMv7 CPU drivers all store the address of the running 485dispatcher's shared segment at a fixed known address, \texttt{dcb\_current}, 486which is loaded by the trap handler. At user level, on x86 this address is 487held in a \emph{segment register} (\texttt{fs} on x86-64, and \texttt{gs} on 488x86-32), while on ARMv7 we sacrifice a general-purpose register (\texttt{r9}) 489for this purpose. Using the \texttt{tpidrro\_el0} register to hold the address 490of the current dispatcher structure will allow us to avoid both a memory load 491on the fast path, and sacrificing a register in user-level code, thus 492\texttt{tpidrro\_el0} \textbf{shall} hold the address of the currently-running 493dispatcher. 494 495\subsection{Instruction Sets} 496 497ARMv8 supports both AArch64, and legacy ARM/Thumb (renamed AArch32). Switching 498execution mode is only possible when switching execution level i.e. on a trap 499or return, and can only be changed while at the higher execution level. Thus, 500\texttt{EL2} can set execution mode for \texttt{EL1}, and \texttt{EL1} for 501\texttt{EL0}. There is no way for a program to change its own execution mode. 502If \texttt{ELn} is in AArch64, then \texttt{EL(n-1)} can be in either AArch64 503or AArch32. If \texttt{ELn} is in AArch32, all lower ELs must also be AArch32. 504 505The CPU driver \textbf{shall} execute in AArch64. 506 507\textbf{Initially}, the CPU driver will enforce that all directly-scheduled 508threads also use AArch64, by controlling all downward EL transitions. An 509\texttt{EL1} client (such as Arrakis or a full virtual machine) may execute 510its own \texttt{EL0} clients in AArch32 (and there is no way to prevent this). 511However, all transitions into the CPU driver (\texttt{svc}, \texttt{hvc} or 512exception) must come from a direct client of the CPU driver, and thus from 513AArch64. The syscall ABI \textbf{shall} be AArch64. 514 515\textbf{Eventually}, Barrelfish \textbf{should} also support the execution of 516AArch32 dispatcher processes, by marking each dispatcher with a flag 517indicating the instruction set to be used (much as is already done with 518VM/non-VM mode in the Arrakis CPU driver). 519 520\subsection{User-Space Access to Architectural Functions} 521 522Generally, anything that can be safely exported, \textbf{should} be made 523available outside of the CPU driver, preferable as a memory-mapped interface, 524at 4kiB granularity. The SBSA mandates that devices be present at addresses 525that can be individually mapped, thus this should not be a problem. 526 527\subsection{Cache Management} 528 529ARMv8 has moved most cache and TLB management from the system control 530coprocessor (cp15), into the core ISA. Several cache operations 531(invalidate/clean by VA) are executable at \texttt{EL0}, and thus no kernel 532interface is required. The system must take into account that user-directed 533flushes may have occurred, or may occur concurrently with any memory 534operation. 535 536\subsection{Performance Monitors} 537 538Performance monitors \textbf{should} be exposed, if it can be done safely. 539 540\subsection{Debugging} 541 542Self-hosted debug \textbf{should} be exposed, if it can be done safely. This 543is under active development. 544 545\subsection{Booting} 546 547Platform support i.e.~a standard set of peripherals, and a defined boot 548process, has improved dramatically on ARM, as it has been repositioned as a 549server platform. UEFI and ACPI support are widespread, including on the 550Mustang development board. We will assume support for UEFI booting, make use 551of ACPI data, where available. 552 553The Barrelfish CPU driver and initial image \textbf{shall} be loaded and 554executed by a UEFI shim, which will pass through all UEFI-supplied 555information, such as ACPI tables, and be able to interpret a Barrelfish 556Multiboot image. This shim, or second-stage bootloader, is called Hagfish, 557and is described in \autoref{s:hagfish}. 558 559\subsection{Interrupts} 560 561ARMv8 interrupt handling is not substantially different from the existing 562architectures and platforms supported by Barrelfish. While a redesign of the 563Barrelfish interrupt system is under way (to use capabilities to grant access 564to receive interrupts), we do not anticipate ARMv8 to impose any particular 565challenges. 566 567The ARMv8 systems we \textbf{initially} target all use minor variations on the 568ARM Generic Interrupt Controller (GIC) design, already supported in 569Barrelfish. We currently have support for version 2 of the GIC, with which 570later implementations are backward-compatible. We will \textbf{eventually} 571support GICv3, the current specification at time of writing. 572 573\subsection{Inter-Domain Communication} 574 575User-level communication between cache-coherent cores in Barrelfish for ARMv8 576is likely to the same as with ARMv7 and x86, and we expect the existing 577User-level Message-Passing over Cache-Coherence (UMP-CC) interconnect driver 578to work unmodified. 579 580Between dispatchers on the same core, however, the different register set on 581the ARMv8 is likely to result in a very different Local Message Passing (LMP) 582interconnect driver---this is always an architecture-specific part of the CPU 583driver. In practice, its design will be closely tied to the context switch and 584upcall dispatch code. 585 586\chapter{Booting}\label{c:booting} 587 588Booting ARM systems has always been difficult to do in a standard way, 589and ARMv8 systems are no exception. Barrelfish uses one of two 590methods of booting an initial ARMv8 core, depending on whether the 591hardware platform supports UEFI~\cite{uefi} or U-Boot. If a platform 592supports neither, more work will be required to boot the board. 593 594If a board has full support for UEFI (such as TianoCore), you can use 595Hagfish~\ref{s:hagfish} to individually load the modules needed to 596boot Barrelfish and set up the initial CPU/MMU environment before 597entering the CPU driver proper. 598 599Note that U-Boot also claims to support UEFI. However, in practice it 600supports a small subset of UEFI functionality sufficient to boot 601\texttt{grub} or the Linux kernel as an EFI binary. If your board 602boots via U-Boot, you should use the minimal EFI 603bootloader~\ref{s:uboot} which loads a single multiboot image into 604memory and sets up the environment similar to Hagfish. 605 606\section{Hagfish}\label{s:hagfish} 607 608The Barrelfish/ARMv8 UEFI loader prototype is called Hagfish\footnote{A 609hagfish is a basal chordate i.e. something like the ancestor of all fishes.}. 610Hagfish is a second-stage bootloader for Barrelfish on UEFI platforms, 611initially the ARMv8 server platform. Hagfish is loaded as a UEFI application, 612and uses the large set of supplied services to do as much of the one-time 613(boot core) setup that the CPU driver needs as is reasonably possible. More 614specifically, Hagfish: 615 616\begin{itemize} 617\item Is loaded over BOOTP/PXE. 618\item Reuses the PXE environment to load a menu.lst-style configuration. 619\item Loads the kernel image and the initial applications, as directed, and 620builds a Multiboot image. 621\item Allocates and builds the CPU driver's page tables. 622\item Activates the initial page table, and allocates a stack. 623\end{itemize} 624 625\subsection{Why Another Bootloader?} 626 627The ARMv8 machines that we're porting to are different to both existing ARM 628boards, and to x86. They have a full pre-boot environment, unlike most 629embedded boards, but it's not a PC-style BIOS. The ARM Server Base Boot 630Requirements specify UEFI. Moreover, there is no mainline support from GNU 631GRUB for the ARMv8 architecture, so no matter what, we need some amount of 632fresh code. 633 634Given that we had to write at least a shim loader, and keeping in mind that 635UEFI is multi-platform (and becoming more and more common in the x86 world), 636we're taking the opportunity to simplify the initial boot process within the 637CPU driver by moving the once-only initialisation into the bootloader. In 638particular, while running under UEFI boot services, we have memory allocation 639available for free, e.g. for the initial page tables. By moving ELF loading 640and relocation code into the bootloader, we can eliminate the need to relocate 641running code, and can cut down (hopefully eliminate) special-case code for 642booting the initial core. Subsequent cores can rely on user-level Coreboot 643code to relocate them, and to construct their page tables. 644 645\subsection{Assumptions and Requirements} 646 647Hagfish is (initially at least) intended to support development work on 648AArch64 server-style hardware and, as such, makes the following assumptions: 649 650\begin{itemize} 651\item 64-bit architecture, using ELF binaries. Porting to 32-bit architectures 652wouldn't be hard, if it were ever necessary (probably not). 653\item PXE/BOOTP/TFTP available for booting. Hagfish expects to load its 654configuration, and any binaries needed, using the same PXE context with which 655it was booted. Changing this to boot from a local device (e.g. HDD) wouldn't 656be hard, as the UEFI \texttt{LoadFile} interface abstracts from the hardware. 657\end{itemize} 658 659\subsection{Boot Process} 660 661In detail, Hagfish currently boots as follows: 662 663\begin{enumerate} 664\item \texttt{Hagfish.efi} is loaded over PXE by UEFI, and is executed at a 665runtime-allocated address, with translation (MMU) and caching enabled. 666\item Hagfish queries EFI for the PXE protocol instance used to load it, and 667squirrels away the current network configuration. 668\item Hagfish loads the file \texttt{hagfish.A.B.C.D.cfg} from the TFTP server 669root (where \texttt{A.B.C.D} is the IP address on the interface that ran PXE). 670\item Hagfish parses its configuration, which is essentially a GRUB menu.lst, 671and loads the kernel image and any additional modules specified therein. All 672ELF images are loaded into page-aligned regions of type 673\texttt{EfiBarrelfishELFData}. 674\item Hagfish queries UEFI for the system memory map, then allocates and 675initialises the inital page tables for the CPU driver (mapping all occupied 676physical addresses, within the \texttt{TTBR1} window, see \autoref{s:layout}). 677The frames holding these tables are marked with the EFI memory type\\ 678\texttt{EfiBarrelfishBootPagetable}, allocated from the OS-specific range 679(\texttt{0x80000000}--\texttt{0x8fffffff}). All memory allocated by Hagfish on 680behalf of the CPU driver is page-aligned, and tagged with an OS-specific type, 681to allow EFI and Hagfish regions to be safely reclaimed. 682\item Hagfish builds a Multiboot 2 information structure, containing as much 683information as it can get from EFI, including: 684 \begin{itemize} 685 \item ACPI 1.0 and 2.0 tables. 686 \item The EFI memory map (including Hagfish's custom-tagged regions). 687 \item Network configuration (the saved DHCP ack packet). 688 \item The kernel command line. 689 \item All loaded modules. 690 \item The kernel's ELF section headers. 691 \end{itemize} 692\item Hagfish allocates a page-aligned kernel stack (type 693\texttt{EfiBarrelfishCPUDriverStack}), of the size specified in the 694configuration. 695\item Hagfish terminates EFI boot services (calls \texttt{ExitBootServices}), 696activates the CPU driver page table, switches to the kernel stack, and jumps 697into the relocated CPU driver image. 698\end{enumerate} 699 700\subsection{Post-Boot state} 701 702When the CPU driver on the boot core begins executing, it can assume the 703following: 704 705\begin{itemize} 706\item The MMU is configured with all RAM and I/O regions mapped via 707\texttt{TTBR1}. 708\item The CPU driver's code and data are both fully relocated into one or more 709distinct 4kiB-aligned regions. 710\item The stack pointer is at the top of a distinct 4kiB-aligned region of at 711least the requested size. 712\item The first argument register holds the Multiboot 2 magic value. 713\item The second holds a pointer to a Multiboot 2 information structure, in 714its own distinct 4kiB-aligned region. 715\item The console device is configured. 716\item Only one core is enabled. 717\item The Multiboot structure contains at least: 718 \begin{itemize} 719 \item The final EFI memory map, with all areas allocated by Hagfish to 720 hold data passed to the CPU driver marked with OS-specific types, all of 721 which refer to non-overlapping 4k-aligned regions: 722 \begin{description} 723 \item[\ttfamily EfiBarrelfishCPUDriver] 724 The currently-executing CPU driver's text and data segments. 725 \item[\ttfamily EfiBarrelfishCPUDriverStack] 726 The CPU driver's stack. 727 \item[\ttfamily EfiBarrelfishMultibootData] 728 The Multiboot structure. 729 \item[\ttfamily EfiBarrelfishELFData] 730 The unrelocated ELF image for a boot-time module (including that for 731 the CPU driver itself), as loaded over TFTP. 732 \item[\ttfamily EfiBarrelfishBootPageTable] 733 The currently-active page tables. 734 \end{description} 735 \item The CPU driver (kernel) command line. 736 \item A copy of the last DHCP Ack packet. 737 \item A copy of the section headers from the CPU driver's ELF image. 738 \item Module descriptions for the CPU driver and all other boot modules. 739 \item If UEFI provided an ACPI root table, the Multiboot structure 740 contains a pointer to it. 741 \end{itemize} 742\end{itemize} 743 744\subsection{Configuration} 745 746Hagfish configures itself by loading a file whose path is generated from its 747assigned IP address. Thus if your development machine receives the address 748192.168.1.100, Hagfish will load the file\\ 749\texttt{hagfish.192.168.1.100.cfg} 750from the same TFTP server used to load it. The format is intended to be as 751close as practical to that of an old-style GRUB menu.lst file. The example 752configuration in \autoref{f:hag_config} loads 753\texttt{/armv8/sbin/cpu\_apm88xxxx} as the CPU driver, with arguments 754\texttt{loglevel=3}, and an 8192B (2-page) stack. 755 756\begin{figure}[htb] 757\begin{center} 758\begin{lstlisting} 759kernel /armv8/sbin/cpu_apm88xxxx loglevel=3 760stack 8192 761module /armv8/sbin/cpu_apm88xxxx 762module /armv8/sbin/init 763 764# Domains spawned by init 765module /armv8/sbin/mem_serv 766module /armv8/sbin/monitor 767 768# Special boot time domains spawned by monitor 769module /armv8/sbin/chips boot 770module /armv8/sbin/ramfsd boot 771module /armv8/sbin/skb boot 772module /armv8/sbin/kaluga boot 773module /armv8/sbin/spawnd boot bootarm=0 774module /armv8/sbin/startd boot 775 776# General user domains 777module /armv8/sbin/serial auto portbase=2 778module /armv8/sbin/fish nospawn 779module /armv8/sbin/angler serial0.terminal xterm 780 781module /armv8/sbin/memtest 782 783module /armv8/sbin/corectrl auto 784module /armv8/sbin/usb_manager auto 785module /armv8/sbin/usb_keyboard auto 786module /armv8/sbin/sdma auto 787\end{lstlisting} 788\end{center} 789\caption{Hagfish configuration file} 790\label{f:hag_config} 791\end{figure} 792 793 794 795\subsection{Booting with Hagfish in \qemu}\label{c:qemu} 796 797When booting a \qemu image for 64-bit ARM, a number of options are 798available (see \texttt{make help-boot}). Building a boot image for 799\qemu with ARMv8 will typically result in a file in the build directory 800called \texttt{armv8_<core_type>_qemu_image}. This is a disk image which can be 801read by Hagfish through EFI calls. 802 803Booting this with a boot target from \texttt{make} will run the 804following: 805\begin{lstlisting} 806 srcdir/tools/qemu-wrapper.sh \\ 807 --image armv8_<core_type>_qemu_image \\ 808 --arch armv8 \\ 809 --bios ../git/barrelfish/tools/hagfish/QEMU_EFI.fd 810\end{lstlisting} 811 812This wrapper script is complex, but reasonably well documented (use 813'\texttt{--help}'). It will invoke \qemu as follows: 814\begin{lstlisting} 815 qemu-system-aarch64 \\ 816 -m 1024 \\ 817 -cpu cortex-a57 \\ 818 -M virt \\ 819 -d guest_errors \\ 820 -M gic_version=3 \\ 821 -smp 1 \\ 822 -bios ../git/barrelfish/tools/hagfish/QEMU_EFI.fd \\ 823 -device virtio-blk-device,drive=image \\ 824 -drive if=none,id=image,file=armv8_<core_type>_qemu_image,format=raw \\ 825 -nographic 826\end{lstlisting} 827 828 829Note that for this script to work, you need to have \texttt{mtools} 830(the MS-DOS file system manipulation tools) installed, since they are 831used to prepare the \texttt{armv8_<core_type>_qemu_image} file. 832 833More specifically, the \texttt{armv8_<core_type>_qemu_image} file is generated 834by \texttt{tools/harness/efiimage.py}. This creates an EFI file 835system image out of the plain Barrelfish binaries built in 836\texttt{\textit{builddir}/armv8/sbin}, plus the Hagfish EFI image we 837regularly use for real hardware. The \texttt{QEMU\_EFI.fd} file is 838the UEFI runtime built for \qemu. 839 840\section{Booting from U-Boot}\label{s:uboot} 841 842Where a full UEFI environment is not available, it is possible to boot 843Barrelfish from U-Boot~\cite{uboot}. We boot Barrelfish from U-Boot 844using U-Boot's limited EFI support: a build-time tool 845(\texttt{armv8\_bootimage} builds a single binary which only requires 846the minimal EFI environment provided by U-Boot. This binary contains 847a loader (\texttt{efi\_loader}) which sets up the rest of the image as 848a multiboot image in memory before starting the CPU driver. 849 850\subsection{Booting in \qemu with U-Boot} 851 852A ``platform'' target like \texttt{QEMU\_UBoot} which build such an image 853for \qemu, and the \texttt{qemu-wrapper.sh} script can be invoked to 854use U-Boot instead of Hagfish: 855 856\begin{lstlisting} 857 srcdir/tools/qemu-wrapper.sh \\ 858 --image armv8_a57_qemu_image.efi \\ 859 --arch armv8 \\ 860 --uboot-img srcdir/tools/qemu-armv8-uboot.bin 861\end{lstlisting} 862 863This invoked \qemu as follows: 864 865\begin{lstlisting} 866 qemu-system-aarch64 \\ 867 -m 1024 \\ 868 -cpu cortex-a57 \\ 869 -M virt \\ 870 -d guest_errors \\ 871 -M gic_version=3 \\ 872 -smp 1 \\ 873 -bios srcdir/tools/qemu-armv8-uboot.bin \\ 874 -device loader,addr=0x50000000,file=armv8_a57_qemu_image.efi \\ 875 -nographic 876\end{lstlisting} 877 878As you can see, the UBoot binary is given as the BIOS, and the minimal 879EFI image with the complete set of multiboot modules compiled in is 880pre-loaded into memory when \qemu starts. 881 882\chapter{Technical Observations}\label{c:tech} 883 884\section{User-Space Threading}\label{s:threads} 885 886\begin{figure}[htb] 887\begin{center} 888\begin{minipage}[t]{0.3\textwidth} 889\begin{lstlisting} 890clrex 891/* Restore CPSR */ 892ldr r0, [r1], #4 893msr cpsr, r0 894/* Restore registers */ 895ldmia r1, {r0-r15} 896\end{lstlisting} 897\end{minipage} 898\hspace{2cm} 899\begin{minipage}[t]{0.5\textwidth} 900\begin{lstlisting} 901/* Restore PSTATE, load resume 902 * address into x18 */ 903ldp x18, x2, [x1, #(PC_REG * 8)] 904/* Set only NZCV. */ 905and x2, x2, #0xf0000000 906msr nzcv, x2 907/* Restore the stack pointer and x30. */ 908ldp x30, x2, [x1, #(30 * 8)] 909mov sp, x2 910/* Restore everything else. */ 911ldp x28, x29, [x1, #(28 * 8)] 912ldp x26, x27, [x1, #(26 * 8)] 913ldp x24, x25, [x1, #(24 * 8)] 914ldp x22, x23, [x1, #(22 * 8)] 915ldp x20, x21, [x1, #(20 * 8)] 916/* n.b. don't reload x18 */ 917ldr x19, [x1, #(19 * 8)] 918ldp x16, x17, [x1, #(16 * 8)] 919ldp x14, x15, [x1, #(14 * 8)] 920ldp x12, x13, [x1, #(12 * 8)] 921ldp x10, x11, [x1, #(10 * 8)] 922ldp x8, x9, [x1, #( 8 * 8)] 923ldp x6, x7, [x1, #( 6 * 8)] 924ldp x4, x5, [x1, #( 4 * 8)] 925ldp x2, x3, [x1, #( 2 * 8)] 926/* n.b. this clobbers x0&x1 */ 927ldp x0, x1, [x1, #( 0 * 8)] 928/* Return to the thread. */ 929br x18 930\end{lstlisting} 931\end{minipage} 932\end{center} 933\caption{\texttt{disp\_resume\_context} on ARMv7 (left) and ARMv8 (right)} 934\label{f:disp_resume} 935\end{figure} 936 937The ARMv8 architecture is in some ways an improvement, and in other ways 938problematic, for the sort of user-level threading implemented in Barrelfish, 939via \emph{scheduler activations}. Under this scheme, the kernel (in Barrelfish 940terms, the \emph{CPU driver}), does not schedule threads directly, but instead 941exposes all scheduling-relevant events via \emph{upcalls} to predefined 942user-level handlers (in Barrelfish, the \emph{dispatcher}), which then 943implements thread scheduling (or something else entirely), as it sees fit. 944This differs from the behaviour of a system such as UNIX, which only ever 945restores a user-level execution context simultaneously with dropping from a 946privileged to an unprivileged execution level. 947 948Processor architectures are, understandably, designed with common software in 949mind. Thus, the primitives available for restoring an execution context i.e. 950register state are often tied closely to those for changing privilege level. A 951common design (which ARMv8 also implements) is the \emph{exception return}, 952where privileged code can atomically drop its privilege, and jump to a 953user-level execution address. In ARMv8, the \texttt{eret} instruction 954atomically updates the program state (PSTATE, most importantly the privilege 955level bits), and branches to the address held in the \emph{exception link 956register}, \texttt{elr}. 957 958In implementing user-level threading, we're not concerned with privilege 959levels, but the lack of some equivalent of \texttt{elr} is frustrating. Not 960only does \texttt{eret} provide an atomic update of the program counter and 961the program state, it does so without modifying any general-purpose register. 962Replicating this behaviour at \texttt{EL0}, where \texttt{eret} is unavailable 963is problematic. ARMv8 differs from ARMv7, in that the program counter can no 964longer be the target of a load instruction, but can only be loaded via a 965general-purpose register. 966 967Specifically, the only PC-modifying instructions (other than \texttt{eret}) 968are PC-relative branches (which are useless in this scenario) and 969branch-to-register (of which \texttt{br}, \texttt{blr} and \texttt{ret} are 970all special encodings). Since ARMv8 has also removed the \texttt{ldm} (load 971multiple) instruction, there is no way to load the program counter with an 972arbitrary value (the thread's restart address), without overwriting one of the 973general-purpose registers. We cannot restore the thread's register value 974\emph{before} we branch to it, as we'd overwrite the return address, and we 975obviously can't do so afterwards, as the thread likely has no idea that it's 976been interrupted. The only alternative is to trampoline through kernel mode in 977order to use \texttt{eret} (which would eliminate the speed benefit of 978user-level threading), or to reserve a general-purpose register for use by the 979dispatcher. Neither option is appealing, but we went with the second option, 980reserving \texttt{x18}, reasoning that with 31 general-purpose registers 981available, the loss of one isn't a huge penalty. Register \texttt{x18} is 982explicitly marked as the \emph{platform register} in the AArch64 ABI 983\citep{arm:aa64pcs}, for such a purpose. 984 985Future revisions of the ARM architecture could prevent this issue in a number 986of ways: allowing the use of \texttt{eret} at \texttt{EL0} or providing an 987equivalent functionality (specifically a non-general-purpose register such as 988\texttt{elr}, that doesn't need to be restored); or alternatively, adding 989indirect jumps (load to PC) back to the instruction set. 990 991\autoref{f:disp_resume} compares the user-level thread resume code for the 992Barrelfish dispatcher (function \texttt{disp\_resume}) for ARMv7 and ARMv8 993side-by-side. The effect of removing the load-multiple instructions, and 994direct-to-SP loads, on code density is clearly visible: everything on lines 9958--29 for ARMv8 corresponds to the single \texttt{ldmia} instruction on lines 9969 for ARMv7 --- one instruction is now 18, on the thread-switch critical path! 997Note also, on line 17, that the ARMv8 code does not restore the thread's 998\texttt{r18}, but instead uses it to hold the branch address for use on line 99929. The only improvement on ARMv8 is that the \texttt{clrex} (clear exclusive 1000monitor) instruction is no longer required, as the monitor is cleared on 1001returning from the kernel. Note also that the usual method to efficiently load 1002multiple registers, using 16-word SIMD (NEON) loads, isn't available, as 1003there's no guarantee that the SIMD extensions are enabled on this dispatcher, 1004and we cannot handle a fault in this code. 1005 1006\section{Trap Handling}\label{s:traps} 1007 1008\begin{figure} 1009\begin{lstlisting} 1010el0_aarch64_sync: 1011 msr daifset, #3 /* IRQ and FIQ masked, Debug and Abort enabled. */ 1012 1013 stp x11, x12, [sp, #-(2 * 8)]! 1014 stp x9, x10, [sp, #-(2 * 8)]! 1015 1016 mrs x10, tpidr_el1 1017 mrs x9, elr_el1 1018 1019 ldp x11, x12, [x10, #OFFSETOF_DISP_CRIT_PC_LOW] 1020 cmp x11, x9 1021 ccmp x12, x9, #0, ls 1022 ldr w11, [x10, #OFFSETOF_DISP_DISABLED] 1023 ccmp x11, xzr, #0, ls 1024 /* NE <-> (low <= PC && PC < high) || disabled != 0 */ 1025 1026 mrs x11, esr_el1 /* Exception Syndrome Register */ 1027 lsr x11, x11, #26 /* Exception Class field is bits [31:26] */ 1028 1029 b.ne el0_sync_disabled 1030 1031 add x10, x10, #OFFSETOF_DISP_ENABLED_AREA 1032 1033save_syscall_context: 1034 str x7, [x10, #(7 * 8)] 1035 1036 stp x19, x20, [x10, #(19 * 8)] 1037 stp x21, x22, [x10, #(21 * 8)] 1038 stp x23, x24, [x10, #(23 * 8)] 1039 stp x25, x26, [x10, #(25 * 8)] 1040 stp x27, x28, [x10, #(27 * 8)] 1041 stp x29, x30, [x10, #(29 * 8)] /* FP & LR */ 1042 1043 mrs x20, sp_el0 1044 stp x20, x9, [x10, #(31 * 8)] 1045 1046 mrs x19, spsr_el1 1047 str x19, [x10, #(33 * 8)] 1048 1049 cmp x11, #0x15 /* SVC or HVC from AArch64 EL0 */ 1050 b.ne el0_abort_enabled 1051 1052 add sp, sp, #(4 * 8) 1053 1054 mov x7, x10 1055 1056 b sys_syscall 1057\end{lstlisting} 1058\caption{BF/ARMv8 synchronous exception handler} 1059\label{f:sync_el0} 1060\end{figure} 1061 1062\autoref{f:sync_el0} shows the CPU driver exception stub, for a synchronous 1063abort from \texttt{EL0}. This exception class includes system calls, 1064breakpoints, and page faults on both code and data. The effect of the loss of 1065store multiple instructions is again visible, for example on lines 27--32. 1066Although not as severe as in the case of the user-level thread restore in 1067\autoref{s:threads}, the extra instructions required do constrain us somewhat, 1068as each trap handler is constrained to 128 bytes, or 32 instructions, before 1069branching to another code block. 1070 1071We were able to squeeze the necessary code into the space available, including 1072the optimised test for a disabled dispatcher at lines 10--14, but only by 1073splitting the page fault handler (\texttt{el0\_abort\_enabled}) into a 1074separate subroutine, incurring an unnecessary branch. A more significant 1075annoyance is that system calls (\texttt{svc} and \texttt{hvc}) are routed to 1076the same exception vector as page faults (aborts). The effect of this is that 1077we are forced to spill registers to the stack (\texttt{x9}--\texttt{x12} on 1078lines 4--5), even on the system call fast path, as we need at least one 1079register to check the exception syndrome (\texttt{esr\_el1}) to distinguish 1080aborts (where we must preserve all registers) from system calls (where we 1081could immediately begin using the caller-saved registers). Note that the code 1082on lines 27--32 only needs to stack the callee-saved registers, and leaves the 1083system call arguments in \texttt{x0}--\texttt{x7}, to be read as required by 1084\texttt{sys\_syscall} (in C). 1085 1086This sort of mismatch between the exception-handling interface of the CPU 1087architecture, and what is required for really high-performance systems code is 1088unfortunately extremely common. Unnecessary overheads, such as the additional 1089stacked registers here hurt the performance of highly-componentised systems, 1090such as Barrelfish, which rely on frequently crossing protection domains. 1091 1092The relatively well-compressed boolean arithmetic on lines 10--14 demonstrates 1093that, even with the loss of ARM's fully-conditional instructions, the 1094conditional compares which remain are still relatively powerful. 1095 1096\section{Cache Coherence} 1097 1098One aspect of the ARM architecture that is of particular interest for the 1099Barrelfish project, but which we have not yet explored in depth, is the 1100configurable cache coherency and fine-grained cache management operations 1101available. Any virtual mapping on a recent ARM architecture, including both 1102ARMv7 and ARMv8, can be tagged with various cacheability properties: inner 1103(L1), outer (L2+, usually), write-back or write-through. Combined with the 1104explicit flush operations at cache-line granularity, able to target either PoU 1105(point of unification, where data and instruction caches merge) or PoC (point 1106of coherency, typically RAM), a multi-core, multi-socket ARMv8 system would 1107make a very interesting testbed for investigating efficient cache management 1108and communication primitives for future partially-coherent architectures. 1109Indeed, the latest revision of the ARMv8 specification, ARMv8.2, introduced 1110flush to PoP, or \emph{point of persistence} --- perhaps in response to 1111interest from well-known systems integration firms investigating large 1112persistent memories. 1113 1114The design presented in this report is intended to expose as much control over 1115the caching hierarchy as possible to user-level code, to provide a platform 1116for future research. 1117 1118\bibliographystyle{plainnat} 1119\bibliography{defs,barrelfish} 1120 1121\end{document} 1122