1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 2% Copyright (c) 2013, ETH Zurich. 3% All rights reserved. 4% 5% This file is distributed under the terms in the attached LICENSE file. 6% If you do not find this file, copies can be found by writing to: 7% ETH Zurich D-INFK, Universitaetstr. 6, CH-8092 Zurich. Attn: Systems Group. 8%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 9 10\documentclass[a4paper,twoside]{report} 11 12\usepackage{bftn} 13\usepackage{booktabs} 14\usepackage{hyperref} 15\usepackage{hyphenat} 16\usepackage{listings} 17\usepackage{makeidx} 18\usepackage{natbib} 19 20\def\chapterautorefname{Chapter} 21\def\sectionautorefname{Section} 22\def\subsectionautorefname{Section} 23\def\subsubsectionautorefname{Section} 24\def\tableautorefname{Table} 25 26\lstdefinelanguage{armasm}{ 27 numbers=left, 28 numberstyle=\tiny, 29 numbersep=5pt, 30 basicstyle=\ttfamily\small 31} 32\lstset{language=armasm} 33 34\title{Barrelfish on ARMv8} 35\author{David Cock} 36 37\tnnumber{022} % give the number of the tech report 38\tnkey{ARMv8} % Short title, will appear in footer 39 40\begin{document} 41\maketitle 42 43\begin{versionhistory} 44\vhEntry{1.0}{11.04.2016}{DC}{Initial version} 45\end{versionhistory} 46 47\chapter{Summary} 48 49Barrelfish now supports ARMv7 and ARMv8 as primary platforms, and we have 50discontinued support for all older architecture revisions (ARMv5, ARMv6). The 51current Barrelfish release contains a port to a simulated ARMv8 environment, 52derived from the existing ARMv7 codebase and running under GEM5, with generous 53contributions from HP Research. 54 55Simultaneously, we are undertaking a clean-slate redesign of the CPU driver 56for ARMv8, as it presents a number of novel features, and greatly improved 57platform standardisation (\autoref{s:sbsa}), that should allow for a much 58cleaner and simpler implementation. This redesigned CPU driver will for the 59basis for ongoing research into large-scale non-cache-coherent systems using 60ARMv8 cores. This document presents the new CPU driver design 61(\autoref{c:design}), briefly covering those features of ARMv8 of greatest 62relevance (\autoref{c:background}), and discusses a number of technical 63challenges presented by the new architecture (\autoref{c:tech}). 64 65\chapter{Background}\label{c:background} 66 67The Barrelfish research operating system is a vehicle for research into 68software support for likely future architectures, where large numbers of 69non-coherent (or weakly-coherent) heterogeneous processor cores are assembled 70into a single large-scale system. As such, support for a common non-x86 71architecture has always been part of the project, beginning with the ARMv5 72(XScale) port, which permitted the embedded processor on a network interface 73card to be integrated as a first-class part of the system, with its own CPU 74driver. We have also actively maintained an ARMv7 port, to the OMAP4460 75processor on the Pandaboard ES, which we use as a teaching platform in the 76Advanced Operating Systems course at ETH Z\"urich. These ports are described 77more fully in the accompanying technical report, \citet{btn017-arm}. 78 79\section{The ARMv8 Architecture} 80 81The ARMv8 architecture is quite a radical departure from previous versions, 82and represents the culmination of a trend that has been developing for quite 83some time. While the first wave of ARM-based microservers, based on the 32-bit 84ARMv7 architecture, was largely a commercial failure, it's clear that ARM is 85now actively targeting the server market, where Intel currently has near-total 86dominance. 87 88ARMv8 discards some long-standing features of the ARM instruction set: 89universal conditional execution, multiple loads/stores, and the program 90counter as a general-purpose register. These most likely caused difficulty in 91scaling the processor pipeline to high clock rates, and we present some 92consequences of their loss in \autoref{s:threads} and \autoref{s:traps}. The 93instruction-set changes, however challenging to the systems programmer, are 94ultimately of little consequence compared to the consolidation of the ARM 95ecosystem into a serious server platform. The two features of most interest at 96this stage in the design process are the standardisation of hardware features 97and memory maps, and of the boot process. 98 99\subsection{ARM Server Base System Architecture}\label{s:sbsa} 100 101ARM has long been criticised by systems programmers for its highly fragmented, 102and non-uniform programming interface. Linux, in particular, has struggled for 103years with supporting the great multiplicity of ARM platforms. The principal 104reason for this is the lack of any concept of a \emph{platform}: a set of 105assumptions (available hardware, memory map, etc.), that programmers can rely 106on when initialising and managing a system. The Linux source tree famously 107contained a vastly greater amount of code in the ARM platform support 108subtrees, than that for x86. 109 110The relative standardisation of the x86 platform is largely a historical 111accident, due to the rapid proliferation of PC/AT clones in the early 1980s. 112The x86 platform thus contains layers of ossified legacy interfaces, necessary 113to ensure broad cross and backward compatibility. ARM's business model, on 114the contrary, has long emphasised the specialisation of implementations: an 115ARM licensee would take their ARM-designed CPU core, and integrate it 116themselves in to a complex SoC (system on a chip), with their own specialised, 117proprietary interfaces. The upsides of this were the possibility to highly 118optimise a particular design, and no requirement on ARM itself to maintain a 119coherent platform. 120 121While ARM's customisable platform worked well for embedded devices, and scaled 122reasonably well to relatively powerful smartphones, it's a disaster for 123producing high-quality systems software, able to execute on a broad range of 124hardware from competing vendors: exactly what a competitive server platform 125requires. ARM clearly know this, and since 2014 have published the Server Base 126System Architecture \citep{arm:sbsa}. To the extent that manufacturers adhere 127to these guidelines, our job as systems programmers is significantly simpler: 128it should be possible to write a single set of initialisation and 129configuration code for ARMv8, that will run on any SBSA-compliant system, much 130as we already do for x86-64. 131 132\begin{table} 133\begin{center} 134\begin{tabular}{lllp{6cm}} 135\toprule 136Supplier & Processor & Name & \\ 137\midrule 138APM & APM883208 & Mustang & 1P 8-core X-Gene 1 with serial trace. \\ 139\addlinespace[2pt] 140 & APM883408-X2 & X-C2 & 1P 8-core X-Gene 2. \\ 141\addlinespace[2pt] 142Cavium & CN8890 & StratusX & 1P 48-core ThunderX. \\ 143\addlinespace[2pt] 144 & & Cirrus & 2P 48-core ThunderX. \\ 145\addlinespace[2pt] 146ARM & AEM & Fixed Virtual Platform & The \emph{architectural envelope model} 147covers the range of behaviour permitted by ARMv8. Bare-metal debug. \\ 148\addlinespace[2pt] 149 & & Foundation Platform & Freely available, compatible with FVP. \\ 150\bottomrule 151\end{tabular} 152\end{center} 153\caption{ARMv8 platforms of interest}\label{t:platforms} 154\end{table} 155 156Our target platforms listed in \autoref{t:platforms} all support SBSA to some 157extent, and absent any compelling reason, we will only support SBSA-compliant 158platforms. 159 160\subsection{UEFI}\label{s:uefi} 161 162One aspect of the SBSA which eases portability is the specification, for the 163first time, of a boot process for ARM systems. ARM has specified that SBSA 164systems must support UEFI \citep{uefi} (the unified extensible firmware 165interface). UEFI is a descendant of the EFI specification, developed by Intel 166for the Itanium project. While Itanium is no longer a platform of any great 167commercial interest, UEFI support is now widespread in the x86-64 market. 168UEFI, in turn, specifies the use of ACPI \citep{acpi} (the advanced 169configuration and power interface) for platform discovery and control. 170 171Supporting ACPI and UEFI requires a one-off investment of effort to design a 172new boot and configuration subsystem, but should pay off in the long term, as 173ports to new ARM boards will no longer require extensive manual configuration. 174The code should also be largely reusable for x86 UEFI systems. Our new UEFI 175bootloader is described in \autoref{s:hagfish}. 176 177\section{A Direct Port from ARMv7} 178 179As already described, the Barrelfish release current at time of writing 180includes an initial ARMv8 port to the GEM5 simulator. This port contains code 181generously contributed by HP Research. 182 183Being developed from the existing codebase, this ARMv8 port follows the 184structure of the existing ARMv7 code closely. While it is highly useful to 185have a running port, we are nevertheless continuing with a significant 186redesign of the CPU driver, as significant improvements and simplifications 187will be possible, once we no longer need to follow the existing structure, 188originally developed for a significantly different platform. 189 190The GEM5 simulator's model of an ARMv8 platform is relatively primitive, and 191does not conform to modern platform conventions, for example placing RAM at 192address \texttt{0}, rather than \texttt{0x80000000} as mandated by the SBSA. 193For this reason, in addition to better integration with ARM debugging tools, 194we have switched to the ARM Fixed Virtual Platform as our default simulation 195environment, with the Foundation Platform supported as a freely available 196simulator. 197 198\section{Registers} 199 200\subsection{General-purpose Registers} 201 202In total there are 31+1 general purpose registers (\texttt{r0-r30}) of size 20364bits(\autoref{tab:registers}). They are usually referred to by the names 204\texttt{x0-x30}. The 32-bit content of the registers are referred to as 205\texttt{w0-w30}. The additional stack pointer \texttt{SP} register can be 206accessed with a restricted number of instructions. 207 208\begin{table}[!h] 209 \begin{center} 210 211 \begin{tabular}{lll} 212 \textbf{Register} & \textbf{Special} & \textbf{Description} \\ 213 \hline 214 \texttt{X0-X7} & Caller-save & function call arguments and return value 215 \\ 216 \texttt{X8} & & indirect result e.g. location of large return value 217 (struct) \\ 218 \texttt{X9-X15} & Caller-save& temporary registers \\ 219 \texttt{X16} & IP0 & The first intra-procedure-call scratch 220 register\footnote{can be used by call veneers and PLT code; at other 221 times may be used as a temporary register. Same for X17}\\ 222 \texttt{X17} & IP1 & The second intra-procedure-call temporary 223 register \\ 224 \texttt{X18} & & The Platform Register (TLS), if needed; otherwise a 225 temporary register. \\ 226 \texttt{X19-X28} & Callee-save & need to be preseved and restored when 227 modified\\ 228 \texttt{X29} & FP & frame pointer \\ 229 \texttt{X30} & LR & link register \\ 230 \texttt{SP} & & stack pointer (XZR) \\ 231 232 \end{tabular} 233 \caption{ARMv8 General purpose Registers} 234 \label{tab:registers} 235 \end{center} 236\end{table} 237 238\paragraph{Procedure call} 239\begin{itemize} 240 \item The registers \texttt{x19-28} and \texttt{SP} are callee-saved and 241 hence must be preserved by the called subroutine. All 64 bits have to 242 be preserved even when executing in the 32-bit mode. 243 \item The registers \texttt{x0-x7} and \texttt{x9-x15} are caller saved. 244 \item During procedure calls the registers \texttt{x16}, \texttt{x17}, 245 \texttt{x29} and \texttt{x30} have special roles i.e.\ they store 246 relevant addresses such as the return address. 247 \item Arguments for calls are passed in the registers \texttt{x0-x7}, 248 \texttt{v0-v7} for floats/SIMD and on the stack 249\end{itemize} 250 251\paragraph{Indirect result} This register is used when returning a large value 252such as declared by this function: \texttt{struct mystruct foo(int arg);}. 253 254 255\paragraph{Platform specific} The use of register \texttt{x18} is platform 256specific and needs to be defined by the platform ABI. This register can hold 257inter-procedural state such as the thread context. 258 259\paragraph{Linker} The registers \texttt{IP0} and \texttt{IP1} can be used by the 260linker as a scratch register or to hold intermediate values between subroutine 261calls. 262 263\subsection{SIMD and Floating point} 264There are 32 registers to be used by floating point and SIMD operations. The name 265of those registers will change, depending on the size of the operation. 266 267\begin{table}[!h] 268 \begin{center} 269 270 \begin{tabular}{ll} 271 \textbf{Register} & \textbf{Description} \\ 272 \hline 273 \texttt{v0-v7} & function call arguments, intermediate values and 274 return value, caller save registers \\ 275 \texttt{v8-v15} & Callee-save registers. They need to be 276 preserved\\ 277 \texttt{v16-v31} & Caller-save registers 278 \end{tabular} 279 \caption{ARMv8 General purpose Registers} 280 \label{tab:registers} 281 \end{center} 282\end{table} 283 284\chapter{Design and Implementation}\label{c:design} 285 286 287 288 289\section{Redesigning the CPU Driver} 290 291Given that ARMv8 is a significantly different platform to ARMv7, and that the 292ARMv7 codebase carries a significant legacy, reaching right back to ARMv5, we 293are pursuing substantial redesign of the CPU driver. Taking advantage of the 294standardisation of the hardware platform mandated by the SBSA 295(\autoref{s:sbsa}), and the facilities provided by UEFI (\autoref{s:uefi}), in 296addition to a relatively unrestricted virtual address space, we are able to 297significantly reduce the complexity of the CPU driver. In this section we 298describe the updated design, and our progress on its implementation, while the 299UEFI interface (Hagfish) is described separately, in \autoref{s:hagfish}. 300 301\paragraph{Terminology} 302In the interest of clarity, in the discussion that follows, we use a few terms 303with precise intent: 304\begin{description} 305\item[shall] 306 indicates features or characteristics of the design to which the 307 Barrelfish implementation must conform. 308\item[should] 309 indicates features which should be supported if at all possible. 310\item[initially] 311 indicates features which will be provided from the outset in the 312 Barrelfish implementation. 313\item[eventually] 314 indicates features which will be provided later in the Barrelfish 315 implementation, and which the initial design will aim to facilitate. 316\end{description} 317 318\subsection{Goals} 319 320Our goal is to provide a reference design for the CPU driver and user-space 321execution environment for Barrelfish on an ARMv8 core, in order to understand 322both positive and negative implications of the architecture for a multikernel 323system. The design \textbf{should} be applicable to any ARMv8 with 324virtualisation (\texttt{EL2}) support. 325 326\textbf{Initially}, our hardware development platform is the APM X-Gene 1, 327using the Mustang Development Board. We are using the Mustang principally as 328it was relatively easily available, as well as being a comparatively complex 329and powerful CPU. The ThunderX platform from Cavium is very interesting for 330Barrelfish, as it ties a large number (48) of less-powerful (2-issue) cores. 331We do not have the resources to develop for two platforms simultaneously, but 332we hope to \textbf{eventually} add support for the ThunderX. 333 334Our target simulation environment is the ARM Fixed Virtual Platform, and the 335Foundation Platform. These models are supplied by ARM. The Foundation Platform 336is freely available, and will be the default supported simulation platform for 337the public Barrelfish tree, while we will use the FVP internally to allow 338bare-metal debugging. Future support for QEmu is desirable, to the extent that 339it models a compatible system --- GEM5, which the ARMv7 port targets, 340currently does not. 341 342\textbf{Initially}, the design will support running both the CPU driver and 343user-space processes in AArch64 mode without support for virtualisation. 344\textbf{Eventually} the design will support running the CPU driver in AArch64 345mode, and user-space processes in both AArch64 and AArch32 modes without 346virtualisation, and virtual machines in AArch64 mode. We will only support 347virtualisation on ARMv8.1 or later platforms, that support the VHE extensions, 348as described in \autoref{s:layout}. 349 350\subsection{Processor Modes and Virtualisation} 351 352Where possible, we will keep the virtualisation model similar to that on 353Barrelfish/x86. In particular, it \textbf{should} be possible to implement 354native applications, fully virtualised (e.g. Linux) VMs, and VM-level 355applications e.g. Arrakis \citep{peter:osdi14}. 356 357ARMv8 has a somewhat different virtualisation model to x86, and different 358again from the ARMv7 virtualisation extensions. Rather than having exception 359levels (rings) duplicated between guest and host, ARMv8 provides 4 exception 360levels (ELs): 361 362\begin{itemize} 363\item \texttt{EL0} is unprivileged --- user applications. 364\item \texttt{EL1} is privileged --- OS kernel. 365\item \texttt{EL2} is hypervisor state. 366\item \texttt{EL3} is for switching between secure and non-secure (TrustZone) 367 modes. The X-Gene 1 does not implement \texttt{EL3}, and it 368 is currently not of interest for Barrelfish. 369\end{itemize} 370 371Explicit traps (syscalls/hypercalls) target only the next level up: 372\texttt{EL0} can call \texttt{EL1} using \texttt{svc} (syscall), and 373\texttt{EL1} can call \texttt{EL2} using \texttt{hvc} (hypercall), but 374\texttt{EL0} cannot directly call \texttt{EL2}, unless \texttt{EL1} is 375completely disabled. Exceptions return to the caller's exception level. 376 377ELs \textbf{shall} be distributed as follows: The CPU driver \textbf{shall} 378exist at both \texttt{EL1} and \texttt{EL2}, and take both syscalls 379(\texttt{svc}, from \texttt{EL0} applications) and hypercalls (\texttt{hvc}, 380from \texttt{EL1} applications). The system \textbf{shall} support 381applications both at \texttt{EL0}, and at \texttt{EL1} (e.g. Arrakis, VMs). 382Most code paths \textbf{should} be identical, as most CPU driver operations do 383not depend on \texttt{EL2} privileges. Hypercalls from \texttt{EL0} 384\textbf{shall} be chained via \texttt{EL1} (with appropriate permission 385checks). 386 387\texttt{EL1} apps such as Arrakis, and paravirtualised VMs using hypercalls 388know that they are being virtualised, and will use \texttt{hvc} explicitly. 389Fully-virtualised \texttt{EL1} VMs do not make hypercalls. 390 391ARMv8 implements two-level address translation: VA (virtual address) to IPA 392(intermediate physical address), and IPA to PA (physical address). 393\texttt{EL1} guests \textbf{shall} be isolated at the L1 translation layer, 394and by trapping all accesses to system control registers. 395 396\subsection{Virtual Address Space Layout}\label{s:layout} 397 398ARMv8 has an effective 48-bit virtual address space. At the lowest execution 399levels (0 --- BF user \& 1 --- BF CPU driver), the hardware supports two (up to) 40048-bit (256TB) 'windows' in a 64-bit space: one at the bottom, and one at the 401top. Each region has its own translation table base register (\texttt{TTBR0} 402\& \texttt{TTBR1}). \texttt{TTBR0} is used at \texttt{EL0}, and \texttt{TTBR1} 403at \texttt{EL1}. 404 405In the initial ARMv8 specification, this split address space was not 406implemented at \texttt{EL2}, which would require a separate CPU driver 407instance for virtualisation, and hypercalls (e.g. for Arrakis). ARMv8.1 408introduced the virtualisation host extensions (VHE) which, among other things, 409extends the split address space to \texttt{EL2}. As this provides a cleaner 410implementation model, and to avoid having to support a now-deprecated 411interface, virtualisation will \textbf{only} be supported on ARMv8.1 and 412later. This means that we will not support virtualisation on the X-Gene 1. 413Both the simulation environment (FVP/FP) and, seemingly, the ThunderX chips, 414support VHE. 415 416The CPU driver \textbf{shall} use \texttt{TTBR1} to provide a complete 417physical window. The ARMv8 CPU driver \textbf{shall not} dynamically map 418device memory into its own window (as the ARMv7 CPU driver does) --- the few 419memory-mapped devices required will be statically mapped on boot, with 420appropriate memory attributes. All physical addresses, RAM and device, 421\textbf{shall} be accessible at a static, standard offset (the base of the 422\texttt{TTBR1} region). 423 424User-level page tables will \textbf{initially} be limited to a 4k translation 425granularity. \textbf{Eventually} user-level page tables \textbf{should} have 426access to all page-table formats and page sizes, as is the case in the current 427Barrelfish x86 implementation. 428 429\subsection{Address Space, Context, and Thread Identifiers} 430 431ARMv8 also provides address-space identifiers (ASIDs) in the TLB to avoid 432flushing the translation cache on a context switch. 433 434ARMv8 ASIDs (referred to in ARM documentation as context IDs) are 435architecturally allowed to be either 8 or 16 bits, although the SBSA 436specifies that they must be at least 16. Relying on the SBSA platform will 437allow us to avoid multiplexing IDs among active processes, on any 438reasonably-sized system. Managing the reuse of context IDs can be left to 439user-level code, and does not need to be on the critical path of a context 440switch. The CPU driver need only ensure that every allocated dispatcher has a 441unique ASID, which is loaded into the \texttt{ContextID} register on dispatch. 442 443The value in the \texttt{ContextID} register is also checked against the 444hardware breakpoint and watchpoint registers, in generating debug exceptions. 445Therefore, it \texttt{shall} be possible for authorised user-level code to 446load the Context ID for a given dispatcher into a breakpoint register --- this 447\texttt{may} be an invocation on the dispatcher capability. 448 449\begin{table} 450\begin{center} 451\begin{tabular}{ll} 452\texttt{tpidrro\_el0} & EL0 Read-Only Software Thread ID Register \\ 453\texttt{tpidr\_el0} & EL0 Read/Write Software Thread ID Register \\ 454\texttt{tpidr\_el1} & EL1 Read/Write Software Thread ID Register \\ 455\texttt{tpidr\_el2} & EL2 Read/Write Software Thread ID Register \\ 456\texttt{tpidr\_el3} & EL3 Read/Write Software Thread ID Register \\ 457\end{tabular} 458\end{center} 459\caption{Thread ID registers in ARMv8} 460\label{t:threadid} 461\end{table} 462 463In addition to the \texttt{ContextID} register, used to tag TLB entries, ARMv8 464also provides a set of thread ID registers with no architecturally-defined 465semantics, as listed in \autoref{t:threadid}. The client-writeable 466\texttt{tpidr\_el0} and \texttt{tpidr\_el1} \textbf{shall} have no CPU 467driver-defined purpose, but \textbf{shall} be saved and restored in a 468dispatcher's trap frame, to allow their use as thread-local storage (TLS). 469Recall that the Barrelfish CPU driver has no awareness of threads, which are 470implemented purely at user level. 471 472To implement the upcall/dispatch mechanism of Barrelfish, the CPU driver and 473the user-level dispatcher need to share a certain amount of state --- the 474user-visible portion of the dispatcher control block, which contains the trap 475frames, and the disabled flag (used to achieve atomic dispatch). The address 476of this structure needs to be known to both the CPU driver, and to user-level 477code, and moreover be efficiently-accessible, as the CPU driver needs to find 478the trap frame on the critical path of system calls and exceptions. This 479pointer also needs to be trustworthy, from the CPU driver's perspective, and 480thus cannot be directly modifiable by user-level code. 481 482The x86-32, x86-64, and ARMv7 CPU drivers all store the address of the running 483dispatcher's shared segment at a fixed known address, \texttt{dcb\_current}, 484which is loaded by the trap handler. At user level, on x86 this address is 485held in a \emph{segment register} (\texttt{fs} on x86-64, and \texttt{gs} on 486x86-32), while on ARMv7 we sacrifice a general-purpose register (\texttt{r9}) 487for this purpose. Using the \texttt{tpidrro\_el0} register to hold the address 488of the current dispatcher structure will allow us to avoid both a memory load 489on the fast path, and sacrificing a register in user-level code, thus 490\texttt{tpidrro\_el0} \textbf{shall} hold the address of the currently-running 491dispatcher. 492 493\subsection{Instruction Sets} 494 495ARMv8 supports both AArch64, and legacy ARM/Thumb (renamed AArch32). Switching 496execution mode is only possible when switching execution level i.e. on a trap 497or return, and can only be changed while at the higher execution level. Thus, 498\texttt{EL2} can set execution mode for \texttt{EL1}, and \texttt{EL1} for 499\texttt{EL0}. There is no way for a program to change its own execution mode. 500If \texttt{ELn} is in AArch64, then \texttt{EL(n-1)} can be in either AArch64 501or AArch32. If \texttt{ELn} is in AArch32, all lower ELs must also be AArch32. 502 503The CPU driver \textbf{shall} execute in AArch64. 504 505\textbf{Initially}, the CPU driver will enforce that all directly-scheduled 506threads also use AArch64, by controlling all downward EL transitions. An 507\texttt{EL1} client (such as Arrakis or a full virtual machine) may execute 508its own \texttt{EL0} clients in AArch32 (and there is no way to prevent this). 509However, all transitions into the CPU driver (\texttt{svc}, \texttt{hvc} or 510exception) must come from a direct client of the CPU driver, and thus from 511AArch64. The syscall ABI \textbf{shall} be AArch64. 512 513\textbf{Eventually}, Barrelfish \textbf{should} also support the execution of 514AArch32 dispatcher processes, by marking each dispatcher with a flag 515indicating the instruction set to be used (much as is already done with 516VM/non-VM mode in the Arrakis CPU driver). 517 518\subsection{User-Space Access to Architectural Functions} 519 520Generally, anything that can be safely exported, \textbf{should} be made 521available outside of the CPU driver, preferable as a memory-mapped interface, 522at 4kiB granularity. The SBSA mandates that devices be present at addresses 523that can be individually mapped, thus this should not be a problem. 524 525\subsection{Cache Management} 526 527ARMv8 has moved most cache and TLB management from the system control 528coprocessor (cp15), into the core ISA. Several cache operations 529(invalidate/clean by VA) are executable at \texttt{EL0}, and thus no kernel 530interface is required. The system must take into account that user-directed 531flushes may have occurred, or may occur concurrently with any memory 532operation. 533 534\subsection{Performance Monitors} 535 536Performance monitors \textbf{should} be exposed, if it can be done safely. 537 538\subsection{Debugging} 539 540Self-hosted debug \textbf{should} be exposed, if it can be done safely. This 541is under active development. 542 543\subsection{Booting} 544 545Platform support i.e.~a standard set of peripherals, and a defined boot 546process, has improved dramatically on ARM, as it has been repositioned as a 547server platform. UEFI and ACPI support are widespread, including on the 548Mustang development board. We will assume support for UEFI booting, make use 549of ACPI data, where available. 550 551The Barrelfish CPU driver and initial image \textbf{shall} be loaded and 552executed by a UEFI shim, which will pass through all UEFI-supplied 553information, such as ACPI tables, and be able to interpret a Barrelfish 554Multiboot image. This shim, or second-stage bootloader, is called Hagfish, 555and is described in \autoref{s:hagfish}. 556 557\subsection{Interrupts} 558 559ARMv8 interrupt handling is not substantially different from the existing 560architectures and platforms supported by Barrelfish. While a redesign of the 561Barrelfish interrupt system is under way (to use capabilities to grant access 562to receive interrupts), we do not anticipate ARMv8 to impose any particular 563challenges. 564 565The ARMv8 systems we \textbf{initially} target all use minor variations on the 566ARM Generic Interrupt Controller (GIC) design, already supported in 567Barrelfish. We currently have support for version 2 of the GIC, with which 568later implementations are backward-compatible. We will \textbf{eventually} 569support GICv3, the current specification at time of writing. 570 571\subsection{Inter-Domain Communication} 572 573User-level communication between cache-coherent cores in Barrelfish for ARMv8 574is likely to the same as with ARMv7 and x86, and we expect the existing 575User-level Message-Passing over Cache-Coherence (UMP-CC) interconnect driver 576to work unmodified. 577 578Between dispatchers on the same core, however, the different register set on 579the ARMv8 is likely to result in a very different Local Message Passing (LMP) 580interconnect driver---this is always an architecture-specific part of the CPU 581driver. In practice, its design will be closely tied to the context switch and 582upcall dispatch code. 583 584\section{Hagfish}\label{s:hagfish} 585 586The Barrelfish/ARMv8 UEFI loader prototype is called Hagfish\footnote{A 587hagfish is a basal chordate i.e. something like the ancestor of all fishes.}. 588Hagfish is a second-stage bootloader for Barrelfish on UEFI platforms, 589initially the ARMv8 server platform. Hagfish is loaded as a UEFI application, 590and uses the large set of supplied services to do as much of the one-time 591(boot core) setup that the CPU driver needs as is reasonably possible. More 592specifically, Hagfish: 593 594\begin{itemize} 595\item Is loaded over BOOTP/PXE. 596\item Reuses the PXE environment to load a menu.lst-style configuration. 597\item Loads the kernel image and the initial applications, as directed, and 598builds a Multiboot image. 599\item Allocates and builds the CPU driver's page tables. 600\item Activates the initial page table, and allocates a stack. 601\end{itemize} 602 603\subsection{Why Another Bootloader?} 604 605The ARMv8 machines that we're porting to are different to both existing ARM 606boards, and to x86. They have a full pre-boot environment, unlike most 607embedded boards, but it's not a PC-style BIOS. The ARM Server Base Boot 608Requirements specify UEFI. Moreover, there is no mainline support from GNU 609GRUB for the ARMv8 architecture, so no matter what, we need some amount of 610fresh code. 611 612Given that we had to write at least a shim loader, and keeping in mind that 613UEFI is multi-platform (and becoming more and more common in the x86 world), 614we're taking the opportunity to simplify the initial boot process within the 615CPU driver by moving the once-only initialisation into the bootloader. In 616particular, while running under UEFI boot services, we have memory allocation 617available for free, e.g. for the initial page tables. By moving ELF loading 618and relocation code into the bootloader, we can eliminate the need to relocate 619running code, and can cut down (hopefully eliminate) special-case code for 620booting the initial core. Subsequent cores can rely on user-level Coreboot 621code to relocate them, and to construct their page tables. 622 623\subsection{Assumptions and Requirements} 624 625Hagfish is (initially at least) intended to support development work on 626AArch64 server-style hardware and, as such, makes the following assumptions: 627 628\begin{itemize} 629\item 64-bit architecture, using ELF binaries. Porting to 32-bit architectures 630wouldn't be hard, if it were ever necessary (probably not). 631\item PXE/BOOTP/TFTP available for booting. Hagfish expects to load its 632configuration, and any binaries needed, using the same PXE context with which 633it was booted. Changing this to boot from a local device (e.g. HDD) wouldn't 634be hard, as the UEFI \texttt{LoadFile} interface abstracts from the hardware. 635\end{itemize} 636 637\subsection{Boot Process} 638 639In detail, Hagfish currently boots as follows: 640 641\begin{enumerate} 642\item \texttt{Hagfish.efi} is loaded over PXE by UEFI, and is executed at a 643runtime-allocated address, with translation (MMU) and caching enabled. 644\item Hagfish queries EFI for the PXE protocol instance used to load it, and 645squirrels away the current network configuration. 646\item Hagfish loads the file \texttt{hagfish.A.B.C.D.cfg} from the TFTP server 647root (where \texttt{A.B.C.D} is the IP address on the interface that ran PXE). 648\item Hagfish parses its configuration, which is essentially a GRUB menu.lst, 649and loads the kernel image and any additional modules specified therein. All 650ELF images are loaded into page-aligned regions of type 651\texttt{EfiBarrelfishELFData}. 652\item Hagfish queries UEFI for the system memory map, then allocates and 653initialises the inital page tables for the CPU driver (mapping all occupied 654physical addresses, within the \texttt{TTBR1} window, see \autoref{s:layout}). 655The frames holding these tables are marked with the EFI memory type\\ 656\texttt{EfiBarrelfishBootPagetable}, allocated from the OS-specific range 657(\texttt{0x80000000}--\texttt{0x8fffffff}). All memory allocated by Hagfish on 658behalf of the CPU driver is page-aligned, and tagged with an OS-specific type, 659to allow EFI and Hagfish regions to be safely reclaimed. 660\item Hagfish builds a Multiboot 2 information structure, containing as much 661information as it can get from EFI, including: 662 \begin{itemize} 663 \item ACPI 1.0 and 2.0 tables. 664 \item The EFI memory map (including Hagfish's custom-tagged regions). 665 \item Network configuration (the saved DHCP ack packet). 666 \item The kernel command line. 667 \item All loaded modules. 668 \item The kernel's ELF section headers. 669 \end{itemize} 670\item Hagfish allocates a page-aligned kernel stack (type 671\texttt{EfiBarrelfishCPUDriverStack}), of the size specified in the 672configuration. 673\item Hagfish terminates EFI boot services (calls \texttt{ExitBootServices}), 674activates the CPU driver page table, switches to the kernel stack, and jumps 675into the relocated CPU driver image. 676\end{enumerate} 677 678\subsection{Post-Boot state} 679 680When the CPU driver on the boot core begins executing, it can assume the 681following: 682 683\begin{itemize} 684\item The MMU is configured with all RAM and I/O regions mapped via 685\texttt{TTBR1}. 686\item The CPU driver's code and data are both fully relocated into one or more 687distinct 4kiB-aligned regions. 688\item The stack pointer is at the top of a distinct 4kiB-aligned region of at 689least the requested size. 690\item The first argument register holds the Multiboot 2 magic value. 691\item The second holds a pointer to a Multiboot 2 information structure, in 692its own distinct 4kiB-aligned region. 693\item The console device is configured. 694\item Only one core is enabled. 695\item The Multiboot structure contains at least: 696 \begin{itemize} 697 \item The final EFI memory map, with all areas allocated by Hagfish to 698 hold data passed to the CPU driver marked with OS-specific types, all of 699 which refer to non-overlapping 4k-aligned regions: 700 \begin{description} 701 \item[\ttfamily EfiBarrelfishCPUDriver] 702 The currently-executing CPU driver's text and data segments. 703 \item[\ttfamily EfiBarrelfishCPUDriverStack] 704 The CPU driver's stack. 705 \item[\ttfamily EfiBarrelfishMultibootData] 706 The Multiboot structure. 707 \item[\ttfamily EfiBarrelfishELFData] 708 The unrelocated ELF image for a boot-time module (including that for 709 the CPU driver itself), as loaded over TFTP. 710 \item[\ttfamily EfiBarrelfishBootPageTable] 711 The currently-active page tables. 712 \end{description} 713 \item The CPU driver (kernel) command line. 714 \item A copy of the last DHCP Ack packet. 715 \item A copy of the section headers from the CPU driver's ELF image. 716 \item Module descriptions for the CPU driver and all other boot modules. 717 \item If UEFI provided an ACPI root table, the Multiboot structure 718 contains a pointer to it. 719 \end{itemize} 720\end{itemize} 721 722\subsection{Configuration} 723 724Hagfish configures itself by loading a file whose path is generated from its 725assigned IP address. Thus if your development machine receives the address 726192.168.1.100, Hagfish will load the file\\ 727\texttt{hagfish.192.168.1.100.cfg} 728from the same TFTP server used to load it. The format is intended to be as 729close as practical to that of an old-style GRUB menu.lst file. The example 730configuration in \autoref{f:hag_config} loads 731\texttt{/armv8/sbin/cpu\_apm88xxxx} as the CPU driver, with arguments 732\texttt{loglevel=3}, and an 8192B (2-page) stack. 733 734\begin{figure}[htb] 735\begin{center} 736\begin{lstlisting} 737kernel /armv8/sbin/cpu_apm88xxxx loglevel=3 738stack 8192 739module /armv8/sbin/cpu_apm88xxxx 740module /armv8/sbin/init 741 742# Domains spawned by init 743module /armv8/sbin/mem_serv 744module /armv8/sbin/monitor 745 746# Special boot time domains spawned by monitor 747module /armv8/sbin/chips boot 748module /armv8/sbin/ramfsd boot 749module /armv8/sbin/skb boot 750module /armv8/sbin/kaluga boot 751module /armv8/sbin/spawnd boot bootarm=0 752module /armv8/sbin/startd boot 753 754# General user domains 755module /armv8/sbin/serial auto portbase=2 756module /armv8/sbin/fish nospawn 757module /armv8/sbin/angler serial0.terminal xterm 758 759module /armv8/sbin/memtest 760 761module /armv8/sbin/corectrl auto 762module /armv8/sbin/usb_manager auto 763module /armv8/sbin/usb_keyboard auto 764module /armv8/sbin/sdma auto 765\end{lstlisting} 766\end{center} 767\caption{Hagfish configuration file} 768\label{f:hag_config} 769\end{figure} 770 771\chapter{Technical Observations}\label{c:tech} 772 773\section{User-Space Threading}\label{s:threads} 774 775\begin{figure}[htb] 776\begin{center} 777\begin{minipage}[t]{0.3\textwidth} 778\begin{lstlisting} 779clrex 780/* Restore CPSR */ 781ldr r0, [r1], #4 782msr cpsr, r0 783/* Restore registers */ 784ldmia r1, {r0-r15} 785\end{lstlisting} 786\end{minipage} 787\hspace{2cm} 788\begin{minipage}[t]{0.5\textwidth} 789\begin{lstlisting} 790/* Restore PSTATE, load resume 791 * address into x18 */ 792ldp x18, x2, [x1, #(PC_REG * 8)] 793/* Set only NZCV. */ 794and x2, x2, #0xf0000000 795msr nzcv, x2 796/* Restore the stack pointer and x30. */ 797ldp x30, x2, [x1, #(30 * 8)] 798mov sp, x2 799/* Restore everything else. */ 800ldp x28, x29, [x1, #(28 * 8)] 801ldp x26, x27, [x1, #(26 * 8)] 802ldp x24, x25, [x1, #(24 * 8)] 803ldp x22, x23, [x1, #(22 * 8)] 804ldp x20, x21, [x1, #(20 * 8)] 805/* n.b. don't reload x18 */ 806ldr x19, [x1, #(19 * 8)] 807ldp x16, x17, [x1, #(16 * 8)] 808ldp x14, x15, [x1, #(14 * 8)] 809ldp x12, x13, [x1, #(12 * 8)] 810ldp x10, x11, [x1, #(10 * 8)] 811ldp x8, x9, [x1, #( 8 * 8)] 812ldp x6, x7, [x1, #( 6 * 8)] 813ldp x4, x5, [x1, #( 4 * 8)] 814ldp x2, x3, [x1, #( 2 * 8)] 815/* n.b. this clobbers x0&x1 */ 816ldp x0, x1, [x1, #( 0 * 8)] 817/* Return to the thread. */ 818br x18 819\end{lstlisting} 820\end{minipage} 821\end{center} 822\caption{\texttt{disp\_resume\_context} on ARMv7 (left) and ARMv8 (right)} 823\label{f:disp_resume} 824\end{figure} 825 826The ARMv8 architecture is in some ways an improvement, and in other ways 827problematic, for the sort of user-level threading implemented in Barrelfish, 828via \emph{scheduler activations}. Under this scheme, the kernel (in Barrelfish 829terms, the \emph{CPU driver}), does not schedule threads directly, but instead 830exposes all scheduling-relevant events via \emph{upcalls} to predefined 831user-level handlers (in Barrelfish, the \emph{dispatcher}), which then 832implements thread scheduling (or something else entirely), as it sees fit. 833This differs from the behaviour of a system such as UNIX, which only ever 834restores a user-level execution context simultaneously with dropping from a 835privileged to an unprivileged execution level. 836 837Processor architectures are, understandably, designed with common software in 838mind. Thus, the primitives available for restoring an execution context i.e. 839register state are often tied closely to those for changing privilege level. A 840common design (which ARMv8 also implements) is the \emph{exception return}, 841where privileged code can atomically drop its privilege, and jump to a 842user-level execution address. In ARMv8, the \texttt{eret} instruction 843atomically updates the program state (PSTATE, most importantly the privilege 844level bits), and branches to the address held in the \emph{exception link 845register}, \texttt{elr}. 846 847In implementing user-level threading, we're not concerned with privilege 848levels, but the lack of some equivalent of \texttt{elr} is frustrating. Not 849only does \texttt{eret} provide an atomic update of the program counter and 850the program state, it does so without modifying any general-purpose register. 851Replicating this behaviour at \texttt{EL0}, where \texttt{eret} is unavailable 852is problematic. ARMv8 differs from ARMv7, in that the program counter can no 853longer be the target of a load instruction, but can only be loaded via a 854general-purpose register. 855 856Specifically, the only PC-modifying instructions (other than \texttt{eret}) 857are PC-relative branches (which are useless in this scenario) and 858branch-to-register (of which \texttt{br}, \texttt{blr} and \texttt{ret} are 859all special encodings). Since ARMv8 has also removed the \texttt{ldm} (load 860multiple) instruction, there is no way to load the program counter with an 861arbitrary value (the thread's restart address), without overwriting one of the 862general-purpose registers. We cannot restore the thread's register value 863\emph{before} we branch to it, as we'd overwrite the return address, and we 864obviously can't do so afterwards, as the thread likely has no idea that it's 865been interrupted. The only alternative is to trampoline through kernel mode in 866order to use \texttt{eret} (which would eliminate the speed benefit of 867user-level threading), or to reserve a general-purpose register for use by the 868dispatcher. Neither option is appealing, but we went with the second option, 869reserving \texttt{x18}, reasoning that with 31 general-purpose registers 870available, the loss of one isn't a huge penalty. Register \texttt{x18} is 871explicitly marked as the \emph{platform register} in the AArch64 ABI 872\citep{arm:aa64pcs}, for such a purpose. 873 874Future revisions of the ARM architecture could prevent this issue in a number 875of ways: allowing the use of \texttt{eret} at \texttt{EL0} or providing an 876equivalent functionality (specifically a non-general-purpose register such as 877\texttt{elr}, that doesn't need to be restored); or alternatively, adding 878indirect jumps (load to PC) back to the instruction set. 879 880\autoref{f:disp_resume} compares the user-level thread resume code for the 881Barrelfish dispatcher (function \texttt{disp\_resume}) for ARMv7 and ARMv8 882side-by-side. The effect of removing the load-multiple instructions, and 883direct-to-SP loads, on code density is clearly visible: everything on lines 8848--29 for ARMv8 corresponds to the single \texttt{ldmia} instruction on lines 8859 for ARMv7 --- one instruction is now 18, on the thread-switch critical path! 886Note also, on line 17, that the ARMv8 code does not restore the thread's 887\texttt{r18}, but instead uses it to hold the branch address for use on line 88829. The only improvement on ARMv8 is that the \texttt{clrex} (clear exclusive 889monitor) instruction is no longer required, as the monitor is cleared on 890returning from the kernel. Note also that the usual method to efficiently load 891multiple registers, using 16-word SIMD (NEON) loads, isn't available, as 892there's no guarantee that the SIMD extensions are enabled on this dispatcher, 893and we cannot handle a fault in this code. 894 895\section{Trap Handling}\label{s:traps} 896 897\begin{figure} 898\begin{lstlisting} 899el0_aarch64_sync: 900 msr daifset, #3 /* IRQ and FIQ masked, Debug and Abort enabled. */ 901 902 stp x11, x12, [sp, #-(2 * 8)]! 903 stp x9, x10, [sp, #-(2 * 8)]! 904 905 mrs x10, tpidr_el1 906 mrs x9, elr_el1 907 908 ldp x11, x12, [x10, #OFFSETOF_DISP_CRIT_PC_LOW] 909 cmp x11, x9 910 ccmp x12, x9, #0, ls 911 ldr w11, [x10, #OFFSETOF_DISP_DISABLED] 912 ccmp x11, xzr, #0, ls 913 /* NE <-> (low <= PC && PC < high) || disabled != 0 */ 914 915 mrs x11, esr_el1 /* Exception Syndrome Register */ 916 lsr x11, x11, #26 /* Exception Class field is bits [31:26] */ 917 918 b.ne el0_sync_disabled 919 920 add x10, x10, #OFFSETOF_DISP_ENABLED_AREA 921 922save_syscall_context: 923 str x7, [x10, #(7 * 8)] 924 925 stp x19, x20, [x10, #(19 * 8)] 926 stp x21, x22, [x10, #(21 * 8)] 927 stp x23, x24, [x10, #(23 * 8)] 928 stp x25, x26, [x10, #(25 * 8)] 929 stp x27, x28, [x10, #(27 * 8)] 930 stp x29, x30, [x10, #(29 * 8)] /* FP & LR */ 931 932 mrs x20, sp_el0 933 stp x20, x9, [x10, #(31 * 8)] 934 935 mrs x19, spsr_el1 936 str x19, [x10, #(33 * 8)] 937 938 cmp x11, #0x15 /* SVC or HVC from AArch64 EL0 */ 939 b.ne el0_abort_enabled 940 941 add sp, sp, #(4 * 8) 942 943 mov x7, x10 944 945 b sys_syscall 946\end{lstlisting} 947\caption{BF/ARMv8 synchronous exception handler} 948\label{f:sync_el0} 949\end{figure} 950 951\autoref{f:sync_el0} shows the CPU driver exception stub, for a synchronous 952abort from \texttt{EL0}. This exception class includes system calls, 953breakpoints, and page faults on both code and data. The effect of the loss of 954store multiple instructions is again visible, for example on lines 27--32. 955Although not as severe as in the case of the user-level thread restore in 956\autoref{s:threads}, the extra instructions required do constrain us somewhat, 957as each trap handler is constrained to 128 bytes, or 32 instructions, before 958branching to another code block. 959 960We were able to squeeze the necessary code into the space available, including 961the optimised test for a disabled dispatcher at lines 10--14, but only by 962splitting the page fault handler (\texttt{el0\_abort\_enabled}) into a 963separate subroutine, incurring an unnecessary branch. A more significant 964annoyance is that system calls (\texttt{svc} and \texttt{hvc}) are routed to 965the same exception vector as page faults (aborts). The effect of this is that 966we are forced to spill registers to the stack (\texttt{x9}--\texttt{x12} on 967lines 4--5), even on the system call fast path, as we need at least one 968register to check the exception syndrome (\texttt{esr\_el1}) to distinguish 969aborts (where we must preserve all registers) from system calls (where we 970could immediately begin using the caller-saved registers). Note that the code 971on lines 27--32 only needs to stack the callee-saved registers, and leaves the 972system call arguments in \texttt{x0}--\texttt{x7}, to be read as required by 973\texttt{sys\_syscall} (in C). 974 975This sort of mismatch between the exception-handling interface of the CPU 976architecture, and what is required for really high-performance systems code is 977unfortunately extremely common. Unnecessary overheads, such as the additional 978stacked registers here hurt the performance of highly-componentised systems, 979such as Barrelfish, which rely on frequently crossing protection domains. 980 981The relatively well-compressed boolean arithmetic on lines 10--14 demonstrates 982that, even with the loss of ARM's fully-conditional instructions, the 983conditional compares which remain are still relatively powerful. 984 985\section{Cache Coherence} 986 987One aspect of the ARM architecture that is of particular interest for the 988Barrelfish project, but which we have not yet explored in depth, is the 989configurable cache coherency and fine-grained cache management operations 990available. Any virtual mapping on a recent ARM architecture, including both 991ARMv7 and ARMv8, can be tagged with various cacheability properties: inner 992(L1), outer (L2+, usually), write-back or write-through. Combined with the 993explicit flush operations at cache-line granularity, able to target either PoU 994(point of unification, where data and instruction caches merge) or PoC (point 995of coherency, typically RAM), a multi-core, multi-socket ARMv8 system would 996make a very interesting testbed for investigating efficient cache management 997and communication primitives for future partially-coherent architectures. 998Indeed, the latest revision of the ARMv8 specification, ARMv8.2, introduced 999flush to PoP, or \emph{point of persistence} --- perhaps in response to 1000interest from well-known systems integration firms investigating large 1001persistent memories. 1002 1003The design presented in this report is intended to expose as much control over 1004the caching hierarchy as possible to user-level code, to provide a platform 1005for future research. 1006 1007\bibliographystyle{plainnat} 1008\bibliography{defs,barrelfish} 1009 1010\end{document} 1011