%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Copyright (c) 2013-2016, ETH Zurich.
% All rights reserved.
%
% This file is distributed under the terms in the attached LICENSE file.
% If you do not find this file, copies can be found by writing to:
% ETH Zurich D-INFK, Universitaetstrasse 6, CH-8092 Zurich. Attn: Systems Group.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\documentclass[a4paper,twoside]{report} % for a report (default)

\usepackage{bftn} % You need this
\usepackage{multirow}
\usepackage{listings}
\usepackage{color}
\usepackage{xspace}

\title{Barrelfish on ARMv7-A}   % title of report
\author{Simon Gerber \and Stefan Kaestle \and Timothy Roscoe \and
  Pravin Shinde \and Gerd Zellweger}
\tnnumber{017}  % give the number of the tech report
\tnkey{ARMv7-A} % Short title, will appear in footer

% \date{Month Year} % Not needed - will be taken from version history

\newcommand{\todo}[1]{\note{\textbf{TODO:} #1}}

\begin{document}
\maketitle

\newcommand{\code}[1]{{\lstinline!#1!}}
\newcommand{\file}[1]{{\lstinline!#1!}}
\newcommand{\mode}[1]{\texttt{#1} mode\xspace}

%configure listings properly
\lstset{%
  basicstyle=\small\ttfamily,
  escapechar=@
}

%
% Include version history first
%
\begin{versionhistory}
\vhEntry{0.1}{05.12.2013}{SK}{Initial version}
\vhEntry{0.2}{08.12.2015}{TR}{Rewritten for new ARMv7 code}
\vhEntry{1.0}{31.05.2016}{TR}{Newly-factored ARMv7 platform support}
\end{versionhistory}

% \intro{Abstract}		% Insert abstract here
% \intro{Acknowledgements}	% Uncomment (if needed) for acknowledgements
\tableofcontents		% Uncomment (if needed) for final draft
% \listoffigures		% Uncomment (if needed) for final draft
% \listoftables			% Uncomment (if needed) for final draft

\lstset{
  language=C,
  basicstyle=\ttfamily \small,
  flexiblecolumns=false,
  basewidth={0.5em,0.45em},
  boxpos=t,
}

\newcommand{\eclipse}{ECL\textsuperscript{i}PS\textsuperscript{e}\xspace}
\newcommand{\codesize}{\scriptsize}
\newcommand{\note}[1]{[\textcolor{red}{\emph{#1}}]}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Introduction}

This document describes the state of support for ARMv7-A processors in
Barrelfish.

ARM hardware is highly diverse, and has evolved over time.  As a
research OS, Barrelfish focusses ARM support on a small number of
platforms based on wide availability, ease of maintenance, and
research interest.   However, since management of hardware complexity
and diversity is also a research goal of the Barrelfish project, we
aim to make it easy to add new ARM-based platforms with a mixture of
traditional and non-traditional engineering techniques. 

The principal processors with 32-bit ARM support in Barrelfish at present are
ARMv7-A (Cortex A-series), in particular the Cortex A9. 

Past support for older ARM 32-bit architectures in Barrelfish included:
\begin{itemize}
\item ARMv7m (Cortex M-series), in particular the Cortex M3. 
\item ARMv5 processors, in particular the Intel iXP2800 network
  processor (which uses an XScale core). 
\item ARMv6 (ARM11MP) processors running under simulation in
  \file{qemu}. 
\end{itemize}

The main 32-bit ARM-based systems we target at present are:
\begin{itemize}
\item The Texas Instruments OMAP4460 SoC used in the Pandaboard ES
  platform. 
\item The ARM VExpress\_EMM board, under emulation in the GEM5
  simulator. 
\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Compilation}
\label{sec:armcompile}

Building Barrelfish with ARMv7 is straightforward; detailed
requirements for packages are described in the latest README file.

Compiling ARM support in Barrelfish requires a cross-compilation
toolchain on the programmers \code{PATH}.  For ARMv7 support we
track the GNU toolchain shipped with Ubuntu LTS  (14.04.3 at time of
writing). 

Once you have the right tools, run hake with the correct options,
e.g.:
\begin{lstlisting}
$ cd /build/barrelfish
$ /git/barrelfish/hake/hake.sh -a armv7 -s /git/barrelfish 
...
$
\end{lstlisting}

After running \code{hake} with appropriate architecture support
(i.e. use \code{-a armv7}), you can ask the Makefile what platforms it
supports:

\begin{lstlisting}
$ make help-platforms
------------------------------------------------------------------
Platforms supported by this Makefile.  Use 'make <platform name>':
 (these are the platforms available with your architecture choices)

 Documentation:
	 Documentation for Barrelfish
 PandaboardES:
	 Standard Pandaboard ES build image and modules
 ARMv7-GEM5:
	 GEM5 emulator for ARM Cortex-A series multicore processors
------------------------------------------------------------------
$ 
\end{lstlisting}

Then build:

\begin{lstlisting}
$ make -j 8 PandaboardES
\end{lstlisting}

\section{Building for GEM5}

To boot Barrelfish in GEM5, in addition to the previous steps you
will need a supported version of GEM5.  The GEM5 website
(\url{gem5.org}) has comprehensive information. 

Unfortunately, different
versions of GEM5 manifest different subtle bugs when emulating ARM
systems.  We recommend revision 0fea324c832c of GEM5 at present;
please let us know if you find a more recent version that works well. 

To fetch and build GEM5 on Ubuntu LTS:

\begin{lstlisting}
$ sudo apt-get install scons swig python-dev libgoogle-perftools-dev m4 protobuf-compiler libprotobuf-dev
$ hg clone http://repo.gem5.org/gem5 -r 0fea324c832c gem5
adding changesets
adding manifests
adding file changes
added 9356 changesets with 53499 changes to 6576 files
updating to branch default
3269 files updated, 0 files merged, 0 files removed, 0 files unresolved
$ cd ./gem5 
$ scons build/ARM/gem5.fast
...

$
\end{lstlisting}

GEM5 is a large system and may take some time to build.  In addition,
you may have to install minor fixes to ensure compilation (I had to
add some initializers to \file{mem/ruby/network/orion/Wire.cc}, for
example). 

After the compilation of GEM5 is finished, add the binary to your PATH.

Now, build Barrelfish like this:
\begin{lstlisting}
$ make -j 8 ARMv7-GEM5
\end{lstlisting}

It's a good idea to set \code{armv7_platform} in
\file{<build_dir>/hake/Config.hs} to \texttt{gem5} in order to enable
the cache quirk workarounds for GEM5 and proper offsets for the
platform simulated by GEM5.

You can also build Barrelfish and boot inside GEM5 in a single step:

\begin{lstlisting}
$ make help-boot
------------------------------------------------------------------
Boot instructions supported by this Makefile.  Use 'make <boot name>':
 (these are the targets available with your architecture choices)

 gem5_armv7:
	 Boot an ARMv7a multicore image in GEM5
 gem5_armv7_detailed:
	 Boot an ARMv7a multicore image in GEM5 using a detailed CPU model
$ make gem5_armv7
...
\end{lstlisting}

To get the output of Barrelfish you should:
\begin{lstlisting}
$ telnet localhost 3456
\end{lstlisting}

GEM5 is a highly configurable simulator.  You can print the supported
options of the GEM5 script as follows:

\begin{lstlisting}
$ gem5.fast gem5/gem5script.py -h
\end{lstlisting}

Note that if you boot using \code{make arm_gem5_detailed} rather than
\code{make arm_gem5}, the simulation takes a long time (depending on
your machine up to an hour just to boot Barrelfish).
 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Hardware assumptions and limitations}

The current state of ARMv7 support in Barrelfish makes a number of
assumptions about the underlying hardware platform, and also imposes
some limitations.  We discuss these here.

\section{No support for Large Physical Address Extensions}

The current Barrelfish design does not support LPAE for 32-bit ARM
processors.  Instead, it assumes a 32-bit physical address space.
Supporting LPAE would require changes to the paging code, but would
also require a mechanism to address user memory from the kernel
effectively (see below). 

\section{Physical RAM starts at 2GB}

Within the 32-bit physical address space, RAM is assumed to start at
the 2GB boundary (i.e. \code{0x80000000}).  This is the
architectural recommendation for Cortex-A series processors, and we
have yet to encounter non-LPAE ARMv7-A hardware which does not do
this.  Changing this assumption in the code should be possible, but in
practice is likely to be dominated by the other limitations mentioned
here. 

\section{Physical RAM is limited to 1GB}

The Barrelfish ARMv7 CPU drivers can handle up to 1GB RAM,
contiguously situated in the physical address space starting at 2GB.
This limit could be raised by half a Gigabyte or so, at the cost of
space for mapping kernel devices.  In practice, the CPU does not need
to map many kernel devices since most drivers run in user space on
Barrelfish.   Consequently, the allocation of the top 2GB of the
virtual address space betwen 1-1 mapped RAM and kernel hardware
devices could easily be moved. 

However, it remains that the total RAM visible to the CPU \emph{plus}
the mappings for any devices needed by the CPU driver must fit into
the top 2GB of the address space (mapped by the TTBR1 register).   

In particular, the CPU driver assumes that all physical RAM is mapped
1-1, and relies on this when performing capability invocations.   If
the system had more RAM that could be mapped 1-1 into kernel virtual
address space, we would need a method for the CPU driver to quickly
access arbitrary physical addresses, entailing some kind of paging
system.  


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Organization of the address space}

Like many other popular operating systems, Barrelfish employs a memory
split. The idea behind a memory split is to separate kernel code from
user space code in the virtual address space. This allows the kernel
to be mapped in every virtual address space of each user space
program, which is necessary to allow user space code to access kernel
features through the system call interface. If the kernel was not
mapped into the virtual address space of each program, it would be
impossible to jump to kernel code without switching the virtual
address space. 

\begin{figure}[htb]
  \centering
  \includegraphics[width=8cm]{figures/virtual_addressing.pdf}
  \caption{Barrelfish virtual address space layout for ARMv7-A}
  \label{fig:memory_layout}
\end{figure}

Additionally ARMv7-A provides two translation table
base registers, TTBR0 and TTBR1. We configure the system to use
TTBR0 for address translations of virtual addresses below 2GB and
TTBR1 for virtual address above 2GB. This saves us the explicit
mapping of the kernel pages into every L1 page table of each process.
Even though the kernel is mapped to each virtual address space, it is
invisible for the user space program. Accessing memory, which belongs
to the kernel, leads to a pagefault. Since many mappings can point to
the same physical memory, memory usage is not increased by this
technique.

Figure~\ref{fig:memory_layout} shows the memory layout of the complete
virtual address space of a single ARMv7-A core running Barrelfish. 

We have a memory split at 2GB, where everything upwards is only
accessible in privileged modes and the lower 2GB of memory is
accessible for user space programs. 

The kernel runs out of the kernel virtual address space where system
RAM is mapped 1-1; in the region between \texttt{0x80000000} and
\texttt{0xC0000000} RAM is mapped directly physical-to-virtual.

The L1 page table of the kernel address space is located inside the
data segment of the kernel right after the
kernel and naturally aligned to 16KB.  

We map the whole available physical memory into the kernel’s virtual
address space using ``sections'' (1MB large pages), obviating the need
for a kernel L2 page table. 

Above \texttt{0xC0000000}, the CPU driver maps regions of physical
memory corresponding to hardware devices it needs to directly access
(typically the UARTs, interrupt controller, timers, Snoop Control
Unit, and a few others).  These are also mapped using sections.
Virtual address regions are allocated in 1MB increments (the size of a
section mapping) working down from the top section, which is used to
map the area of RAM containing the CPU driver's exception vectors. 

Below the \texttt{0x80000000}, all mappings are handled by TTBR0 and
changed on every context switch.  At startup, the kernel uses another
page table (also 16kB-aligned and located inside its data segment) to
map low memory virtual-to-physical as well, as a way to access
hardware devices in this region before the rest of the system has come
up.  However, after the early stages of bootstrap this table is no
longer used. 

Instead, TTBR0 is always loaded with the address of a user domain's
hardware page table and changes on a context switch.  TTBR1 does not
change, ensuring the kernel mappings are static after boot.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Boot sequence}

\section{BSP (initial) core}

\begin{enumerate}
\item \file{boot.S:start} is called by the bootloader.  It
  sets the processor \mode{System}, sets up the (single) kernel
  stack, the global object table pointer, and jumps to
  \file{arch_init}.
\item \file{init.c:arch_init} is called with a single
  argument: the address of the multiboot info block.  It first
  initializes the serial console \file{serial_early_init} and
  checks to see if this is the BSP.  If so, it calls
  \file{bsp_init}. 
\item \file{init.c:bsp_init} reads information from the multiboot
  info into the global data structure, initializing it.  It also
  resets global spinlocks, and sizes RAM (though this information is
  not yet used).  It returns.
\item \file{init.c:arch_init} continues by initialzing paging,
  calling:
\item \file{paging.c:paging_init} populates the two initial page
  tables (one for each base register).  The kernel (upper) page table
  is initialized to map 1GB of RAM at 0x80000000, and the exception
  vectors at the top of memory.  The initial user (lower) page table
  is set to map the lower 2GB of the physical address space 1-1 to
  enable early device access.  The MMU is then enabled.
\item \file{init.c:arch_init} continues with the MMU enabled by
  jumping at: 
\item \file{init.c:arch_init_2} which initializes exceptions,
  relocating the current KCB, parses the command line arguments, and
  re-initializes the serial ports so that the UART hardware is now
  mapped correctly into kernel address space with a section mapping. 

  It then initializes the GIC, the Snoop Control Unit, the Global
  Timer, and the Time Slice Counter.  Cycle counter access from
  \mode{User} is enabled, and the coreboot spawn handler set up.  It
  then calls:

\item \file{startup_arch.c:arm_kernel_startup} which initializes
  a simple memory allocator from the global structure, allocates the
  a new KCB, and calls:

\item \file{startup_arch.c:spawn_bsp_init} which creates the
  initial kernel data structures for spawning the init process.  It
  also creates the initial capabilities for init to use to allocate
  memory, and returns. 

\item \file{startup_arch.c:arm_kernel_startup} continues
  but calling \code{dispatch} on the init DCB, and we are now up and
  running. 
\end{enumerate}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Exception code paths}

ARMv7-A exceptions are initialized in
\file{exceptions.S:exceptions_init}, which for some reason is
written in assembly.  It assumes the core is running in \mode{System}. 

There is a 256-byte statically-allocated stack for each exception
mode, and an 8kB stack used for subsequently calling into C in
\mode{System}, all defined in \file{exceptions.S}. 

Most exception handlers in the vector table start by checking whether
the processor was in \mode{User} or not when the trap happened.  In most
cases, if the processor was not in \mode{User}, the result is that
\mode{System} is entered and the processor jumps to
\file{exn.c:fatal_kernel_fault}, which panics.   Exceptions to this
rule are noted below. 

For exceptions taken while the processor is in \mode{User}, the
address of the current (user space) dispatcher is loaded (macro \\
\code{get_dispatcher_shared_arm}), and a check is made to see if the
dispatcher is ``enabled'' (in other words, whether the dispatcher
should be upcalled when next dispatched).   

This latter check is performed by the macro \code{disp_is_disabled},
and returns non-zero if:
\begin{enumerate}
\item The \code{disabled} value in the dispatcher (at offset
  \code{OFFSET_OF_DISP_DISABLED}) is non-zero, \emph{or}
\item The PC lies between the two values in the dispatcher with
  offsets \code{OFFSETOF_DISP_CRIT_PC_LOW} and
  \code{OFFSETOF_DISP_CRIT_PC_HIGH}\footnote{A trick suggested by
    Justin Cappos to allow an atomic resume of a user-level thread
    without entering the kernel}. 
\end{enumerate}

Depending on this, context is saved in a different area of the
dispatcher, \mode{System} is entered, and a call is made to C code as
noted below. 

Taking each exception in turn:

\section{Reset exception}

This is vector 0x00, and is not used. 

\section{Undefined Instruction exception}

This is vector offset 0x04, and is referred to as 
\code{ARM_EVECTOR_UNDEF} in the source.   
The processor enters \code{undef_handler} in \mode{Undefined}.
Context is saved in either the \texttt{ENABLED} or \texttt{TRAP} area.
C is entered at \code{exn.c:handle_user_undef}. 

\section{Supervisor call (software interrupt)}

This is vector offset 0x08, and referred to as
\code{ARM_EVECTOR_SWI} in the source. 
The processor enters \code{swi_handler} in \mode{Supervisor}.

If the syscall was issued from user space, context is saved in either
the \code{ENABLED} or \code{DISABLED} area.  C is entered at
\code{syscall.c:sys_syscall}.  

If the syscall was issued from kernel space, no context is saved and
C is entered at \code{syscall.c:sys_syscall_kernel}. 

\section{Prefetch Abort exception}

This is vector offset 0x0C, and referred to as
\code{ARM_EVECTOR_PABT} in the source. 
The processor enters \code{pabt_handler} in \mode{Abort}.

Context is saved in either the \texttt{ENABLED} or \texttt{TRAP} area.
C is entered at \code{exn.c:handle_user_page_fault}. 

\section{Data Abort exception}

This is vector offset 0x10, and referred to as
\code{ARM_EVECTOR_DABT} in the source. 
The processor enters \code{dabt_handler} in \mode{Abort}.

Context is saved in either the \texttt{ENABLED} or \texttt{TRAP} area.
C is entered at \code{exn.c:handle_user_page_fault} with the faulting
address in \code{r0}. 

\section{Hyp Trap, or Hyp mode entry}

This is vector offset 0x14, and is not used in Barrelfish.

\section{IRQ interrupt}

This is vector offset 0x18, and referred to as
\code{ARM_EVECTOR_IRQ} in the source. 
The processor enters \code{irq_handler} in \mode{IRQ}. 

If the syscall was issued from user space, context is saved in either
the \code{ENABLED} or \code{DISABLED} area.  C is entered at
\code{exn.c:handle_irq}.  

If the syscall was issued from kernel space, context is saved in
\code{irq_save_area}, \mode{System} is entered, and C is called at
\code{exn.c:handle_irq}. 

\section{Fast interrupt}

This is vector offset 0x1C, and referred to as
\code{ARM_EVECTOR_FIQ} in the source. 
The processor enters \code{fiq_handler} in \mode{FIQ}. 

If the syscall was issued from user space, context is saved in either
the \code{ENABLED} or \code{DISABLED} area.  C is entered at
\code{exn.c:handle_irq} (as for IRQ).

If the syscall was issued from kernel space, context is saved in
\code{irq_save_area}, \mode{System} is entered, and C is called at
\code{exn.c:handle_irq} (as for IRQ).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{The Dispatch mechanism}

Each time a CPU driver decides to switch to running a domain, it
dispatches the domain in one of two ways:

\begin{description}
\item[RESUME], also known as ``disabled'': in this mode, the domain is
  resumed exactly where it was preempted before, much as in operating
  systems like Unix. 
\item[UPCALL], also known as ``enabled'': as with Scheduler
  Activations, the domain is upcalled at a fixed address with a new
  context on a small, dedicated stack.  The context of the
  previously-running thread in teh domain is available to be resumed
  in user space, if the user-level scheduler (also known as the
  activation handler) decides to. 
\end{description}

Which one of these happens depends on the state of the domain.  

When a domain is running in user space (i.e. the kernel is \emph{not}
executing) the domain is in one of two states, indicated by a
combination of:
\begin{itemize}
\item the \code{disabled} field of the \code{struct
  dispatcher_shared_generic} structure,
\item the current program counter,
\item the \code{crit_pc_low} and \code{crit_pc_high} fields of the \code{struct
  dispatcher_shared_generic} structure.
\end{itemize}

Note that all of these values can be written by the user program. 

Specifically, the domain is in \code{RESUME} state \emph{iff}:
\begin{enumerate}
\item \code{disabled} is \code{true}, \emph{or}
\item the current program counter lies between \code{crit_pc_low} and
  \code{crit_pc_high} 
\end{enumerate}

Otherwise, it is in state \code{UPCALL}.  

Once the kernel is entered, the \code{disabled} flag of the domain's
\code{struct dcb} structure (as opposed to the \code{struct
  dispatcher_shared_generic}) is updated to reflect the state of the
preempted domain. 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Key data structures}

\begin{itemize}
\item \code{struct dcb}: in \code{kernel/include/dispatch.h};
  the main domain control block.   

  \code{dcb_current} is a global pointer in the CPU driver that points
  to the current DCB. 
  
  If \code{dp} is of type \core{struct dcb *}, then
  \code{dp->disabled} is a flag which is 1 if the current DCB has
  activations disabled (i.e. it should be resumed when next scheduled
  to run) and 0 otherwise (in which case it should be upcalled) - the
  analogy is with enabling and disabling interrupts.   The flag is set
  on entry to the kernel.

\item \code{struct dispatcher_shared_generic}: in
  \code{include/barrelfish_kpi/dispatcher_shared.h}: the
  architecture-independent part of the a dispatcher, the user-space
  datastructure corresponding to a DCB.   This is the first struct in
  architecture-dependent variants, such as \code{struct
    dispatcher_shared_arm}.

  If \code{dp} is of type \core{struct dispatcher_shared_generic *}, then
  \code{dp->disabled} is a flag which is 1 if the current DCB has

\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Hardware abstraction layers}

Barrelfish distinguishes between:
\begin{itemize}
\item General code
\item Architecture-specific code (e.g. ARMv7-A code)
\item Platform-specific code (e.g. code for the OMAP4460 SoC)
\end{itemize}

Since most Barrelfish device drivers run in userspace, the difference
between ``platform'' as a chip (such as the OMAP4460) and ``platform''
as a board or complete machine (such as the PandaBoard ES) are
relatively unimportant inside the CPU driver, since most of the
platform-specific CPU driver code is actually specific to a chip or
SoC.

Barrelfish CPU driver source code for ARMv7-A systems therefore
consists of the following categories:
\begin{itemize}
\item Portable, architecture-independent code.
\item ARMv7-A-specific code which common to all ARMv7-A platforms
\item Code for particular devices or macrocells which are only used on
  ARMv7-A, but might appear on multiple ARMv7-A platforms.
\item Platform-specific code. 
\end{itemize}

We restrict platform-specific code to a single source file, which
roughly corresponds to ARM's concept of an ``integrator'', and acts as
a compilation-time indirection layer between commmon ARMv7-A-specific
code and individual device and macrocell drivers.  

\section{The ARMv7-A HAL}

Platform code for a Barrelfish ARMv7-A CPU driver must implement the
following interfaces:

\begin{description}
\item[serial.h]: Low-level drivers for a multiple UART devices.
\item[spinlock.h]: Some number of static spinlocks, used for
  coordinating access to e.g. serial devices between CPU drivers on
  different cores. 
\end{description}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Code organization}

The variety of ARM platforms make organizing source trees to maximise
code reuse across different platforms a challenge. 

Barrelfish distinguishes between \emph{Architectures}, which are
typically processor architectures like ``ARMv7-A'', and \emph{Platforms},
which are complete system targets, like ``PandaBoard-ES''. 

Code and headers specific to a particular architecture are found in
the source tree is various subdirectories of the form
\file{../arch/armv7/}.  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Versatile Express platform}

%--------------------------------------------------
\chapter{GEM5 specifics}

The GEM5~\cite{gem5:sigarch11} simulator combines the best aspects of
the M5~\cite{m5:micro06} and GEMS~\cite{gems:sigarch05}
simulators. With its flexible and highly modular design, GEM5 allows
the simulation of a wide range of systems. GEM5 supports a wide range
of ISAs like x86, SPARC, Alpha and, in our case most importantly,
ARM. In the following we will list some features of GEM5.

GEM5 supports four different CPU models: AtomicSimple, TimingSimple,
In-Order and O3. The first two are simple one-cycle-per-instruction
CPU models. The difference between the two lies in the way they handle
memory accesses. The AtomicSimple model completes all memory accesses
immediately, whereas the TimingSimple CPU models the timing of memory
accesses. Due to their simplicity, the simulation speed is far above
the other two models.  The InOrder CPU models an in-order pipeline and
focuses on timing and simulation accuracy. The pipeline can be
configured to model different numbers of stages and hardware threads.
The O3 CPU models a pipelined, out-of-order and possibly superscalar
CPU model. It simulates dependencies between instructions, memory
accesses, pipeline stages and functional units. With a load/store
queue and reorder buffer its possible to simulate superscalar
architectures as well as multiple hardware threads.

The GEM5 simulator provides a tight integration of Python into the
simulator. Python is mainly used for system configuration. Every
simulated building block of a system is implemented in C++ but are
also reflected as a Python class and derive from a single superclass
SimObject. This provides a very flexible way of system construction
and allows to tailor nearly every aspect of the system to our needs.
Python is also used to control the simulation, taking and restoring
snapshots as well as all the command line processing.

We use a VExpress\_EMM based system to run Barrelfish. The number of
cores can be passed as an argument to the GEM5 script. Cores are
clocked at 1 GHz and main memory is 64 MB starting at 2 GB.

\section{Boot process: first (bootstrap) core}

% Source: Samuel's thesis, 4.1.1

This section gives a high-level overview of the boot up process of the
Barrelfish
kernel on ARMv7-a. In subsequent sections we will go more into details
involved
in the single steps.
\begin{enumerate}
\item Setup kernel stack and ensure privileged mode
\item Allocate L1 page table for kernel
\item Create necessary mappings for address translation
\item Set translation table base register (TTBR) and domain
  permissions
\item Activate MMU, relocate program counter and stack pointer
\item Invalidate TLB, setup arguments for first C-function arch init
\item Setup exception handling
\item Map the available physical memory in the kernel L1 page table
\item Parse command line and set corresponding variables
\item Initialize devices
\item Create a physical memory map for the available memory
\item Check ramdisk for errors
\item Initialize and switch to init’s address space
\item Load init image from ramdisk into memory
\item Load and create capabilities for modules defined by menu.lst
\item Start timer for scheduling
\item Schedule init and switch to user space
\item init brings up the monitor and mem serv
\item monitor spawns ramfsd, skb and all the other modules
\end{enumerate}

\section{Boot process: subsequent cores}

% Source: Samuel, 4.2.2

The boot up protocol for the multi-core port differs in various ways
from the boot up procedure of our previous single-core port. We
therefore include this revised overview here. The first core is called
the bootstrap processor and every subsequent core is called an
application processor On bootstrap processor:

\begin{enumerate}
\item Pass argument from bootloader to first C-function arch
  init 18
\item Make multiboot information passed by bootloader globally
  available
\item Create 1:1 mapping of address space and alias the same region at
  high memory
\item Configure and activate MMU
\item Relocate kernel image to high memory
\item Reset mapping, only map in the physical memory aliased at high
  memory
\item Parse command line and set corresponding variables
\item Initialize devices
\item Initialize and switch to init’s address space
\item Load init image into memory
\item Create capabilities for modules defined by the multiboot info
\item Schedule init and switch to user space
\item init brings up the monitor and mem serv
\item monitor spawns ramfsd, skb and all the other modules
\item spawnd parses its cmd line and tells the monitor to bring up a
  new core
\item monitor setups inter-monitor communication channel
\item monitor allocates memory for new kernel and remote monitor
\item monitor loads kernel image and relocates it to destination
  address
\item monitor setups boot information for new kernel
\item spawnd issues syscall to start new core
\item Kernel writes entry address for new core into SYSFLAG registers
\item Kernel raises software interrupt to start new core
\item Kernel spins on pseudo-lock until other kernel releases it
\item repeat steps 15 to 23 for each application processor
\end{enumerate}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{OMAP44xx platform}

% Source: Claudio 3.1

The OMAP4460 is a system on a chip (SoC) by Texas Instruments,
intended for use in consumer devices like smartphones and tablet
computers. It contains:

\begin{itemize}
\item A dual core ARM Cortex-A9 processor
\item Two ARM Cortex-M3 processors
\item A hardware spinlock module
\item A mailbox module
\item Many devices to process media input and output
\end{itemize}

The intention is that the Cortex-A9 will be running a general purpose
operating system, while the Cortex-M3 processors will only be running
a real-time operating system to control the imaging subsystem.

The processor configuration in the OMAP4460 is somewhat
unconventional; for example, the Cortex-M3 processors share a
custom MMU with page faults handled by code running on the Cortex-A9
processors and hence are constrained to run in the same virtual
address at all times.  They are also not cache-coherent with the
Cortex-A9 cores. 

\section{Compiling and booting}

To compile Barrelfish for the Pandaboard, first configure your
toolchain as described in Section~\ref{sec:armcompile}. Then execute: 

\begin{lstlisting}
cd @\shell@SRC
mkdir build
cd build
../hake/hake.sh -a armv7 -s ../
make pandaboard_image
\end{lstlisting}

The resulting image can be booted on the Pandaboard over the USB OTG
connector using the standard \texttt{usbboot} utility.  It will
generate console output on the Pandaboard's serial connector.

\section{Booting the second OMAP A9 core}

% source: AOS m6

Here is a brief overview of how the bootstrapping process for the second core
works: it waits for a signal from the BSP core (an interrupt), and when this
signal is received, the application core will read an address from a well-
defined register and start executing the code from this address.

To boot the second core, one can write the address of
a function to the register and send the inter-processor
interrupt. Following are some pointers to the documentation to help
understand the bootstrapping process in more detail:

\begin{itemize}
\item Section 27.4.4 in the OMAP44xx manual talks about the boot process for
  application cores.
\item Pages 1144 \textit{ff.} in the OMAP44xx manual have the register
  layout for the registers that are used in the boot process of the
  second core. 
\end{itemize}

Note that the Barrelfish codebase distinguishes between the BSP (bootstrap)
processor and APP (application) processors. This distinction and naming
originates from Intel x86 support where the BIOS will choose a
distinguished BSP processor at start-up and the OS 
is responsible for starting the rest of the processors (the APP
processors). Although it works somewhat differently on 
ARM, the naming convention is applicable here as well.

Note also that the second core will start working with the MMU
disabled, so is running in physical address space.  The bootstrapping
code sets up a stack, initial page tables and an initial Barrelfish
dispatcher.

\section{Physical address space}

At present, a temporary limitation in the core boot protocol means
that running Barrelfish on both A9 cores requires static partitioning of
the available RAM into two halves, with an independent memory server
running on each core.   This is will fixed in a subsequent release. 

\section{Interconnect driver}\label{sec:interconnect}

Communication between A9 cores on the OMAP processor is performed
using a variant of the CC-UMP interconnect driver, modified for the
32-byte cache line size of the ARMv7 architecture.  A notification
driver for inter-processor interrupts exists. 

The OMAP4460 also has mailbox hardware which can be used by both the
A9 and M3 cores.  Barrelfish support for this hardware is in
progress. 

\section{M3 cores}

Barrelfish also has rudimentary support for running on both the A9 and
M3 cores.  This is limited by the requirement that the M3 cores must
run in the same virtual address space, and do not have a way to
automatically change address space on a kernel trap.  For this reason,
we only execute on a single M3 core at present. 

Before the Cortex-M3 can start executing code, the following steps
have to be taken by the Cortex-A9:

\begin{enumerate}
\item Power on the Cortex-M3 subsystem
\item Activate the Cortex-M3 subsystem clock
\item Load the image to be executed into memory
\item Enable the L2 MMU
\item Set up mappings for the loaded image in the L2 MMU (can be
  written directly into the TLB)
\item Write the first two entries of the vectortable (initial sp and
  reset vector)
\item Take the Cortex-M3 out of reset
\end{enumerate}

It is important to note that the Cortex-M3 is in a virtual address
space from the very beginning, reading the vector table at virtual
address 0. Inserting a 1:1 mapping for the kernel image greatly
simplifies the bootstrapping of memory management on the Cortex-M3
once it is running, because it needs to know the physical address of
the page tables it sets up.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{abbrv}
\bibliography{defs,barrelfish}

\end{document}

\end{document}