%
% Copyright 2014, General Dynamics C4 Systems
%
% This software may be distributed and modified according to the terms of
% the GNU General Public License version 2. Note that NO WARRANTY is provided.
% See "LICENSE_GPLv2.txt" for details.
%
% @TAG(GD_GPL)
%

\chapter{\label{ch:threads}Threads and Execution}

\section{Threads}
\label{sec:threads}

seL4 provides threads to represent an execution context and manage
processor time. A thread is represented in seL4 by its thread control block
object (\obj{TCB}). Each \obj{TCB} has an associated CSpace (see
\autoref{ch:cspace}) and VSpace (see \autoref{ch:vspace}) which
may be shared with other threads. A \obj{TCB} may also have an IPC buffer
(see  \autoref{ch:ipc}), which is used to pass extra arguments during IPC
or kernel object invocation that do not fit in the architecture-defined message
registers. While it is not compulsory that a thread has an IPC buffer,
it will not be able to perform most kernel invocations, as they require
cap transfer.
Each thread belongs to exactly one security domain (see
\autoref{sec:domains}).
%FIXME: there is much more information held in the TCB!

\subsection{Thread Creation}
\label{sec:thread_creation}

Like other objects, \obj{TCB}s are created with the
\apifunc{seL4\_Untyped\_Retype}{untyped_retype} method (see
\autoref{sec:kernmemalloc}). A newly created thread is initially inactive. It
is configured by setting its CSpace and VSpace with the
\apifunc{seL4\_TCB\_SetSpace}{tcb_setspace}
or \apifunc{seL4\_TCB\_Configure}{tcb_configure} methods and then calling
\apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} with an initial stack pointer and instruction
pointer. The thread can then be activated either by setting the
\texttt{resume\_target} parameter in the \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} invocation to true
or by seperately calling the \apifunc{seL4\_TCB\_Resume}{tcb_resume} method. In multicore machines, the thread
would be running on the same CPU which originally created the \obj{TCB}. However, it could be migrated to other CPUs
by calling \apifunc{seL4\_TCB\_SetAffinity}{tcb_setaffinity}.

\subsection{Thread Deactivation}
\label{sec:thread_deactivation}

The \apifunc{seL4\_TCB\_Suspend}{tcb_suspend} method deactivates a thread.
Suspended threads can later be resumed.
Their suspended state can be retrieved with the
\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and
\apifunc{seL4\_TCB\_CopyRegisters}{tcb_copyregisters} methods.
They can also be reconfigured and
reused or left suspended indefinitely if not needed. Threads will be
automatically suspended when the last capability to their \obj{TCB} is
deleted.
% an example of which is demonstrated in \nameref{ex:second_thread}.

\subsection{Scheduling}
\label{sec:sched}

seL4 uses a preemptive round-robin scheduler with 256 priority levels.
All threads have a maximum controlled priority (MCP) and a priority, the latter being the effective
priority of the thread.
When a thread modifies a another threads priority (including itself) it must provide a
thread capability from which to use the MCP from. Threads can only set priorities and MCPs
to be less than or equal to the provided thread's MCP.
The initial task starts with an MCP and priority as the highest priority in the system (\texttt{seL4\_MaxPrio}).
Thread priority and MCP can be set with \apifunc{seL4\_TCB\_SetPriority}{tcb_setpriority}
and \apifunc{seL4\_TCB\_SetMCPriority}{tcb_setmcpriority} methods.

\subsection{Exceptions}

Each thread has an associated exception-handler endpoint. If the thread
causes an exception, the kernel creates an IPC message with the relevant
details and sends this to the endpoint. This
thread can then take the appropriate action. Fault IPC messages are
described in \autoref{sec:faults}.

In order to enable exception handlers, a capability to the exception-handler
endpoint must exist in the CSpace of the thread that generates the exception.
The exception-handler
endpoint can be set with the \apifunc{seL4\_TCB\_SetSpace}{tcb_setspace} or
\apifunc{seL4\_TCB\_Configure}{tcb_configure} method.
With these methods, a capability address for the exception handler can be associated with a thread.
This address is then used to lookup the handler endpoint when an exception is generated.
Note, however, that these methods make no attempt to check whether an endpoint capability exists at the specified
address in the CSpace of the thread. The capability is only looked up
when an exception actually happens and if the lookup fails then no
exception message is delivered and the thread is suspended indefinitely.

The exception endpoint must have send and grant rights. Replying to the
exception message restarts the thread. For certain exception types, the contents of
the reply message may be used to set the values in the registers of the
thread being restarted.
See \autoref{sec:faults} for details.


\subsection{Message Layout of the Read-/Write-Registers Methods}
\label{sec:read_write_registers}

The registers of a thread can be read and written with the
\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} methods.
For some registers, the kernel will silently mask certain bits or ranges of bits off, and force them to contain certain
values to ensure that they cannot be maliciously set to values that would compromise the running system, or to respect
values that the architecture specifications have mandated to be certain values.
The register contents are transferred via the IPC buffer.

\section{Faults}
\label{sec:faults}

A thread's actions may result in a fault. Faults are delivered to the
thread's exception handler so that it can take the appropriate action.
The fault type is specified in the message label and is one of:
seL4\_Fault\_CapFault, seL4\_Fault\_VMFault, seL4\_Fault\_UnknownSyscall, seL4\_Fault\_UserException, seL4\_Fault\_DebugException,
or seL4\_Fault\_NullFault (indicating no fault occurred and this is a normal IPC message).

Fault are delivered in such a way as to imitate a Call from the faulting
thread. This means that to send a fault message the fault endpoint
must have both write and grant permissions.

\subsection{Capability Faults}

Capability faults may occur in two places. Firstly, a capability fault
can occur when lookup of a capability referenced by a
\apifunc{seL4\_Call}{sel4_call} or \apifunc{seL4\_Send}{sel4_send} system call
failed (\apifunc{seL4\_NBSend}{sel4_nbsend} calls on
invalid capabilities silently fail). In this case, the capability
on which the fault occurred may be the capability being invoked or an
extra capability passed in the \texttt{caps} field in the IPC buffer.

Secondly, a capability fault can occur when \apifunc{seL4\_Recv}{sel4_recv} or \apifunc{seL4\_NBRecv}{sel4_nbrecv}
is called on a capability that does not exist, is not an endpoint or notification capability or does not have
receive permissions.

Replying to the fault IPC will restart the faulting thread. The contents of the
IPC message are given in \autoref{tbl:ipc_contents}.\\

\begin{table}[htb]
\noindent\begin{tabularx}{\textwidth}{XX}
\toprule
    \textbf{Meaning} & \textbf{ IPC buffer location} \\
\midrule
    Address at which to restart execution & \ipcbloc{seL4\_CapFault\_IP} \\
    Capability address & \ipcbloc{seL4\_CapFault\_Addr} \\
In receive phase (1 if the fault happened during a receive system call, 0
    otherwise) & \ipcbloc{seL4\_CapFault\_InRecvPhase} \\
Lookup failure description. As described in \autoref{sec:lookup_fail_desc} &
    \ipcbloc{seL4\_CapFault\_LookupFailureType} \\
\bottomrule
\end{tabularx}
\caption{\label{tbl:ipc_contents}Contents of an IPC message.}
\end{table}

\subsection{Unknown Syscall}
\label{sec:unknown-syscall}

This fault occurs when a thread executes a system call with a syscall
number that is unknown to seL4.
The register set
of the faulting thread is passed to the thread's exception handler so that it
may, for example, emulate the system call if a thread is being
virtualised.

Replying to the fault IPC allows the thread to be restarted
and/or the thread's register set to be modified. If the reply has
a label of zero, the thread will be restarted. Additionally, if the
message length is non-zero, the faulting thread's register set will be
updated. In this case, the number of
registers updated is controlled with the length field of the message
tag.

\subsection{User Exception}

User exceptions are used to deliver architecture-defined exceptions. For
example, such an exception could occur if a user thread attempted to
divide a number by zero.

Replying to the fault IPC allows the thread to be restarted
and/or the thread's register set to be modified. If the reply has
a label of zero, the thread will be restarted. Additionally, if the
message length is non-zero, the faulting thread's register set will be
updated. In this case, the number of
registers updated is controlled with the length field of the message
tag.

\subsection{Debug Exception: Breakpoints and Watchpoints}
\label{sec:debug_exceptions}

Debug exceptions are used to deliver trace and debug related events to threads.
Breakpoints, watchpoints, trace-events and instruction-performance sampling
events are examples. These events are supported for userspace threads when the kernel
is configured to include them (when CONFIG\_HARDWARE\_DEBUG\_API is set). The hardware
debugging extensions API is supported on the following subset of the platforms that the
kernel has been ported to:

\begin{itemize}
\item PC99: IA-32 and x86\_64
\item Sabrelite (i.MX6)
\item Jetson TegraK1
\item HiSilicon Hikey
\item Raspberry Pi 3
\item Odroid-X (Exynos4)
\item Xilinx zynq7000
\end{itemize}

Information on the available hardware debugging resources is presented in the form of the following constants:

\begin{description}
\item[seL4\_NumHWBreakpoints]: Defines the total number of hardware break
registers available, of all types available on the hardware platform. On the ARM
Cortex A7 for example, there are 6 exclusive instruction breakpoint registers,
and 4 exclusive data watchpoint registers, for a total of 10 monitor registers.
On this platform therefore, \texttt{seL4\_NumHWBreakpoints} is defined as 10.
The instruction breakpoint registers will always be assigned the lower API-IDs,
and the data watchpoints will always be assigned following them.

Additionally, \texttt{seL4\_NumExclusiveBreakpoints}, \texttt{seL4\_NumExclusiveWatchpoints}
and \texttt{seL4\_NumDualFunctionMonitors}
are defined for each target platform to reflect the number of available
hardware breakpoints/watchpoints of a certain type.

\item[seL4\_NumExclusiveBreakpoints]: Defines the number of hardware registers
capable of generating a fault \textbf{only} on instruction execution. Currently this will be
set only on ARM platforms. The API-ID of the first exclusive breakpoint is given
in \texttt{seL4\_FirstBreakpoint}. If there are no instruction-break exclusive
registers, \texttt{seL4\_NumExclusiveBreakpoints} will be set to \texttt{0} and
\texttt{seL4\_FirstBreakpoint} will be set to -1.

\item[seL4\_NumExclusiveWatchpoints]: Defines the number of hardware registers
capable of generating a fault \textbf{only} on data access. Currently this will be set only
on ARM platforms. The API-ID of the first exclusive watchpoint is given
in \texttt{seL4\_FirstWatchpoint}. If there are no data-break exclusive
registers, \texttt{seL4\_NumExclusiveWatchpoints} will be set to \texttt{0} and
\texttt{seL4\_FirstWatchpoint} will be set to -1.

\item[seL4\_NumDualFunctionMonitors]: Defines the number of hardware registers
capable of generating a fault on either type of access -- i.e, the register
supports both instruction and data breaks. Currently this will be set only on
x86 platforms. The API-ID of the first dual-function monitor is given
in \texttt{seL4\_FirstDualFunctionMonitor}. If there are no dual-function break
registers, \texttt{seL4\_NumDualFunctionMonitors} will be set to \texttt{0} and
\texttt{seL4\_FirstDualFunctionMonitor} will be set to -1.

\end{description}

\begin{table}[h]
\begin{tabularx}{\textwidth}{XXX}
\toprule
\textbf{Value sent} & \textbf{IPC buffer location} \\
\midrule
\reg{Breakpoint instruction address} & \ipcbloc{IPCBuffer[0]} \\
\reg{Exception reason} & \ipcbloc{IPCBuffer[1]} \\
\reg{Watchpoint data access address} & \ipcbloc{IPCBuffer[2]} \\
\reg{Register API-ID} & \ipcbloc{IPCBuffer[3]} \\
\bottomrule
\end{tabularx}
\caption{\label{tbl:debug_exception_result}Debug fault message layout. The
register API-ID is not returned in the fault message from the kernel on
single-step faults.}
\end{table}

\subsection{Debug Exception: Single-stepping}
\label{sec:single_stepping_debug_exception}

The kernel provides support for the use of hardware single-stepping of userspace
threads when configured to do so (when CONFIG\_HARDWARE\_DEBUG\_API is set). To
this end it exposes the invocation, \texttt{seL4\_TCB\_ConfigureSingleStepping}.

The caller is expected to select an API-ID that corresponds to
an instruction breakpoint, to use when setting up the single-stepping
functionality (i.e, API-ID from 0 to \texttt{seL4\_NumExclusiveBreakpoints} - 1).
However, not all hardware platforms require an actual hardware breakpoint
register to provide single-stepping functionality. If the caller's hardware platform requires the
use of a hardware breakpoint register, it will use the breakpoint register given to it in \texttt{bp\_num},
and return \texttt{true} in \texttt{bp\_was\_consumed}. If the underlying platform does not need a hardware
breakpoint to provide single-stepping, seL4 will return \texttt{false} in \texttt{bp\_was\_consumed} and
leave \texttt{bp\_num} unchanged.

If \texttt{bp\_was\_consumed} is \texttt{true}, the caller should not
attempt to re-configure \texttt{bp\_num} for Breakpoint or Watchpoint usage until
the caller has disabled single-stepping and released that register, via a subsequent
call to \texttt{seL4\_TCB\_ConfigureSingleStepping}, or a fault-reply with
\texttt{n\_instr} being 0. Setting \texttt{num\_instructions} to \texttt{0}
\textbf{disables single stepping}.

On architectures that require an actual hardware registers to be configured for
single-stepping functionality, seL4 will restrict the number of registers that
can be configured as single-steppers, to one at any given time. The register that
is currently configured (if any) for single-stepping will be the implicit
\texttt{bp\_num} argument in a single-step debug fault reply.

The kernel's single-stepping, also supports skipping a certain number of
instructions before delivering the single-step fault message. \texttt{Num\_instructions}
should be set to \texttt{1} when single-stepping, or any non-zero integer value to skip that many
instructions before resuming single-stepping. This skip-count can also be set in
the fault-reply to a single-step debug fault.

\begin{table}[h]
\begin{tabularx}{\textwidth}{XXX}
\toprule
\textbf{Value sent} & \textbf{Register set by reply} & \textbf{IPC buffer location} \\
\midrule
\reg{Breakpoint instruction address} & \texttt{num\_instructions} to skip & \ipcbloc{IPCBuffer[0]} \\
\reg{Exception reason} & --- & \ipcbloc{IPCBuffer[1]} \\
\bottomrule
\end{tabularx}
\caption{\label{tbl:single_step_exception_result}Single-step fault message layout.}
\end{table}

\subsection{VM Fault}
\label{sec:vm-fault}

The thread caused a page fault. Replying to the fault IPC will restart
the thread. The contents of the IPC message are given below.\\

\begin{table}[htb]
\begin{tabularx}{\textwidth}{XXX}
\toprule
\textbf{Meaning} & \textbf{IPC buffer location} \\
\midrule
    Program counter to restart execution at. & \ipcbloc{seL4\_VMFault\_IP} \\
Address that caused the fault. & \ipcbloc{seL4\_VMFault\_SP} \\
    Instruction fault (1 if the fault was caused by an instruction fetch). & \ipcbloc{seL4\_VMFault\_PrefetchFault}  \\
Fault status register (FSR). Contains information about the cause of the fault. Architecture dependent. & \ipcbloc{seL4\_VMFault\_FSR} \\
\bottomrule
\end{tabularx}
\caption{\label{tbl:vm_fault_result_arm} VM Fault outcome on all architectures.}
\end{table}

\section{Domains}
\label{sec:domains}

Domains are used to isolate independent subsystems, so as to limit
information flow between them.
The kernel switches between domains according to a fixed, time-triggered
schedule.
The fixed schedule is compiled into the kernel via the constant
\texttt{CONFIG\_NUM\_DOMAINS} and the global variable \texttt{ksDomSchedule}.

A thread belongs to exactly one domain, and will only run when that domain
is active.
The \apifunc{seL4\_DomainSet\_Set}{domainset_set} method changes the domain
of a thread.
The caller must possess a \obj{Domain} cap and the thread's \obj{TCB} cap.
The initial thread starts with a \obj{Domain} cap (see
\autoref{sec:messageinfo}).

\section{Virtualisation}
\label{sec:virt}

Hardware execution virtualisation is supported on specific arm and x86 platforms. The interface is exposed through a series
of kernel objects, invocations and syscalls that allow the user to take advantage of hardware
virtualisation features.

Hardware virtualisation allows for a thread to perform instructions and operations as if it were
running at a higher privilege level. As higher privilege levels typically have access to
additional machine registers and other pieces of state a \obj{VCPU} object is introduced to act
as storage for this state. For simplicity we refer to this virtualised higher privileged level as
'guest mode'. \obj{VCPU}s are bound in a one-to-one relationship with a \obj{TCB} in order
to provide a thread with this ability to run in higher privilege mode. See the section on
ARM or x86 for more precise details.

\obj{VCPU} objects also have additional, architecture specific, invocations for manipulating
the additional state or other virtualisation controls provided by the hardware. Binding of
a \obj{VCPU} to a \obj{TCB} is done by an invocation on the \obj{VCPU} only, and not the \obj{TCB}.

The provided objects and invocations are, generally speaking, the thinnest possible shim over
the underlying hardware primitives and operations. As a result an in depth familiarity with
the underlying architecture specific harwdare mechanisms is required to use these objects, and
such familiarity is therefore assumed in description.

\subsection{ARM}

When a \obj{TCB} has a bound \obj{VCPU} it is allowed to have the mode portion of the
\texttt{cpsr} register set to values other than \texttt{user}. Specifically it may have any value other than
\texttt{hypervisor}.

TODO: this section needs more detail

\subsection{x86}

A \obj{TCB} with a bound \obj{VCPU} has two execution modes; one is the original thread just as
if there was no bound \obj{VCPU}, and the other is the guest mode execution using the
\obj{VCPU}. Switching from regular execution mode into the guest execution mode is
done by using the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall. Executing this syscall causes the thread, whenever
it is scheduled thereafter, to execute using the higher privlidged mode controlled by the \obj{VCPU}.
Should the guest execution mode generate any kind of fault, or if a message arrives
on the \obj{TCB}s bound notification, the \obj{TCB} will be switched back to regular mode
and the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall will return with a message indicating the reason for return.

\obj{VCPU} state and execution is controlled through the \apifunc{seL4\_VCPU\_ReadVMCS}{x86_vcpu_readvmcs}
and \apifunc{seL4\_VCPU\_WriteVMCS}{x86_vcpu_writevmcs} invocations.
These are very thin wrappers around the hardware \texttt{vmread} and \texttt{vmwrite} instructions and the kernel
merely does enough validation on the parameters to ensure the \obj{VCPU} is not configured
to run in such a way as to violate any kernel properties. For example, it is not possible to
disable the use of External Interrupt Exiting, as this would prevent the kernel from receiving
timer interrupts and allow the thread to monopolise CPU time.

Memory access of the guest execution mode is controlled by requiring the use of Extended
Page Tables (EPT). A series of EPT related paging structure objects (\obj{EPTPML4}, \obj{EPTPDPT}, \obj{EPTPD}, \obj{EPTPT})
exist and are manipulated in exactly the same manner as the objects for the regular virtual
address space. Once constructed a \obj{TCB} can be given an \obj{EPTPML4} as an EPT root with \apifunc{seL4\_TCB\_SetEPTRoot}{x86_set_eptroot},
which serves as the vspace root when executing in guest mode, with the vspace root set
with \apifunc{seL4\_TCB\_SetSPace}{tcb_setspace} or \apifunc{seL4\_TCB\_Configure}{tcb_configure}
continuing to provide translation when the TCB is executing in its normal mode.

Direct access to I/O ports can be given to the privlidged execution mode through the
\apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport} invocation and allows the provided I/O port capability to be
linked to the VCPU, and a subset of its I/O port range to be made accessible to the \obj{VCPU}.
Linking means that an I/O port capability can only be used in a single \apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport}
invocation and a second invocation will undo the previous one. The link also means that
if the I/O port capability is deleted for any reason the access will be correspondingly removed
from the \obj{VCPU}.