1%
2% Copyright 2014, General Dynamics C4 Systems
3%
4% This software may be distributed and modified according to the terms of
5% the GNU General Public License version 2. Note that NO WARRANTY is provided.
6% See "LICENSE_GPLv2.txt" for details.
7%
8% @TAG(GD_GPL)
9%
10
11\chapter{\label{ch:threads}Threads and Execution}
12
13\section{Threads}
14\label{sec:threads}
15
16seL4 provides threads to represent an execution context and manage
17processor time. A thread is represented in seL4 by its thread control block
18object (\obj{TCB}). Each \obj{TCB} has an associated CSpace (see
19\autoref{ch:cspace}) and VSpace (see \autoref{ch:vspace}) which
20may be shared with other threads. A \obj{TCB} may also have an IPC buffer
21(see  \autoref{ch:ipc}), which is used to pass extra arguments during IPC
22or kernel object invocation that do not fit in the architecture-defined message
23registers. While it is not compulsory that a thread has an IPC buffer,
24it will not be able to perform most kernel invocations, as they require
25cap transfer.
26Each thread belongs to exactly one security domain (see
27\autoref{sec:domains}).
28%FIXME: there is much more information held in the TCB!
29
30\subsection{Thread Creation}
31\label{sec:thread_creation}
32
33Like other objects, \obj{TCB}s are created with the
34\apifunc{seL4\_Untyped\_Retype}{untyped_retype} method (see
35\autoref{sec:kernmemalloc}). A newly created thread is initially inactive. It
36is configured by setting its CSpace and VSpace with the
37\apifunc{seL4\_TCB\_SetSpace}{tcb_setspace}
38or \apifunc{seL4\_TCB\_Configure}{tcb_configure} methods and then calling
39\apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} with an initial stack pointer and instruction
40pointer. The thread can then be activated either by setting the
41\texttt{resume\_target} parameter in the \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} invocation to true
42or by seperately calling the \apifunc{seL4\_TCB\_Resume}{tcb_resume} method. In multicore machines, the thread
43would be running on the same CPU which originally created the \obj{TCB}. However, it could be migrated to other CPUs
44by calling \apifunc{seL4\_TCB\_SetAffinity}{tcb_setaffinity}.
45
46\subsection{Thread Deactivation}
47\label{sec:thread_deactivation}
48
49The \apifunc{seL4\_TCB\_Suspend}{tcb_suspend} method deactivates a thread.
50Suspended threads can later be resumed.
51Their suspended state can be retrieved with the
52\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and
53\apifunc{seL4\_TCB\_CopyRegisters}{tcb_copyregisters} methods.
54They can also be reconfigured and
55reused or left suspended indefinitely if not needed. Threads will be
56automatically suspended when the last capability to their \obj{TCB} is
57deleted.
58% an example of which is demonstrated in \nameref{ex:second_thread}.
59
60\subsection{Scheduling}
61\label{sec:sched}
62
63seL4 uses a preemptive round-robin scheduler with 256 priority levels.
64All threads have a maximum controlled priority (MCP) and a priority, the latter being the effective
65priority of the thread.
66When a thread modifies a another threads priority (including itself) it must provide a
67thread capability from which to use the MCP from. Threads can only set priorities and MCPs
68to be less than or equal to the provided thread's MCP.
69The initial task starts with an MCP and priority as the highest priority in the system (\texttt{seL4\_MaxPrio}).
70Thread priority and MCP can be set with \apifunc{seL4\_TCB\_SetPriority}{tcb_setpriority}
71and \apifunc{seL4\_TCB\_SetMCPriority}{tcb_setmcpriority} methods.
72
73\subsection{Exceptions}
74
75Each thread has an associated exception-handler endpoint. If the thread
76causes an exception, the kernel creates an IPC message with the relevant
77details and sends this to the endpoint. This
78thread can then take the appropriate action. Fault IPC messages are
79described in \autoref{sec:faults}.
80
81In order to enable exception handlers, a capability to the exception-handler
82endpoint must exist in the CSpace of the thread that generates the exception.
83The exception-handler
84endpoint can be set with the \apifunc{seL4\_TCB\_SetSpace}{tcb_setspace} or
85\apifunc{seL4\_TCB\_Configure}{tcb_configure} method.
86With these methods, a capability address for the exception handler can be associated with a thread.
87This address is then used to lookup the handler endpoint when an exception is generated.
88Note, however, that these methods make no attempt to check whether an endpoint capability exists at the specified
89address in the CSpace of the thread. The capability is only looked up
90when an exception actually happens and if the lookup fails then no
91exception message is delivered and the thread is suspended indefinitely.
92
93The exception endpoint must have send and grant rights. Replying to the
94exception message restarts the thread. For certain exception types, the contents of
95the reply message may be used to set the values in the registers of the
96thread being restarted.
97See \autoref{sec:faults} for details.
98
99
100
101\subsection{Message Layout of the Read-/Write-Registers Methods}
102\label{sec:read_write_registers}
103
104The registers of a thread can be read and written with the
105\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} methods.
106For some registers, the kernel will silently mask certain bits or ranges of bits off, and force them to contain certain
107values to ensure that they cannot be maliciously set to values that would compromise the running system, or to respect
108values that the architecture specifications have mandated to be certain values.
109The register contents are transferred via the IPC buffer.
110
111\section{Faults}
112\label{sec:faults}
113
114A thread's actions may result in a fault. Faults are delivered to the
115thread's exception handler so that it can take the appropriate action.
116The fault type is specified in the message label and is one of:
117seL4\_Fault\_CapFault, seL4\_Fault\_VMFault, seL4\_Fault\_UnknownSyscall, seL4\_Fault\_UserException, seL4\_Fault\_DebugException,
118or seL4\_Fault\_NullFault (indicating no fault occurred and this is a normal IPC message).
119
120Fault are delivered in such a way as to imitate a Call from the faulting
121thread. This means that to send a fault message the fault endpoint
122must have both write and grant permissions.
123
124\subsection{Capability Faults}
125
126Capability faults may occur in two places. Firstly, a capability fault
127can occur when lookup of a capability referenced by a
128\apifunc{seL4\_Call}{sel4_call} or \apifunc{seL4\_Send}{sel4_send} system call
129failed (\apifunc{seL4\_NBSend}{sel4_nbsend} calls on
130invalid capabilities silently fail). In this case, the capability
131on which the fault occurred may be the capability being invoked or an
132extra capability passed in the \texttt{caps} field in the IPC buffer.
133
134Secondly, a capability fault can occur when \apifunc{seL4\_Recv}{sel4_recv} or \apifunc{seL4\_NBRecv}{sel4_nbrecv}
135is called on a capability that does not exist, is not an endpoint or notification capability or does not have
136receive permissions.
137
138Replying to the fault IPC will restart the faulting thread. The contents of the
139IPC message are given in \autoref{tbl:ipc_contents}.\\
140
141\begin{table}[htb]
142\noindent\begin{tabularx}{\textwidth}{XX}
143\toprule
144    \textbf{Meaning} & \textbf{ IPC buffer location} \\
145\midrule
146    Address at which to restart execution & \ipcbloc{seL4\_CapFault\_IP} \\
147    Capability address & \ipcbloc{seL4\_CapFault\_Addr} \\
148In receive phase (1 if the fault happened during a receive system call, 0
149    otherwise) & \ipcbloc{seL4\_CapFault\_InRecvPhase} \\
150Lookup failure description. As described in \autoref{sec:lookup_fail_desc} &
151    \ipcbloc{seL4\_CapFault\_LookupFailureType} \\
152\bottomrule
153\end{tabularx}
154\caption{\label{tbl:ipc_contents}Contents of an IPC message.}
155\end{table}
156
157\subsection{Unknown Syscall}
158\label{sec:unknown-syscall}
159
160This fault occurs when a thread executes a system call with a syscall
161number that is unknown to seL4.
162The register set
163of the faulting thread is passed to the thread's exception handler so that it
164may, for example, emulate the system call if a thread is being
165virtualised.
166
167Replying to the fault IPC allows the thread to be restarted
168and/or the thread's register set to be modified. If the reply has
169a label of zero, the thread will be restarted. Additionally, if the
170message length is non-zero, the faulting thread's register set will be
171updated. In this case, the number of
172registers updated is controlled with the length field of the message
173tag.
174
175\subsection{User Exception}
176
177User exceptions are used to deliver architecture-defined exceptions. For
178example, such an exception could occur if a user thread attempted to
179divide a number by zero.
180
181Replying to the fault IPC allows the thread to be restarted
182and/or the thread's register set to be modified. If the reply has
183a label of zero, the thread will be restarted. Additionally, if the
184message length is non-zero, the faulting thread's register set will be
185updated. In this case, the number of
186registers updated is controlled with the length field of the message
187tag.
188
189\subsection{Debug Exception: Breakpoints and Watchpoints}
190\label{sec:debug_exceptions}
191
192Debug exceptions are used to deliver trace and debug related events to threads.
193Breakpoints, watchpoints, trace-events and instruction-performance sampling
194events are examples. These events are supported for userspace threads when the kernel
195is configured to include them (when CONFIG\_HARDWARE\_DEBUG\_API is set). The hardware
196debugging extensions API is supported on the following subset of the platforms that the
197kernel has been ported to:
198
199\begin{itemize}
200\item PC99: IA-32 and x86\_64
201\item Sabrelite (i.MX6)
202\item Jetson TegraK1
203\item HiSilicon Hikey
204\item Raspberry Pi 3
205\item Odroid-X (Exynos4)
206\item Xilinx zynq7000
207\end{itemize}
208
209Information on the available hardware debugging resources is presented in the form of the following constants:
210
211\begin{description}
212\item[seL4\_NumHWBreakpoints]: Defines the total number of hardware break
213registers available, of all types available on the hardware platform. On the ARM
214Cortex A7 for example, there are 6 exclusive instruction breakpoint registers,
215and 4 exclusive data watchpoint registers, for a total of 10 monitor registers.
216On this platform therefore, \texttt{seL4\_NumHWBreakpoints} is defined as 10.
217The instruction breakpoint registers will always be assigned the lower API-IDs,
218and the data watchpoints will always be assigned following them.
219
220Additionally, \texttt{seL4\_NumExclusiveBreakpoints}, \texttt{seL4\_NumExclusiveWatchpoints}
221and \texttt{seL4\_NumDualFunctionMonitors}
222are defined for each target platform to reflect the number of available
223hardware breakpoints/watchpoints of a certain type.
224
225\item[seL4\_NumExclusiveBreakpoints]: Defines the number of hardware registers
226capable of generating a fault \textbf{only} on instruction execution. Currently this will be
227set only on ARM platforms. The API-ID of the first exclusive breakpoint is given
228in \texttt{seL4\_FirstBreakpoint}. If there are no instruction-break exclusive
229registers, \texttt{seL4\_NumExclusiveBreakpoints} will be set to \texttt{0} and
230\texttt{seL4\_FirstBreakpoint} will be set to -1.
231
232\item[seL4\_NumExclusiveWatchpoints]: Defines the number of hardware registers
233capable of generating a fault \textbf{only} on data access. Currently this will be set only
234on ARM platforms. The API-ID of the first exclusive watchpoint is given
235in \texttt{seL4\_FirstWatchpoint}. If there are no data-break exclusive
236registers, \texttt{seL4\_NumExclusiveWatchpoints} will be set to \texttt{0} and
237\texttt{seL4\_FirstWatchpoint} will be set to -1.
238
239\item[seL4\_NumDualFunctionMonitors]: Defines the number of hardware registers
240capable of generating a fault on either type of access -- i.e, the register
241supports both instruction and data breaks. Currently this will be set only on
242x86 platforms. The API-ID of the first dual-function monitor is given
243in \texttt{seL4\_FirstDualFunctionMonitor}. If there are no dual-function break
244registers, \texttt{seL4\_NumDualFunctionMonitors} will be set to \texttt{0} and
245\texttt{seL4\_FirstDualFunctionMonitor} will be set to -1.
246
247\end{description}
248
249\begin{table}[h]
250\begin{tabularx}{\textwidth}{XXX}
251\toprule
252\textbf{Value sent} & \textbf{IPC buffer location} \\
253\midrule
254\reg{Breakpoint instruction address} & \ipcbloc{IPCBuffer[0]} \\
255\reg{Exception reason} & \ipcbloc{IPCBuffer[1]} \\
256\reg{Watchpoint data access address} & \ipcbloc{IPCBuffer[2]} \\
257\reg{Register API-ID} & \ipcbloc{IPCBuffer[3]} \\
258\bottomrule
259\end{tabularx}
260\caption{\label{tbl:debug_exception_result}Debug fault message layout. The
261register API-ID is not returned in the fault message from the kernel on
262single-step faults.}
263\end{table}
264
265\subsection{Debug Exception: Single-stepping}
266\label{sec:single_stepping_debug_exception}
267
268The kernel provides support for the use of hardware single-stepping of userspace
269threads when configured to do so (when CONFIG\_HARDWARE\_DEBUG\_API is set). To
270this end it exposes the invocation, \texttt{seL4\_TCB\_ConfigureSingleStepping}.
271
272The caller is expected to select an API-ID that corresponds to
273an instruction breakpoint, to use when setting up the single-stepping
274functionality (i.e, API-ID from 0 to \texttt{seL4\_NumExclusiveBreakpoints} - 1).
275However, not all hardware platforms require an actual hardware breakpoint
276register to provide single-stepping functionality. If the caller's hardware platform requires the
277use of a hardware breakpoint register, it will use the breakpoint register given to it in \texttt{bp\_num},
278and return \texttt{true} in \texttt{bp\_was\_consumed}. If the underlying platform does not need a hardware
279breakpoint to provide single-stepping, seL4 will return \texttt{false} in \texttt{bp\_was\_consumed} and
280leave \texttt{bp\_num} unchanged.
281
282If \texttt{bp\_was\_consumed} is \texttt{true}, the caller should not
283attempt to re-configure \texttt{bp\_num} for Breakpoint or Watchpoint usage until
284the caller has disabled single-stepping and released that register, via a subsequent
285call to \texttt{seL4\_TCB\_ConfigureSingleStepping}, or a fault-reply with
286\texttt{n\_instr} being 0. Setting \texttt{num\_instructions} to \texttt{0}
287\textbf{disables single stepping}.
288
289On architectures that require an actual hardware registers to be configured for
290single-stepping functionality, seL4 will restrict the number of registers that
291can be configured as single-steppers, to one at any given time. The register that
292is currently configured (if any) for single-stepping will be the implicit
293\texttt{bp\_num} argument in a single-step debug fault reply.
294
295The kernel's single-stepping, also supports skipping a certain number of
296instructions before delivering the single-step fault message. \texttt{Num\_instructions}
297should be set to \texttt{1} when single-stepping, or any non-zero integer value to skip that many
298instructions before resuming single-stepping. This skip-count can also be set in
299the fault-reply to a single-step debug fault.
300
301\begin{table}[h]
302\begin{tabularx}{\textwidth}{XXX}
303\toprule
304\textbf{Value sent} & \textbf{Register set by reply} & \textbf{IPC buffer location} \\
305\midrule
306\reg{Breakpoint instruction address} & \texttt{num\_instructions} to skip & \ipcbloc{IPCBuffer[0]} \\
307\reg{Exception reason} & --- & \ipcbloc{IPCBuffer[1]} \\
308\bottomrule
309\end{tabularx}
310\caption{\label{tbl:single_step_exception_result}Single-step fault message layout.}
311\end{table}
312
313\subsection{VM Fault}
314\label{sec:vm-fault}
315
316The thread caused a page fault. Replying to the fault IPC will restart
317the thread. The contents of the IPC message are given below.\\
318
319\begin{table}[htb]
320\begin{tabularx}{\textwidth}{XXX}
321\toprule
322\textbf{Meaning} & \textbf{IPC buffer location} \\
323\midrule
324    Program counter to restart execution at. & \ipcbloc{seL4\_VMFault\_IP} \\
325Address that caused the fault. & \ipcbloc{seL4\_VMFault\_SP} \\
326    Instruction fault (1 if the fault was caused by an instruction fetch). & \ipcbloc{seL4\_VMFault\_PrefetchFault}  \\
327Fault status register (FSR). Contains information about the cause of the fault. Architecture dependent. & \ipcbloc{seL4\_VMFault\_FSR} \\
328\bottomrule
329\end{tabularx}
330\caption{\label{tbl:vm_fault_result_arm} VM Fault outcome on all architectures.}
331\end{table}
332
333\section{Domains}
334\label{sec:domains}
335
336Domains are used to isolate independent subsystems, so as to limit
337information flow between them.
338The kernel switches between domains according to a fixed, time-triggered
339schedule.
340The fixed schedule is compiled into the kernel via the constant
341\texttt{CONFIG\_NUM\_DOMAINS} and the global variable \texttt{ksDomSchedule}.
342
343A thread belongs to exactly one domain, and will only run when that domain
344is active.
345The \apifunc{seL4\_DomainSet\_Set}{domainset_set} method changes the domain
346of a thread.
347The caller must possess a \obj{Domain} cap and the thread's \obj{TCB} cap.
348The initial thread starts with a \obj{Domain} cap (see
349\autoref{sec:messageinfo}).
350
351\section{Virtualisation}
352\label{sec:virt}
353
354Hardware execution virtualisation is supported on specific arm and x86 platforms. The interface is exposed through a series
355of kernel objects, invocations and syscalls that allow the user to take advantage of hardware
356virtualisation features.
357
358Hardware virtualisation allows for a thread to perform instructions and operations as if it were
359running at a higher privilege level. As higher privilege levels typically have access to
360additional machine registers and other pieces of state a \obj{VCPU} object is introduced to act
361as storage for this state. For simplicity we refer to this virtualised higher privileged level as
362'guest mode'. \obj{VCPU}s are bound in a one-to-one relationship with a \obj{TCB} in order
363to provide a thread with this ability to run in higher privilege mode. See the section on
364ARM or x86 for more precise details.
365
366\obj{VCPU} objects also have additional, architecture specific, invocations for manipulating
367the additional state or other virtualisation controls provided by the hardware. Binding of
368a \obj{VCPU} to a \obj{TCB} is done by an invocation on the \obj{VCPU} only, and not the \obj{TCB}.
369
370The provided objects and invocations are, generally speaking, the thinnest possible shim over
371the underlying hardware primitives and operations. As a result an in depth familiarity with
372the underlying architecture specific harwdare mechanisms is required to use these objects, and
373such familiarity is therefore assumed in description.
374
375\subsection{ARM}
376
377When a \obj{TCB} has a bound \obj{VCPU} it is allowed to have the mode portion of the
378\texttt{cpsr} register set to values other than \texttt{user}. Specifically it may have any value other than
379\texttt{hypervisor}.
380
381TODO: this section needs more detail
382
383\subsection{x86}
384
385A \obj{TCB} with a bound \obj{VCPU} has two execution modes; one is the original thread just as
386if there was no bound \obj{VCPU}, and the other is the guest mode execution using the
387\obj{VCPU}. Switching from regular execution mode into the guest execution mode is
388done by using the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall. Executing this syscall causes the thread, whenever
389it is scheduled thereafter, to execute using the higher privlidged mode controlled by the \obj{VCPU}.
390Should the guest execution mode generate any kind of fault, or if a message arrives
391on the \obj{TCB}s bound notification, the \obj{TCB} will be switched back to regular mode
392and the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall will return with a message indicating the reason for return.
393
394\obj{VCPU} state and execution is controlled through the \apifunc{seL4\_VCPU\_ReadVMCS}{x86_vcpu_readvmcs}
395and \apifunc{seL4\_VCPU\_WriteVMCS}{x86_vcpu_writevmcs} invocations.
396These are very thin wrappers around the hardware \texttt{vmread} and \texttt{vmwrite} instructions and the kernel
397merely does enough validation on the parameters to ensure the \obj{VCPU} is not configured
398to run in such a way as to violate any kernel properties. For example, it is not possible to
399disable the use of External Interrupt Exiting, as this would prevent the kernel from receiving
400timer interrupts and allow the thread to monopolise CPU time.
401
402Memory access of the guest execution mode is controlled by requiring the use of Extended
403Page Tables (EPT). A series of EPT related paging structure objects (\obj{EPTPML4}, \obj{EPTPDPT}, \obj{EPTPD}, \obj{EPTPT})
404exist and are manipulated in exactly the same manner as the objects for the regular virtual
405address space. Once constructed a \obj{TCB} can be given an \obj{EPTPML4} as an EPT root with \apifunc{seL4\_TCB\_SetEPTRoot}{x86_set_eptroot},
406which serves as the vspace root when executing in guest mode, with the vspace root set
407with \apifunc{seL4\_TCB\_SetSPace}{tcb_setspace} or \apifunc{seL4\_TCB\_Configure}{tcb_configure}
408continuing to provide translation when the TCB is executing in its normal mode.
409
410Direct access to I/O ports can be given to the privlidged execution mode through the
411\apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport} invocation and allows the provided I/O port capability to be
412linked to the VCPU, and a subset of its I/O port range to be made accessible to the \obj{VCPU}.
413Linking means that an I/O port capability can only be used in a single \apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport}
414invocation and a second invocation will undo the previous one. The link also means that
415if the I/O port capability is deleted for any reason the access will be correspondingly removed
416from the \obj{VCPU}.
417