1% 2% Copyright 2014, General Dynamics C4 Systems 3% 4% This software may be distributed and modified according to the terms of 5% the GNU General Public License version 2. Note that NO WARRANTY is provided. 6% See "LICENSE_GPLv2.txt" for details. 7% 8% @TAG(GD_GPL) 9% 10 11\chapter{\label{ch:threads}Threads and Execution} 12 13\section{Threads} 14\label{sec:threads} 15 16seL4 provides threads to represent an execution context and manage 17processor time. A thread is represented in seL4 by its thread control block 18object (\obj{TCB}). Each \obj{TCB} has an associated CSpace (see 19\autoref{ch:cspace}) and VSpace (see \autoref{ch:vspace}) which 20may be shared with other threads. A \obj{TCB} may also have an IPC buffer 21(see \autoref{ch:ipc}), which is used to pass extra arguments during IPC 22or kernel object invocation that do not fit in the architecture-defined message 23registers. While it is not compulsory that a thread has an IPC buffer, 24it will not be able to perform most kernel invocations, as they require 25cap transfer. 26Each thread belongs to exactly one security domain (see 27\autoref{sec:domains}). 28%FIXME: there is much more information held in the TCB! 29 30\subsection{Thread Creation} 31\label{sec:thread_creation} 32 33Like other objects, \obj{TCB}s are created with the 34\apifunc{seL4\_Untyped\_Retype}{untyped_retype} method (see 35\autoref{sec:kernmemalloc}). A newly created thread is initially inactive. It 36is configured by setting its CSpace and VSpace with the 37\apifunc{seL4\_TCB\_SetSpace}{tcb_setspace} 38or \apifunc{seL4\_TCB\_Configure}{tcb_configure} methods and then calling 39\apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} with an initial stack pointer and instruction 40pointer. The thread can then be activated either by setting the 41\texttt{resume\_target} parameter in the \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} invocation to true 42or by seperately calling the \apifunc{seL4\_TCB\_Resume}{tcb_resume} method. In multicore machines, the thread 43would be running on the same CPU which originally created the \obj{TCB}. However, it could be migrated to other CPUs 44by calling \apifunc{seL4\_TCB\_SetAffinity}{tcb_setaffinity}. 45 46\subsection{Thread Deactivation} 47\label{sec:thread_deactivation} 48 49The \apifunc{seL4\_TCB\_Suspend}{tcb_suspend} method deactivates a thread. 50Suspended threads can later be resumed. 51Their suspended state can be retrieved with the 52\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and 53\apifunc{seL4\_TCB\_CopyRegisters}{tcb_copyregisters} methods. 54They can also be reconfigured and 55reused or left suspended indefinitely if not needed. Threads will be 56automatically suspended when the last capability to their \obj{TCB} is 57deleted. 58% an example of which is demonstrated in \nameref{ex:second_thread}. 59 60\subsection{Scheduling} 61\label{sec:sched} 62 63seL4 uses a preemptive round-robin scheduler with 256 priority levels. 64All threads have a maximum controlled priority (MCP) and a priority, the latter being the effective 65priority of the thread. 66When a thread modifies a another threads priority (including itself) it must provide a 67thread capability from which to use the MCP from. Threads can only set priorities and MCPs 68to be less than or equal to the provided thread's MCP. 69The initial task starts with an MCP and priority as the highest priority in the system (\texttt{seL4\_MaxPrio}). 70Thread priority and MCP can be set with \apifunc{seL4\_TCB\_SetPriority}{tcb_setpriority} 71and \apifunc{seL4\_TCB\_SetMCPriority}{tcb_setmcpriority} methods. 72 73\subsection{Exceptions} 74 75Each thread has an associated exception-handler endpoint. If the thread 76causes an exception, the kernel creates an IPC message with the relevant 77details and sends this to the endpoint. This 78thread can then take the appropriate action. Fault IPC messages are 79described in \autoref{sec:faults}. 80 81In order to enable exception handlers, a capability to the exception-handler 82endpoint must exist in the CSpace of the thread that generates the exception. 83The exception-handler 84endpoint can be set with the \apifunc{seL4\_TCB\_SetSpace}{tcb_setspace} or 85\apifunc{seL4\_TCB\_Configure}{tcb_configure} method. 86With these methods, a capability address for the exception handler can be associated with a thread. 87This address is then used to lookup the handler endpoint when an exception is generated. 88Note, however, that these methods make no attempt to check whether an endpoint capability exists at the specified 89address in the CSpace of the thread. The capability is only looked up 90when an exception actually happens and if the lookup fails then no 91exception message is delivered and the thread is suspended indefinitely. 92 93The exception endpoint must have send and grant rights. Replying to the 94exception message restarts the thread. For certain exception types, the contents of 95the reply message may be used to set the values in the registers of the 96thread being restarted. 97See \autoref{sec:faults} for details. 98 99 100 101\subsection{Message Layout of the Read-/Write-Registers Methods} 102\label{sec:read_write_registers} 103 104The registers of a thread can be read and written with the 105\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} methods. 106For some registers, the kernel will silently mask certain bits or ranges of bits off, and force them to contain certain 107values to ensure that they cannot be maliciously set to values that would compromise the running system, or to respect 108values that the architecture specifications have mandated to be certain values. 109The register contents are transferred via the IPC buffer. 110 111\section{Faults} 112\label{sec:faults} 113 114A thread's actions may result in a fault. Faults are delivered to the 115thread's exception handler so that it can take the appropriate action. 116The fault type is specified in the message label and is one of: 117seL4\_Fault\_CapFault, seL4\_Fault\_VMFault, seL4\_Fault\_UnknownSyscall, seL4\_Fault\_UserException, seL4\_Fault\_DebugException, 118or seL4\_Fault\_NullFault (indicating no fault occurred and this is a normal IPC message). 119 120Fault are delivered in such a way as to imitate a Call from the faulting 121thread. This means that to send a fault message the fault endpoint 122must have both write and grant permissions. 123 124\subsection{Capability Faults} 125 126Capability faults may occur in two places. Firstly, a capability fault 127can occur when lookup of a capability referenced by a 128\apifunc{seL4\_Call}{sel4_call} or \apifunc{seL4\_Send}{sel4_send} system call 129failed (\apifunc{seL4\_NBSend}{sel4_nbsend} calls on 130invalid capabilities silently fail). In this case, the capability 131on which the fault occurred may be the capability being invoked or an 132extra capability passed in the \texttt{caps} field in the IPC buffer. 133 134Secondly, a capability fault can occur when \apifunc{seL4\_Recv}{sel4_recv} or \apifunc{seL4\_NBRecv}{sel4_nbrecv} 135is called on a capability that does not exist, is not an endpoint or notification capability or does not have 136receive permissions. 137 138Replying to the fault IPC will restart the faulting thread. The contents of the 139IPC message are given in \autoref{tbl:ipc_contents}.\\ 140 141\begin{table}[htb] 142\noindent\begin{tabularx}{\textwidth}{XX} 143\toprule 144 \textbf{Meaning} & \textbf{ IPC buffer location} \\ 145\midrule 146 Address at which to restart execution & \ipcbloc{seL4\_CapFault\_IP} \\ 147 Capability address & \ipcbloc{seL4\_CapFault\_Addr} \\ 148In receive phase (1 if the fault happened during a receive system call, 0 149 otherwise) & \ipcbloc{seL4\_CapFault\_InRecvPhase} \\ 150Lookup failure description. As described in \autoref{sec:lookup_fail_desc} & 151 \ipcbloc{seL4\_CapFault\_LookupFailureType} \\ 152\bottomrule 153\end{tabularx} 154\caption{\label{tbl:ipc_contents}Contents of an IPC message.} 155\end{table} 156 157\subsection{Unknown Syscall} 158\label{sec:unknown-syscall} 159 160This fault occurs when a thread executes a system call with a syscall 161number that is unknown to seL4. 162The register set 163of the faulting thread is passed to the thread's exception handler so that it 164may, for example, emulate the system call if a thread is being 165virtualised. 166 167Replying to the fault IPC allows the thread to be restarted 168and/or the thread's register set to be modified. If the reply has 169a label of zero, the thread will be restarted. Additionally, if the 170message length is non-zero, the faulting thread's register set will be 171updated. In this case, the number of 172registers updated is controlled with the length field of the message 173tag. 174 175\subsection{User Exception} 176 177User exceptions are used to deliver architecture-defined exceptions. For 178example, such an exception could occur if a user thread attempted to 179divide a number by zero. 180 181Replying to the fault IPC allows the thread to be restarted 182and/or the thread's register set to be modified. If the reply has 183a label of zero, the thread will be restarted. Additionally, if the 184message length is non-zero, the faulting thread's register set will be 185updated. In this case, the number of 186registers updated is controlled with the length field of the message 187tag. 188 189\subsection{Debug Exception: Breakpoints and Watchpoints} 190\label{sec:debug_exceptions} 191 192Debug exceptions are used to deliver trace and debug related events to threads. 193Breakpoints, watchpoints, trace-events and instruction-performance sampling 194events are examples. These events are supported for userspace threads when the kernel 195is configured to include them (when CONFIG\_HARDWARE\_DEBUG\_API is set). The hardware 196debugging extensions API is supported on the following subset of the platforms that the 197kernel has been ported to: 198 199\begin{itemize} 200\item PC99: IA-32 and x86\_64 201\item Sabrelite (i.MX6) 202\item Jetson TegraK1 203\item HiSilicon Hikey 204\item Raspberry Pi 3 205\item Odroid-X (Exynos4) 206\item Xilinx zynq7000 207\end{itemize} 208 209Information on the available hardware debugging resources is presented in the form of the following constants: 210 211\begin{description} 212\item[seL4\_NumHWBreakpoints]: Defines the total number of hardware break 213registers available, of all types available on the hardware platform. On the ARM 214Cortex A7 for example, there are 6 exclusive instruction breakpoint registers, 215and 4 exclusive data watchpoint registers, for a total of 10 monitor registers. 216On this platform therefore, \texttt{seL4\_NumHWBreakpoints} is defined as 10. 217The instruction breakpoint registers will always be assigned the lower API-IDs, 218and the data watchpoints will always be assigned following them. 219 220Additionally, \texttt{seL4\_NumExclusiveBreakpoints}, \texttt{seL4\_NumExclusiveWatchpoints} 221and \texttt{seL4\_NumDualFunctionMonitors} 222are defined for each target platform to reflect the number of available 223hardware breakpoints/watchpoints of a certain type. 224 225\item[seL4\_NumExclusiveBreakpoints]: Defines the number of hardware registers 226capable of generating a fault \textbf{only} on instruction execution. Currently this will be 227set only on ARM platforms. The API-ID of the first exclusive breakpoint is given 228in \texttt{seL4\_FirstBreakpoint}. If there are no instruction-break exclusive 229registers, \texttt{seL4\_NumExclusiveBreakpoints} will be set to \texttt{0} and 230\texttt{seL4\_FirstBreakpoint} will be set to -1. 231 232\item[seL4\_NumExclusiveWatchpoints]: Defines the number of hardware registers 233capable of generating a fault \textbf{only} on data access. Currently this will be set only 234on ARM platforms. The API-ID of the first exclusive watchpoint is given 235in \texttt{seL4\_FirstWatchpoint}. If there are no data-break exclusive 236registers, \texttt{seL4\_NumExclusiveWatchpoints} will be set to \texttt{0} and 237\texttt{seL4\_FirstWatchpoint} will be set to -1. 238 239\item[seL4\_NumDualFunctionMonitors]: Defines the number of hardware registers 240capable of generating a fault on either type of access -- i.e, the register 241supports both instruction and data breaks. Currently this will be set only on 242x86 platforms. The API-ID of the first dual-function monitor is given 243in \texttt{seL4\_FirstDualFunctionMonitor}. If there are no dual-function break 244registers, \texttt{seL4\_NumDualFunctionMonitors} will be set to \texttt{0} and 245\texttt{seL4\_FirstDualFunctionMonitor} will be set to -1. 246 247\end{description} 248 249\begin{table}[h] 250\begin{tabularx}{\textwidth}{XXX} 251\toprule 252\textbf{Value sent} & \textbf{IPC buffer location} \\ 253\midrule 254\reg{Breakpoint instruction address} & \ipcbloc{IPCBuffer[0]} \\ 255\reg{Exception reason} & \ipcbloc{IPCBuffer[1]} \\ 256\reg{Watchpoint data access address} & \ipcbloc{IPCBuffer[2]} \\ 257\reg{Register API-ID} & \ipcbloc{IPCBuffer[3]} \\ 258\bottomrule 259\end{tabularx} 260\caption{\label{tbl:debug_exception_result}Debug fault message layout. The 261register API-ID is not returned in the fault message from the kernel on 262single-step faults.} 263\end{table} 264 265\subsection{Debug Exception: Single-stepping} 266\label{sec:single_stepping_debug_exception} 267 268The kernel provides support for the use of hardware single-stepping of userspace 269threads when configured to do so (when CONFIG\_HARDWARE\_DEBUG\_API is set). To 270this end it exposes the invocation, \texttt{seL4\_TCB\_ConfigureSingleStepping}. 271 272The caller is expected to select an API-ID that corresponds to 273an instruction breakpoint, to use when setting up the single-stepping 274functionality (i.e, API-ID from 0 to \texttt{seL4\_NumExclusiveBreakpoints} - 1). 275However, not all hardware platforms require an actual hardware breakpoint 276register to provide single-stepping functionality. If the caller's hardware platform requires the 277use of a hardware breakpoint register, it will use the breakpoint register given to it in \texttt{bp\_num}, 278and return \texttt{true} in \texttt{bp\_was\_consumed}. If the underlying platform does not need a hardware 279breakpoint to provide single-stepping, seL4 will return \texttt{false} in \texttt{bp\_was\_consumed} and 280leave \texttt{bp\_num} unchanged. 281 282If \texttt{bp\_was\_consumed} is \texttt{true}, the caller should not 283attempt to re-configure \texttt{bp\_num} for Breakpoint or Watchpoint usage until 284the caller has disabled single-stepping and released that register, via a subsequent 285call to \texttt{seL4\_TCB\_ConfigureSingleStepping}, or a fault-reply with 286\texttt{n\_instr} being 0. Setting \texttt{num\_instructions} to \texttt{0} 287\textbf{disables single stepping}. 288 289On architectures that require an actual hardware registers to be configured for 290single-stepping functionality, seL4 will restrict the number of registers that 291can be configured as single-steppers, to one at any given time. The register that 292is currently configured (if any) for single-stepping will be the implicit 293\texttt{bp\_num} argument in a single-step debug fault reply. 294 295The kernel's single-stepping, also supports skipping a certain number of 296instructions before delivering the single-step fault message. \texttt{Num\_instructions} 297should be set to \texttt{1} when single-stepping, or any non-zero integer value to skip that many 298instructions before resuming single-stepping. This skip-count can also be set in 299the fault-reply to a single-step debug fault. 300 301\begin{table}[h] 302\begin{tabularx}{\textwidth}{XXX} 303\toprule 304\textbf{Value sent} & \textbf{Register set by reply} & \textbf{IPC buffer location} \\ 305\midrule 306\reg{Breakpoint instruction address} & \texttt{num\_instructions} to skip & \ipcbloc{IPCBuffer[0]} \\ 307\reg{Exception reason} & --- & \ipcbloc{IPCBuffer[1]} \\ 308\bottomrule 309\end{tabularx} 310\caption{\label{tbl:single_step_exception_result}Single-step fault message layout.} 311\end{table} 312 313\subsection{VM Fault} 314\label{sec:vm-fault} 315 316The thread caused a page fault. Replying to the fault IPC will restart 317the thread. The contents of the IPC message are given below.\\ 318 319\begin{table}[htb] 320\begin{tabularx}{\textwidth}{XXX} 321\toprule 322\textbf{Meaning} & \textbf{IPC buffer location} \\ 323\midrule 324 Program counter to restart execution at. & \ipcbloc{seL4\_VMFault\_IP} \\ 325Address that caused the fault. & \ipcbloc{seL4\_VMFault\_SP} \\ 326 Instruction fault (1 if the fault was caused by an instruction fetch). & \ipcbloc{seL4\_VMFault\_PrefetchFault} \\ 327Fault status register (FSR). Contains information about the cause of the fault. Architecture dependent. & \ipcbloc{seL4\_VMFault\_FSR} \\ 328\bottomrule 329\end{tabularx} 330\caption{\label{tbl:vm_fault_result_arm} VM Fault outcome on all architectures.} 331\end{table} 332 333\section{Domains} 334\label{sec:domains} 335 336Domains are used to isolate independent subsystems, so as to limit 337information flow between them. 338The kernel switches between domains according to a fixed, time-triggered 339schedule. 340The fixed schedule is compiled into the kernel via the constant 341\texttt{CONFIG\_NUM\_DOMAINS} and the global variable \texttt{ksDomSchedule}. 342 343A thread belongs to exactly one domain, and will only run when that domain 344is active. 345The \apifunc{seL4\_DomainSet\_Set}{domainset_set} method changes the domain 346of a thread. 347The caller must possess a \obj{Domain} cap and the thread's \obj{TCB} cap. 348The initial thread starts with a \obj{Domain} cap (see 349\autoref{sec:messageinfo}). 350 351\section{Virtualisation} 352\label{sec:virt} 353 354Hardware execution virtualisation is supported on specific arm and x86 platforms. The interface is exposed through a series 355of kernel objects, invocations and syscalls that allow the user to take advantage of hardware 356virtualisation features. 357 358Hardware virtualisation allows for a thread to perform instructions and operations as if it were 359running at a higher privilege level. As higher privilege levels typically have access to 360additional machine registers and other pieces of state a \obj{VCPU} object is introduced to act 361as storage for this state. For simplicity we refer to this virtualised higher privileged level as 362'guest mode'. \obj{VCPU}s are bound in a one-to-one relationship with a \obj{TCB} in order 363to provide a thread with this ability to run in higher privilege mode. See the section on 364ARM or x86 for more precise details. 365 366\obj{VCPU} objects also have additional, architecture specific, invocations for manipulating 367the additional state or other virtualisation controls provided by the hardware. Binding of 368a \obj{VCPU} to a \obj{TCB} is done by an invocation on the \obj{VCPU} only, and not the \obj{TCB}. 369 370The provided objects and invocations are, generally speaking, the thinnest possible shim over 371the underlying hardware primitives and operations. As a result an in depth familiarity with 372the underlying architecture specific harwdare mechanisms is required to use these objects, and 373such familiarity is therefore assumed in description. 374 375\subsection{ARM} 376 377When a \obj{TCB} has a bound \obj{VCPU} it is allowed to have the mode portion of the 378\texttt{cpsr} register set to values other than \texttt{user}. Specifically it may have any value other than 379\texttt{hypervisor}. 380 381TODO: this section needs more detail 382 383\subsection{x86} 384 385A \obj{TCB} with a bound \obj{VCPU} has two execution modes; one is the original thread just as 386if there was no bound \obj{VCPU}, and the other is the guest mode execution using the 387\obj{VCPU}. Switching from regular execution mode into the guest execution mode is 388done by using the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall. Executing this syscall causes the thread, whenever 389it is scheduled thereafter, to execute using the higher privlidged mode controlled by the \obj{VCPU}. 390Should the guest execution mode generate any kind of fault, or if a message arrives 391on the \obj{TCB}s bound notification, the \obj{TCB} will be switched back to regular mode 392and the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall will return with a message indicating the reason for return. 393 394\obj{VCPU} state and execution is controlled through the \apifunc{seL4\_VCPU\_ReadVMCS}{x86_vcpu_readvmcs} 395and \apifunc{seL4\_VCPU\_WriteVMCS}{x86_vcpu_writevmcs} invocations. 396These are very thin wrappers around the hardware \texttt{vmread} and \texttt{vmwrite} instructions and the kernel 397merely does enough validation on the parameters to ensure the \obj{VCPU} is not configured 398to run in such a way as to violate any kernel properties. For example, it is not possible to 399disable the use of External Interrupt Exiting, as this would prevent the kernel from receiving 400timer interrupts and allow the thread to monopolise CPU time. 401 402Memory access of the guest execution mode is controlled by requiring the use of Extended 403Page Tables (EPT). A series of EPT related paging structure objects (\obj{EPTPML4}, \obj{EPTPDPT}, \obj{EPTPD}, \obj{EPTPT}) 404exist and are manipulated in exactly the same manner as the objects for the regular virtual 405address space. Once constructed a \obj{TCB} can be given an \obj{EPTPML4} as an EPT root with \apifunc{seL4\_TCB\_SetEPTRoot}{x86_set_eptroot}, 406which serves as the vspace root when executing in guest mode, with the vspace root set 407with \apifunc{seL4\_TCB\_SetSPace}{tcb_setspace} or \apifunc{seL4\_TCB\_Configure}{tcb_configure} 408continuing to provide translation when the TCB is executing in its normal mode. 409 410Direct access to I/O ports can be given to the privlidged execution mode through the 411\apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport} invocation and allows the provided I/O port capability to be 412linked to the VCPU, and a subset of its I/O port range to be made accessible to the \obj{VCPU}. 413Linking means that an I/O port capability can only be used in a single \apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport} 414invocation and a second invocation will undo the previous one. The link also means that 415if the I/O port capability is deleted for any reason the access will be correspondingly removed 416from the \obj{VCPU}. 417