1% 2% Copyright 2014, General Dynamics C4 Systems 3% 4% SPDX-License-Identifier: GPL-2.0-only 5% 6 7\chapter{\label{ch:threads}Threads and Execution} 8 9\section{Threads} 10\label{sec:threads} 11 12 13seL4 provides threads to represent an execution context. On MCS configurations of 14the kernel, scheduling contexts are used to manage processor time. Without MCS, processor 15time is also represented by the thread abstraction. 16A thread is represented in seL4 by its thread control block 17object (\obj{TCB}). 18 19With MCS, a scheduling context is represented by a scheduling context object 20(\obj{SCO}), and threads cannot run unless they are bound to, or receive a 21scheduling context. 22 23\subsection{Thread control blocks} 24 25Each \obj{TCB} has an associated CSpace (see 26\autoref{ch:cspace}) and VSpace (see \autoref{ch:vspace}) which 27may be shared with other threads. A \obj{TCB} may also have an IPC buffer 28(see \autoref{ch:ipc}), which is used to pass extra arguments during IPC 29or kernel object invocation that do not fit in the architecture-defined message 30registers. While it is not compulsory that a thread has an IPC buffer, 31it will not be able to perform most kernel invocations, as they require 32cap transfer. 33Each thread belongs to exactly one security domain (see 34\autoref{sec:domains}). 35%FIXME: there is much more information held in the TCB! 36 37\subsection{Thread Creation} 38\label{sec:thread_creation} 39 40Like other objects, \obj{TCB}s are created with the 41\apifunc{seL4\_Untyped\_Retype}{untyped_retype} method (see 42\autoref{sec:kernmemalloc}). A newly created thread is initially inactive. It 43is configured by setting its CSpace and VSpace with the 44\apifunc{seL4\_TCB\_SetSpace}{tcb_setspace} 45or \apifunc{seL4\_TCB\_Configure}{tcb_configure} methods and then calling 46\apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} with an initial stack pointer and instruction 47pointer. The thread can then be activated either by setting the 48\texttt{resume\_target} parameter in the \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} invocation to true 49or by seperately calling the \apifunc{seL4\_TCB\_Resume}{tcb_resume} method. Both of these methods 50place the thread in a runnable state. 51 52On the master kernel, this will result in the thread immediately being added to 53the scheduler. On the MCS kernel, the thread will only begin running if it has a 54scheduling context object. 55 56In a SMP configuration of the kernel, the thread will resume on the core 57corresponding to the affinity of the thread. For master, this is set using 58\apifunc{seL4\_TCB\_SetAffinity}{tcb_setaffinity}, while on the MCS kernel the affinity is 59derived from the scheduling context object. 60 61\subsection{Thread Deactivation} 62\label{sec:thread_deactivation} 63 64The \apifunc{seL4\_TCB\_Suspend}{tcb_suspend} method deactivates a thread. 65Suspended threads can later be resumed. 66Their suspended state can be retrieved with the 67\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and 68\apifunc{seL4\_TCB\_CopyRegisters}{tcb_copyregisters} methods. 69They can also be reconfigured and 70reused or left suspended indefinitely if not needed. Threads will be 71automatically suspended when the last capability to their \obj{TCB} is 72deleted. 73% an example of which is demonstrated in \nameref{ex:second_thread}. 74 75\subsection{Scheduling} 76\label{sec:sched} 77 78seL4 uses a preemptive, tickless scheduler with 256 priority levels (0 --- 255). All threads have 79a maximum controlled priority (MCP) and a priority, the latter being the effective priority of the 80thread. 81When a thread modifies a another threads priority (including itself) it must provide a 82thread capability from which to use the MCP from. Threads can only set priorities and MCPs 83to be less than or equal to the provided thread's MCP. 84The initial task starts with an MCP and priority as the highest priority in the system (\texttt{seL4\_MaxPrio}). 85Thread priority and MCP can be 86set with \apifunc{seL4\_TCB\_SetSchedParams}{tcb_setschedparams} and 87\apifunc{seL4\_TCB\_SetPriority}{tcb_setpriority}, 88\apifunc{seL4\_TCB\_SetMCPriority}{tcb_setmcpriority} methods. 89 90 91Of threads eligible for scheduling, the highest priority thread in a runnable state is chosen. 92 93Thread priority (structure \texttt{seL4\_PrioProps\_t}) consists of two values as follows: 94 95\begin{description} \item[Priority] the priority a thread will be scheduled with. \item[Maximum 96controlled priority (MCP)] the highest priority a thread can set itself or another thread to. 97\end{description} 98 99\subsection{MCS Scheduling} 100 101This section only applies to configrations with MCS enabled, where threads must have 102a scheduling context object available in order to be admitted to the scheduler. 103 104\subsection{Scheduling Contexts} 105\label{sec:scheduling_contexts} 106 107Access to CPU execution time is controlled through scheduling context objects. 108Scheduling contexts are configured with a tuple of 109\textit{budget}$(b)$ and \textit{period} $(p)$, both in microseconds, set by 110\apifunc{seL4\_SchedControl\_Configure}{schedcontrol_configure} (see \autoref{sec:sc_creation}). 111The tuple $(b, p)$ forms an upper bound on the thread's execution -- the kernel will not permit a 112thread to run for more than $b$ out of every $p$ microseconds. However, $\frac{b}{p}$ does not 113represent a lower bound on execution, as a thread must have the highest or equal highest priority 114of all runnable threads to be guaranteed to be scheduled at all, and the kernel does not conduct 115an admission test. As a result the set of all parameters is not neccesserily schedulable. If 116multiple threads have available budget concurrently they are scheduled first-in first-out, and 117round-robin scheduling is applied once the budget is expired. 118 119A scheduling context that is eligible to be picked by the scheduler, i.e has budget available, is 120referred to as \emph{active}. Budget charging and replenishment rules are different for round-robin 121and sporadic threads. For round-robin threads, the budget is charged each time the current node's 122scheduling context is changed, until it is depleted and then refilled immediately. 123 124Threads where $b == p$ are treated as round robin threads, where $b$ acts as a timeslice. 125Otherwise the kernel uses \emph{sporadic servers} to enforce temporal isolation, which enforce the 126property that $\frac{b}{p}$ cannot be exceeded for all possible $p$. 127In theory, sporadic servers provide temporal isolation -- preventing threads from exceeding their allocated budget -- by using the following algorithm: 128 129\begin{itemize} 130\item When a thread starts executing at current time $T$, record $T_{s}$ 131\item When a thread stops executing (blocks or is preempted), schedule a replenishment at $T_{s} + p$ for the 132amount of time consumed ($T - T_{s}$) and subtract that from the current replenishment being used. 133\end{itemize} 134 135seL4 implements this algorithm by maintaining an ordered list of sporadic replenishments -- 136\texttt{refills} for brevity -- in each scheduling context. Each replenishment contains a tuple 137of the time it is eligible for use (\emph{rTime}) and the amount that replenishment is for 138(\texttt{rAmount}). 139While a thread is executing, it constantly drains the budget from the \texttt{rAmount} at the head of the 140replenishment list. If the \texttt{rTime} is in the future, the thread bound to that 141scheduling context is placed in a queue of threads waiting for more budget. 142 143Round-robin threads are treated that same as sporadic threads except for one case: how the 144budget is charged. Round-robin threads have two refills only, both of which are always ready to be 145used. When a round-robin thread stops executing, budget is moved from the head to the tail 146replenishment. Once the head budget is consumed, the thread is removed from the scheduling queue 147for its priority and appended at the tail. 148 149Sporadic threads behave differently depending on the amount of replenishments available, which 150must be bounded. Developers have two options to configure the size of the replenishment list: 151 152\begin{itemize} 153\item The maximum number of refills in a single scheduling context is determined by the size 154 of the scheduling context when created by \apifunc{seL4\_Untyped\_Retype}{untyped_retype}. 155\item A per scheduling context parameter, \texttt{extra\_refills} that 156limits the number of refills for that specific scheduling context. This value is added to the base 157value of 2 and is limited by the size of the scheduling context. 158\end{itemize} 159 160Threads that have short exection times (e.g interrupt handlers) and are not frequently preempted 161should have less refills, while longer running threads with long values of $b$ should have a higher 162value. Threads bound to a scheduling context with 0 extra refills will behave periodically -- tasks 163that use their head replenishment, or call yield, will not be scheduled again until the start of 164their next period. 165 166Given the number of replenishments is limited, if a node's SC changes and the outgoing SC does not 167have enough space to store the new replenishment, space is created by removing the current 168replenishment which can result in preemption if the next replenishment is not yet available. 169Scheduling contexts with a higher number of configured refills will consume closer 170to their whole budget, as they can be preempted or switch threads more often without filling their 171replenishment queue. However, the scheduling overhead will be higher as the replenishment list is 172subject to fragmentation. 173 174Whenever a thread is executing it consumes the budget from its current scheduling context. The 175system call \apifunc{seL4\_Yield}{sel4_yield} can be used to sacrifice any remaining budget and 176block until the next replenishment is ready to be used. 177 178Threads can be bound to scheduling contexts using \apifunc{seL4\_TCB\_Configure}{tcb_configure} or 179\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind}, both invocations have the same effect 180although \apifunc{seL4\_TCB\_Configure}{tcb_configure} allows more thread fields to be set with only 181one kernel entry. When a thread is bound to a scheduling context, if it is in a runnable state and 182the scheduling context is active, it will be added to the scheduler. 183 184\subsection{Passive Threads} \label{sec:passive} 185 186Threads can be unbound from a scheduling context with 187\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}. This is distinct from 188suspending a thread, in that threads that are blocked waiting in an endpoint or notification queue 189will remain in the queue and can still recieve messages and signals. However, the unbound thread 190will not be schedulable again until it receives a scheduling context. Threads without scheduling 191contexts are referred to as \emph{passive} threads, as they cannot execute without the action of 192another thread. 193 194\subsection{Scheduling Context Creation} \label{sec:sc_creation} 195 196Like other objects, scheduling contexts are created from untyped memory using 197\apifunc{seL4\_UntypedRetype}{untyped_retype}. On creation, scheduling contexts are empty, 198representing 0\% of CPU execution time. To populate a scheduling context with parameters, one must 199invoke the appropriate \obj{SchedControl} capability, which provides access to CPU time management 200on a single node. A scheduling control cap for each node is provided to the initial task at run 201time. Threads run on the node that their scheduling context is configured for. Scheduling context 202parameters can then be set and updated using 203\apifunc{seL4\_SchedControl\_Configure}{schedcontrol_configure}, which allows the budget and period 204to be specified. 205 206The kernel does not conduct any schedulability tests, as task admission is left to user-level policy 207and can be conducted online or offline, statically or dynamically or not at all. 208 209\subsection{Scheduling Context Donation} 210 211In addition to explicitly binding and removing scheduling contexts through 212\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind} and 213\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}, scheduling contexts can move 214between threads over IPC. Scheduling contexts are donated implicitly when the system calls 215\apifunc{seL4\_Call}{sel4_call} and \apifunc{seL4\_NBSendRecv}{sel4_nbsendrecv} are used to 216communicate with a passive thread. When an active thread invokes an endpoint with 217\apifunc{seL4\_Call}{sel4_call} and rendezvous with a passive thread, the active thread's scheduling 218context is donated to the passive thread. The generated reply cap ensures that the callee is merely 219borrowing the scheduling context: when the reply cap is consumed by a reply message being sent the 220scheduling context will be returned to the caller. If the reply cap is revoked, and the callee 221holds the scheduling context, the scheduling context will be returned to the caller. However, if in 222a deep call chain and a reply cap in the middle of the call chain is revoked, such that the callee 223does not possess the scheduling context, the thread will be removed from the call chain and the 224scheduling context will remain where it is. If the receiver does not provide a reply object to 225track the donation in (i.e uses \apifunc{seL4\_Wait}{sel4_wait} instead of 226\apifunc{seL4\_Recv}{sel4_recv} scheduling context donation will not occur but the message will be 227delivered. The passive receiver will be set to inactive as it does not have a scheduling context. 228 229Consider an example where thread A calls thread B which calls thread C. If whilst C holds the 230scheduling context, B's reply cap to A is revoked, then the scheduling context will remain with C. 231However, a call chain will remain between A and C, such that if C's reply cap is revoked, or 232invoked, the scheduling context will return to A. 233 234\apifunc{seL4\_NBSendRecv}{sel4_nbsendrecv} can also result in scheduling context donation. 235If the non-blocking send phase of the operation results in message delivery to a passive thread, the 236scheduling context will be donated to that passive thread and the thread making the system call becomes passive on the 237receiving endpoint in the receive phase. No reply capability generated, so there 238is no guarantee that the scheduling context will return, which increases book keeping complexity but allows 239for data-flow like architectures rather than remote-procedure calls. Note that \apifunc{seL4\_Call}{sel4_call} 240does not guarantee the return of a scheduling context: this is an inherently trusted operation as the 241server could never reply and return the scheduling context. 242 243Scheduling contexts can also be bound to notification objects using 244\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind} and unbound using 245\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}. If a signal is delivered to 246a notification object with a passive thread blocked waiting on it, the passive thread will receive 247the scheduling context that is bound to the notification object. The scheduling context is returned 248when the thread blocks on the notification object. This feature allows for passive servers to use 249notification binding (See \autoref{sec:notification-binding}). 250 251Scheduling contexts can be unbound from all objects (notification objects and TCBs that are bound or 252have received a scheduling context through donation) using 253\apifunc{seL4\_SchedContext\_Unbind}{schedcontext_unbind}. 254 255Passive threads will run on the CPU node that the scheduling context was configured with, and will 256be migrated on IPC. 257 258\subsection{Scheduling algorithm} \label{sec:mcs-sched} 259 260Threads are only eligible for scheduling if they have an active scheduling context. 261Of threads eligible for scheduling, the highest priority thread in a runnable state is chosen. 262 263Threads of sufficient maximum controlled priority and with possession of the 264appropriate scheduling context capability can manipulate the scheduler and 265implement user-level schedulers using IPC. 266 267Scheduling contexts provide access to and an upper bound on exection CPU time, 268however when a thread executes is determined by thread priority. Consequently, 269access to CPU is a function of thread MCPs, scheduling contexts and the 270\obj{SchedControl} capability. The kernel will enforce that threads do not 271exceeed the budget in their scheduling context for any given period, and that 272the highest priority thread will always run, however it is up to the system 273designer to make sure the entire system is schedulable. 274 275 276\subsection{Exceptions} 277\label{sec:exceptions} 278 279Each thread has two associated exception-handler endpoints, a \emph{standard} 280exception handler and a \emph{timeout} exception handler, where the latter is MCS 281only. If the thread causes 282an exception, the kernel creates an IPC message with the relevant details and 283sends this to the endpoint. This thread can then take the appropriate action. 284Fault IPC messages are described in \autoref{sec:faults}. Standard exception-handler 285endpoints can be set with the \apifunc{seL4\_TCB\_SetSpace}{tcb_setspace} or 286\apifunc{seL4\_TCB\_SetSchedParams}{tcb_setschedparams} methods while Timeout exception 287handlers an be set with \apifunc{seL4\_TCB\_SetTimeoutEndpoint}{tcb_settimeoutendpoint} (MCS only). 288With these methods, a 289capability address for the exception handler can be associated with a thread. 290This address is then used to lookup the handler endpoint, and the capability to 291the endpoint is installed into the threads' kernel CNode. For threads without 292an exception handler, a null capability can be used, however the consequences 293are different per exeception handler type. Before raising an exception the 294handler capability is validated. The kernel does not perform another lookup, 295but checks that the capability is an endpoint with the correct rights. 296 297The exception endpoint must have Write and either Grant or GrantReply rights. 298Replying to the exception message restarts the thread. For certain exception 299types, the contents of the reply message may be used to set the values in the 300registers of the thread being restarted. See \autoref{sec:faults} for details. 301 302\subsubsection{Standard Exceptions} 303 304The standard exception handler is used when a fault is triggered by a thread 305which cannot be recovered without action by another thread. For example, if a 306thread raises a fault due to an unmapped virtual memory page, the thread cannot 307make any more progress until the page is mapped. If a thread experiences a 308fault that would trigger the standard exception handler while it is set to a 309null capability, the kernel will pause the thread and it will not run again. 310This is because without action by another thread, standard exceptions cannot be 311recovered from. Consequently threads without standard exception handlers 312should be trusted not to fault at all. 313 314Standard exception handlers can be passive, in which case they will run on the 315scheduling context of the faulting thread. 316 317\subsubsection{Timeout Exceptions (MCS Only)} \label{sec:timeout-exceptions} 318 319Timeout faults are raised when a thread attempts to run but has no available 320budget, and if that thread has a valid timeout exception handler capability. 321The handling of timeout faults is not compulsory: if a thread does not have a 322timeout fault handler, a fault will not be raised and the thread will continue 323running when it's budget is replenished. This allows temporally sensitive 324threads to handle budget overruns while other threads may ignore them. 325 326Timeout faults are registered per thread, which means that while clients may 327not have a timeout fault handler, servers may, allowing single-threaded, 328time-sensitive, passive servers to use a timeout exception handler to recover 329from malicious or untrusted clients whose budget expires while the server is 330completing the request. Timeout fault handlers can access server reply objects 331and reply with an error to the client, then reset the server to handle the next 332client request. 333 334If a reply message is sent to a nested server and a scheduling context without 335available budget returned, another timeout fault will be generated if the 336nested server also has a timeout fault handler. 337 338\subsection{Message Layout of the Read-/Write-Registers Methods} 339\label{sec:read_write_registers} 340 341The registers of a thread can be read and written with the 342\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} methods. 343For some registers, the kernel will silently mask certain bits or ranges of bits off, and force them to contain certain 344values to ensure that they cannot be maliciously set to values that would compromise the running system, or to respect 345values that the architecture specifications have mandated to be certain values. 346The register contents are transferred via the IPC buffer. 347 348\section{Faults} 349\label{sec:faults} 350 351A thread's actions may result in a fault. Faults are delivered to the 352thread's exception handler so that it can take the appropriate action. 353The fault type is specified in the message label and is one of: 354\texttt{seL4\_Fault\_CapFault}, \texttt{seL4\_Fault\_VMFault}, 355\texttt{seL4\_Fault\_UnknownSyscall}, \texttt{seL4\_Fault\_UserException}, 356\texttt{seL4\_Fault\_DebugException}, 357\texttt{seL4\_Fault\_TimeoutFault}, or \texttt{seL4\_Fault\_NullFault} 358(indicating no fault occurred and this is a normal IPC message). 359 360Faults are delivered in such a way as to imitate a Call from the faulting 361thread. This means that to send a fault message the fault endpoint 362must have Write and either Grant or GrantReply permissions. If this is not the 363case, a double fault happens (generally the thread is simply suspended). 364 365\subsection{Capability Faults} 366 367Capability faults may occur in two places. Firstly, a capability fault 368can occur when lookup of a capability referenced by a 369\apifunc{seL4\_Call}{sel4_call} or \apifunc{seL4\_Send}{sel4_send} system call 370failed (\apifunc{seL4\_NBSend}{sel4_nbsend} calls on 371invalid capabilities silently fail). In this case, the capability 372on which the fault occurred may be the capability being invoked or an 373extra capability passed in the \texttt{caps} field in the IPC buffer. 374 375Secondly, a capability fault can occur when \apifunc{seL4\_Recv}{sel4_recv} or \apifunc{seL4\_NBRecv}{sel4_nbrecv} 376is called on a capability that does not exist, is not an endpoint or notification capability or does not have 377receive permissions. 378 379Replying to the fault IPC will restart the faulting thread. The contents of the 380IPC message are given in \autoref{tbl:ipc_contents}.\\ 381 382\begin{table}[htb] 383\noindent\begin{tabularx}{\textwidth}{XX} 384\toprule 385 \textbf{Meaning} & \textbf{ IPC buffer location} \\ 386\midrule 387 Address at which to restart execution & \ipcbloc{seL4\_CapFault\_IP} \\ 388 Capability address & \ipcbloc{seL4\_CapFault\_Addr} \\ 389In receive phase (1 if the fault happened during a receive system call, 0 390 otherwise) & \ipcbloc{seL4\_CapFault\_InRecvPhase} \\ 391Lookup failure description. As described in \autoref{sec:lookup_fail_desc} & 392 \ipcbloc{seL4\_CapFault\_LookupFailureType} \\ 393\bottomrule 394\end{tabularx} 395\caption{\label{tbl:ipc_contents}Contents of an IPC message.} 396\end{table} 397 398\subsection{Unknown Syscall} 399\label{sec:unknown-syscall} 400 401This fault occurs when a thread executes a system call with a syscall 402number that is unknown to seL4. 403The register set 404of the faulting thread is passed to the thread's exception handler so that it 405may, for example, emulate the system call if a thread is being 406virtualised. 407 408Replying to the fault IPC allows the thread to be restarted 409and/or the thread's register set to be modified. If the reply has 410a label of zero, the thread will be restarted. Additionally, if the 411message length is non-zero, the faulting thread's register set will be 412updated. In this case, the number of 413registers updated is controlled with the length field of the message 414tag. 415 416\subsection{User Exception} 417 418User exceptions are used to deliver architecture-defined exceptions. For 419example, such an exception could occur if a user thread attempted to 420divide a number by zero. 421 422Replying to the fault IPC allows the thread to be restarted 423and/or the thread's register set to be modified. If the reply has 424a label of zero, the thread will be restarted. Additionally, if the 425message length is non-zero, the faulting thread's register set will be 426updated. In this case, the number of 427registers updated is controlled with the length field of the message 428tag. 429 430\subsection{Debug Exception: Breakpoints and Watchpoints} 431\label{sec:debug_exceptions} 432 433Debug exceptions are used to deliver trace and debug related events to threads. 434Breakpoints, watchpoints, trace-events and instruction-performance sampling 435events are examples. These events are supported for userspace threads when the kernel 436is configured to include them (when CONFIG\_HARDWARE\_DEBUG\_API is set). The hardware 437debugging extensions API is supported on the following subset of the platforms that the 438kernel has been ported to: 439 440\begin{itemize} 441\item PC99: IA-32 and x86\_64 442\item Sabrelite (i.MX6) 443\item Jetson TegraK1 444\item HiSilicon Hikey 445\item Raspberry Pi 3 446\item Odroid-X (Exynos4) 447\item Xilinx zynq7000 448\end{itemize} 449 450Information on the available hardware debugging resources is presented in the form of the following constants: 451 452\begin{description} 453\item[seL4\_NumHWBreakpoints]: Defines the total number of hardware break 454registers available, of all types available on the hardware platform. On the ARM 455Cortex A7 for example, there are 6 exclusive instruction breakpoint registers, 456and 4 exclusive data watchpoint registers, for a total of 10 monitor registers. 457On this platform therefore, \texttt{seL4\_NumHWBreakpoints} is defined as 10. 458The instruction breakpoint registers will always be assigned the lower API-IDs, 459and the data watchpoints will always be assigned following them. 460 461Additionally, \texttt{seL4\_NumExclusiveBreakpoints}, \texttt{seL4\_NumExclusiveWatchpoints} 462and \texttt{seL4\_NumDualFunctionMonitors} 463are defined for each target platform to reflect the number of available 464hardware breakpoints/watchpoints of a certain type. 465 466\item[seL4\_NumExclusiveBreakpoints]: Defines the number of hardware registers 467capable of generating a fault \textbf{only} on instruction execution. Currently this will be 468set only on ARM platforms. The API-ID of the first exclusive breakpoint is given 469in \texttt{seL4\_FirstBreakpoint}. If there are no instruction-break exclusive 470registers, \texttt{seL4\_NumExclusiveBreakpoints} will be set to \texttt{0} and 471\texttt{seL4\_FirstBreakpoint} will be set to -1. 472 473\item[seL4\_NumExclusiveWatchpoints]: Defines the number of hardware registers 474capable of generating a fault \textbf{only} on data access. Currently this will be set only 475on ARM platforms. The API-ID of the first exclusive watchpoint is given 476in \texttt{seL4\_FirstWatchpoint}. If there are no data-break exclusive 477registers, \texttt{seL4\_NumExclusiveWatchpoints} will be set to \texttt{0} and 478\texttt{seL4\_FirstWatchpoint} will be set to -1. 479 480\item[seL4\_NumDualFunctionMonitors]: Defines the number of hardware registers 481capable of generating a fault on either type of access -- i.e, the register 482supports both instruction and data breaks. Currently this will be set only on 483x86 platforms. The API-ID of the first dual-function monitor is given 484in \texttt{seL4\_FirstDualFunctionMonitor}. If there are no dual-function break 485registers, \texttt{seL4\_NumDualFunctionMonitors} will be set to \texttt{0} and 486\texttt{seL4\_FirstDualFunctionMonitor} will be set to -1. 487 488\end{description} 489 490\begin{table}[h] 491\begin{tabularx}{\textwidth}{XXX} 492\toprule 493\textbf{Value sent} & \textbf{IPC buffer location} \\ 494\midrule 495\reg{Breakpoint instruction address} & \ipcbloc{IPCBuffer[0]} \\ 496\reg{Exception reason} & \ipcbloc{IPCBuffer[1]} \\ 497\reg{Watchpoint data access address} & \ipcbloc{IPCBuffer[2]} \\ 498\reg{Register API-ID} & \ipcbloc{IPCBuffer[3]} \\ 499\bottomrule 500\end{tabularx} 501\caption{\label{tbl:debug_exception_result}Debug fault message layout. The 502register API-ID is not returned in the fault message from the kernel on 503single-step faults.} 504\end{table} 505 506\subsection{Debug Exception: Single-stepping} 507\label{sec:single_stepping_debug_exception} 508 509The kernel provides support for the use of hardware single-stepping of userspace 510threads when configured to do so (when CONFIG\_HARDWARE\_DEBUG\_API is set). To 511this end it exposes the invocation, \texttt{seL4\_TCB\_ConfigureSingleStepping}. 512 513The caller is expected to select an API-ID that corresponds to 514an instruction breakpoint, to use when setting up the single-stepping 515functionality (i.e, API-ID from 0 to \texttt{seL4\_NumExclusiveBreakpoints} - 1). 516However, not all hardware platforms require an actual hardware breakpoint 517register to provide single-stepping functionality. If the caller's hardware platform requires the 518use of a hardware breakpoint register, it will use the breakpoint register given to it in \texttt{bp\_num}, 519and return \texttt{true} in \texttt{bp\_was\_consumed}. If the underlying platform does not need a hardware 520breakpoint to provide single-stepping, seL4 will return \texttt{false} in \texttt{bp\_was\_consumed} and 521leave \texttt{bp\_num} unchanged. 522 523If \texttt{bp\_was\_consumed} is \texttt{true}, the caller should not 524attempt to re-configure \texttt{bp\_num} for Breakpoint or Watchpoint usage until 525the caller has disabled single-stepping and released that register, via a subsequent 526call to \texttt{seL4\_TCB\_ConfigureSingleStepping}, or a fault-reply with 527\texttt{n\_instr} being 0. Setting \texttt{num\_instructions} to \texttt{0} 528\textbf{disables single stepping}. 529 530On architectures that require an actual hardware registers to be configured for 531single-stepping functionality, seL4 will restrict the number of registers that 532can be configured as single-steppers, to one at any given time. The register that 533is currently configured (if any) for single-stepping will be the implicit 534\texttt{bp\_num} argument in a single-step debug fault reply. 535 536The kernel's single-stepping, also supports skipping a certain number of 537instructions before delivering the single-step fault message. \texttt{Num\_instructions} 538should be set to \texttt{1} when single-stepping, or any non-zero integer value to skip that many 539instructions before resuming single-stepping. This skip-count can also be set in 540the fault-reply to a single-step debug fault. 541 542\begin{table}[h] 543\begin{tabularx}{\textwidth}{XXX} 544\toprule 545\textbf{Value sent} & \textbf{Register set by reply} & \textbf{IPC buffer location} \\ 546\midrule 547\reg{Breakpoint instruction address} & \texttt{num\_instructions} to skip & \ipcbloc{IPCBuffer[0]} \\ 548\reg{Exception reason} & --- & \ipcbloc{IPCBuffer[1]} \\ 549\bottomrule 550\end{tabularx} 551\caption{\label{tbl:single_step_exception_result}Single-step fault message layout.} 552\end{table} 553 554\subsection{Timeout Fault (MCS only)} 555\label{sec:timeout-fault} 556 557Timeout faults are raised when a thread consumes all of its budget and has a 558timeout fault handler that is not a null capability. They allow a timeout 559exception handler to take some action to restore the thread, and deliver a 560message containing the scheduling context data word, as well as the amount of 561time consumed since the last timeout fault occured on this scheduling context, 562or since \apifunc{seL4\_SchedContext\_YieldTo}{schedcontext_yieldto} or 563\apifunc{seL4\_SchedContext\_Consumed}{schedcontext_consumed} was last called. 564Timeout exception handlers can reply to a temporal fault with the registers set 565in the same format as outlined in \autoref{sec:read_write_registers}. 566 567\begin{table}[htb] \noindent\begin{tabularx}{\textwidth}{XX} \toprule 568 \textbf{Meaning} & \textbf{IPC buffer location} \\ \midrule Data word from 569 the scheduling context object that the thread was running on when the fault 570 occured. & \ipcbloc{seL4\_TimeoutFault\_Data} \\ Upper 32-bits of 571 microseconds consumed since last reset & 572 \ipcbloc{seL4\_TimeoutFault\_Consumed} \\ Lower 32-bits of microseconds 573 consumed since last reset & \ipcbloc{seL4\_TimeoutFault\_Consumed\_LowBits} 574 \\ \bottomrule \end{tabularx}\\ \\ \caption{\label{tbl:tf_message_32} 575 Timeout fault outcome on 32-bit architectures.} \end{table} 576 577\subsection{VM Fault} 578\label{sec:vm-fault} 579 580The thread caused a page fault. Replying to the fault IPC will restart 581the thread. The contents of the IPC message are given below.\\ 582 583\begin{table}[htb] 584\begin{tabularx}{\textwidth}{XXX} 585\toprule 586\textbf{Meaning} & \textbf{IPC buffer location} \\ 587\midrule 588 Program counter to restart execution at. & \ipcbloc{seL4\_VMFault\_IP} \\ 589Address that caused the fault. & \ipcbloc{seL4\_VMFault\_SP} \\ 590 Instruction fault (1 if the fault was caused by an instruction fetch). & \ipcbloc{seL4\_VMFault\_PrefetchFault} \\ 591Fault status register (FSR). Contains information about the cause of the fault. Architecture dependent. & \ipcbloc{seL4\_VMFault\_FSR} \\ 592\bottomrule 593\end{tabularx} 594\caption{\label{tbl:vm_fault_result_arm} VM Fault outcome on all architectures.} 595\end{table} 596 597\section{Domains} 598\label{sec:domains} 599 600Domains are used to isolate independent subsystems, so as to limit 601information flow between them. 602The kernel switches between domains according to a fixed, time-triggered 603schedule. 604The fixed schedule is compiled into the kernel via the constant 605\texttt{CONFIG\_NUM\_DOMAINS} and the global variable \texttt{ksDomSchedule}. 606 607A thread belongs to exactly one domain, and will only run when that domain 608is active. 609The \apifunc{seL4\_DomainSet\_Set}{domainset_set} method changes the domain 610of a thread. 611The caller must possess a \obj{Domain} cap and the thread's \obj{TCB} cap. 612The initial thread starts with a \obj{Domain} cap (see 613\autoref{sec:messageinfo}). 614 615\section{Virtualisation} 616\label{sec:virt} 617 618Hardware execution virtualisation is supported on specific arm and x86 platforms. The interface is exposed through a series 619of kernel objects, invocations and syscalls that allow the user to take advantage of hardware 620virtualisation features. 621 622Hardware virtualisation allows for a thread to perform instructions and operations as if it were 623running at a higher privilege level. As higher privilege levels typically have access to 624additional machine registers and other pieces of state a \obj{VCPU} object is introduced to act 625as storage for this state. For simplicity we refer to this virtualised higher privileged level as 626'guest mode'. \obj{VCPU}s are bound in a one-to-one relationship with a \obj{TCB} in order 627to provide a thread with this ability to run in higher privilege mode. See the section on 628ARM or x86 for more precise details. 629 630\obj{VCPU} objects also have additional, architecture specific, invocations for manipulating 631the additional state or other virtualisation controls provided by the hardware. Binding of 632a \obj{VCPU} to a \obj{TCB} is done by an invocation on the \obj{VCPU} only, and not the \obj{TCB}. 633 634The provided objects and invocations are, generally speaking, the thinnest possible shim over 635the underlying hardware primitives and operations. As a result an in depth familiarity with 636the underlying architecture specific harwdare mechanisms is required to use these objects, and 637such familiarity is therefore assumed in description. 638 639\subsection{ARM} 640 641When a \obj{TCB} has a bound \obj{VCPU} it is allowed to have the mode portion of the 642\texttt{cpsr} register set to values other than \texttt{user}. Specifically it may have any value other than 643\texttt{hypervisor}. 644 645TODO: this section needs more detail 646 647\subsection{x86} 648 649A \obj{TCB} with a bound \obj{VCPU} has two execution modes; one is the original thread just as 650if there was no bound \obj{VCPU}, and the other is the guest mode execution using the 651\obj{VCPU}. Switching from regular execution mode into the guest execution mode is 652done by using the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall. Executing this syscall causes the thread, whenever 653it is scheduled thereafter, to execute using the higher privileged mode controlled by the \obj{VCPU}. 654Should the guest execution mode generate any kind of fault, or if a message arrives 655on the \obj{TCB}s bound notification, the \obj{TCB} will be switched back to regular mode 656and the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall will return with a message indicating the reason for return. 657 658\obj{VCPU} state and execution is controlled through the \apifunc{seL4\_VCPU\_ReadVMCS}{x86_vcpu_readvmcs} 659and \apifunc{seL4\_VCPU\_WriteVMCS}{x86_vcpu_writevmcs} invocations. 660These are very thin wrappers around the hardware \texttt{vmread} and \texttt{vmwrite} instructions and the kernel 661merely does enough validation on the parameters to ensure the \obj{VCPU} is not configured 662to run in such a way as to violate any kernel properties. For example, it is not possible to 663disable the use of External Interrupt Exiting, as this would prevent the kernel from receiving 664timer interrupts and allow the thread to monopolise CPU time. 665 666Memory access of the guest execution mode is controlled by requiring the use of Extended 667Page Tables (EPT). A series of EPT related paging structure objects (\obj{EPTPML4}, \obj{EPTPDPT}, \obj{EPTPD}, \obj{EPTPT}) 668exist and are manipulated in exactly the same manner as the objects for the regular virtual 669address space. Once constructed a \obj{TCB} can be given an \obj{EPTPML4} as an EPT root with \apifunc{seL4\_TCB\_SetEPTRoot}{x86_set_eptroot}, 670which serves as the vspace root when executing in guest mode, with the vspace root set 671with \apifunc{seL4\_TCB\_SetSPace}{tcb_setspace} or \apifunc{seL4\_TCB\_Configure}{tcb_configure} 672continuing to provide translation when the TCB is executing in its normal mode. 673 674Direct access to I/O ports can be given to the privileged execution mode through the 675\apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport} invocation and allows the provided I/O port capability to be 676linked to the VCPU, and a subset of its I/O port range to be made accessible to the \obj{VCPU}. 677Linking means that an I/O port capability can only be used in a single \apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport} 678invocation and a second invocation will undo the previous one. The link also means that 679if the I/O port capability is deleted for any reason the access will be correspondingly removed 680from the \obj{VCPU}. 681