manual/parts/objects.tex

%
% Copyright 2014, General Dynamics C4 Systems
%
% SPDX-License-Identifier: GPL-2.0-only
%

\chapter{\label{ch:objects}Kernel Services and Objects}

A limited number of service primitives are provided by the microkernel;
more complex services may be implemented as applications on top of these
primitives. In this way, the functionality of the system can be extended
without increasing the code and complexity in privileged mode, while
still supporting a potentially wide number of services for varied
application domains.

Note that some services are available only when the kernel is configured for
MCS\footnote{``mixed-criticality system''} support.

The basic services seL4 provides are as follows:
\begin{description}
    \item[Threads] are an abstraction of CPU execution that supports
    running software;

    \item[Scheduling contexts] (MCS only) are an abstraction of CPU execuion time.

    \item[Address spaces] are virtual memory spaces that each contain an
    application. Applications are limited to accessing memory in their
    address space;

    \item[Inter-process communication] (IPC) via \emph{endpoints} allows
    threads to communicate using message passing;

    \item[Reply objects] (MCS only) are used to store single-use reply capabilities,
    and are provided by the receiver during message passing.

    \item[Notifications] provide a non-blocking signalling mechanism
      similar to binary semaphores;

    \item[Device primitives] allow device drivers to be implemented as
    unprivileged applications.  The kernel exports hardware device
    interrupts via IPC messages; and

    \item[Capability spaces] store capabilities (i.e., access rights) to
    kernel services along with their book-keeping information.
\end{description}

This chapter gives an overview of these services, describes how kernel
objects are accessed by userspace applications, and describes how new
objects can be created.

\section{Capability-based Access Control}
\label{sec:cap-access-control}

The seL4 microkernel provides a capability-based access-control model.
Access control governs all kernel services; in order to perform an
operation, an application must \emph{invoke} a capability in its
possession that has sufficient access rights for the requested service.
With this, the system can be configured to isolate software components from
each other, and also to enable authorised, controlled communication
between components by selectively granting specific communication
capabilities.  This enables software-component isolation with a high
degree of assurance, as only those operations explicitly authorised by
capability possession are permitted.

A capability is an unforgeable token that references a specific kernel
object (such as a thread control block) and carries access rights that
control what methods may be invoked.
Conceptually, a capability resides in an application's \emph{capability
space}; an address in this space refers to a \emph{slot} which may or
may not contain a capability.  An application may refer to
a capability---to request a kernel service, for example---using the
address of the slot holding that capability.  This means, the seL4
capability model is an instance of a \emph{segregated} (or \emph{partitioned})
capability system, where capabilities are managed by the kernel.

Capability spaces are implemented as a directed graph of kernel-managed
\emph{capability nodes} (\obj{CNode}s).  A \obj{CNode} is a table of
slots, where each slot may contain further \obj{CNode} capabilities. An
address of a capability in a capability space is the concatenation of the indices
of slots within \obj{CNodes} forming the path to the destination
slot; we discuss \obj{CNode} objects in detail in \autoref{ch:cspace}.

Capabilities can be copied and moved within capability spaces, and
also sent via IPC. This allows creation of applications with specific
access rights, the delegation of authority to another application, and
passing to an application authority to a newly created (or selected)
kernel service. Furthermore, capabilities can be \emph{minted} to
create a derived capability with a subset of the rights of the
original capability (never with more rights). A newly minted
capability can be used for partial delegation of authority.

Capabilities can also be revoked to withdraw
authority. Revocation recursively removes any capabilities that have
been derived from the original capability being revoked. The propagation of
capabilities through the system is controlled by a
\emph{take-grant}-based model~\cite{Elkaduwe_GE_08,Boyton_09}.

\section{System Calls}
\label{sec:syscalls}
\label{sec:sys_send}
\label{sec:sys_wait}
\label{sec:sys_nbwait}
\label{sec:sys_recv}
\label{sec:sys_call}
\label{sec:sys_reply}
\label{sec:sys_nbsend}
\label{sec:sys_replyrecv}
\label{sec:sys_nbrecv}
\label{sec:sys_nbsendrecv}
\label{sec:sys_nbsendwait}
\label{sec:sys_yield}

The seL4 kernel provides a message-passing service for communication between
threads. This mechanism is also used for communication with kernel-provided
services. There is a standard message format, each message containing a
number of data words and possibly some capabilities. The structure and encoding
of these messages are described in detail in \autoref{ch:ipc}.

Threads send messages by invoking capabilities within their capability space.
When an endpoint, notification or reply capability is invoked in this way, the
message will be transferred through the kernel to another thread.
When other capabilities to kernel objects are invoked, the message will be
interpreted as a method invocation in a manner specific to the type of kernel
object. For example, invoking a thread control block (TCB) capability with a
correctly formatted message will suspend the target thread.

% TODO: The \apifunc{}{} refs in the paragraphs below (up to the section on
% kernel objects) go to the API reference which is very terse and not nearly as
% useful for cross-references at this point in the manual.  We should consider
% making these prose descriptions of the system calls legitimate targets for
% references, and cross-reference to those instead.

Fundamentally, we can regard the kernel as providing three system calls:
\emph{Send}, \emph{Receive} and \emph{Yield}.  However, there are also
combinations and variants of the basic \emph{Send} and \emph{Receive} calls.  An
important variant is the \emph{Call} operation, which consists of a standard
\emph{Send} operation atomically followed by a variant of \emph{Receive} which
waits for a \emph{Reply}.  A \emph{reply} message is always delivered via a
special resource instead of using the standard IPC mechanism; see
\apifunc{seL4\_Call}{sel4_call} below for details.

Invoking methods on kernel objects other than endpoints and notifications is
done with \emph{Send} or \emph{Call}, depending on whether the invoker
wants a reply from the kernel (\emph{Call}) or not (\emph{Send}).  By using
functions provided by the libsel4 API you are guaranteed to always use the more
appropriate one.  The \emph{Yield} system call is not associated with any kernel
object and is the only operation that does not invoke a capability.  In the MCS
configuration, \emph{Wait} is a variant of \emph{Receive} that does not require
a reply object to be provided---on non-MCS configurations, \emph{Wait} is
synonymous with \emph{Receive}, because neither call takes a reply object.

The fundamental system calls are:
\begin{description}
    \item[\apifunc{seL4\_Yield}{sel4_yield}] is the only system call that does
    not require a capability to be used.  It forfeits the remainder of the
    calling thread's timeslice and causes invocation of the kernel's scheduler.
    If there are no other runnable threads with the same priority as the caller,
    the calling thread will immediately be scheduled with a fresh timeslice.  In
    the MCS configuration, this behaviour depends on the state of the scheduling
    context; see \autoref{sec:scheduling_contexts}.

    \item[\apifunc{seL4\_Send}{sel4_send}] delivers a message through the named
    capability.  If the invoked capability is an endpoint, and no receiver is
    ready to receive the message immediately, the sending thread will block
    until the message can be delivered.  No error code or response will be
    returned by the receiving object.

    \item[\apifunc{seL4\_Recv}{sel4_recv}] (``receive'') is used by a thread to
    receive messages through endpoints or notifications.  If no sender or
    notification is pending, the caller will block until a message or
    notification can be delivered.  This system call works only on
    \obj{Endpoint} or \obj{Notification} capabilities, raising a fault (see
    section \ref{sec:faults}) when attempted with other capability types.

    In the MCS configuration, \emph{Receive} takes a reply capability---a
    capability to a reply object--as a parameter.
\end{description}

% TODO: seL4_Recv() gets hyphenated as
% seL4_-
% Recv()
% here, and that's ugly.  But just suppressing hyphenation for it (with \mbox or
% \hyphenation) makes it overrun the right margin.  GBR knows an easy way to
% deal with this in groff but this is not groff...

The remaining system calls are variants and combinations of
\apifunc{seL4\_Send}{sel4_send} and \apifunc{seL4\_Recv}{sel4_recv} efficiently
accommodate common use cases in systems programming.

\begin{description}
    \item[\apifunc{seL4\_NBSend}{sel4_nbsend}] performs a polling send on an
    endpoint.  If the message cannot be delivered immediately, i.e., there is no
    receiver waiting on the destination \obj{Endpoint}, the message is silently
    dropped.  The sending thread continues execution.  As with
    \apifunc{seL4\_Send}{sel4_send}, no error code or response will be returned.

    \item[\apifunc{seL4\_NBRecv}{sel4_nbrecv}] is used by a thread to check for
    signals pending on a notification object or messages pending on an endpoint
    without blocking.  This system call works only on endpoints and notification
    object capabilities, raising a fault (see section \ref{sec:faults}) when
    attempted with other capability types.

    \item[\apifunc{seL4\_Call}{sel4_call}] combines
    \apifunc{seL4\_Send}{sel4_send} and \apifunc{seL4\_Recv}{sel4_recv} with
    some important differences.  The call blocks the sending thread until its
    message is delivered and a reply message is received.

    When invoking capabilities to kernel services other than endpoints, using
    \apifunc{seL4\_Call}{sel4_call} allows the kernel to return an error code or
    other response through the reply message.

    When the sent message is delivered to another thread via an \obj{Endpoint},
    the kernel does the same operation as \apifunc{seL4\_Send}{sel4_send}.  What
    happens next depends on the kernel configuration.  For MCS configurations,
    the kernel then updates the \emph{reply object} provided by the receiver.  A
    \emph{reply object} is a vessel for tracking reply messages, used to send
    a reply message and wake up the caller.  In non-MCS configurations, the
    kernel then deposits a special \emph{reply capability} in a dedicated slot
    in the receiver's \obj{TCB}.  This \emph{reply capability} is a single-use
    right to send a reply message and wake up the caller, meaning that the
    kernel invalidates it as soon as it has been invoked.  For both variants,
    the calling thread is blocked until a capability to the reply object is
    invoked.  For more information, see \autoref{sec:ep-cal}.

    \item[\apifunc{seL4\_Reply}{sel4_reply}] is used to respond to a
    \apifunc{seL4\_Call}{sel4_call}, by invoking the reply capability generated
    by the \apifunc{seL4\_Call}{sel4_call} system call and stored in a dedicated
    slot in the replying thread's TCB.  It has exactly the same behaviour as
    invoking the reply capability with \apifunc{seL4\_Send}{sel4_send} which is
    described in \autoref{sec:ep-cal}.

    \item[\apifunc{seL4\_ReplyRecv}{sel4_replyrecv}] combines
    \apifunc{seL4\_Reply}{sel4_reply} and \apifunc{seL4\_Recv}{sel4_recv}.  It
    exists mostly for efficiency reasons, namely the common case of replying to
    a request and waiting for the next can be performed in a single kernel
    system call instead of two.  The transition from the reply to the receive
    phase is also atomic.

    \item[\apifunc{seL4\_Wait}{sel4_wait}] works like
    \apifunc{seL4\_Recv}{sel4_recv}; on non-MCS configurations, they are in fact
    synonymous.  In the MCS configuration, \apifunc{seL4\_Wait}{sel4_wait} is
    used when no reply is expected.  Unlike \apifunc{seL4\_Recv}{sel4_recv},
    \apifunc{seL4\_Wait}{sel4_wait} takes no reply capability.

    \item[\apifunc{seL4\_NBWait}{sel4_nbwait}] (MCS only) is used by a thread to
    poll for messages through endpoints or notifications.  If no sender or
    notification is pending, the system call returns immediately.

    \item[\apifunc{seL4\_NBSendWait}{sel4_nbsendwait}] (MCS only) combines an
    \apifunc{seL4\_NBSend}{sel4_nbsend} and \apifunc{seL4\_Wait}{sel4_wait} into
    one atomic system call.

    \item[\apifunc{seL4\_NBSendRecv}{sel4_nbsendwait}] (MCS only) combines an
    \apifunc{seL4\_NBSend}{sel4_nbsend} and \apifunc{seL4\_Recv}{sel4_recv} into
    one atomic system call.
\end{description}

\section{Kernel Objects}
\label{s:sel4_internals}

In this section we give a brief overview of the kernel-implemented
object types whose instances (also simply called \emph{objects}) can be invoked by applications. The interface to these
objects forms the interface to the kernel itself. The creation and use
of kernel services is achieved by the creation,
manipulation, and combination of these kernel objects:

\begin{description}

    \item[\obj{CNodes}] (see \autoref{ch:cspace}) store capabilities, giving threads permission to
    invoke methods on particular objects.
    Each \obj{CNode} has a fixed number of slots,
    always a power of two, determined when the \obj{CNode} is created. Slots
    can be empty or contain a capability.

    \item[Thread Control Blocks] (\obj{TCB}s; see \autoref{ch:threads}) represent a thread of
    execution in seL4. Threads are the unit of execution that is
    scheduled, blocked, unblocked, etc., depending on the application's
    interaction with other threads.

   \item[Scheduling contexts] (MCS only) (\obj{SchedulingContext}s; see \autoref{ch:threads}) represent
       CPU time in seL4. Users can create scheduling contexts from untyped objects, however on
       creation scheduling contexts are \textit{empty} and do not represent any time.
       Initially, there is a capability to \obj{SchedControl} for each node, which
       allows scheduling context to be populated with parameters, which combined with priority
       control thread's access to CPU time.

    \item[\obj{Endpoints}] (see \autoref{ch:ipc}) facilitate message-passing
    communication between threads. IPC is synchronous: A thread
    trying to send or receive on an endpoint blocks until the message
    can be delivered. This means that message delivery only happens if
    a sender and a receiver rendezvous at the endpoint, and the
    kernel can deliver the message with a single copy (or without
    copying for short messages using only registers).

    A capability to an endpoint can be restricted to be
    send-only or receive-only. Additionally, \obj{Endpoint}
    capabilities can have the grant right, which allows sending
    capabilities as part of the message.

   \item[\obj{Reply objects}] (MCS only) (see \autoref{ch:ipc}) track scheduling
    context donation and provide a container for single-use reply capabilities.
    They are provided by \apifunc{seL4\_Recv}{sel4_recv}.

    \item[\obj{Notification} Objects] (see \autoref{ch:notifications})
      provide a simple signalling mechanism. A \obj{Notification}
      is a word-size array of flags, each of which behaves like a binary semaphore. Operations
      are \emph{signalling} a subset of flags in a single operation,
      polling to check any flags,
      and blocking until any are signalled. Notification capabilities
      can be signal-only or wait-only.

    \item[Virtual Address Space Objects] (see \autoref{ch:vspace})
    are used to construct a virtual
    address space (or VSpace) for one or more threads. These
    objects largely directly correspond to those of the hardware, and
    as such are architecture-dependent. The kernel also includes \obj{ASID
    Pool} and \obj{ASID Control} objects for tracking the status of
    address spaces.

    \item[Interrupt Objects] (see \autoref{ch:io}) give applications the ability to receive
    and acknowledge interrupts from hardware devices.
    Initially, there is a capability to \obj{IRQControl},
    which allows for the creation of \obj{IRQHandler} capabilities.
    An \obj{IRQHandler} capability permits the management of a specific
    interrupt source associated with a specific device.
    It is delegated to
    a device driver to access an interrupt source. The \obj{IRQHandler}
    object allows threads to wait for and acknowledge individual
    interrupts.

    \item[Untyped Memory] (see \autoref{sec:kernmemalloc}) is the foundation of memory allocation
    in the seL4 kernel.  Untyped memory capabilities have a single method
    which allows the creation of new kernel objects. If the method
    succeeds, the calling thread gains access to capabilities to the
    newly-created objects. Additionally, untyped memory objects can be
    divided into a group of smaller untyped memory objects allowing
    delegation of part (or all) of the system's memory.  We discuss
    memory management in general in the following sections.

\end{description}

\section{Kernel Memory Allocation}
\label{sec:kernmemalloc}

The seL4 microkernel does not dynamically allocate memory for kernel objects.
Instead, objects must be explicitly created from application-controlled memory
regions via \obj{Untyped Memory} capabilities.  Applications must have
explicit authority to memory (through these \obj{Untyped Memory} capabilities) in
order to create new objects, and all objects consume a fixed amount of memory once
created. These mechanisms can be used to precisely control
the specific amount of physical memory available to applications,
including being able to enforce isolation of physical memory access
between applications or a device.  There are no arbitrary resource
limits in the kernel apart from those dictated by the
hardware\footnote{The treatment of virtual ASIDs imposes a fixed number
of address spaces. This limitation is to be removed in future
versions of seL4.}, and so many denial-of-service attacks via resource
exhaustion are avoided.

At boot time, seL4 pre-allocates the memory required for the kernel
itself, including the code, data, and stack sections (seL4 is a single
kernel-stack operating system). It then creates an initial user
thread (with an appropriate address and capability space).
The kernel then hands all remaining memory to
the initial thread in the form of capabilities to \obj{Untyped Memory}, and
some additional capabilities to kernel objects that were required to
bootstrap the initial thread.  These \obj{Untyped Memory} regions can then be split into
smaller regions or other kernel objects using the
\apifunc{seL4\_Untyped\_Retype}{untyped_retype} method; the created objects are termed \emph{children} of
the original untyped memory object.

The user-level application that creates an object using \apifunc{seL4\_Untyped\_Retype}{untyped_retype}
receives full authority over the resulting object. It can then delegate
all or part of the authority it possesses over this object to one or
more of its clients.

Untyped memory objects represent two different types of memory:
general purpose memory, or device memory.
\emph{General purpose} memory can be untyped into any other object
type and used for any operation on untyped memory provided by the kernel.
\emph{Device memory} covers memory regions reserved for devices
as determined by the hardware platform, and usage of these objects
is restricted by the kernel in the following ways:

\begin{itemize}
\item Device untyped objects can only be retyped into frames or other
untyped objects; developers cannot, for example, create an endpoint from device memory.
\item Frame objects retyped from device untyped objects cannot be set as thread IPC buffers, or used
in the creation of an ASID pool
\end{itemize}

The type attribute (whether it represents \emph{general purpose} or
\emph{device} memory) of a child untyped object is inherited from its
parent untyped object. That is, any child of a device untyped will
also be a device untyped. Developers cannot change the type attribute of an untyped.

\subsection{Reusing Memory}
\label{s:memRevoke}

The model described thus far is sufficient for applications to
allocate kernel objects, distribute authority among client
applications, and obtain various kernel services provided by these
objects.  This alone is sufficient for a simple static system
configuration.

The seL4 kernel also allows \obj{Untyped Memory} regions to be reused.
Reusing a region of memory is allowed only
when there are no dangling references (i.e., capabilities) left to the
objects inside that memory.  The kernel tracks
\emph{capability derivations}, i.e., the children generated by the
methods \apifunc{seL4\_Untyped\_Retype}{untyped_retype}, \apifunc{seL4\_CNode\_Mint}{cnode_mint}, \apifunc{seL4\_CNode\_Copy}{cnode_copy}, and
\apifunc{seL4\_CNode\_Mutate}{cnode_mutate}.

The tree structure so generated is termed the \emph{capability
derivation tree} (CDT).\footnote{Although the CDT conceptually is a separate
data structure, it is implemented as part of the CNode object and so
requires no additional kernel meta-data.}  For example, when a user
creates new kernel objects by retyping untyped memory, the newly created
capabilities would be inserted into the CDT as children of the untyped
memory capability.

For each \obj{Untyped Memory} region, the kernel keeps
a \emph{watermark} recording how much of the region has previously been
allocated. Whenever a user requests the kernel to create new objects in
an untyped memory region, the kernel will carry out one of two actions:
if there are already existing objects allocated in the region, the
kernel will allocate the new objects at the current watermark level, and
increase the watermark. If all objects previously allocated in the
region have been deleted, the kernel will reset the watermark and start
allocating new objects from the beginning of the region again.

Finally, the \apifunc{seL4\_CNode\_Revoke}{cnode_revoke} method provided by \obj{CNode} objects
destroys all capabilities derived from the argument capability. Revoking
the last capability to a kernel object triggers the \emph{destroy}
operation on the now unreferenced object. This simply cleans up any in-kernel dependencies between
it, other objects and the kernel.

By calling \apifunc{seL4\_CNode\_Revoke}{cnode_revoke} on the original capability to an untyped memory
object, the user removes all of the untyped memory object's
children---that is, all capabilities pointing to objects in the untyped
memory region.  Thus, after this invocation there are no valid references
to any object within the untyped region, and the region may be safely
retyped and reused.

\subsection{Summary of Object Sizes}
\label{sec:object_sizes}

When retyping untyped memory it is useful to know how much memory the
object will require. Object sizes are defined in libsel4.

Note that \obj{CNode}s, \obj{SchedContext}s (MCS only), and \obj{Untyped Object}s
have variables sizes. When retyping untyped memory into \obj{CNode}s
or \obj{SchedContext}s, or breaking an
\obj{Untyped Object} into smaller \obj{Untyped Object}s, the
\texttt{size\_bits} argument to
\apifunc{seL4\_Untyped\_Retype}{untyped_retype} is used to specify
the size of the resulting objects.
For all other object types, the size is fixed, and the \texttt{size\_bits}
argument to \apifunc{seL4\_Untyped\_Retype}{untyped_retype} is ignored.

\begin{table}[htb]
  \renewcommand{\arraystretch}{1.5}
  \begin{tabularx}{\textwidth}{p{0.15\textwidth}XXXX}
    \toprule
    Type & Meaning of \texttt{size\_bits} & Size in Bytes  \\
    \midrule
    \obj{CNode} & $\log_2$ number of slots & $2^\texttt{size\_bits} \cdot 2^\texttt{seL4\_SlotBits}$
	  \texttt{seL4\_SlotBits} is: \newline
	  \emph{on 32-bit architectures:} 4 \newline
	  \emph{on 64-bit architectures:} 5 \\
    \obj{SchedContext} (MCS only) & $\log_2$ size in bytes & $2^\texttt{size\_bits}$ \\
    \obj{Untyped} & $\log_2$ size in bytes & $2^\texttt{size\_bits}$ \\
    \bottomrule
  \end{tabularx}
  \caption{\label{tab:objsize} Meaning of \texttt{size\_bits} for object types of
	variable size}
\end{table}

A single call to \apifunc{seL4\_Untyped\_Retype}{untyped_retype} can retype a
single \obj{Untyped Object} into multiple objects. The number of objects
to create is specified by its \texttt{num\_objects} argument. All created
objects must be of the same type, specified by the \texttt{type} argument. In
the case of variable-sized objects, each object must also be of the same size.
If the size of the memory area needed (calculated by the object size multiplied
by \texttt{num\_objects}) is greater than the remaining unallocated memory of
the \obj{Untyped Object}, an error will result.