manual/parts/objects.tex

%
% Copyright 2014, General Dynamics C4 Systems
%
% This software may be distributed and modified according to the terms of
% the GNU General Public License version 2. Note that NO WARRANTY is provided.
% See "LICENSE_GPLv2.txt" for details.
%
% @TAG(GD_GPL)
%

\chapter{\label{ch:objects}Kernel Services and Objects}

A limited number of service primitives are provided by the microkernel;
more complex services may be implemented as applications on top of these
primitives. In this way, the functionality of the system can be extended
without increasing the code and complexity in privileged mode, while
still supporting a potentially wide number of services for varied
application domains.

The basic services seL4 provides are as follows:
\begin{description}
    \item[Threads] are an abstraction of CPU execution that supports
    running software;

    \item[Address spaces] are virtual memory spaces that each contain an
    application. Applications are limited to accessing memory in their
    address space;

    \item[Inter-process communication] (IPC) via \emph{endpoints} allows
    threads to communicate using message passing;

    \item[Notifications] provide a non-blocking signalling mechanism
      similar to binary semaphores;

    \item[Device primitives] allow device drivers to be implemented as
    unprivileged applications.  The kernel exports hardware device
    interrupts via IPC messages; and

    \item[Capability spaces] store capabilities (i.e., access rights) to
    kernel services along with their book-keeping information.
\end{description}

This chapter gives an overview of these services, describes how kernel
objects are accessed by userspace applications, and describes how new
objects can be created.

\section{Capability-based Access Control}
\label{sec:cap-access-control}

The seL4 microkernel provides a capability-based access-control model.
Access control governs all kernel services; in order to perform an
operation, an application must \emph{invoke} a capability in its
possession that has sufficient access rights for the requested service.
With this, the system can be configured to isolate software components from
each other, and also to enable authorised, controlled communication
between components by selectively granting specific communication
capabilities.  This enables software-component isolation with a high
degree of assurance, as only those operations explicitly authorised by
capability possession are permitted.

A capability is an unforgeable token that references a specific kernel
object (such as a thread control block) and carries access rights that
control what methods may be invoked.
Conceptually, a capability resides in an application's \emph{capability
space}; an address in this space refers to a \emph{slot} which may or
may not contain a capability.  An application may refer to
a capability---to request a kernel service, for example---using the
address of the slot holding that capability.  This means, the seL4
capability model is an instance of a \emph{segregated} (or \emph{partitioned})
capability system, where capabilities are managed by the kernel.

Capability spaces are implemented as a directed graph of kernel-managed
\emph{capability nodes} (\obj{CNode}s).  A \obj{CNode} is a table of
slots, where each slot may contain further \obj{CNode} capabilities. An
address of a capability in a capability space is the concatenation of the indices
of slots within \obj{CNodes} forming the path to the destination
slot; we discuss \obj{CNode} objects in detail in \autoref{ch:cspace}.

Capabilities can be copied and moved within capability spaces, and
also sent via IPC. This allows creation of applications with specific
access rights, the delegation of authority to another application, and
passing to an application authority to a newly created (or selected)
kernel service. Furthermore, capabilities can be \emph{minted} to
create a derived capability with a subset of the rights of the
original capability (never with more rights). A newly minted
capability can be used for partial delegation of authority.

Capabilities can also be revoked to withdraw
authority. Revocation recursively removes any capabilities that have
been derived from the original capability being revoked. The propagation of
capabilities through the system is controlled by a
\emph{take-grant}-based model~\cite{Elkaduwe_GE_08,Boyton_09}.

\section{System Calls}
\label{sec:syscalls}
\label{sec:sys_send}
\label{sec:sys_recv}
\label{sec:sys_call}
\label{sec:sys_reply}
\label{sec:sys_nbsend}
\label{sec:sys_replyrecv}
\label{sec:sys_nbrecv}
\label{sec:sys_yield}

The seL4 kernel provides a message-passing service for communication between
threads. This mechanism is also used for communication with kernel-provided
services. There is a standard message format, each message containing a
number of data words and possibly some capabilities. The structure and encoding
of these messages are described in detail in \autoref{ch:ipc}.

Threads send messages by invoking capabilities within their capability space.
When an endpoint capability is invoked in this way, the message will be
transferred through the kernel to another thread. When capabilities to kernel
objects are invoked, the message will be interpreted as a method invocation in a
manner specific to the type of kernel object. For example, invoking a thread
control block (TCB) capability with a correctly formatted message will suspend
the target thread.

Logically, the kernel provides three system calls, \emph{Send},
\emph{Receive} and \emph{Yield}. However, there are also combinations
and variants of the basic \emph{Send} and \emph{Receive} calls, e.g.\
the \emph{Call} operation, which consists of a send followed by a
\emph{Receive} from the same object. Methods on kernel objects other
than endpoints and notifications are all mapped to \emph{Send} or
\emph{Call}, depending on whether or not the method returns a
result. The \emph{Yield} system call is not associated with any kernel
object and is the only operation that does not invoke a capability.

The complete set of system calls is:
\begin{description}
    \item[\apifunc{seL4\_Send}{sel4_send}] delivers a message through the named
    capability and the application to continue. If the invoked
    capability is an endpoint, and no receiver is ready to receive the message
    immediately, the sending thread will block until the message can be
    delivered. No error code or response will be returned by the receiving
    object.

    \item[\apifunc{seL4\_NBSend}{sel4_nbsend}] performs a polling send
      on an endpoint. It
    is similar to \apifunc{seL4\_Send}{sel4_send}, except that it is
    guaranteed not to block. If the message
    cannot be delivered immediately, i.e.\ there is no receiver waiting
    on the destination \obj{Endpoint}, the message is silently dropped. Like
    \apifunc{seL4\_Send}{sel4_send}, no error code or response will be returned.

    \item[\apifunc{seL4\_Call}{sel4_call}] combines \apifunc{seL4\_Send}{sel4_send}
      and \apifunc{seL4\_Recv}{sel4_recv}. The call
      blocks the sending thread until its message is delivered and a reply message is received. When the
    sent message is delivered to another thread (via an
    \obj{Endpoint}), the kernel adds an
    additional `\emph{reply}' capability to the message that is delivered
    to the receiver, giving the latter the right to reply to the original sender. The reply
    capability is deposited in a dedicated slot in the receiver's
    \obj{TCB}, and is a single-use right, meaning that the kernel
    invalidates it as soon as it has been invoked.

    The \apifunc{seL4\_Call}{sel4_call} operation exists not only for
    efficiency reasons (combining two operations into a single system
    call). It differs from
    \apifunc{seL4\_Send}{sel4_send} immediately followed by
    \apifunc{seL4\_Recv}{sel4_recv} in two ways:
    \begin{enumerate}
    \item the single-use reply capability is created to establish a
      reply channel with minimal trust;
    \item the transition from send to recv phase is atomic, meaning it
      cannot be preempted, and the receiver can reply without any risk
      of blocking.
    \end{enumerate}

    When invoking capabilities to kernel services, using
    \apifunc{seL4\_Call}{sel4_call} allows the kernel to return an error code
    or other response through the reply message.

    \item[\apifunc{seL4\_Recv}{sel4_recv}] is used by a thread to receive
    messages through endpoints or notifications. If no sender or
    notification is pending, the caller
    will block until a message or notification can be delivered. This system call works only on
    \obj{Endpoint} or \obj{Notification} capabilities, raising a fault (see section \ref{sec:faults}) when
    attempted with other capability types.

    \item[\apifunc{seL4\_Reply}{sel4_reply}] is used to respond to a
    \apifunc{seL4\_Call}{sel4_call}, using the reply capability generated by the
    \apifunc{seL4\_Call}{sel4_call} system call and stored in the replying
    thread's TCB. It delivers the message to the thread that invoked
    the \apifunc{seL4\_Call}{sel4_call}, waking it in
    the process.

    There is space for only one reply capability in each thread's TCB, so the
    \apifunc{seL4\_Reply}{sel4_reply} syscall can be used to reply to the most
    recent caller only. The \apifunc{seL4\_CNode\_SaveCaller}{cnode_savecaller}
    method that will be described later can be used to save the reply
    capability into regular capability space, where it can be used with
    \apifunc{seL4\_Send}{sel4_send}.

    \item[\apifunc{seL4\_ReplyRecv}{sel4_replyrecv}] combines \apifunc{seL4\_Reply}{sel4_reply} and
    \apifunc{seL4\_Recv}{sel4_recv}. It exists mostly for efficiency reasons: the common case of
    replying to a request and waiting for the next can be performed in
    a single kernel system call instead of two. The transition from
    the reply to the receive phase is also atomic.

    \item[\apifunc{seL4\_NBRecv}{sel4_nbrecv}] is used by a thread to check for
    signals pending on a notification object or messages pending on an endpoint without blocking.
    This system call works only on endpoints and notification object
    capabilities, raising a fault (see section \ref{sec:faults}) when attempted
    with other capability types.

    \item[\apifunc{seL4\_Yield}{sel4_yield}] is the only system call that does not require
    a capability to be used. It forfeits the remainder of the calling thread's
    timeslice and causes invocation of the kernel's scheduler.
    If there are no other runnable threads with the same
    priority as the caller, the calling thread will immediately be
    scheduled with a fresh timeslice.
\end{description}

\section{Kernel Objects}
\label{s:sel4_internals}

In this section we give a brief overview of the kernel-implemented
object types whose instances (also simply called \emph{objects}) can be invoked by applications. The interface to these
objects forms the interface to the kernel itself. The creation and use
of kernel services is achieved by the creation,
manipulation, and combination of these kernel objects:

\begin{description}

    \item[\obj{CNodes}] (see \autoref{ch:cspace}) store capabilities, giving threads permission to
    invoke methods on particular objects.
    Each \obj{CNode} has a fixed number of slots,
    always a power of two, determined when the \obj{CNode} is created. Slots
    can be empty or contain a capability.

    \item[Thread Control Blocks] (\obj{TCB}s; see \autoref{ch:threads}) represent a thread of
    execution in seL4. Threads are the unit of execution that is
    scheduled, blocked, unblocked, etc., depending on the application's
    interaction with other threads.

    \item[\obj{Endpoints}] (see \autoref{ch:ipc}) facilitate message-passing
    communication between threads. IPC is synchronous: A thread
    trying to send or receive on an endpoint blocks until the message
    can be delivered. This means that message delivery only happens if
    a sender and a receiver rendezvous at the endpoint, and the
    kernel can deliver the message with a single copy (or without
    copying for short messages using only registers).

    A capability to an endpoint can be restricted to be
    send-only or receive-only. Additionally, \obj{Endpoint}
    capabilities can have the grant right, which allows sending
    capabilities as part of the message.

    \item[\obj{Notification} Objects] (see \autoref{ch:notifications})
      provide a simple signalling mechanism. A \obj{Notification}
      is a word-size array of flags, each of which behaves like a binary semaphore. Operations
      are \emph{signalling} a subset of flags in a single operation,
      polling to check any flags,
      and blocking until any are signalled. Notification capabilities
      can be signal-only or wait-only.

    \item[Virtual Address Space Objects] (see \autoref{ch:vspace})
    are used to construct a virtual
    address space (or VSpace) for one or more threads. These
    objects largely directly correspond to those of the hardware, and
    as such are architecture-dependent. The kernel also includes \obj{ASID
    Pool} and \obj{ASID Control} objects for tracking the status of
    address spaces.

    \item[Interrupt Objects] (see \autoref{ch:io}) give applications the ability to receive
    and acknowledge interrupts from hardware devices.
    Initially, there is a capability to \obj{IRQControl},
    which allows for the creation of \obj{IRQHandler} capabilities.
    An \obj{IRQHandler} capability permits the management of a specific
    interrupt source associated with a specific device.
    It is delegated to
    a device driver to access an interrupt source. The \obj{IRQHandler}
    object allows threads to wait for and acknowledge individual
    interrupts.

    \item[Untyped Memory] (see \autoref{sec:kernmemalloc}) is the foundation of memory allocation
    in the seL4 kernel.  Untyped memory capabilities have a single method
    which allows the creation of new kernel objects. If the method
    succeeds, the calling thread gains access to capabilities to the
    newly-created objects. Additionally, untyped memory objects can be
    divided into a group of smaller untyped memory objects allowing
    delegation of part (or all) of the system's memory.  We discuss
    memory management in general in the following sections.

\end{description}

\section{Kernel Memory Allocation}
\label{sec:kernmemalloc}

The seL4 microkernel does not dynamically allocate memory for kernel objects.
Instead, objects must be explicitly created from application-controlled memory
regions via \obj{Untyped Memory} capabilities.  Applications must have
explicit authority to memory (through these \obj{Untyped Memory} capabilities) in
order to create new objects, and all objects consume a fixed amount of memory once
created. These mechanisms can be used to precisely control
the specific amount of physical memory available to applications,
including being able to enforce isolation of physical memory access
between applications or a device.  There are no arbitrary resource
limits in the kernel apart from those dictated by the
hardware\footnote{The treatment of virtual ASIDs imposes a fixed number
of address spaces. This limitation is to be removed in future
versions of seL4.}, and so many denial-of-service attacks via resource
exhaustion are avoided.

At boot time, seL4 pre-allocates the memory required for the kernel
itself, including the code, data, and stack sections (seL4 is a single
kernel-stack operating system). It then creates an initial user
thread (with an appropriate address and capability space).
The kernel then hands all remaining memory to
the initial thread in the form of capabilities to \obj{Untyped Memory}, and
some additional capabilities to kernel objects that were required to
bootstrap the initial thread.  These \obj{Untyped Memory} regions can then be split into
smaller regions or other kernel objects using the
\apifunc{seL4\_Untyped\_Retype}{untyped_retype} method; the created objects are termed \emph{children} of
the original untyped memory object.

The user-level application that creates an object using \apifunc{seL4\_Untyped\_Retype}{untyped_retype}
receives full authority over the resulting object. It can then delegate
all or part of the authority it possesses over this object to one or
more of its clients.

Untyped memory objects represent two different types of memory:
general purpose memory, or device memory.
\emph{General purpose} memory can be untyped into any other object
type and used for any operation on untyped memory provided by the kernel.
\emph{Device memory} covers memory regions reserved for devices
as determined by the hardware platform, and usage of these objects
is restricted by the kernel in the following ways:

\begin{itemize}
\item Device untyped objects can only be retyped into frames or other
untyped objects; developers cannot, for example, create an endpoint from device memory.
\item Frame objects retyped from device untyped objects cannot be set as thread IPC buffers, or used
in the creation of an ASID pool
\end{itemize}

The type attribute (whether it represents \emph{general purpose} or
\emph{device} memory) of a child untyped object is inherited from its
parent untyped object. That is, any child of a device untyped will
also be a device untyped. Developers cannot change the type attribute of an untyped.

\subsection{Reusing Memory}
\label{s:memRevoke}

The model described thus far is sufficient for applications to
allocate kernel objects, distribute authority among client
applications, and obtain various kernel services provided by these
objects.  This alone is sufficient for a simple static system
configuration.

The seL4 kernel also allows \obj{Untyped Memory} regions to be reused.
Reusing a region of memory is allowed only
when there are no dangling references (i.e., capabilities) left to the
objects inside that memory.  The kernel tracks
\emph{capability derivations}, i.e., the children generated by the
methods \apifunc{seL4\_Untyped\_Retype}{untyped_retype}, \apifunc{seL4\_CNode\_Mint}{cnode_mint}, \apifunc{seL4\_CNode\_Copy}{cnode_copy}, and
\apifunc{seL4\_CNode\_Mutate}{cnode_mutate}.

The tree structure so generated is termed the \emph{capability
derivation tree} (CDT).\footnote{Although the CDT conceptually is a separate
data structure, it is implemented as part of the CNode object and so
requires no additional kernel meta-data.}  For example, when a user
creates new kernel objects by retyping untyped memory, the newly created
capabilities would be inserted into the CDT as children of the untyped
memory capability.

For each \obj{Untyped Memory} region, the kernel keeps
a \emph{watermark} recording how much of the region has previously been
allocated. Whenever a user requests the kernel to create new objects in
an untyped memory region, the kernel will carry out one of two actions:
if there are already existing objects allocated in the region, the
kernel will allocate the new objects at the current watermark level, and
increase the watermark. If all objects previously allocated in the
region have been deleted, the kernel will reset the watermark and start
allocating new objects from the beginning of the region again.

Finally, the \apifunc{seL4\_CNode\_Revoke}{cnode_revoke} method provided by \obj{CNode} objects
destroys all capabilities derived from the argument capability. Revoking
the last capability to a kernel object triggers the \emph{destroy}
operation on the now unreferenced object. This simply cleans up any in-kernel dependencies between
it, other objects and the kernel.

By calling \apifunc{seL4\_CNode\_Revoke}{cnode_revoke} on the original capability to an untyped memory
object, the user removes all of the untyped memory object's
children---that is, all capabilities pointing to objects in the untyped
memory region.  Thus, after this invocation there are no valid references
to any object within the untyped region, and the region may be safely
retyped and reused.

\subsection{Summary of Object Sizes}
\label{sec:object_sizes}

When retyping untyped memory it is useful to know how much memory the
object will require. Object sizes are defined in libsel4.

Note that \obj{CNode}s and \obj{Untyped Object}s have variables sizes.
When retyping untyped memory into \obj{CNode}s or breaking an
\obj{Untyped Object} into smaller \obj{Untyped Object}s, the
\texttt{size\_bits} argument to
\apifunc{seL4\_Untyped\_Retype}{untyped_retype} is used to specify
the size of the resulting objects.
For all other object types, the size is fixed, and the \texttt{size\_bits}
argument to \apifunc{seL4\_Untyped\_Retype}{untyped_retype} is ignored.

A single call to \apifunc{seL4\_Untyped\_Retype}{untyped_retype} can retype a
single \obj{Untyped Object} into multiple objects. The number of objects
to create is specified by its \texttt{num\_objects} argument. All created
objects must be of the same type, specified by the \texttt{type} argument. In
the case of variable-sized objects, each object must also be of the same size.
If the size of the memory area needed (calculated by the object size multiplied
by \texttt{num\_objects}) is greater than the remaining unallocated memory of
the \obj{Untyped Object}, an error will result.