1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2% Copyright (c) 2011-2017, ETH Zurich.
3% All rights reserved.
4%
5% This file is distributed under the terms in the attached LICENSE file.
6% If you do not find this file, copies can be found by writing to:
7% ETH Zurich D-INFK, Universitaetstr. 6, CH-8092 Zurich. Attn: Systems Group.
8%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
9
10\documentclass[a4paper,twoside]{report} % for a report (default)
11
12\usepackage{bftn} % You need this
13\usepackage[pdftex]{graphicx}
14\usepackage{color,listings,ctable}
15
16\title{Capability Management in Barrelfish}   % title of report
17\author{Akhilesh Singhania, Ihor Kuz, Mark Nevill, Simon Gerber}	% author
18\tnnumber{013}  % give the number of the tech report
19\tnkey{Capability Management} % Short title, will appear in footer
20
21% \date{Month Year} % Not needed - will be taken from version history
22
23\newcommand{\note}[1]{[\textcolor{red}{\textit{#1}}]}
24
25\lstdefinelanguage{Mackerel}{
26  morekeywords={datatype,device,register,regtype,constants,type,at,
27              many,edit,io,also},
28  sensitive=false,
29  morecomment=[l]{//},
30  morecomment=[s]{/*}{*/},
31  morestring=[b]",
32  showspaces=false,
33  showstringspaces=false,
34  showtabs=false,
35}
36
37\newcommand{\noarginvocation}[1]{\paragraph{#1 invocation}}
38\newcounter{invocArgCnt}
39\newenvironment{invocation}[1]{%
40  \noarginvocation{#1}
41  
42  \begin{list}{\emph{Argument~\arabic{invocArgCnt}:}}{%
43    \usecounter{invocArgCnt}%
44    \setlength{\rightmargin}{\leftmargin}%
45    \setlength{\itemsep}{0ex}%
46  }
47  \renewcommand{\arg}{\item}
48}{%
49  \end{list}
50}
51
52\begin{document}
53\maketitle
54
55%
56% Include version history first
57%
58\begin{versionhistory}
59\vhEntry{1.0}{08.03.2011}{AS}{Initial version}
60\vhEntry{2.0}{27.1.2012}{MN}{New capability system design}
61\vhEntry{2.1}{8.7.2013}{SK}{Updates}
62\vhEntry{2.2}{1.12.2013}{TR}{Fixed missing references/citations}
63\vhEntry{3.0}{2.06.2017}{SG}{Update to new CSpace design and remove outdated info}
64\end{versionhistory}
65
66% \intro{Abstract}		% Insert abstract here
67% \intro{Acknowledgements}	% Uncomment (if needed) for acknowledgements
68% \tableofcontents		% Uncomment (if needed) for final draft
69% \listoffigures		% Uncomment (if needed) for final draft
70% \listoftables			% Uncomment (if needed) for final draft
71
72\chapter{Introduction}
73
74This document discusses the state of capabilities in the Barrelfish
75operating system.
76
77Chapter \ref{chap:known_issues} lists the currently known issues with
78capability management and \ref{chap:type_system} discusses the type
79system.
80
81Chapter \ref{chap:current_state} discusses the current state of the
82implementation in Barrelfish, chapter \ref{chap:db} discusses
83different approaches for maintaining a multicore mapping database of
84capabilities, chapter \ref{chap:solutions} discusses the requirements
85from a correct solution and discusses four different solutions,
86chapter \ref{chap:implementation} discusses some Barrelfish specific
87challenges in implementing the solutions, and chapter \ref{chap:nyd}
88highlights issues not yet covered in this document.
89
90%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
91\chapter{Known Issues}\label{chap:known_issues}
92\begin{itemize}
93
94\item Kernel operations should not take longer than $O(1)$.  Some
95  capability operations are not constant time but can take $O(log n)$
96  where $n$ is the size of the mapping database. Spending so much time
97  in the kernel is detrimental for good scheduling as when in the
98  kernel, interrupts are disabled. All non constant time operations
99  should be punted to the monitor.
100
101\item When the last copy of a lmp capability or frame capability
102  backing ump channels is deleted, should it initiate a connection
103  tear down?
104
105\item Fragmentation of memory. Example: a ram capability of 4GB is
106  retyped into two capabilities of 2GB each. One of the 2GB capability
107  is deleted. The only way to get a capability to that 2GB back is to
108  also delete the other 2GB capability and then retype the 4GB
109  capability again.
110
111\item IO space can be treated in a same way as physical memory. So
112  instead of using mint operations to update ranges, we use retype
113  operations. This is useful as different IO capabilities will have
114  the ancestor, descendant, and copy relationships rather than just a
115  copy relationship. The former is richer and can offer improved
116  maintenance.
117
118\item When a new capability is introduced in a core via cross-core
119  send, the kernel must walk the entire mapping database to find the
120  appropriate place to insert the capability. The operation is
121  logarithmic time in the size of the mapping database. This
122  operation must either happen in the monitor or must be made more
123  efficient. If we move the mapping database into the monitor, we
124  think that this should at worse become a performance issue and not
125  a correctness one.
126
127\end{itemize}
128
129%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
130\input{type_system.tex}
131
132%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
133
134\chapter{Current State}\label{chap:current_state}
135
136This chapter will cover how capabilities are stored and what happens
137on capability invocation.
138
139\section{Storage}\label{sec:cspace}
140For security reasons, capabilities are stored in kernel-space and
141users are given pointers to them. Each capability is stored in two
142separate databases:
143
144\begin{itemize}
145\item Each dispatcher has an associated capability space that holds
146  all the capabilities it has.  The capability space of a dispatcher
147  is implemented using the CNode type capability. Each dispatcher is
148  associated with a ``root CNode'' that contains all capabilities the
149  dispatcher has.
150\item Each core has a mapping database that holds all the capabilities on the
151      core. The mapping database is implemented using a tree of the capabilities.
152      As discussed later in the chapter, the mapping database stores the
153      capabilities in a particular order to facilitate different capability
154      operations.
155\end{itemize}
156
157\section{Capability invocation}\label{sec:sys_invoke}
158When a dispatcher invokes a capability, it passes the kernel an
159address of the capability in the CSpace it wishes to invoke. The
160kernel locates the capability starting from the dispatcher's root
161CNode (walks the capability space), verifies that the requested
162operation can indeed be performed on the specified capability and then
163performs the operation.
164
165\section{Data structures}
166Capabilities in Barrelfish are represented by the following data
167structures:
168
169{\scriptsize
170\begin{verbatim}
171    struct mdbnode {
172      struct cte *left, *right;  // Links to the mapping database
173	  ...
174    };
175
176    struct CNode {
177      paddr_t cnode;      // Base address of CNode
178      uint8_t bits;       // Size in number of bits
179      ...
180    };
181
182    union capability_u {  // Union of all types of capabilities
183      ...
184      struct CNode cnode;
185      ...
186    };
187
188    struct capability {
189      enum objtype type;     // Type of capability
190      union capability_u u; // Union of the capability
191    };
192
193    struct cte {
194      struct capability   cap;     ///< The actual capability
195      struct mdbnode      mdbnode; ///< MDB node for the cap
196    };
197\end{verbatim}
198}
199
200A capability, \verb|cte|, consists of the actual capability represented by the
201``capability'' structure and an entry in the mapping database represented by
202the ``mdbnode'' structure. The capability structure contains the type specific
203information and the mdbnode contains pointers for the tree representing the
204mapping database.
205
206Capabilities can be looked-up in two ways.
207\begin{itemize}
208\item All capabilities on a core are stored in a mapping database. It
209  is possible to reach any capability on the core by traversing from
210  any other capability on the core.
211
212\item Capabilities are also stored in the CNode type capability. The
213  area of memory identified by the CNode structure is actually an
214  array of capabilities. Starting from the ``root CNode'' of a
215  dispatcher, it is only possible to reach any capability the
216  dispatcher holds.
217\end{itemize}
218
219\section{Terminology}
220
221This section discusses some terminology to facilitate the discussion
222of capability management.
223
224\subsection{Copy}
225
226A capability X is a copy of a capability Y if:
227
228\begin{itemize}
229\item X was copied from Y
230\item or Y was copied from X
231\item or X was copied from Z and Z was copied from Y
232\end{itemize}
233
234\subsection{Descendants}
235
236A capability X is a descendant of a capability Y if:
237
238\begin{itemize}
239\item X was retyped from Y
240
241\item or X is a descendant of Y1 and Y1 is a copy of Y
242
243\item or X is a descendant of Z and Z is a descendant of Y
244
245\item or X is a copy of X1 and X1 is a descendant of Y
246\end{itemize}
247
248\subsection{Ancestor}
249
250A is a ancestor of B if B is a descendant of A.
251
252\section{CNode invocations}
253
254Most Capabilities have type specific invocations.  Operations on the
255CNode capability modifies the capability space of the system. We
256discuss how these operations are implemented for a single core system
257here.
258
259\note{Invocations on other capability types will probably also modify
260  the capability space but alas we don't know how those will work yet.}
261
262\subsection{Retype}
263
264Retyping a capability creates one or more descendants of the
265capability. This operation will fail if the capability already has
266descendants. The descendants are inserted into a CNode specified by
267the operation and into the mapping database right after the retyped
268capability.
269
270When a dispatcher issues the retype invocation, the kernel must traverse the
271mapping database to ensure that the capability has no descendants, create the
272descendants capabilities, insert them in the specified CNode and in the mapping
273database.
274
275\subsection{Copy}
276
277Copying a capability creates a new copy of it. The kernel walks the capability
278space to find the capability to be copied, creates the copy, and inserts it
279into the specified CNode and mapping database.
280
281\subsection{Delete}
282
283Delete removes the specified capability from the CNode in which it
284resides and from the mapping database. This operation cannot fail.
285
286The kernel first walks the capability space to locate the capability
287to delete. It then walks the mapping database to check if there still
288exist copies of the deleted capability. If no copies are found, then
289it performs certain operations based on the capability type.
290
291\subsection{Revoke}
292
293Revoking a capability calls delete on all copies and descendants of
294it. When the operation returns, the capability will not have any
295copies or descendants.
296
297The kernel walks the capability space to find the specified
298capability, uses the mapping database to find all copies and
299descendants of the specified capability and deletes them.
300
301\subsection{Looking up local copies and descendants}
302
303Due to the capability ordering used by the mapping database, copies are located
304adjacent to a capability and descendants immediately thereafter. Therefore, it
305is easy to look up all related copies of a capability on the same core. This
306facilitates revocation by looking up all copies and descendants, retypes by
307checking for existing descendants, and deletes by checking for copies.
308
309The following pseudo-code looks up all descendants and copies of a
310capability given the existence of type specific is\_copy and
311is\_descendant functions.
312
313{\scriptsize
314\begin{verbatim}
315  // Traverse forward
316  mdbnode *walk = successor(cap);
317  while (walk) {
318    // Check if descendant
319    if (is_descendant(cap, walk)) {
320      // Found a descendant
321      goto increment;
322    }
323
324    // Check if copy
325    if (is_copy(cap, walk)) {
326      // Found a copy
327      goto increment;
328    }
329
330    // Cap is not a descendant or copy
331    break;
332
333  increment:
334    walk = successor(walk);
335  }
336
337  // Traverse backwards
338  mdbnode *walk = predecessor(cap);
339  while (walk) {
340    // Predecessors cannot be descendants
341
342    // Check if copy
343    if (is_copy(cap, walk)) {
344      // Found a copy
345      goto increment;
346    }
347
348    // Cap is not a copy
349    break;
350
351  increment:
352    walk = predecessor(walk);
353  }
354}
355\end{verbatim}
356}
357
358\section{Multicore extensions}
359
360The model above works for a single core system. We have already
361extended it to work on multiple cores. Here we discuss these
362extensions.
363
364We implement the multi-core extension based on the system discussed in detail
365in chapters 2 and 3 of Mark Nevill's master's thesis, ``An Evaluation of
366Capabilities for a Multikernel''~\cite{Nevill2012}.
367
368\subsection{Cross-core transfer}
369
370This is a special operation available only to the monitors. This sends
371a capability from one core to another.
372
373\section{Summary}
374
375In this chapter, we presented a background and the current state of
376capabilities management in Barrelfish. We can now discuss different
377designs for multicore capability management.
378
379
380%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
381
382\chapter{Maintaining the database}\label{chap:db}
383We consider the following approaches for managing the mapping
384database.
385
386\note{Use diagrams to better illustrate the discussion below.}
387
388\begin{itemize}
389\item \textbf{No partition:} The mapping database and capability space
390  for all cores is maintained on a single centralized
391  coordinator. Accessing either the capability space or the mapping
392  database on any core requires communication with the coordinator.
393
394\item \textbf{Partitioned data structure:} The mapping database for
395  all cores is maintained on a single centralized coordinator and the
396  capability space is maintained on the local cores. This implies that
397  the cte structure is split between the coordinator and the local
398  cores. Cores can access the capability space locally and have to
399  message the coordinator to access the mapping database. Note that
400  split ``capability'' and ``mdbnode'' structures need to maintain
401  references to each other.
402
403\item \textbf{Centrally replicated space:} The mapping database and
404  the capability space is replicated between a single coordinator and
405  the local cores. This implies that two copies of each ``cte''
406  structure exist in the system, one on the local core and one on the
407  coordinator. Both local core and the coordinator can access the
408  mapping database and the capability space.
409
410\item \textbf{Partitioned space:} The mapping database and the
411  capability space are maintained on the local cores. The entire
412  ``cte'' structure can be accessed locally but capability operations
413  will require coordination and communication with remote cores.
414
415\item \textbf{Minimal replication:} The database is partitioned
416  between local cores as in the above approach and additionally for
417  each capability the cores maintain a cache of its remote relations.
418  This is discussed in more detail in section \ref{sec:cache}.
419\end{itemize}
420
421We qualitatively compare the above approaches and why the partitioned
422space and minimal replication is the best.
423
424\section{Comparison}\label{sec:comparison}
425We compare the above approaches in this section.
426
427\subsection{Efficiency of capability invocations}
428When a dispatcher invokes a capability, the kernel has to look it up
429in the capability space starting from the dispatcher's ``root'' CNode.
430If the capability space is not local, then the core has to message the
431coordinator to perform the look up. A round trip with the coordinator
432is more expensive than a local look up and can become a scalability
433bottleneck if we use a single coordinator.
434
435The no partition approach does not maintain a local capability space
436and therefore may suffer from poor performance. All other approaches
437maintain the capability space locally.
438
439\subsection{Local operations}
440Approaches that enable more pure local operations will perform better
441as they will reduce the amount of cross-core communication. In the no
442partition, partitioned data structure, and replicated space
443approaches, no operation can be performed locally. In the partitioned
444and minimal replication approaches, certain operations such as copying
445capabilities is purely local.
446
447\begin{table*}
448\begin{center}
449\begin{tabular}{|c|c|c|c|c|c|}
450  \hline
451
452  & Cap invocation & Local operations \\
453  \hline
454
455  No partition                & - & -   \\ \hline
456  Partitioned data structure  & + & -   \\ \hline
457  Centrally replicated space  & + & -   \\ \hline
458  Partitioned space           & + & +   \\ \hline
459  Minimal replication         & + & +   \\ \hline
460
461\end{tabular}
462\end{center}
463\caption{\label{t:summary}Summary of the different approaches.}
464\end{table*}
465
466\subsection{Discussion}
467Table \ref{t:summary} summarizes our comparison of the five
468approaches. A (+) indicates that the approach performs relatively well
469on the given metric and a (-) indicates that the approach performs
470relatively poorly. Based on the results, we choose to implement the
471partitioned space and minimal replication approaches.
472
473\section{Caching}\label{sec:cache}
474The minimal replication approach is characterized as the partitioned
475approach with caching the state of remote relations. Capabilities
476without remote relations are marked as such and when performing
477operations on these no remote communication is required.
478
479When tracking remote relations, three types of relations must be
480tracked: copies, descendants, and ancestors. Tracking of remote copies
481and descendants is required so that revoke, retype, and delete
482operations can be correctly implemented. And capabilities must track
483their remote ancestors so if they are deleted, the remote ancestors
484can be informed to update the state about their remote descendants.
485
486\subsection{How to maintain the cache?}
487
488\note{Once I discuss total order broadcast with caching, this
489  discussion will be revamped.}
490
491There are two ways of maintaining the cache:
492\begin{itemize}
493\item \textbf{Single bit:} A bit each for the three types of remote
494  relations is used. The bits merely indicate the presence of remote
495  relations but provide no further information such as which cores
496  have the remote relations.
497
498\item \textbf{List:} A list each for the three types of remote
499  relations is used. The list contains exact information about which
500  cores the remote relations exist on. Uhlig et al [?]'s core mask
501  technique can be used to maintain the lists with fixed space
502  requirement.
503\end{itemize}
504
505\subsubsection{Comparison}
506\textbf{Informing existing relations:} When a capability that already
507has remote relations is sent to another core, with the single-bit
508approach, none of the existing cores with relations have to be
509informed, but with the list approach, cores with existing relations
510must be informed so they can update their lists.
511
512With both approaches when the last local copy of a capability is
513deleted, its remote relations should be informed. This is required in
514the list approach so that the cores can update their list and is
515required in the single-bit approach so that the cores can determine if
516the last remote relation has been deleted and it can mark the
517capability as not having the remote relation anymore.
518
519\textbf{Number of cores communicated:} With the single bit approach,
520all cores in the system must be contacted, while with the list
521approach, just the cores of interest must be contacted.
522
523\textbf{Discussion:} The single-bit approach will probably be more
524efficient if the system has few cores and the capabilities that have
525remote relations, tend to have them on many cores. The list approach
526will have better performance if there are lots of cores in the system
527and typically a capability only has relations on a few cores.
528
529\subsection{Where to maintain the cache?}\label{subsec:where}
530Imagine that there exists a capability with only local relations. Then
531cross-core transfer it applied to it. Should we only mark that
532capability as having a remote copy or should we mark all impacted
533capabilities accordingly? This section will discuss and compare these
534two approaches.
535
536\subsubsection{Mark all}
537When cross-core transfer is applied to a capability, it is marked has
538having remote copies and all its local relations are also marked as
539having the appropriate remote relations. Its local copies will be
540marked as having remote copies, its local descendants will be marked
541as having remote ancestors, and its local ancestors will be marked as
542having remote descendants.
543
544All capabilities maintain complete state of their remote relations at
545all times.
546
547\subsubsection{Mark just individual capability}
548When cross-core transfer is applied to a capability, it is marked has
549having remote copies and none of its local relations are updated.
550
551Capabilities do not maintain the complete state of their remote
552relations. Their local relations must be looked up to build the
553complete state of their remote relations.
554
555\subsubsection{Comparison}
556Both approaches have their specific strengths and weaknesses discussed
557here.
558
559\textbf{Cost of cross-core transferring:} When applying cross-core
560transfer to a capability, in the mark all approach, all local
561relations must be updated whereas in the mark just individual
562capability approach, the local relations do not have to be updated.
563However, the cross-core transfer operation needs to send full
564information about the state of local relations so that the newly
565created remote capability can setup its cache properly. Gathering this
566information will require accessing all local relations so the mark all
567approach represents a small in the operation that must be performed
568anyways.
569
570\textbf{Cost of building state of remote relations:} Any capability
571operation that requires information on the current state of remote
572relations will have to build the full state of remote relations for
573the capability. This is done trivially in the mark all approach as
574each capability maintains the full state. In the mark just individual
575capability approach, all local relations must be accessed to build the
576state.
577
578\textbf{Discussion:} Given that the state of remote relations is
579trivially constructed in the mark all approach and the cost of
580cross-core transfer is only marginally higher than the bare minimum
581cost already required, we conclude that the mark all approach is
582superior to the mark just individual capability approach.
583
584\subsection{Summary}\label{subsec:cache:summary}
585In summary, we have come up with two approaches for the type of cache:
586single-bit and list, and we have concluded that the mark all approach
587is superior for maintaining the cache.
588
589Maintaining a cache will require the following adjustment to the
590capability operations presented above.
591
592\begin{itemize}
593\item When cross-core transfer is applied to a capability, it is
594  marked as having remote copies. The operation sends information on
595  the state of the capability's local and remote relations.
596
597\item When a capability is created due to the cross-core receive
598  operation, it incorporates the information about relations sent with
599  the cross-core transfer operation.
600
601\item When a copy of a capability is marked as having remote
602  relations, the capability is marked as having the same remote
603  relations.
604
605\item When a descendant of a capability is marked as having remote
606  relations, the capability is marked also marked as having remote
607  relations based on the following rules:
608  \begin{itemize}
609  \item If the descendant has a remote copy, then capability has a
610    remote descendant.
611  \item If the descendant has a remote descendant, then capability has
612    a remote descendant.
613  \item If the descendant has a remote ancestor, then capability
614    either has a remote copy or an ancestor depending on the outcome
615    from the type-specific is\_descendant function.
616  \end{itemize}
617
618\item When an ancestor of a capability is marked as having remote
619  relations, the capability is marked also marked as having remote
620  relations based on the following rules:
621  \begin{itemize}
622  \item If the ancestor has a remote copy, then capability has a
623    remote ancestors.
624  \item If the ancestor has a remote descendant, then capability
625    either has a remote copy or a remote descendant depending on the
626    outcome from the type-specific is\_descendant function.
627  \item If the ancestor has a remote ancestor, then capability
628    has remote ancestors.
629  \end{itemize}
630
631\item When a capability is retyped:
632  \begin{itemize}
633  \item Its remote copies are marked as having remote descendants
634  \item The descendants are marked as having remote ancestors if the
635    capability has remote copies.
636  \end{itemize}
637
638\item When a capability is copied, the copy is marked as having the
639  same remote relations that the capability has.
640
641\item When the last copy of a capability on the core is deleted, its
642  remote copies, descendants, ancestors are informed.
643\end{itemize}
644
645We now discuss some implications of maintaining a cache.
646
647\textbf{Drawbacks:} Caching introduces the following drawbacks that
648the non-caching approach does not suffer from.
649
650\begin{itemize}
651\item The application dispatchers must first try to perform the
652  capability operations locally and if that fails, then communicate
653  with the monitors. If no caching were used, the application
654  dispatchers can always directly communicate with the monitor saving
655  one superfluous trip into the kernel.
656
657\item As discussed above, caching requires additional overhead in
658  keeping the cache consistent.
659
660\item Caching increases the space requirement for maintaining the
661  mapping database.
662\end{itemize}
663
664Caching can work if typically few capabilities are shared between
665cores and can hurt if many capabilities are shared.
666
667\textbf{Decide dynamically:} It is clear that application scenarios
668will actually dictate if caching can help and if it does, which type
669of caching helps. To support this, Barrelfish can allow applications
670to specify which type of caching is well suited for it or the OS can
671gather appropriate metrics on application execution and based on that
672dynamically update the type of caching used.  For this, we need to
673identify the cross-over point where one approach is preferred over the
674other.
675
676\section{Implementation}
677
678As summarized, the current implementation uses a partitioned database which
679caches remote relations of capabilities with three bits, one each for remote
680copies, ancestors, and descendants.
681
682Each database partition is associated with a kernel control block, and keeps
683track of all the capabilities in the CSpaces of all the dispatchers associated
684with that kernel control block.
685The database is implemented as a binary search tree.
686We chose the AA tree~\cite{Andersson1993}, which is a variation of the
687red-black tree where red nodes can only be added as right subchildren.
688This results in a greatly simplified implementation of the maintenance
689operations while conserving the red-black tree's invariants.
690
691The particular variation we implement for the capability database is an
692interval tree as described by
693Cormen et al.~\cite[section~14.3, pp~348-354]{Cormen2001} which uses the AA
694tree as its basis.
695
696It is noteworthy that we embed the tree nodes into each kernel capability
697object to avoid any dynamic allocations in the CPU driver.
698
699The choice of tree, the database implementation and its performance are more
700extensively discussed in chapter 4 of Mark Nevill's master's
701thesis~\cite{Nevill2012}.
702
703%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
704
705\chapter{Solutions}\label{chap:solutions}
706In this chapter, we discuss mechanisms for implementing the capability
707operations. In section \ref{sec:challenges}, we will first discuss the
708challenges in correctly implementing the operations by discussing how
709executing multiple related operations in parallel can lead to
710conflicts and by discussing how an intuitive solution of using
711acknowledgments fails to resolve the conflicts. We then present the
712requirements from a correct solution in section \ref{sec:requirements}
713and compare four different correct solution in section
714\ref{sec:solutions}.
715
716\section{Challenges}\label{sec:challenges}
717We first discuss the conflicts that can arise if two or more related
718operations execute in parallel and then discuss an intuitive but
719incorrect solution of using acknowledgments to resolve the conflicts.
720
721\subsection{Conflicts}\label{subsec:conflicts}
722Performing overlapping operations on related capability can lead to
723conflicts. This section discusses the different types of conflicts
724that can arise.
725
726\textbf{Conflicts between two retypes, deletes, or revokes:} If two
727different cores are trying to retype, delete, or revoke related
728capabilities, then they can conflict. If two different cores try to
729retype two copies of a capability, the correct behavior is for one to
730succeed and for one to fail. If two cores try to delete the last two
731copies of a capability, the correct behavior is for one to just delete
732locally and for the other to perform type-specific operations required
733when deleting the last copy of a capability. If two cores try to
734revoke the same capability, one should succeed, the other should fail
735and the capability it was trying to revoke should also be deleted.
736
737\textbf{Conflict between revoke and cross-core transfer:} If a core is
738trying to revoke a capability, a copy or descendant of which another
739core is trying to transfer, the revoke operation should only finish
740when the in-flight capability is also deleted.
741
742\textbf{Conflict between revoke and retype:} If a core is trying to
743revoke a capability, a copy or descendant of which another core is
744trying to retype, then the retype should either succeed and the new
745capabilities be deleted before revoke finishes or the retype should
746fail.
747
748\subsection{Non-conflicts}
749This section discusses why it is safe to perform other operations in
750parallel.
751
752\textbf{Delete and revoke:} Delete at least deletes one capability and
753potentially more. If a revoke for a related capability is issued at
754the same time, they will overlap and potentially try to perform
755redundant operations but these are safe.
756
757\textbf{Delete and transfer:} Delete will delete one capability and
758perform a check for copies. The in-flight capability already has a
759copy that can satisfy the requirements of delete. If the capability is
760also being deleted, it is the same two deletes overlapping which
761indeed is a conflict.
762
763\textbf{Retype and delete:} A delete will not conflict with retypes
764because it checks for the existence of copies which a successful
765retype cannot possibly create. A retype will not conflict with deletes
766because no matter when it performs the check for descendants, it will
767either see descendants that are about to be deleted or not see them at
768all, either of the outcomes is correct.
769
770\textbf{Retype and transfer:} Retype cannot conflict with transfers
771because in order to transfer a capability because the capability being
772transferred is a copy of an existing capability which does not change
773the existing descendants relationships.
774
775\subsection{Incorrect solution}
776Since the essential issue with the relation with cross-core transfer
777operation is that in-flight capabilities do not exist anywhere, the
778sending core can maintain a list of in-flight capabilities which are
779garbage collected when the receiver acknowledges the reception of the
780capability. This allows the sending core to include the in-flight
781capabilities in the search for copies and descendants. As shown below,
782this intuitive solution is actually incorrect.
783
784\begin{figure}[t]
785 \includegraphics[width=0.75\textwidth]{acks-problem.pdf}
786 \caption{Problem with using acknowledments}\label{fig:acks-problem}
787\end{figure}
788
789Consider the scenario shown in figure \ref{fig:acks-problem}. The
790system consists of three cores \{A, B, C\}, A is trying to revoke a
791capability and B is trying to send a copy of the capability to C. A
792sends a revoke request to B, C and waits for their acknowledgments. B
793sends the capability to C and adds it to its list of in-flight
794capabilities. C sees the revoke request first, performs the operation
795locally and replies to A. Then it sees the capability sent from B,
796inserts it locally and send an acknowledgment to B. B sees the
797acknowledgment from C first garbage collecting its list of in-flight
798capabilities and then sees the revoke request from A, performs the
799revoke locally, and replies to A. A receives replies from both B and C
800concluding the revoke operation. The revoke operation has finished and
801C still has a copy of the revoked capability.
802
803\section{Requirements from a correct solution}\label{sec:requirements}
804A correct solution will enforce the following transactional like and
805ordering properties.
806
807\subsection{Transactional properties}
808\textbf{Atomic:} If one part of a capability operation fails, then the
809entire operation must fail. For example, in a retype operation, the
810descendant capabilities can be created in parallel with the check for
811existing descendants. If existing descendants are found, then the
812newly created descendants must be removed.
813
814\textbf{Consistent:} Every capability operation must leave the system
815in a consistent state. For example, if caching is used an operation
816deletes the last copy of a capability on a core, then all its remote
817relations should be informed and updated before the operation
818finishes.
819
820\textbf{Isolation:} The data that has been modified during a
821capability operation must not be accessible by other operations till
822the operation finishes. For example, if a retype operation has eagerly
823created descendants while checking for existing descendants in
824parallel, no other capability operation should be able to access the
825descendants created eagerly till the retype operation finishes.
826
827\subsection{Ordering}
828Section \ref{subsec:conflicts} discusses the two types of conflicts
829that can arise when multiple cores try to perform operations on
830related capabilities at the time. Correctness can be ensured if the
831individual steps (messages) in operations must be executed in some
832order.  We provide some background on ordering messages before
833discussing the actual ordering requirements.
834
835\textbf{Background on ordering:} Four types of ordering are
836possible. We discuss them from the cheapest to the most expensive.
837
838\begin{itemize}
839\item No order: Messages can be delivered in any order. This cannot
840  happen on Barrelfish.
841\item Single Source FIFO (SSF): Messages from the same sender are
842  delivered in the order they were sent. This is what currently is
843  available on Barrelfish by default.
844\item Causal: Messages are delivered based on their partial order
845  given by the happens-before relationship. Vector clocks [?] can be
846  used to provide causal delivery of messages.
847\item Total order: All receivers receive messages in the same
848  order. Note that any ordering function can be used to order the
849  messages. Further, total order does not imply causal order but if
850  the ordering function respects SSF, then total order does imply
851  causal order.
852\end{itemize}
853
854\subsection{Conflicts with each-other}
855
856\begin{figure}[t]
857 \includegraphics[width=0.75\textwidth]{causal-problem.pdf}
858 \caption{Problem with using causal order
859   delivery}\label{fig:causal-problem}
860\end{figure}
861
862The minimum ordering required for resolving conflicts between retype,
863delete, and revoke operations is total order. Figure
864\ref{fig:causal-problem} illustrates the reason causal ordering does
865not work. Nodes \{A,B,C,D\} each contain copies of the same capability
866that \{A,D\} wish to retype at the same time. B receives the message
867from A before the message from D, and C receives the D's message
868before A's. The messages were delivered in causal order, C will refuse
869to A, B will refuse to D not allowing either retype operation to
870succeed. Similar examples can be constructed for the delete and revoke
871operations as well.
872
873Total order delivery will deliver the messages to B and C in the same
874order allowing just one and only one retype to succeed.
875
876\subsection{Conflict between revoke and transfer}
877Figure \ref{fig:acks-problem} illustrates the conflict between revoke
878and cross-core transfer operation. C sees the message from A before
879the message from B. Hence any message it sends to B causally depend
880upon the message from A. If this was reflected in the message C sent
881to B, then B could delay handling the message till it sees the revoke
882request from A. When B receives the revoke request from A, it will
883realize that A is trying to revoke an in-flight capability and take
884appropriate actions to resolve the issue. The delaying of the message
885can be guaranteed by enforcing causal delivery of messages.
886
887\subsection{Conflict between revoke and retype}
888The minimum ordering requirement is total. With less stricter
889ordering, different cores might see the revoke request and retype
890requests in different orders. The retyping core might delete the
891source capability but end up creating descendants later.
892
893\section{Solutions}\label{sec:solutions}
894
895We will discuss and compare the following four solutions.
896
897\begin{itemize}
898\item Locks
899\item Total order broadcast
900\item Two phase commit
901\item Sequencer
902\item Hybrid implementation with locks and 2PC
903\end{itemize}
904
905\subsection{Locks}\label{subsec:locks}
906Locks can be used to enforce the above transactional and ordering
907policies. Critical sections can be used to resolve the conflicts
908ensuring that \emph{Time-of-check-to-time-of-use} is not violated.
909
910Below, we provide pseudo-code for how capability operations will be
911implemented when using a centralized lock and not caching information
912about remote relations.
913
914\textbf{Copying a capability:} This operation is safe to be performed
915without holding a lock.
916
917\begin{verbatim}
918cap_copy(struct cte *cte) {
919  create_copies();
920}
921\end{verbatim}
922
923\textbf{Retyping a capability:} Holding the lock is required to
924prevent multiple cores from trying to retype copies of the same
925capability. Acquiring the lock is a blocking call during which the
926capability might have been deleted so it is important to check that
927the capability still exists after the lock is acquired.
928
929\begin{verbatim}
930cap_retype(struct cte *cte) {
931  errval_t err = success;
932  acquire_lock();
933  if (cap_exists(cte) == false) {
934    err = fail;
935    goto done;
936  }
937  if (has_descendants(cte) == true) {
938    err = fail;
939    goto done;
940  }
941  create_descendants(cte);
942done:
943  release_lock();
944  return err;
945}
946\end{verbatim}
947
948\textbf{Deleting a capability:} If the capability has local copies,
949the operation is purely local. Otherwise, holding the lock is required
950to ensure that if the last copy of the capability in the system is
951being deleted, then the type-specific cleanup operations are executed.
952
953\begin{verbatim}
954cap_delete(struct cte *cte) {
955  remove_cap(cte);
956
957  if (!requires_typed_ops(cte)) {
958    return success;
959  }
960  if (has_local_copies(cte)) {
961    return success;
962  }
963
964  errval_t err = success;
965  acquire_lock();
966  if (!cap_exists(cte) {
967    goto done;
968  }
969
970  if (!has_copies(cte) {
971    type_specific_ops(cte);
972  }
973
974done:
975  release_lock();
976  return err;
977}
978\end{verbatim}
979
980\textbf{Revoking a capability:} Holding the lock is required to ensure
981that if multiple cores are trying to revoke related capabilities, only
982one succeeds. This is why the code below ensures that the capability
983still exists after acquiring the lock.
984
985\begin{verbatim}
986cap_revoke(struct cte *cte) {
987  errval_t err = success;
988  acquire_lock();
989  if (!cap_exists(cte) {
990    err = fail;
991    goto done;
992  }
993
994  while(has_descendants(cte) {
995    struct cte *dest = get_next_descendant(cte);
996    remove_cap(dest);
997  }
998  while(has_copies(cte) {
999    struct cte *dest = get_next_copy(cte);
1000    remove_cap(dest);
1001  }
1002
1003done:
1004  release_lock();
1005  return err;
1006}
1007\end{verbatim}
1008
1009\textbf{Cross-core transferring a capability:} Holding the lock is
1010required to conflicts between the above capability operations and
1011cross-core transfer operation. The sender acquires the lock and sends
1012the capability. The receiver creates the capability and releases the
1013lock.
1014
1015\begin{verbatim}
1016cap_send(struct cte *cte, coreid_t dest) {
1017  acquire_lock();
1018  if (cap_exists(cte) == false) {
1019    return fail;
1020  }
1021  send_cap(cte, dest);
1022}
1023
1024cap_receive(struct cte *cte) {
1025  create_cap(cte);
1026  release_lock();
1027  return_success_to_app();
1028}
1029\end{verbatim}
1030
1031\subsubsection{Caching remote relations}
1032If remote relations are cached, then the cache has to be kept
1033consistent when capability operations change the state of
1034relations. Below we describe the modifications that will be required
1035in the above pseudo-code to keep the cache consistent. Note that our
1036cache can be seen as eager replication. As discussed in section
1037\ref{subsec:where}, we will be using the mark all approach to maintain
1038the cache.
1039
1040\textbf{Copying a capability:} When the new copy is created, its cache
1041of remote relations is set equal to the cache from the capability it
1042was copied from.
1043
1044\textbf{Retyping a capability:} If the capability does not have remote
1045copies and descendants, the operation is purely local and the lock is
1046not required. If the operation is not local, then the lock is
1047acquired, checks for existence of capability and for descendants and
1048is made, the capability is retyped, and remote relations are updated
1049based on the rules presented in section \ref{subsec:cache:summary}.
1050
1051\begin{verbatim}
1052cap_retype(struct cte *cte) {
1053  if (has_local_descendants(cte) == true) {
1054    return fail;
1055  }
1056
1057  if (!has_remote_copies(cte) && !has_remote_descendants(cte)) {
1058    create_descendants(cte);
1059    return success;
1060  }
1061
1062  if (has_remote_descendants(cte)) {
1063    return fail;
1064  }
1065
1066  errval_t err = success;
1067  acquire_lock();
1068  if (!cap_exists(cte)) {
1069    err = fail;
1070    goto done;
1071  }
1072
1073  if (has_remote_descendants(cte)) {
1074    err = fail;
1075    goto done;
1076  }
1077  if (has_local_descendants(cte)) {
1078    err = fail;
1079    goto done;
1080  }
1081
1082  create_descendants(cte);
1083  update_relations(cte);
1084
1085done:
1086  release_lock();
1087  return err;
1088}
1089\end{verbatim}
1090
1091\textbf{Deleting a capability:} If the capability has local copies or
1092no remote relations, the operation is purely local. Otherwise, the
1093lock is required and the remote relations must be updated.
1094
1095\begin{verbatim}
1096cap_delete(struct cte *cte) {
1097  remove_cap(cte);
1098
1099  if (!requires_typed_ops(cte)) {
1100    return success;
1101  }
1102  if (has_local_copies(cte)) {
1103    return success;
1104  }
1105
1106  if (!has_remote_relations(cte)) {
1107    type_specific_ops(cte);
1108    return success;
1109  }
1110
1111  acquire_lock();
1112  if (!cap_exists(cte)) {
1113    goto done;
1114  }
1115  if (!has_copies(cte) {
1116    type_specific_ops(cte);
1117  }
1118  update_remote_relations(cte);
1119  remove_cap(cte);
1120  release_lock();
1121done:
1122  return success;
1123}
1124\end{verbatim}
1125
1126\textbf{Revoking a capability:} If the capability has no remote
1127relations, the operation is purely local. Otherwise, the lock is
1128required.
1129
1130\begin{verbatim}
1131cap_revoke(struct cte *cte) {
1132  if (!has_remote_relations(cte)) {
1133    while(has_descendants(cte) == true) {
1134      struct cte *dest = get_next_descendant(cte);
1135      remove_cap(dest);
1136    }
1137    while(has_copies(cte) == true) {
1138      struct cte *dest = get_next_copy(cte);
1139      remove_cap(dest);
1140    }
1141    return success;
1142  }
1143
1144  errval_t err = success;
1145  acquire_lock();
1146  if (!cap_exists(cte)) {
1147    err = fail;
1148    goto done;
1149  }
1150
1151  while(has_descendants(cte) == true) {
1152    struct cte *dest = get_next_descendant(cte);
1153    remote_cap(dest);
1154  }
1155  while(has_copies(cte) == true) {
1156    struct cte *dest = get_next_copy(cte);
1157    remote_cap(dest);
1158  }
1159  release_lock();
1160done:
1161  return err;
1162}
1163\end{verbatim}
1164
1165\textbf{Cross-core transferring a capability:} The remote relations
1166cache on local and remote capabilities is updated as presented in
1167section \ref{subsec:cache:summary}.
1168
1169\subsubsection{Multiple locks}
1170\note{NYI}
1171If a single lock for the entire system is used, only one capability
1172operation can be performed at any given moment limiting the available
1173potential for parallelism. By using different locks for unrelated
1174capabilities, multiple capability operations can proceed in
1175parallel. Using multiple locks will increase the space requirement but
1176will also improve parallelism.
1177
1178\note{Multiple locks are not straight-forward. Ancestors reference
1179  more memory than descendants do.}
1180
1181\note{Can a copy lock for delete, a descendant lock for retype, and
1182  both for revoke work?}
1183
1184\subsection{Total order broadcast}
1185\note{Use a single data structure for all pending cap operations.}
1186
1187The required transactional and ordering guarantees can be provided by
1188ensuring that messages for related capabilities are delivered in the
1189same order on all cores. Note that causal order delivery resolves the
1190conflict between retype, delete, revoke and cross-core transfer
1191operation but not within the retype, delete, revoke operations.
1192
1193Below we present pseudo-code for how the capability operations will be
1194implemented when not caching information about remote relations.
1195
1196\textbf{Copying a capability:} This operation is safe to be performed
1197without any ordering requirements.
1198
1199\textbf{Deleting a capability:} If the capability does not require
1200type-specific operations or has local copies, the operation succeeds.
1201Or else, local state is created and a message is broadcast to all
1202cores.
1203
1204\begin{verbatim}
1205cap_delete(struct cte *cte) {
1206  if (!requires_typed_ops(cte) || has_local_copies(cte)) {
1207    remove_cap(data->cte);
1208    return success;
1209  }
1210
1211  struct delete_data *data = malloc(sizeof(struct delete_data));
1212  delete_data_initialize(data, cte);
1213  outstanding_delete_list->add(data);
1214  send_delete_request_bcast(TOTAL_ORDER, data);
1215}
1216\end{verbatim}
1217
1218If a core receives a delete request broadcast and it was the sender of
1219the message, it removes the capability and returns. If the core was
1220not the sender of the message, then it replies with the state of local
1221copies of the specified capability.
1222
1223\begin{verbatim}
1224delete_request_bcast(coreid_t from, struct delete_request *data) {
1225  if (from == my_core_id) {
1226    remove_cap(data->cte);
1227    return;
1228  }
1229
1230  send_delete_reply(from, data, has_local_copies(data->cte));
1231}
1232\end{verbatim}
1233
1234Finally, when a core receives a reply from a delete request, if a
1235remote copy is found, no type-specific cleanup is required and the
1236operation finishes. If all replies have been aggregated and no copies
1237were found, then the type-specific cleanups are performed and then the
1238operation finishes.
1239
1240\begin{verbatim}
1241delete_reply(coreid_t from, bool has_copies) {
1242  struct cte *my_data = outstanding_deletes_list->get(data);
1243  if (!my_data) {
1244    return;
1245  }
1246
1247  if (has_copies) {
1248    outstanding_deletes_list->remove(my_data);
1249    return;
1250  }
1251
1252  increment_replies(my_data);
1253  if (seen_all_replies(my_data)) {
1254    type_specific_ops(cte);
1255    outstanding_deletes_list->remove(my_data);
1256  }
1257}
1258\end{verbatim}
1259
1260Since the deleting cores remove the capability when they receive the
1261broadcast, when multiple cores are trying to delete copies, the last
1262core's broadcast will see no copies while other cores will see copies.
1263
1264\note{Malloc probably means unbounded memory requirement.}
1265
1266\textbf{Retyping a capability:} If the capability has local
1267descendants, then the operation fails. Else, a retype request
1268broadcast is sent.
1269
1270\begin{verbatim}
1271cap_retype(struct cte *cte) {
1272  if (has_local_descendants(cte)) {
1273    return fail;
1274  }
1275
1276  struct retype_data *data = malloc(sizeof(struct retype_data));
1277  retype_data_initialize(data, cte);
1278  outstanding_retypes_list->add(data);
1279  send_retype_request_bcast(data);
1280}
1281\end{verbatim}
1282
1283When a core receives a retype request broadcast and it was the sender
1284of the message, one of two things may have happened. Either the core
1285had already received a broadcast from another core trying to retype
1286the same capability in which case the receiver's operation has failed
1287and the state for the request has been removed or the core can succeed
1288in retyping the capability and updates its state accordingly. If the
1289core was not the sender of the broadcast, it sends a reply to the
1290sender with the state of its local descendants. Then the core checks
1291if it has an outstanding retype request for the capability. If it does
1292and its request has not been delivered yet, its retype fails. An error
1293is sent to the application and the state is garbage collected.
1294
1295\begin{verbatim}
1296retype_request_bcast(coreid_t from, struct retype_request *data) {
1297  if (from == my_core_id) {
1298    struct cte *my_data = outstanding_retypes_list->get(data);
1299    if (!my_data) {
1300      return;
1301    }
1302    my_data->can_succeed_flag = true;
1303    return;
1304  }
1305
1306  send_retype_reply(from, data, has_local_descendants(data->cte));
1307  struct cte *my_data = outstanding_retypes_list->get(data);
1308  if (!my_data) {
1309    return;
1310  }
1311
1312  if (!my_data->success_flag) {
1313    outstanding_retypes_list->remove(my_data);
1314    return_failure_to_app();
1315  }
1316}
1317\end{verbatim}
1318
1319When a core receives a reply to the retype request broadcast, the
1320operation may already have been failed, in which case the reply is
1321ignored. If the operation has not been canceled yet, then if the reply
1322indicates no descendants, the operation can still succeed. If all
1323replies are seen then the retype operation succeeds.
1324
1325\begin{verbatim}
1326retype_reply(struct retype_request *data, bool has_descendants) {
1327  struct cte *my_data = outstanding_retypes_list->get(data);
1328  if (!my_data) {
1329    return;
1330  }
1331
1332  if (has_descendants) {
1333    outstanding_retypes_list->remove(my_data);
1334    return_failure_to_app();
1335    return;
1336  }
1337
1338  increment_replies(my_data);
1339  if (seen_all_replies(my_data)) {
1340    create_descendants();
1341    outstanding_retypes_list->remove(my_data);
1342    return_success_to_app();
1343  }
1344}
1345\end{verbatim}
1346
1347This resolves the conflicts between two retypes. The conflict between
1348retype and revoke are discussed below.
1349
1350\note{Malloc probably means unbounded memory requirement.}
1351
1352\textbf{Revoking a capability:} The core initializes some local state
1353and then broadcasts a revoke request to all cores in the system.
1354
1355\begin{verbatim}
1356cap_revoke(struct cte *cte) {
1357  struct revoke_data *data = malloc(sizeof(struct revoke_data));
1358  revoke_data_initialize(data, cte);
1359  outstanding_revokes_list->add(data);
1360  send_revoke_request_bcast(TOTAL_ORDER, data);
1361}
1362\end{verbatim}
1363
1364When a core receives a revoke request broadcast, if the core was
1365trying to retype a related capability, then it fails the retype
1366operation. If it is not trying to revoke the capability itself, it
1367simply revokes the capability locally and sends a reply. If the core
1368is trying to revoke the same capability but it was not the sender of
1369the broadcast, then this core's revoke operation fails.
1370
1371\begin{verbatim}
1372revoke_request_bcast(coreid_t from, struct revoke_request *data) {
1373
1374  if (related_retype_in_progress(data->cte)) {
1375    outstanding_retypes_list->remove(data);
1376    return_fail_to_app();
1377  }
1378
1379  struct cte *my_data = outstanding_revokes_list->get(data);
1380  if (!my_data) {
1381    revoke_locally(data->cte);
1382    send_revoke_reply(from, data);
1383    return;
1384  }
1385
1386  if (from != my_core_id && !my_data->success_flag) {
1387    outstanding_revokes_list->remove(my_data);
1388    send_revoke_reply(from, data);
1389    return_failure_to_app();
1390    return;
1391  }
1392
1393  my_data->success_flag = true;
1394}
1395\end{verbatim}
1396
1397When a core receives a reply from the broadcast, it aggregates them
1398till it has heard back from all cores in the system and then sends a
1399success to the application.
1400
1401\begin{verbatim}
1402revoke_reply(struct revoke_request *data) {
1403  struct cte *my_data = outstanding_revokes_list->get(data);
1404  if (!my_data) {
1405    return;
1406  }
1407
1408  increment_replies(my_data);
1409  if (seen_all_replies(my_data) && !my_data->sent_transfer_cap_delete_flag) {
1410    outstanding_revokes_list->remove(my_data);
1411    return_success_to_app();
1412  }
1413}
1414\end{verbatim}
1415
1416\textbf{cross core transfer:} When sending a capability to another
1417core, the sending core creates local states and broadcasts the send
1418message to all cores in the system.
1419
1420\begin{verbatim}
1421cap_transfer(struct cte *cte, coreid_t to) {
1422  struct transfer_data *data = malloc(sizeof(struct transfer_data));
1423  transfer_data_initialize(data, cte);
1424  outstanding_transfers_list->add(data);
1425  send_transfer_request_bcast(TOTAL_ORDER, data, to);
1426}
1427\end{verbatim}
1428
1429When a core receives the capability transfer broadcast, it checks
1430against the state of current capability operations in progress and
1431takes appropriate actions if they are related.
1432
1433If a revoke operation is in progress that is trying to revoke a copy
1434or an ancestor of the capability being transferred, the broadcast will
1435indicate that the system indeed has more copies and descendant of the
1436capability being revoked which must be deleted. If the receiver was
1437also the recipient of the capability, it does not create it and
1438returns an error to the sender of the capability or else the receiver
1439sends a message to the core to which the capability is being
1440transferred to requesting it to delete the capability.
1441
1442\begin{verbatim}
1443transfer_request_bcast(coreid_t from, struct transfer_data *data, coreid_t to) {
1444  struct cte *my_data;
1445  my_data = outstanding_revokes_list->get(data);
1446  if (my_data) {
1447    if (to == my_core_id) {
1448      send_transfer_reply(from, FAILURE_REVOKED, data);
1449      return;
1450    }
1451
1452    my_data->sent_transfer_cap_delete_flag = true;
1453    send_delete_transferred_cap_request(to, data);
1454  }
1455
1456  if (to != my_core_id) {
1457    return;
1458  }
1459
1460  my_data = pending_delete_for_transferred_cap_list->get(data);
1461  if (my_data) {
1462    pending_delete_for_transferred_cap_list->remove(my_data);
1463    send_delete_transferred_cap_reply(my_data->from);
1464    send_transfer_reply(from, FAILURE_REVOKED, data);
1465  }
1466
1467  cap_create(data->cte);
1468  send_transfer_reply(from, SUCCESS);
1469}
1470\end{verbatim}
1471
1472When the sender of the capability gets a reply from the receiver, it
1473forwards the error code to the application and garbage collects its
1474state.
1475
1476\begin{verbatim}
1477transfer_reply(errval_t err, struct transfer_request *data) {
1478  my_data = outstanding_transfers_list->get(data);
1479  outstanding_transfers_list->remove(data);
1480  return_err_to_app(my_data->app, err);
1481}
1482\end{verbatim}
1483
1484If a revoke was in progress during the transfer of the capability, the
1485core revoking the capability sends a message to the receiver of the
1486transferred capability to delete it. When the receiver receives this
1487message, it may or may not have received the capability yet. If it has
1488already received the capability, it deletes or else it establishes
1489some state to delete when it is later received.
1490
1491\begin{verbatim}
1492delete_transferred_cap_request(coreid_t from, struct data *data) {
1493  if (cap_exists(data->cte)) {
1494    remove_cap(cte);
1495    send_delete_transferred_cap_reply(from);
1496    return;
1497  }
1498
1499  struct pending_delete_for_transferred_cap *my_data =
1500    malloc(sizeof(struct pending_delete_for_transferred_cap));
1501  pending_delete_for_transferred_cap_initialize(my_data);
1502  pending_delete_for_transferred_cap_list->add(my_data);
1503}
1504\end{verbatim}
1505
1506\begin{verbatim}
1507delete_transferred_cap_reply(struct data *data) {
1508  my_data = outstanding_revokes_list->get(data);
1509  if (my_data) {
1510    return;
1511  }
1512
1513  my_data->sent_transfer_cap_delete_flag = false;
1514  if (seen_all_replies(my_data)) {
1515    outstanding_revokes_list->remove(my_data);
1516    return_success_to_app();
1517  }
1518}
1519\end{verbatim}
1520
1521\note{Caching NYI}
1522
1523\subsection{Two phase commit}
1524Use two-phase commit to synchronize operations that would otherwise conflict.
1525We do not discuss this solution in detail, as we incorporate two-phase commit
1526into the delete and revoke operations for the hybrid solution discussed in
1527section~\ref{sec:sol:hybrid}.
1528
1529\subsection{Sequencer}
1530Use a sequencer. This will order whole operations. We do not discuss this
1531solution in depth, as the sequencer idea is incorporated into the hybrid
1532solution in section~\ref{sec:sol:hybrid}.
1533
1534\subsection{Hybrid}
1535\label{sec:sol:hybrid}
1536
1537We can, of course, combine facets of each of the previously discussed
1538approaches to build a hybrid approach which combines some (or all) of them.
1539
1540The solution discussed in chapters two and three of Mark Nevill's master's
1541thesis~\cite{Nevill2012}, is one such hybrid approach, which combines locking,
1542per capability sequencing, and two-phase commit.
1543Additionally, Mark's solution does not require any additional requirements on
1544Barrelfish's message channels, and works flawlessly with SSF message
1545semantics.
1546
1547We only give a brief overview of the solution here, and refer to Mark's thesis
1548which discusses the solution in depth, giving invariants, pre- and
1549post-conditions, and algorithm sketches for each operation.
1550
1551The key concept for this solution is that, for each capability, of which there
1552can exist arbitrarily many copies, one core in the system is chosen as the
1553capability's owner, or sequencer.
1554All operations on any copy of a capability that need synchronization have to
1555be proxied through the capability's owner -- this is the sequencer aspect of
1556the solution.
1557
1558We use locks to eliminate possible conflicts between operations, and merge
1559overlapping deletes and revokes to avoid deadlocks.
1560
1561Delete and revoke employ a form of two-phase commit, which is implemented as a
1562\emph{mark} and \emph{sweep} algorithm.
1563Other operations treat capabilities that have been marked for deletion as
1564already deleted, which avoids many otherwise conflicting operation sequences.
1565
1566\subsubsection{Caching}
1567Caching for this solution is implemented using a bitfield which has one bit
1568each indicating the presence of remote copies, descendants, and ancestors
1569respectively.
1570
1571\subsection{Multicast vs Broadcast}
1572To implement two-phase commit, this solution could use multicast messages to
1573all cores that have remote copies/descendants/ancestors or simply use
1574broadcast and have cores that do not have any copies reply with a success
1575reply for the mark phase.
1576
1577\subsection{Comparison}
1578\note{TODO: Compare the approaches}
1579
1580\subsection{Implementation}
1581
1582Currently, the capability operations are implemented using the hybrid
1583technique outlined above.
1584The implementation uses broadcasts to implement 2PC, because the this way, the
1585implementation does not have to keep, and update, a list of remote cores that
1586have copies, descendants or ancestors for each capability.
1587
1588%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1589
1590\chapter{Implementation details}\label{chap:implementation}
1591
1592\begin{itemize}
1593\item Using THC
1594\item Using RPC between app and monitor
1595\item Using RPC between monitors
1596\item Everything having ancestors on memserv
1597\item Optimization: monitor caches root cnodes
1598\item Bootstrap
1599\item How to put enum objtype in interface file?
1600\item Two types of ifdefs: type of caching (none, list, or bits) and
1601  type of mgmt
1602\end{itemize}
1603
1604\section{Performing capability operations}
1605If caching is not used, then the application knows for which
1606operations it must contact the monitor and does so directly without
1607requesting the kernel to try to perform the operation first.
1608
1609If caching is used, then the application should try to perform the
1610operation via the kernel first. If the capability does have remote
1611relations, the kernel returns an appropriate error to the application
1612in which case it contacts the monitor.
1613
1614\section{Sharing mdb between kernel and monitor}
1615When the application wants the monitor to perform an operation for it,
1616it passes the monitor its root CNode and all the required parameters.
1617
1618%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1619
1620\chapter{Not Yet Discussed}\label{chap:nyd}
1621
1622Things that I know that I need to discuss.
1623
1624\begin{itemize}
1625
1626\item The OS does not guarantee which order the operations will be
1627  performed in. The user must enforce the ordering herself.
1628
1629\item Partitioned approach with caching is eager replication. Consider
1630  lazy replication.
1631\end{itemize}
1632
1633\bibliographystyle{plain}
1634\bibliography{defs,barrelfish}
1635
1636\end{document}
1637