1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 2% Copyright (c) 2011-2017, ETH Zurich. 3% All rights reserved. 4% 5% This file is distributed under the terms in the attached LICENSE file. 6% If you do not find this file, copies can be found by writing to: 7% ETH Zurich D-INFK, Universitaetstr. 6, CH-8092 Zurich. Attn: Systems Group. 8%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 9 10\documentclass[a4paper,twoside]{report} % for a report (default) 11 12\usepackage{bftn} % You need this 13\usepackage[pdftex]{graphicx} 14\usepackage{color,listings,ctable} 15 16\title{Capability Management in Barrelfish} % title of report 17\author{Akhilesh Singhania, Ihor Kuz, Mark Nevill, Simon Gerber} % author 18\tnnumber{013} % give the number of the tech report 19\tnkey{Capability Management} % Short title, will appear in footer 20 21% \date{Month Year} % Not needed - will be taken from version history 22 23\newcommand{\note}[1]{[\textcolor{red}{\textit{#1}}]} 24 25\lstdefinelanguage{Mackerel}{ 26 morekeywords={datatype,device,register,regtype,constants,type,at, 27 many,edit,io,also}, 28 sensitive=false, 29 morecomment=[l]{//}, 30 morecomment=[s]{/*}{*/}, 31 morestring=[b]", 32 showspaces=false, 33 showstringspaces=false, 34 showtabs=false, 35} 36 37\newcommand{\noarginvocation}[1]{\paragraph{#1 invocation}} 38\newcounter{invocArgCnt} 39\newenvironment{invocation}[1]{% 40 \noarginvocation{#1} 41 42 \begin{list}{\emph{Argument~\arabic{invocArgCnt}:}}{% 43 \usecounter{invocArgCnt}% 44 \setlength{\rightmargin}{\leftmargin}% 45 \setlength{\itemsep}{0ex}% 46 } 47 \renewcommand{\arg}{\item} 48}{% 49 \end{list} 50} 51 52\begin{document} 53\maketitle 54 55% 56% Include version history first 57% 58\begin{versionhistory} 59\vhEntry{1.0}{08.03.2011}{AS}{Initial version} 60\vhEntry{2.0}{27.1.2012}{MN}{New capability system design} 61\vhEntry{2.1}{8.7.2013}{SK}{Updates} 62\vhEntry{2.2}{1.12.2013}{TR}{Fixed missing references/citations} 63\vhEntry{3.0}{2.06.2017}{SG}{Update to new CSpace design and remove outdated info} 64\end{versionhistory} 65 66% \intro{Abstract} % Insert abstract here 67% \intro{Acknowledgements} % Uncomment (if needed) for acknowledgements 68% \tableofcontents % Uncomment (if needed) for final draft 69% \listoffigures % Uncomment (if needed) for final draft 70% \listoftables % Uncomment (if needed) for final draft 71 72\chapter{Introduction} 73 74This document discusses the state of capabilities in the Barrelfish 75operating system. 76 77Chapter \ref{chap:known_issues} lists the currently known issues with 78capability management and \ref{chap:type_system} discusses the type 79system. 80 81Chapter \ref{chap:current_state} discusses the current state of the 82implementation in Barrelfish, chapter \ref{chap:db} discusses 83different approaches for maintaining a multicore mapping database of 84capabilities, chapter \ref{chap:solutions} discusses the requirements 85from a correct solution and discusses four different solutions, 86chapter \ref{chap:implementation} discusses some Barrelfish specific 87challenges in implementing the solutions, and chapter \ref{chap:nyd} 88highlights issues not yet covered in this document. 89 90%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 91\chapter{Known Issues}\label{chap:known_issues} 92\begin{itemize} 93 94\item Kernel operations should not take longer than $O(1)$. Some 95 capability operations are not constant time but can take $O(log n)$ 96 where $n$ is the size of the mapping database. Spending so much time 97 in the kernel is detrimental for good scheduling as when in the 98 kernel, interrupts are disabled. All non constant time operations 99 should be punted to the monitor. 100 101\item When the last copy of a lmp capability or frame capability 102 backing ump channels is deleted, should it initiate a connection 103 tear down? 104 105\item Fragmentation of memory. Example: a ram capability of 4GB is 106 retyped into two capabilities of 2GB each. One of the 2GB capability 107 is deleted. The only way to get a capability to that 2GB back is to 108 also delete the other 2GB capability and then retype the 4GB 109 capability again. 110 111\item IO space can be treated in a same way as physical memory. So 112 instead of using mint operations to update ranges, we use retype 113 operations. This is useful as different IO capabilities will have 114 the ancestor, descendant, and copy relationships rather than just a 115 copy relationship. The former is richer and can offer improved 116 maintenance. 117 118\item When a new capability is introduced in a core via cross-core 119 send, the kernel must walk the entire mapping database to find the 120 appropriate place to insert the capability. The operation is 121 logarithmic time in the size of the mapping database. This 122 operation must either happen in the monitor or must be made more 123 efficient. If we move the mapping database into the monitor, we 124 think that this should at worse become a performance issue and not 125 a correctness one. 126 127\end{itemize} 128 129%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 130\input{type_system.tex} 131 132%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 133 134\chapter{Current State}\label{chap:current_state} 135 136This chapter will cover how capabilities are stored and what happens 137on capability invocation. 138 139\section{Storage}\label{sec:cspace} 140For security reasons, capabilities are stored in kernel-space and 141users are given pointers to them. Each capability is stored in two 142separate databases: 143 144\begin{itemize} 145\item Each dispatcher has an associated capability space that holds 146 all the capabilities it has. The capability space of a dispatcher 147 is implemented using the CNode type capability. Each dispatcher is 148 associated with a ``root CNode'' that contains all capabilities the 149 dispatcher has. 150\item Each core has a mapping database that holds all the capabilities on the 151 core. The mapping database is implemented using a tree of the capabilities. 152 As discussed later in the chapter, the mapping database stores the 153 capabilities in a particular order to facilitate different capability 154 operations. 155\end{itemize} 156 157\section{Capability invocation}\label{sec:sys_invoke} 158When a dispatcher invokes a capability, it passes the kernel an 159address of the capability in the CSpace it wishes to invoke. The 160kernel locates the capability starting from the dispatcher's root 161CNode (walks the capability space), verifies that the requested 162operation can indeed be performed on the specified capability and then 163performs the operation. 164 165\section{Data structures} 166Capabilities in Barrelfish are represented by the following data 167structures: 168 169{\scriptsize 170\begin{verbatim} 171 struct mdbnode { 172 struct cte *left, *right; // Links to the mapping database 173 ... 174 }; 175 176 struct CNode { 177 paddr_t cnode; // Base address of CNode 178 uint8_t bits; // Size in number of bits 179 ... 180 }; 181 182 union capability_u { // Union of all types of capabilities 183 ... 184 struct CNode cnode; 185 ... 186 }; 187 188 struct capability { 189 enum objtype type; // Type of capability 190 union capability_u u; // Union of the capability 191 }; 192 193 struct cte { 194 struct capability cap; ///< The actual capability 195 struct mdbnode mdbnode; ///< MDB node for the cap 196 }; 197\end{verbatim} 198} 199 200A capability, \verb|cte|, consists of the actual capability represented by the 201``capability'' structure and an entry in the mapping database represented by 202the ``mdbnode'' structure. The capability structure contains the type specific 203information and the mdbnode contains pointers for the tree representing the 204mapping database. 205 206Capabilities can be looked-up in two ways. 207\begin{itemize} 208\item All capabilities on a core are stored in a mapping database. It 209 is possible to reach any capability on the core by traversing from 210 any other capability on the core. 211 212\item Capabilities are also stored in the CNode type capability. The 213 area of memory identified by the CNode structure is actually an 214 array of capabilities. Starting from the ``root CNode'' of a 215 dispatcher, it is only possible to reach any capability the 216 dispatcher holds. 217\end{itemize} 218 219\section{Terminology} 220 221This section discusses some terminology to facilitate the discussion 222of capability management. 223 224\subsection{Copy} 225 226A capability X is a copy of a capability Y if: 227 228\begin{itemize} 229\item X was copied from Y 230\item or Y was copied from X 231\item or X was copied from Z and Z was copied from Y 232\end{itemize} 233 234\subsection{Descendants} 235 236A capability X is a descendant of a capability Y if: 237 238\begin{itemize} 239\item X was retyped from Y 240 241\item or X is a descendant of Y1 and Y1 is a copy of Y 242 243\item or X is a descendant of Z and Z is a descendant of Y 244 245\item or X is a copy of X1 and X1 is a descendant of Y 246\end{itemize} 247 248\subsection{Ancestor} 249 250A is a ancestor of B if B is a descendant of A. 251 252\section{CNode invocations} 253 254Most Capabilities have type specific invocations. Operations on the 255CNode capability modifies the capability space of the system. We 256discuss how these operations are implemented for a single core system 257here. 258 259\note{Invocations on other capability types will probably also modify 260 the capability space but alas we don't know how those will work yet.} 261 262\subsection{Retype} 263 264Retyping a capability creates one or more descendants of the 265capability. This operation will fail if the capability already has 266descendants. The descendants are inserted into a CNode specified by 267the operation and into the mapping database right after the retyped 268capability. 269 270When a dispatcher issues the retype invocation, the kernel must traverse the 271mapping database to ensure that the capability has no descendants, create the 272descendants capabilities, insert them in the specified CNode and in the mapping 273database. 274 275\subsection{Copy} 276 277Copying a capability creates a new copy of it. The kernel walks the capability 278space to find the capability to be copied, creates the copy, and inserts it 279into the specified CNode and mapping database. 280 281\subsection{Delete} 282 283Delete removes the specified capability from the CNode in which it 284resides and from the mapping database. This operation cannot fail. 285 286The kernel first walks the capability space to locate the capability 287to delete. It then walks the mapping database to check if there still 288exist copies of the deleted capability. If no copies are found, then 289it performs certain operations based on the capability type. 290 291\subsection{Revoke} 292 293Revoking a capability calls delete on all copies and descendants of 294it. When the operation returns, the capability will not have any 295copies or descendants. 296 297The kernel walks the capability space to find the specified 298capability, uses the mapping database to find all copies and 299descendants of the specified capability and deletes them. 300 301\subsection{Looking up local copies and descendants} 302 303Due to the capability ordering used by the mapping database, copies are located 304adjacent to a capability and descendants immediately thereafter. Therefore, it 305is easy to look up all related copies of a capability on the same core. This 306facilitates revocation by looking up all copies and descendants, retypes by 307checking for existing descendants, and deletes by checking for copies. 308 309The following pseudo-code looks up all descendants and copies of a 310capability given the existence of type specific is\_copy and 311is\_descendant functions. 312 313{\scriptsize 314\begin{verbatim} 315 // Traverse forward 316 mdbnode *walk = successor(cap); 317 while (walk) { 318 // Check if descendant 319 if (is_descendant(cap, walk)) { 320 // Found a descendant 321 goto increment; 322 } 323 324 // Check if copy 325 if (is_copy(cap, walk)) { 326 // Found a copy 327 goto increment; 328 } 329 330 // Cap is not a descendant or copy 331 break; 332 333 increment: 334 walk = successor(walk); 335 } 336 337 // Traverse backwards 338 mdbnode *walk = predecessor(cap); 339 while (walk) { 340 // Predecessors cannot be descendants 341 342 // Check if copy 343 if (is_copy(cap, walk)) { 344 // Found a copy 345 goto increment; 346 } 347 348 // Cap is not a copy 349 break; 350 351 increment: 352 walk = predecessor(walk); 353 } 354} 355\end{verbatim} 356} 357 358\section{Multicore extensions} 359 360The model above works for a single core system. We have already 361extended it to work on multiple cores. Here we discuss these 362extensions. 363 364We implement the multi-core extension based on the system discussed in detail 365in chapters 2 and 3 of Mark Nevill's master's thesis, ``An Evaluation of 366Capabilities for a Multikernel''~\cite{Nevill2012}. 367 368\subsection{Cross-core transfer} 369 370This is a special operation available only to the monitors. This sends 371a capability from one core to another. 372 373\section{Summary} 374 375In this chapter, we presented a background and the current state of 376capabilities management in Barrelfish. We can now discuss different 377designs for multicore capability management. 378 379 380%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 381 382\chapter{Maintaining the database}\label{chap:db} 383We consider the following approaches for managing the mapping 384database. 385 386\note{Use diagrams to better illustrate the discussion below.} 387 388\begin{itemize} 389\item \textbf{No partition:} The mapping database and capability space 390 for all cores is maintained on a single centralized 391 coordinator. Accessing either the capability space or the mapping 392 database on any core requires communication with the coordinator. 393 394\item \textbf{Partitioned data structure:} The mapping database for 395 all cores is maintained on a single centralized coordinator and the 396 capability space is maintained on the local cores. This implies that 397 the cte structure is split between the coordinator and the local 398 cores. Cores can access the capability space locally and have to 399 message the coordinator to access the mapping database. Note that 400 split ``capability'' and ``mdbnode'' structures need to maintain 401 references to each other. 402 403\item \textbf{Centrally replicated space:} The mapping database and 404 the capability space is replicated between a single coordinator and 405 the local cores. This implies that two copies of each ``cte'' 406 structure exist in the system, one on the local core and one on the 407 coordinator. Both local core and the coordinator can access the 408 mapping database and the capability space. 409 410\item \textbf{Partitioned space:} The mapping database and the 411 capability space are maintained on the local cores. The entire 412 ``cte'' structure can be accessed locally but capability operations 413 will require coordination and communication with remote cores. 414 415\item \textbf{Minimal replication:} The database is partitioned 416 between local cores as in the above approach and additionally for 417 each capability the cores maintain a cache of its remote relations. 418 This is discussed in more detail in section \ref{sec:cache}. 419\end{itemize} 420 421We qualitatively compare the above approaches and why the partitioned 422space and minimal replication is the best. 423 424\section{Comparison}\label{sec:comparison} 425We compare the above approaches in this section. 426 427\subsection{Efficiency of capability invocations} 428When a dispatcher invokes a capability, the kernel has to look it up 429in the capability space starting from the dispatcher's ``root'' CNode. 430If the capability space is not local, then the core has to message the 431coordinator to perform the look up. A round trip with the coordinator 432is more expensive than a local look up and can become a scalability 433bottleneck if we use a single coordinator. 434 435The no partition approach does not maintain a local capability space 436and therefore may suffer from poor performance. All other approaches 437maintain the capability space locally. 438 439\subsection{Local operations} 440Approaches that enable more pure local operations will perform better 441as they will reduce the amount of cross-core communication. In the no 442partition, partitioned data structure, and replicated space 443approaches, no operation can be performed locally. In the partitioned 444and minimal replication approaches, certain operations such as copying 445capabilities is purely local. 446 447\begin{table*} 448\begin{center} 449\begin{tabular}{|c|c|c|c|c|c|} 450 \hline 451 452 & Cap invocation & Local operations \\ 453 \hline 454 455 No partition & - & - \\ \hline 456 Partitioned data structure & + & - \\ \hline 457 Centrally replicated space & + & - \\ \hline 458 Partitioned space & + & + \\ \hline 459 Minimal replication & + & + \\ \hline 460 461\end{tabular} 462\end{center} 463\caption{\label{t:summary}Summary of the different approaches.} 464\end{table*} 465 466\subsection{Discussion} 467Table \ref{t:summary} summarizes our comparison of the five 468approaches. A (+) indicates that the approach performs relatively well 469on the given metric and a (-) indicates that the approach performs 470relatively poorly. Based on the results, we choose to implement the 471partitioned space and minimal replication approaches. 472 473\section{Caching}\label{sec:cache} 474The minimal replication approach is characterized as the partitioned 475approach with caching the state of remote relations. Capabilities 476without remote relations are marked as such and when performing 477operations on these no remote communication is required. 478 479When tracking remote relations, three types of relations must be 480tracked: copies, descendants, and ancestors. Tracking of remote copies 481and descendants is required so that revoke, retype, and delete 482operations can be correctly implemented. And capabilities must track 483their remote ancestors so if they are deleted, the remote ancestors 484can be informed to update the state about their remote descendants. 485 486\subsection{How to maintain the cache?} 487 488\note{Once I discuss total order broadcast with caching, this 489 discussion will be revamped.} 490 491There are two ways of maintaining the cache: 492\begin{itemize} 493\item \textbf{Single bit:} A bit each for the three types of remote 494 relations is used. The bits merely indicate the presence of remote 495 relations but provide no further information such as which cores 496 have the remote relations. 497 498\item \textbf{List:} A list each for the three types of remote 499 relations is used. The list contains exact information about which 500 cores the remote relations exist on. Uhlig et al [?]'s core mask 501 technique can be used to maintain the lists with fixed space 502 requirement. 503\end{itemize} 504 505\subsubsection{Comparison} 506\textbf{Informing existing relations:} When a capability that already 507has remote relations is sent to another core, with the single-bit 508approach, none of the existing cores with relations have to be 509informed, but with the list approach, cores with existing relations 510must be informed so they can update their lists. 511 512With both approaches when the last local copy of a capability is 513deleted, its remote relations should be informed. This is required in 514the list approach so that the cores can update their list and is 515required in the single-bit approach so that the cores can determine if 516the last remote relation has been deleted and it can mark the 517capability as not having the remote relation anymore. 518 519\textbf{Number of cores communicated:} With the single bit approach, 520all cores in the system must be contacted, while with the list 521approach, just the cores of interest must be contacted. 522 523\textbf{Discussion:} The single-bit approach will probably be more 524efficient if the system has few cores and the capabilities that have 525remote relations, tend to have them on many cores. The list approach 526will have better performance if there are lots of cores in the system 527and typically a capability only has relations on a few cores. 528 529\subsection{Where to maintain the cache?}\label{subsec:where} 530Imagine that there exists a capability with only local relations. Then 531cross-core transfer it applied to it. Should we only mark that 532capability as having a remote copy or should we mark all impacted 533capabilities accordingly? This section will discuss and compare these 534two approaches. 535 536\subsubsection{Mark all} 537When cross-core transfer is applied to a capability, it is marked has 538having remote copies and all its local relations are also marked as 539having the appropriate remote relations. Its local copies will be 540marked as having remote copies, its local descendants will be marked 541as having remote ancestors, and its local ancestors will be marked as 542having remote descendants. 543 544All capabilities maintain complete state of their remote relations at 545all times. 546 547\subsubsection{Mark just individual capability} 548When cross-core transfer is applied to a capability, it is marked has 549having remote copies and none of its local relations are updated. 550 551Capabilities do not maintain the complete state of their remote 552relations. Their local relations must be looked up to build the 553complete state of their remote relations. 554 555\subsubsection{Comparison} 556Both approaches have their specific strengths and weaknesses discussed 557here. 558 559\textbf{Cost of cross-core transferring:} When applying cross-core 560transfer to a capability, in the mark all approach, all local 561relations must be updated whereas in the mark just individual 562capability approach, the local relations do not have to be updated. 563However, the cross-core transfer operation needs to send full 564information about the state of local relations so that the newly 565created remote capability can setup its cache properly. Gathering this 566information will require accessing all local relations so the mark all 567approach represents a small in the operation that must be performed 568anyways. 569 570\textbf{Cost of building state of remote relations:} Any capability 571operation that requires information on the current state of remote 572relations will have to build the full state of remote relations for 573the capability. This is done trivially in the mark all approach as 574each capability maintains the full state. In the mark just individual 575capability approach, all local relations must be accessed to build the 576state. 577 578\textbf{Discussion:} Given that the state of remote relations is 579trivially constructed in the mark all approach and the cost of 580cross-core transfer is only marginally higher than the bare minimum 581cost already required, we conclude that the mark all approach is 582superior to the mark just individual capability approach. 583 584\subsection{Summary}\label{subsec:cache:summary} 585In summary, we have come up with two approaches for the type of cache: 586single-bit and list, and we have concluded that the mark all approach 587is superior for maintaining the cache. 588 589Maintaining a cache will require the following adjustment to the 590capability operations presented above. 591 592\begin{itemize} 593\item When cross-core transfer is applied to a capability, it is 594 marked as having remote copies. The operation sends information on 595 the state of the capability's local and remote relations. 596 597\item When a capability is created due to the cross-core receive 598 operation, it incorporates the information about relations sent with 599 the cross-core transfer operation. 600 601\item When a copy of a capability is marked as having remote 602 relations, the capability is marked as having the same remote 603 relations. 604 605\item When a descendant of a capability is marked as having remote 606 relations, the capability is marked also marked as having remote 607 relations based on the following rules: 608 \begin{itemize} 609 \item If the descendant has a remote copy, then capability has a 610 remote descendant. 611 \item If the descendant has a remote descendant, then capability has 612 a remote descendant. 613 \item If the descendant has a remote ancestor, then capability 614 either has a remote copy or an ancestor depending on the outcome 615 from the type-specific is\_descendant function. 616 \end{itemize} 617 618\item When an ancestor of a capability is marked as having remote 619 relations, the capability is marked also marked as having remote 620 relations based on the following rules: 621 \begin{itemize} 622 \item If the ancestor has a remote copy, then capability has a 623 remote ancestors. 624 \item If the ancestor has a remote descendant, then capability 625 either has a remote copy or a remote descendant depending on the 626 outcome from the type-specific is\_descendant function. 627 \item If the ancestor has a remote ancestor, then capability 628 has remote ancestors. 629 \end{itemize} 630 631\item When a capability is retyped: 632 \begin{itemize} 633 \item Its remote copies are marked as having remote descendants 634 \item The descendants are marked as having remote ancestors if the 635 capability has remote copies. 636 \end{itemize} 637 638\item When a capability is copied, the copy is marked as having the 639 same remote relations that the capability has. 640 641\item When the last copy of a capability on the core is deleted, its 642 remote copies, descendants, ancestors are informed. 643\end{itemize} 644 645We now discuss some implications of maintaining a cache. 646 647\textbf{Drawbacks:} Caching introduces the following drawbacks that 648the non-caching approach does not suffer from. 649 650\begin{itemize} 651\item The application dispatchers must first try to perform the 652 capability operations locally and if that fails, then communicate 653 with the monitors. If no caching were used, the application 654 dispatchers can always directly communicate with the monitor saving 655 one superfluous trip into the kernel. 656 657\item As discussed above, caching requires additional overhead in 658 keeping the cache consistent. 659 660\item Caching increases the space requirement for maintaining the 661 mapping database. 662\end{itemize} 663 664Caching can work if typically few capabilities are shared between 665cores and can hurt if many capabilities are shared. 666 667\textbf{Decide dynamically:} It is clear that application scenarios 668will actually dictate if caching can help and if it does, which type 669of caching helps. To support this, Barrelfish can allow applications 670to specify which type of caching is well suited for it or the OS can 671gather appropriate metrics on application execution and based on that 672dynamically update the type of caching used. For this, we need to 673identify the cross-over point where one approach is preferred over the 674other. 675 676\section{Implementation} 677 678As summarized, the current implementation uses a partitioned database which 679caches remote relations of capabilities with three bits, one each for remote 680copies, ancestors, and descendants. 681 682Each database partition is associated with a kernel control block, and keeps 683track of all the capabilities in the CSpaces of all the dispatchers associated 684with that kernel control block. 685The database is implemented as a binary search tree. 686We chose the AA tree~\cite{Andersson1993}, which is a variation of the 687red-black tree where red nodes can only be added as right subchildren. 688This results in a greatly simplified implementation of the maintenance 689operations while conserving the red-black tree's invariants. 690 691The particular variation we implement for the capability database is an 692interval tree as described by 693Cormen et al.~\cite[section~14.3, pp~348-354]{Cormen2001} which uses the AA 694tree as its basis. 695 696It is noteworthy that we embed the tree nodes into each kernel capability 697object to avoid any dynamic allocations in the CPU driver. 698 699The choice of tree, the database implementation and its performance are more 700extensively discussed in chapter 4 of Mark Nevill's master's 701thesis~\cite{Nevill2012}. 702 703%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 704 705\chapter{Solutions}\label{chap:solutions} 706In this chapter, we discuss mechanisms for implementing the capability 707operations. In section \ref{sec:challenges}, we will first discuss the 708challenges in correctly implementing the operations by discussing how 709executing multiple related operations in parallel can lead to 710conflicts and by discussing how an intuitive solution of using 711acknowledgments fails to resolve the conflicts. We then present the 712requirements from a correct solution in section \ref{sec:requirements} 713and compare four different correct solution in section 714\ref{sec:solutions}. 715 716\section{Challenges}\label{sec:challenges} 717We first discuss the conflicts that can arise if two or more related 718operations execute in parallel and then discuss an intuitive but 719incorrect solution of using acknowledgments to resolve the conflicts. 720 721\subsection{Conflicts}\label{subsec:conflicts} 722Performing overlapping operations on related capability can lead to 723conflicts. This section discusses the different types of conflicts 724that can arise. 725 726\textbf{Conflicts between two retypes, deletes, or revokes:} If two 727different cores are trying to retype, delete, or revoke related 728capabilities, then they can conflict. If two different cores try to 729retype two copies of a capability, the correct behavior is for one to 730succeed and for one to fail. If two cores try to delete the last two 731copies of a capability, the correct behavior is for one to just delete 732locally and for the other to perform type-specific operations required 733when deleting the last copy of a capability. If two cores try to 734revoke the same capability, one should succeed, the other should fail 735and the capability it was trying to revoke should also be deleted. 736 737\textbf{Conflict between revoke and cross-core transfer:} If a core is 738trying to revoke a capability, a copy or descendant of which another 739core is trying to transfer, the revoke operation should only finish 740when the in-flight capability is also deleted. 741 742\textbf{Conflict between revoke and retype:} If a core is trying to 743revoke a capability, a copy or descendant of which another core is 744trying to retype, then the retype should either succeed and the new 745capabilities be deleted before revoke finishes or the retype should 746fail. 747 748\subsection{Non-conflicts} 749This section discusses why it is safe to perform other operations in 750parallel. 751 752\textbf{Delete and revoke:} Delete at least deletes one capability and 753potentially more. If a revoke for a related capability is issued at 754the same time, they will overlap and potentially try to perform 755redundant operations but these are safe. 756 757\textbf{Delete and transfer:} Delete will delete one capability and 758perform a check for copies. The in-flight capability already has a 759copy that can satisfy the requirements of delete. If the capability is 760also being deleted, it is the same two deletes overlapping which 761indeed is a conflict. 762 763\textbf{Retype and delete:} A delete will not conflict with retypes 764because it checks for the existence of copies which a successful 765retype cannot possibly create. A retype will not conflict with deletes 766because no matter when it performs the check for descendants, it will 767either see descendants that are about to be deleted or not see them at 768all, either of the outcomes is correct. 769 770\textbf{Retype and transfer:} Retype cannot conflict with transfers 771because in order to transfer a capability because the capability being 772transferred is a copy of an existing capability which does not change 773the existing descendants relationships. 774 775\subsection{Incorrect solution} 776Since the essential issue with the relation with cross-core transfer 777operation is that in-flight capabilities do not exist anywhere, the 778sending core can maintain a list of in-flight capabilities which are 779garbage collected when the receiver acknowledges the reception of the 780capability. This allows the sending core to include the in-flight 781capabilities in the search for copies and descendants. As shown below, 782this intuitive solution is actually incorrect. 783 784\begin{figure}[t] 785 \includegraphics[width=0.75\textwidth]{acks-problem.pdf} 786 \caption{Problem with using acknowledments}\label{fig:acks-problem} 787\end{figure} 788 789Consider the scenario shown in figure \ref{fig:acks-problem}. The 790system consists of three cores \{A, B, C\}, A is trying to revoke a 791capability and B is trying to send a copy of the capability to C. A 792sends a revoke request to B, C and waits for their acknowledgments. B 793sends the capability to C and adds it to its list of in-flight 794capabilities. C sees the revoke request first, performs the operation 795locally and replies to A. Then it sees the capability sent from B, 796inserts it locally and send an acknowledgment to B. B sees the 797acknowledgment from C first garbage collecting its list of in-flight 798capabilities and then sees the revoke request from A, performs the 799revoke locally, and replies to A. A receives replies from both B and C 800concluding the revoke operation. The revoke operation has finished and 801C still has a copy of the revoked capability. 802 803\section{Requirements from a correct solution}\label{sec:requirements} 804A correct solution will enforce the following transactional like and 805ordering properties. 806 807\subsection{Transactional properties} 808\textbf{Atomic:} If one part of a capability operation fails, then the 809entire operation must fail. For example, in a retype operation, the 810descendant capabilities can be created in parallel with the check for 811existing descendants. If existing descendants are found, then the 812newly created descendants must be removed. 813 814\textbf{Consistent:} Every capability operation must leave the system 815in a consistent state. For example, if caching is used an operation 816deletes the last copy of a capability on a core, then all its remote 817relations should be informed and updated before the operation 818finishes. 819 820\textbf{Isolation:} The data that has been modified during a 821capability operation must not be accessible by other operations till 822the operation finishes. For example, if a retype operation has eagerly 823created descendants while checking for existing descendants in 824parallel, no other capability operation should be able to access the 825descendants created eagerly till the retype operation finishes. 826 827\subsection{Ordering} 828Section \ref{subsec:conflicts} discusses the two types of conflicts 829that can arise when multiple cores try to perform operations on 830related capabilities at the time. Correctness can be ensured if the 831individual steps (messages) in operations must be executed in some 832order. We provide some background on ordering messages before 833discussing the actual ordering requirements. 834 835\textbf{Background on ordering:} Four types of ordering are 836possible. We discuss them from the cheapest to the most expensive. 837 838\begin{itemize} 839\item No order: Messages can be delivered in any order. This cannot 840 happen on Barrelfish. 841\item Single Source FIFO (SSF): Messages from the same sender are 842 delivered in the order they were sent. This is what currently is 843 available on Barrelfish by default. 844\item Causal: Messages are delivered based on their partial order 845 given by the happens-before relationship. Vector clocks [?] can be 846 used to provide causal delivery of messages. 847\item Total order: All receivers receive messages in the same 848 order. Note that any ordering function can be used to order the 849 messages. Further, total order does not imply causal order but if 850 the ordering function respects SSF, then total order does imply 851 causal order. 852\end{itemize} 853 854\subsection{Conflicts with each-other} 855 856\begin{figure}[t] 857 \includegraphics[width=0.75\textwidth]{causal-problem.pdf} 858 \caption{Problem with using causal order 859 delivery}\label{fig:causal-problem} 860\end{figure} 861 862The minimum ordering required for resolving conflicts between retype, 863delete, and revoke operations is total order. Figure 864\ref{fig:causal-problem} illustrates the reason causal ordering does 865not work. Nodes \{A,B,C,D\} each contain copies of the same capability 866that \{A,D\} wish to retype at the same time. B receives the message 867from A before the message from D, and C receives the D's message 868before A's. The messages were delivered in causal order, C will refuse 869to A, B will refuse to D not allowing either retype operation to 870succeed. Similar examples can be constructed for the delete and revoke 871operations as well. 872 873Total order delivery will deliver the messages to B and C in the same 874order allowing just one and only one retype to succeed. 875 876\subsection{Conflict between revoke and transfer} 877Figure \ref{fig:acks-problem} illustrates the conflict between revoke 878and cross-core transfer operation. C sees the message from A before 879the message from B. Hence any message it sends to B causally depend 880upon the message from A. If this was reflected in the message C sent 881to B, then B could delay handling the message till it sees the revoke 882request from A. When B receives the revoke request from A, it will 883realize that A is trying to revoke an in-flight capability and take 884appropriate actions to resolve the issue. The delaying of the message 885can be guaranteed by enforcing causal delivery of messages. 886 887\subsection{Conflict between revoke and retype} 888The minimum ordering requirement is total. With less stricter 889ordering, different cores might see the revoke request and retype 890requests in different orders. The retyping core might delete the 891source capability but end up creating descendants later. 892 893\section{Solutions}\label{sec:solutions} 894 895We will discuss and compare the following four solutions. 896 897\begin{itemize} 898\item Locks 899\item Total order broadcast 900\item Two phase commit 901\item Sequencer 902\item Hybrid implementation with locks and 2PC 903\end{itemize} 904 905\subsection{Locks}\label{subsec:locks} 906Locks can be used to enforce the above transactional and ordering 907policies. Critical sections can be used to resolve the conflicts 908ensuring that \emph{Time-of-check-to-time-of-use} is not violated. 909 910Below, we provide pseudo-code for how capability operations will be 911implemented when using a centralized lock and not caching information 912about remote relations. 913 914\textbf{Copying a capability:} This operation is safe to be performed 915without holding a lock. 916 917\begin{verbatim} 918cap_copy(struct cte *cte) { 919 create_copies(); 920} 921\end{verbatim} 922 923\textbf{Retyping a capability:} Holding the lock is required to 924prevent multiple cores from trying to retype copies of the same 925capability. Acquiring the lock is a blocking call during which the 926capability might have been deleted so it is important to check that 927the capability still exists after the lock is acquired. 928 929\begin{verbatim} 930cap_retype(struct cte *cte) { 931 errval_t err = success; 932 acquire_lock(); 933 if (cap_exists(cte) == false) { 934 err = fail; 935 goto done; 936 } 937 if (has_descendants(cte) == true) { 938 err = fail; 939 goto done; 940 } 941 create_descendants(cte); 942done: 943 release_lock(); 944 return err; 945} 946\end{verbatim} 947 948\textbf{Deleting a capability:} If the capability has local copies, 949the operation is purely local. Otherwise, holding the lock is required 950to ensure that if the last copy of the capability in the system is 951being deleted, then the type-specific cleanup operations are executed. 952 953\begin{verbatim} 954cap_delete(struct cte *cte) { 955 remove_cap(cte); 956 957 if (!requires_typed_ops(cte)) { 958 return success; 959 } 960 if (has_local_copies(cte)) { 961 return success; 962 } 963 964 errval_t err = success; 965 acquire_lock(); 966 if (!cap_exists(cte) { 967 goto done; 968 } 969 970 if (!has_copies(cte) { 971 type_specific_ops(cte); 972 } 973 974done: 975 release_lock(); 976 return err; 977} 978\end{verbatim} 979 980\textbf{Revoking a capability:} Holding the lock is required to ensure 981that if multiple cores are trying to revoke related capabilities, only 982one succeeds. This is why the code below ensures that the capability 983still exists after acquiring the lock. 984 985\begin{verbatim} 986cap_revoke(struct cte *cte) { 987 errval_t err = success; 988 acquire_lock(); 989 if (!cap_exists(cte) { 990 err = fail; 991 goto done; 992 } 993 994 while(has_descendants(cte) { 995 struct cte *dest = get_next_descendant(cte); 996 remove_cap(dest); 997 } 998 while(has_copies(cte) { 999 struct cte *dest = get_next_copy(cte); 1000 remove_cap(dest); 1001 } 1002 1003done: 1004 release_lock(); 1005 return err; 1006} 1007\end{verbatim} 1008 1009\textbf{Cross-core transferring a capability:} Holding the lock is 1010required to conflicts between the above capability operations and 1011cross-core transfer operation. The sender acquires the lock and sends 1012the capability. The receiver creates the capability and releases the 1013lock. 1014 1015\begin{verbatim} 1016cap_send(struct cte *cte, coreid_t dest) { 1017 acquire_lock(); 1018 if (cap_exists(cte) == false) { 1019 return fail; 1020 } 1021 send_cap(cte, dest); 1022} 1023 1024cap_receive(struct cte *cte) { 1025 create_cap(cte); 1026 release_lock(); 1027 return_success_to_app(); 1028} 1029\end{verbatim} 1030 1031\subsubsection{Caching remote relations} 1032If remote relations are cached, then the cache has to be kept 1033consistent when capability operations change the state of 1034relations. Below we describe the modifications that will be required 1035in the above pseudo-code to keep the cache consistent. Note that our 1036cache can be seen as eager replication. As discussed in section 1037\ref{subsec:where}, we will be using the mark all approach to maintain 1038the cache. 1039 1040\textbf{Copying a capability:} When the new copy is created, its cache 1041of remote relations is set equal to the cache from the capability it 1042was copied from. 1043 1044\textbf{Retyping a capability:} If the capability does not have remote 1045copies and descendants, the operation is purely local and the lock is 1046not required. If the operation is not local, then the lock is 1047acquired, checks for existence of capability and for descendants and 1048is made, the capability is retyped, and remote relations are updated 1049based on the rules presented in section \ref{subsec:cache:summary}. 1050 1051\begin{verbatim} 1052cap_retype(struct cte *cte) { 1053 if (has_local_descendants(cte) == true) { 1054 return fail; 1055 } 1056 1057 if (!has_remote_copies(cte) && !has_remote_descendants(cte)) { 1058 create_descendants(cte); 1059 return success; 1060 } 1061 1062 if (has_remote_descendants(cte)) { 1063 return fail; 1064 } 1065 1066 errval_t err = success; 1067 acquire_lock(); 1068 if (!cap_exists(cte)) { 1069 err = fail; 1070 goto done; 1071 } 1072 1073 if (has_remote_descendants(cte)) { 1074 err = fail; 1075 goto done; 1076 } 1077 if (has_local_descendants(cte)) { 1078 err = fail; 1079 goto done; 1080 } 1081 1082 create_descendants(cte); 1083 update_relations(cte); 1084 1085done: 1086 release_lock(); 1087 return err; 1088} 1089\end{verbatim} 1090 1091\textbf{Deleting a capability:} If the capability has local copies or 1092no remote relations, the operation is purely local. Otherwise, the 1093lock is required and the remote relations must be updated. 1094 1095\begin{verbatim} 1096cap_delete(struct cte *cte) { 1097 remove_cap(cte); 1098 1099 if (!requires_typed_ops(cte)) { 1100 return success; 1101 } 1102 if (has_local_copies(cte)) { 1103 return success; 1104 } 1105 1106 if (!has_remote_relations(cte)) { 1107 type_specific_ops(cte); 1108 return success; 1109 } 1110 1111 acquire_lock(); 1112 if (!cap_exists(cte)) { 1113 goto done; 1114 } 1115 if (!has_copies(cte) { 1116 type_specific_ops(cte); 1117 } 1118 update_remote_relations(cte); 1119 remove_cap(cte); 1120 release_lock(); 1121done: 1122 return success; 1123} 1124\end{verbatim} 1125 1126\textbf{Revoking a capability:} If the capability has no remote 1127relations, the operation is purely local. Otherwise, the lock is 1128required. 1129 1130\begin{verbatim} 1131cap_revoke(struct cte *cte) { 1132 if (!has_remote_relations(cte)) { 1133 while(has_descendants(cte) == true) { 1134 struct cte *dest = get_next_descendant(cte); 1135 remove_cap(dest); 1136 } 1137 while(has_copies(cte) == true) { 1138 struct cte *dest = get_next_copy(cte); 1139 remove_cap(dest); 1140 } 1141 return success; 1142 } 1143 1144 errval_t err = success; 1145 acquire_lock(); 1146 if (!cap_exists(cte)) { 1147 err = fail; 1148 goto done; 1149 } 1150 1151 while(has_descendants(cte) == true) { 1152 struct cte *dest = get_next_descendant(cte); 1153 remote_cap(dest); 1154 } 1155 while(has_copies(cte) == true) { 1156 struct cte *dest = get_next_copy(cte); 1157 remote_cap(dest); 1158 } 1159 release_lock(); 1160done: 1161 return err; 1162} 1163\end{verbatim} 1164 1165\textbf{Cross-core transferring a capability:} The remote relations 1166cache on local and remote capabilities is updated as presented in 1167section \ref{subsec:cache:summary}. 1168 1169\subsubsection{Multiple locks} 1170\note{NYI} 1171If a single lock for the entire system is used, only one capability 1172operation can be performed at any given moment limiting the available 1173potential for parallelism. By using different locks for unrelated 1174capabilities, multiple capability operations can proceed in 1175parallel. Using multiple locks will increase the space requirement but 1176will also improve parallelism. 1177 1178\note{Multiple locks are not straight-forward. Ancestors reference 1179 more memory than descendants do.} 1180 1181\note{Can a copy lock for delete, a descendant lock for retype, and 1182 both for revoke work?} 1183 1184\subsection{Total order broadcast} 1185\note{Use a single data structure for all pending cap operations.} 1186 1187The required transactional and ordering guarantees can be provided by 1188ensuring that messages for related capabilities are delivered in the 1189same order on all cores. Note that causal order delivery resolves the 1190conflict between retype, delete, revoke and cross-core transfer 1191operation but not within the retype, delete, revoke operations. 1192 1193Below we present pseudo-code for how the capability operations will be 1194implemented when not caching information about remote relations. 1195 1196\textbf{Copying a capability:} This operation is safe to be performed 1197without any ordering requirements. 1198 1199\textbf{Deleting a capability:} If the capability does not require 1200type-specific operations or has local copies, the operation succeeds. 1201Or else, local state is created and a message is broadcast to all 1202cores. 1203 1204\begin{verbatim} 1205cap_delete(struct cte *cte) { 1206 if (!requires_typed_ops(cte) || has_local_copies(cte)) { 1207 remove_cap(data->cte); 1208 return success; 1209 } 1210 1211 struct delete_data *data = malloc(sizeof(struct delete_data)); 1212 delete_data_initialize(data, cte); 1213 outstanding_delete_list->add(data); 1214 send_delete_request_bcast(TOTAL_ORDER, data); 1215} 1216\end{verbatim} 1217 1218If a core receives a delete request broadcast and it was the sender of 1219the message, it removes the capability and returns. If the core was 1220not the sender of the message, then it replies with the state of local 1221copies of the specified capability. 1222 1223\begin{verbatim} 1224delete_request_bcast(coreid_t from, struct delete_request *data) { 1225 if (from == my_core_id) { 1226 remove_cap(data->cte); 1227 return; 1228 } 1229 1230 send_delete_reply(from, data, has_local_copies(data->cte)); 1231} 1232\end{verbatim} 1233 1234Finally, when a core receives a reply from a delete request, if a 1235remote copy is found, no type-specific cleanup is required and the 1236operation finishes. If all replies have been aggregated and no copies 1237were found, then the type-specific cleanups are performed and then the 1238operation finishes. 1239 1240\begin{verbatim} 1241delete_reply(coreid_t from, bool has_copies) { 1242 struct cte *my_data = outstanding_deletes_list->get(data); 1243 if (!my_data) { 1244 return; 1245 } 1246 1247 if (has_copies) { 1248 outstanding_deletes_list->remove(my_data); 1249 return; 1250 } 1251 1252 increment_replies(my_data); 1253 if (seen_all_replies(my_data)) { 1254 type_specific_ops(cte); 1255 outstanding_deletes_list->remove(my_data); 1256 } 1257} 1258\end{verbatim} 1259 1260Since the deleting cores remove the capability when they receive the 1261broadcast, when multiple cores are trying to delete copies, the last 1262core's broadcast will see no copies while other cores will see copies. 1263 1264\note{Malloc probably means unbounded memory requirement.} 1265 1266\textbf{Retyping a capability:} If the capability has local 1267descendants, then the operation fails. Else, a retype request 1268broadcast is sent. 1269 1270\begin{verbatim} 1271cap_retype(struct cte *cte) { 1272 if (has_local_descendants(cte)) { 1273 return fail; 1274 } 1275 1276 struct retype_data *data = malloc(sizeof(struct retype_data)); 1277 retype_data_initialize(data, cte); 1278 outstanding_retypes_list->add(data); 1279 send_retype_request_bcast(data); 1280} 1281\end{verbatim} 1282 1283When a core receives a retype request broadcast and it was the sender 1284of the message, one of two things may have happened. Either the core 1285had already received a broadcast from another core trying to retype 1286the same capability in which case the receiver's operation has failed 1287and the state for the request has been removed or the core can succeed 1288in retyping the capability and updates its state accordingly. If the 1289core was not the sender of the broadcast, it sends a reply to the 1290sender with the state of its local descendants. Then the core checks 1291if it has an outstanding retype request for the capability. If it does 1292and its request has not been delivered yet, its retype fails. An error 1293is sent to the application and the state is garbage collected. 1294 1295\begin{verbatim} 1296retype_request_bcast(coreid_t from, struct retype_request *data) { 1297 if (from == my_core_id) { 1298 struct cte *my_data = outstanding_retypes_list->get(data); 1299 if (!my_data) { 1300 return; 1301 } 1302 my_data->can_succeed_flag = true; 1303 return; 1304 } 1305 1306 send_retype_reply(from, data, has_local_descendants(data->cte)); 1307 struct cte *my_data = outstanding_retypes_list->get(data); 1308 if (!my_data) { 1309 return; 1310 } 1311 1312 if (!my_data->success_flag) { 1313 outstanding_retypes_list->remove(my_data); 1314 return_failure_to_app(); 1315 } 1316} 1317\end{verbatim} 1318 1319When a core receives a reply to the retype request broadcast, the 1320operation may already have been failed, in which case the reply is 1321ignored. If the operation has not been canceled yet, then if the reply 1322indicates no descendants, the operation can still succeed. If all 1323replies are seen then the retype operation succeeds. 1324 1325\begin{verbatim} 1326retype_reply(struct retype_request *data, bool has_descendants) { 1327 struct cte *my_data = outstanding_retypes_list->get(data); 1328 if (!my_data) { 1329 return; 1330 } 1331 1332 if (has_descendants) { 1333 outstanding_retypes_list->remove(my_data); 1334 return_failure_to_app(); 1335 return; 1336 } 1337 1338 increment_replies(my_data); 1339 if (seen_all_replies(my_data)) { 1340 create_descendants(); 1341 outstanding_retypes_list->remove(my_data); 1342 return_success_to_app(); 1343 } 1344} 1345\end{verbatim} 1346 1347This resolves the conflicts between two retypes. The conflict between 1348retype and revoke are discussed below. 1349 1350\note{Malloc probably means unbounded memory requirement.} 1351 1352\textbf{Revoking a capability:} The core initializes some local state 1353and then broadcasts a revoke request to all cores in the system. 1354 1355\begin{verbatim} 1356cap_revoke(struct cte *cte) { 1357 struct revoke_data *data = malloc(sizeof(struct revoke_data)); 1358 revoke_data_initialize(data, cte); 1359 outstanding_revokes_list->add(data); 1360 send_revoke_request_bcast(TOTAL_ORDER, data); 1361} 1362\end{verbatim} 1363 1364When a core receives a revoke request broadcast, if the core was 1365trying to retype a related capability, then it fails the retype 1366operation. If it is not trying to revoke the capability itself, it 1367simply revokes the capability locally and sends a reply. If the core 1368is trying to revoke the same capability but it was not the sender of 1369the broadcast, then this core's revoke operation fails. 1370 1371\begin{verbatim} 1372revoke_request_bcast(coreid_t from, struct revoke_request *data) { 1373 1374 if (related_retype_in_progress(data->cte)) { 1375 outstanding_retypes_list->remove(data); 1376 return_fail_to_app(); 1377 } 1378 1379 struct cte *my_data = outstanding_revokes_list->get(data); 1380 if (!my_data) { 1381 revoke_locally(data->cte); 1382 send_revoke_reply(from, data); 1383 return; 1384 } 1385 1386 if (from != my_core_id && !my_data->success_flag) { 1387 outstanding_revokes_list->remove(my_data); 1388 send_revoke_reply(from, data); 1389 return_failure_to_app(); 1390 return; 1391 } 1392 1393 my_data->success_flag = true; 1394} 1395\end{verbatim} 1396 1397When a core receives a reply from the broadcast, it aggregates them 1398till it has heard back from all cores in the system and then sends a 1399success to the application. 1400 1401\begin{verbatim} 1402revoke_reply(struct revoke_request *data) { 1403 struct cte *my_data = outstanding_revokes_list->get(data); 1404 if (!my_data) { 1405 return; 1406 } 1407 1408 increment_replies(my_data); 1409 if (seen_all_replies(my_data) && !my_data->sent_transfer_cap_delete_flag) { 1410 outstanding_revokes_list->remove(my_data); 1411 return_success_to_app(); 1412 } 1413} 1414\end{verbatim} 1415 1416\textbf{cross core transfer:} When sending a capability to another 1417core, the sending core creates local states and broadcasts the send 1418message to all cores in the system. 1419 1420\begin{verbatim} 1421cap_transfer(struct cte *cte, coreid_t to) { 1422 struct transfer_data *data = malloc(sizeof(struct transfer_data)); 1423 transfer_data_initialize(data, cte); 1424 outstanding_transfers_list->add(data); 1425 send_transfer_request_bcast(TOTAL_ORDER, data, to); 1426} 1427\end{verbatim} 1428 1429When a core receives the capability transfer broadcast, it checks 1430against the state of current capability operations in progress and 1431takes appropriate actions if they are related. 1432 1433If a revoke operation is in progress that is trying to revoke a copy 1434or an ancestor of the capability being transferred, the broadcast will 1435indicate that the system indeed has more copies and descendant of the 1436capability being revoked which must be deleted. If the receiver was 1437also the recipient of the capability, it does not create it and 1438returns an error to the sender of the capability or else the receiver 1439sends a message to the core to which the capability is being 1440transferred to requesting it to delete the capability. 1441 1442\begin{verbatim} 1443transfer_request_bcast(coreid_t from, struct transfer_data *data, coreid_t to) { 1444 struct cte *my_data; 1445 my_data = outstanding_revokes_list->get(data); 1446 if (my_data) { 1447 if (to == my_core_id) { 1448 send_transfer_reply(from, FAILURE_REVOKED, data); 1449 return; 1450 } 1451 1452 my_data->sent_transfer_cap_delete_flag = true; 1453 send_delete_transferred_cap_request(to, data); 1454 } 1455 1456 if (to != my_core_id) { 1457 return; 1458 } 1459 1460 my_data = pending_delete_for_transferred_cap_list->get(data); 1461 if (my_data) { 1462 pending_delete_for_transferred_cap_list->remove(my_data); 1463 send_delete_transferred_cap_reply(my_data->from); 1464 send_transfer_reply(from, FAILURE_REVOKED, data); 1465 } 1466 1467 cap_create(data->cte); 1468 send_transfer_reply(from, SUCCESS); 1469} 1470\end{verbatim} 1471 1472When the sender of the capability gets a reply from the receiver, it 1473forwards the error code to the application and garbage collects its 1474state. 1475 1476\begin{verbatim} 1477transfer_reply(errval_t err, struct transfer_request *data) { 1478 my_data = outstanding_transfers_list->get(data); 1479 outstanding_transfers_list->remove(data); 1480 return_err_to_app(my_data->app, err); 1481} 1482\end{verbatim} 1483 1484If a revoke was in progress during the transfer of the capability, the 1485core revoking the capability sends a message to the receiver of the 1486transferred capability to delete it. When the receiver receives this 1487message, it may or may not have received the capability yet. If it has 1488already received the capability, it deletes or else it establishes 1489some state to delete when it is later received. 1490 1491\begin{verbatim} 1492delete_transferred_cap_request(coreid_t from, struct data *data) { 1493 if (cap_exists(data->cte)) { 1494 remove_cap(cte); 1495 send_delete_transferred_cap_reply(from); 1496 return; 1497 } 1498 1499 struct pending_delete_for_transferred_cap *my_data = 1500 malloc(sizeof(struct pending_delete_for_transferred_cap)); 1501 pending_delete_for_transferred_cap_initialize(my_data); 1502 pending_delete_for_transferred_cap_list->add(my_data); 1503} 1504\end{verbatim} 1505 1506\begin{verbatim} 1507delete_transferred_cap_reply(struct data *data) { 1508 my_data = outstanding_revokes_list->get(data); 1509 if (my_data) { 1510 return; 1511 } 1512 1513 my_data->sent_transfer_cap_delete_flag = false; 1514 if (seen_all_replies(my_data)) { 1515 outstanding_revokes_list->remove(my_data); 1516 return_success_to_app(); 1517 } 1518} 1519\end{verbatim} 1520 1521\note{Caching NYI} 1522 1523\subsection{Two phase commit} 1524Use two-phase commit to synchronize operations that would otherwise conflict. 1525We do not discuss this solution in detail, as we incorporate two-phase commit 1526into the delete and revoke operations for the hybrid solution discussed in 1527section~\ref{sec:sol:hybrid}. 1528 1529\subsection{Sequencer} 1530Use a sequencer. This will order whole operations. We do not discuss this 1531solution in depth, as the sequencer idea is incorporated into the hybrid 1532solution in section~\ref{sec:sol:hybrid}. 1533 1534\subsection{Hybrid} 1535\label{sec:sol:hybrid} 1536 1537We can, of course, combine facets of each of the previously discussed 1538approaches to build a hybrid approach which combines some (or all) of them. 1539 1540The solution discussed in chapters two and three of Mark Nevill's master's 1541thesis~\cite{Nevill2012}, is one such hybrid approach, which combines locking, 1542per capability sequencing, and two-phase commit. 1543Additionally, Mark's solution does not require any additional requirements on 1544Barrelfish's message channels, and works flawlessly with SSF message 1545semantics. 1546 1547We only give a brief overview of the solution here, and refer to Mark's thesis 1548which discusses the solution in depth, giving invariants, pre- and 1549post-conditions, and algorithm sketches for each operation. 1550 1551The key concept for this solution is that, for each capability, of which there 1552can exist arbitrarily many copies, one core in the system is chosen as the 1553capability's owner, or sequencer. 1554All operations on any copy of a capability that need synchronization have to 1555be proxied through the capability's owner -- this is the sequencer aspect of 1556the solution. 1557 1558We use locks to eliminate possible conflicts between operations, and merge 1559overlapping deletes and revokes to avoid deadlocks. 1560 1561Delete and revoke employ a form of two-phase commit, which is implemented as a 1562\emph{mark} and \emph{sweep} algorithm. 1563Other operations treat capabilities that have been marked for deletion as 1564already deleted, which avoids many otherwise conflicting operation sequences. 1565 1566\subsubsection{Caching} 1567Caching for this solution is implemented using a bitfield which has one bit 1568each indicating the presence of remote copies, descendants, and ancestors 1569respectively. 1570 1571\subsection{Multicast vs Broadcast} 1572To implement two-phase commit, this solution could use multicast messages to 1573all cores that have remote copies/descendants/ancestors or simply use 1574broadcast and have cores that do not have any copies reply with a success 1575reply for the mark phase. 1576 1577\subsection{Comparison} 1578\note{TODO: Compare the approaches} 1579 1580\subsection{Implementation} 1581 1582Currently, the capability operations are implemented using the hybrid 1583technique outlined above. 1584The implementation uses broadcasts to implement 2PC, because the this way, the 1585implementation does not have to keep, and update, a list of remote cores that 1586have copies, descendants or ancestors for each capability. 1587 1588%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1589 1590\chapter{Implementation details}\label{chap:implementation} 1591 1592\begin{itemize} 1593\item Using THC 1594\item Using RPC between app and monitor 1595\item Using RPC between monitors 1596\item Everything having ancestors on memserv 1597\item Optimization: monitor caches root cnodes 1598\item Bootstrap 1599\item How to put enum objtype in interface file? 1600\item Two types of ifdefs: type of caching (none, list, or bits) and 1601 type of mgmt 1602\end{itemize} 1603 1604\section{Performing capability operations} 1605If caching is not used, then the application knows for which 1606operations it must contact the monitor and does so directly without 1607requesting the kernel to try to perform the operation first. 1608 1609If caching is used, then the application should try to perform the 1610operation via the kernel first. If the capability does have remote 1611relations, the kernel returns an appropriate error to the application 1612in which case it contacts the monitor. 1613 1614\section{Sharing mdb between kernel and monitor} 1615When the application wants the monitor to perform an operation for it, 1616it passes the monitor its root CNode and all the required parameters. 1617 1618%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1619 1620\chapter{Not Yet Discussed}\label{chap:nyd} 1621 1622Things that I know that I need to discuss. 1623 1624\begin{itemize} 1625 1626\item The OS does not guarantee which order the operations will be 1627 performed in. The user must enforce the ordering herself. 1628 1629\item Partitioned approach with caching is eager replication. Consider 1630 lazy replication. 1631\end{itemize} 1632 1633\bibliographystyle{plain} 1634\bibliography{defs,barrelfish} 1635 1636\end{document} 1637