1\documentclass[a4paper,11pt,twoside]{report} 2\usepackage{bftn} 3\usepackage{calc} 4\usepackage{hyperref} 5\usepackage{hyphenat} 6\usepackage{verbatim} 7\usepackage{xspace} 8\usepackage{pifont} 9\usepackage{pxfonts} 10\usepackage{textcomp} 11\usepackage{amsmath} 12\usepackage{listings} 13\usepackage[framemethod=default]{mdframed} 14\usepackage[shortlabels]{enumitem} 15\usepackage{parskip} 16\usepackage{xparse} 17\usepackage{natbib} 18\usepackage[all,cmtip]{xy} 19\usepackage{xcolor}% or package color 20 21\title{Device Queues in Barrelfish} 22\author{Barrelfish project} 23\tnnumber{26} 24\tnkey{DeviceQueues} 25 26\begin{document} 27 \maketitle % Uncomment for final draft 28 29 \begin{versionhistory} 30 \vhEntry{0.1}{24.10.2017}{RH}{Initial Version} 31 \end{versionhistory} 32 33 % \intro{Abstract} % Insert abstract here 34 % \intro{Acknowledgements} % Uncomment (if needed) for acknowledgements 35 \tableofcontents % Uncomment (if needed) for final draft 36 % \listoffigures % Uncomment (if needed) for final draft 37 % \listoftables % Uncomment (if needed) for final draft 38 \cleardoublepage 39 \setcounter{secnumdepth}{2} 40 41 \newcommand{\todo}[1]{[\textcolor{red}{\emph{#1}}]} 42 43 \newcommand{\virtio}{Virtio } 44 \newcommand{\devif}{Devif } 45 \newcommand{\ep}[1]{$Ep_{#1}$} 46 47\usepackage{color} 48\definecolor{bluekeywords}{rgb}{0.13,0.13,1} 49\definecolor{greencomments}{rgb}{0,0.5,0} 50\definecolor{redstrings}{rgb}{0.9,0,0} 51 52\lstdefinestyle{code} 53{language=C, 54 basicstyle=\tiny, 55 showspaces=false, 56 showtabs=false, 57 breaklines=true, 58 showstringspaces=false, 59 breakatwhitespace=true, 60 escapeinside={(*@}{@*)}, 61 commentstyle=\color{greencomments}, 62 keywordstyle=\color{bluekeywords}\bfseries, 63 stringstyle=\color{redstrings}, 64 basicstyle=\ttfamily, 65 morekeywords={uint8_t, uint16_t, uint32_t, uint64_t, size_t, bool},} 66 67 \mdfdefinestyle{defi}{% 68 linecolor=black,linewidth=2pt,% 69 frametitlerule=true,% 70 frametitlebackgroundcolor=gray!20, 71 innertopmargin=\topskip, 72 nobreak=true% prevent page breaks in the middle of mystyle 73 } 74 75 \mdfdefinestyle{inv}{% 76 linecolor=black,linewidth=2pt,% 77 topline=false, 78 leftline=false, 79 innertopmargin=\topskip, 80 } 81 82 \mdtheorem[style=defi]{defi}{Definition} 83 \mdtheorem[style=defi]{op}{Operation} 84 \mdtheorem[style=inv]{inv}{Invariant} 85 86 \chapter{Introduction} 87 In this document we describe a queue based interface that unifies communication for 88 devices like network cards and block devices but also between processes. 89 The interface should fit to as many devices as possible while still being efficient. 90 The goal is similar to \virtio \cite{virtio:sigops}, Portals \cite{portals} or MPI \cite{mpi}, 91 but we want to take a more formal 92 approach and define pieces that are ambiguous. These interface do not define a memory model 93 or what preconditions/postconditions an interface function has. 94 For example, what happens when a guest accesses a memory buffer that is handed off 95 to the host? These interfaces are clearly implementation driven where 96 we want to document the interface as clearly as possible, and see the implementation and 97 its code as two separate things. 98 99 \section{Terminology} 100 In this section, we explain the terms and the meaning of them as they are used in the following sections. 101 \begin{itemize} 102 103 \item \textbf{Region}: A Region is a chunk of memory that is registered to the \devif interface. From the memory of the region, buffers can be allocated. 104 \item \textbf{Buffer}: A buffer is a chunk of memory within a region. 105 \item \textbf{Endpoint}: An endpoint is a processes or devices. 106 \item \textbf{Ownership}: An endpoint can own a buffer and transfer ownership of a buffer to another endpoint or device. 107 If an endpoint owns a buffer, it can alter it. If an endpoint alters a buffer that is not owned, 108 the result is undefined and it is considered a bug. 109 \end{itemize} 110 111 \section{Functionality} 112 The basic functionality of our queue based device interface (from now on called \devif) 113 should be transferring ownership of buffers between two endpoints of a queue. 114 A buffer is a variable sized piece of memory within a previously to the \devif interface 115 register region of memory. We exclude managing the buffers themselves i.e. allocating 116 and deallocating buffers to keep the interface and the underlying protocol simple. 117 If we manage the variable sized buffers, we would have to implement a dynamic memory 118 allocator which increases the complexity of the \devif interface. 119 \\ 120 Another important aspect is the idea of "stacking" queues. In this manner each 121 queue on the stack can do an arbitrary transformation on the buffer that was enqueued 122 and hand it down to the queue lower in the stack. 123 124 \section{Memory Model} 125 When implementing a backend for the \devif interface, the underlying memory model 126 has to be considered. In certain cases, a write to a buffer might not have been written back 127 to memory before the buffer is processed by a device. 128 \\ 129 There are several memory models that are used in current hardware but none of them 130 are sequential consistent (SC). The memory models relax sequential consistency to allow 131 instruction reordering and other optimizations. Currently X86 and Sparc can be 132 modelled using \textit{Total Store Ordering} (TSO) ~\cite{swell:cacm2010}. TSO relaxes 133 SC so that local writes are visible locally before they are visible to all other 134 hardware threads (multi-copy atomic). In most cases this is because of a write buffer 135 that is introduced to buffer stores and the local request are satisfied by the contents 136 of the buffer. TSO is still a very strict memory model and only allows limited instruction 137 reordering. 138 \\ 139 Our aim with \devif is that the backends support the even more relaxed model of ARM 140 \cite{swell:cacm2016,arm:mm} (and IBM Power). In the memory model of ARM the processor can 141 reorder instructions very generously. Stores as well as loads and other instructions 142 (even atomic instructions) can be reordered. The goal of \devif is to support the weakest memory 143 model so it runs on the most common hardware, but can still increase the strictness 144 of the memory model to improve performance. 145 146 147 \chapter{Function Definitions and Semantics} 148 In this section we describe the functions of the \devif interface in detail. 149 Not only do we define the functions itself, 150 but the give additional information to the semantics. We use the term 151 \textit{undefined behaviour} for calls on the interface that we consider 152 bugs and that must not happen. The creation and destruction of queues 153 are device specific and are not part of the interface itself. 154 155 \section{Registering Region} 156 Adds a region of memory to the active set of this queue. 157 The queue has to be properly initialized beforehand. 158 The memory region has to be owned by the endpoint trying to register the region. 159 If the region is not owned by the endpoint, the behaviour is undefined. 160 If a region is added that is already registered or overlaps with another region, an error must be returned. 161 The returned region id must be unique for this queue and larger than or equal to 0. After the function returns, 162 buffers from the just registered region can be enqueued and the ownership 163 can be transferred. 164 \begin{figure}[h] 165 \begin{lstlisting}[style=code] 166 errval_t devq_register(struct devq *q, 167 struct capref cap, 168 regionid_t* region_id); 169 \end{lstlisting} 170 \label{lst:register} 171 \end{figure} 172 173 \subsection*{Arguments} 174 \begin{itemize} 175 \item \texttt{devq *q}: handle to the device queue. 176 \item \texttt{struct capref cap}: the capability representing the memory region to register. 177 \item \texttt{regionid\_t* region}: return pointer to the region id of the newly registered region. 178 \end{itemize} 179 \subsection*{Preconditions} 180 \begin{itemize} 181 \item The device queue is initialized 182 \item The memory has to be allocated 183 \item Memory of the region must be owned by endpoint 184 \item The region must not be currently registered 185 \item Region must not overlap with other already registered regions 186 \end{itemize} 187 \subsection*{Postconditions} 188 On success, the following conditions on the returned value hold 189 \begin{itemize} 190 \item \texttt{region\_id} must be unique for this queue and larger or equal to 0 191 \end{itemize} 192 193 \subsection*{Reasons for Failure} 194 \begin{itemize} 195 \item Register function of backend fails. 196 \item Region to register is already registered. 197 \end{itemize} 198 199 \section{Deregister Region} 200 Removes a memory region from the registered regions of a queue. 201 If a region is deregisterd that was not registered before, an error is returned. 202 To deregister a region, every buffer of the region i.e. the whole region has to be 203 owned by the endpoint making the call to the interface. If a region is deregistered and the 204 region is not fully owned by the endpoint, the behaviour is undefined. 205 \begin{figure}[h] 206 \begin{lstlisting}[style=code] 207 errval_t devq_deregister(struct devq *q, 208 regionid_t region_id, 209 struct capref* cap); 210 \end{lstlisting} 211 \label{lst:deregister} 212 \end{figure} 213 214 \subsection*{Arguments} 215 \begin{itemize} 216 \item \texttt{devq *q}: handle to the device queue 217 \item \texttt{regionid\_t region\_id}: the id of the region to deregister 218 \item \texttt{struct capref cap}: return pointer to the cap of the deregistered region 219 \end{itemize} 220 221 \subsection*{Preconditions} 222 \begin{itemize} 223 \item The device queue is initialized 224 \item Region must currently be registered (i.e. valid region id that is currently registered) 225 \end{itemize} 226 \subsubsection*{Postconditions} 227 On success, the following conditions on the returned value hold 228 \begin{itemize} 229 \item \texttt{cap} must not be NULL 230 \item \texttt{cap} is a capability referencing a memory region that was once registered. 231 \end{itemize} 232 233 \subsection*{Reasons for Failure} 234 \begin{itemize} 235 \item Backend deregister function fails. 236 \item Region was not registered beforehand. 237 \end{itemize} 238 239 \section{Enqueue} 240 Enqueues a buffer of a region for ownership transfer. The buffers offset into the memory region 241 has to be within the preregistered memory region matching the region id, otherwise an error is returned. 242 The region id provided has to be valid i.e. larger or equal than 0 and is already registered. 243 The length of the buffer must be large than 0 and must not 244 exceed the region size minus the offset of the start address of the buffer within the 245 region. The valid data offset has to be within the buffer and its length may not exceed 246 the buffers length. 247 The buffer has to be currently owned by the client i.e. a buffer 248 can not be enqueued twice without dequeuing it beforehand, otherwise the behaviour is undefined. 249 Enqueueing a buffer does not directly transfer the ownership, but the client 250 enqueueing the buffer has given up ownership on the buffer. Eventually the ownership 251 of the buffer will be transferred but there is no guarantee when this happens. 252 All the changes to the buffer have to be written back to memory and not only 253 reside in the cache. Altering a buffer that a client has no ownership over, 254 will result in undefined behaviour. Multiple buffers can be chained by using 255 the argument \texttt{misc\_flags}. When chaining multiple buffers, the last buffer 256 of the chain must have the \texttt{misc\_flags} set to \texttt{DEVQ\_FLAG\_LAST}. 257 258 259 \begin{figure}[h] 260 \begin{lstlisting}[style=code] 261 errval_t devq_enqueue(struct devq *q, 262 regionid_t region_id, 263 genoffset_t offset, 264 genoffset_t length, 265 genoffset_t valid_data, 266 genoffset_t valid_length, 267 uint64_t misc_flags); 268 \end{lstlisting} 269 \label{lst:enqueue} 270 \end{figure} 271 272 \subsection*{Arguments} 273 \begin{itemize} 274 \item \texttt{devq *q}: handle to the device queue. 275 \item \texttt{regionid\_t region\_id}: the id of a memory region the enqueued buffer belongs to. 276 \item \texttt{genoffset\_t offset}: the offset within the memory region where the buffers starts. 277 \item \texttt{genoffset\_t length}: the length of the enqueued buffer. 278 \item \texttt{genoffset\_t valid\_data}: the offset within the buffer where the valid data starts. 279 \item \texttt{genoffset\_t valid\_length}: the length of the valid data within the buffer. 280 \item \texttt{uint64\_t flags}: flags of the buffer. 281 \end{itemize} 282 283 \subsection*{Preconditions} 284 \begin{itemize} 285 \item The device queue is initialized 286 \item The region id must match with an already registered region 287 \item The buffer must be owned by the client of the interface 288 \item The offset must be within the bounds of a memory region 289 \item The length must not exceed the region size minus the offset of the buffer within the region 290 \item The valid\_data offset must be within the buffers bounds 291 \item The valid\_length must not exceed the length minus the valid\_data offset 292 \item Changes to the buffer are written back to memory 293 \end{itemize} 294 295 \subsection*{Postconditions} 296 After a successful enqueue, the following conditions hold 297 \begin{itemize} 298 \item - 299 \end{itemize} 300 301 \subsection*{Reasons for Failure} 302 \begin{itemize} 303 \item Backend enqueue function fails (e.g. when the queue is full) 304 \item Region id is not valid. 305 \item Bounds check for buffer fails 306 \item Bounds check for valid data fails 307 \end{itemize} 308 309 \section{Dequeue} 310 Dequeues a buffer from the queue. After a buffer is dequeued, the client 311 takes ownership of the buffer. As long as the client owns a buffer, the 312 client can alter the contents of this buffer. Dequeue can be called any time, 313 and even when a notification is received there is no guarantee that the queue contains any buffers 314 to receive. If there is nothing to dequeue, the call will return an error. When an endpoint 315 dequeues a buffer, it has to invalidate its cache of the received buffer when 316 the weaker memory model is assumed (ARM, IBM Power). The dequeued values 317 have to represent a valid buffer as well as point to valid data. If nothing 318 is known about the validity of the data with the buffer, the whole buffer 319 is considered as valid data. 320 321 \begin{figure}[h] 322 \begin{lstlisting}[style=code] 323 errval_t devq_dequeue(struct devq *q, 324 regionid_t* region_id, 325 genoffset_t* offset, 326 genoffset_t* length, 327 genoffset_t* valid_data, 328 genoffset_t* valid_length, 329 uint64_t* flags); 330 \end{lstlisting} 331 \label{lst:dequeue} 332 \end{figure} 333 \subsection*{Arguments} 334 \begin{itemize} 335 \item \texttt{devq *q}: handle to the device queue. 336 \item \texttt{regionid\_t* region\_id}: return pointer to the region id of the dequeued buffer. 337 \item \texttt{genoffset\_t* offset}: return pointer to the offset within the memory region where the buffers starts. 338 \item \texttt{genoffset\_t* length}: return pointer to the length of the enqueued buffer. 339 \item \texttt{genoffset\_t* valid\_data}: return pointer to the offset within the buffer where the valid data starts. 340 \item \texttt{genoffset\_t* valid\_length}: return pointer to the length of the valid data within the buffer. 341 \item \texttt{uint64\_t flags}: flags of the buffer. 342 \end{itemize} 343 \subsection*{Preconditions} 344 \begin{itemize} 345 \item The device queue is initialized 346 \end{itemize} 347 \subsection*{Postconditions} 348 The returned pointer have to contain valid information about a buffer. 349 After a successful dequeue, the following conditions on the returned values hold 350 \begin{itemize} 351 \item \texttt{region\_id} must be larger or equal to 0 352 \item \texttt{offset} is not equal to \textit{NULL} 353 \item \texttt{length} is not equal to \textit{NULL} 354 \item \texttt{valid\_data} is not equal to \textit{NULL} 355 \item \texttt{valid\_length} is not equal to \textit{NULL} 356 \item \textit{offset} does not exceed region length 357 \item \textit{length} + \textit{offset} do not exceed region length 358 \item \textit{valid\_data} does not exceed the buffer size 359 \item \textit{valid\_length} does not exceed buffer size minus valid data offset 360 \end{itemize} 361 362 \subsection*{Reasons for Failure} 363 \begin{itemize} 364 \item Backend dequeue function fails (e.g. when the queue is empty) 365 \item Buffer returned by the backend is not valid. 366 \end{itemize} 367 368 \section{Notify} 369 Notify informs the client on the other side of the queue that there might be buffers 370 buffers in the queue that are ready for processing. There is no guarantee that there is 371 actually a buffer in the queue. When a buffer is enqueued, there 372 is no guarantee to when the buffer is processed. Notify ensures, that 373 all the buffers that are enqueued to this point will eventually processed and the ownership 374 is transferred. Notify is a performance optimization mechanism and not strictly necessary. 375 376 \begin{figure}[h] 377 \begin{lstlisting}[style=code] 378 errval_t devq_notify(struct devq *q); 379 \end{lstlisting} 380 \label{lst:notify} 381 \end{figure} 382 383 \subsection*{Arguments} 384 \begin{itemize} 385 \item \texttt{devq *q}: handle to the device queue. 386 \end{itemize} 387 \subsection*{Preconditions} 388 \begin{itemize} 389 \item The device queue is initialized 390 \end{itemize} 391 \subsection*{Postconditions} 392 -- 393 \subsection*{Reasons for Failure} 394 \begin{itemize} 395 \item Backend notify function fails 396 \end{itemize} 397 398 \chapter{Formal Model} 399 In this section we model the system as a transfer of ownership of sets of addresses. We model it as a 400 transition system by first defining the agents, the structures, 401 and the operations used to do the transfer. We do not model notifications since they are strictly optional 402 and mainly a performance optimization. We abstract the buffers 403 of the ownership transfer protocol as a simple set of addresses. Consequently, we do not need to 404 define what a buffer is. This means in our model the smallest unit resource we transfer ownership 405 of is a single memory address. 406 \\ 407 \begin{defi}[Memory Address] 408 A \textbf{Memory Address} \textit{A} is an identifier (or a name) that abstracts 409 an addressable byte of a machine. 410 \end{defi} 411 412 To go any further, we have to define on what sets an agent can operate on. In the following 413 if we write about a set, this means a set of addresses i.e. if we mention set $S$ it 414 is defined as $S = \{A_0, ..., A_n\}$. 415 In our transition system we only have two agents $X$ and $Y$. 416 \\ 417 \begin{defi}[Agent State] 418 Our model consists of two Agents $X$ and $Y$. 419 The sets an Agent can operate on are $S_x$ for agent $X$ and $S_y$ for agent $Y$ and 420 each of these sets consists of memory addresses. 421 \end{defi} 422 423 The ownership in our model is transferred between the two agents $X$ and $Y$. 424 If an agent takes ownership of an address, this means that the agent can operate 425 on these addresses. For example, it can read the 426 contents of the memory address or write to the memory address. 427 \\ 428 \begin{defi}[Ownership] 429 Agent $X$ and $Y$ can take \textbf{ownership} of an address $A$ by adding the memory address to the set $S_x$ or $S_y$ respectively. 430 \[ \text{ownership}(S, A) \Leftrightarrow A \in S \] 431 where $S$ is either the state of agent $X$ or $Y$ and $A$ is any address that can be used in the transition system. 432 \end{defi} 433 434 To prevent two agents owning the same address, we define the following invariant that 435 has to hold for all operations on the transition system. 436 \\ 437 \begin{inv} 438 At any point in time an address $A$ can only be owned by one Agent at a time 439 \[ S_x \cap S_y = \emptyset \] 440 \end{inv} 441 442 With the current definitions, we could model the transfer of ownership by simply removing from one 443 set and adding it to the other ($S_x$ to $S_y$ and vice versa) but this model would not 444 entail correctly how the transfer of an address happens. An address that is enqueued might 445 not be immediately dequeued by the other agent which means that we require another mechanism: 446 the queue itself. In essence, we need to store 447 the addresses that are currently in transfer in another set associated with a queue. 448 \\ 449 \begin{defi}[Queue] 450 A bidirectional queue $Q$ between the two agents $X$ and $Y$ consists of two sets of addresses called 451 $T_{xy}$ and $T_{yx}$ where $T_{xy}$ is the set of addresses that is in transfer from agent $X$ 452 to agent $Y$ and $T_{yx}$ vice versa. 453 \[Q = (T_{xy}, T_{yx})\] 454 \end{defi} 455 456 Having defined basic agents and structures of our model, but we are missing operations 457 on the transition system to actually transfer the ownership of addresses. To transfer 458 ownership over a queue from agent $X$ to $Y$ and vice versa, we define 459 the operations \textit{enqueue} and \textit{dequeue}. 460 \\ 461 \begin{op} [Enqueue] 462 Enqueue: initiates a transfer of ownership of a set of addresses $B$ from agent $X$ to $Y$ and vice versa. 463 Enqueue removes $B$ from either agent $X$ or $Y$'s state ($S_x$ and $S_y$) and adds it to the 464 queue $Q$ ($Q.T_{xy}$ in case of agent $X$ and $Q.T_{yx}$ in case of agent $Y$). Enqueue requires a 465 set of addresses to transfer $B$. The function enqueue from agent $X$ is defined as 466 \begin{align*} enqueue_{x}(B) \quad \quad S_x := S_x - B \\ 467 T_{xy} := T_{xy} \cup B 468 \end{align*} 469 and for Agent $Y$ 470 \begin{align*} enqueue_{y}(B) \quad \quad S_y := S_y - B \\ 471 T_{yx} := T_{yx} \cup B 472 \end{align*} 473 \end{op} 474 \\ 475 \\ 476 \begin{op}[Dequeue] 477 Dequeue: completes a transfer of ownership of a set of addresses $B$ from agent $X$ to $Y$ 478 and vice versa. Dequeue removes $B$ from the queue state $Q$ ($Q.T_{xy}$ in case of $Y$ 479 and $Q.T_{yx}$ in case of $X$) and adds it to the agents state $S$ ($S_x$ in case of $X$ and 480 $S_y$ in case of $Y$). Dequeue requires the set of addresses to transfer $B$, the agents state 481 $S$ and the queue state $Q$. The function enqueue from agent $X$ is defined as 482 \begin{align*} \text{dequeue}_{x}(B) \quad \quad T_{yx} := T_{yx} - B \\ 483 S_x := S_x \cup B 484 \end{align*} 485 and for Agent $Y$ 486 \begin{align*} \text{dequeue}_{y}(B) \quad \quad T_{xy} := T_{xy} - B \\ 487 S_y := S_y \cup B 488 \end{align*} 489 \end{op} 490 \\ 491 \\ 492 Further to add and remove Addresses from the closed system, we defined \textit{register} and \textit{deregister}. 493 \begin{op} [Register] 494 Register: adds a set of addresses $R$ that are not yet in the system to the agent state $S$. 495 To be more precise $R$ is defined as \[ R \subseteq (S_x \cup S_y \cup Q.T_{xy} \cup Q.T_{yx})^c \] 496 497 The addresses in the system can be used for the transfer of ownership to a second agent. 498 The function takes the set of addresses $R$ as an argument 499 500 The behaviour of the function \textit{register} is shown below 501 $$register_{x}(R) \quad \quad S_x := S_x \cup R $$ 502 $$register_{y}(R) \quad \quad S_y := S_y \cup R $$ 503 \end{op} 504 \\ 505 \vspace{6mm} 506 \\ 507 \begin{op} [Deregister] 508 Deregister: removes a set of memory addresses $R$ from an agents 509 state $S$. After deregistering the addresses of $R$ can no longer be transfer on 510 the queue. The function takes the set of addresses $R$. The addresses 511 of $R$ will be no longer in the system i.e. 512 \[R \in (S_x \cup S_y \cup Q.T_{xy} \cup Q.T_{yx})^c \] 513 514 The behaviour of the function \textit{deregister} is shown below 515 $$deregister_{x}(R) \quad \quad S_x := S_x - R $$ 516 $$deregister_{y}(R) \quad \quad S_y := S_y - R $$ 517 \end{op} 518 \clearpage 519 The transition system that we defined up to now can be seen in the picture below. 520 521\begin{figure}[!h] 522 \begin{center} 523 \caption{Buffer transfer protocol sets and the operations} 524 \centerline{\xymatrix{ 525 &&(S_x \cup S_y \cup T_{xy} \cup T_{yx})^c 526 \ar@{<->}[rrddd]^{register_y}_{deregister_y} 527 \ar@{<->}[llddd]^{deregister_x}_{register_x} 528 \\ 529 \\ 530 &&T_{xy} \ar@{->}[rrd]^{dequeue_y} \\ 531 S_x \ar@{->}[rru]^{enqueue_x} 532 &&&&S_y \ar@{->}[lld]^{enqueue_y}\\ 533 &&T_{yx} \ar@{->}[llu]^{dequeue_x} 534 }} 535 \end{center} 536\end{figure} 537 538 The invariant we described before, only captures part of the constraint that we require so we extend it to 539 also address the sets $T_{xy}$ and $T_{yx}$. 540 541 \begin{inv} 542 At any point in time an address $A$ can only be in one set at the tame 543 \[ S_x \cap S_y \cap T_{xy} \cap T_{yx} = \emptyset \] 544 \end{inv} 545 546 We can show that this holds for all the operations we defined. We omit the proof in this document. 547 The formal specification should make the main concepts of the \devif Interface more clear and 548 remove more ambiguities of the function specifications and semantics of the previous chapter. 549 550 551\chapter{Implementation} 552 553The building blocks of the \devif interface are the different library backends for the devices 554and a small generic library implementing the bookkeeping of region and buffer ids. The backend 555libraries implement a few well defined functions that are installed as function pointers 556for the generic library during the creation of a queue. Creating and destroying a queue 557is device specific and is not part of the interface between library backends and the 558generic library. A high level overview of how to use the \devif interface is shown in Figure 559\ref{fig:overview}. The creation of a queue yields a handle that contains all the 560state for a device queue and as a first member, the general struct for the \devif 561queue state. An example of the data structures for a device and the \texttt{devq struct} 562is shown in Figures \ref{lst:sfn5122f}, \ref{lst:devq}, \ref{lst:func_p}. 563 564 565\begin{figure}[h] 566 \centering 567 \caption{High level overview} 568 \includegraphics[width=\textwidth]{pics/devif.pdf} 569 \label{fig:overview} 570\end{figure} 571 572\begin{figure}[h] 573 \caption{Device Specific Queue Struct} 574 \begin{lstlisting}[style=code] 575 struct sfn5122f_queue { 576 struct devq q; 577 union { 578 sfn5122f_q_tx_user_desc_array_t* user; 579 sfn5122f_q_tx_ker_desc_array_t* ker; 580 } tx_ring; 581 struct devq_buf* tx_bufs; 582 uint16_t tx_head; 583 uint16_t tx_tail; 584 size_t tx_size; 585 ... 586 }; 587 \end{lstlisting} 588 \label{lst:sfn5122f} 589\end{figure} 590 591\begin{figure}[h] 592 \caption{General \devif struct} 593 \begin{lstlisting}[style=code] 594 struct devq { 595 // Region management 596 struct region_pool* pool; 597 598 // Funciton pointers 599 struct devq_func_pointer f; 600 ... 601 }; 602 \end{lstlisting} 603 \label{lst:devq} 604\end{figure} 605 606\begin{figure}[H] 607 \caption{Backend Function Pointers} 608 \begin{lstlisting}[style=code] 609 typedef uint64_t genoffset_t; 610 errval_t (*devq_notify_t) (struct devq *q); 611 612 errval_t (*devq_register_t)(struct devq *q, 613 struct capref cap, 614 regionid_t region_id); 615 616 errval_t (*devq_deregister_t)(struct devq *q, 617 regionid_t region_id); 618 619 errval_t (*devq_control_t)(struct devq *q, 620 uint64_t request, 621 uint64_t value 622 uint64_t result*); 623 624 errval_t (*devq_enqueue_t)(struct devq *q, 625 regionid_t region_id, 626 genoffset_t offset, 627 genoffset_t length, 628 genoffset_t valid_data, 629 genoffset_t valid_offset, 630 uint64_t misc_flags); 631 632 errval_t (*devq_dequeue_t)(struct devq *q, 633 regionid_t* region_id, 634 genoffset_t* offset, 635 genoffset_t* length, 636 genoffset_t* valid_data, 637 genoffset_t* valid_offset, 638 uint64_t* misc_flags); 639 640 struct devq_func_pointer { 641 devq_register_t reg; 642 devq_deregister_t dereg; 643 devq_control_t ctrl; 644 devq_notify_t notify; 645 devq_enqueue_t enq; 646 devq_dequeue_t deq; 647 }; 648 \end{lstlisting} 649 \label{lst:func_p} 650\end{figure} 651 652 653The example shown in Figure \ref{lst:sfn5122f} is the device specific struct for a Solarflare NIC queue. 654The struct contains as a first member the \texttt{devq} struct (Figure \ref{lst:devq}) so it can be 655cast to a \texttt{devq} struct. The rest of the struct contains the pointers to the descriptor rings (and 656the state associated with a descriptor for the \devif interface). Further the bindings for the 657communication channel to the card driver. In Figure \ref{lst:func_p} the functions which have 658to be implemented by a library backend are shown. In certain cases some of these functions 659can be a NOP, but the function pointers still have to be set. 660 661\section{Implementation Debug Backend} 662This is a debugging interface for \devif interface that 663can be used with any existing queue. It can be stacked on top 664to check for non valid buffer enqueues/deqeues that might happen and 665that lead to undefined behaviour. With other queues, the undefined behaviour 666might go unnoticed where the debug queue certainly returns an error. 667An example of a not valid enqueue of a buffer is when the endpoint that enqueues 668the buffer does not own the buffer. 669\\ 670We keep track of the owned buffers as a list of regions which each 671contains a list of memory chunks. Each chunk specifies a offset within 672the region and its length. When a region is registered, we add one memory chunk that describes 673the whole region i.e. offset=0 length = length of region 674\\ 675If a buffer is enqueued, it has to be contained in one of these 676memory chunks (otherwise the endpoint does not own the buffer). 677The memory chunk is then altered according how the buffer is contained in the chunk. 678If it is at the beginning or end of the chunk, the offset/length of the chunk is changed accordingly 679If the buffer is in the middle of the chunk, we split the memory chunk 680into two new memory chunks that do not contain the buffer. Simply put, the list contains the 681parts of the region which this endpoint owns. 682\\ 683If a buffer is dequeued the buffer is added to the existing memory 684chunks if possible, otherwise a new memory chunk is added to the 685list of chunks. If a buffer is dequeued that is in between two 686memory chunks, the memory chunks are merged to one big chunk. 687We might fail to find the region id in our list of regions. In this 688case we add the region with the deqeued offset+length as a size. 689We can be sure that this region exists since the devq library itself 690does these checks if the region is known to the endpoint. The debugging queue 691on top of the other queue does not have allways have a consistent view of the 692registered regions (but the \devif library part does) 693\\ 694When a region is deregistered, the list of chunks has to only 695contain a single chunk that descirbes the whole region. Otherwise 696the call will fail since some of the buffers are still in use and are not 697owned by the endpoint. 698\\ 699Additionally, we added two calls to help debug errors that arise. The function signatures 700are shown below in Listing \ref{lst:debug} 701 702\begin{figure}[h] 703 \caption{Additional functions of the Debug Queue} 704 \centering 705 \begin{lstlisting}[style=code] 706 errval_t debug_dump_region(struct debug_q* que, 707 regionid_t rid); 708 709 void debug_dump_history(struct debug_q* q); 710 \end{lstlisting} 711 \label{lst:debug} 712\end{figure} 713With these two functions, the history of operations (to a certain limit) 714and the list of currently owned memory chunks can be printed on the console. 715 716\section{Implementation Solarflare Backend} 717 718The solaflare backend mainly deals with getting the resources to access hardware register and 719then handling the receive, transmit, and event queue of a VNIC. On the create call, the resources for a VNIC are allocated. 720A VNIC (Not to be confused with Virtual Functions) has three queues receive, transmit, and event queue that are represented 721in software by ringbuffers. After allocating the memory for the queues, a communication 722channel to the driver is set up and the information to set up the queue on the card is 723propagated to the driver by an RPC call. The RPC returns a capability to the hardware registers 724of the queue and a queue id. With the received cap, 725the registers of the queue can be mapped into the vspace of the current running program and as 726of that moment, the user-space program can access a hardware queue directly. The signature of the 727create call is shown in Listing \ref{lst:sfn5122f_create} 728\begin{figure}[h] 729 \caption{Additional functions of the Debug Queue} 730 \centering 731 \begin{lstlisting}[style=code] 732 errval_t sfn5122f_queue_create(struct sfn5122f_queue** q, 733 sfn5122f_event_cb_t cb, 734 bool userspace, 735 bool interrupts, 736 bool default_q); 737 \end{lstlisting} 738 \label{lst:debug} 739\end{figure} 740The event callback in the signature is called when the queue receives an interrupt. The different 741booleans are for enabling/disabling the userspace networking feature and interrupts. Further 742the default queue where all the unmatched packets (i.e. not matched by any hardware filtering) 743are received can be requested, otherwise one of the 1024 VNICs will be used. 744\\ 745When a VNIC is created, it can be instantiated with user-level networking enabled or disabled. 746If user-level networking is enabled, mappings from of buffers that can be used to send/receive 747data have to be installed on the card. When register is called, it executes an RPC to 748the card driver which adds the required mappings to a hardware register table 749(shown in Figure \ref{fig:translation}). The mappings are removed by a deregister call. Each of these 750entries represents a 4k page that can be used as buffers. \textbf{Note: the number 751table entries is limited ($\approx$140000) so registering big amount of memory with 752the solarflare card might lead to problems.} To send data using such a buffer, we need to know 753for a region id the corresponding first entry of the buffer table. Since the buffer id is the 754offset within the region, we can directly compute the buffer table entry if we store the index 755of the first entry of the region. Further with the buffer id being the offset in the region, we 756can compute offsets within a 4k buffer entry (supported by the NIC). If a buffer crosses 757a 4k buffer entry boundary, the packet has to be fragmented into two descriptors in the ringbuffers. 758\begin{figure}[h] 759 \caption{Translation from region id + buffer id to buffer table entry of the solarflare card} 760 \includegraphics[width=\textwidth]{pics/sfn5122f.pdf} 761 \label{fig:translation} 762\end{figure} 763 764After the setup, we can directly access the hardware registers for managing the VNIC. If we call 765\texttt{devq\_enqueue()} now, the region and buffer id are translated to an buffer table 766entry and an offset which can be used to build a descriptor. The \texttt{flags} of the 767\texttt{devq\_enqueue()} function are used to define if the buffer we enqueue is a receive 768or send buffer. Additionally to the descriptor, we also write the buffer id, region id and 769other information alongside the descriptor so a second translation of 770buffer table index to region and buffer id is not necessary on \texttt{devq\_dequeue()}. 771Since both \texttt{devq\_enqueue} and \texttt{devq\_dequeue} directly access hardware 772registers, \texttt{devq\_notify} is a NOP. Here we use the semantics, that enqueing/dequeueing 773on a queue is possible before receiving a notify. 774 775\section{Implementation IDC Backend} 776The IDC backend facilitates communication between two processes on either the same core 777or two different cores. The communication channel is based on Flounder interfaces. To set 778up such a channel, one of the endpoints exports the interface while the other simply connects 779to the endpoint that exports functionality. The flounder interfaces are shown in Figure \ref{lst:descq}. 780\begin{figure}[h!] 781 \caption{Flounder Interface} 782 \begin{lstlisting}[style=code] 783 interface descq "Devif communication queue" { 784 // create and destroy a queue 785 rpc create_queue(in uint32 slots, 786 in cap rx, 787 in cap tx, 788 in bool notifications, 789 out errval err); 790 791 rpc destroy_queue(out errval err); 792 793 // add a memory region to the buffer table 794 rpc register_region(in cap cap, 795 in uint32 rid, 796 out errval err); 797 798 rpc deregister_region(in uint32 rid, 799 out errval err); 800 801 rpc control(in uint64 cmd, 802 in uint64 value, 803 out errval err); 804 805 }; 806 \end{lstlisting} 807 \label{lst:descq} 808\end{figure} 809On the remote side, the functions pointers that are given during creation are called. 810Using these function pointers, two queues can be connected (e.g. solaflare queue and 811IDC queue to support software filtering). If the endpoint simply 812connects to a remote endpoint, the local endpoint sets up memory for the receive/send 813queue that reside in shared memory between the two endpoints. 814The shared memory is established by the \texttt{create\_queue} 815RPC call. There are no flounder RPC calls for enqueue and dequeue as they are handled through either 816a notification sent to the other endpoint or the remote endpoint can simply 817poll the queue by trying to dequeue a descriptor from the queue. 818 819\chapter{Interface Usage} 820The usage in both cases, if the datapath is in a single domain or across domains is the same. 821In the case of a single domain, the notify can be dropped since it is a NOP. 822In a single domain the control plane, i.e. setting up the queue and registering memory 823regions to the interface, can involve another process but not on the data plane. 824In the following we make a small example of how to use the queue interface and 825a queue backend. 826\begin{lstlisting}[style=code] 827errval_t err; 828struct queue* q; 829struct descq* queue; 830struct capref memory; 831regionid_t regid; 832 833// Create queue, specific to the queue backend. 834// In this case simple channel between two processes 835err = descq_create(&queue, notify_cb, ...); 836if (err_is_fail(err)){ 837 // error handling 838} 839 840// Cast to queue interface struct. Used for methods 841// that are not 842q = (struct queue*) queue; 843 844// Allocate memory (returns a capability to the memory) 845err = frame_alloc(&memory, MEMORY_SIZE, NULL); 846if (err_is_fail(err)){ 847 //error handling 848} 849 850// Register the memory we are going to use 851err = queue_register(q, memory, ®id); 852if (err_is_fail(err)){ 853 // error handling 854} 855 856\end{lstlisting} 857At the point of the queue creation, we still have to know what the queue represents. 858In this case we create a queue to connect two processes. Following the creation, we cast 859the device specific to a more general queue. Now, we can use the queue interface to register 860the memory from which we want to carve out our variable sized buffers. At this stage, the datapath 861is set up and buffers can be transferred. 862\begin{lstlisting}[style=code] 863// Modify buffers in some way 864// ... 865 866// Enqueue buffers (2KB size for now, but can be variable size) 867for (int i = 0; i < MEMORY_SIZE/2048; i++){ 868 err = queue_enqueue(q, regid_rx, i*2048, 2048, 0, 2048, 0); 869 if (err_is_fail(err)){ 870 // Can do a notify if the queue is full 871 872 if (err == QUEUE_ERR_QUEUE_FULL) { 873 err = queue_notify(q); 874 if (err_is_fail(err)) { 875 // error handling 876 } 877 // there was an error i.e. retry enqueueing buffer 878 i--; 879 } else { 880 // there was an error i.e. retry enqueueing buffer 881 i--; 882 } 883 } 884} 885\end{lstlisting} 886In the example above, fixed sized buffers are enqueued and transferred to another process. 887The process on the other end of the queue can dequeue the buffers at any time but the queue 888itself has a size limit (as many most other queues). When the other process does not dequeue 889from the queue, at some point the queue will be full. As an optimization, the function \texttt{notify} 890sends a message to the other process to inform it that there might be buffers in the queue. 891\\ 892The teardown of the queue is using the function \texttt{deregister} and \texttt{destroy} 893\begin{lstlisting}[style=code] 894// Make sure this process owns all the memory of the region 895// ... 896 897// Remove region from the active set 898err = queue_deregister(q, regid, &memory); 899if (err_is_fail(err)){ 900 // error handling 901} 902 903// Cleanup resources of the queue 904err = queue_destroy(q); 905if (err_is_fail(err)){ 906 // error handling 907} 908\end{lstlisting} 909Before a region can be deregistered, the process has to make sure that all the memory of the region 910is owned by the process. If the process does not fully own the region, \texttt{deregister} 911returns an error. Similar, \texttt{destroy} fails if there are still regions registered. 912\\ 913For the receiving part, the process simple has to dequeue from the queue 914\begin{lstlisting}[style=code] 915regionid_t rid; 916genoffset_t offset; 917genoffset_t length; 918genoffset_t valid_data; 919genoffset_t valid_length; 920uint64_t flags; 921 922err = queue_dequeue(q, &rid, &offset, &length, &valid_data, &valid_length, &flags); 923if (err_is_fail(err)) { 924 //error handling, 925} 926\end{lstlisting} 927Additionally in the case of the queue for inter process communication during creation 928a notification handler can be given as an argument. An example of such a handler is shown below. 929This handler is called whenever a \texttt{notify} from the other process is received. 930\begin{lstlisting}[style=code] 931static void notify_cb(void* q) 932{ 933 errval_t err = SYS_ERR_OK; 934 struct queue* queue = (struct queue*) q; 935 936 regionid_t rid; 937 genoffset_t offset; 938 genoffset_t length; 939 genoffset_t valid_data; 940 genoffset_t valid_length; 941 uint64_t flags; 942 943 while(err_is_ok(err)) { 944 err = queue_dequeue(queue, &rid, &offset, &length, &valid_data, 945 &valid_length, &flags); 946 } 947} 948\end{lstlisting} 949Some other examples are shown in Figures \ref{fig:oneproc} and \ref{fig:twoproc} 950\begin{figure}[!h] 951 \begin{center} 952 \caption{Usage example with datapath in a single domain} 953 \includegraphics[width=0.65\textwidth]{pics/oneproc.pdf} 954 \label{fig:oneproc} 955 \end{center} 956\end{figure} 957 958\begin{figure}[!h] 959 \begin{center} 960 \caption{Usage example with datapath across domains} 961 \includegraphics[width=0.65\textwidth]{pics/twoproc.pdf} 962 \label{fig:twoproc} 963 \end{center} 964\end{figure} 965 966 967\bibliographystyle{abbrv} 968\bibliography{defs,barrelfish} 969 970\end{document} 971