1\documentclass[a4paper,11pt,twoside]{report}
2\usepackage{bftn}
3\usepackage{calc}
4\usepackage{hyperref}
5\usepackage{hyphenat}
6\usepackage{verbatim}
7\usepackage{xspace}
8\usepackage{pifont}
9\usepackage{pxfonts}
10\usepackage{textcomp}
11\usepackage{amsmath}
12\usepackage{listings}
13\usepackage[framemethod=default]{mdframed}
14\usepackage[shortlabels]{enumitem}
15\usepackage{parskip}
16\usepackage{xparse}
17\usepackage{natbib}
18\usepackage[all,cmtip]{xy}
19\usepackage{xcolor}% or package color
20
21\title{Device Queues in Barrelfish}
22\author{Barrelfish project}
23\tnnumber{26}
24\tnkey{DeviceQueues}
25
26\begin{document}
27	\maketitle			% Uncomment for final draft
28	
29	\begin{versionhistory}
30		\vhEntry{0.1}{24.10.2017}{RH}{Initial Version}
31	\end{versionhistory}
32	
33	% \intro{Abstract}		% Insert abstract here
34	% \intro{Acknowledgements}	% Uncomment (if needed) for acknowledgements
35	\tableofcontents		% Uncomment (if needed) for final draft
36	% \listoffigures		% Uncomment (if needed) for final draft
37	% \listoftables			% Uncomment (if needed) for final draft
38	\cleardoublepage
39	\setcounter{secnumdepth}{2}
40
41	\newcommand{\todo}[1]{[\textcolor{red}{\emph{#1}}]}
42
43	\newcommand{\virtio}{Virtio }
44	\newcommand{\devif}{Devif }
45	\newcommand{\ep}[1]{$Ep_{#1}$}
46
47\usepackage{color}
48\definecolor{bluekeywords}{rgb}{0.13,0.13,1}
49\definecolor{greencomments}{rgb}{0,0.5,0}
50\definecolor{redstrings}{rgb}{0.9,0,0}
51
52\lstdefinestyle{code}
53{language=C,
54	basicstyle=\tiny,
55	showspaces=false,
56	showtabs=false,
57	breaklines=true,
58	showstringspaces=false,
59	breakatwhitespace=true,
60	escapeinside={(*@}{@*)},
61	commentstyle=\color{greencomments},
62	keywordstyle=\color{bluekeywords}\bfseries,
63	stringstyle=\color{redstrings},
64	basicstyle=\ttfamily,
65	morekeywords={uint8_t, uint16_t, uint32_t, uint64_t, size_t, bool},}
66	
67	\mdfdefinestyle{defi}{%
68		linecolor=black,linewidth=2pt,%
69		frametitlerule=true,%
70		frametitlebackgroundcolor=gray!20,
71		innertopmargin=\topskip,
72		nobreak=true% prevent page breaks in the middle of mystyle
73	}
74	
75	\mdfdefinestyle{inv}{%
76		linecolor=black,linewidth=2pt,%
77		topline=false,
78		leftline=false,
79		innertopmargin=\topskip,
80	}
81	
82	\mdtheorem[style=defi]{defi}{Definition}
83	\mdtheorem[style=defi]{op}{Operation}
84	\mdtheorem[style=inv]{inv}{Invariant}
85	
86	\chapter{Introduction}
87	In this document we describe a queue based interface that unifies communication for 
88	devices like network cards and block devices but also between processes. 
89	The interface should fit to as many devices as possible while still being efficient.
90	The goal is similar to \virtio \cite{virtio:sigops}, Portals \cite{portals} or MPI \cite{mpi}, 
91	but we want to take a more formal 
92	approach and define pieces that are ambiguous. These interface do not define a memory model 
93	or what preconditions/postconditions an interface function has. 
94	For example, what happens when a guest accesses a memory buffer that is handed off 
95	to the host? These interfaces are clearly implementation driven where 
96	we want to document the interface as clearly as possible, and see the implementation and
97	its code as two separate things.
98	
99	\section{Terminology}
100	In this section, we explain the terms and the meaning of them as they are used in the following sections.
101	\begin{itemize}
102		
103		\item \textbf{Region}: A Region is a chunk of memory that is registered to the \devif interface. From the memory of the region, buffers can be allocated.
104		\item \textbf{Buffer}: A buffer is a chunk of memory within a region.
105		\item \textbf{Endpoint}: An endpoint is a processes or devices. 
106		\item \textbf{Ownership}: An endpoint can own a buffer and transfer ownership of a buffer to another endpoint or device. 
107		If an endpoint owns a buffer, it can alter it. If an endpoint alters a buffer that is not owned, 
108		the result is undefined and it is considered a bug. 
109	\end{itemize}
110	
111	\section{Functionality}
112	The basic functionality of our queue based device interface (from now on called \devif) 
113	should be transferring ownership of buffers between two endpoints of a queue.  
114	A buffer is a variable sized piece of memory within a previously to the \devif interface 
115	register region of memory. We exclude managing the buffers themselves i.e. allocating 
116	and deallocating buffers to keep the interface and the underlying protocol simple. 
117	If we manage the variable sized buffers, we would have to implement a dynamic memory 
118	allocator which increases the complexity of the \devif interface. 
119	\\
120	Another important aspect is the idea of "stacking" queues. In this manner each 
121	queue on the stack can do an arbitrary transformation on the buffer that was enqueued 
122	and hand it down to the queue lower in the stack. 
123	
124	\section{Memory Model}
125	When implementing a backend for the \devif interface, the underlying memory model 
126	has to be considered. In certain cases, a write to a buffer might not have been written back
127	to memory before the buffer is processed by a device. 
128	\\
129	There are several memory models that are used in current hardware but none of them 
130	are sequential consistent (SC). The memory models relax sequential consistency to allow 
131	instruction reordering and other optimizations. Currently X86 and Sparc can be 
132	modelled using \textit{Total Store Ordering} (TSO) ~\cite{swell:cacm2010}. TSO relaxes 
133	SC so that local writes are visible locally before they are visible to all other
134	hardware threads (multi-copy atomic). In most cases this is because of a write buffer
135	that is introduced to buffer stores and the local request are satisfied by the contents 
136	of the buffer. TSO is still a very strict memory model and only allows limited instruction
137	reordering. 
138	\\
139	Our aim with \devif is that the backends support the even more relaxed model of ARM 
140	\cite{swell:cacm2016,arm:mm} (and IBM Power). In the memory model of ARM the processor can 
141	reorder instructions very generously. Stores as well as loads and other instructions 
142	(even atomic instructions) can be reordered. The goal of \devif is to support the weakest memory
143	model so it runs on the most common hardware, but can still increase the strictness
144	of the memory model to improve performance. 
145	
146
147	\chapter{Function Definitions and Semantics}
148	In this section we describe the functions of the \devif interface in detail. 
149	Not only do we define the functions itself, 
150	but the give additional information to the semantics. We use the term 
151	\textit{undefined behaviour} for calls on the interface that we consider
152	bugs and that must not happen. The creation and destruction of queues 
153	are device specific and are not part of the interface itself. 
154	
155	\section{Registering Region}
156	Adds a region of memory to the active set of this queue. 
157	The queue has to be properly initialized beforehand. 
158	The memory region has to be owned by the endpoint trying to register the region. 
159	If the region is not owned by the endpoint, the behaviour is undefined.
160	If a region is added that is already registered or overlaps with another region, an error must be returned. 
161	The returned region id must be unique for this queue and larger than or equal to 0. After the function returns, 
162	buffers from the just registered region can be enqueued and the ownership
163	can be transferred. 
164	\begin{figure}[h]
165		\begin{lstlisting}[style=code]
166        errval_t devq_register(struct devq *q,
167                               struct capref cap,
168                               regionid_t* region_id);
169		\end{lstlisting}
170		\label{lst:register}
171	\end{figure}
172	
173	\subsection*{Arguments}
174	\begin{itemize}
175		\item \texttt{devq *q}: handle to the device queue.
176		\item \texttt{struct capref cap}: the capability representing the memory region to register.
177		\item \texttt{regionid\_t* region}: return pointer to the region id of the newly registered region.
178	\end{itemize}
179	\subsection*{Preconditions}
180	\begin{itemize}
181		\item The device queue is initialized
182		\item The memory has to be allocated
183		\item Memory of the region must be owned by endpoint 
184		\item The region must not be currently registered
185		\item Region must not overlap with other already registered regions 
186	\end{itemize}
187	\subsection*{Postconditions}
188	On success, the following conditions on the returned value hold
189	\begin{itemize}
190		\item \texttt{region\_id} must be unique for this queue and larger or equal to 0
191	\end{itemize}
192	
193	\subsection*{Reasons for Failure}
194	\begin{itemize}
195		\item Register function of backend fails.
196		\item Region to register is already registered.
197	\end{itemize}
198	
199	\section{Deregister Region}
200	Removes a memory region from the registered regions of a queue. 
201	If a region is deregisterd that was not registered before, an error is returned. 
202	To deregister a region, every buffer of the region i.e. the whole region has to be 
203	owned by the endpoint making the call to the interface. If a region is deregistered and the 
204	region is not fully owned by the endpoint, the behaviour is undefined.  
205	\begin{figure}[h]
206		\begin{lstlisting}[style=code]
207        errval_t devq_deregister(struct devq *q,
208                                 regionid_t region_id,
209                                 struct capref* cap);
210		\end{lstlisting}
211		\label{lst:deregister}
212	\end{figure}
213	
214	\subsection*{Arguments}
215	\begin{itemize}
216		\item \texttt{devq *q}: handle to the device queue
217		\item \texttt{regionid\_t region\_id}: the id of the region to deregister
218		\item \texttt{struct capref cap}: return pointer to the cap of the deregistered region
219	\end{itemize}
220	
221	\subsection*{Preconditions}
222	\begin{itemize}
223		\item The device queue is initialized
224		\item Region must currently be registered (i.e. valid region id that is currently registered)
225	\end{itemize}
226	\subsubsection*{Postconditions}
227	On success, the following conditions on the returned value hold
228	\begin{itemize}
229		\item \texttt{cap} must not be NULL
230		\item \texttt{cap} is a capability referencing a memory region that was once registered.
231	\end{itemize}
232	
233	\subsection*{Reasons for Failure}
234	\begin{itemize}
235		\item Backend deregister function fails.
236		\item Region was not registered beforehand. 
237	\end{itemize}
238	
239	\section{Enqueue}
240	Enqueues a buffer of a region for ownership transfer. The buffers offset into the memory region 
241	has to be within the preregistered memory region matching the region id, otherwise an error is returned. 
242	The region id provided has to be valid i.e. larger or equal than 0 and is already registered. 
243	The length of the buffer must be large than 0 and must not
244	exceed the region size minus the offset of the start address of the buffer within the 
245	region. The valid data offset has to be within the buffer and its length may not exceed
246	the buffers length. 
247	The buffer has to be currently owned by the client i.e. a buffer 
248	can not be enqueued twice without dequeuing it beforehand, otherwise the behaviour is undefined. 
249	Enqueueing a buffer does not directly transfer the ownership, but the client
250	enqueueing the buffer has given up ownership on the buffer. Eventually the ownership
251	of the buffer will be transferred but there is no guarantee when this happens.
252	All the changes to the buffer have to be written back to memory and not only 
253	reside in the cache. Altering a buffer that a client has no ownership over, 
254	will result in undefined behaviour. Multiple buffers can be chained by using 
255	the argument \texttt{misc\_flags}. When chaining multiple buffers, the last buffer
256	of the chain must have the \texttt{misc\_flags} set to \texttt{DEVQ\_FLAG\_LAST}.
257	
258	
259	\begin{figure}[h]
260		\begin{lstlisting}[style=code]
261		errval_t devq_enqueue(struct devq *q,
262		                      regionid_t region_id,
263		                      genoffset_t offset,
264		                      genoffset_t length,
265		                      genoffset_t valid_data,
266		                      genoffset_t valid_length,
267		                      uint64_t misc_flags);
268		\end{lstlisting}
269		\label{lst:enqueue}
270	\end{figure}
271	
272	\subsection*{Arguments}
273	\begin{itemize}
274		\item \texttt{devq *q}: handle to the device queue.
275		\item \texttt{regionid\_t region\_id}: the id of a memory region the enqueued buffer belongs to.
276		\item \texttt{genoffset\_t offset}: the offset within the memory region where the buffers starts. 
277		\item \texttt{genoffset\_t length}: the length of the enqueued buffer.
278		\item \texttt{genoffset\_t valid\_data}: the offset within the buffer where the valid data starts.
279		\item \texttt{genoffset\_t valid\_length}: the length of the valid data within the buffer.
280		\item \texttt{uint64\_t flags}: flags of the buffer. 
281	\end{itemize}
282	
283	\subsection*{Preconditions}
284	\begin{itemize}
285		\item The device queue is initialized
286		\item The region id must match with an already registered region
287		\item The buffer must be owned by the client of the interface 
288		\item The offset must be within the bounds of a memory region
289		\item The length must not exceed the region size minus the offset of the buffer within the region
290		\item The valid\_data offset must be within the buffers bounds
291		\item The valid\_length must not exceed the length minus the valid\_data offset
292		\item Changes to the buffer are written back to memory
293	\end{itemize}
294	
295	\subsection*{Postconditions}
296	After a successful enqueue, the following conditions hold
297	\begin{itemize}
298		\item -
299	\end{itemize}
300	
301	\subsection*{Reasons for Failure}
302	\begin{itemize}
303		\item Backend enqueue function fails (e.g. when the queue is full)
304		\item Region id is not valid.
305		\item Bounds check for buffer fails
306		\item Bounds check for valid data fails
307	\end{itemize}
308	
309	\section{Dequeue}
310	Dequeues a buffer from the queue. After a buffer is dequeued, the client
311	takes ownership of the buffer. As long as the client owns a buffer, the
312	client can alter the contents of this buffer. Dequeue can be called any time,
313	and even when a notification is received there is no guarantee that the queue contains any buffers
314	to receive. If there is nothing to dequeue, the call will return an error. When an endpoint
315	dequeues a buffer, it has to invalidate its cache of the received buffer when
316	the weaker memory model is assumed (ARM, IBM Power). The dequeued values 
317	have to represent a valid buffer as well as point to valid data. If nothing 
318	is known about the validity of the data with the buffer, the whole buffer 
319	is considered as valid data.
320
321	\begin{figure}[h]
322		\begin{lstlisting}[style=code]
323		errval_t devq_dequeue(struct devq *q,
324		                      regionid_t* region_id,
325		                      genoffset_t* offset,
326		                      genoffset_t* length,
327		                      genoffset_t* valid_data,
328		                      genoffset_t* valid_length,
329		                      uint64_t* flags);
330		\end{lstlisting}
331		\label{lst:dequeue}
332	\end{figure}
333	\subsection*{Arguments}
334	\begin{itemize}
335		\item \texttt{devq *q}: handle to the device queue.
336		\item \texttt{regionid\_t* region\_id}: return pointer to the region id of the dequeued buffer.
337		\item \texttt{genoffset\_t* offset}: return pointer to the offset within the memory region where the buffers starts. 
338		\item \texttt{genoffset\_t* length}: return pointer to the length of the enqueued buffer.
339		\item \texttt{genoffset\_t* valid\_data}: return pointer to the offset within the buffer where the valid data starts.
340		\item \texttt{genoffset\_t* valid\_length}: return pointer to the length of the valid data within the buffer.
341		\item \texttt{uint64\_t flags}: flags of the buffer. 
342	\end{itemize}
343	\subsection*{Preconditions}
344	\begin{itemize}
345		\item The device queue is initialized
346	\end{itemize}
347	\subsection*{Postconditions}
348	The returned pointer have to contain valid information about a buffer.
349	After a successful dequeue, the following conditions on the returned values hold
350	\begin{itemize}
351		\item \texttt{region\_id} must be larger or equal to 0
352		\item \texttt{offset} is not equal to \textit{NULL}
353		\item \texttt{length} is not equal to \textit{NULL}
354		\item \texttt{valid\_data} is not equal to \textit{NULL}
355		\item \texttt{valid\_length} is not equal to \textit{NULL}
356		\item \textit{offset} does not exceed region length
357		\item \textit{length} + \textit{offset} do not exceed region length
358		\item \textit{valid\_data} does not exceed the buffer size
359		\item \textit{valid\_length} does not exceed buffer size minus valid data offset
360	\end{itemize}
361	
362	\subsection*{Reasons for Failure}
363	\begin{itemize}
364		\item Backend dequeue function fails (e.g. when the queue is empty)
365		\item Buffer returned by the backend is not valid.
366	\end{itemize}
367	
368	\section{Notify}
369	Notify informs the client on the other side of the queue that there might be buffers 
370	buffers in the queue that are ready for processing. There is no guarantee that there is 
371	actually a buffer in the queue. When a buffer is enqueued, there
372	is no guarantee to when the buffer is processed. Notify ensures, that
373	all the buffers that are enqueued to this point will eventually processed and the ownership
374	is transferred. Notify is a performance optimization mechanism and not strictly necessary.
375	
376	\begin{figure}[h]
377		\begin{lstlisting}[style=code]
378		errval_t devq_notify(struct devq *q);
379		\end{lstlisting}
380		\label{lst:notify}
381	\end{figure}
382	
383	\subsection*{Arguments}
384	\begin{itemize}
385		\item \texttt{devq *q}: handle to the device queue.
386	\end{itemize}
387	\subsection*{Preconditions}
388	\begin{itemize}
389		\item The device queue is initialized
390	\end{itemize}
391	\subsection*{Postconditions}
392	-- 
393	\subsection*{Reasons for Failure}
394	\begin{itemize}
395		\item Backend notify function fails
396	\end{itemize} 
397	
398	\chapter{Formal Model}
399	In this section we model the system as a transfer of ownership of sets of addresses. We model it as a 
400	transition system by first defining the agents, the structures, 
401	and the operations used to do the transfer. We do not model notifications since they are strictly optional
402	and mainly a performance optimization. We abstract the buffers 
403	of the ownership transfer protocol as a simple set of addresses. Consequently, we do not need to 
404	define what a buffer is. This means in our model the smallest unit resource we transfer ownership 
405	of is a single memory address. 
406	\\
407	\begin{defi}[Memory Address]
408		A \textbf{Memory Address} \textit{A} is an identifier (or a name) that abstracts 
409		an addressable byte of a machine. 
410	\end{defi}
411	
412	To go any further, we have to define on what sets an agent can operate on. In the following
413	if we write about a set, this means a set of addresses i.e. if we mention set $S$ it 
414	is defined as $S = \{A_0, ..., A_n\}$.
415    In our transition system we only have two agents $X$ and $Y$.
416	\\
417	\begin{defi}[Agent State]
418		Our model consists of two Agents $X$ and $Y$. 
419		The sets an Agent can operate on are $S_x$ for agent $X$ and $S_y$ for agent $Y$ and
420		each of these sets consists of memory addresses.
421	\end{defi}
422	
423	The ownership in our model is transferred between the two agents $X$ and $Y$. 
424	If an agent takes ownership of an address, this means that the agent can operate 
425	on these addresses. For example, it can read the 
426	contents of the memory address or write to the memory address. 
427	\\
428	\begin{defi}[Ownership]	
429		Agent $X$ and $Y$ can take \textbf{ownership} of an address $A$ by adding the memory address to the set $S_x$ or $S_y$ respectively.
430		\[ \text{ownership}(S, A) \Leftrightarrow A \in S \] 
431		where $S$ is either the state of agent $X$ or $Y$ and $A$ is any address that can be used in the transition system. 
432	\end{defi}
433	
434	To prevent two agents owning the same address, we define the following invariant that 
435	has to hold for all operations on the transition system.
436	\\
437	\begin{inv}
438		At any point in time an address $A$ can only be owned by one Agent at a time
439		\[ S_x \cap S_y = \emptyset \]
440	\end{inv}	
441	
442	With the current definitions, we could model the transfer of ownership by simply removing from one
443	set and adding it to the other ($S_x$ to $S_y$ and vice versa) but this model would not 
444	entail correctly how the transfer of an address happens. An address that is enqueued might
445	not be immediately dequeued by the other agent which means that we require another mechanism: 
446	the queue itself. In essence, we need to store
447	the addresses that are currently in transfer in another set associated with a queue.
448	\\
449	\begin{defi}[Queue]
450		A bidirectional queue $Q$ between the two agents $X$ and $Y$ consists of two sets of addresses called 
451		$T_{xy}$ and $T_{yx}$ where $T_{xy}$ is the set of addresses that is in transfer from agent $X$ 
452		to agent $Y$ and $T_{yx}$ vice versa. 
453		\[Q = (T_{xy}, T_{yx})\]
454	\end{defi}
455	
456	Having defined basic agents and structures of our model, but we are missing operations 
457	on the transition system to actually transfer the ownership of addresses. To transfer 
458	ownership over a queue from agent $X$ to $Y$ and vice versa, we define
459	the operations \textit{enqueue} and \textit{dequeue}. 
460	\\
461	\begin{op} [Enqueue]
462		Enqueue: initiates a transfer of ownership of a set of addresses $B$ from agent $X$ to $Y$ and vice versa. 
463		Enqueue removes $B$ from either agent $X$ or $Y$'s state ($S_x$ and $S_y$) and adds it to the 
464		queue $Q$ ($Q.T_{xy}$ in case of agent $X$ and $Q.T_{yx}$ in case of agent $Y$). Enqueue requires a 
465		set of addresses to transfer $B$. The function enqueue from agent $X$ is defined as
466		\begin{align*} enqueue_{x}(B) \quad \quad S_x := S_x - B \\
467		T_{xy} := T_{xy} \cup B  
468		\end{align*}
469		and for Agent $Y$
470		\begin{align*} enqueue_{y}(B) \quad \quad S_y := S_y - B \\
471		T_{yx} := T_{yx} \cup B 
472		\end{align*}
473	\end{op}
474	\\
475	\\
476	\begin{op}[Dequeue]
477		Dequeue: completes a transfer of ownership of a set of addresses $B$ from agent $X$ to $Y$ 
478		and vice versa. Dequeue removes $B$ from the queue state $Q$ ($Q.T_{xy}$ in case of $Y$ 
479		and $Q.T_{yx}$ in case of $X$) and adds it to the agents state $S$ ($S_x$ in case of $X$ and 
480		$S_y$ in case of $Y$). Dequeue requires the set of addresses to transfer $B$, the agents state 
481		$S$ and the queue state $Q$. The function enqueue from agent $X$ is defined as
482		\begin{align*} \text{dequeue}_{x}(B) \quad \quad T_{yx} := T_{yx} - B \\
483		S_x := S_x \cup B  
484		\end{align*}
485		and for Agent $Y$
486		\begin{align*} \text{dequeue}_{y}(B) \quad \quad T_{xy} := T_{xy} - B \\
487		S_y := S_y \cup B  
488		\end{align*}
489	\end{op}
490	\\
491	\\
492	Further to add and remove Addresses from the closed system, we defined \textit{register} and \textit{deregister}.
493	\begin{op} [Register]
494		Register: adds a set of addresses $R$ that are not yet in the system to the agent state $S$. 
495		To be more precise $R$ is defined as \[ R \subseteq (S_x \cup S_y \cup Q.T_{xy} \cup Q.T_{yx})^c \]
496		
497		The addresses in the system can be used for the transfer of ownership to a second agent. 
498		The function takes the set of addresses $R$ as an argument
499		
500		The behaviour of the function \textit{register} is shown below 
501		$$register_{x}(R) \quad \quad S_x := S_x \cup R $$
502		$$register_{y}(R) \quad \quad S_y := S_y \cup R $$
503	\end{op}
504	\\
505	\vspace{6mm}
506	\\
507	\begin{op} [Deregister]
508		Deregister: removes a set of memory addresses $R$ from an agents
509		state $S$. After deregistering the addresses of $R$ can no longer be transfer on
510		the queue. The function takes the set of addresses $R$. The addresses
511		of $R$ will be no longer in the system i.e. 
512		\[R \in (S_x \cup S_y \cup Q.T_{xy} \cup Q.T_{yx})^c \]
513		
514		The behaviour of the function \textit{deregister} is shown below 
515		$$deregister_{x}(R) \quad \quad S_x := S_x - R $$
516		$$deregister_{y}(R) \quad \quad S_y := S_y - R $$
517	\end{op}
518	\clearpage
519	The transition system that we defined up to now can be seen in the picture below. 
520	
521\begin{figure}[!h]
522	\begin{center}
523		\caption{Buffer transfer protocol sets and the operations}
524		\centerline{\xymatrix{
525				&&(S_x \cup S_y \cup T_{xy} \cup T_{yx})^c 
526				\ar@{<->}[rrddd]^{register_y}_{deregister_y}
527				\ar@{<->}[llddd]^{deregister_x}_{register_x}
528				\\
529				\\
530				&&T_{xy} \ar@{->}[rrd]^{dequeue_y} \\
531				S_x \ar@{->}[rru]^{enqueue_x} 
532				&&&&S_y \ar@{->}[lld]^{enqueue_y}\\
533				&&T_{yx} \ar@{->}[llu]^{dequeue_x}
534			}}
535		\end{center}
536\end{figure}
537		
538	The invariant we described before, only captures part of the constraint that we require so we extend it to
539	also address the sets $T_{xy}$ and $T_{yx}$.
540	
541	\begin{inv}
542		At any point in time an address $A$ can only be in one set at the tame
543		\[ S_x \cap S_y \cap T_{xy} \cap T_{yx} = \emptyset \]
544	\end{inv}	
545
546	We can show that this holds for all the operations we defined. We omit the proof in this document.
547	The formal specification should make the main concepts of the \devif Interface more clear and 
548	remove more ambiguities of the function specifications and semantics of the previous chapter. 
549
550
551\chapter{Implementation} 
552
553The building blocks of the \devif interface are the different library backends for the devices
554and a small generic library implementing the bookkeeping of region and buffer ids. The backend 
555libraries implement a few well defined functions that are installed as function pointers
556for the generic library during the creation of a queue. Creating and destroying a queue 
557is device specific and is not part of the interface between library backends and the 
558generic library. A high level overview of how to use the \devif interface is shown in Figure 
559\ref{fig:overview}. The creation of a queue yields a handle that contains all the
560state for a device queue and as a first member, the general struct for the \devif
561queue state. An example of the data structures for a device and the \texttt{devq struct} 
562is shown in Figures \ref{lst:sfn5122f}, \ref{lst:devq}, \ref{lst:func_p}.
563
564
565\begin{figure}[h]
566	\centering
567	\caption{High level overview}
568	\includegraphics[width=\textwidth]{pics/devif.pdf}
569	\label{fig:overview}
570\end{figure}  
571
572\begin{figure}[h]
573	\caption{Device Specific Queue Struct}
574	\begin{lstlisting}[style=code]
575    struct sfn5122f_queue {
576        struct devq q;
577        union {
578            sfn5122f_q_tx_user_desc_array_t* user;
579            sfn5122f_q_tx_ker_desc_array_t* ker;
580        } tx_ring;
581        struct devq_buf*                tx_bufs;
582        uint16_t                        tx_head;
583        uint16_t                        tx_tail;
584        size_t                          tx_size;
585        ...
586    };
587	\end{lstlisting}
588	\label{lst:sfn5122f}
589\end{figure}  
590
591\begin{figure}[h]
592	\caption{General \devif struct}
593	\begin{lstlisting}[style=code]
594    struct devq {
595        // Region management
596        struct region_pool* pool;
597        
598        // Funciton pointers
599        struct devq_func_pointer f;
600        ...
601    };
602	\end{lstlisting}
603	\label{lst:devq}
604\end{figure}  
605
606\begin{figure}[H]
607	\caption{Backend Function Pointers}
608	\begin{lstlisting}[style=code]
609    typedef uint64_t genoffset_t;
610    errval_t (*devq_notify_t) (struct devq *q);
611	
612    errval_t (*devq_register_t)(struct devq *q,
613                                struct capref cap,
614                                regionid_t region_id);
615                                
616    errval_t (*devq_deregister_t)(struct devq *q, 
617                                  regionid_t region_id);
618	
619    errval_t (*devq_control_t)(struct devq *q,
620	                           uint64_t request,
621	                           uint64_t value
622	                           uint64_t result*);
623	
624    errval_t (*devq_enqueue_t)(struct devq *q, 
625	                           regionid_t region_id,
626	                           genoffset_t offset,
627	                           genoffset_t length,
628	                           genoffset_t valid_data,
629	                           genoffset_t valid_offset,
630	                           uint64_t misc_flags);
631	
632    errval_t (*devq_dequeue_t)(struct devq *q,
633                               regionid_t* region_id,
634                               genoffset_t* offset,
635                               genoffset_t* length,
636                               genoffset_t* valid_data,
637                               genoffset_t* valid_offset,
638                               uint64_t* misc_flags);
639	
640    struct devq_func_pointer {
641        devq_register_t reg;
642        devq_deregister_t dereg;
643        devq_control_t ctrl;
644        devq_notify_t notify;
645        devq_enqueue_t enq;
646        devq_dequeue_t deq;
647    };
648	\end{lstlisting}
649	\label{lst:func_p}
650\end{figure}  
651
652
653The example shown in Figure \ref{lst:sfn5122f} is the device specific struct for a Solarflare NIC queue.
654The struct contains as a first member the \texttt{devq} struct (Figure \ref{lst:devq}) so it can be 
655cast to a \texttt{devq} struct. The rest of the struct contains the pointers to the descriptor rings (and
656the state associated with a descriptor for the \devif interface). Further the bindings for the 
657communication channel to the card driver. In Figure \ref{lst:func_p} the functions which have 
658to be implemented by a library backend are shown. In certain cases some of these functions
659can be a NOP, but the function pointers still have to be set.
660
661\section{Implementation Debug Backend}
662This is a debugging interface for \devif interface that            
663can be used with any existing queue. It can be stacked on top                
664to check for non valid buffer enqueues/deqeues that might happen and
665that lead to undefined behaviour. With other queues, the undefined behaviour
666might go unnoticed where the debug queue certainly returns an error.      
667An example of a not valid enqueue of a buffer is when the endpoint that enqueues           
668the buffer does not own the buffer.                                          
669\\                                                                     
670We keep track of the owned buffers as a list of regions which each           
671contains a list of memory chunks. Each chunk specifies a offset within 
672the region and its length. When a region is registered, we add one memory chunk that describes          
673the whole region i.e. offset=0 length = length of region                      
674\\                                                                           
675If a buffer is enqueued, it has to be contained in one of these            
676memory chunks (otherwise the endpoint does not own the buffer). 
677The memory chunk is then altered according how the buffer is contained in the chunk. 
678If it is at the beginning or end of the chunk, the offset/length of the chunk is changed accordingly      
679If the buffer is in the middle of the chunk, we split the memory chunk       
680into two new memory chunks that do not contain the buffer. Simply put, the list contains the 
681parts of the region which this endpoint owns.                   
682\\                                                                           
683If a buffer is dequeued the buffer is added to the existing memory           
684chunks if possible, otherwise a new memory chunk is added to the             
685list of chunks. If a buffer is dequeued that is in between two               
686memory chunks, the memory chunks are merged to one big chunk.                
687We might fail to find the region id in our list of regions. In this          
688case we add the region with the deqeued offset+length as a size.             
689We can be sure that this region exists since the devq library itself         
690does these checks if the region is known to the endpoint. The debugging queue 
691on top of the other queue does not have allways have a consistent view of the 
692registered regions (but the \devif library part does)         
693\\                                                                           
694When a region is deregistered, the list of chunks has to only                
695contain a single chunk that descirbes the whole region. Otherwise            
696the call will fail since some of the buffers are still in use and are not
697owned by the endpoint.   
698\\
699Additionally, we added two calls to help debug errors that arise. The function signatures
700are shown below in Listing \ref{lst:debug}
701
702\begin{figure}[h]
703	\caption{Additional functions of the Debug Queue}
704	\centering
705	\begin{lstlisting}[style=code]
706    errval_t debug_dump_region(struct debug_q* que,
707                               regionid_t rid);     
708                                          
709    void debug_dump_history(struct debug_q* q);  
710	\end{lstlisting}
711	\label{lst:debug}
712\end{figure} 
713With these two functions, the history of operations (to a certain limit) 
714and the list of currently owned memory chunks can be printed on the console.
715
716\section{Implementation Solarflare Backend}
717
718The solaflare backend mainly deals with getting the resources to access hardware register and
719then handling the receive, transmit, and event queue of a VNIC. On the create call, the resources for a VNIC are allocated. 
720A VNIC (Not to be confused with Virtual Functions) has three queues receive, transmit, and event queue that are represented 
721in software by ringbuffers. After allocating the memory for the queues, a communication 
722channel to the driver is set up and the information to set up the queue on the card is 
723propagated to the driver by an RPC call. The RPC returns a capability to the hardware registers 
724of the queue and a queue id. With the received cap, 
725the registers of the queue can be mapped into the vspace of the current running program and as 
726of that moment, the user-space program can access a hardware queue directly. The signature of the 
727create call is shown in Listing \ref{lst:sfn5122f_create}
728\begin{figure}[h]
729	\caption{Additional functions of the Debug Queue}
730	\centering
731	\begin{lstlisting}[style=code]
732     errval_t sfn5122f_queue_create(struct sfn5122f_queue** q, 
733                                    sfn5122f_event_cb_t cb,
734                                    bool userspace, 
735                                    bool interrupts, 
736                                    bool default_q);
737	\end{lstlisting}
738	\label{lst:debug}
739\end{figure} 
740The event callback in the signature is called when the queue receives an interrupt. The different
741booleans are for enabling/disabling the userspace networking feature and interrupts. Further
742the default queue where all the unmatched packets (i.e. not matched by any hardware filtering)
743are received can be requested, otherwise one of the 1024 VNICs will be used.
744\\
745When a VNIC is created, it can be instantiated with user-level networking enabled or disabled. 
746If user-level networking is enabled, mappings from of buffers that can be used to send/receive
747data have to be installed on the card. When register is called, it executes an RPC to 
748the card driver which adds the required mappings to a hardware register table 
749(shown in Figure \ref{fig:translation}). The mappings are removed by a deregister call. Each of these
750entries represents a 4k page that can be used as buffers. \textbf{Note: the number 
751table entries is limited ($\approx$140000) so registering big amount of memory with
752the solarflare card might lead to problems.} To send data using such a buffer, we need to know
753for a region id the corresponding first entry of the buffer table. Since the buffer id is the
754offset within the region, we can directly compute the buffer table entry if we store the index
755of the first entry of the region. Further with the buffer id being the offset in the region, we
756can compute offsets within a 4k buffer entry (supported by the NIC). If a buffer crosses 
757a 4k buffer entry boundary, the packet has to be fragmented into two descriptors in the ringbuffers. 
758\begin{figure}[h]
759	\caption{Translation from region id + buffer id to buffer table entry of the solarflare card}
760	\includegraphics[width=\textwidth]{pics/sfn5122f.pdf}
761	\label{fig:translation}
762\end{figure}  
763
764After the setup, we can directly access the hardware registers for managing the VNIC. If we call
765\texttt{devq\_enqueue()} now, the region and buffer id are translated to an buffer table 
766entry and an offset which can be used to build a descriptor. The \texttt{flags} of the 
767\texttt{devq\_enqueue()} function are used to define if the buffer we enqueue is a receive 
768or send buffer. Additionally to the descriptor, we also write the buffer id, region id and 
769other information alongside the descriptor so a second translation of 
770buffer table index to region and buffer id is not necessary on \texttt{devq\_dequeue()}. 
771Since both \texttt{devq\_enqueue} and \texttt{devq\_dequeue} directly access hardware 
772registers, \texttt{devq\_notify} is a NOP. Here we use the semantics, that enqueing/dequeueing 
773on a queue is possible before receiving a notify.
774
775\section{Implementation IDC Backend}
776The IDC backend facilitates communication between two processes on either the same core
777or two different cores. The communication channel is based on Flounder interfaces. To set 
778up such a channel, one of the endpoints exports the interface while the other simply connects
779to the endpoint that exports functionality. The flounder interfaces are shown in Figure \ref{lst:descq}.
780\begin{figure}[h!]
781	\caption{Flounder Interface}
782	\begin{lstlisting}[style=code]
783    interface descq "Devif communication queue" {
784        // create and destroy a queue
785        rpc create_queue(in uint32 slots, 
786                         in cap rx, 
787                         in cap tx,
788	                     in bool notifications, 
789                         out errval err);
790                     
791        rpc destroy_queue(out errval err);
792	
793        // add a memory region to the buffer table
794        rpc register_region(in cap cap, 
795                            in uint32 rid, 
796                            out errval err);
797                        
798        rpc deregister_region(in uint32 rid, 
799                              out errval err);
800	
801        rpc control(in uint64 cmd, 
802                    in uint64 value, 
803                    out errval err);
804    
805    };
806	\end{lstlisting}
807	\label{lst:descq}
808\end{figure} 
809On the remote side, the functions pointers that are given during creation are called. 
810Using these function pointers, two queues can be connected (e.g. solaflare queue and 
811IDC queue to support software filtering). If the endpoint simply 
812connects to a remote endpoint, the local endpoint sets up memory for the receive/send 
813queue that reside in shared memory between the two endpoints. 
814The shared memory is established by the \texttt{create\_queue}
815RPC call. There are no flounder RPC calls for enqueue and dequeue as they are handled through either 
816a notification sent to the other endpoint or the remote endpoint can simply 
817poll the queue by trying to dequeue a descriptor from the queue. 
818
819\chapter{Interface Usage}
820The usage in both cases, if the datapath is in a single domain or across domains is the same. 
821In the case of a single domain, the notify can be dropped since it is a NOP. 
822In a single domain the control plane, i.e. setting up the queue and registering memory 
823regions to the interface, can involve another process but not on the data plane. 
824In the following we make a small example of how to use the queue interface and
825a queue backend. 
826\begin{lstlisting}[style=code]
827errval_t err;
828struct queue* q;
829struct descq* queue;
830struct capref memory;
831regionid_t regid;
832
833// Create queue, specific to the queue backend. 
834// In this case simple channel between two processes
835err = descq_create(&queue, notify_cb, ...);
836if (err_is_fail(err)){
837    // error handling
838}
839
840// Cast to queue interface struct. Used for methods
841// that are not 
842q = (struct queue*) queue;
843
844// Allocate memory (returns a capability to the memory)
845err = frame_alloc(&memory, MEMORY_SIZE, NULL);
846if (err_is_fail(err)){
847    //error handling
848}
849
850// Register the memory we are going to use
851err = queue_register(q, memory, &regid);
852if (err_is_fail(err)){
853    // error handling
854}
855
856\end{lstlisting}
857At the point of the queue creation, we still have to know what the queue represents.
858In this case we create a queue to connect two processes. Following the creation, we cast 
859the device specific to a more general queue. Now, we can use the queue interface to register 
860the memory from which we want to carve out our variable sized buffers. At this stage, the datapath
861is set up and buffers can be transferred. 
862\begin{lstlisting}[style=code]
863// Modify buffers in some way
864//  ...
865
866// Enqueue buffers (2KB size for now, but can be variable size)
867for (int i = 0; i < MEMORY_SIZE/2048; i++){
868    err = queue_enqueue(q, regid_rx, i*2048, 2048, 0, 2048, 0);
869    if (err_is_fail(err)){
870        // Can do a notify if the queue is full
871    
872        if (err == QUEUE_ERR_QUEUE_FULL) {
873            err = queue_notify(q);
874            if (err_is_fail(err)) {
875                // error handling
876            }
877            // there was an error i.e. retry enqueueing buffer
878            i--;
879        } else {
880           // there was an error i.e. retry enqueueing buffer
881           i--;
882        }
883    }
884} 
885\end{lstlisting}
886In the example above, fixed sized buffers are enqueued and transferred to another process. 
887The process on the other end of the queue can dequeue the buffers at any time but the queue
888itself has a size limit (as many most other queues). When the other process does not dequeue
889from the queue, at some point the queue will be full. As an optimization, the function \texttt{notify}
890sends a message to the other process to inform it that there might be buffers in the queue. 
891\\
892The teardown of the queue is using the function \texttt{deregister} and \texttt{destroy}
893\begin{lstlisting}[style=code]
894// Make sure this process owns all the memory of the region
895// ...
896
897// Remove region from the active set
898err = queue_deregister(q, regid, &memory);
899if (err_is_fail(err)){
900    // error handling
901}
902
903// Cleanup resources of the queue
904err = queue_destroy(q);
905if (err_is_fail(err)){
906    // error handling
907}
908\end{lstlisting}
909Before a region can be deregistered, the process has to make sure that all the memory of the region
910is owned by the process. If the process does not fully own the region, \texttt{deregister} 
911returns an error. Similar, \texttt{destroy} fails if there are still regions registered. 
912\\
913For the receiving part, the process simple has to dequeue from the queue
914\begin{lstlisting}[style=code]
915regionid_t rid;
916genoffset_t offset;
917genoffset_t length;
918genoffset_t valid_data;
919genoffset_t valid_length;
920uint64_t flags;
921
922err = queue_dequeue(q, &rid, &offset, &length, &valid_data, &valid_length, &flags);
923if (err_is_fail(err)) {
924    //error handling, 
925}
926\end{lstlisting}
927Additionally in the case of the queue for inter process communication during creation
928a notification handler can be given as an argument. An example of such a handler is shown below.
929This handler is called whenever a \texttt{notify} from the other process is received. 
930\begin{lstlisting}[style=code]
931static void notify_cb(void* q)
932{
933    errval_t err = SYS_ERR_OK;
934    struct queue* queue = (struct queue*) q;
935
936    regionid_t rid;
937    genoffset_t offset;
938    genoffset_t length;
939    genoffset_t valid_data;
940    genoffset_t valid_length;
941    uint64_t flags;
942
943    while(err_is_ok(err)) {
944        err = queue_dequeue(queue, &rid, &offset, &length, &valid_data,
945                            &valid_length, &flags);
946    }
947}
948\end{lstlisting}
949Some other examples are shown in Figures \ref{fig:oneproc} and \ref{fig:twoproc}
950\begin{figure}[!h]
951	\begin{center}
952		\caption{Usage example with datapath in a single domain}
953		\includegraphics[width=0.65\textwidth]{pics/oneproc.pdf}
954		\label{fig:oneproc}
955	\end{center}
956\end{figure}
957
958\begin{figure}[!h]
959	\begin{center}
960		\caption{Usage example with datapath across domains}
961		\includegraphics[width=0.65\textwidth]{pics/twoproc.pdf}
962		\label{fig:twoproc}
963	\end{center}
964\end{figure}
965
966
967\bibliographystyle{abbrv}
968\bibliography{defs,barrelfish}	
969	
970\end{document}
971