1\documentclass[a4paper,twoside]{report} % for a report (default)
2
3\usepackage{bftn,color} % You need this
4\usepackage{verbatim} % for comment
5
6\title{Routing in Barrelfish}   % title of report
7\author{Akhilesh Singhania, Alexander Grest}	% author
8\tnnumber{006}  % give the number of the tech report
9\tnkey{Routing} % Short title, will appear in footer
10
11% \date{Month Year} % Not needed - will be taken from version history
12
13%% \newcommand{\note}[1]{}
14\newcommand{\note}[1]{[\textcolor{red}{\textit{#1}}]}
15
16\begin{document}
17\maketitle
18
19%
20% Include version history first
21%
22\begin{versionhistory}
23\vhEntry{1.0}{09.06.2010}{AS}{Initial version}
24\vhEntry{1.01}{14.06.2010}{AS}{Added discussion of design and unresolved issues}
25\vhEntry{1.02}{23.07.2010}{AS}{More work on design and API}
26\vhEntry{2.0}{31.05.2011}{Alexander Grest}{Multi-hop messaging}
27\vhEntry{2.1}{01.12.2013}{TR}{Some explanation}
28\end{versionhistory}
29
30% \intro{Abstract}		% Insert abstract here
31% \intro{Acknowledgements}	% Uncomment (if needed) for acknowledgements
32% \tableofcontents		% Uncomment (if needed) for final draft
33% \listoffigures		% Uncomment (if needed) for final draft
34% \listoftables			% Uncomment (if needed) for final draft
35
36\chapter{Motivation}
37
38This technical note describes a set of design proposals (and prototype
39implementation) of multi-hop message routing in Barrelfish -- how to
40send messages between cores which do not have a direct connection
41between them.  This arises in, for example, Ethernet communication
42with a fully-distributed Barrelfish instance, or between multiple
43cores spread out over a PCIe bus (as is the case with Xeon Phi or
44Intel SCC). 
45
46All inter-core communication in Barrelfish is performed using explicit
47messages, which are carried over ''Interconnect Drivers'' (ICDs),
48specialized messaging subsystems that carry data between cores. At
49present, all communication happens over direct point-to-point ICD
50links. However, there are multiple motivations for extending this. In
51this note, we motivate the introduction of a routing layer in
52Barrelfish. 
53
54
55\section{Partial connectivity}
56Most current multicore machines are fully connected via shared memory. This means that any core in the system can communicate with any other core in the system
57by using shared memory. Most applications are also designed accordingly. However, this assumption is not necessarily true on modern hardware, as the following two examples illustrate:
58
59\begin{itemize}
60
61\item On the \emph{Intel Single Chip Cloud Computer} (SCC), the set of memory a core can access is determined by the setup of its Look Up Tables (LUTs). It is possible that these tables are set-up in such a manner that
62two or more cores do not have access to the same region of memory. In such cases to communication these cores will have to route via another set of cores if such a path exists.
63
64\item If Barrelfish is operated on a cluster of machines, there is only an Ethernet-based ICD link between the core(s) where the network stack is running. In order to allow every core to communicate with every other core, the available link must be multiplexed. 
65
66\end{itemize}
67
68This are two examples of current set-ups which lead to partial connectivity. A routing layer akin to the one in IP routing will allow applications to communicate in
69such environments, without having to worry about the concrete path a message takes from one core to another. The routing layer will properly route messages in a transparent way.
70
71\section{Resource usage}
72ICD links require resources. In particular, a link uses the following resources:
73
74\begin{itemize}
75\item \textbf{Memory}: Any ICD link will require some memory to buffer unsent messages and messages that have been received but not yet delivered. Some links will require additional memory for the channel itself. For instance, the shared-memory UMP driver on x86 requires two pages of physical memory per binding. In general, the amount of memory required is governed by the number of messages in flight and
76the number of messages in buffer.
77
78\item \textbf{CPU}: Some ICDs, as for example the UMP driver on x86, require explicit polling to check for incoming messages. The more links a core has, the more time it has to spend polling.
79
80\item \textbf{Cache}: If an ICD  uses polling to check for incoming messages, the polled cache line will be placed in the cache. If the core has many ICD links, a significant part of its cache will be flushed due to polling.
81
82\end{itemize}
83
84The more ICD links a core has, the more of its resources will be used for them. This is detrimental to a high-performance system. By limiting the number of links, we will limit the amount of resources required. One way to limit the number of links is to multiplex multiple channels over one ICD link and therefore not construct a fully connected network. 
85
86
87\section{Heterogeneous IDC}
88Barrelfish supports various IDC mechanisms.
89Different mechanisms provide different semantics and guarantees such
90as maximum frame size, notification mechanism, synchrony, etc.
91
92An application can utilize multiple IDC mechanisms.  This can happen
93if a homogeneous machine supports multiple IDC mechanisms or if the
94application runs on a heterogeneous machine.  To avoid the need for
95the application to understand the different semantics of all IDC, it
96can conform to the semantics provided by the routing library.
97
98The semantics provided by the routing layer are discussed in section
99\ref{sec:semantics}.
100
101\section{Group communication}
102
103Various parallel computing abstractions such as barriers
104require communication among a group of threads.
105When any thread enters a barrier, it waits for all other threads to enter
106the barrier as well before continuing.
107
108Various distributed communication abstractions such as achieving consensus
109also require communication among a group of nodes.
110A group of nodes that want to come to agreement on some value
111need to communicate with each other.
112
113It has been shown \cite{nishtala:optimizing-collective:hotpar09,
114  barrelfish:sosp09} that even in a fully connected 
115machine, using some form of routing can improve the performance of group communication.
116The sender sends the message to a subset of nodes it wishes to communicate with.
117The subset will in turn forward it to the remaining set of nodes.
118The work has also shown that the order in which messages are sent also matters.
119The optimal route and ordering of messages is machine dependent.
120
121If applications were written with the abstraction of a group layer,
122it will allow the library sufficient flexibility in
123selecting the optimal route based on the machine type.
124
125
126\chapter{Multi-hop messaging}
127
128Multi-hop messaging is an important part of the routing layer. It gives applications a possibility to create a logical channel between two cores, that is routed over multiple nodes. This requires that available ICD links are multiplexed.
129
130A multi-hop channel can only be set up between two dispatchers running on different cores. It always leads through the two monitors running on each dispatcher's core. Between those two monitors the multi-hop channel can lead through an arbitrary number of additional monitors. We call all the monitors that lie on a multi-hop channel \emph{nodes}. All the nodes of a multi-hop channel must be connected by means of other ICD-links (such as LMP or UMP ICD-links).
131
132Once a multi-hop channel is set up, it can be used to exchange messages between the two dispatchers. The multi-hop channel transports messages by passing them to the underlying interconnect driver on each link between the nodes of the multi-hop channel. 
133
134The multi-hop interconnect driver consists of
135\begin{itemize}
136\item A mechanism to set up new multi-hop channels between dispatchers addressed by end-point identifiers
137\item A mechanism to send messages along a multi-hop channel
138\item A mechanism to receive messages from the channel
139\end{itemize}
140
141
142\section{Design goals}
143
144\subsection{Independence of the underlying interconnect driver}
145
146The multi-hop interconnect driver was designed to be independent of the type of the underlying ICD links between the nodes on the multi-hop channel. This means that it uses the common flounder interface supported by all ICDs when interacting with the underlying ICD link and uses no ICD-specific knowledge. This design involves a performance penalty: Interacting directly with the underlying ICDs instead of via the common flounder-interface would certainly perform better. Nevertheless, we chose this design, as it gives us more flexibility: The multi-hop interconnect channel can run over all present and future interconnect drivers in Barrelfish, as long as they support the common flounder interface.
147
148\subsection{Reliability}
149
150Interconnect drivers in Barrelfish generally provide a reliable messaging service: A message is delivered only once and each message sent is eventually delivered and its content is not corrupted. Furthermore, messages are delivered in FIFO order. The multi-hop interconnect driver is designed to provide a reliable messaging service in principle. However, contrary to the end-to-end argument, it does not provide any \emph{end-to-end} reliability, but builds on the reliability provided by the interconnect drivers of the underlying links. We accept that the multi-hop interconnect driver can fail in case any of the interconnect drivers of the underlying link fail.
151
152\subsection{Resource usage}
153Because it is our goal to optimize resource usage, the multi-hop interconnect driver is designed to perform considerably better in terms of resource usage compared to the scenario where we only use direct point-to-point ICD links. In particular, we save memory, because the multi-hop driver has a comparably small memory footprint. 
154
155\section{Design overview}
156
157Messaging in Barrelfish is connection-oriented: messages are passed via an explicit binding object, which encapsulates one half of a connection, and such a binding must be established in advance. Therefore, we have decided to support only connection-oriented multi-hop messaging (for now).  The multi-hop interconnect driver is designed in such a way that channel set-up is collapsed into the binding phase. 
158
159We use virtual circuit switching in order to multiplex multiple multi-hop channels over the available ICD links. Virtual circuit switching has several advantages over a packed-switched approach. It ensures that all messages take the same path and thereby FIFO delivery of messages (as long as the underlying ICD links provide FIFO delivery). Moreover, it allows to create per-circuit state on the nodes of a virtual circuit. 
160
161Each monitor maintains a forwarding table. For each multi-hop channel, entries are created in the forwarding tables at all the nodes of that channel. Messages that are sent over the channel are forwarded at each node according to its forwarding table. Those entries in the forwarding tables can be seen as per-channel created \emph{hard} state: It is explicitly created at channel set-up and deleted at channel tear-down. Additionally to the entries in the forwarding table, per-channel created state includes bindings to the neighbouring nodes on the multi-hop channel.  
162
163In addition to the forwarding table, each node maintains a routing table. The routing table is used for channel set-up: If a node receives a channel set-up request, it determines where to forward the request with the help of its routing table. 
164
165The virtual circuit switching approach would also allow to reserve some resources on the nodes for each channel. Per-channel reserved resources could include buffer space to save received, but not yet forwarded messages, or bandwidth on the ICD links. This is potentially very useful for congestion and flow control. Note that we cannot simply drop messages in case of congested links, as we want to provide a reliable messaging service. As of now, we do not reserve resources on the nodes, but allocate required resources dynamically.
166
167\begin{figure}[h]
168	\begin{center}
169 	\includegraphics[scale=0.7]{overview_multihop_channel.pdf}
170 	\caption{Basic set-up}\label{fig:multihop-chan}
171 	\end{center}
172\end{figure}
173
174\section{Additional monitor bindings}
175A multi-hop channel is multiplexed over the available ICD links. However, for each multi-hop channel, there will be two additional ICD links: Two additional LMP channels will be created between the client's dispatcher and the monitor running on its core and between the service's dispatcher and the monitor on its core. LMP channels are rather cheap - they do not require polling and require only a small amount of memory. Therefore, this does not compromise our goal of optimizing resource usage. Figure~\ref{fig:multihop-chan} shows an example set-up of a multi-hop channel with the two additional LMP channels. 
176
177Those additional channels are needed to ensure that the default monitor binding is not congested or even blocked by multi-hop messages. For example, suppose that a client's dispatcher receives a lot of multi-hop messages within a short period of time. The client reacts to this by allocating more memory. If multi-hop messages are sent over the default monitor binding, the message coming from the memory server will be blocked, therefore this will result in a dead lock. By creating new monitor bindings and not using the default monitor binding, we can prevent such a scenario.
178
179
180\section{Virtual circuit identifiers}
181\label{section:vcis}
182Multi-hop messages carry a virtual circuit identifier (VCI). Virtual circuit identifiers allow nodes to identify the particular multi-hop channel a message belongs to. Each node on a multi-hop channel maintains a forwarding table, which maps VCIs to the next hop on that particular channel. A node forwards multi-hop messages based on this forwarding table. At channel end-points, a VCI allows to identify the binding belonging to the multi-hop channel the message was sent over. Virtual circuit identifiers are not only local to a specific link, but also to a direction on that link. Figure~\ref{fig:vci} shows an example assignment of VCIs.
183
184We assign virtual circuit identifiers at random. At each node, we use a hash table to map virtual circuit identifiers to a pointer to the channel state. The use of a hash table allows efficient message forwarding. When a message arrives, it can be determined where to forward this message by means of a simple look-up in the hash table. The complexity of this lookup is linear in the number of virtual circuit identifiers that map to the same hash bucket (the number of buckets in the hash table is a compile time constant).
185
186An attacker sending messages with manipulated virtual circuit identifiers may be able to send messages on channels not belonging to him. By assigning virtual circuit identifiers at random, we make it very unlikely for an attacker to find valid virtual circuit identifiers of channels not belonging to him.
187
188This design requires that each node on a multi-hop channel tells its neighbours what virtual circuit identifier they should use for messages sent over that particular channel. This happens in the set-up phase of a multi-hop channel (see section~\ref{section: set-up}).
189
190\begin{figure}[h]
191	\begin{center}
192 	\includegraphics[scale=0.68]{vcis.pdf}
193 	\caption{Virtual circuit identifiers} \label{fig:vci}
194 	\end{center}
195\end{figure}
196
197
198\section{Channel set-up}
199\label{section: set-up}
200If two dispatchers want to communicate with the help of the multi-hop interconnect driver, they have to create a multi-hop channel first. During channel-set up, one dispatcher must act as the client and the other as the server (however, once a channel is established, the communication process on both sides of the channel is indistinguishable). 
201
202The channel set-up process can be initiated by invoking the \texttt{multihop\_chan\_bind} function of the multihop interconnect driver. It has to be remarked that normally a user does not interact directly with the multi-hop interconnect driver, but only over the flounder generated stubs (see chapter~\ref{chapter: flounder integration} ).
203
204
205The channel set-up process works as follows:
206
207\begin{enumerate}
208
209\item A client dispatcher initiates the set-up process by calling the bind function of the multi-hop interconnect driver. This function forwards the bind request to the monitor running on the client dispatcher's core. The bind request includes various parameters, including the \emph{iref} of the service and the client's (ingoing) virtual circuit identifier.
210
211\item The monitor running on the client dispatcher's core determines (from the iref) the core on which the service resides. It then forwards the bind request to another monitor, which is determined based on the routing table.
212
213\item Monitors receiving the bind request check whether the service is running on the same core as they are. If so, they determine the local dispatcher which has exported this iref and forward the request to it. Otherwise, the bind request is forwarded to another monitor in the same way as in step two.
214
215\item As soon as the service's dispatcher receives the bind request, it runs the user provided connection callback. Based on the return value of this callback, it either accepts the connection or rejects it. In any case, the bind reply is sent back to the monitor.
216
217\item The monitor proxies the bind replay back to where it received the bind request from.
218
219\item If the client dispatcher receives the bind reply, it will run the user provided bind callback.
220
221\end{enumerate}
222
223In order to support setting up connections between dispatchers, the existing messaging interfaces between dispatchers and their local monitor, and between monitors has been extended.
224
225As described in section~\ref{section:vcis}, it is necessary that each node on the multi-hop channel tells its neighbouring nodes what virtual circuit identifier they should use for messages sent over that particular channel. Therefore, each message contains the virtual circuit identifier of the sender.  The two response-messages additionally contain the VCI of the receiver. This allows the receiver of a response-message to identify the multi-hop channel the message belongs to.
226
227
228\section{Message forwarding}
229\label{section: message forwarding}
230Once the multi-hop channel is set-up, messages can be sent in both directions. A message can be sent by invoking the \texttt{multihop\_send\_message} function of the interconnect driver.  This function requires that the message payload is passed as one (char) array. If a user-defined message contains multiple arguments that are not stored in continuous memory locations, either the user-defined message must be split up in multiple multi-hop messages, or a new array must be allocated and all message arguments must be copied into the newly allocated array (see chapter~\ref{chapter: flounder integration} for a discussion).
231
232In order to support sending messages, the existing messaging interfaces between dispatchers and their local monitor, and between monitors has been extended. Each multi-hop  message contains a VCI, a field encoding the direction of the message and the message payload (as a dynamic array). Furthermore, it contains one field encoding message flags and another field used to acknowledge received messages. Those two fields are used for flow control (see section~\ref{section: flow control}).
233
234As a multi-hop channel allows to send messages in two directions, the direction field is needed to identify the direction of a particular message. Currently we assign direction ''1'' to all messages going from the dispatcher who initiated the multi-hop channel to the other dispatcher, and direction ''2'' for messages going the opposite way. 
235
236This definition of a multi-hop is motivated by the fact that it must be possible to transport an arbitrary message within one (or more) multi-hop messages. By using a dynamic array argument for the message payload, we can transfer data of an arbitrary size in a multi-hop message.
237
238Internally, multi-hop messages are forwarded at every node of a multi-hop channel until they reach the receiver. We make sure that multi-hop messages cannot overtake other multi-hop messages at the nodes by enqueuing messages in the order they arrive and forwarding them in a FIFO order.
239
240\section{Capability forwarding}
241\label{section: capabilities}
242Because capabilities are maintained as references to per-core state in the CPU drivers, only the LMP interconnect driver which traverses kernel-mode code can directly deliver a capability along with message payload. In the multi-hop interconnect driver, capabilities travel out-of-band from other message payload. 
243
244To send a capability, the monitor sends a \texttt{multihop\_cap\_send} message to its local monitor, containing the capability. The monitor determines whether the capability can be sent to the remote dispatcher. In gereral, capabilities referring to inherently local state (such as LMP endpoint) may not be sent, nor may capabilities that are currently being revoked. If the capability cannot be sent, a \texttt{multihop\_cap\_reply} message is sent back to the local dispatcher containing the error code. Otherwise, the capability is serialised and forwarded along the multi-hop channel. 
245
246The monitor running on the receiver's core reconstructs the capability from its serialised representation and forwards it to the destination dispatcher. This dispatcher identifies the binding to which the capability belongs and invokes a callback on that binding. 
247
248The capability sender only receives a reply message in case an error occurs. An error can occur if for example the capability cannot be sent or the receiver has no space left to accommodate an additional capability.
249
250\section{Receiving messages}
251In order to receive messages sent over a multi-hop channel, message handlers must be registered with that multi-hop channel. In particular, three message handlers must be registered: one message handler for ''normal'' messages, one handler for incoming capabilities and one handler for capability reply messages (that are sent in case an error occurred while sending a capability).
252
253The flounder generated stubs for the multi-hop interconnect driver (see chapter~\ref{chapter: flounder integration}) register those message handlers, not the application itself (normally).
254
255\section{Routing tables}
256\label{sec: routing tables}
257The routing tables are used to determine where to forward a connection set-up request. Each monitor needs its own routing table. We currently support the automatic generation of routing tables for three basic modes of routing:
258
259\begin{enumerate}
260\item \textbf{Direct}: All set-up requests are immediately forwarded to the end-receiver.
261
262\item \textbf{Ring}: We route over all cores of a system. Core $i$ forwards a request to core $i+1$ mod num\_cores.
263
264\item \textbf{Fat tree}: We route directly between the cores that are located on the same CPU socket. On each socket, we choose a ''leader'' and route directly between all leaders. A set-up request for a core on a different socket is always forwarded over the local leader to the leader on that socket.
265\end{enumerate} 
266
267For the routing modes ''ring'' and ''fat tree'' we need information from the system knowledge base: We need to know the number of cores in the system for the ''ring'' routing mode. For the ''fat tree'' mode, we additionally need to know the number of cores per CPU socket (note that we assume here that sockets are continuously numbered). 
268
269We decided that there should be no direct communication between the monitor and the system knowledge base, because it is not always present. For some architectures, such as Microsoft's experimental Beehive architecture or to a certain extend the Intel Single Chip Cloud Computer, the system knowledge base is not available. Therefore, a dependency of the monitor on the system knowledge base should be avoided.
270
271For this reason, we decided to create a separate module, called the \emph{routing table set-up dispatcher} (RTS) that talks to the system knowledge base and to the initial monitor (the monitor that is first booted). The routing table set-up dispatcher will retrieve the required information from the system knowledge base in order to construct the routing table. Once it has constructed the routing table, it will send it to the initial monitor. 
272
273The initial monitor will forward the (relevant parts of the) routing table to the other monitors once they are booted. This is necessary  because we want to avoid having to create a channel between each monitor and the routing table set-up dispatcher.
274
275It must be noted that the routing table set-up dispatcher can only generate the routing tables for the cores of a single system. It cannot handle set-ups like an Intel single chip cloud computer connected to a x86 machine over a PCIe-based channel.
276
277\section{Flow control}
278\label{section: flow control}
279It is possible that one dispatcher on a multi-hop channel is sending at a faster rate than the receiving dispatcher can handle incoming messages and process them. Because we want to provide a reliable messaging service, we cannot just drop messages in such a case, but have to buffer them and deliver them eventually. To limit the space needed to buffer undelivered messages, we decided to implement a flow control mechanism within the multi-hop interconnect driver. The flow control mechanism allows the receiving dispatcher to control the transmission speed, so that it is not overwhelmed with messages.
280
281We decided to use a credit-based flow control mechanism: The number of messages in flight at any given time is limited. Once a sender has reached this limit, he has to wait until he receives an acknowledgement that the receiver has processed previously sent messages. We call this limit the \emph{window size}.
282
283The flow control mechanism is completely transparent to applications. It is entirely handled by the multi-hop interconnect driver. On each message sent by a dispatcher over a multi-hop channel an acknowledgement for all messages previously received over this channel is piggy-backed. 
284
285If an application uses a one-way communication schema, i.e. one dispatcher is always sending while the other is only receiving, it is not possible to piggy-back acknowledgements on messages sent the other way. In such a case, the multi-hop interconnect driver sends a dummy message. A dummy message contains no message payload, but acknowledges all received messages. This approach ensures that acknowledgements are, whenever possible, piggy-backed on messages. Only if it is absolutely necessary, an acknowledgement is sent in its own message.
286
287
288
289\chapter{Flounder support for multi-hop messaging}
290\label{chapter: flounder integration}
291
292Flounder is a stub compiler which generates stubs for defined interfaces. To support multi-hop messaging, we created a new back-end code generator for the flounder stub compiler that generates code to use the multi-hop interconnect driver.  Applications do not interact with the multi-hop interconnect driver directly, but only over the generated stubs. The stubs for the multi-hop interconnect driver have the exact same interface as stubs for other interconnect drivers. This makes application code independent of the interconnect driver used for communication.
293
294The generated stubs can be seen as an ''adaptor'' to the multi-hop interconnect driver. They  translate calls to the common flounder interface to the interface of the multi-hop interconnect driver. Supported functionality mainly includes binding, sending and receiving of multi-hop messages and some control operations.
295
296\section{Binding}
297If two dispatchers want to communicate with the help of the multi-hop interconnect driver, they must acquire binding objects for each endpoint of the channel. In any binding attempt, one dispatcher must act as the client and the other as the service (however, once a binding is established, the communication process on both sides of the binding is indistinguishable). The binding phase is merged with channel set-up, i.e. a new multi-hop channel will be created during the binding process. 
298
299In order to initiate a binding, a client dispatcher calls the bind function for a given interface. Because Barrelfish features multiple interconnect drivers, the interface's bind function will have to decide which interconnect driver to use in order to establish the binding. Currently, it ''asks'' the different interconnect drivers to establish a binding in a predefined order (for example, the LMP driver is always first). As soon as an interconnect driver manages to establish the binding, the binding process is finished. Should one interconnect driver fail, the next one in order is tried.
300
301If an application wants to create a new multi-hop channel, it can pass the flag \texttt{IDC\_BIND\_FLAG\_MULTIHOP} as an argument to the interface's bind function. This changes the order of the interconnect drivers: The multi-hop interconnect driver will come in second place, directly after the LMP driver. The LMP driver is first, because it is preferable to the multi-hop interconnect driver if client and service are running on the same core. If the multi-hop interconnect driver fails to establish a binding for some reason, the binding process continues as normal with the other interconnect drivers.
302
303The result of the binding process on the client's and service's side is a binding object which is the abstract interface to the multi-hop interconnect driver for a specific interface type.
304
305
306\section{Sending messages}
307\label{section: flounder sending messages}
308A message may be sent on the binding by calling the appropriate transmit function. We distinguish between user-defined messages and multi-hop messages. User-defined messages are those messages defined by the user in the interface. Multi-hop messages are messages that are sent over a multi-hop channel. 
309
310As pointed out in section \ref{section: message forwarding}, the multi-hop interconnect driver requires that the message payload is passed as one char array. If a user-defined message contains dynamic arguments (arguments whose size is only known at run-time), such as a string or a dynamic array, it is generally not possible to pass the message payload as one char array to the multi-hop interconnect driver. There are three possible approaches to send such a message:
311
312\begin{enumerate}
313\item Allocate a region of memory capable of holding all message arguments and copy the message arguments to this region. A pointer to it can then be passed to the multi-hop interconnect driver as message payload.
314
315\item Split a user-defined message into multiple multi-hop messages. Each argument of the multi-hop message is transported in its own multi-hop message. 
316
317\item Use a combination of the above approaches. For instance, all fixed size arguments could be sent in one message, and each dynamically sized argument could be sent in an extra multi-hop message.
318\end{enumerate}
319
320In comparison to approach 1, approach 2 saves the cost of allocating a region of memory and copying all the arguments of the message to that region. In exchange for that, it needs to split a user-defined message and transport it via multiple multi-hop messages. The preferable approach depends on the type of messages that are sent. However, we think that the performance penalty involved in sending each message argument in its own multi-hop message is not acceptable for most message types. Therefore, the flounder-generated stubs for the multi-hop interconnect driver use approach 1. Approach 3 might be a possible performance optimization, but is currently not in use.
321
322\subsection{Implementation}
323All message arguments are copied to continuous memory locations in order to send the whole user-defined message in one multi-hop message.
324When sending a user-defined message, we first calculate the size of its payload. The size of a message's payload is only known at compile-time if the message definition does not contain any dynamic arguments. Otherwise, the size of the payload has to be computed each time such a message is sent. After having computed the payload size, we allocate a memory region of that size and copy the message arguments to that region of memory. Finally, we pass a pointer to this memory region to the multi-hop interconnect driver.
325
326We include the size of every dynamically sized argument in the message payload. This tells the receiver about the size of those arguments and allows him to retrieve them from the received message payload. Currently, we use 8 bytes to transfer the size of a dynamic argument. This ensures that we do not get an overflow. We account for those size fields when calculating the size of the message payload.
327
328Capabilities are never sent as message payload. They are always sent out-of-band from ''normal'' message payload. A discussion of this can be found in section~\ref{section: capabilities}.
329
330There is one issue regarding communication in heterogeneous systems of our implementation: To be consistent  with the common flounder interface, we have to use a variable of type \texttt{size\_t} to represent the size of a dynamic array. The type \texttt{size\_t} is architecture dependent. On a 32-bit system it will likely be at least 32-bits wide. On a 64-bit system it will likely be at least 64-bit wide. If a dispatcher on a 64-bit system communicates with a dispatcher on a 32-bit system, this can lead to a problem: The dispatcher on the 64-bit system can potentially send dynamic arrays that are bigger than the dispatcher on the 32-bit system can receive. This is a problem of the current Barrelfish version and remains unsolved.
331
332
333\subsection{Send continuation}
334Each transmit function takes as an argument a pointer to a continuation closure. The closure will be executed after the message has successfully been sent. If another transmit function is called on the same binding before the continuation is executed, it will return the \texttt{FLOUNDER\_ERR\_TX\_BUSY} error code, indicating that the binding is currently unable to accept another message. In this case, the user must arrange to retry the send.
335
336The send continuation is the only way to know when a message has been sent over the multi-hop channel and it is safe to send the next message. Note that even if an application uses a ping pong communication scheme, i.e. it sends a message back and forth between two dispatchers, it is not guaranteed to not get a \texttt{FLOUNDER\_ERR\_TX\_BUSY} error code, unless it serialises all sends with the continuation. This somewhat unintentional behaviour is caused by the fact that the multi-hop channel internally relies on other ICD-links to transport messages. The multi-hop channel itself uses send continuations on the underlying ICD-links to determine when it can accept another message. Those send continuations are always executed after a message is sent. Therefore it is possible (although unlikely) that a message is sent and the reply for that message is received, before the multi-hop channel can accept the next message.
337
338
339\section{Receiving messages}
340The flounder-generated stubs register a callback function with the multi-hop interconnect driver at channel set-up time in order to be notified when a message arrives. As we send a user-defined message within a single multi-hop message, we therefore also receive a user-defined message in one multi-hop message.
341
342Upon receiving a multi-hop message, we have to extract the original user-defined message from it and pass it on to the user-provided receive handler. It is a fatal error if a message arrives on a multi-hop channel and the receive handler function for that message type is not set.
343
344If the user-defined message contains dynamic arguments, we have to allocate space for each of those arguments separately and copy them from the received multi-hop message. This is necessary, because all dynamic message arguments are passed by reference to the user and become property of the user. The user must be able to free those arguments separately, therefore they must be copied to separately allocated memory. Fixed-size arguments are simply passed on the stack to the user.
345
346
347
348\chapter{Group Communication}
349
350\section{Terminology}
351
352\textbf{Groups:}
353The set of all nodes on the machine form a \emph{universe group}.
354The set of nodes in the universe group that
355wish to communicate with each other form an \emph{application group}.
356An application group is a subset of the universe group.
357A subset of nodes in the application group can form a \emph{multicast group}.
358
359It is possible to join and leave any multicast and application group.
360
361\textbf{IDs:}
362Each application group is identified by a \emph{group ID}.
363The group ID in turn identifies the instance of routing library to use.
364The group ID is unique within the universe group.
365
366Each multicast group is identified by \emph{multicast ID}.
367The multicast ID is unique within the application group.
368
369When nodes join an application group, they are assigned a \emph{node ID}.
370The node ID is unique within the application group.
371
372Each node is also given an \emph{application broadcast ID}.
373These may or may not be unique and are drawn from a set that
374may just have a single element.
375
376The union of the set of node ID, multicast ID, and application broadcast ID is
377called the \emph{destination ID}.
378The set of node IDs, multicast IDs, and application broadcast IDs are disjoint.
379
380\textbf{Messaging:}
381It is not possible to communicate with nodes in the universe group that
382are not in the application group.
383
384A node can send a message to another node in the application group by
385sending a message to the appropriate node ID.
386A node can send a message to all nodes in the application group by
387sending a message to the application broadcast provided to it.
388A node can send a message to all nodes in an multicast group by
389sending a message to the multicast ID provided to it
390when it joined the multicast group.
391
392\textbf{Types of messages:}
393\emph{Unicast:} Send a message to a single node in the application group.
394\emph{Broadcast:} Send a message to all nodes in the application group.
395\emph{Multicast:} Send a message to all nodes in the multicast group.
396
397
398\section{Semantics}\label{sec:semantics}
399
400The routing layer will provide a uniform set of semantics to the
401application layer regardless of the set of semantics the
402IDC mechanisms below it provide.
403It can provide different semantics to suit the
404needs of different application scenarios.
405
406Below, we discuss the different set of semantics it can provide.
407
408\subsection{Set 1: Single source FIFO}
409The set of semantics provided are as follows:
410
411\begin{itemize}
412\item Reliability:
413  A message is delivered only once and only if it was sent earlier.
414  Each message is eventually delivered and the contents are not corrupted.
415\item Single source FIFO ordering:
416  If a sender sends $m$ before $m'$ then $m$ is delivered before $m'$
417  at all receivers.
418\item Failure:
419  The routing library will not fail.
420\item Payload:
421  The IDC can deliver an arbitrarily sized message.
422\end{itemize}
423
424\subsection{Set 2: Causal order}
425The set of semantics provided are as follows:
426
427\begin{itemize}
428\item Reliability:
429  A message is delivered only once and only if it was sent earlier.
430  Each message is eventually delivered and the contents are not corrupted.
431\item Causal ordering:
432  If the delivery of message $m$ depends upon the delivery of message $m'$ as
433  per the \emph{happened before} relationship \cite{Lamport:1978:TCO:359545.359563},
434  then $m$ is not delivered till $m'$ has been delivered.
435\item Failure:
436  The routing library will not fail.
437\item Payload:
438  The IDC can deliver an arbitrarily sized message.
439\end{itemize}
440
441\subsection{Set 3: Total order}
442The set of semantics provided are as follows:
443
444\begin{itemize}
445\item Reliability:
446  A message is delivered only once and only if it was sent earlier.
447  Each message is eventually delivered and the contents are not corrupted.
448\item Total order:
449  All messages are delivered to all nodes in the same order.
450\item Failure:
451  The routing library will not fail.
452\item Payload:
453  The IDC can deliver an arbitrarily sized message.
454\end{itemize}
455
456It is possible to order messages using various types of ordering mechanisms.
457Investigation of this remains future work.
458
459\subsection{Additional sets}
460In future, if we choose to provide additional set of semantics,
461they will be listed here.
462They could include weaker semantics than above if the underlying IDC mechanism
463are too expensive.
464Some example are just reliability, or not even providing reliability.
465
466\section{Interface}
467We discuss the interface for group management and sending/receiving of messages.
468
469\subsection{Group management}
470
471\textbf{Creating groups:}
472Before nodes can join a group, they need to be created.
473Any dispatcher in the system can create a new application group
474and any node in an application group can create a new multicast group
475within the application group.
476
477The library will return to application a group ID or
478multicast ID of the created group.
479
480\textbf{Updating a group:}
481A dispatcher can join any application group by calling join on
482the application group ID.
483A node can join any multicast group within the application group it is part of.
484When the join has finished, the node gets the join callback from the library.
485When a dispatcher is done joining an application group,
486it can query the library for its node ID and application broadcast ID.
487
488Similarly a node can leave a group at anytime by calling leave on the group ID.
489When the leave is done, the application will get a leave callback.
490A dispatcher should call leave before it exits the system.
491
492The behavior of the group is undefined while membership is in flux.
493New links are being created and old links are being torn down.
494Messages may not reach their proper destination.
495If such guarantees are required at all times in the application,
496the application must refrain from sending messages while
497group member is in flux.
498
499\section{Flow control}
500The IDC mechanisms that the routing library operates over are asynchronous.
501When a message is sent over them,
502it will eventually be delivered to the receiver.
503Undelivered messages are maintained in a queue of fixed length.
504If the sender tries to send messages too quickly the queue can fill up.
505If the queue is full, the sender must wait
506till the queue has room for more messages.
507IDC mechanisms allow senders to register callbacks in such situations.
508When a send fails with the transient error that
509the queue is full, the sender can register
510a callback which will be called when the next send should not fail.
511While the sender waits for the callback, it has to handle the unsent message.
512
513We discuss the simple scenario of two nodes and
514then a more complex scenario of many nodes.
515
516\subsection{Client-server model}
517
518\begin{figure}[t]
519 \includegraphics[width=\columnwidth]{client-server.pdf}
520 \caption{Client-server model}\label{fig:client-server}
521\end{figure}
522
523Figure \ref{fig:client-server} shows a simple client-server model
524of two nodes that are directly connected.
525The weights on the edges is the length of the queue.
526The client sends messages to the server, the server processes them
527and sends replies to the client.
528It is possible that link1 becomes full maybe because
529the client has not been handling replies on it.
530At this point the server has some options:%
531
532\begin{itemize}
533\item Drop the message.
534  The server can simply drop the message if the queue is full.
535  This will result in an unreliable channel.
536\item Allocate some resources and queue the message up.
537  This implies unbounded resource requirement.
538\item Apply back pressure on the client.
539  The server at this point can stop processing messages
540  from the client till it is able to send messages to it again.
541\end{itemize}
542
543In this model the last option works well as it will force the client
544to slow down and process replies before sending more requests.
545
546\subsection{Multihop route}
547
548\begin{figure}[t]
549 \includegraphics[width=\columnwidth]{many-to-many.pdf}
550 \caption{Example multihop route}\label{fig:many-to-many}
551\end{figure}
552
553In this scenario the problem is more complex.
554Figure \ref{fig:many-to-many} shows a group of 5 nodes,
555the link weights specifies the queue length of the link.
556If node1 and node2 are sending messages to node4,
557link34 will fill up before link13 or link23 does.
558Node3 cannot send additional messages to node4.
559At this point, node3 has the following options:
560
561\begin{itemize}
562\item Continue to process incoming messages on link13 and link23.
563  If they are destined for node4, drop them.
564  This will result in an unreliable channel.
565\item Continue to process incoming messages and if they are destined for node4,
566  queue them up locally.
567  This implies unbounded resource requirement.
568\item Stop processing messages on link13 and link23.
569  This will delay messages on those links that were not destined to node4.
570  In literature, this is called \emph{Head of line blocking} \cite{cite}.
571\end{itemize}
572
573None of these solutions are particularly desirable.
574Flow control in the context different types of networks has
575been studied previously.
576
577Relevant existing work includes:
578\begin{itemize}
579\item Credit based flow control:
580  The endpoints dictate maximum number of messages in flight.
581\item TCP flow control
582\item Ethernet flow control
583\item Datacenter ethernet flow control
584\item Related work in routing between sockets on a machine
585\item QoS (DiffServ and IntServ).
586\end{itemize}
587
588Some applications may not be willing to pay the cost of flow control.
589Further, a flow control mechanism that guarantees reliability
590may not scale well with  number of nodes.
591We discuss some abstract ideas we have for flow control below.
592
593\section{Open questions}
594Some open questions
595
596\subsection{Reservation of resources with end-to-end flow control}
597The essential idea is that when a route is established,
598reservations for some number of in flight messages are made for the route.
599Even though the links might be shared,
600no other routing path is allowed to use the reserved resources.
601The endpoints must then limit the number of in flight messages.
602If they exceed it, the library can deliver an error at the endpoint or try to
603optimistically deliver the message and drop the message if it is unable to.
604
605For example, in figure \ref{fig:many-to-many},
606if reservations for two messages is made for the routing path
607from node1 to node4, node1 and node3 each will maintain a queue of size 2.
608Whenever they receive a message from the application in node1 destined to node4
609and they are not able to send it on the link,
610they can locally store the message.
611Eventually, when the link has space in it,
612they can try to send the message again.
613
614It remains to be seen if the approach can work and scale well.
615\note{Cite work on switching networks and other works that make reservations and
616  give guarantees.}
617
618This approach works when the nodes that are sharing the links
619cooperate with one another.
620However, if the link is shared between distrustful
621nodes then additional guarantees of fairness and no starvation maybe required.
622
623\begin{figure}[t]
624 \includegraphics[width=\columnwidth]{client-monitor.pdf}
625 \caption{A network of monitor with clients}\label{fig:client-monitor}
626\end{figure}
627
628Figure \ref{fig:client-monitor} shows a network of two monitors.
629Each monitor is connected to two clients.
630Clients A1 and A2 are cooperating and client B1 and B2 are cooperating.
631The clients want to send some messages that must go through the monitors such as
632transferring capabilities.
633If one pair of client is aggressive in sending messages,
634it may fill up the link between the monitors and impact the performance
635of the other pair of clients.
636In this scenario, the link between the monitor can be seen
637as a common system resource that is being multiplexed between the users.
638The monitors should guarantee some fairness to the users in using this link.
639
640\subsection{Discovering node IDs}
641When a set of dispatchers join an application group,
642each of them is assigned a node ID.
643The nodes need some mechanism of discovering the IDs of each other
644so that they can message each other.
645
646The discovery service will be built on top of the routing library
647and can be designed in multiple ways.
648Nodes can send broadcasts to each other informing each other of their node IDs,
649they can use the name service, etc.
650
651\chapter{Interesting benchmarks}\label{chap:benchmarks}
652
653Some benchmarks that validate the claims of above and show the performance of the library.
654
655\note{Costs of one-to-many channels. One sender, multiple readers.}
656
657\note{Comparison of routes with no forwarding and routes with forwarding.}
658
659\note{Resource requirements for channels, memory and cpu time.}
660
661\note{Cost of the discussion group membership update mechanism.}
662
663\chapter{Fun research questions}
664
665\begin{itemize}
666\item Flow control
667\item Link state vs. distance vector routing
668\end{itemize}
669
670\bibliographystyle{abbrv}
671\bibliography{defs,barrelfish}
672
673\end{document}
674