1.. SPDX-License-Identifier: GPL-2.0
2
3=============================================
4Open vSwitch datapath developer documentation
5=============================================
6
7The Open vSwitch kernel module allows flexible userspace control over
8flow-level packet processing on selected network devices.  It can be
9used to implement a plain Ethernet switch, network device bonding,
10VLAN processing, network access control, flow-based network control,
11and so on.
12
13The kernel module implements multiple "datapaths" (analogous to
14bridges), each of which can have multiple "vports" (analogous to ports
15within a bridge).  Each datapath also has associated with it a "flow
16table" that userspace populates with "flows" that map from keys based
17on packet headers and metadata to sets of actions.  The most common
18action forwards the packet to another vport; other actions are also
19implemented.
20
21When a packet arrives on a vport, the kernel module processes it by
22extracting its flow key and looking it up in the flow table.  If there
23is a matching flow, it executes the associated actions.  If there is
24no match, it queues the packet to userspace for processing (as part of
25its processing, userspace will likely set up a flow to handle further
26packets of the same type entirely in-kernel).
27
28
29Flow key compatibility
30----------------------
31
32Network protocols evolve over time.  New protocols become important
33and existing protocols lose their prominence.  For the Open vSwitch
34kernel module to remain relevant, it must be possible for newer
35versions to parse additional protocols as part of the flow key.  It
36might even be desirable, someday, to drop support for parsing
37protocols that have become obsolete.  Therefore, the Netlink interface
38to Open vSwitch is designed to allow carefully written userspace
39applications to work with any version of the flow key, past or future.
40
41To support this forward and backward compatibility, whenever the
42kernel module passes a packet to userspace, it also passes along the
43flow key that it parsed from the packet.  Userspace then extracts its
44own notion of a flow key from the packet and compares it against the
45kernel-provided version:
46
47    - If userspace's notion of the flow key for the packet matches the
48      kernel's, then nothing special is necessary.
49
50    - If the kernel's flow key includes more fields than the userspace
51      version of the flow key, for example if the kernel decoded IPv6
52      headers but userspace stopped at the Ethernet type (because it
53      does not understand IPv6), then again nothing special is
54      necessary.  Userspace can still set up a flow in the usual way,
55      as long as it uses the kernel-provided flow key to do it.
56
57    - If the userspace flow key includes more fields than the
58      kernel's, for example if userspace decoded an IPv6 header but
59      the kernel stopped at the Ethernet type, then userspace can
60      forward the packet manually, without setting up a flow in the
61      kernel.  This case is bad for performance because every packet
62      that the kernel considers part of the flow must go to userspace,
63      but the forwarding behavior is correct.  (If userspace can
64      determine that the values of the extra fields would not affect
65      forwarding behavior, then it could set up a flow anyway.)
66
67How flow keys evolve over time is important to making this work, so
68the following sections go into detail.
69
70
71Flow key format
72---------------
73
74A flow key is passed over a Netlink socket as a sequence of Netlink
75attributes.  Some attributes represent packet metadata, defined as any
76information about a packet that cannot be extracted from the packet
77itself, e.g. the vport on which the packet was received.  Most
78attributes, however, are extracted from headers within the packet,
79e.g. source and destination addresses from Ethernet, IP, or TCP
80headers.
81
82The <linux/openvswitch.h> header file defines the exact format of the
83flow key attributes.  For informal explanatory purposes here, we write
84them as comma-separated strings, with parentheses indicating arguments
85and nesting.  For example, the following could represent a flow key
86corresponding to a TCP packet that arrived on vport 1::
87
88    in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
89    eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
90    frag=no), tcp(src=49163, dst=80)
91
92Often we ellipsize arguments not important to the discussion, e.g.::
93
94    in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
95
96
97Wildcarded flow key format
98--------------------------
99
100A wildcarded flow is described with two sequences of Netlink attributes
101passed over the Netlink socket. A flow key, exactly as described above, and an
102optional corresponding flow mask.
103
104A wildcarded flow can represent a group of exact match flows. Each '1' bit
105in the mask specifies a exact match with the corresponding bit in the flow key.
106A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit
107of a incoming packet. Using wildcarded flow can improve the flow set up rate
108by reduce the number of new flows need to be processed by the user space program.
109
110Support for the mask Netlink attribute is optional for both the kernel and user
111space program. The kernel can ignore the mask attribute, installing an exact
112match flow, or reduce the number of don't care bits in the kernel to less than
113what was specified by the user space program. In this case, variations in bits
114that the kernel does not implement will simply result in additional flow setups.
115The kernel module will also work with user space programs that neither support
116nor supply flow mask attributes.
117
118Since the kernel may ignore or modify wildcard bits, it can be difficult for
119the userspace program to know exactly what matches are installed. There are
120two possible approaches: reactively install flows as they miss the kernel
121flow table (and therefore not attempt to determine wildcard changes at all)
122or use the kernel's response messages to determine the installed wildcards.
123
124When interacting with userspace, the kernel should maintain the match portion
125of the key exactly as originally installed. This will provides a handle to
126identify the flow for all future operations. However, when reporting the
127mask of an installed flow, the mask should include any restrictions imposed
128by the kernel.
129
130The behavior when using overlapping wildcarded flows is undefined. It is the
131responsibility of the user space program to ensure that any incoming packet
132can match at most one flow, wildcarded or not. The current implementation
133performs best-effort detection of overlapping wildcarded flows and may reject
134some but not all of them. However, this behavior may change in future versions.
135
136
137Unique flow identifiers
138-----------------------
139
140An alternative to using the original match portion of a key as the handle for
141flow identification is a unique flow identifier, or "UFID". UFIDs are optional
142for both the kernel and user space program.
143
144User space programs that support UFID are expected to provide it during flow
145setup in addition to the flow, then refer to the flow using the UFID for all
146future operations. The kernel is not required to index flows by the original
147flow key if a UFID is specified.
148
149
150Basic rule for evolving flow keys
151---------------------------------
152
153Some care is needed to really maintain forward and backward
154compatibility for applications that follow the rules listed under
155"Flow key compatibility" above.
156
157The basic rule is obvious::
158
159    ==================================================================
160    New network protocol support must only supplement existing flow
161    key attributes.  It must not change the meaning of already defined
162    flow key attributes.
163    ==================================================================
164
165This rule does have less-obvious consequences so it is worth working
166through a few examples.  Suppose, for example, that the kernel module
167did not already implement VLAN parsing.  Instead, it just interpreted
168the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
169packet.  The flow key for any packet with an 802.1Q header would look
170essentially like this, ignoring metadata::
171
172    eth(...), eth_type(0x8100)
173
174Naively, to add VLAN support, it makes sense to add a new "vlan" flow
175key attribute to contain the VLAN tag, then continue to decode the
176encapsulated headers beyond the VLAN tag using the existing field
177definitions.  With this change, a TCP packet in VLAN 10 would have a
178flow key much like this::
179
180    eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
181
182But this change would negatively affect a userspace application that
183has not been updated to understand the new "vlan" flow key attribute.
184The application could, following the flow compatibility rules above,
185ignore the "vlan" attribute that it does not understand and therefore
186assume that the flow contained IP packets.  This is a bad assumption
187(the flow only contains IP packets if one parses and skips over the
188802.1Q header) and it could cause the application's behavior to change
189across kernel versions even though it follows the compatibility rules.
190
191The solution is to use a set of nested attributes.  This is, for
192example, why 802.1Q support uses nested attributes.  A TCP packet in
193VLAN 10 is actually expressed as::
194
195    eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
196    ip(proto=6, ...), tcp(...)))
197
198Notice how the "eth_type", "ip", and "tcp" flow key attributes are
199nested inside the "encap" attribute.  Thus, an application that does
200not understand the "vlan" key will not see either of those attributes
201and therefore will not misinterpret them.  (Also, the outer eth_type
202is still 0x8100, not changed to 0x0800.)
203
204Handling malformed packets
205--------------------------
206
207Don't drop packets in the kernel for malformed protocol headers, bad
208checksums, etc.  This would prevent userspace from implementing a
209simple Ethernet switch that forwards every packet.
210
211Instead, in such a case, include an attribute with "empty" content.
212It doesn't matter if the empty content could be valid protocol values,
213as long as those values are rarely seen in practice, because userspace
214can always forward all packets with those values to userspace and
215handle them individually.
216
217For example, consider a packet that contains an IP header that
218indicates protocol 6 for TCP, but which is truncated just after the IP
219header, so that the TCP header is missing.  The flow key for this
220packet would include a tcp attribute with all-zero src and dst, like
221this::
222
223    eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
224
225As another example, consider a packet with an Ethernet type of 0x8100,
226indicating that a VLAN TCI should follow, but which is truncated just
227after the Ethernet type.  The flow key for this packet would include
228an all-zero-bits vlan and an empty encap attribute, like this::
229
230    eth(...), eth_type(0x8100), vlan(0), encap()
231
232Unlike a TCP packet with source and destination ports 0, an
233all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
234VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
235attribute expressly to allow this situation to be distinguished.
236Thus, the flow key in this second example unambiguously indicates a
237missing or malformed VLAN TCI.
238
239Other rules
240-----------
241
242The other rules for flow keys are much less subtle:
243
244    - Duplicate attributes are not allowed at a given nesting level.
245
246    - Ordering of attributes is not significant.
247
248    - When the kernel sends a given flow key to userspace, it always
249      composes it the same way.  This allows userspace to hash and
250      compare entire flow keys that it may not be able to fully
251      interpret.
252