1.. SPDX-License-Identifier: GPL-2.0
2
3VMbus
4=====
5VMbus is a software construct provided by Hyper-V to guest VMs.  It
6consists of a control path and common facilities used by synthetic
7devices that Hyper-V presents to guest VMs.   The control path is
8used to offer synthetic devices to the guest VM and, in some cases,
9to rescind those devices.   The common facilities include software
10channels for communicating between the device driver in the guest VM
11and the synthetic device implementation that is part of Hyper-V, and
12signaling primitives to allow Hyper-V and the guest to interrupt
13each other.
14
15VMbus is modeled in Linux as a bus, with the expected /sys/bus/vmbus
16entry in a running Linux guest.  The VMbus driver (drivers/hv/vmbus_drv.c)
17establishes the VMbus control path with the Hyper-V host, then
18registers itself as a Linux bus driver.  It implements the standard
19bus functions for adding and removing devices to/from the bus.
20
21Most synthetic devices offered by Hyper-V have a corresponding Linux
22device driver.  These devices include:
23
24* SCSI controller
25* NIC
26* Graphics frame buffer
27* Keyboard
28* Mouse
29* PCI device pass-thru
30* Heartbeat
31* Time Sync
32* Shutdown
33* Memory balloon
34* Key/Value Pair (KVP) exchange with Hyper-V
35* Hyper-V online backup (a.k.a. VSS)
36
37Guest VMs may have multiple instances of the synthetic SCSI
38controller, synthetic NIC, and PCI pass-thru devices.  Other
39synthetic devices are limited to a single instance per VM.  Not
40listed above are a small number of synthetic devices offered by
41Hyper-V that are used only by Windows guests and for which Linux
42does not have a driver.
43
44Hyper-V uses the terms "VSP" and "VSC" in describing synthetic
45devices.  "VSP" refers to the Hyper-V code that implements a
46particular synthetic device, while "VSC" refers to the driver for
47the device in the guest VM.  For example, the Linux driver for the
48synthetic NIC is referred to as "netvsc" and the Linux driver for
49the synthetic SCSI controller is "storvsc".  These drivers contain
50functions with names like "storvsc_connect_to_vsp".
51
52VMbus channels
53--------------
54An instance of a synthetic device uses VMbus channels to communicate
55between the VSP and the VSC.  Channels are bi-directional and used
56for passing messages.   Most synthetic devices use a single channel,
57but the synthetic SCSI controller and synthetic NIC may use multiple
58channels to achieve higher performance and greater parallelism.
59
60Each channel consists of two ring buffers.  These are classic ring
61buffers from a university data structures textbook.  If the read
62and writes pointers are equal, the ring buffer is considered to be
63empty, so a full ring buffer always has at least one byte unused.
64The "in" ring buffer is for messages from the Hyper-V host to the
65guest, and the "out" ring buffer is for messages from the guest to
66the Hyper-V host.  In Linux, the "in" and "out" designations are as
67viewed by the guest side.  The ring buffers are memory that is
68shared between the guest and the host, and they follow the standard
69paradigm where the memory is allocated by the guest, with the list
70of GPAs that make up the ring buffer communicated to the host.  Each
71ring buffer consists of a header page (4 Kbytes) with the read and
72write indices and some control flags, followed by the memory for the
73actual ring.  The size of the ring is determined by the VSC in the
74guest and is specific to each synthetic device.   The list of GPAs
75making up the ring is communicated to the Hyper-V host over the
76VMbus control path as a GPA Descriptor List (GPADL).  See function
77vmbus_establish_gpadl().
78
79Each ring buffer is mapped into contiguous Linux kernel virtual
80space in three parts:  1) the 4 Kbyte header page, 2) the memory
81that makes up the ring itself, and 3) a second mapping of the memory
82that makes up the ring itself.  Because (2) and (3) are contiguous
83in kernel virtual space, the code that copies data to and from the
84ring buffer need not be concerned with ring buffer wrap-around.
85Once a copy operation has completed, the read or write index may
86need to be reset to point back into the first mapping, but the
87actual data copy does not need to be broken into two parts.  This
88approach also allows complex data structures to be easily accessed
89directly in the ring without handling wrap-around.
90
91On arm64 with page sizes > 4 Kbytes, the header page must still be
92passed to Hyper-V as a 4 Kbyte area.  But the memory for the actual
93ring must be aligned to PAGE_SIZE and have a size that is a multiple
94of PAGE_SIZE so that the duplicate mapping trick can be done.  Hence
95a portion of the header page is unused and not communicated to
96Hyper-V.  This case is handled by vmbus_establish_gpadl().
97
98Hyper-V enforces a limit on the aggregate amount of guest memory
99that can be shared with the host via GPADLs.  This limit ensures
100that a rogue guest can't force the consumption of excessive host
101resources.  For Windows Server 2019 and later, this limit is
102approximately 1280 Mbytes.  For versions prior to Windows Server
1032019, the limit is approximately 384 Mbytes.
104
105VMbus messages
106--------------
107All VMbus messages have a standard header that includes the message
108length, the offset of the message payload, some flags, and a
109transactionID.  The portion of the message after the header is
110unique to each VSP/VSC pair.
111
112Messages follow one of two patterns:
113
114* Unidirectional:  Either side sends a message and does not
115  expect a response message
116* Request/response:  One side (usually the guest) sends a message
117  and expects a response
118
119The transactionID (a.k.a. "requestID") is for matching requests &
120responses.  Some synthetic devices allow multiple requests to be in-
121flight simultaneously, so the guest specifies a transactionID when
122sending a request.  Hyper-V sends back the same transactionID in the
123matching response.
124
125Messages passed between the VSP and VSC are control messages.  For
126example, a message sent from the storvsc driver might be "execute
127this SCSI command".   If a message also implies some data transfer
128between the guest and the Hyper-V host, the actual data to be
129transferred may be embedded with the control message, or it may be
130specified as a separate data buffer that the Hyper-V host will
131access as a DMA operation.  The former case is used when the size of
132the data is small and the cost of copying the data to and from the
133ring buffer is minimal.  For example, time sync messages from the
134Hyper-V host to the guest contain the actual time value.  When the
135data is larger, a separate data buffer is used.  In this case, the
136control message contains a list of GPAs that describe the data
137buffer.  For example, the storvsc driver uses this approach to
138specify the data buffers to/from which disk I/O is done.
139
140Three functions exist to send VMbus messages:
141
1421. vmbus_sendpacket():  Control-only messages and messages with
143   embedded data -- no GPAs
1442. vmbus_sendpacket_pagebuffer(): Message with list of GPAs
145   identifying data to transfer.  An offset and length is
146   associated with each GPA so that multiple discontinuous areas
147   of guest memory can be targeted.
1483. vmbus_sendpacket_mpb_desc(): Message with list of GPAs
149   identifying data to transfer.  A single offset and length is
150   associated with a list of GPAs.  The GPAs must describe a
151   single logical area of guest memory to be targeted.
152
153Historically, Linux guests have trusted Hyper-V to send well-formed
154and valid messages, and Linux drivers for synthetic devices did not
155fully validate messages.  With the introduction of processor
156technologies that fully encrypt guest memory and that allow the
157guest to not trust the hypervisor (AMD SNP-SEV, Intel TDX), trusting
158the Hyper-V host is no longer a valid assumption.  The drivers for
159VMbus synthetic devices are being updated to fully validate any
160values read from memory that is shared with Hyper-V, which includes
161messages from VMbus devices.  To facilitate such validation,
162messages read by the guest from the "in" ring buffer are copied to a
163temporary buffer that is not shared with Hyper-V.  Validation is
164performed in this temporary buffer without the risk of Hyper-V
165maliciously modifying the message after it is validated but before
166it is used.
167
168VMbus interrupts
169----------------
170VMbus provides a mechanism for the guest to interrupt the host when
171the guest has queued new messages in a ring buffer.  The host
172expects that the guest will send an interrupt only when an "out"
173ring buffer transitions from empty to non-empty.  If the guest sends
174interrupts at other times, the host deems such interrupts to be
175unnecessary.  If a guest sends an excessive number of unnecessary
176interrupts, the host may throttle that guest by suspending its
177execution for a few seconds to prevent a denial-of-service attack.
178
179Similarly, the host will interrupt the guest when it sends a new
180message on the VMbus control path, or when a VMbus channel "in" ring
181buffer transitions from empty to non-empty.  Each CPU in the guest
182may receive VMbus interrupts, so they are best modeled as per-CPU
183interrupts in Linux.  This model works well on arm64 where a single
184per-CPU IRQ is allocated for VMbus.  Since x86/x64 lacks support for
185per-CPU IRQs, an x86 interrupt vector is statically allocated (see
186HYPERVISOR_CALLBACK_VECTOR) across all CPUs and explicitly coded to
187call the VMbus interrupt service routine.  These interrupts are
188visible in /proc/interrupts on the "HYP" line.
189
190The guest CPU that a VMbus channel will interrupt is selected by the
191guest when the channel is created, and the host is informed of that
192selection.  VMbus devices are broadly grouped into two categories:
193
1941. "Slow" devices that need only one VMbus channel.  The devices
195   (such as keyboard, mouse, heartbeat, and timesync) generate
196   relatively few interrupts.  Their VMbus channels are all
197   assigned to interrupt the VMBUS_CONNECT_CPU, which is always
198   CPU 0.
199
2002. "High speed" devices that may use multiple VMbus channels for
201   higher parallelism and performance.  These devices include the
202   synthetic SCSI controller and synthetic NIC.  Their VMbus
203   channels interrupts are assigned to CPUs that are spread out
204   among the available CPUs in the VM so that interrupts on
205   multiple channels can be processed in parallel.
206
207The assignment of VMbus channel interrupts to CPUs is done in the
208function init_vp_index().  This assignment is done outside of the
209normal Linux interrupt affinity mechanism, so the interrupts are
210neither "unmanaged" nor "managed" interrupts.
211
212The CPU that a VMbus channel will interrupt can be seen in
213/sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu.
214When running on later versions of Hyper-V, the CPU can be changed
215by writing a new value to this sysfs entry.  Because the interrupt
216assignment is done outside of the normal Linux affinity mechanism,
217there are no entries in /proc/irq corresponding to individual
218VMbus channel interrupts.
219
220An online CPU in a Linux guest may not be taken offline if it has
221VMbus channel interrupts assigned to it.  Any such channel
222interrupts must first be manually reassigned to another CPU as
223described above.  When no channel interrupts are assigned to the
224CPU, it can be taken offline.
225
226When a guest CPU receives a VMbus interrupt from the host, the
227function vmbus_isr() handles the interrupt.  It first checks for
228channel interrupts by calling vmbus_chan_sched(), which looks at a
229bitmap setup by the host to determine which channels have pending
230interrupts on this CPU.  If multiple channels have pending
231interrupts for this CPU, they are processed sequentially.  When all
232channel interrupts have been processed, vmbus_isr() checks for and
233processes any message received on the VMbus control path.
234
235The VMbus channel interrupt handling code is designed to work
236correctly even if an interrupt is received on a CPU other than the
237CPU assigned to the channel.  Specifically, the code does not use
238CPU-based exclusion for correctness.  In normal operation, Hyper-V
239will interrupt the assigned CPU.  But when the CPU assigned to a
240channel is being changed via sysfs, the guest doesn't know exactly
241when Hyper-V will make the transition.  The code must work correctly
242even if there is a time lag before Hyper-V starts interrupting the
243new CPU.  See comments in target_cpu_store().
244
245VMbus device creation/deletion
246------------------------------
247Hyper-V and the Linux guest have a separate message-passing path
248that is used for synthetic device creation and deletion. This
249path does not use a VMbus channel.  See vmbus_post_msg() and
250vmbus_on_msg_dpc().
251
252The first step is for the guest to connect to the generic
253Hyper-V VMbus mechanism.  As part of establishing this connection,
254the guest and Hyper-V agree on a VMbus protocol version they will
255use.  This negotiation allows newer Linux kernels to run on older
256Hyper-V versions, and vice versa.
257
258The guest then tells Hyper-V to "send offers".  Hyper-V sends an
259offer message to the guest for each synthetic device that the VM
260is configured to have. Each VMbus device type has a fixed GUID
261known as the "class ID", and each VMbus device instance is also
262identified by a GUID. The offer message from Hyper-V contains
263both GUIDs to uniquely (within the VM) identify the device.
264There is one offer message for each device instance, so a VM with
265two synthetic NICs will get two offers messages with the NIC
266class ID. The ordering of offer messages can vary from boot-to-boot
267and must not be assumed to be consistent in Linux code. Offer
268messages may also arrive long after Linux has initially booted
269because Hyper-V supports adding devices, such as synthetic NICs,
270to running VMs. A new offer message is processed by
271vmbus_process_offer(), which indirectly invokes vmbus_add_channel_work().
272
273Upon receipt of an offer message, the guest identifies the device
274type based on the class ID, and invokes the correct driver to set up
275the device.  Driver/device matching is performed using the standard
276Linux mechanism.
277
278The device driver probe function opens the primary VMbus channel to
279the corresponding VSP. It allocates guest memory for the channel
280ring buffers and shares the ring buffer with the Hyper-V host by
281giving the host a list of GPAs for the ring buffer memory.  See
282vmbus_establish_gpadl().
283
284Once the ring buffer is set up, the device driver and VSP exchange
285setup messages via the primary channel.  These messages may include
286negotiating the device protocol version to be used between the Linux
287VSC and the VSP on the Hyper-V host.  The setup messages may also
288include creating additional VMbus channels, which are somewhat
289mis-named as "sub-channels" since they are functionally
290equivalent to the primary channel once they are created.
291
292Finally, the device driver may create entries in /dev as with
293any device driver.
294
295The Hyper-V host can send a "rescind" message to the guest to
296remove a device that was previously offered. Linux drivers must
297handle such a rescind message at any time. Rescinding a device
298invokes the device driver "remove" function to cleanly shut
299down the device and remove it. Once a synthetic device is
300rescinded, neither Hyper-V nor Linux retains any state about
301its previous existence. Such a device might be re-added later,
302in which case it is treated as an entirely new device. See
303vmbus_onoffer_rescind().
304