1.. SPDX-License-Identifier: GPL-2.0
2
3=================
4KVM-specific MSRs
5=================
6
7:Author: Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
8
9KVM makes use of some custom MSRs to service some requests.
10
11Custom MSRs have a range reserved for them, that goes from
120x4b564d00 to 0x4b564dff. There are MSRs outside this area,
13but they are deprecated and their use is discouraged.
14
15Custom MSR list
16---------------
17
18The current supported Custom MSR list is:
19
20MSR_KVM_WALL_CLOCK_NEW:
21	0x4b564d00
22
23data:
24	4-byte alignment physical address of a memory area which must be
25	in guest RAM. This memory is expected to hold a copy of the following
26	structure::
27
28	 struct pvclock_wall_clock {
29		u32   version;
30		u32   sec;
31		u32   nsec;
32	  } __attribute__((__packed__));
33
34	whose data will be filled in by the hypervisor. The hypervisor is only
35	guaranteed to update this data at the moment of MSR write.
36	Users that want to reliably query this information more than once have
37	to write more than once to this MSR. Fields have the following meanings:
38
39	version:
40		guest has to check version before and after grabbing
41		time information and check that they are both equal and even.
42		An odd version indicates an in-progress update.
43
44	sec:
45		 number of seconds for wallclock at time of boot.
46
47	nsec:
48		 number of nanoseconds for wallclock at time of boot.
49
50	In order to get the current wallclock time, the system_time from
51	MSR_KVM_SYSTEM_TIME_NEW needs to be added.
52
53	Note that although MSRs are per-CPU entities, the effect of this
54	particular MSR is global.
55
56	Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
57	leaf prior to usage.
58
59MSR_KVM_SYSTEM_TIME_NEW:
60	0x4b564d01
61
62data:
63	4-byte aligned physical address of a memory area which must be in
64	guest RAM, plus an enable bit in bit 0. This memory is expected to hold
65	a copy of the following structure::
66
67	  struct pvclock_vcpu_time_info {
68		u32   version;
69		u32   pad0;
70		u64   tsc_timestamp;
71		u64   system_time;
72		u32   tsc_to_system_mul;
73		s8    tsc_shift;
74		u8    flags;
75		u8    pad[2];
76	  } __attribute__((__packed__)); /* 32 bytes */
77
78	whose data will be filled in by the hypervisor periodically. Only one
79	write, or registration, is needed for each VCPU. The interval between
80	updates of this structure is arbitrary and implementation-dependent.
81	The hypervisor may update this structure at any time it sees fit until
82	anything with bit0 == 0 is written to it.
83
84	Fields have the following meanings:
85
86	version:
87		guest has to check version before and after grabbing
88		time information and check that they are both equal and even.
89		An odd version indicates an in-progress update.
90
91	tsc_timestamp:
92		the tsc value at the current VCPU at the time
93		of the update of this structure. Guests can subtract this value
94		from current tsc to derive a notion of elapsed time since the
95		structure update.
96
97	system_time:
98		a host notion of monotonic time, including sleep
99		time at the time this structure was last updated. Unit is
100		nanoseconds.
101
102	tsc_to_system_mul:
103		multiplier to be used when converting
104		tsc-related quantity to nanoseconds
105
106	tsc_shift:
107		shift to be used when converting tsc-related
108		quantity to nanoseconds. This shift will ensure that
109		multiplication with tsc_to_system_mul does not overflow.
110		A positive value denotes a left shift, a negative value
111		a right shift.
112
113		The conversion from tsc to nanoseconds involves an additional
114		right shift by 32 bits. With this information, guests can
115		derive per-CPU time by doing::
116
117			time = (current_tsc - tsc_timestamp)
118			if (tsc_shift >= 0)
119				time <<= tsc_shift;
120			else
121				time >>= -tsc_shift;
122			time = (time * tsc_to_system_mul) >> 32
123			time = time + system_time
124
125	flags:
126		bits in this field indicate extended capabilities
127		coordinated between the guest and the hypervisor. Availability
128		of specific flags has to be checked in 0x40000001 cpuid leaf.
129		Current flags are:
130
131
132		+-----------+--------------+----------------------------------+
133		| flag bit  | cpuid bit    | meaning			      |
134		+-----------+--------------+----------------------------------+
135		|	    |		   | time measures taken across       |
136		|    0      |	   24      | multiple cpus are guaranteed to  |
137		|	    |		   | be monotonic		      |
138		+-----------+--------------+----------------------------------+
139		|	    |		   | guest vcpu has been paused by    |
140		|    1	    |	  N/A	   | the host			      |
141		|	    |		   | See 4.70 in api.txt	      |
142		+-----------+--------------+----------------------------------+
143
144	Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
145	leaf prior to usage.
146
147
148MSR_KVM_WALL_CLOCK:
149	0x11
150
151data and functioning:
152	same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
153
154	This MSR falls outside the reserved KVM range and may be removed in the
155	future. Its usage is deprecated.
156
157	Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
158	leaf prior to usage.
159
160MSR_KVM_SYSTEM_TIME:
161	0x12
162
163data and functioning:
164	same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
165
166	This MSR falls outside the reserved KVM range and may be removed in the
167	future. Its usage is deprecated.
168
169	Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
170	leaf prior to usage.
171
172	The suggested algorithm for detecting kvmclock presence is then::
173
174		if (!kvm_para_available())    /* refer to cpuid.txt */
175			return NON_PRESENT;
176
177		flags = cpuid_eax(0x40000001);
178		if (flags & 3) {
179			msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
180			msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
181			return PRESENT;
182		} else if (flags & 0) {
183			msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
184			msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
185			return PRESENT;
186		} else
187			return NON_PRESENT;
188
189MSR_KVM_ASYNC_PF_EN:
190	0x4b564d02
191
192data:
193	Asynchronous page fault (APF) control MSR.
194
195	Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
196	which must be in guest RAM. This memory is expected to hold the
197	following structure::
198
199	  struct kvm_vcpu_pv_apf_data {
200		/* Used for 'page not present' events delivered via #PF */
201		__u32 flags;
202
203		/* Used for 'page ready' events delivered via interrupt notification */
204		__u32 token;
205
206		__u8 pad[56];
207	  };
208
209	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
210	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
211	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
212	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
213	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
214	present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
215	events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
216	CPUID.
217
218	'Page not present' events are currently always delivered as synthetic
219	#PF exception. During delivery of these events APF CR2 register contains
220	a token that will be used to notify the guest when missing page becomes
221	available. Also, to make it possible to distinguish between real #PF and
222	APF, first 4 bytes of 64 byte memory location ('flags') will be written
223	to by the hypervisor at the time of injection. Only first bit of 'flags'
224	is currently supported, when set, it indicates that the guest is dealing
225	with asynchronous 'page not present' event. If during a page fault APF
226	'flags' is '0' it means that this is regular page fault. Guest is
227	supposed to clear 'flags' when it is done handling #PF exception so the
228	next event can be delivered.
229
230	Note, since APF 'page not present' events use the same exception vector
231	as regular page fault, guest must reset 'flags' to '0' before it does
232	something that can generate normal page fault.
233
234	Bytes 4-7 of 64 byte memory location ('token') will be written to by the
235	hypervisor at the time of APF 'page ready' event injection. The content
236	of these bytes is a token which was previously delivered in CR2 as
237	'page not present' event. The event indicates the page is now available.
238	Guest is supposed to write '0' to 'token' when it is done handling
239	'page ready' event and to write '1' to MSR_KVM_ASYNC_PF_ACK after
240	clearing the location; writing to the MSR forces KVM to re-scan its
241	queue and deliver the next pending notification.
242
243	Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page
244	ready' APF delivery needs to be written to before enabling APF mechanism
245	in MSR_KVM_ASYNC_PF_EN or interrupt #0 can get injected. The MSR is
246	available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
247
248	Note, previously, 'page ready' events were delivered via the same #PF
249	exception as 'page not present' events but this is now deprecated. If
250	bit 3 (interrupt based delivery) is not set APF events are not delivered.
251
252	If APF is disabled while there are outstanding APFs, they will
253	not be delivered.
254
255	Currently 'page ready' APF events will be always delivered on the
256	same vcpu as 'page not present' event was, but guest should not rely on
257	that.
258
259MSR_KVM_STEAL_TIME:
260	0x4b564d03
261
262data:
263	64-byte alignment physical address of a memory area which must be
264	in guest RAM, plus an enable bit in bit 0. This memory is expected to
265	hold a copy of the following structure::
266
267	  struct kvm_steal_time {
268		__u64 steal;
269		__u32 version;
270		__u32 flags;
271		__u8  preempted;
272		__u8  u8_pad[3];
273		__u32 pad[11];
274	  }
275
276	whose data will be filled in by the hypervisor periodically. Only one
277	write, or registration, is needed for each VCPU. The interval between
278	updates of this structure is arbitrary and implementation-dependent.
279	The hypervisor may update this structure at any time it sees fit until
280	anything with bit0 == 0 is written to it. Guest is required to make sure
281	this structure is initialized to zero.
282
283	Fields have the following meanings:
284
285	version:
286		a sequence counter. In other words, guest has to check
287		this field before and after grabbing time information and make
288		sure they are both equal and even. An odd version indicates an
289		in-progress update.
290
291	flags:
292		At this point, always zero. May be used to indicate
293		changes in this structure in the future.
294
295	steal:
296		the amount of time in which this vCPU did not run, in
297		nanoseconds. Time during which the vcpu is idle, will not be
298		reported as steal time.
299
300	preempted:
301		indicate the vCPU who owns this struct is running or
302		not. Non-zero values mean the vCPU has been preempted. Zero
303		means the vCPU is not preempted. NOTE, it is always zero if the
304		the hypervisor doesn't support this field.
305
306MSR_KVM_EOI_EN:
307	0x4b564d04
308
309data:
310	Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
311	when disabled.  Bit 1 is reserved and must be zero.  When PV end of
312	interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
313	physical address of a 4 byte memory area which must be in guest RAM and
314	must be zeroed.
315
316	The first, least significant bit of 4 byte memory location will be
317	written to by the hypervisor, typically at the time of interrupt
318	injection.  Value of 1 means that guest can skip writing EOI to the apic
319	(using MSR or MMIO write); instead, it is sufficient to signal
320	EOI by clearing the bit in guest memory - this location will
321	later be polled by the hypervisor.
322	Value of 0 means that the EOI write is required.
323
324	It is always safe for the guest to ignore the optimization and perform
325	the APIC EOI write anyway.
326
327	Hypervisor is guaranteed to only modify this least
328	significant bit while in the current VCPU context, this means that
329	guest does not need to use either lock prefix or memory ordering
330	primitives to synchronise with the hypervisor.
331
332	However, hypervisor can set and clear this memory bit at any time:
333	therefore to make sure hypervisor does not interrupt the
334	guest and clear the least significant bit in the memory area
335	in the window between guest testing it to detect
336	whether it can skip EOI apic write and between guest
337	clearing it to signal EOI to the hypervisor,
338	guest must both read the least significant bit in the memory area and
339	clear it using a single CPU instruction, such as test and clear, or
340	compare and exchange.
341
342MSR_KVM_POLL_CONTROL:
343	0x4b564d05
344
345	Control host-side polling.
346
347data:
348	Bit 0 enables (1) or disables (0) host-side HLT polling logic.
349
350	KVM guests can request the host not to poll on HLT, for example if
351	they are performing polling themselves.
352
353MSR_KVM_ASYNC_PF_INT:
354	0x4b564d06
355
356data:
357	Second asynchronous page fault (APF) control MSR.
358
359	Bits 0-7: APIC vector for delivery of 'page ready' APF events.
360	Bits 8-63: Reserved
361
362	Interrupt vector for asynchnonous 'page ready' notifications delivery.
363	The vector has to be set up before asynchronous page fault mechanism
364	is enabled in MSR_KVM_ASYNC_PF_EN.  The MSR is only available if
365	KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
366
367MSR_KVM_ASYNC_PF_ACK:
368	0x4b564d07
369
370data:
371	Asynchronous page fault (APF) acknowledgment.
372
373	When the guest is done processing 'page ready' APF event and 'token'
374	field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to
375	write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
376	and check if there are more notifications pending. The MSR is available
377	if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
378
379MSR_KVM_MIGRATION_CONTROL:
380        0x4b564d08
381
382data:
383        This MSR is available if KVM_FEATURE_MIGRATION_CONTROL is present in
384        CPUID.  Bit 0 represents whether live migration of the guest is allowed.
385
386        When a guest is started, bit 0 will be 0 if the guest has encrypted
387        memory and 1 if the guest does not have encrypted memory.  If the
388        guest is communicating page encryption status to the host using the
389        ``KVM_HC_MAP_GPA_RANGE`` hypercall, it can set bit 0 in this MSR to
390        allow live migration of the guest.
391