History log of /linux-master/arch/x86/include/asm/tdx.h
Revision Date Author Comments
# 70060463 08-Dec-2023 Kai Huang <kai.huang@intel.com>

x86/mce: Differentiate real hardware #MCs from TDX erratum ones

The first few generations of TDX hardware have an erratum. Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.

Make an effort to detect these TDX-induced machine checks and spit out
a new blurb to dmesg so folks do not think their hardware is failing.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines. In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller. The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings. The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A partial write to a TDX private memory cacheline will silently "poison"
the line. Subsequent reads will consume the poison and generate a
machine check. According to the TDX hardware spec, neither of these
things should have happened.

To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.

== Solution ==

In the end, this issue is hard to trigger. Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.

Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:

[...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
[...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
...
[...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[...] Kernel panic - not syncing: Fatal local machine check

Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".

Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error". But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error. Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.

So keep the "Hardware Error". The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.

Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:

...
[...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.

Adding this additional message requires determination of whether the
memory page is TDX private memory. There is no existing infrastructure
to do that. Add an interface to query the TDX module to fill this gap.

== Impact ==

This issue requires some kind of kernel bug to trigger.

TDX private memory should never be mapped UC/WC. A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory. It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.

MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map. It should also be
detectable with normal debug techniques.

The one place where this might get nasty would be if the CPU read data
then wrote back the same data. That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.

With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.

[ dhansen: changelog tweaks ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com


# 6162b310 08-Dec-2023 Kai Huang <kai.huang@intel.com>

x86/virt/tdx: Add skeleton to enable TDX on demand

There are essentially two steps to get the TDX module ready:
1) Get each CPU ready to run TDX
2) Set up the shared TDX module data structures

Introduce and export (to KVM) the infrastructure to do both of these
pieces at runtime.

== Per-CPU TDX Initialization ==

Track the initialization status of each CPU with a per-cpu variable.
This avoids failures in the case of KVM module reloads and handles cases
where CPUs come online later.

Generally, the per-cpu SEAMCALLs happen first. But there's actually one
global call that has to happen before _any_ others (TDH_SYS_INIT). It's
analogous to the boot CPU having to do a bit of extra work just because
it happens to be the first one. Track if _any_ CPU has done this call
and then only actually do it during the first per-cpu init.

== Shared TDX Initialization ==

Create the global state function (tdx_enable()) as a simple placeholder.
The TODO list will be pared down as functionality is added.

Use a state machine protected by mutex to make sure the work in
tdx_enable() will only be done once. This avoids failures if the KVM
module is reloaded.

A CPU must be made ready to run TDX before it can participate in
initializing the shared parts of the module. Any caller of tdx_enable()
need to ensure that it can never run on a CPU which is not ready to
run TDX. It needs to be wary of CPU hotplug, preemption and the
VMX enabling state of any CPU on which it might run.

== Why runtime instead of boot time? ==

The TDX module can be initialized only once in its lifetime. Instead
of always initializing it at boot time, this implementation chooses an
"on demand" approach to initialize TDX until there is a real need (e.g
when requested by KVM). This approach has below pros:

1) It avoids consuming the memory that must be allocated by kernel and
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
and also saves the CPU cycles of initializing the TDX module (and the
metadata) when TDX is not used at all.

2) The TDX module design allows it to be updated while the system is
running. The update procedure shares quite a few steps with this "on
demand" initialization mechanism. The hope is that much of "on demand"
mechanism can be shared with a future "update" mechanism. A boot-time
TDX module implementation would not be able to share much code with the
update mechanism.

3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM
code mucks with VMX enabling. If the TDX module were to be initialized
separately from KVM (like at boot), the boot code would need to be
taught how to muck with VMX enabling and KVM would need to be taught how
to cope with that. Making KVM itself responsible for TDX initialization
lets the rest of the kernel stay blissfully unaware of VMX.

[ dhansen: completely reorder/rewrite changelog ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-6-dave.hansen%40intel.com


# df01f5ae 08-Dec-2023 Kai Huang <kai.huang@intel.com>

x86/virt/tdx: Add SEAMCALL error printing for module initialization

The SEAMCALLs involved during the TDX module initialization are not
expected to fail. In fact, they are not expected to return any non-zero
code (except the "running out of entropy error", which can be handled
internally already).

Add yet another set of SEAMCALL wrappers, which treats all non-zero
return code as error, to support printing SEAMCALL error upon failure
for module initialization. Note the TDX module initialization doesn't
use the _saved_ret() variant thus no wrapper is added for it.

SEAMCALL assembly can also return kernel-defined error codes for three
special cases: 1) TDX isn't enabled by the BIOS; 2) TDX module isn't
loaded; 3) CPU isn't in VMX operation. Whether they can legally happen
depends on the caller, so leave to the caller to print error message
when desired.

Also convert the SEAMCALL error codes to the kernel error codes in the
new wrappers so that each SEAMCALL caller doesn't have to repeat the
conversion.

[ dhansen: Align the register dump with show_regs(). Zero-pad the
contents, split on two lines and use consistent spacing. ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-5-dave.hansen%40intel.com


# 1e66a7e2 08-Dec-2023 Kai Huang <kai.huang@intel.com>

x86/virt/tdx: Handle SEAMCALL no entropy error in common code

Some SEAMCALLs use the RDRAND hardware and can fail for the same reasons
as RDRAND. Use the kernel RDRAND retry logic for them.

There are three __seamcall*() variants. Do the SEAMCALL retry in common
code and add a wrapper for each of them.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirll.shutemov@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-4-dave.hansen%40intel.com


# 765a0542 08-Dec-2023 Kai Huang <kai.huang@intel.com>

x86/virt/tdx: Detect TDX during kernel boot

Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME. The memory encryption hardware underpinning MKTME is also
used for Intel TDX. TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs. The
BIOS is responsible for partitioning the "KeyID" space between legacy
MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private
KeyIDs' or 'TDX KeyIDs' for short.

During machine boot, TDX microcode verifies that the BIOS programmed TDX
private KeyIDs consistently and correctly programmed across all CPU
packages. The MSRs are locked in this state after verification. This
is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration:
it indicates not just that the hardware supports TDX, but that all the
boot-time security checks passed.

The TDX module is expected to be loaded by the BIOS when it enables TDX,
but the kernel needs to properly initialize it before it can be used to
create and run any TDX guests. The TDX module will be initialized by
the KVM subsystem when KVM wants to use TDX.

Detect platform TDX support by detecting TDX private KeyIDs.

The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
to protect its metadata. Each TDX guest also needs a TDX KeyID for its
own protection. Just use the first TDX KeyID as the global KeyID and
leave the rest for TDX guests. If no TDX KeyID is left for TDX guests,
disable TDX as initializing the TDX module alone is useless.

[ dhansen: add X86_FEATURE, replace helper function ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-1-dave.hansen%40intel.com


# f4738f56 25-Sep-2023 Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

virt: tdx-guest: Add Quote generation support using TSM_REPORTS

In TDX guest, the attestation process is used to verify the TDX guest
trustworthiness to other entities before provisioning secrets to the
guest. The first step in the attestation process is TDREPORT
generation, which involves getting the guest measurement data in the
format of TDREPORT, which is further used to validate the authenticity
of the TDX guest. TDREPORT by design is integrity-protected and can
only be verified on the local machine.

To support remote verification of the TDREPORT in a SGX-based
attestation, the TDREPORT needs to be sent to the SGX Quoting Enclave
(QE) to convert it to a remotely verifiable Quote. SGX QE by design can
only run outside of the TDX guest (i.e. in a host process or in a
normal VM) and guest can use communication channels like vsock or
TCP/IP to send the TDREPORT to the QE. But for security concerns, the
TDX guest may not support these communication channels. To handle such
cases, TDX defines a GetQuote hypercall which can be used by the guest
to request the host VMM to communicate with the SGX QE. More details
about GetQuote hypercall can be found in TDX Guest-Host Communication
Interface (GHCI) for Intel TDX 1.0, section titled
"TDG.VP.VMCALL<GetQuote>".

Trusted Security Module (TSM) [1] exposes a common ABI for Confidential
Computing Guest platforms to get the measurement data via ConfigFS.
Extend the TSM framework and add support to allow an attestation agent
to get the TDX Quote data (included usage example below).

report=/sys/kernel/config/tsm/report/report0
mkdir $report
dd if=/dev/urandom bs=64 count=1 > $report/inblob
hexdump -C $report/outblob
rmdir $report

GetQuote TDVMCALL requires TD guest pass a 4K aligned shared buffer
with TDREPORT data as input, which is further used by the VMM to copy
the TD Quote result after successful Quote generation. To create the
shared buffer, allocate a large enough memory and mark it shared using
set_memory_decrypted() in tdx_guest_init(). This buffer will be re-used
for GetQuote requests in the TDX TSM handler.

Although this method reserves a fixed chunk of memory for GetQuote
requests, such one time allocation can help avoid memory fragmentation
related allocation failures later in the uptime of the guest.

Since the Quote generation process is not time-critical or frequently
used, the current version uses a polling model for Quote requests and
it also does not support parallel GetQuote requests.

Link: https://lore.kernel.org/lkml/169342399185.3934343.3035845348326944519.stgit@dwillia2-xfh.jf.intel.com/ [1]
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Erdem Aktas <erdemaktas@google.com>
Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Tested-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 7b804135 15-Aug-2023 Kai Huang <kai.huang@intel.com>

x86/virt/tdx: Make TDX_MODULE_CALL handle SEAMCALL #UD and #GP

SEAMCALL instruction causes #UD if the CPU isn't in VMX operation.
Currently the TDX_MODULE_CALL assembly doesn't handle #UD, thus making
SEAMCALL when VMX is disabled would cause Oops.

Unfortunately, there are legal cases that SEAMCALL can be made when VMX
is disabled. For instance, VMX can be disabled due to emergency reboot
while there are still TDX guests running.

Extend the TDX_MODULE_CALL assembly to return an error code for #UD to
handle this case gracefully, e.g., KVM can then quietly eat all SEAMCALL
errors caused by emergency reboot.

SEAMCALL instruction also causes #GP when TDX isn't enabled by the BIOS.
Use _ASM_EXTABLE_FAULT() to catch both exceptions with the trap number
recorded, and define two new error codes by XORing the trap number to
the TDX_SW_ERROR. This opportunistically handles #GP too while using
the same simple assembly code.

A bonus is when kernel mistakenly calls SEAMCALL when CPU isn't in VMX
operation, or when TDX isn't enabled by the BIOS, or when the BIOS is
buggy, the kernel can get a nicer error code rather than a less
understandable Oops.

This is basically based on Peter's code.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/de975832a367f476aab2d0eb0d9de66019a16b54.1692096753.git.kai.huang%40intel.com


# c33621b4 15-Aug-2023 Kai Huang <kai.huang@intel.com>

x86/virt/tdx: Wire up basic SEAMCALL functions

Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
mode runs only the TDX module itself or other code to load the TDX
module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction. This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead. The TDX
module establishes a new SEAMCALL ABI which allows the host to
initialize the module and to manage VMs.

The SEAMCALL ABI is very similar to the TDCALL ABI and leverages much
TDCALL infrastructure. Wire up basic functions to make SEAMCALLs for
the basic support of running TDX guests: __seamcall(), __seamcall_ret(),
and __seamcall_saved_ret() for TDH.VP.ENTER. All SEAMCALLs involved in
the basic TDX support don't use "callee-saved" registers as input and
output, except the TDH.VP.ENTER.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST
to opt-in TDX host kernel support (to distinguish with TDX guest kernel
support). So far only KVM uses TDX. Make the new config option depend
on KVM_INTEL.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Link: https://lore.kernel.org/all/4db7c3fc085e6af12acc2932294254ddb3d320b3.1692096753.git.kai.huang%40intel.com


# 75d090fd 06-Jun-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/tdx: Add unaccepted memory support

Hookup TDX-specific code to accept memory.

Accepting the memory is done with ACCEPT_PAGE module call on every page
in the range. MAP_GPA hypercall is not required as the unaccepted memory
is considered private already.

Extract the part of tdx_enc_status_changed() that does memory acceptance
in a new helper. Move the helper tdx-shared.c. It is going to be used by
both main kernel and decompressor.

[ bp: Fix the INTEL_TDX_GUEST=y, KVM_GUEST=n build. ]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230606142637.5171-10-kirill.shutemov@linux.intel.com


# ff40b576 06-Jun-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub

Memory acceptance requires a hypercall and one or multiple module calls.

Make helpers for the calls available in boot stub. It has to accept
memory where kernel image and initrd are placed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20230606142637.5171-8-kirill.shutemov@linux.intel.com


# 51acfe89 16-Nov-2022 Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

x86/tdx: Add a wrapper to get TDREPORT0 from the TDX Module

To support TDX attestation, the TDX guest driver exposes an IOCTL
interface to allow userspace to get the TDREPORT0 (a.k.a. TDREPORT
subtype 0) from the TDX module via TDG.MR.TDREPORT TDCALL.

In order to get the TDREPORT0 in the TDX guest driver, instead of using
a low level function like __tdx_module_call(), add a
tdx_mcall_get_report0() wrapper function to handle it.

This is a preparatory patch for adding attestation support.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Wander Lairson Costa <wander@redhat.com>
Link: https://lore.kernel.org/all/20221116223820.819090-2-sathyanarayanan.kuppuswamy%40linux.intel.com


# cfb8ec7a 05-Apr-2022 Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

x86/tdx: Wire up KVM hypercalls

KVM hypercalls use the VMCALL or VMMCALL instructions. Although the ABI
is similar, those instructions no longer function for TDX guests.

Make vendor-specific TDVMCALLs instead of VMCALL. This enables TDX
guests to run with KVM acting as the hypervisor.

Among other things, KVM hypercall is used to send IPIs.

Since the KVM driver can be built as a kernel module, export
tdx_kvm_hypercall() to make the symbols visible to kvm.ko.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-20-kirill.shutemov@linux.intel.com


# 32e72854 05-Apr-2022 Andi Kleen <ak@linux.intel.com>

x86/tdx: Port I/O: Add early boot support

TDX guests cannot do port I/O directly. The TDX module triggers a #VE
exception to let the guest kernel emulate port I/O by converting them
into TDCALLs to call the host.

But before IDT handlers are set up, port I/O cannot be emulated using
normal kernel #VE handlers. To support the #VE-based emulation during
this boot window, add a minimal early #VE handler support in early
exception handlers. This is similar to what AMD SEV does. This is
mainly to support earlyprintk's serial driver, as well as potentially
the VGA driver.

The early handler only supports I/O-related #VE exceptions. Unhandled or
failed exceptions will be handled via early_fixup_exceptions() (like
normal exception failures). At runtime I/O-related #VE exceptions (along
with other types) handled by virt_exception_kernel().

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-19-kirill.shutemov@linux.intel.com


# 4c5b9aac 05-Apr-2022 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/boot: Port I/O: Add decompression-time support for TDX

Port I/O instructions trigger #VE in the TDX environment. In response to
the exception, kernel emulates these instructions using hypercalls.

But during early boot, on the decompression stage, it is cumbersome to
deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
handling.

Hook up TDX-specific port I/O helpers if booting in TDX environment.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-17-kirill.shutemov@linux.intel.com


# 4b05f815 05-Apr-2022 Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

x86/tdx: Detect TDX at early kernel decompression time

The early decompression code does port I/O for its console output. But,
handling the decompression-time port I/O demands a different approach
from normal runtime because the IDT required to support #VE based port
I/O emulation is not yet set up. Paravirtualizing I/O calls during
the decompression step is acceptable because the decompression code
doesn't have a lot of call sites to IO instruction.

To support port I/O in decompression code, TDX must be detected before
the decompression code might do port I/O. Detect whether the kernel runs
in a TDX guest.

Add an early_is_tdx_guest() interface to query the cached TDX guest
status in the decompression code.

TDX is detected with CPUID. Make cpuid_count() accessible outside
boot/cpuflags.c.

TDX detection in the main kernel is very similar. Move common bits
into <asm/shared/tdx.h>.

The actual port I/O paravirtualization will come later in the series.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-13-kirill.shutemov@linux.intel.com


# bfe6ed0c 05-Apr-2022 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/tdx: Add HLT support for TDX guests

The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).

To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].

In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.

To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.

Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com


# 9a22bf6d 05-Apr-2022 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/traps: Add #VE support for TDX guest

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:

* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to specific guest physical addresses

Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.

For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.

Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.

During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.

TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag. This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.

Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.

For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com


# eb94f1b6 05-Apr-2022 Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.

In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.

A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A leaf of TDCALL used to communicate
with the VMM is called TDVMCALL.

Add generic interfaces to communicate with the TDX module and VMM
(using the TDCALL instruction).

__tdx_module_call() - Used to communicate with the TDX module (via
TDCALL instruction).
__tdx_hypercall() - Used by the guest to request services from
the VMM (via TDVMCALL leaf of TDCALL).

Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.

The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file. The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.

Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.

For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".

Based on previous patch by Sean Christopherson.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-4-kirill.shutemov@linux.intel.com


# 527a534c 05-Apr-2022 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers

Secure Arbitration Mode (SEAM) is an extension of VMX architecture. It
defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
operation (SEAM VMX non-root) which are both isolated from the legacy
VMX operation where the host kernel runs.

A CPU-attested software module (called 'TDX module') runs in SEAM VMX
root to manage and protect VMs running in SEAM VMX non-root. SEAM VMX
root is also used to host another CPU-attested software module (called
'P-SEAMLDR') to load and update the TDX module.

Host kernel transits to either P-SEAMLDR or TDX module via the new
SEAMCALL instruction, which is essentially a VMExit from VMX root mode
to SEAM VMX root mode. SEAMCALLs are leaf functions defined by
P-SEAMLDR and TDX module around the new SEAMCALL instruction.

A guest kernel can also communicate with TDX module via TDCALL
instruction.

TDCALLs and SEAMCALLs use an ABI different from the x86-64 system-v ABI.
RAX is used to carry both the SEAMCALL leaf function number (input) and
the completion status (output). Additional GPRs (RCX, RDX, R8-R11) may
be further used as both input and output operands in individual leaf.

TDCALL and SEAMCALL share the same ABI and require the largely same
code to pass down arguments and retrieve results.

Define an assembly macro that can be used to implement C wrapper for
both TDCALL and SEAMCALL.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-3-kirill.shutemov@linux.intel.com


# 59bd54a8 05-Apr-2022 Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

x86/tdx: Detect running as a TDX guest in early boot

In preparation of extending cc_platform_has() API to support TDX guest,
use CPUID instruction to detect support for TDX guests in the early
boot code (via tdx_early_init()). Since copy_bootdata() is the first
user of cc_platform_has() API, detect the TDX guest status before it.

Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this
bit in a valid TDX guest platform.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-2-kirill.shutemov@linux.intel.com