History log of /linux-master/arch/x86/kvm/mmu/paging_tmpl.h
Revision Date Author Comments
# a130066f 13-Sep-2023 Binbin Wu <binbin.wu@linux.intel.com>

KVM: x86/mmu: Drop non-PA bits when getting GFN for guest's PGD

Drop non-PA bits when getting GFN for guest's PGD with the maximum theoretical
mask for guest MAXPHYADDR.

Do it unconditionally because it's harmless for 32-bit guests, querying 64-bit
mode would be more expensive, and for EPT the mask isn't tied to guest mode.
Using PT_BASE_ADDR_MASK would be technically wrong (PAE paging has 64-bit
elements _except_ for CR3, which has only 32 valid bits), it wouldn't matter
in practice though.

Opportunistically use GENMASK_ULL() to define __PT_BASE_ADDR_MASK.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-6-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 0e3223d8 28-Jul-2023 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots

When attempting to allocate a shadow root for a !visible guest root gfn,
e.g. that resides in MMIO space, load a dummy root that is backed by the
zero page instead of immediately synthesizing a triple fault shutdown
(using the zero page ensures any attempt to translate memory will generate
a !PRESENT fault and thus VM-Exit).

Unless the vCPU is racing with memslot activity, KVM will inject a page
fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e.
the end result is mostly same, but critically KVM will inject a fault only
*after* KVM runs the vCPU with the bogus root.

Waiting to inject a fault until after running the vCPU fixes a bug where
KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled
and a !visible root. Even though a bad root will *probably* lead to
shutdown, (a) it's not guaranteed and (b) the CPU won't read the
underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with
a VMX preemption timer value of '0', then architecturally the preemption
timer VM-Exit is guaranteed to occur before the CPU executes any
instruction, i.e. before the CPU needs to translate a GPA to a HPA (so
long as there are no injected events with higher priority than the
preemption timer).

If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because
userspace created a memslot between installing the dummy root and handling
the page fault, simply unload the MMU to allocate a new root and retry the
instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as
invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and
conceptually the dummy root has indeeed become obsolete. The only
difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is
that the root has become obsolete due to memslot *creation*, not memslot
deletion or movement.

Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b5b359ac 28-Jul-2023 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Disallow guest from using !visible slots for page tables

Explicitly inject a page fault if guest attempts to use a !visible gfn
as a page table. kvm_vcpu_gfn_to_hva_prot() will naturally handle the
case where there is no memslot, but doesn't catch the scenario where the
gfn points at a KVM-internal memslot.

Letting the guest backdoor its way into accessing KVM-internal memslots
isn't dangerous on its own, e.g. at worst the guest can crash itself, but
disallowing the behavior will simplify fixing how KVM handles !visible
guest root gfns (immediately synthesizing a triple fault when loading the
root is architecturally wrong).

Link: https://lore.kernel.org/r/20230729005200.1057358-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 72e2fb24 28-Jul-2023 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Bug the VM if a vCPU ends up in long mode without PAE enabled

Promote the ASSERT(), which is quite dead code in KVM, into a KVM_BUG_ON()
for KVM's sanity check that CR4.PAE=1 if the vCPU is in long mode when
performing a walk of guest page tables. The sanity is quite cheap since
neither EFER nor CR4.PAE requires a VMREAD, especially relative to the
cost of walking the guest page tables.

More importantly, the sanity check would have prevented the true badness
fixed by commit 112e66017bff ("KVM: nVMX: add missing consistency checks
for CR0 and CR4"). The missed consistency check resulted in some versions
of KVM corrupting the on-stack guest_walker structure due to KVM thinking
there are 4/5 levels of page tables, but wiring up the MMU hooks to point
at the paging32 implementation, which only allocates space for two levels
of page tables in "struct guest_walker32".

Queue a page fault for injection if the assertion fails, as both callers,
FNAME(gva_to_gpa) and FNAME(walk_addr_generic), assume that walker.fault
contains sane info on a walk failure. E.g. not populating the fault info
could result in KVM consuming and/or exposing uninitialized stack data
before the vCPU is kicked out to userspace, which doesn't happen until
KVM checks for KVM_REQ_VM_DEAD on the next enter.

Move the check below the initialization of "pte_access" so that the
aforementioned to-be-injected page fault doesn't consume uninitialized
stack data. The information _shouldn't_ reach the guest or userspace,
but there's zero downside to being paranoid in this case.

Link: https://lore.kernel.org/r/20230729004722.1056172-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 20ba462d 28-Jul-2023 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()

Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.

If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.

On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.

Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.

Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# a98b8894 28-Jul-2023 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Delete pgprintk() and all its usage

Delete KVM's pgprintk() and all its usage, as the code is very prone
to bitrot due to being buried behind MMU_DEBUG, and the functionality has
been rendered almost entirely obsolete by the tracepoints KVM has gained
over the years. And for the situations where the information provided by
KVM's tracepoints is insufficient, pgprintk() rarely fills in the gaps,
and is almost always far too noisy, i.e. developers end up implementing
custom prints anyways.

Link: https://lore.kernel.org/r/20230729004722.1056172-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2fdcc1b3 21-Mar-2023 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: Avoid indirect call for get_cr3

Most of the time, calls to get_guest_pgd result in calling
kvm_read_cr3 (the exception is only nested TDP). Hardcode
the default instead of using the get_cr3 function, avoiding
a retpoline if they are enabled.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230322013731.102955-2-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 19ace7d6 16-Feb-2023 Lai Jiangshan <jiangshan.ljs@antgroup.com>

KVM: x86/mmu: Skip calling mmu->sync_spte() when the spte is 0

Sync the spte only when the spte is set and avoid the indirect branch.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-5-jiangshanlai@gmail.com
[sean: add wrapper instead of open coding each check]
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 91ca7672 16-Feb-2023 Lai Jiangshan <jiangshan.ljs@antgroup.com>

kvm: x86/mmu: Remove @no_dirty_log from FNAME(prefetch_gpte)

FNAME(prefetch_gpte) is always called with @no_dirty_log=true.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-4-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 9fd4a4e3 16-Feb-2023 Lai Jiangshan <jiangshan.ljs@antgroup.com>

KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB instead.

In hardware TLB, invalidating TLB entries means the translations are
removed from the TLB.

In KVM shadowed vTLB, the translations (combinations of shadow paging
and hardware TLB) are generally maintained as long as they remain "clean"
when the TLB of an address space (i.e. a PCID or all) is flushed with
the help of write-protections, sp->unsync, and kvm_sync_page(), where
"clean" in this context means that no updates to KVM's SPTEs are needed.

However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is
unsync, and thus triggers a remote flush even if the original vTLB entry
is clean, i.e. is usable as-is.

Besides this, FNAME(invlpg) is largely is a duplicate implementation of
FNAME(sync_spte) to invalidate a vTLB entry.

To address both issues, reuse FNAME(sync_spte) to share the code and
slightly modify the semantics, i.e. keep the vTLB entry if it's "clean"
and avoid remote TLB flush.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# ed335278 16-Feb-2023 Lai Jiangshan <jiangshan.ljs@antgroup.com>

KVM: x86/mmu: Allow the roots to be invalid in FNAME(invlpg)

Don't assume the current root to be valid, just check it and remove
the WARN().

Also move the code to check if the root is valid into FNAME(invlpg)
to simplify the code.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-2-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# e6722d92 16-Feb-2023 Lai Jiangshan <jiangshan.ljs@antgroup.com>

KVM: x86/mmu: Reduce the update to the spte in FNAME(sync_spte)

Sometimes when the guest updates its pagetable, it adds only new gptes
to it without changing any existed one, so there is no point to update
the sptes for these existed gptes.

Also when the sptes for these unchanged gptes are updated, the AD
bits are also removed since make_spte() is called with prefetch=true
which might result unneeded TLB flushing.

Just do nothing if the gpte's permissions are unchanged.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-7-jiangshanlai@gmail.com
[sean: expand comment to call out A/D bits]
Signed-off-by: Sean Christopherson <seanjc@google.com>


# c3c6c9fc5 16-Feb-2023 Lai Jiangshan <jiangshan.ljs@antgroup.com>

KVM: x86/mmu: Move the code out of FNAME(sync_page)'s loop body into mmu.c

Rename mmu->sync_page to mmu->sync_spte and move the code out
of FNAME(sync_page)'s loop body into mmu.c.

No functionalities change intended.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-6-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 90e44470 16-Feb-2023 Lai Jiangshan <jiangshan.ljs@antgroup.com>

KVM: x86/mmu: Move the check in FNAME(sync_page) as kvm_sync_page_check()

Prepare to check mmu->sync_page pointer before calling it.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 753b43c9 16-Feb-2023 Lai Jiangshan <jiangshan.ljs@antgroup.com>

KVM: x86/mmu: Use 64-bit address to invalidate to fix a subtle bug

FNAME(invlpg)() and kvm_mmu_invalidate_gva() take a gva_t, i.e. unsigned
long, as the type of the address to invalidate. On 32-bit kernels, the
upper 32 bits of the GPA will get dropped when an L2 GPA address is
invalidated in the shadowed nested TDP MMU.

Convert it to u64 to fix the problem.

Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-2-jiangshanlai@gmail.com
[sean: tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 9a967700 02-Feb-2023 Lai Jiangshan <jiangshan.ljs@antgroup.com>

KVM: x86/mmu: Remove FNAME(is_self_change_mapping)

Drop FNAME(is_self_change_mapping) and instead rely on
kvm_mmu_hugepage_adjust() to adjust the hugepage accordingly. Prior to
commit 4cd071d13c5c ("KVM: x86/mmu: Move calls to thp_adjust() down a
level"), the hugepage adjustment was done before allocating new shadow
pages, i.e. failed to restrict the hugepage sizes if a new shadow page
resulted in account_shadowed() changing the disallowed hugepage tracking.

Removing FNAME(is_self_change_mapping) fixes a bug reported by Huang Hang
where KVM unnecessarily forces a 4KiB page. FNAME(is_self_change_mapping)
has a defect in that it blindly disables _all_ hugepage mappings rather
than trying to reduce the size of the hugepage. If the guest is writing
to a 1GiB page and the 1GiB is self-referential but a 2MiB page is not,
then KVM can and should create a 2MiB mapping.

Add a comment above the call to kvm_mmu_hugepage_adjust() to call out the
new dependency on adjusting the hugepage size after walking indirect PTEs.

Reported-by: Huang Hang <hhuang@linux.alibaba.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20221213125538.81209-1-jiangshanlai@gmail.com
[sean: rework changelog after separating out the emulator change]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230202182817.407394-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 39fda5d8 02-Feb-2023 Lai Jiangshan <jiangshan.ljs@antgroup.com>

KVM: x86/mmu: Detect write #PF to shadow pages during FNAME(fetch) walk

Move the detection of write #PF to shadow pages, i.e. a fault on a write
to a page table that is being shadowed by KVM that is used to translate
the write itself, from FNAME(is_self_change_mapping) to FNAME(fetch).
There is no need to detect the self-referential write before
kvm_faultin_pfn() as KVM does not consume EMULTYPE_WRITE_PF_TO_SP for
accesses that resolve to "error or no-slot" pfns, i.e. KVM doesn't allow
retrying MMIO accesses or writes to read-only memslots.

Detecting the EMULTYPE_WRITE_PF_TO_SP scenario in FNAME(fetch) will allow
dropping FNAME(is_self_change_mapping) entirely, as the hugepage
interaction can be deferred to kvm_mmu_hugepage_adjust().

Cc: Huang Hang <hhuang@linux.alibaba.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20221213125538.81209-1-jiangshanlai@gmail.com
[sean: split to separate patch, write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230202182817.407394-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 258d985f 02-Feb-2023 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Use EMULTYPE flag to track write #PFs to shadow pages

Use a new EMULTYPE flag, EMULTYPE_WRITE_PF_TO_SP, to track page faults
on self-changing writes to shadowed page tables instead of propagating
that information to the emulator via a semi-persistent vCPU flag. Using
a flag in "struct kvm_vcpu_arch" is confusing, especially as implemented,
as it's not at all obvious that clearing the flag only when emulation
actually occurs is correct.

E.g. if KVM sets the flag and then retries the fault without ever getting
to the emulator, the flag will be left set for future calls into the
emulator. But because the flag is consumed if and only if both
EMULTYPE_PF and EMULTYPE_ALLOW_RETRY_PF are set, and because
EMULTYPE_ALLOW_RETRY_PF is deliberately not set for direct MMUs, emulated
MMIO, or while L2 is active, KVM avoids false positives on a stale flag
since FNAME(page_fault) is guaranteed to be run and refresh the flag
before it's ultimately consumed by the tail end of reexecute_instruction().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230202182817.407394-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 1b2dc736 10-Oct-2022 Hou Wenlong <houwenlong.hwl@antgroup.com>

KVM: x86/mmu: Fix wrong start gfn of tlb flushing with range

When a spte is dropped, the start gfn of tlb flushing should be the gfn
of spte not the base gfn of SP which contains the spte. Also introduce a
helper function to do range-based flushing when a spte is dropped, which
would help prevent future buggy use of
kvm_flush_remote_tlbs_with_address() in such case.

Fixes: c3134ce240eed ("KVM: Replace old tlb flush function with new one to flush a specified range.")
Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/72ac2169a261976f00c1703e88cda676dfb960f5.1665214747.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# c667a3ba 10-Oct-2022 Hou Wenlong <houwenlong.hwl@antgroup.com>

KVM: x86/mmu: Move round_gfn_for_level() helper into mmu_internal.h

Rounding down the GFN to a huge page size is a common pattern throughout
KVM, so move round_gfn_for_level() helper in tdp_iter.c to
mmu_internal.h for common usage. Also rename it as gfn_round_for_level()
to use gfn_* prefix and clean up the other call sites.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/415c64782f27444898db650e21cf28eeb6441dfa.1665214747.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 9e3fbdfd 05-Jan-2023 Lai Jiangshan <jiangshan.ljs@antgroup.com>

kvm: x86/mmu: Don't clear write flooding for direct SP

Although there is no harm, but there is no point to clear write
flooding for direct SP.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230105100310.6700-1-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 354c908c 21-Sep-2022 David Matlack <dmatlack@google.com>

KVM: x86/mmu: Handle no-slot faults in kvm_faultin_pfn()

Handle faults on GFNs that do not have a backing memslot in
kvm_faultin_pfn() and drop handle_abnormal_pfn(). This eliminates
duplicate code in the various page fault handlers.

Opportunistically tweak the comment about handling gfn > host.MAXPHYADDR
to reflect that the effect of returning RET_PF_EMULATE at that point is
to avoid creating an MMIO SPTE for such GFNs.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ba6e3fe2 21-Sep-2022 David Matlack <dmatlack@google.com>

KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()

Grab mmu_invalidate_seq in kvm_faultin_pfn() and stash it in struct
kvm_page_fault. The eliminates duplicate code and reduces the amount of
parameters needed for is_page_fault_stale().

Preemptively split out __kvm_faultin_pfn() to a separate function for
use in subsequent commits.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f09948ec 21-Sep-2022 David Matlack <dmatlack@google.com>

KVM: x86/mmu: Handle no-slot faults in kvm_faultin_pfn()

Handle faults on GFNs that do not have a backing memslot in
kvm_faultin_pfn() and drop handle_abnormal_pfn(). This eliminates
duplicate code in the various page fault handlers.

Opportunistically tweak the comment about handling gfn > host.MAXPHYADDR
to reflect that the effect of returning RET_PF_EMULATE at that point is
to avoid creating an MMIO SPTE for such GFNs.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 90c54c19 21-Sep-2022 David Matlack <dmatlack@google.com>

KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()

Grab mmu_invalidate_seq in kvm_faultin_pfn() and stash it in struct
kvm_page_fault. The eliminates duplicate code and reduces the amount of
parameters needed for is_page_fault_stale().

Preemptively split out __kvm_faultin_pfn() to a separate function for
use in subsequent commits.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 55c510e2 19-Oct-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Rename NX huge pages fields/functions for consistency

Rename most of the variables/functions involved in the NX huge page
mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
applies only to the NX huge page mitigation, not to any condition that
prevents creating a huge page.

Add a comment explaining what the newly named "possible_nx_huge_pages"
tracks.

Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
stone.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221019165618.927057-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 428e9216 19-Oct-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked

Tag shadow pages that cannot be replaced with an NX huge page regardless
of whether or not zapping the page would allow KVM to immediately create
a huge page, e.g. because something else prevents creating a huge page.

I.e. track pages that are disallowed from being NX huge pages regardless
of whether or not the page could have been huge at the time of fault.
KVM currently tracks pages that were disallowed from being huge due to
the NX workaround if and only if the page could otherwise be huge. But
that fails to handled the scenario where whatever restriction prevented
KVM from installing a huge page goes away, e.g. if dirty logging is
disabled, the host mapping level changes, etc...

Failure to tag shadow pages appropriately could theoretically lead to
false negatives, e.g. if a fetch fault requests a small page and thus
isn't tracked, and a read/write fault later requests a huge page, KVM
will not reject the huge page as it should.

To avoid yet another flag, initialize the list_head and use list_empty()
to determine whether or not a page is on the list of NX huge pages that
should be recovered.

Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is
more involved due to mmu_lock being held for read. This will be
addressed in a future commit.

Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221019165618.927057-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b85a97b8 31-Aug-2022 Jilin Yuan <yuanjilin@cdjrlc.com>

KVM: x86/mmu: fix repeated words in comments

Delete the redundant word 'to'.

Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com>
Link: https://lore.kernel.org/r/20220831125217.12313-1-yuanjilin@cdjrlc.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 20ec3ebd 16-Aug-2022 Chao Peng <chao.p.peng@linux.intel.com>

KVM: Rename mmu_notifier_* to mmu_invalidate_*

The motivation of this renaming is to make these variables and related
helper functions less mmu_notifier bound and can also be used for non
mmu_notifier based page invalidation. mmu_invalidate_* was chosen to
better describe the purpose of 'invalidating' a page that those
variables are used for.

- mmu_notifier_seq/range_start/range_end are renamed to
mmu_invalidate_seq/range_start/range_end.

- mmu_notifier_retry{_hva} helper functions are renamed to
mmu_invalidate_retry{_hva}.

- mmu_notifier_count is renamed to mmu_invalidate_in_progress to
avoid confusion with mn_active_invalidate_count.

- While here, also update kvm_inc/dec_notifier_count() to
kvm_mmu_invalidate_begin/end() to match the change for
mmu_notifier_count.

No functional change intended.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 79e48cec 11-Jul-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Add optimized helper to retrieve an SPTE's index

Add spte_index() to dedup all the code that calculates a SPTE's index
into its parent's page table and/or spt array. Opportunistically tweak
the calculation to avoid pointer arithmetic, which is subtle (subtract in
8-byte chunks) and less performant (requires the compiler to generate the
subtraction).

Suggested-by: David Matlack <dmatlack@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220712020724.1262121-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 6e1d2a3f 01-Jul-2022 Hou Wenlong <houwenlong.hwl@antgroup.com>

KVM: x86/mmu: Replace UNMAPPED_GVA with INVALID_GPA for gva_to_gpa()

The result of gva_to_gpa() is physical address not virtual address,
it is odd that UNMAPPED_GVA macro is used as the result for physical
address. Replace UNMAPPED_GVA with INVALID_GPA and drop UNMAPPED_GVA
macro.

No functional change intended.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/6104978956449467d3c68f1ad7f2c2f6d771d0ee.1656667239.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 0cd8dc73 22-Jun-2022 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()

Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE. This is done by drop_large_spte().

However, dropping the large SPTE is really only necessary before the
sp is installed. While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page(). Move the call
there instead of having it in each and every caller.

To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.

Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 6a97575d 22-Jun-2022 David Matlack <dmatlack@google.com>

KVM: x86/mmu: Cache the access bits of shadowed translations

Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.

In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2e65e842 22-Jun-2022 David Matlack <dmatlack@google.com>

KVM: x86/mmu: Derive shadow MMU page role from parent

Instead of computing the shadow page role from scratch for every new
page, derive most of the information from the parent shadow page. This
eliminates the dependency on the vCPU root role to allocate shadow page
tables, and reduces the number of parameters to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

Note that when calculating the MMU root role, we can take
@role.passthrough, @role.direct, and @role.access directly from
@vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still
must be overridden for PAE page directories, when shadowing 32-bit
guest page tables with PAE page tables.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 70e41c31 14-Jun-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Use common logic for computing the 32/64-bit base PA mask

Use common logic for computing PT_BASE_ADDR_MASK for 32-bit, 64-bit, and
EPT paging. Both PAGE_MASK and the new-common logic are supsersets of
what is actually needed for 32-bit paging. PAGE_MASK sets bits 63:12 and
the former GUEST_PT64_BASE_ADDR_MASK sets bits 51:12, so regardless of
which value is used, the result will always be bits 31:12.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f7384b88 14-Jun-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Truncate paging32's PT_BASE_ADDR_MASK to 32 bits

Truncate paging32's PT_BASE_ADDR_MASK to a pt_element_t, i.e. to 32 bits.
Ignoring PSE huge pages, the mask is only used in conjunction with gPTEs,
which are 32 bits, and so the address is limited to bits 31:12.

PSE huge pages encoded PA bits 39:32 in PTE bits 20:13, i.e. need custom
logic to handle their funky encoding regardless of PT_BASE_ADDR_MASK.

Note, PT_LVL_OFFSET_MASK is somewhat confusing in that it computes the
offset of the _gfn_, not of the gpa, i.e. not having bits 63:32 set in
PT_BASE_ADDR_MASK is again correct.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f6b8ea6d 15-Jun-2022 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: Use common macros to compute 32/64-bit paging masks

Dedup the code for generating (most of) the per-type PT_* masks in
paging_tmpl.h. The relevant macros only vary based on the number of bits
per level, and that smidge of info is already provided in a common form
as PT_LEVEL_BITS.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2ca3129e 14-Jun-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs

Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs
(PT64). SPTE and PT64 are _mostly_ the same, but the few differences are
quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and
guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is
very much specific to SPTEs.

Opportunistically (and temporarily) move most guest macros into paging.h
to clearly associate them with shadow paging, and to ensure that they're
not used as of this commit. A future patch will eliminate them entirely.

Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's
needed for the quadrant calculation in kvm_mmu_get_page(). The quadrant
calculation is hot enough (when using shadow paging with 32-bit guests)
that adding a per-context helper is undesirable, and burying the
computation in paging_tmpl.h with a forward declaration isn't exactly an
improvement.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b3fcdb04 14-Jun-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h

Move a handful of one-off macros and helpers for 32-bit PSE paging into
paging_tmpl.h and hide them behind "PTTYPE == 32". Under no circumstance
should anything but 32-bit shadow paging care about PSE paging.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 007a369f 13-Jun-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Drop unused CMPXCHG macro from paging_tmpl.h

Drop the CMPXCHG macro from paging_tmpl.h, it's no longer used now that
KVM uses a common uaccess helper to do 8-byte CMPXCHG.

Fixes: f122dfe44768 ("KVM: x86: Use __try_cmpxchg_user() to update guest PTE A/D bits")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220613225723.2734132-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b8b9156e 13-May-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Comment FNAME(sync_page) to document TLB flushing logic

Add a comment to FNAME(sync_page) to explain why the TLB flushing logic
conspiculously doesn't handle the scenario of guest protections being
reduced. Specifically, if synchronizing a SPTE drops execute protections,
KVM will not emit a TLB flush, whereas dropping writable or clearing A/D
bits does trigger a flush via mmu_spte_update(). Architecturally, until
the GPTE is implicitly or explicitly flushed from the guest's perspective,
KVM is not required to flush any old, stale translations.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220513195000.99371-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9fb35657 13-May-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Drop RWX=0 SPTEs during ept_sync_page()

All of sync_page()'s existing checks filter out only !PRESENT gPTE,
because without execute-only, all upper levels are guaranteed to be at
least READABLE. However, if EPT with execute-only support is in use by
L1, KVM can create an SPTE that is shadow-present but guest-inaccessible
(RWX=0) if the upper level combined permissions are R (or RW) and
the leaf EPTE is changed from R (or RW) to X. Because the EPTE is
considered present when viewed in isolation, and no reserved bits are set,
FNAME(prefetch_invalid_gpte) will consider the GPTE valid, and cause a
not-present SPTE to be created.

The SPTE is "correct": the guest translation is inaccessible because
the combined protections of all levels yield RWX=0, and KVM will just
redirect any vmexits to the guest. If EPT A/D bits are disabled, KVM
can mistake the SPTE for an access-tracked SPTE, but again such confusion
isn't fatal, as the "saved" protections are also RWX=0. However,
creating a useless SPTE in general means that KVM messed up something,
even if this particular goof didn't manifest as a functional bug.
So, drop SPTEs whose new protections will yield a RWX=0 SPTE, and
add a WARN in make_spte() to detect creation of SPTEs that will
result in RWX=0 protections.

Fixes: d95c55687e11 ("kvm: mmu: track read permission explicitly for shadow EPT page tables")
Cc: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220513195000.99371-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 1075d41e 22-Apr-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Expand and clean up page fault stats

Expand and clean up the page fault stats. The current stats are at best
incomplete, and at worst misleading. Differentiate between faults that
are actually fixed vs those that result in an MMIO SPTE being created,
track faults that are spurious, faults that trigger emulation, faults
that that are fixed in the fast path, and last but not least, track the
number of faults that are taken.

Note, the number of faults that require emulation for write-protected
shadow pages can roughly be calculated by subtracting the number of MMIO
SPTEs created from the overall number of faults that trigger emulation.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5276c616 22-Apr-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns"

Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
kvm_faultin_pfn() to signal that the page fault handler should continue
doing its thing. Aside from being gross and inefficient, using a boolean
return to signal continue vs. stop makes it extremely difficult to add
more helpers and/or move existing code to a helper.

E.g. hypothetically, if nested MMUs were to gain a separate page fault
handler in the future, everything up to the "is self-modifying PTE" check
can be shared by all shadow MMUs, but communicating up the stack whether
to continue on or stop becomes a nightmare.

More concretely, proposed support for private guest memory ran into a
similar issue, where it'll be forced to forego a helper in order to yield
sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.

No functional change intended.

Cc: David Matlack <dmatlack@google.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 84e5ffd0 20-Apr-2022 Lai Jiangshan <jiangshan.ljs@antgroup.com>

KVM: X86/MMU: Fix shadowing 5-level NPT for 4-level NPT L1 guest

When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is
allocated with role.level = 5 and the guest pagetable's root gfn.

And root_sp->spt[0] is also allocated with the same gfn and the same
role except role.level = 4. Luckily that they are different shadow
pages, but only root_sp->spt[0] is the real translation of the guest
pagetable.

Here comes a problem:

If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
and uses the same gfn as the root page for nested NPT before and after
switching gCR4_LA57. The host (hCR4_LA57=1) might use the same root_sp
for the guest even the guest switches gCR4_LA57. The guest will see
unexpected page mapped and L2 may exploit the bug and hurt L1. It is
lucky that the problem can't hurt L0.

And three special cases need to be handled:

The root_sp should be like role.direct=1 sometimes: its contents are
not backed by gptes, root_sp->gfns is meaningless. (For a normal high
level sp in shadow paging, sp->gfns is often unused and kept zero, but
it could be relevant and meaningful if sp->gfns is used because they
are backed by concrete gptes.)

For such root_sp in the case, root_sp is just a portal to contribute
root_sp->spt[0], and root_sp->gfns should not be used and
root_sp->spt[0] should not be dropped if gpte[0] of the guest root
pagetable is changed.

Such root_sp should not be accounted too.

So add role.passthrough to distinguish the shadow pages in the hash
when gCR4_LA57 is toggled and fix above special cases by using it in
kvm_mmu_page_{get|set}_gfn() and sp_has_gptes().

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220420131204.2850-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4d25502a 10-Feb-2022 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: replace root_level with cpu_role.base.level

Remove another duplicate field of struct kvm_mmu. This time it's
the root level for page table walking; the separate field is
always initialized as cpu_role.base.level, so its users can look
up the CPU mode directly instead.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 7a458f0e 14-Feb-2022 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: remove extended bits from mmu_role, rename field

mmu_role represents the role of the root of the page tables.
It does not need any extended bits, as those govern only KVM's
page table walking; the is_* functions used for page table
walking always use the CPU role.

ext.valid is not present anymore in the MMU role, but an
all-zero MMU role is impossible because the level field is
never zero in the MMU role. So just zap the whole mmu_role
in order to force invalidation after CPUID is updated.

While making this change, which requires touching almost every
occurrence of "mmu_role", rename it to "root_role".

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ec283cb1 10-Feb-2022 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: remove ept_ad field

The ept_ad field is used during page walk to determine if the guest PTEs
have accessed and dirty bits. In the MMU role, the ad_disabled
bit represents whether the *shadow* PTEs have the bits, so it
would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just
!mmu->mmu_role.base.ad_disabled.

However, the similar field in the CPU mode, ad_disabled, is initialized
correctly: to the opposite value of ept_ad for shadow EPT, and zero
for non-EPT guest paging modes (which always have A/D bits). It is
therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode,
like other page-format fields; it just has to be inverted to account
for the different polarity.

In fact, now that the CPU mode is distinct from the MMU roles, it would
even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and
use !mmu->cpu_role.base.ad_disabled instead. I am not doing this because
the macro has a small effect in terms of dead code elimination:

text data bss dec hex
103544 16665 112 120321 1d601 # as of this patch
103746 16665 112 120523 1d6cb # without PT_HAVE_ACCESSED_DIRTY

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# e5ed0fb0 11-Feb-2022 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: split cpu_role from mmu_role

Snapshot the state of the processor registers that govern page walk into
a new field of struct kvm_mmu. This is a more natural representation
than having it *mostly* in mmu_role but not exclusively; the delta
right now is represented in other fields, such as root_level.

The nested MMU now has only the CPU role; and in fact the new function
kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role,
except that it has role.base.direct equal to !CR0.PG. For a walk-only
MMU, "direct" has no meaning, but we set it to !CR0.PG so that
role.ext.cr0_pg can go away in a future patch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 25cc0565 10-Feb-2022 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: rephrase unclear comment

If accessed bits are not supported there simple isn't any distinction
between accessed and non-accessed gPTEs, so the comment does not make
much sense. Rephrase it in terms of what happens if accessed bits
*are* supported.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f122dfe4 01-Feb-2022 Sean Christopherson <seanjc@google.com>

KVM: x86: Use __try_cmpxchg_user() to update guest PTE A/D bits

Use the recently introduced __try_cmpxchg_user() to update guest PTE A/D
bits instead of mapping the PTE into kernel address space. The VM_PFNMAP
path is broken as it assumes that vm_pgoff is the base pfn of the mapped
VMA range, which is conceptually wrong as vm_pgoff is the offset relative
to the file and has nothing to do with the pfn. The horrific hack worked
for the original use case (backing guest memory with /dev/mem), but leads
to accessing "random" pfns for pretty much any other VM_PFNMAP case.

Fixes: bd53cb35a3e9 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs")
Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220202004945.2540433-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ca2a7c22 28-Mar-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Derive EPT violation RWX bits from EPTE RWX bits

Derive the mask of RWX bits reported on EPT violations from the mask of
RWX bits that are shoved into EPT entries; the layout is the same, the
EPT violation bits are simply shifted by three. Use the new shift and a
slight copy-paste of the mask derivation instead of completely open
coding the same to convert between the EPT entry bits and the exit
qualification when synthesizing a nested EPT Violation.

No functional change intended.

Cc: SU Hang <darcy.sh@antgroup.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220329030108.97341-3-darcy.sh@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# aecce510 28-Mar-2022 SU Hang <darcy.sh@antgroup.com>

KVM: VMX: replace 0x180 with EPT_VIOLATION_* definition

Using self-expressing macro definition EPT_VIOLATION_GVA_VALIDATION
and EPT_VIOLATION_GVA_TRANSLATED instead of 0x180
in FNAME(walk_addr_generic)().

Signed-off-by: SU Hang <darcy.sh@antgroup.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220329030108.97341-2-darcy.sh@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2a8859f3 28-Mar-2022 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: do compare-and-exchange of gPTE via the user address

FNAME(cmpxchg_gpte) is an inefficient mess. It is at least decent if it
can go through get_user_pages_fast(), but if it cannot then it tries to
use memremap(); that is not just terribly slow, it is also wrong because
it assumes that the VM_PFNMAP VMA is contiguous.

The right way to do it would be to do the same thing as
hva_to_pfn_remapped() does since commit add6a0cd1c5b ("KVM: MMU: try to
fix up page faults before giving up", 2016-07-05), using follow_pte()
and fixup_user_fault() to determine the correct address to use for
memremap(). To do this, one could for example extract hva_to_pfn()
for use outside virt/kvm/kvm_main.c. But really there is no reason to
do that either, because there is already a perfectly valid address to
do the cmpxchg() on, only it is a userspace address. That means doing
user_access_begin()/user_access_end() and writing the code in assembly
to handle exceptions correctly. Worse, the guest PTE can be 8-byte
even on i686 so there is the extra complication of using cmpxchg8b to
account for. But at least it is an efficient mess.

(Thanks to Linus for suggesting improvement on the inline assembly).

Reported-by: Qiuhao Li <qiuhao@sysec.org>
Reported-by: Gaoning Pan <pgn@zju.edu.cn>
Reported-by: Yongkang Jia <kangel@zju.edu.cn>
Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com
Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Fixes: bd53cb35a3e9 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5b22bbe7 11-Mar-2022 Lai Jiangshan <jiangshan.ljs@antgroup.com>

KVM: X86: Change the type of access u32 to u64

Change the type of access u32 to u64 for FNAME(walk_addr) and
->gva_to_gpa().

The kinds of accesses are usually combinations of UWX, and VMX/SVM's
nested paging adds a new factor of access: is it an access for a guest
page table or for a final guest physical address.

And SMAP relies a factor for supervisor access: explicit or implicit.

So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
all these information to do the walk.

Although @access(u32) has enough bits to encode all the kinds, this
patch extends it to u64:
o Extra bits will be in the higher 32 bits, so that we can
easily obtain the traditional access mode (UWX) by converting
it to u32.
o Reuse the value for the access kind defined by SVM's nested
paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
@error_code in kvm_handle_page_fault().

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b9e5603c 21-Feb-2022 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: use struct kvm_mmu_root_info for mmu->root

The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
Use the struct to have more consistency between mmu->root and
mmu->prev_roots.

The patch is entirely search and replace except for cached_root_available,
which does not need a temporary struct kvm_mmu_root_info anymore.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 1bbc60d0 18-Feb-2022 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Remove MMU auditing

Remove mmu_audit.c and all its collateral, the auditing code has suffered
severe bitrot, ironically partly due to shadow paging being more stable
and thus not benefiting as much from auditing, but mostly due to TDP
supplanting shadow paging for non-nested guests and shadowing of nested
TDP not heavily stressing the logic that is being audited.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c59a0f57 24-Nov-2021 Lai Jiangshan <laijs@linux.alibaba.com>

KVM: X86: Remove mmu->translate_gpa

Reduce an indirect function call (retpoline) and some intialization
code.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-4-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 1f5a21ee 24-Nov-2021 Lai Jiangshan <laijs@linux.alibaba.com>

KVM: X86: Add parameter struct kvm_mmu *mmu into mmu->gva_to_gpa()

The mmu->gva_to_gpa() has no "struct kvm_mmu *mmu", so an extra
FNAME(gva_to_gpa_nested) is needed.

Add the parameter can simplify the code. And it makes it explicit that
the walk is upon vcpu->arch.walk_mmu for gva and vcpu->arch.mmu for L2
gpa in translate_nested_gpa() via the new parameter.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# a955cad8 19-Nov-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Retry page fault if root is invalidated by memslot update

Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.

For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.

For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest. But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root. Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete. But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.

Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2839180c 29-Sep-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: clean up prefetch/prefault/speculative naming

"prefetch", "prefault" and "speculative" are used throughout KVM to mean
the same thing. Use a single name, standardizing on "prefetch" which
is already used by various functions such as direct_pte_prefetch,
FNAME(prefetch_gpte), FNAME(pte_prefetch), etc.

Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 53597858 17-Aug-2021 David Matlack <dmatlack@google.com>

KVM: x86/mmu: Avoid memslot lookup in make_spte and mmu_try_to_unsync_pages

mmu_try_to_unsync_pages checks if page tracking is active for the given
gfn, which requires knowing the memslot. We can pass down the memslot
via make_spte to avoid this lookup.

The memslot is also handy for make_spte's marking of the gfn as dirty:
we can test whether dirty page tracking is enabled, and if so ensure that
pages are mapped as writable with 4K granularity. Apart from the warning,
no functional change is intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 8a9f566a 13-Aug-2021 David Matlack <dmatlack@google.com>

KVM: x86/mmu: Avoid memslot lookup in rmap_add

Avoid the memslot lookup in rmap_add, by passing it down from the fault
handling code to mmu_set_spte and then to rmap_add.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-6-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# a12f4381 17-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: pass struct kvm_page_fault to mmu_set_spte

mmu_set_spte is called for either PTE prefetching or page faults. The
three boolean arguments write_fault, speculative and host_writable are
always respectively false/true/true for prefetching and coming from
a struct kvm_page_fault for page faults.

Let mmu_set_spte distinguish these two situation by accepting a
possibly NULL struct kvm_page_fault argument.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 7158bee4 17-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: pass kvm_mmu_page struct to make_spte

The level and A/D bit support of the new SPTE can be found in the role,
which is stored in the kvm_mmu_page struct. This merges two arguments
into one.

For the TDP MMU, the kvm_mmu_page was not used (kvm_tdp_mmu_map does
not use it if the SPTE is already present) so we fetch it just before
calling make_spte.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# eb5cd7ff 17-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: remove unnecessary argument to mmu_set_spte

The level of the new SPTE can be found in the kvm_mmu_page struct; there
is no need to pass it down.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4758d47e 17-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: inline set_spte in FNAME(sync_page)

Since the two callers of set_spte do different things with the results,
inlining it actually makes the code simpler to reason about. For example,
FNAME(sync_page) already has a struct kvm_mmu_page *, but set_spte had to
fish it back out of sptep's private page data.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# e710c5f6 24-Sep-2021 David Matlack <dmatlack@google.com>

KVM: x86/mmu: Pass the memslot around via struct kvm_page_fault

The memslot for the faulting gfn is used throughout the page fault
handling code, so capture it in kvm_page_fault as soon as we know the
gfn and use it in the page fault handling code that has direct access
to the kvm_page_fault struct. Replace various tests using is_noslot_pfn
with more direct tests on fault->slot being NULL.

This, in combination with the subsequent patch, improves "Populate
memory time" in dirty_log_perf_test by 5% when using the legacy MMU.
There is no discerable improvement to the performance of the TDP MMU.

No functional change intended.

Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b1a429fb 06-Sep-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Verify shadow walk doesn't terminate early in page faults

WARN and bail if the shadow walk for faulting in a SPTE terminates early,
i.e. doesn't reach the expected level because the walk encountered a
terminal SPTE. The shadow walks for page faults are subtle in that they
install non-leaf SPTEs (zapping leaf SPTEs if necessary!) in the loop
body, and consume the newly created non-leaf SPTE in the loop control,
e.g. __shadow_walk_next(). In other words, the walks guarantee that the
walk will stop if and only if the target level is reached by installing
non-leaf SPTEs to guarantee the walk remains valid.

Opportunistically use fault->goal-level instead of it.level in
FNAME(fetch) to further clarify that KVM always installs the leaf SPTE at
the target level.

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210906122547.263316-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f0066d94 06-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: change tracepoints arguments to kvm_page_fault

Pass struct kvm_page_fault to tracepoints instead of extracting the
arguments from the struct. This also lets the kvm_mmu_spte_requested
tracepoint pick the gfn directly from fault->gfn, instead of using
the address.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 536f0e6a 06-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: change disallowed_hugepage_adjust() arguments to kvm_page_fault

Pass struct kvm_page_fault to disallowed_hugepage_adjust() instead of
extracting the arguments from the struct. Tweak a bit the conditions
to avoid long lines.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 73a3c659 07-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: change kvm_mmu_hugepage_adjust() arguments to kvm_page_fault

Pass struct kvm_page_fault to kvm_mmu_hugepage_adjust() instead of
extracting the arguments from the struct; the results are also stored
in the struct, so the callers are adjusted consequently.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9c03b182 06-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: change FNAME(fetch)() arguments to kvm_page_fault

Pass struct kvm_page_fault to FNAME(fetch)() instead of
extracting the arguments from the struct.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 3a13f4fe 06-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: change handle_abnormal_pfn() arguments to kvm_page_fault

Pass struct kvm_page_fault to handle_abnormal_pfn() instead of
extracting the arguments from the struct.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 3647cd04 07-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: change kvm_faultin_pfn() arguments to kvm_page_fault

Add fields to struct kvm_page_fault corresponding to outputs of
kvm_faultin_pfn(). For now they have to be extracted again from struct
kvm_page_fault in the subsequent steps, but this is temporary until
other functions in the chain are switched over as well.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b8a5d551 06-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: change page_fault_handle_page_track() arguments to kvm_page_fault

Add fields to struct kvm_page_fault corresponding to the arguments
of page_fault_handle_page_track(). The fields are initialized in the
callers, and page_fault_handle_page_track() receives a struct
kvm_page_fault instead of having to extract the arguments out of it.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4326e57e 06-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: change direct_page_fault() arguments to kvm_page_fault

Add fields to struct kvm_page_fault corresponding to
the arguments of direct_page_fault(). The fields are
initialized in the callers, and direct_page_fault()
receives a struct kvm_page_fault instead of having to
extract the arguments out of it.

Also adjust FNAME(page_fault) to store the max_level in
struct kvm_page_fault, to keep it similar to the direct
map path.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c501040a 06-Aug-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: change mmu->page_fault() arguments to kvm_page_fault

Pass struct kvm_page_fault to mmu->page_fault() instead of
extracting the arguments from the struct. FNAME(page_fault) can use
the precomputed bools from the error code.

Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 3e44dce4 06-Sep-2021 Lai Jiangshan <laijs@linux.alibaba.com>

KVM: X86: Move PTE present check from loop body to __shadow_walk_next()

So far, the loop bodies already ensure the PTE is present before calling
__shadow_walk_next(): Some loop bodies simply exit with a !PRESENT
directly and some other loop bodies, i.e. FNAME(fetch) and __direct_map()
do not currently guard their walks with is_shadow_present_pte, but only
because they install present non-leaf SPTEs in the loop itself.

But checking pte present in __shadow_walk_next() (which is called from
shadow_walk_okay()) is more prudent; walking past a !PRESENT SPTE
would lead to attempting to read a the next level SPTE from a garbage
iter->shadow_addr. It also allows to remove the is_shadow_present_pte()
checks from the loop bodies.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210906122547.263316-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# cc2a8e66 17-Sep-2021 Lai Jiangshan <laijs@linux.alibaba.com>

KVM: X86: Remove FNAME(update_pte)

Its solo caller is changed to use FNAME(prefetch_gpte) directly.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-9-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c3e5e415 17-Sep-2021 Lai Jiangshan <laijs@linux.alibaba.com>

KVM: X86: Change kvm_sync_page() to return true when remote flush is needed

Currently kvm_sync_page() returns true when there is any present spte.
But the return value is ignored in the callers.

Changing kvm_sync_page() to return true when remote flush is needed and
changing mmu->sync_page() not to directly flush can combine and reduce
remote flush requests.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-7-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 65855ed8 17-Sep-2021 Lai Jiangshan <laijs@linux.alibaba.com>

KVM: X86: Synchronize the shadow pagetable before link it

If gpte is changed from non-present to present, the guest doesn't need
to flush tlb per SDM. So the host must synchronze sp before
link it. Otherwise the guest might use a wrong mapping.

For example: the guest first changes a level-1 pagetable, and then
links its parent to a new place where the original gpte is non-present.
Finally the guest can access the remapped area without flushing
the tlb. The guest's behavior should be allowed per SDM, but the host
kvm mmu makes it wrong.

Fixes: 4731d4c7a077 ("KVM: MMU: out of sync shadow core")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f8160295 17-Sep-2021 Lai Jiangshan <laijs@linux.alibaba.com>

KVM: X86: Fix missed remote tlb flush in rmap_write_protect()

When kvm->tlbs_dirty > 0, some rmaps might have been deleted
without flushing tlb remotely after kvm_sync_page(). If @gfn
was writable before and it's rmaps was deleted in kvm_sync_page(),
and if the tlb entry is still in a remote running VCPU, the @gfn
is not safely protected.

To fix the problem, kvm_sync_page() does the remote flush when
needed to avoid the problem.

Fixes: a4ee1ca4a36e ("KVM: MMU: delay flush all tlbs on sync_page path")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 8f32d5e5 10-Aug-2021 Maxim Levitsky <mlevitsk@redhat.com>

KVM: x86/mmu: allow kvm_faultin_pfn to return page fault handling code

This will allow it to return RET_PF_EMULATE for APIC mmio
emulation.

This code is based on a patch from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210810205251.424103-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 33a5c000 10-Aug-2021 Maxim Levitsky <mlevitsk@redhat.com>

KVM: x86/mmu: rename try_async_pf to kvm_faultin_pfn

try_async_pf is a wrong name for this function, since this code
is used when asynchronous page fault is not enabled as well.

This code is based on a patch from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210810205251.424103-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# fc9bf2e0 23-Jun-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs

Ignore "dynamic" host adjustments to the physical address mask when
generating the masks for guest PTEs, i.e. the guest PA masks. The host
physical address space and guest physical address space are two different
beasts, e.g. even though SEV's C-bit is the same bit location for both
host and guest, disabling SME in the host (which clears shadow_me_mask)
does not affect the guest PTE->GPA "translation".

For non-SEV guests, not dropping bits is the correct behavior. Assuming
KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits
that are lost as collateral damage from memory encryption are treated as
reserved bits, i.e. KVM will never get to the point where it attempts to
generate a gfn using the affected bits. And if userspace wants to create
a bogus vCPU, then userspace gets to deal with the fallout of hardware
doing odd things with bad GPAs.

For SEV guests, not dropping the C-bit is technically wrong, but it's a
moot point because KVM can't read SEV guest's page tables in any case
since they're always encrypted. Not to mention that the current KVM code
is also broken since sme_me_mask does not have to be non-zero for SEV to
be supported by KVM. The proper fix would be to teach all of KVM to
correctly handle guest private memory, but that's a task for the future.

Fixes: d0ec49d4de90 ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
Cc: stable@vger.kernel.org
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210623230552.4027702-5-seanjc@google.com>
[Use a new header instead of adding header guards to paging_tmpl.h. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9a65d0b7 22-Jun-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault

Use the current MMU instead of vCPU state to query CR4.SMEP when handling
a page fault. In the nested NPT case, the current CR4.SMEP reflects L2,
whereas the page fault is shadowing L1's NPT, which uses L1's hCR4.
Practically speaking, this is a nop a NPT walks are always user faults,
i.e. this code will never be reached, but fix it up for consistency.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-54-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# fdaa2935 22-Jun-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault

Use the current MMU instead of vCPU state to query CR0.WP when handling
a page fault. In the nested NPT case, the current CR0.WP reflects L2,
whereas the page fault is shadowing L1's NPT. Practically speaking, this
is a nop a NPT walks are always user faults, but fix it up for
consistency.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-53-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 7cd138db 22-Jun-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic

Drop the pre-computed last_nonleaf_level, which is arguably wrong and at
best confusing. Per the comment:

Can have large pages at levels 2..last_nonleaf_level-1.

the intent of the variable would appear to be to track what levels can
_legally_ have large pages, but that intent doesn't align with reality.
The computed value will be wrong for 5-level paging, or if 1gb pages are
not supported.

The flawed code is not a problem in practice, because except for 32-bit
PSE paging, bit 7 is reserved if large pages aren't supported at the
level. Take advantage of this invariant and simply omit the level magic
math for 64-bit page tables (including PAE).

For 32-bit paging (non-PAE), the adjustments are needed purely because
bit 7 is ignored if PSE=0. Retain that logic as is, but make
is_last_gpte() unique per PTTYPE so that the PSE check is avoided for
PAE and EPT paging. In the spirit of avoiding branches, bump the "last
nonleaf level" for 32-bit PSE paging by adding the PSE bit itself.

Note, bit 7 is ignored or has other meaning in CR3/EPTP, but despite
FNAME(walk_addr_generic) briefly grabbing CR3/EPTP in "pte", they are
not PTEs and will blow up all the other gpte helpers.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-51-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# cd628f0f 22-Jun-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Use MMU's role to detect EFER.NX in guest page walk

Use the NX bit from the MMU's role instead of the MMU itself so that the
redundant, dedicated "nx" flag can be dropped.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-36-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 60667724 22-Jun-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Add accessors to query mmu_role bits

Add accessors via a builder macro for all mmu_role bits that track a CR0,
CR4, or EFER bit, abstracting whether the bits are in the base or the
extended role.

Future commits will switch to using mmu_role instead of vCPU state to
configure the MMU, i.e. there are about to be a large number of users.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-26-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2640b086 22-Jun-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: WARN and zap SP when sync'ing if MMU role mismatches

When synchronizing a shadow page, WARN and zap the page if its mmu role
isn't compatible with the current MMU context, where "compatible" is an
exact match sans the bits that have no meaning in the overall MMU context
or will be explicitly overwritten during the sync. Many of the helpers
used by sync_page() are specific to the current context, updating a SMM
vs. non-SMM shadow page would use the wrong memslots, updating L1 vs. L2
PTEs might work but would be extremely bizaree, and so on and so forth.

Drop the guard with respect to 8-byte vs. 4-byte PTEs in
__kvm_sync_page(), it was made useless when kvm_mmu_get_page() stopped
trying to sync shadow pages irrespective of the current MMU context.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ef318b9e 22-Jun-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Use MMU's role to detect CR4.SMEP value in nested NPT walk

Use the MMU's role to get its effective SMEP value when injecting a fault
into the guest. When walking L1's (nested) NPT while L2 is active, vCPU
state will reflect L2, whereas NPT uses the host's (L1 in this case) CR0,
CR4, EFER, etc... If L1 and L2 have different settings for SMEP and
L1 does not have EFER.NX=1, this can result in an incorrect PFEC.FETCH
when injecting #NPF.

Fixes: e57d4a356ad3 ("KVM: Add instruction fetch checking when walking guest page table")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b1bd5cba 02-Jun-2021 Lai Jiangshan <laijs@linux.alibaba.com>

KVM: X86: MMU: Use the correct inherited permissions to get shadow page

When computing the access permissions of a shadow page, use the effective
permissions of the walk up to that point, i.e. the logic AND of its parents'
permissions. Two guest PxE entries that point at the same table gfn need to
be shadowed with different shadow pages if their parents' permissions are
different. KVM currently uses the effective permissions of the last
non-leaf entry for all non-leaf entries. Because all non-leaf SPTEs have
full ("uwx") permissions, and the effective permissions are recorded only
in role.access and merged into the leaves, this can lead to incorrect
reuse of a shadow page and eventually to a missing guest protection page
fault.

For example, here is a shared pagetable:

pgd[] pud[] pmd[] virtual address pointers
/->pmd1(u--)->pte1(uw-)->page1 <- ptr1 (u--)
/->pud1(uw-)--->pmd2(uw-)->pte2(uw-)->page2 <- ptr2 (uw-)
pgd-| (shared pmd[] as above)
\->pud2(u--)--->pmd1(u--)->pte1(uw-)->page1 <- ptr3 (u--)
\->pmd2(uw-)->pte2(uw-)->page2 <- ptr4 (u--)

pud1 and pud2 point to the same pmd table, so:
- ptr1 and ptr3 points to the same page.
- ptr2 and ptr4 points to the same page.

(pud1 and pud2 here are pud entries, while pmd1 and pmd2 here are pmd entries)

- First, the guest reads from ptr1 first and KVM prepares a shadow
page table with role.access=u--, from ptr1's pud1 and ptr1's pmd1.
"u--" comes from the effective permissions of pgd, pud1 and
pmd1, which are stored in pt->access. "u--" is used also to get
the pagetable for pud1, instead of "uw-".

- Then the guest writes to ptr2 and KVM reuses pud1 which is present.
The hypervisor set up a shadow page for ptr2 with pt->access is "uw-"
even though the pud1 pmd (because of the incorrect argument to
kvm_mmu_get_page in the previous step) has role.access="u--".

- Then the guest reads from ptr3. The hypervisor reuses pud1's
shadow pmd for pud2, because both use "u--" for their permissions.
Thus, the shadow pmd already includes entries for both pmd1 and pmd2.

- At last, the guest writes to ptr4. This causes no vmexit or pagefault,
because pud1's shadow page structures included an "uw-" page even though
its role.access was "u--".

Any kind of shared pagetable might have the similar problem when in
virtual machine without TDP enabled if the permissions are different
from different ancestors.

In order to fix the problem, we change pt->access to be an array, and
any access in it will not include permissions ANDed from child ptes.

The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/
Remember to test it with TDP disabled.

The problem had existed long before the commit 41074d07c78b ("KVM: MMU:
Fix inherited permissions for emulated guest pte updates"), and it
is hard to find which is the culprit. So there is no fixes tag here.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210603052455.21023-1-jiangshanlai@gmail.com>
Cc: stable@vger.kernel.org
Fixes: cea0f0e7ea54 ("[PATCH] KVM: MMU: Shadow page table caching")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5fc3424f 25-Feb-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Make Host-writable and MMU-writable bit locations dynamic

Make the location of the HOST_WRITABLE and MMU_WRITABLE configurable for
a given KVM instance. This will allow EPT to use high available bits,
which in turn will free up bit 11 for a constant MMU_PRESENT bit.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 422e2e17 25-Feb-2021 Maxim Levitsky <mlevitsk@redhat.com>

KVM: x86: mmu: initialize fault.async_page_fault in walk_addr_generic

This field was left uninitialized by a mistake.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210225154135.405125-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4a42d848 21-Feb-2021 David Stevens <stevensd@chromium.org>

KVM: x86/mmu: Consider the hva in mmu_notifier retry

Track the range being invalidated by mmu_notifier and skip page fault
retries if the fault address is not affected by the in-progress
invalidation. Handle concurrent invalidations by finding the minimal
range which includes all ranges being invalidated. Although the combined
range may include unrelated addresses and cannot be shrunk as individual
invalidation operations complete, it is unlikely the marginal gains of
proper range tracking are worth the additional complexity.

The primary benefit of this change is the reduction in the likelihood of
extreme latency when handing a page fault due to another thread having
been preempted while modifying host virtual addresses.

Signed-off-by: David Stevens <stevensd@chromium.org>
Message-Id: <20210222024522.1751719-3-stevensd@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5f8a7cf2 21-Feb-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Skip mmu_notifier check when handling MMIO page fault

Don't retry a page fault due to an mmu_notifier invalidation when
handling a page fault for a GPA that did not resolve to a memslot, i.e.
an MMIO page fault. Invalidations from the mmu_notifier signal a change
in a host virtual address (HVA) mapping; without a memslot, there is no
HVA and thus no possibility that the invalidation is relevant to the
page fault being handled.

Note, the MMIO vs. memslot generation checks handle the case where a
pending memslot will create a memslot overlapping the faulting GPA. The
mmu_notifier checks are orthogonal to memslot updates.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210222024522.1751719-2-stevensd@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 531810ca 02-Feb-2021 Ben Gardon <bgardon@google.com>

KVM: x86/mmu: Use an rwlock for the x86 MMU

Add a read / write lock to be used in place of the MMU spinlock on x86.
The rwlock will enable the TDP MMU to handle page faults, and other
operations in parallel in future commits.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>

Message-Id: <20210202185734.1680553-19-bgardon@google.com>
[Introduce virt/kvm/mmu_lock.h - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 7d945312 14-Oct-2020 Ben Gardon <bgardon@google.com>

kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg

In order to avoid creating executable hugepages in the TDP MMU PF
handler, remove the dependency between disallowed_hugepage_adjust and
the shadow_walk_iterator. This will open the function up to being used
by the TDP MMU PF handler in a future patch.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20201014182700.2888246-10-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# e88b8093 23-Sep-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Track write/user faults using bools

Use bools to track write and user faults throughout the page fault paths
and down into mmu_set_spte(). The actual usage is purely boolean, but
that's not obvious without digging into all paths as the current code
uses a mix of bools (TDP and try_async_pf) and ints (shadow paging and
mmu_set_spte()).

No true functional change intended (although the pgprintk() will now
print 0/1 instead of 0/PFERR_WRITE_MASK).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-9-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# dcc70651 23-Sep-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Hoist ITLB multi-hit workaround check up a level

Move the "ITLB multi-hit workaround enabled" check into the callers of
disallowed_hugepage_adjust() to make it more obvious that the helper is
specific to the workaround, and to be consistent with the accounting,
i.e. account_huge_nx_page() is called if and only if the workaround is
enabled.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 1d4a7372 23-Sep-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Rename 'hlevel' to 'level' in FNAME(fetch)

Rename 'hlevel', which presumably stands for 'host level', to simply
'level' in FNAME(fetch). The variable hasn't tracked the host level for
quite some time.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5bcaf3e1 23-Sep-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested

Condition the accounting of a disallowed huge NX page on the original
requested level of the page being greater than the current iterator
level. This does two things: accounts the page if and only if a huge
page was actually disallowed, and accounts the shadow page if and only
if it was the level at which the huge page was disallowed. For the
latter case, the previous logic would account all shadow pages used to
create the translation for the forced small page, e.g. even PML4, which
can't be a huge page on current hardware, would be accounted as having
been a disallowed huge page when using 5-level EPT.

The overzealous accounting is purely a performance issue, i.e. the
recovery thread will spuriously zap shadow pages, but otherwise the bad
behavior is harmless.

Cc: Junaid Shahid <junaids@google.com>
Fixes: b8e8c8303ff28 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 3cf06612 23-Sep-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Capture requested page level before NX huge page workaround

Apply the "huge page disallowed" adjustment of the max level only after
capturing the original requested level. The requested level will be
used in a future patch to skip adding pages to the list of disallowed
huge pages if a huge page wasn't possible anyways, e.g. if the page
isn't mapped as a huge page in the host.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 6c2fd34f 23-Sep-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Move "huge page disallowed" calculation into mapping helpers

Calculate huge_page_disallowed in __direct_map() and FNAME(fetch) in
preparation for reworking the calculation so that it preserves the
requested map level and eventually to avoid flagging a shadow page as
being disallowed for being used as a large/huge page when it couldn't
have been huge in the first place, e.g. because the backing page in the
host is not large.

Pass the error code into the helpers and use it to recalcuate exec and
write_fault instead adding yet more booleans to the parameters.

Opportunistically use huge_page_disallowed instead of lpage_disallowed
to match the nomenclature used within the mapping helpers (though even
they have existing inconsistencies).

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 12703759 23-Sep-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Bail early from final #PF handling on spurious faults

Detect spurious page faults, e.g. page faults that occur when multiple
vCPUs simultaneously access a not-present page, and skip the SPTE write,
prefetch, and stats update for spurious faults.

Note, the performance benefits of skipping the write and prefetch are
likely negligible, and the false positive stats adjustment is probably
lost in the noise. The primary motivation is to play nice with TDX's
SEPT in the long term. SEAMCALLs (to program SEPT entries) are quite
costly, e.g. thousands of cycles, and a spurious SEPT update will result
in a SEAMCALL error (which KVM will ideally treat as fatal).

Reported-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923220425.18402-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2de4085c 23-Sep-2020 Ben Gardon <bgardon@google.com>

KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent

Recursively zap all to-be-orphaned children, unsynced or otherwise, when
zapping a shadow page for a nested TDP MMU. KVM currently only zaps the
unsynced child pages, but not the synced ones. This can create problems
over time when running many nested guests because it leaves unlinked
pages which will not be freed until the page quota is hit. With the
default page quota of 20 shadow pages per 1000 guest pages, this looks
like a memory leak and can degrade MMU performance.

In a recent benchmark, substantial performance degradation was observed:
An L1 guest was booted with 64G memory.
2G nested Windows guests were booted, 10 at a time for 20
iterations. (200 total boots)
Windows was used in this benchmark because they touch all of their
memory on startup.
By the end of the benchmark, the nested guests were taking ~10% longer
to boot. With this patch there is no degradation in boot time.
Without this patch the benchmark ends with hundreds of thousands of
stale EPT02 pages cluttering up rmaps and the page hash map. As a
result, VM shutdown is also much slower: deleting memslot 0 was
observed to take over a minute. With this patch it takes just a
few miliseconds.

Cc: Peter Shier <pshier@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923221406.16297-3-sean.j.christopherson@intel.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ace569e0 23-Sep-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Move flush logic from mmu_page_zap_pte() to FNAME(invlpg)

Move the logic that controls whether or not FNAME(invlpg) needs to flush
fully into FNAME(invlpg) so that mmu_page_zap_pte() doesn't return a
value. This allows a future patch to redefine the return semantics for
mmu_page_zap_pte() so that it can recursively zap orphaned child shadow
pages for nested TDP MMUs.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923221406.16297-2-sean.j.christopherson@intel.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 3f649ab7 03-Jun-2020 Kees Cook <keescook@chromium.org>

treewide: Remove uninitialized_var() usage

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>


# 378f5cd6 02-Jul-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Skip filling the gfn cache for guaranteed direct MMU topups

Don't bother filling the gfn array cache when the caller is a fully
direct MMU, i.e. won't need a gfn array for shadow pages.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-13-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f3747a5a 02-Jul-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Topup memory caches after walking GVA->GPA

Topup memory caches after walking the GVA->GPA translation during a
shadow page fault, there is no need to ensure the caches are full when
walking the GVA. As of commit f5a1e9f89504f ("KVM: MMU: remove call
to kvm_mmu_pte_write from walk_addr"), the FNAME(walk_addr) flow no
longer add rmaps via kvm_mmu_pte_write().

This avoids allocating memory in the case that the GVA is unmapped in
the guest, and also provides a paper trail of why/when the memory caches
need to be filled.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 57354682 22-Jun-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Add sptep_to_sp() helper to wrap shadow page lookup

Introduce sptep_to_sp() to reduce the boilerplate code needed to get the
shadow page associated with a spte pointer, and to improve readability
as it's not immediately obvious that "page_header" is a KVM-specific
accessor for retrieving a shadow page.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622202034.15093-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 7bd7ded6 23-Jun-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Exit to userspace on make_mmu_pages_available() error

Propagate any error returned by make_mmu_pages_available() out to
userspace instead of resuming the guest if the error occurs while
handling a page fault. Now that zapping the oldest MMU pages skips
active roots, i.e. fails if and only if there are no zappable pages,
there is no chance for a false positive, i.e. no chance of returning a
spurious error to userspace.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623193542.7554-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 02f5fb2e 22-Jun-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Make .write_log_dirty a nested operation

Move .write_log_dirty() into kvm_x86_nested_ops to help differentiate it
from the non-nested dirty log hooks. And because it's a nested-only
operation.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f25a9dec 22-Jun-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Drop kvm_arch_write_log_dirty() wrapper

Drop kvm_arch_write_log_dirty() in favor of invoking .write_log_dirty()
directly from FNAME(update_accessed_dirty_bits). "kvm_arch" is usually
used for x86 functions that are invoked from generic KVM, and implies
that there are external callers, neither of which is true.

Remove the check for a non-NULL kvm_x86_ops hook as the call is wrapped
in PTTYPE_EPT and is unconditionally set by VMX.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2dbebf7a 22-Jun-2020 Sean Christopherson <seanjc@google.com>

KVM: nVMX: Plumb L2 GPA through to PML emulation

Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all
intents and purposes is vmx_write_pml_buffer(), instead of having the
latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS. If the dirty bit
update is the result of KVM emulation (rare for L2), then the GPA in the
VMCS may be stale and/or hold a completely unrelated GPA.

Fixes: c5f983f6e8455 ("nVMX: Implement emulated Page Modification Logging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 312d16c7 22-Jun-2020 Vitaly Kuznetsov <vkuznets@redhat.com>

KVM: x86/mmu: Avoid mixing gpa_t with gfn_t in walk_addr_generic()

translate_gpa() returns a GPA, assigning it to 'real_gfn' seems obviously
wrong. There is no real issue because both 'gpa_t' and 'gfn_t' are u64 and
we don't use the value in 'real_gfn' as a GFN, we do

real_gfn = gpa_to_gfn(real_gfn);

instead. 'If you see a "buffalo" sign on an elephant's cage, do not trust
your eyes', but let's fix it for good.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200622151435.752560-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 89154dd5 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: convert mmap_sem call sites missed by coccinelle

Convert the last few remaining mmap_sem rwsem calls to use the new mmap
locking API. These were missed by coccinelle for some reason (I think
coccinelle does not support some of the preprocessor constructs in these
files ?)

[akpm@linux-foundation.org: convert linux-next leftovers]
[akpm@linux-foundation.org: more linux-next leftovers]
[akpm@linux-foundation.org: more linux-next leftovers]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-6-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3bae0459 27-Apr-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Drop KVM's hugepage enums in favor of the kernel's enums

Replace KVM's PT_PAGE_TABLE_LEVEL, PT_DIRECTORY_LEVEL and PT_PDPE_LEVEL
with the kernel's PG_LEVEL_4K, PG_LEVEL_2M and PG_LEVEL_1G. KVM's
enums are borderline impossible to remember and result in code that is
visually difficult to audit, e.g.

if (!enable_ept)
ept_lpage_level = 0;
else if (cpu_has_vmx_ept_1g_page())
ept_lpage_level = PT_PDPE_LEVEL;
else if (cpu_has_vmx_ept_2m_page())
ept_lpage_level = PT_DIRECTORY_LEVEL;
else
ept_lpage_level = PT_PAGE_TABLE_LEVEL;

versus

if (!enable_ept)
ept_lpage_level = 0;
else if (cpu_has_vmx_ept_1g_page())
ept_lpage_level = PG_LEVEL_1G;
else if (cpu_has_vmx_ept_2m_page())
ept_lpage_level = PG_LEVEL_2M;
else
ept_lpage_level = PG_LEVEL_4K;

No functional change intended.

Suggested-by: Barret Rhoden <brho@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200428005422.4235-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b2f432f8 27-Apr-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Tweak PSE hugepage handling to avoid 2M vs 4M conundrum

Change the PSE hugepage handling in walk_addr_generic() to fire on any
page level greater than PT_PAGE_TABLE_LEVEL, a.k.a. PG_LEVEL_4K. PSE
paging only has two levels, so "== 2" and "> 1" are functionally the
same, i.e. this is a nop.

A future patch will drop KVM's PT_*_LEVEL enums in favor of the kernel's
PG_LEVEL_* enums, at which point "walker->level == PG_LEVEL_2M" is
semantically incorrect (though still functionally ok).

No functional change intended.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200428005422.4235-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 0cd665bd 24-Mar-2020 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: cleanup kvm_inject_emulated_page_fault

To reconstruct the kvm_mmu to be used for page fault injection, we
can simply use fault->nested_page_fault. This matches how
fault->nested_page_fault is assigned in the first place by
FNAME(walk_addr_generic).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# d8dd54e0 02-Mar-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Rename kvm_mmu->get_cr3() to ->get_guest_pgd()

Rename kvm_mmu->get_cr3() to call out that it is retrieving a guest
value, as opposed to kvm_mmu->set_cr3(), which sets a host value, and to
note that it will return something other than CR3 when nested EPT is in
use. Hopefully the new name will also make it more obvious that L1's
nested_cr3 is returned in SVM's nested NPT case.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# bb1fcc70 02-Mar-2020 Sean Christopherson <seanjc@google.com>

KVM: nVMX: Allow L1 to use 5-level page walks for nested EPT

Add support for 5-level nested EPT, and advertise said support in the
EPT capabilities MSR. KVM's MMU can already handle 5-level legacy page
tables, there's no reason to force an L1 VMM to use shadow paging if it
wants to employ 5-level page tables.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# a4814443 15-Feb-2020 Al Viro <viro@zeniv.linux.org.uk>

x86 kvm page table walks: switch to explicit __get_user()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f6ab0107 07-Feb-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Fix struct guest_walker arrays for 5-level paging

Define PT_MAX_FULL_LEVELS as PT64_ROOT_MAX_LEVEL, i.e. 5, to fix shadow
paging for 5-level guest page tables. PT_MAX_FULL_LEVELS is used to
size the arrays that track guest pages table information, i.e. using a
"max levels" of 4 causes KVM to access garbage beyond the end of an
array when querying state for level 5 entries. E.g. FNAME(gpte_changed)
will read garbage and most likely return %true for a level 5 entry,
soft-hanging the guest because FNAME(fetch) will restart the guest
instead of creating SPTEs because it thinks the guest PTE has changed.

Note, KVM doesn't yet support 5-level nested EPT, so PT_MAX_FULL_LEVELS
gets to stay "4" for the PTTYPE_EPT case.

Fixes: 855feb673640 ("KVM: MMU: Add 5 level EPT & Shadow page table support.")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 293e306e 08-Jan-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust()

Fold max_mapping_level() into kvm_mmu_hugepage_adjust() now that HugeTLB
mappings are handled in kvm_mmu_hugepage_adjust(), i.e. there isn't a
need to pre-calculate the max mapping level. Co-locating all hugepage
checks eliminates a memslot lookup, at the cost of performing the
__mmu_gfn_lpage_is_disallowed() checks while holding mmu_lock.

The latency of lpage_is_disallowed() is likely negligible relative to
the rest of the code run while holding mmu_lock, and can be offset to
some extent by eliminating the mmu_gfn_lpage_is_disallowed() check in
set_spte() in a future patch. Eliminating the check in set_spte() is
made possible by performing the initial lpage_is_disallowed() checks
while holding mmu_lock.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 09c4453e 08-Jan-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch)

Remove logic to retrieve the original gfn now that HugeTLB mappings are
are identified in FNAME(fetch), i.e. FNAME(page_fault) no longer adjusts
the level or gfn.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 83f06fa7 08-Jan-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Rely on host page tables to find HugeTLB mappings

Remove KVM's HugeTLB specific logic and instead rely on walking the host
page tables (already done for THP) to identify HugeTLB mappings.
Eliminating the HugeTLB-only logic avoids taking mmap_sem and calling
find_vma() for all hugepage compatible page faults, and simplifies KVM's
page fault code by consolidating all hugepage adjustments into a common
helper.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 17eff019 08-Jan-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Refactor THP adjust to prep for changing query

Refactor transparent_hugepage_adjust() in preparation for walking the
host page tables to identify hugepage mappings, initially for THP pages,
and eventualy for HugeTLB and DAX-backed pages as well. The latter
cases support 1gb pages, i.e. the adjustment logic needs access to the
max allowed level.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b5c3c1b3 09-Jan-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Micro-optimize nEPT's bad memptype/XWR checks

Rework the handling of nEPT's bad memtype/XWR checks to micro-optimize
the checks as much as possible. Move the check to a separate helper,
__is_bad_mt_xwr(), which allows the guest_rsvd_check usage in
paging_tmpl.h to omit the check entirely for paging32/64 (bad_mt_xwr is
always zero for non-nEPT) while retaining the bitwise-OR of the current
code for the shadow_zero_check in walk_shadow_page_get_mmio_spte().

Add a comment for the bitwise-OR usage in the mmio spte walk to avoid
future attempts to "fix" the code, which is what prompted this
optimization in the first place[*].

Opportunistically remove the superfluous '!= 0' and parantheses, and
use BIT_ULL() instead of open coding its equivalent.

The net effect is that code generation is largely unchanged for
walk_shadow_page_get_mmio_spte(), marginally better for
ept_prefetch_invalid_gpte(), and significantly improved for
paging32/64_prefetch_invalid_gpte().

Note, walk_shadow_page_get_mmio_spte() can't use a templated version of
the memtype/XRW as it works on the host's shadow PTEs, e.g. checks that
KVM hasn't borked its EPT tables. Even if it could be templated, the
benefits of having a single implementation far outweight the few uops
that would be saved for NPT or non-TDP paging, e.g. most compilers
inline it all the way to up kvm_mmu_page_fault().

[*] https://lkml.kernel.org/r/20200108001859.25254-1-sean.j.christopherson@intel.com

Cc: Jim Mattson <jmattson@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f8052a05 09-Jan-2020 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Reorder the reserved bit check in prefetch_invalid_gpte()

Move the !PRESENT and !ACCESSED checks in FNAME(prefetch_invalid_gpte)
above the call to is_rsvd_bits_set(). For a well behaved guest, the
!PRESENT and !ACCESSED are far more likely to evaluate true than the
reserved bit checks, and they do not require additional memory accesses.

Before:
Dump of assembler code for function paging32_prefetch_invalid_gpte:
0x0000000000044240 <+0>: callq 0x44245 <paging32_prefetch_invalid_gpte+5>
0x0000000000044245 <+5>: mov %rcx,%rax
0x0000000000044248 <+8>: shr $0x7,%rax
0x000000000004424c <+12>: and $0x1,%eax
0x000000000004424f <+15>: lea 0x0(,%rax,4),%r8
0x0000000000044257 <+23>: add %r8,%rax
0x000000000004425a <+26>: mov %rcx,%r8
0x000000000004425d <+29>: and 0x120(%rsi,%rax,8),%r8
0x0000000000044265 <+37>: mov 0x170(%rsi),%rax
0x000000000004426c <+44>: shr %cl,%rax
0x000000000004426f <+47>: and $0x1,%eax
0x0000000000044272 <+50>: or %rax,%r8
0x0000000000044275 <+53>: jne 0x4427c <paging32_prefetch_invalid_gpte+60>
0x0000000000044277 <+55>: test $0x1,%cl
0x000000000004427a <+58>: jne 0x4428a <paging32_prefetch_invalid_gpte+74>
0x000000000004427c <+60>: mov %rdx,%rsi
0x000000000004427f <+63>: callq 0x44080 <drop_spte>
0x0000000000044284 <+68>: mov $0x1,%eax
0x0000000000044289 <+73>: retq
0x000000000004428a <+74>: xor %eax,%eax
0x000000000004428c <+76>: and $0x20,%ecx
0x000000000004428f <+79>: jne 0x44289 <paging32_prefetch_invalid_gpte+73>
0x0000000000044291 <+81>: mov %rdx,%rsi
0x0000000000044294 <+84>: callq 0x44080 <drop_spte>
0x0000000000044299 <+89>: mov $0x1,%eax
0x000000000004429e <+94>: jmp 0x44289 <paging32_prefetch_invalid_gpte+73>
End of assembler dump.

After:
Dump of assembler code for function paging32_prefetch_invalid_gpte:
0x0000000000044240 <+0>: callq 0x44245 <paging32_prefetch_invalid_gpte+5>
0x0000000000044245 <+5>: test $0x1,%cl
0x0000000000044248 <+8>: je 0x4424f <paging32_prefetch_invalid_gpte+15>
0x000000000004424a <+10>: test $0x20,%cl
0x000000000004424d <+13>: jne 0x4425d <paging32_prefetch_invalid_gpte+29>
0x000000000004424f <+15>: mov %rdx,%rsi
0x0000000000044252 <+18>: callq 0x44080 <drop_spte>
0x0000000000044257 <+23>: mov $0x1,%eax
0x000000000004425c <+28>: retq
0x000000000004425d <+29>: mov %rcx,%rax
0x0000000000044260 <+32>: mov (%rsi),%rsi
0x0000000000044263 <+35>: shr $0x7,%rax
0x0000000000044267 <+39>: and $0x1,%eax
0x000000000004426a <+42>: lea 0x0(,%rax,4),%r8
0x0000000000044272 <+50>: add %r8,%rax
0x0000000000044275 <+53>: mov %rcx,%r8
0x0000000000044278 <+56>: and 0x120(%rsi,%rax,8),%r8
0x0000000000044280 <+64>: mov 0x170(%rsi),%rax
0x0000000000044287 <+71>: shr %cl,%rax
0x000000000004428a <+74>: and $0x1,%eax
0x000000000004428d <+77>: mov %rax,%rcx
0x0000000000044290 <+80>: xor %eax,%eax
0x0000000000044292 <+82>: or %rcx,%r8
0x0000000000044295 <+85>: je 0x4425c <paging32_prefetch_invalid_gpte+28>
0x0000000000044297 <+87>: mov %rdx,%rsi
0x000000000004429a <+90>: callq 0x44080 <drop_spte>
0x000000000004429f <+95>: mov $0x1,%eax
0x00000000000442a4 <+100>: jmp 0x4425c <paging32_prefetch_invalid_gpte+28>
End of assembler dump.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 0c7a98e3 06-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: WARN on an invalid root_hpa

WARN on the existing invalid root_hpa checks in __direct_map() and
FNAME(fetch). The "legitimate" path that invalidated root_hpa in the
middle of a page fault is long since gone, i.e. it should no longer be
impossible to invalidate in the middle of a page fault[*].

The root_hpa checks were added by two related commits

989c6b34f6a94 ("KVM: MMU: handle invalid root_hpa at __direct_map")
37f6a4e237303 ("KVM: x86: handle invalid root_hpa everywhere")

to fix a bug where nested_vmx_vmexit() could be called *in the middle*
of a page fault. At the time, vmx_interrupt_allowed(), which was and
still is used by kvm_can_do_async_pf() via ->interrupt_allowed(),
directly invoked nested_vmx_vmexit() to switch from L2 to L1 to emulate
a VM-Exit on a pending interrupt. Emulating the nested VM-Exit resulted
in root_hpa being invalidated by kvm_mmu_reset_context() without
explicitly terminating the page fault.

Now that root_hpa is checked for validity by kvm_mmu_page_fault(), WARN
on an invalid root_hpa to detect any flows that reset the MMU while
handling a page fault. The broken vmx_interrupt_allowed() behavior has
long since been fixed and resetting the MMU during a page fault should
not be considered legal behavior.

[*] It's actually technically possible in FNAME(page_fault)() because it
calls inject_page_fault() when the guest translation is invalid, but
in that case the page fault handling is immediately terminated.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4cd071d1 06-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Move calls to thp_adjust() down a level

Move the calls to thp_adjust() down a level from the page fault handlers
to the map/fetch helpers and remove the page count shuffling done in
thp_adjust().

Despite holding a reference to the underlying page while processing a
page fault, the page fault flows don't actually rely on holding a
reference to the page when thp_adjust() is called. At that point, the
fault handlers hold mmu_lock, which prevents mmu_notifier from completing
any invalidations, and have verified no invalidations from mmu_notifier
have occurred since the page reference was acquired (which is done prior
to taking mmu_lock).

The kvm_release_pfn_clean()/kvm_get_pfn() dance in thp_adjust() is a
quirk that is necessitated because thp_adjust() modifies the pfn that is
consumed by its caller. Because the page fault handlers call
kvm_release_pfn_clean() on said pfn, thp_adjust() needs to transfer the
reference to the correct pfn purely for correctness when the pfn is
released.

Calling thp_adjust() from __direct_map() and FNAME(fetch) means the pfn
adjustment doesn't change the pfn as seen by the page fault handlers,
i.e. the pfn released by the page fault handlers is the same pfn that
was returned by gfn_to_pfn().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# cbe1e6f0 06-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Incorporate guest's page level into max level for shadow MMU

Restrict the max level for a shadow page based on the guest's level
instead of capping the level after the fact for host-mapped huge pages,
e.g. hugetlbfs pages. Explicitly capping the max level using the guest
mapping level also eliminates FNAME(page_fault)'s subtle dependency on
THP only supporting 2mb pages.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 39ca1ecb 06-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Refactor handling of forced 4k pages in page faults

Refactor the page fault handlers and mapping_level() to track the max
allowed page level instead of only tracking if a 4k page is mandatory
due to one restriction or another. This paves the way for cleanly
consolidating tdp_page_fault() and nonpaging_page_fault(), and for
eliminating a redundant check on mmu_gfn_lpage_is_disallowed().

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 736c291c 06-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM

Convert a plethora of parameters and variables in the MMU and page fault
flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.

Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
addresses. When TDP is enabled, the fault address is a guest physical
address and thus can be a 64-bit value, even when both KVM and its guest
are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
64-bit field, not a natural width field.

Using a gva_t for the fault address means KVM will incorrectly drop the
upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to
translate L2 GPAs to L1 GPAs.

Opportunistically rename variables and parameters to better reflect the
dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
"addr" instead of "vaddr" when the address may be either a GVA or an L2
GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
a confusing "gpa_t gva" declaration; this also sets the stage for a
future patch to combing nonpaging_page_fault() and tdp_page_fault() with
minimal churn.

Sprinkle in a few comments to document flows where an address is known
to be a GVA and thus can be safely truncated to a 32-bit value. Add
WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
document such cases and detect bugs.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c50d8ae3 21-Nov-2019 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: create mmu/ subdirectory

Preparatory work for shattering mmu.c into multiple files. Besides making it easier
to follow, this will also make it possible to write unit tests for various parts.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>