History log of /linux-master/arch/x86/kernel/head64.c
Revision Date Author Comments
# 9843231c 22-Mar-2024 Tom Lendacky <thomas.lendacky@amd.com>

x86/boot/64: Move 5-level paging global variable assignments back

Commit 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging
global variables") moved assignment of 5-level global variables to later
in the boot in order to avoid having to use RIP relative addressing in
order to set them. However, when running with 5-level paging and SME
active (mem_encrypt=on), the variables are needed as part of the page
table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(),
etc.). Since the variables haven't been set, the page table manipulation
is done as if 4-level paging is active, causing the system to crash on
boot.

While only a subset of the assignments that were moved need to be set
early, move all of the assignments back into check_la57_support() so that
these assignments aren't spread between two locations. Instead of just
reverting the fix, this uses the new RIP_REL_REF() macro when assigning
the variables.

Fixes: 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging global variables")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/2ca419f4d0de719926fd82353f6751f717590a86.1711122067.git.thomas.lendacky@amd.com


# 4d0d7e78 22-Mar-2024 Tom Lendacky <thomas.lendacky@amd.com>

x86/boot/64: Apply encryption mask to 5-level pagetable update

When running with 5-level page tables, the kernel mapping PGD entry is
updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC,
which, when SME is active (mem_encrypt=on), results in a page table
entry without the encryption mask set, causing the system to crash on
boot.

Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so
that the encryption mask is set for the PGD entry.

Fixes: 533568e06b15 ("x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/8f20345cda7dbba2cf748b286e1bc00816fe649a.1711122067.git.thomas.lendacky@amd.com


# 63bed966 27-Feb-2024 Ard Biesheuvel <ardb@kernel.org>

x86/startup_64: Defer assignment of 5-level paging global variables

Assigning the 5-level paging related global variables from the earliest
C code using explicit references that use the 1:1 translation of memory
is unnecessary, as the startup code itself does not rely on them to
create the initial page tables, and this is all it should be doing. So
defer these assignments to the primary C entry code that executes via
the ordinary kernel virtual mapping.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-13-ardb+git@google.com


# 11e36b0f 26-Feb-2024 Brian Gerst <brgerst@gmail.com>

x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]

Instead of loading a duplicate GDT just for early boot, load the kernel
GDT from its physical address.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20240226220544.70769-1-brgerst@gmail.com


# 533568e0 20-Feb-2024 Ard Biesheuvel <ardb@kernel.org>

x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]

early_top_pgt[] is assigned from code that executes from a 1:1 mapping
so it cannot use a plain access from C. Replace the use of
fixup_pointer() with RIP_REL_REF(), which is better and simpler.

For legibility and to align with the code that populates the lower page
table levels, statically initialize the root level page table with an
entry pointing to level3_kernel_pgt[], and overwrite it when needed to
enable 5-level paging.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-24-ardb+git@google.com


# eb54c2ae 20-Feb-2024 Ard Biesheuvel <ardb@kernel.org>

x86/boot/64: Use RIP_REL_REF() to access early page tables

The early statically allocated page tables are populated from code that
executes from a 1:1 mapping so it cannot use plain accesses from C.
Replace the use of fixup_pointer() with RIP_REL_REF(), which is better
and simpler.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-23-ardb+git@google.com


# 4f8b6cf2 20-Feb-2024 Ard Biesheuvel <ardb@kernel.org>

x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask'

'__supported_pte_mask' is accessed from code that executes from a 1:1
mapping so it cannot use a plain access from C. Replace the use of
fixup_pointer() with RIP_REL_REF(), which is better and simpler.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-22-ardb+git@google.com


# b0fe5fb6 20-Feb-2024 Ard Biesheuvel <ardb@kernel.org>

x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[]

early_dynamic_pgts[] and next_early_pgt are accessed from code that
executes from a 1:1 mapping so it cannot use a plain access from C.
Replace the use of fixup_pointer() with RIP_REL_REF(), which is better
and simpler.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-21-ardb+git@google.com


# d9ec1158 20-Feb-2024 Ard Biesheuvel <ardb@kernel.org>

x86/boot/64: Use RIP_REL_REF() to assign 'phys_base'

'phys_base' is assigned from code that executes from a 1:1 mapping so it
cannot use a plain access from C. Replace the use of fixup_pointer()
with RIP_REL_REF(), which is better and simpler.

While at it, move the assignment to before the addition of the SME mask
so there is no need to subtract it again, and drop the unnecessary
addition ('phys_base' is statically initialized to 0x0)

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-20-ardb+git@google.com


# 5da79367 20-Feb-2024 Ard Biesheuvel <ardb@kernel.org>

x86/boot/64: Simplify global variable accesses in GDT/IDT programming

There are two code paths in the startup code to program an IDT: one that
runs from the 1:1 mapping and one that runs from the virtual kernel
mapping. Currently, these are strictly separate because fixup_pointer()
is used on the 1:1 path, which will produce the wrong value when used
while executing from the virtual kernel mapping.

Switch to RIP_REL_REF() so that the two code paths can be merged. Also,
move the GDT and IDT descriptors to the stack so that they can be
referenced directly, rather than via RIP_REL_REF().

Rename startup_64_setup_env() to startup_64_setup_gdt_idt() while at it,
to make the call from assembler self-documenting.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-19-ardb+git@google.com


# 3b184b71 19-Dec-2023 Vegard Nossum <vegard.nossum@oracle.com>

x86/asm: Always set A (accessed) flag in GDT descriptors

We have no known use for having the CPU track whether GDT descriptors
have been accessed or not.

Simplify the code by adding the flag to the common flags and removing
it everywhere else.

Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231219151200.2878271-5-vegard.nossum@oracle.com


# 1445f6e1 19-Dec-2023 Vegard Nossum <vegard.nossum@oracle.com>

x86/asm: Replace magic numbers in GDT descriptors, script-generated change

Actually replace the numeric values by the new symbolic values.

I used this to find all the existing users of the GDT_ENTRY*() macros:

$ git grep -P 'GDT_ENTRY(_INIT)?\('

Some of the lines will exceed 80 characters, but some of them will be
shorter again in the next couple of patches.

Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231219151200.2878271-4-vegard.nossum@oracle.com


# d2a285d6 17-Oct-2023 Hou Wenlong <houwenlong.hwl@antgroup.com>

x86/head/64: Move the __head definition to <asm/init.h>

Move the __head section definition to a header to widen its use.

An upcoming patch will mark the code as __head in mem_encrypt_identity.c too.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/0583f57977be184689c373fe540cbd7d85ca2047.1697525407.git.houwenlong.hwl@antgroup.com


# 7f6874ed 11-Jul-2023 Hou Wenlong <houwenlong.hwl@antgroup.com>

x86/head/64: Add missing __head annotation to startup_64_load_idt()

This function is currently only used in the head code and is only called
from startup_64_setup_env(). Although it would be inlined by the
compiler, it would be better to mark it as __head too in case it doesn't.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/efcc5b5e18af880e415d884e072bf651c1fa7c34.1689130310.git.houwenlong.hwl@antgroup.com


# dc628300 11-Jul-2023 Hou Wenlong <houwenlong.hwl@antgroup.com>

x86/head/64: Mark 'startup_gdt[]' and 'startup_gdt_descr' as __initdata

As 'startup_gdt[]' and 'startup_gdt_descr' are only used in booting,
mark them as __initdata to allow them to be freed after boot.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/c85903a7cfad37d14a7e5a4df9fc7119a3669fb3.1689130310.git.houwenlong.hwl@antgroup.com


# 9f76d606 03-Aug-2023 Wang Jinchao <wangjinchao@xfusion.com>

x86/boot: Harmonize the style of array-type parameter for fixup_pointer() calls

The usage of '&' before the array parameter is redundant because '&array'
is equivalent to 'array'. Therefore, there is no need to include '&'
before the array parameter. In fact, using '&' can cause more confusion,
especially for individuals who are not familiar with the address-of
operation for arrays. They might mistakenly believe that one is different
from the other and spend additional time realizing that they are actually
the same.

Harmonizing the style by removing the unnecessary '&' would save time for
those individuals.

Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/ZMt24BGEX9IhPSY6@fedora


# 001470fe 07-Aug-2023 Yuntao Wang <ytcoode@gmail.com>

x86/boot: Fix incorrect startup_gdt_descr.size

Since the size value is added to the base address to yield the last valid
byte address of the GDT, the current size value of startup_gdt_descr is
incorrect (too large by one), fix it.

[ mingo: This probably never mattered, because startup_gdt[] is only used
in a very controlled fashion - but make it consistent nevertheless. ]

Fixes: 866b556efa12 ("x86/head/64: Install startup GDT")
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20230807084547.217390-1-ytcoode@gmail.com


# 4208d2d7 12-Apr-2023 Josh Poimboeuf <jpoimboe@kernel.org>

x86/head: Mark *_start_kernel() __noreturn

Now that start_kernel() is __noreturn, mark its chain of callers
__noreturn.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/c2525f96b88be98ee027ee0291d58003036d4120.1681342859.git.jpoimboe@kernel.org


# 82328227 16-May-2022 Pasha Tatashin <pasha.tatashin@soleen.com>

x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros

Other architectures and the common mm/ use P*D_MASK, and P*D_SIZE.
Remove the duplicated P*D_PAGE_MASK and P*D_PAGE_SIZE which are only
used in x86/*.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/20220516185202.604654-1-tatashin@google.com


# 38fa5479 30-Jun-2022 Juergen Gross <jgross@suse.com>

x86: Clear .brk area at early boot

The .brk section has the same properties as .bss: it is an alloc-only
section and should be cleared before being used.

Not doing so is especially a problem for Xen PV guests, as the
hypervisor will validate page tables (check for writable page tables
and hypervisor private bits) before accepting them to be used.

Make sure .brk is initially zero by letting clear_bss() clear the brk
area, too.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220630071441.28576-3-jgross@suse.com


# 96e8fc58 30-Jun-2022 Juergen Gross <jgross@suse.com>

x86/xen: Use clear_bss() for Xen PV guests

Instead of clearing the bss area in assembly code, use the clear_bss()
function.

This requires to pass the start_info address as parameter to
xen_start_kernel() in order to avoid the xen_start_info being zeroed
again.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20220630071441.28576-2-jgross@suse.com


# 32e72854 05-Apr-2022 Andi Kleen <ak@linux.intel.com>

x86/tdx: Port I/O: Add early boot support

TDX guests cannot do port I/O directly. The TDX module triggers a #VE
exception to let the guest kernel emulate port I/O by converting them
into TDCALLs to call the host.

But before IDT handlers are set up, port I/O cannot be emulated using
normal kernel #VE handlers. To support the #VE-based emulation during
this boot window, add a minimal early #VE handler support in early
exception handlers. This is similar to what AMD SEV does. This is
mainly to support earlyprintk's serial driver, as well as potentially
the VGA driver.

The early handler only supports I/O-related #VE exceptions. Unhandled or
failed exceptions will be handled via early_fixup_exceptions() (like
normal exception failures). At runtime I/O-related #VE exceptions (along
with other types) handled by virt_exception_kernel().

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-19-kirill.shutemov@linux.intel.com


# 59bd54a8 05-Apr-2022 Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

x86/tdx: Detect running as a TDX guest in early boot

In preparation of extending cc_platform_has() API to support TDX guest,
use CPUID instruction to detect support for TDX guests in the early
boot code (via tdx_early_init()). Since copy_bootdata() is the first
user of cc_platform_has() API, detect the TDX guest status before it.

Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this
bit in a valid TDX guest platform.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-2-kirill.shutemov@linux.intel.com


# 469693d8 08-Feb-2022 Michael Roth <michael.roth@amd.com>

x86/head/64: Re-enable stack protection

Due to

103a4908ad4d ("x86/head/64: Disable stack protection for head$(BITS).o")

kernel/head{32,64}.c are compiled with -fno-stack-protector to allow
a call to set_bringup_idt_handler(), which would otherwise have stack
protection enabled with CONFIG_STACKPROTECTOR_STRONG.

While sufficient for that case, there may still be issues with calls to
any external functions that were compiled with stack protection enabled
that in-turn make stack-protected calls, or if the exception handlers
set up by set_bringup_idt_handler() make calls to stack-protected
functions.

Subsequent patches for SEV-SNP CPUID validation support will introduce
both such cases. Attempting to disable stack protection for everything
in scope to address that is prohibitive since much of the code, like the
SEV-ES #VC handler, is shared code that remains in use after boot and
could benefit from having stack protection enabled. Attempting to inline
calls is brittle and can quickly balloon out to library/helper code
where that's not really an option.

Instead, re-enable stack protection for head32.c/head64.c, and make the
appropriate changes to ensure the segment used for the stack canary is
initialized in advance of any stack-protected C calls.

For head64.c:

- The BSP will enter from startup_64() and call into C code
(startup_64_setup_env()) shortly after setting up the stack, which
may result in calls to stack-protected code. Set up %gs early to allow
for this safely.
- APs will enter from secondary_startup_64*(), and %gs will be set up
soon after. There is one call to C code prior to %gs being setup
(__startup_secondary_64()), but it is only to fetch 'sme_me_mask'
global, so just load 'sme_me_mask' directly instead, and remove the
now-unused __startup_secondary_64() function.

For head32.c:

- BSPs/APs will set %fs to __BOOT_DS prior to any C calls. In recent
kernels, the compiler is configured to access the stack canary at
%fs:__stack_chk_guard [1], which overlaps with the initial per-cpu
'__stack_chk_guard' variable in the initial/"master" .data..percpu
area. This is sufficient to allow access to the canary for use
during initial startup, so no changes are needed there.

[1] 3fb0fdb3bbe7 ("x86/stackprotector/32: Make the canary into a regular percpu variable")

[ bp: Massage commit message. ]

Suggested-by: Joerg Roedel <jroedel@suse.de> #for 64-bit %gs set up
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-24-brijesh.singh@amd.com


# efac0eed 08-Feb-2022 Brijesh Singh <brijesh.singh@amd.com>

x86/kernel: Mark the .bss..decrypted section as shared in the RMP table

The encryption attribute for the .bss..decrypted section is cleared in the
initial page table build. This is because the section contains the data
that need to be shared between the guest and the hypervisor.

When SEV-SNP is active, just clearing the encryption attribute in the
page table is not enough. The page state needs to be updated in the RMP
table.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-20-brijesh.singh@amd.com


# 95d33bfa 08-Feb-2022 Brijesh Singh <brijesh.singh@amd.com>

x86/sev: Register GHCB memory when SEV-SNP is active

The SEV-SNP guest is required by the GHCB spec to register the GHCB's
Guest Physical Address (GPA). This is because the hypervisor may prefer
that a guest uses a consistent and/or specific GPA for the GHCB associated
with a vCPU. For more information, see the GHCB specification section
"GHCB GPA Registration".

[ bp: Cleanup comments. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-18-brijesh.singh@amd.com


# bcce8290 08-Feb-2022 Michael Roth <michael.roth@amd.com>

x86/sev: Detect/setup SEV/SME features earlier in boot

sme_enable() handles feature detection for both SEV and SME. Future
patches will also use it for SEV-SNP feature detection/setup, which
will need to be done immediately after the first #VC handler is set up.
Move it now in preparation.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-9-brijesh.singh@amd.com


# 5f117033 11-Feb-2022 Marco Bonelli <marco@mebeim.net>

x86/head64: Add missing __head annotation to sme_postprocess_startup()

This function was previously part of __startup_64() which is marked
__head, and is currently only called from there. Mark it __head too.

Signed-off-by: Marco Bonelli <marco@mebeim.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220211162350.11780-1-marco@mebeim.net


# b64dfcde 17-Dec-2021 Borislav Petkov <bp@suse.de>

x86/mm: Prevent early boot triple-faults with instrumentation

Commit in Fixes added a global TLB flush on the early boot path, after
the kernel switches off of the trampoline page table.

Compiler profiling options enabled with GCOV_PROFILE add additional
measurement code on clang which needs to be initialized prior to
use. The global flush in x86_64_start_kernel() happens before those
initializations can happen, leading to accessing invalid memory.
GCOV_PROFILE builds with gcc are still ok so this is clang-specific.

The second issue this fixes is with KASAN: for a similar reason,
kasan_early_init() needs to have happened before KASAN-instrumented
functions are called.

Therefore, reorder the flush to happen after the KASAN early init
and prevent the compilers from adding profiling instrumentation to
native_write_cr4().

Fixes: f154f290855b ("x86/mm/64: Flush global TLB on boot and AP bringup")
Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Carel Si <beibei.si@intel.com>
Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
Link: https://lore.kernel.org/r/20211209144141.GC25654@xsang-OptiPlex-9020


# f154f290 02-Dec-2021 Joerg Roedel <jroedel@suse.de>

x86/mm/64: Flush global TLB on boot and AP bringup

The AP bringup code uses the trampoline_pgd page-table which
establishes global mappings in the user range of the address space.
Flush the global TLB entries after the indentity mappings are removed so
no stale entries remain in the TLB.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211202153226.22946-3-joro@8bytes.org


# 5ed0a99b 10-Nov-2021 Borislav Petkov <bp@suse.de>

x86/head64: Carve out the guest encryption postprocessing into a helper

Carve it out so that it is abstracted out of the main boot path. All
other encrypted guest-relevant processing should be placed in there.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211110220731.2396491-7-brijesh.singh@amd.com


# e9d1d2bb 08-Sep-2021 Tom Lendacky <thomas.lendacky@amd.com>

treewide: Replace the use of mem_encrypt_active() with cc_platform_has()

Replace uses of mem_encrypt_active() with calls to cc_platform_has() with
the CC_ATTR_MEM_ENCRYPT attribute.

Remove the implementation of mem_encrypt_active() across all arches.

For s390, since the default implementation of the cc_platform_has()
matches the s390 implementation of mem_encrypt_active(), cc_platform_has()
does not need to be implemented in s390 (the config option
ARCH_HAS_CC_PLATFORM is not set).

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-9-bp@alien8.de


# e759959f 27-Apr-2021 Brijesh Singh <brijesh.singh@amd.com>

x86/sev-es: Rename sev-es.{ch} to sev.{ch}

SEV-SNP builds upon the SEV-ES functionality while adding new hardware
protection. Version 2 of the GHCB specification adds new NAE events that
are SEV-SNP specific. Rename the sev-es.{ch} to sev.{ch} so that all
SEV* functionality can be consolidated in one place.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-2-brijesh.singh@amd.com


# d9f6e12f 18-Mar-2021 Ingo Molnar <mingo@kernel.org>

x86: Fix various typos in comments

Fix ~144 single-word typos in arch/x86/ code comments.

Doing this in a single commit should reduce the churn.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org


# 61b39ad9 08-Nov-2020 Wang Qing <wangqing@vivo.com>

x86/head64: Remove duplicate include

Remove duplicate header include.

Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1604893542-20961-1-git-send-email-wangqing@vivo.com


# 33def849 21-Oct-2020 Joe Perches <joe@perches.com>

treewide: Convert macro and uses of __section(foo) to __section("foo")

Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1aa9aa8e 08-Sep-2020 Joerg Roedel <jroedel@suse.de>

x86/sev-es: Setup GHCB-based boot #VC handler

Add the infrastructure to handle #VC exceptions when the kernel runs on
virtual addresses and has mapped a GHCB. This handler will be used until
the runtime #VC handler takes over.

Since the handler runs very early, disable instrumentation for sev-es.c.

[ bp: Make vc_ghcb_invalidate() __always_inline so that it can be
inlined in noinstr functions like __sev_es_nmi_complete(). ]

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200908123816.GB3764@8bytes.org


# 74d8d9d5 08-Sep-2020 Joerg Roedel <jroedel@suse.de>

x86/sev-es: Setup an early #VC handler

Setup an early handler for #VC exceptions. There is no GHCB mapped
yet, so just re-use the vc_no_ghcb_handler(). It can only handle
CPUID exit-codes, but that should be enough to get the kernel through
verify_cpu() and __startup_64() until it runs on virtual addresses.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
[ boot failure Error: kernel_ident_mapping_init() failed. ]
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20200908123517.GA3764@8bytes.org


# 4b47cdbd 07-Sep-2020 Joerg Roedel <jroedel@suse.de>

x86/head/64: Move early exception dispatch to C code

Move the assembly coded dispatch between page-faults and all other
exceptions to C code to make it easier to maintain and extend.

Also change the return-type of early_make_pgtable() to bool and make it
static.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-36-joro@8bytes.org


# f5963ba7 07-Sep-2020 Joerg Roedel <jroedel@suse.de>

x86/head/64: Install a CPU bringup IDT

Add a separate bringup IDT for the CPU bringup code that will be used
until the kernel switches to the idt_table. There are two reasons for a
separate IDT:

1) When the idt_table is set up and the secondary CPUs are
booted, it contains entries (e.g. IST entries) which
require certain CPU state to be set up. This includes a
working TSS (for IST), MSR_GS_BASE (for stack protector) or
CR4.FSGSBASE (for paranoid_entry) path. By using a
dedicated IDT for early boot this state need not to be set
up early.

2) The idt_table is static to idt.c, so any function
using/modifying must be in idt.c too. That means that all
compiler driven instrumentation like tracing or KASAN is
also active in this code. But during early CPU bringup the
environment is not set up for this instrumentation to work
correctly.

To avoid all of these hassles and make early exception handling robust,
use a dedicated bringup IDT.

The IDT is loaded two times, first on the boot CPU while the kernel is
still running on direct mapped addresses, and again later after the
switch to kernel addresses has happened. The second IDT load happens on
the boot and secondary CPUs.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-34-joro@8bytes.org


# 866b556e 07-Sep-2020 Joerg Roedel <jroedel@suse.de>

x86/head/64: Install startup GDT

Handling exceptions during boot requires a working GDT. The kernel GDT
can't be used on the direct mapping, so load a startup GDT and setup
segments.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-30-joro@8bytes.org


# 65fddcfc 08-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: reorder includes after introduction of linux/pgtable.h

The replacement of <asm/pgrable.h> with <linux/pgtable.h> made the include
of the latter in the middle of asm includes. Fix this up with the aid of
the below script and manual adjustments here and there.

import sys
import re

if len(sys.argv) is not 3:
print "USAGE: %s <file> <header>" % (sys.argv[0])
sys.exit(1)

hdr_to_move="#include <linux/%s>" % sys.argv[2]
moved = False
in_hdrs = False

with open(sys.argv[1], "r") as f:
lines = f.readlines()
for _line in lines:
line = _line.rstrip('
')
if line == hdr_to_move:
continue
if line.startswith("#include <linux/"):
in_hdrs = True
elif not moved and in_hdrs:
moved = True
print hdr_to_move
print line

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ca5999fd 08-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: introduce include/linux/pgtable.h

The include/linux/pgtable.h is going to be the home of generic page table
manipulation functions.

Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
make the latter include asm/pgtable.h.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2aa85f24 24-Sep-2019 Steve Wahl <steve.wahl@hpe.com>

x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area

Our hardware (UV aka Superdome Flex) has address ranges marked
reserved by the BIOS. Access to these ranges is caught as an error,
causing the BIOS to halt the system.

Initial page tables mapped a large range of physical addresses that
were not checked against the list of BIOS reserved addresses, and
sometimes included reserved addresses in part of the mapped range.
Including the reserved range in the map allowed processor speculative
accesses to the reserved range, triggering a BIOS halt.

Used early in booting, the page table level2_kernel_pgt addresses 1
GiB divided into 2 MiB pages, and it was set up to linearly map a full
1 GiB of physical addresses that included the physical address range
of the kernel image, as chosen by KASLR. But this also included a
large range of unused addresses on either side of the kernel image.
And unlike the kernel image's physical address range, this extra
mapped space was not checked against the BIOS tables of usable RAM
addresses. So there were times when the addresses chosen by KASLR
would result in processor accessible mappings of BIOS reserved
physical addresses.

The kernel code did not directly access any of this extra mapped
space, but having it mapped allowed the processor to issue speculative
accesses into reserved memory, causing system halts.

This was encountered somewhat rarely on a normal system boot, and much
more often when starting the crash kernel if "crashkernel=512M,high"
was specified on the command line (this heavily restricts the physical
address of the crash kernel, in our case usually within 1 GiB of
reserved space).

The solution is to invalidate the pages of this table outside the kernel
image's space before the page table is activated. It fixes this problem
on our hardware.

[ bp: Touchups. ]

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: dimitri.sivanich@hpe.com
Cc: Feng Tang <feng.tang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jordan Borgner <mail@jordan-borgner.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: mike.travis@hpe.com
Cc: russ.anderson@hpe.com
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Link: https://lkml.kernel.org/r/9c011ee51b081534a7a15065b1681d200298b530.1569358539.git.steve.wahl@hpe.com


# c1887159 20-Jun-2019 Kirill A. Shutemov <kirill@shutemov.name>

x86/boot/64: Add missing fixup_pointer() for next_early_pgt access

__startup_64() uses fixup_pointer() to access global variables in a
position-independent fashion. Access to next_early_pgt was wrapped into the
helper, but one instance in the 5-level paging branch was missed.

GCC generates a R_X86_64_PC32 PC-relative relocation for the access which
doesn't trigger the issue, but Clang emmits a R_X86_64_32S which leads to
an invalid memory access and system reboot.

Fixes: 187e91fe5e91 ("x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt'")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Potapenko <glider@google.com>
Link: https://lkml.kernel.org/r/20190620112422.29264-1-kirill.shutemov@linux.intel.com


# 81c7ed29 20-Jun-2019 Kirill A. Shutemov <kirill@shutemov.name>

x86/boot/64: Fix crash if kernel image crosses page table boundary

A kernel which boots in 5-level paging mode crashes in a small percentage
of cases if KASLR is enabled.

This issue was tracked down to the case when the kernel image unpacks in a
way that it crosses an 1G boundary. The crash is caused by an overrun of
the PMD page table in __startup_64() and corruption of P4D page table
allocated next to it. This particular issue is not visible with 4-level
paging as P4D page tables are not used.

But the P4D and the PUD calculation have similar problems.

The PMD index calculation is wrong due to operator precedence, which fails
to confine the PMDs in the PMD array on wrap around.

The P4D calculation for 5-level paging and the PUD calculation calculate
the first index correctly, but then blindly increment it which causes the
same issue when a kernel image is located across a 512G and for 5-level
paging across a 46T boundary.

This wrap around mishandling was introduced when these parts moved from
assembly to C.

Restore it to the correct behaviour.

Fixes: c88d71508e36 ("x86/boot/64: Rewrite startup_64() in C")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190620112345.28833-1-kirill.shutemov@linux.intel.com


# 38418404 20-Nov-2018 Juergen Gross <jgross@suse.com>

x86/boot: Mostly revert commit ae7e1238e68f2a ("Add ACPI RSDP address to setup_header")

Peter Anvin pointed out that commit:

ae7e1238e68f2a ("x86/boot: Add ACPI RSDP address to setup_header")

should be reverted as setup_header should only contain items set by the
legacy BIOS.

So revert said commit. Instead of fully reverting the dependent commit
of:

e7b66d16fe4172 ("x86/acpi, x86/boot: Take RSDP address for boot params if available")

just remove the setup_header reference in order to replace it by
a boot_params in a followup patch.

Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boris.ostrovsky@oracle.com
Cc: bp@alien8.de
Cc: daniel.kiper@oracle.com
Cc: sstabellini@kernel.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20181120072529.5489-2-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 0e96f31e 27-Oct-2018 Jordan Borgner <mail@jordan-borgner.de>

x86: Clean up 'sizeof x' => 'sizeof(x)'

"sizeof(x)" is the canonical coding style used in arch/x86 most of the time.
Fix the few places that didn't follow the convention.

(Also do some whitespace cleanups in a few places while at it.)

[ mingo: Rewrote the changelog. ]

Signed-off-by: Jordan Borgner <mail@jordan-borgner.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181028125828.7rgammkgzep2wpam@JordanDesktop
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ae7e1238 10-Oct-2018 Juergen Gross <jgross@suse.com>

x86/boot: Add ACPI RSDP address to setup_header

Xen PVH guests receive the address of the RSDP table from Xen. In order
to support booting a Xen PVH guest via Grub2 using the standard x86
boot entry we need a way for Grub2 to pass the RSDP address to the
kernel.

For this purpose expand the struct setup_header to hold the physical
address of the RSDP address. Being zero means it isn't specified and
has to be located the legacy way (searching through low memory or
EBDA).

While documenting the new setup_header layout and protocol version
2.14 add the missing documentation of protocol version 2.13.

There are Grub2 versions in several distros with a downstream patch
violating the boot protocol by writing past the end of setup_header.
This requires another update of the boot protocol to enable the kernel
to distinguish between a specified RSDP address and one filled with
garbage by such a broken Grub2.

From protocol 2.14 on Grub2 will write the version it is supporting
(but never a higher value than found to be supported by the kernel)
ored with 0x8000 to the version field of setup_header. This enables
the kernel to know up to which field Grub2 has written information
to. All fields after that are supposed to be clobbered.

Signed-off-by: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boris.ostrovsky@oracle.com
Cc: bp@alien8.de
Cc: corbet@lwn.net
Cc: linux-doc@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20181010061456.22238-3-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 05ab1d8a 19-Sep-2018 Feng Tang <feng.tang@intel.com>

x86/mm: Expand static page table for fixmap space

We met a kernel panic when enabling earlycon, which is due to the fixmap
address of earlycon is not statically setup.

Currently the static fixmap setup in head_64.S only covers 2M virtual
address space, while it actually could be in 4M space with different
kernel configurations, e.g. when VSYSCALL emulation is disabled.

So increase the static space to 4M for now by defining FIXMAP_PMD_NUM to 2,
and add a build time check to ensure that the fixmap is covered by the
initial static page tables.

Fixes: 1ad83c858c7d ("x86_64,vsyscall: Make vsyscall emulation configurable")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com> (Xen parts)
Cc: H Peter Anvin <hpa@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180920025828.23699-1-feng.tang@intel.com


# b3f0907c 14-Sep-2018 Brijesh Singh <brijesh.singh@amd.com>

x86/mm: Add .bss..decrypted section to hold shared variables

kvmclock defines few static variables which are shared with the
hypervisor during the kvmclock initialization.

When SEV is active, memory is encrypted with a guest-specific key, and
if the guest OS wants to share the memory region with the hypervisor
then it must clear the C-bit before sharing it.

Currently, we use kernel_physical_mapping_init() to split large pages
before clearing the C-bit on shared pages. But it fails when called from
the kvmclock initialization (mainly because the memblock allocator is
not ready that early during boot).

Add a __bss_decrypted section attribute which can be used when defining
such shared variable. The so-defined variables will be placed in the
.bss..decrypted section. This section will be mapped with C=0 early
during boot.

The .bss..decrypted section has a big chunk of memory that may be unused
when memory encryption is not active, free it when memory encryption is
not active.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Radim Krčmář<rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Link: https://lkml.kernel.org/r/1536932759-12905-2-git-send-email-brijesh.singh@amd.com


# 51be1335 22-Jun-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Revert "x86/mm: Mark __pgtable_l5_enabled __initdata"

This reverts commit e4e961e36f063484c48bed919013c106d178995d.

We need to use early version of pgtable_l5_enabled() in
early_identify_cpu() as this code runs before cpu_feature_enabled() is
usable.

But it leads to section mismatch:

cpu_init()
load_mm_ldt()
ldt_slot_va()
LDT_BASE_ADDR
LDT_PGD_ENTRY
pgtable_l5_enabled()
__pgtable_l5_enabled

__pgtable_l5_enabled marked as __initdata, but cpu_init() is not __init.

It's fixable: early code can be isolated into a separate translation unit,
but such change collides with other work in the area. That's too much
hassle to save 4 bytes of memory.

Return __pgtable_l5_enabled back to be __ro_after_init.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180622220841.54135-2-kirill.shutemov@linux.intel.com


# e4e961e3 18-May-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Mark __pgtable_l5_enabled __initdata

__pgtable_l5_enabled shouldn't be needed after system has booted.
All preparation is done. We can now mark it as __initdata.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180518103528.59260-8-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 372fddf7 18-May-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Introduce the 'no5lvl' kernel parameter

This kernel parameter allows to force kernel to use 4-level paging even
if hardware and kernel support 5-level paging.

The option may be useful to work around regressions related to 5-level
paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180518103528.59260-5-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ed7588d5 18-May-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Stop pretending pgtable_l5_enabled is a variable

pgtable_l5_enabled is defined using cpu_feature_enabled() but we refer
to it as a variable. This is misleading.

Make pgtable_l5_enabled() a function.

We cannot literally define it as a function due to circular dependencies
between header files. Function-alike macros is close enough.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180518103528.59260-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ad3fe525 18-May-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Unify pgtable_l5_enabled usage in early boot code

Usually pgtable_l5_enabled is defined using cpu_feature_enabled().
cpu_feature_enabled() is not available in early boot code. We use
several different preprocessor tricks to get around it. It's messy.

Unify them all.

If cpu_feature_enabled() is not yet available, USE_EARLY_PGTABLE_L5 can
be defined before all includes. It makes pgtable_l5_enabled rely on
__pgtable_l5_enabled variable instead. This approach fits all early
users.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180518103528.59260-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 4a09f021 09-May-2018 Alexander Potapenko <glider@google.com>

x86/boot/64/clang: Use fixup_pointer() to access '__supported_pte_mask'

Clang builds with defconfig started crashing after the following
commit:

fb43d6cb91ef ("x86/mm: Do not auto-massage page protections")

This was caused by introducing a new global access in __startup_64().

Code in __startup_64() can be relocated during execution, but the compiler
doesn't have to generate PC-relative relocations when accessing globals
from that function. Clang actually does not generate them, which leads
to boot-time crashes. To work around this problem, every global pointer
must be adjusted using fixup_pointer().

Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: dvyukov@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Cc: md@google.com
Cc: mka@chromium.org
Fixes: fb43d6cb91ef ("x86/mm: Do not auto-massage page protections")
Link: http://lkml.kernel.org/r/20180509091822.191810-1-glider@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# fb43d6cb 06-Apr-2018 Dave Hansen <dave.hansen@linux.intel.com>

x86/mm: Do not auto-massage page protections

A PTE is constructed from a physical address and a pgprotval_t.
__PAGE_KERNEL, for instance, is a pgprot_t and must be converted
into a pgprotval_t before it can be used to create a PTE. This is
done implicitly within functions like pfn_pte() by massage_pgprot().

However, this makes it very challenging to set bits (and keep them
set) if your bit is being filtered out by massage_pgprot().

This moves the bit filtering out of pfn_pte() and friends. For
users of PAGE_KERNEL*, filtering will be done automatically inside
those macros but for users of __PAGE_KERNEL*, they need to do their
own filtering now.

Note that we also just move pfn_pte/pmd/pud() over to check_pgprot()
instead of massage_pgprot(). This way, we still *look* for
unsupported bits and properly warn about them if we find them. This
might happen if an unfiltered __PAGE_KERNEL* value was passed in,
for instance.

- printk format warning fix from: Arnd Bergmann <arnd@arndb.de>
- boot crash fix from: Tom Lendacky <thomas.lendacky@amd.com>
- crash bisected by: Mike Galbraith <efault@gmx.de>

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reported-and-fixed-by: Arnd Bergmann <arnd@arndb.de>
Fixed-by: Tom Lendacky <thomas.lendacky@amd.com>
Bisected-by: Mike Galbraith <efault@gmx.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205509.77E1D7F6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 39b95522 16-Feb-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Optimize boot-time paging mode switching cost

By this point we have functioning boot-time switching between 4- and
5-level paging mode. But naive approach comes with cost.

Numbers below are for kernel build, allmodconfig, 5 times.

CONFIG_X86_5LEVEL=n:

Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):

17308719.892691 task-clock:u (msec) # 26.772 CPUs utilized ( +- 0.11% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
331,993,164 page-faults:u # 0.019 M/sec ( +- 0.01% )
43,614,978,867,455 cycles:u # 2.520 GHz ( +- 0.01% )
39,371,534,575,126 stalled-cycles-frontend:u # 90.27% frontend cycles idle ( +- 0.09% )
28,363,350,152,428 instructions:u # 0.65 insn per cycle
# 1.39 stalled cycles per insn ( +- 0.00% )
6,316,784,066,413 branches:u # 364.948 M/sec ( +- 0.00% )
250,808,144,781 branch-misses:u # 3.97% of all branches ( +- 0.01% )

646.531974142 seconds time elapsed ( +- 1.15% )

CONFIG_X86_5LEVEL=y:

Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):

17411536.780625 task-clock:u (msec) # 26.426 CPUs utilized ( +- 0.10% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
331,868,663 page-faults:u # 0.019 M/sec ( +- 0.01% )
43,865,909,056,301 cycles:u # 2.519 GHz ( +- 0.01% )
39,740,130,365,581 stalled-cycles-frontend:u # 90.59% frontend cycles idle ( +- 0.05% )
28,363,358,997,959 instructions:u # 0.65 insn per cycle
# 1.40 stalled cycles per insn ( +- 0.00% )
6,316,784,937,460 branches:u # 362.793 M/sec ( +- 0.00% )
251,531,919,485 branch-misses:u # 3.98% of all branches ( +- 0.00% )

658.886307752 seconds time elapsed ( +- 0.92% )

The patch tries to fix the performance regression by using
cpu_feature_enabled(X86_FEATURE_LA57) instead of pgtable_l5_enabled in
all hot code paths. These will statically patch the target code for
additional performance.

CONFIG_X86_5LEVEL=y + the patch:

Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):

17381990.268506 task-clock:u (msec) # 26.907 CPUs utilized ( +- 0.19% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
331,862,625 page-faults:u # 0.019 M/sec ( +- 0.01% )
43,697,726,320,051 cycles:u # 2.514 GHz ( +- 0.03% )
39,480,408,690,401 stalled-cycles-frontend:u # 90.35% frontend cycles idle ( +- 0.05% )
28,363,394,221,388 instructions:u # 0.65 insn per cycle
# 1.39 stalled cycles per insn ( +- 0.00% )
6,316,794,985,573 branches:u # 363.410 M/sec ( +- 0.00% )
251,013,232,547 branch-misses:u # 3.97% of all branches ( +- 0.01% )

645.991174661 seconds time elapsed ( +- 1.19% )

Unfortunately, this approach doesn't help with text size:

vmlinux.before .text size: 8190319
vmlinux.after .text size: 8200623

The .text section is increased by about 4k. Not sure if we can do anything
about this.

Signed-off-by: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180216114948.68868-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 6f9dd329 14-Feb-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Support boot-time switching of paging modes in the early boot code

Early boot code should be able to initialize page tables for both 4- and
5-level paging modes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180214182542.69302-7-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 9b46a051 14-Feb-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Initialize vmemmap_base at boot-time

vmemmap area has different placement depending on paging mode.
Let's adjust it during early boot accodring to machine capability.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180214182542.69302-6-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a7412546 14-Feb-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Adjust vmalloc base and size at boot-time

vmalloc area has different placement and size depending on paging mode.
Let's adjust it during early boot accodring to machine capability.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180214182542.69302-5-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 4fa5662b 14-Feb-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Initialize 'page_offset_base' at boot-time

For 4- and 5-level paging we have different 'page_offset_base'.
Let's initialize it at boot-time accordingly to machine capability.

We also have to split __PAGE_OFFSET_BASE into two constants -- for 4-
and 5-level paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180214182542.69302-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# b16e770b 14-Feb-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Initialize 'pgdir_shift' and 'ptrs_per_p4d' at boot-time

Switching between paging modes requires the folding of the p4d page table level
when we only have 4 paging levels, which means we need to adjust 'pgdir_shift'
and 'ptrs_per_p4d' during early boot according to paging mode.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180214182542.69302-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 4c2b4058 14-Feb-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Initialize 'pgtable_l5_enabled' at boot-time

'pgtable_l5_enabled' indicates which paging mode we are using. We need to
initialize it at boot-time according to machine capability.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180214182542.69302-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c65e774f 14-Feb-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Make PGDIR_SHIFT and PTRS_PER_P4D variable

For boot-time switching between 4- and 5-level paging we need to be able
to fold p4d page table level at runtime. It requires variable
PGDIR_SHIFT and PTRS_PER_P4D.

The change doesn't affect the kernel image size much:

text data bss dec hex filename
8628091 4734304 1368064 14730459 e0c4db vmlinux.before
8628393 4734340 1368064 14730797 e0c62d vmlinux.after

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180214111656.88514-7-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# e626e6bb 14-Feb-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Introduce 'pgtable_l5_enabled'

The new flag would indicate what paging mode we are in.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180214111656.88514-5-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# eedb92ab 14-Feb-2018 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Make virtual memory layout dynamic for CONFIG_X86_5LEVEL=y

We need to be able to adjust virtual memory layout at runtime to be able
to switch between 4- and 5-level paging at boot-time.

KASLR already has movable __VMALLOC_BASE, __VMEMMAP_BASE and __PAGE_OFFSET.
Let's re-use it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180214111656.88514-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 107cd253 10-Jan-2018 Tom Lendacky <thomas.lendacky@amd.com>

x86/mm: Encrypt the initrd earlier for BSP microcode update

Currently the BSP microcode update code examines the initrd very early
in the boot process. If SME is active, the initrd is treated as being
encrypted but it has not been encrypted (in place) yet. Update the
early boot code that encrypts the kernel to also encrypt the initrd so
that early BSP microcode updates work.

Tested-by: Gabriel Craciunescu <nix.or.die@gmail.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180110192634.6026.10452.stgit@tlendack-t1.amdoffice.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 588787fd 28-Aug-2017 Thomas Gleixner <tglx@linutronix.de>

x86/idt: Move early IDT handler setup to IDT code

The early IDT handler setup is done in C entry code on 64-bit kernels and in
ASM entry code on 32-bit kernels.

Move the 64-bit variant to the IDT code so it can be shared with 32-bit
in the next step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170828064958.679561404@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 187e91fe 16-Aug-2017 Alexander Potapenko <glider@google.com>

x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt'

__startup_64() is normally using fixup_pointer() to access globals in a
position-independent fashion. However 'next_early_pgt' was accessed
directly, which wasn't guaranteed to work.

Luckily GCC was generating a R_X86_64_PC32 PC-relative relocation for
'next_early_pgt', but Clang emitted a R_X86_64_32S, which led to
accessing invalid memory and rebooting the kernel.

Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Davidson <md@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: c88d71508e36 ("x86/boot/64: Rewrite startup_64() in C")
Link: http://lkml.kernel.org/r/20170816190808.131748-1-glider@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# aca20d54 17-Jul-2017 Tom Lendacky <thomas.lendacky@amd.com>

x86/mm: Add support to make use of Secure Memory Encryption

Add support to check if SME has been enabled and if memory encryption
should be activated (checking of command line option based on the
configuration of the default state). If memory encryption is to be
activated, then the encryption mask is set and the kernel is encrypted
"in place."

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/5f0da2fd4cce63f556117549e2c89c170072209f.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# b9d05200 17-Jul-2017 Tom Lendacky <thomas.lendacky@amd.com>

x86/mm: Insure that boot memory areas are mapped properly

The boot data and command line data are present in memory in a decrypted
state and are copied early in the boot process. The early page fault
support will map these areas as encrypted, so before attempting to copy
them, add decrypted mappings so the data is accessed properly when copied.

For the initrd, encrypt this data in place. Since the future mapping of
the initrd area will be mapped as encrypted the data will be accessed
properly.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/bb0d430b41efefd45ee515aaf0979dcfda8b6a44.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 21729f81 17-Jul-2017 Tom Lendacky <thomas.lendacky@amd.com>

x86/mm: Provide general kernel support for memory encryption

Changes to the existing page table macros will allow the SME support to
be enabled in a simple fashion with minimal changes to files that use these
macros. Since the memory encryption mask will now be part of the regular
pagetable macros, we introduce two new macros (_PAGE_TABLE_NOENC and
_KERNPG_TABLE_NOENC) to allow for early pagetable creation/initialization
without the encryption mask before SME becomes active. Two new pgprot()
macros are defined to allow setting or clearing the page encryption mask.

The FIXMAP_PAGE_NOCACHE define is introduced for use with MMIO. SME does
not support encryption for MMIO areas so this define removes the encryption
mask from the page attribute.

Two new macros are introduced (__sme_pa() / __sme_pa_nodebug()) to allow
creating a physical address with the encryption mask. These are used when
working with the cr3 register so that the PGD can be encrypted. The current
__va() macro is updated so that the virtual address is generated based off
of the physical address without the encryption mask thus allowing the same
virtual address to be generated regardless of whether encryption is enabled
for that physical location or not.

Also, an early initialization function is added for SME. If SME is active,
this function:

- Updates the early_pmd_flags so that early page faults create mappings
with the encryption mask.

- Updates the __supported_pte_mask to include the encryption mask.

- Updates the protection_map entries to include the encryption mask so
that user-space allocations will automatically have the encryption mask
applied.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/b36e952c4c39767ae7f0a41cf5345adf27438480.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 5868f365 17-Jul-2017 Tom Lendacky <thomas.lendacky@amd.com>

x86/mm: Add support to enable SME in early boot processing

Add support to the early boot code to use Secure Memory Encryption (SME).
Since the kernel has been loaded into memory in a decrypted state, encrypt
the kernel in place and update the early pagetables with the memory
encryption mask so that new pagetable entries will use memory encryption.

The routines to set the encryption mask and perform the encryption are
stub routines for now with functionality to be added in a later patch.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/e52ad781f085224bf835b3caff9aa3aee6febccb.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 26179670 16-Jun-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/boot/64: Put __startup_64() into .head.text

Put __startup_64() and fixup_pointer() into .head.text section to make
sure it's always near startup_64() and always callable.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel test robot <fengguang.wu@intel.com>
Cc: wfg@linux.intel.com
Link: http://lkml.kernel.org/r/20170616113024.ajmif63cmcszry5a@black.fi.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 032370b9 06-Jun-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/boot/64: Add support of additional page table level during early boot

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170606113133.22974-10-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 65ade2f8 06-Jun-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/boot/64: Rename init_level4_pgt and early_level4_pgt

With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables.

Let's give these variable more generic names: init_top_pgt and
early_top_pgt.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170606113133.22974-9-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c88d7150 06-Jun-2017 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/boot/64: Rewrite startup_64() in C

The patch write most of startup_64 logic in C.

This is preparation for 5-level paging enabling.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170606113133.22974-8-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 6c690ee1 12-Jun-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Split read_cr3() into read_cr3_pa() and __read_cr3()

The kernel has several code paths that read CR3. Most of them assume that
CR3 contains the PGD's physical address, whereas some of them awkwardly
use PHYSICAL_PAGE_MASK to mask off low bits.

Add explicit mask macros for CR3 and convert all of the CR3 readers.
This will keep them from breaking when PCID is enabled.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: xen-devel <xen-devel@lists.xen.org>
Link: http://lkml.kernel.org/r/883f8fb121f4616c1c1427ad87350bb2f5ffeca1.1497288170.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# be3606ff 13-Mar-2017 Andrey Ryabinin <ryabinin.a.a@gmail.com>

x86/kasan: Fix boot with KASAN=y and PROFILE_ANNOTATED_BRANCHES=y

The kernel doesn't boot with both PROFILE_ANNOTATED_BRANCHES=y and KASAN=y
options selected. With branch profiling enabled we end up calling
ftrace_likely_update() before kasan_early_init(). ftrace_likely_update() is
built with KASAN instrumentation, so calling it before kasan has been
initialized leads to crash.

Use DISABLE_BRANCH_PROFILING define to make sure that we don't call
ftrace_likely_update() from early code before kasan_early_init().

Fixes: ef7f0d6a6ca8 ("x86_64: add KASan support")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Alexander Potapenko <glider@google.com>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: lkp@01.org
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: http://lkml.kernel.org/r/20170313163337.1704-1-aryabinin@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 66441bd3 27-Jan-2017 Ingo Molnar <mingo@kernel.org>

x86/boot/e820: Move asm/e820.h to asm/e820/api.h

In line with asm/e820/types.h, move the e820 API declarations to
asm/e820/api.h and update all usage sites.

This is just a mechanical, obviously correct move & replace patch,
there will be subsequent changes to clean up the code and to make
better use of the new header organization.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 007b7560 10-Aug-2016 Andy Lutomirski <luto@kernel.org>

x86/boot: Run reserve_bios_regions() after we initialize the memory map

reserve_bios_regions() is a quirk that reserves memory that we might
otherwise think is available. There's no need to run it so early,
and running it before we have the memory map initialized with its
non-quirky inputs makes it hard to make reserve_bios_regions() more
intelligent.

Move it right after we populate the memblock state.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Matt Fleming <mfleming@suse.de>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/59f58618911005c799c6c9979ce6ae4881d907c2.1470821230.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# edce2121 21-Jul-2016 Ingo Molnar <mingo@kernel.org>

x86/boot: Reorganize and clean up the BIOS area reservation code

So the reserve_ebda_region() code has accumulated a number of
problems over the years that make it really difficult to read
and understand:

- The calculation of 'lowmem' and 'ebda_addr' is an unnecessarily
interleaved mess of first lowmem, then ebda_addr, then lowmem tweaks...

- 'lowmem' here means 'super low mem' - i.e. 16-bit addressable memory. In other
parts of the x86 code 'lowmem' means 32-bit addressable memory... This makes it
super confusing to read.

- It does not help at all that we have various memory range markers, half of which
are 'start of range', half of which are 'end of range' - but this crucial
property is not obvious in the naming at all ... gave me a headache trying to
understand all this.

- Also, the 'ebda_addr' name sucks: it highlights that it's an address (which is
obvious, all values here are addresses!), while it does not highlight that it's
the _start_ of the EBDA region ...

- 'BIOS_LOWMEM_KILOBYTES' says a lot of things, except that this is the only value
that is a pointer to a value, not a memory range address!

- The function name itself is a misnomer: it says 'reserve_ebda_region()' while
its main purpose is to reserve all the firmware ROM typically between 640K and
1MB, while the 'EBDA' part is only a small part of that ...

- Likewise, the paravirt quirk flag name 'ebda_search' is misleading as well: this
too should be about whether to reserve firmware areas in the paravirt case.

- In fact thinking about this as 'end of RAM' is confusing: what this function
*really* wants to reserve is firmware data and code areas! Once the thinking is
inverted from a mixed 'ram' and 'reserved firmware area' notion to a pure
'reserved area' notion everything becomes a lot clearer.

To improve all this rewrite the whole code (without changing the logic):

- Firstly invert the naming from 'lowmem end' to 'BIOS reserved area start'
and propagate this concept through all the variable names and constants.

BIOS_RAM_SIZE_KB_PTR // was: BIOS_LOWMEM_KILOBYTES

BIOS_START_MIN // was: INSANE_CUTOFF

ebda_start // was: ebda_addr
bios_start // was: lowmem

BIOS_START_MAX // was: LOWMEM_CAP

- Then clean up the name of the function itself by renaming it
to reserve_bios_regions() and renaming the ::ebda_search paravirt
flag to ::reserve_bios_regions.

- Fix up all the comments (fix typos), harmonize and simplify their
formulation and remove comments that become unnecessary due to
the much better naming all around.

Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 8d152e7a 13-Apr-2016 Luis R. Rodriguez <mcgrof@kernel.org>

x86/rtc: Replace paravirt rtc check with platform legacy quirk

We have 4 types of x86 platforms that disable RTC:

* Intel MID
* Lguest - uses paravirt
* Xen dom-U - uses paravirt
* x86 on legacy systems annotated with an ACPI legacy flag

We can consolidate all of these into a platform specific legacy
quirk set early in boot through i386_start_kernel() and through
x86_64_start_reservations(). This deals with the RTC quirks which
we can rely on through the hardware subarch, the ACPI check can
be dealt with separately.

For Xen things are bit more complex given that the @X86_SUBARCH_XEN
x86_hardware_subarch is shared on for Xen which uses the PV path for
both domU and dom0. Since the semantics for differentiating between
the two are Xen specific we provide a platform helper to help override
default legacy features -- x86_platform.set_legacy_features(). Use
of this helper is highly discouraged, its only purpose should be
to account for the lack of semantics available within your given
x86_hardware_subarch.

As per 0-day, this bumps the vmlinux size using i386-tinyconfig as
follows:

TOTAL TEXT init.text x86_early_init_platform_quirks()
+70 +62 +62 +43

Only 8 bytes overhead total, as the main increase in size is
all removed via __init.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: andrew.cooper3@citrix.com
Cc: andriy.shevchenko@linux.intel.com
Cc: bigeasy@linutronix.de
Cc: boris.ostrovsky@oracle.com
Cc: david.vrabel@citrix.com
Cc: ffainelli@freebox.fr
Cc: george.dunlap@citrix.com
Cc: glin@suse.com
Cc: jlee@suse.com
Cc: josh@joshtriplett.org
Cc: julien.grall@linaro.org
Cc: konrad.wilk@oracle.com
Cc: kozerkov@parallels.com
Cc: lenb@kernel.org
Cc: lguest@lists.ozlabs.org
Cc: linux-acpi@vger.kernel.org
Cc: lv.zheng@intel.com
Cc: matt@codeblueprint.co.uk
Cc: mbizon@freebox.fr
Cc: rjw@rjwysocki.net
Cc: robert.moore@intel.com
Cc: rusty@rustcorp.com.au
Cc: tiwai@suse.de
Cc: toshi.kani@hp.com
Cc: xen-devel@lists.xensource.com
Link: http://lkml.kernel.org/r/1460592286-300-5-git-send-email-mcgrof@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a91bbe01 09-Feb-2016 Alexander Kuleshov <kuleshovmail@gmail.com>

x86/boot: Use proper array element type in memset() size calculation

I changed open coded zeroing loops to explicit memset()s in the
following commit:

5e9ebbd87a99 ("x86/boot: Micro-optimize reset_early_page_tables()")

The base for the size argument of memset was sizeof(pud_p/pmd_p), which
are pointers - but the initialized array has pud_t/pmd_t elements.

Luckily the two types had the same size, so this did not result in any
runtime misbehavior.

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Alexander Popov <alpopov@ptsecurity.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1455025494-4063-1-git-send-email-kuleshovmail@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 5e9ebbd8 30-Jan-2016 Alexander Kuleshov <kuleshovmail@gmail.com>

x86/boot: Micro-optimize reset_early_page_tables()

Save 25 bytes of code and make the bootup a tiny bit faster:

text data bss dec filename
9735144 4970776 15474688 30180608 vmlinux.old
9735119 4970776 15474688 30180583 vmlinux

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Alexander Popov <alpopov@ptsecurity.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1454140872-16926-1-git-send-email-kuleshovmail@gmail.com
[ Fixed various small details. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 3fda5bb4 15-Jan-2016 Andy Shevchenko <andriy.shevchenko@linux.intel.com>

x86/platform/intel-mid: Enable 64-bit build

Intel Tangier SoC is known to have 64-bit dual core CPU. Enable
64-bit build for it.

The kernel has been tested on Intel Edison board:

Linux buildroot 4.4.0-next-20160115+ #25 SMP Fri Jan 15 22:03:19 EET 2016 x86_64 GNU/Linux

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 74
model name : Genuine Intel(R) CPU 4000 @ 500MHz
stepping : 8

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1452888668-147116-1-git-send-email-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 5d5aa3cf 01-Jul-2015 Alexander Popov <alpopov@ptsecurity.com>

x86/kasan: Fix KASAN shadow region page tables

Currently KASAN shadow region page tables created without
respect of physical offset (phys_base). This causes kernel halt
when phys_base is not zero.

So let's initialize KASAN shadow region page tables in
kasan_early_init() using __pa_nodebug() which considers
phys_base.

This patch also separates x86_64_start_kernel() from KASAN low
level details by moving kasan_map_early_shadow(init_level4_pgt)
into kasan_early_init().

Remove the comment before clear_bss() which stopped bringing
much profit to the code readability. Otherwise describing all
the new order dependencies would be too verbose.

Signed-off-by: Alexander Popov <alpopov@ptsecurity.com>
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: <stable@vger.kernel.org> # 4.0+
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435828178-10975-3-git-send-email-a.ryabinin@samsung.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# d0f77d4d 01-Jul-2015 Andrey Ryabinin <ryabinin.a.a@gmail.com>

x86/init: Clear 'init_level4_pgt' earlier

Currently x86_64_start_kernel() has two KASAN related
function calls. The first call maps shadow to early_level4_pgt,
the second maps shadow to init_level4_pgt.

If we move clear_page(init_level4_pgt) earlier, we could hide
KASAN low level detail from generic x86_64 initialization code.
The next patch will do it.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: <stable@vger.kernel.org> # 4.0+
Cc: Alexander Popov <alpopov@ptsecurity.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435828178-10975-2-git-send-email-a.ryabinin@samsung.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 425be567 22-May-2015 Andy Lutomirski <luto@kernel.org>

x86/asm/irq: Stop relying on magic JMP behavior for early_idt_handlers

The early_idt_handlers asm code generates an array of entry
points spaced nine bytes apart. It's not really clear from that
code or from the places that reference it what's going on, and
the code only works in the first place because GAS never
generates two-byte JMP instructions when jumping to global
labels.

Clean up the code to generate the correct array stride (member size)
explicitly. This should be considerably more robust against
screw-ups, as GAS will warn if a .fill directive has a negative
count. Using '. =' to advance would have been even more robust
(it would generate an actual error if it tried to move
backwards), but it would pad with nulls, confusing anyone who
tries to disassemble the code. The new scheme should be much
clearer to future readers.

While we're at it, improve the comments and rename the array and
common code.

Binutils may start relaxing jumps to non-weak labels. If so,
this change will fix our build, and we may need to backport this
change.

Before, on x86_64:

0000000000000000 <early_idt_handlers>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 00 00 00 00 jmpq 9 <early_idt_handlers+0x9>
5: R_X86_64_PC32 early_idt_handler-0x4
...
48: 66 90 xchg %ax,%ax
4a: 6a 08 pushq $0x8
4c: e9 00 00 00 00 jmpq 51 <early_idt_handlers+0x51>
4d: R_X86_64_PC32 early_idt_handler-0x4
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: e9 00 00 00 00 jmpq 120 <early_idt_handler>
11c: R_X86_64_PC32 early_idt_handler-0x4

After:

0000000000000000 <early_idt_handler_array>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 14 01 00 00 jmpq 11d <early_idt_handler_common>
...
48: 6a 08 pushq $0x8
4a: e9 d1 00 00 00 jmpq 120 <early_idt_handler_common>
4f: cc int3
50: cc int3
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: eb 03 jmp 120 <early_idt_handler_common>
11d: cc int3
11e: cc int3
11f: cc int3

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Binutils <binutils@sourceware.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/ac027962af343b0c599cbfcf50b945ad2ef3d7a8.1432336324.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# cdeb6048 22-May-2015 Andy Lutomirski <luto@kernel.org>

x86/asm/irq: Stop relying on magic JMP behavior for early_idt_handlers

The early_idt_handlers asm code generates an array of entry
points spaced nine bytes apart. It's not really clear from that
code or from the places that reference it what's going on, and
the code only works in the first place because GAS never
generates two-byte JMP instructions when jumping to global
labels.

Clean up the code to generate the correct array stride (member size)
explicitly. This should be considerably more robust against
screw-ups, as GAS will warn if a .fill directive has a negative
count. Using '. =' to advance would have been even more robust
(it would generate an actual error if it tried to move
backwards), but it would pad with nulls, confusing anyone who
tries to disassemble the code. The new scheme should be much
clearer to future readers.

While we're at it, improve the comments and rename the array and
common code.

Binutils may start relaxing jumps to non-weak labels. If so,
this change will fix our build, and we may need to backport this
change.

Before, on x86_64:

0000000000000000 <early_idt_handlers>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 00 00 00 00 jmpq 9 <early_idt_handlers+0x9>
5: R_X86_64_PC32 early_idt_handler-0x4
...
48: 66 90 xchg %ax,%ax
4a: 6a 08 pushq $0x8
4c: e9 00 00 00 00 jmpq 51 <early_idt_handlers+0x51>
4d: R_X86_64_PC32 early_idt_handler-0x4
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: e9 00 00 00 00 jmpq 120 <early_idt_handler>
11c: R_X86_64_PC32 early_idt_handler-0x4

After:

0000000000000000 <early_idt_handler_array>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 14 01 00 00 jmpq 11d <early_idt_handler_common>
...
48: 6a 08 pushq $0x8
4a: e9 d1 00 00 00 jmpq 120 <early_idt_handler_common>
4f: cc int3
50: cc int3
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: eb 03 jmp 120 <early_idt_handler_common>
11d: cc int3
11e: cc int3
11f: cc int3

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Binutils <binutils@sourceware.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ac027962af343b0c599cbfcf50b945ad2ef3d7a8.1432336324.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 91d8f041 17-Mar-2015 Alexander Kuleshov <kuleshovmail@gmail.com>

x86/boot/64: Remove pointless early_printk() message

earlyprintk is not initialised yet by the setup_early_printk() function
so we can remove it.

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1426597205-5142-1-git-send-email-kuleshovmail@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ef7f0d6a 13-Feb-2015 Andrey Ryabinin <ryabinin.a.a@gmail.com>

x86_64: add KASan support

This patch adds arch specific code for kernel address sanitizer.

16TB of virtual addressed used for shadow memory. It's located in range
[ffffec0000000000 - fffffc0000000000] between vmemmap and %esp fixup
stacks.

At early stage we map whole shadow region with zero page. Latter, after
pages mapped to direct mapping address range we unmap zero pages from
corresponding shadow (see kasan_map_shadow()) and allocate and map a real
shadow memory reusing vmemmap_populate() function.

Also replace __pa with __pa_nodebug before shadow initialized. __pa with
CONFIG_DEBUG_VIRTUAL=y make external function call (__phys_addr)
__phys_addr is instrumented, so __asan_load could be called before shadow
area initialized.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1e02ce4c 24-Oct-2014 Andy Lutomirski <luto@amacapital.net>

x86: Store a per-cpu shadow copy of CR4

Context switches and TLB flushes can change individual bits of CR4.
CR4 reads take several cycles, so store a shadow copy of CR4 in a
per-cpu variable.

To avoid wasting a cache line, I added the CR4 shadow to
cpu_tlbstate, which is already touched in switch_mm. The heaviest
users of the cr4 shadow will be switch_mm and __switch_to_xtra, and
__switch_to_xtra is called shortly after switch_mm during context
switch, so the cacheline is likely to be hot.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a8fe19eb 04-Jun-2014 Borislav Petkov <bp@suse.de>

kernel/printk: use symbolic defines for console loglevels

... instead of naked numbers.

Stuff in sysrq.c used to set it to 8 which is supposed to mean above
default level so set it to DEBUG instead as we're terminating/killing all
tasks and we want to be verbose there.

Also, correct the check in x86_64_start_kernel which should be >= as
we're clearly issuing the string there for all debug levels, not only
the magical 10.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Joe Perches <joe@perches.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2605fc21 01-May-2014 Andi Kleen <ak@linux.intel.com>

asmlinkage, x86: Add explicit __visible to arch/x86/*

As requested by Linus add explicit __visible to the asmlinkage users.
This marks all functions visible to assembler.

Tree sweep for arch/x86/*

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1398984278-29319-3-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 25c74b10 30-Oct-2013 Seiji Aguchi <seiji.aguchi@hds.com>

x86, trace: Register exception handler to trace IDT

This patch registers exception handlers for tracing to a trace IDT.

To implemented it in set_intr_gate(), this patch does followings.
- Register the exception handlers to
the trace IDT by prepending "trace_" to the handler's names.
- Also, newly introduce trace_page_fault() to add tracepoints
in a subsequent patch.

Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/52716DEC.5050204@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# a1ed4ddf 05-Aug-2013 Andi Kleen <ak@linux.intel.com>

x86, asmlinkage: Make _*_start_kernel visible

Obviously these functions have to be visible, otherwise
the whole kernel could be optimized away.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1375740170-7446-5-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 5e427ec2 20-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

x86: Fix bit corruption at CPU resume time

In commit 78d77df71510 ("x86-64, init: Do not set NX bits on non-NX
capable hardware") we added the early_pmd_flags that gets the NX bit set
when a CPU supports NX. However, the new variable was marked __initdata,
because the main _use_ of this is in an __init routine.

However, the bit setting happens from secondary_startup_64(), which is
called not only at bootup, but on every secondary CPU start. Including
resuming from STR and at CPU hotplug time. So the value cannot be
__initdata.

Reported-bisected-and-tested-by: Michal Hocko <mhocko@suse.cz>
Cc: stable@vger.kernel.org # v3.9
Acked-by: Peter Anvin <hpa@linux.intel.com>
Cc: Fernando Luis Vázquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78d77df7 02-May-2013 H. Peter Anvin <hpa@linux.intel.com>

x86-64, init: Do not set NX bits on non-NX capable hardware

During early init, we would incorrectly set the NX bit even if the NX
feature was not supported. Instead, only set this bit if NX is
actually available and enabled. We already do very early detection of
the NX bit to enable it in EFER, this simply extends this detection to
the early page table mask.

Reported-by: Fernando Luis Vázquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1367476850.5660.2.camel@nexus
Cc: <stable@vger.kernel.org> v3.9


# 8e3c2a8c 04-Mar-2013 Borislav Petkov <bp@alien8.de>

x86: Drop KERNEL_IMAGE_START

We have KERNEL_IMAGE_START and __START_KERNEL_map which both contain the
start of the kernel text mapping's virtual address. Remove the prior one
which has been replicated a lot less times around the tree.

No functionality change.

Signed-off-by: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1362428180-8865-3-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# ac630dd9 22-Feb-2013 Linus Torvalds <torvalds@linux-foundation.org>

x86-64: don't set the early IDT to point directly to 'early_idt_handler'

The code requires the use of the proper per-exception-vector stub
functions (set up as the early_idt_handlers[] array - note the 's') that
make sure to set up the error vector number. This is true regardless of
whether CONFIG_EARLY_PRINTK is set or not.

Why? The stack offset for the comparison of __KERNEL_CS won't be right
otherwise, nor will the new check (from commit 8170e6bed465: "x86,
64bit: Use a #PF handler to materialize early mappings on demand") for
the page fault exception vector.

Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# feddc9de 21-Dec-2012 Fenghua Yu <fenghua.yu@intel.com>

x86/head64.c: Early update ucode in 64-bit

This updates ucode on BSP in 64-bit mode. Paging and virtual address are
working now.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1356075872-3054-11-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 6c902b65 24-Jan-2013 Yinghai Lu <yinghai@kernel.org>

x86: Merge early kernel reserve for 32bit and 64bit

They are the same, and we could move them out from head32/64.c to setup.c.

We are using memblock, and it could handle overlapping properly, so
we don't need to reserve some at first to hold the location, and just
need to make sure we reserve them before we are using memblock to find
free mem to use.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-32-git-send-email-yinghai@kernel.org
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# ee92d815 28-Jan-2013 Yinghai Lu <yinghai@kernel.org>

x86, boot: Support loading bzImage, boot_params and ramdisk above 4G

xloadflags bit 1 indicates that we can load the kernel and all data
structures above 4G; it is set if kernel is relocatable and 64bit.

bootloader will check if xloadflags bit 1 is set to decide if
it could load ramdisk and kernel high above 4G.

bootloader will fill value to ext_ramdisk_image/size for high 32bits
when it load ramdisk above 4G.
kernel use get_ramdisk_image/size to use ext_ramdisk_image/size to get
right positon for ramdisk.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Rob Landley <rob@landley.net>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Gokul Caushik <caushik1@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joe Millenbach <jmillenbach@gmail.com>
Link: http://lkml.kernel.org/r/1359058816-7615-26-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# f1da834c 24-Jan-2013 Yinghai Lu <yinghai@kernel.org>

x86, boot: Add get_cmd_line_ptr()

Add an accessor function for the command line address.
Later we will add support for holding a 64-bit address via ext_cmd_line_ptr.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-17-git-send-email-yinghai@kernel.org
Cc: Gokul Caushik <caushik1@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joe Millenbach <jmillenbach@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 1b8c78be 24-Jan-2013 Yinghai Lu <yinghai@kernel.org>

x86: Merge early_reserve_initrd for 32bit and 64bit

They are the same, could move them out from head32/64.c to setup.c.

We are using memblock, and it could handle overlapping properly, so
we don't need to reserve some at first to hold the location, and just
need to make sure we reserve them before we are using memblock to find
free mem to use.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-15-git-send-email-yinghai@kernel.org
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 10054230 24-Jan-2013 Yinghai Lu <yinghai@kernel.org>

x86, 64bit: Don't set max_pfn_mapped wrong value early on native path

We are not having max_pfn_mapped set correctly until init_memory_mapping.
So don't print its initial value for 64bit

Also need to use KERNEL_IMAGE_SIZE directly for highmap cleanup.

-v2: update comments about max_pfn_mapped according to Stefano Stabellini.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-14-git-send-email-yinghai@kernel.org
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 6b9c75ac 24-Jan-2013 Yinghai Lu <yinghai@kernel.org>

x86, 64bit: #PF handler set page to cover only 2M per #PF

We only map a single 2 MiB page per #PF, even though we should be able
to do this a full gigabyte at a time with no additional memory cost.
This is a workaround for a broken AMD reference BIOS (and its
derivatives in shipping system) which maps a large chunk of memory as
WB in the MTRR system but will #MC if the processor wanders off and
tries to prefetch that memory, which can happen any time the memory is
mapped in the TLB.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-13-git-send-email-yinghai@kernel.org
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
[ hpa: rewrote the patch description ]
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 8170e6be 24-Jan-2013 H. Peter Anvin <hpa@zytor.com>

x86, 64bit: Use a #PF handler to materialize early mappings on demand

Linear mode (CR0.PG = 0) is mutually exclusive with 64-bit mode; all
64-bit code has to use page tables. This makes it awkward before we
have first set up properly all-covering page tables to access objects
that are outside the static kernel range.

So far we have dealt with that simply by mapping a fixed amount of
low memory, but that fails in at least two upcoming use cases:

1. We will support load and run kernel, struct boot_params, ramdisk,
command line, etc. above the 4 GiB mark.
2. need to access ramdisk early to get microcode to update that as
early possible.

We could use early_iomap to access them too, but it will make code to
messy and hard to be unified with 32 bit.

Hence, set up a #PF table and use a fixed number of buffers to set up
page tables on demand. If the buffers fill up then we simply flush
them and start over. These buffers are all in __initdata, so it does
not increase RAM usage at runtime.

Thus, with the help of the #PF handler, we can set the final kernel
mapping from blank, and switch to init_level4_pgt later.

During the switchover in head_64.S, before #PF handler is available,
we use three pages to handle kernel crossing 1G, 512G boundaries with
sharing page by playing games with page aliasing: the same page is
mapped twice in the higher-level tables with appropriate wraparound.
The kernel region itself will be properly mapped; other mappings may
be spurious.

early_make_pgtable is using kernel high mapping address to access pages
to set page table.

-v4: Add phys_base offset to make kexec happy, and add
init_mapping_kernel() - Yinghai
-v5: fix compiling with xen, and add back ident level3 and level2 for xen
also move back init_level4_pgt from BSS to DATA again.
because we have to clear it anyway. - Yinghai
-v6: switch to init_level4_pgt in init_mem_mapping. - Yinghai
-v7: remove not needed clear_page for init_level4_page
it is with fill 512,8,0 already in head_64.S - Yinghai
-v8: we need to keep that handler alive until init_mem_mapping and don't
let early_trap_init to trash that early #PF handler.
So split early_trap_pf_init out and move it down. - Yinghai
-v9: switchover only cover kernel space instead of 1G so could avoid
touch possible mem holes. - Yinghai
-v11: change far jmp back to far return to initial_code, that is needed
to fix failure that is reported by Konrad on AMD systems. - Yinghai

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-12-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# fa2bbce9 24-Jan-2013 Yinghai Lu <yinghai@kernel.org>

x86, 64bit: Copy struct boot_params early

We want to support struct boot_params (formerly known as the
zero-page, or real-mode data) above the 4 GiB mark. We will have #PF
handler to set page table for not accessible ram early, but want to
limit it before x86_64_start_reservations to limit the code change to
native path only.

Also we will need the ramdisk info in struct boot_params to access the microcode
blob in ramdisk in x86_64_start_kernel, so copy struct boot_params early makes
it accessing ramdisk info simple.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-9-git-send-email-yinghai@kernel.org
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 5dcd14ec 29-Jan-2013 H. Peter Anvin <hpa@linux.intel.com>

x86, boot: Sanitize boot_params if not zeroed on creation

Use the new sentinel field to detect bootloaders which fail to follow
protocol and don't initialize fields in struct boot_params that they
do not explicitly initialize to zero.

Based on an original patch and research by Yinghai Lu.
Changed by hpa to be invoked both in the decompression path and in the
kernel proper; the latter for the case where a bootloader takes over
decompression.

Originally-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-26-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# bbee3aec 19-Nov-2012 Alexander Duyck <alexander.h.duyck@intel.com>

x86: Fix warning about cast from pointer to integer of different size

This patch fixes a warning reported by the kbuild test robot where we were
casting a pointer to a physical address which represents an integer of a
different size. Per the suggestion of Peter Anvin I am replacing it and one
other spot where I made a similar cast with an unsigned long.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Link: http://lkml.kernel.org/r/20121119182927.3655.7641.stgit@ahduyck-cp1.jf.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 05a476b6 16-Nov-2012 Alexander Duyck <alexander.h.duyck@intel.com>

x86: Drop 4 unnecessary calls to __pa_symbol

While debugging the __pa_symbol inline patch I found that there were a couple
spots where __pa_symbol was used as follows:
__pa_symbol(x) - __pa_symbol(y)

The compiler had reduced them to:
x - y

Since we also support a debug case where __pa_symbol is a function call it
would probably be useful to just change the two cases I found so that they are
always just treated as "x - y". As such I am casting the values to
phys_addr_t and then doing simple subtraction so that the correct type and
value is returned.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Link: http://lkml.kernel.org/r/20121116215552.8521.68085.stgit@ahduyck-cp1.jf.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# c9b77ccb 08-May-2012 Jarkko Sakkinen <jarkko.sakkinen@intel.com>

x86, realmode: Move ACPI wakeup to unified realmode code

Migrated ACPI wakeup code to the real-mode blob.
Code existing in .x86_trampoline can be completely
removed. Static descriptor table in wakeup_asm.S is
courtesy of H. Peter Anvin.

Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Link: http://lkml.kernel.org/r/1336501366-28617-7-git-send-email-jarkko.sakkinen@intel.com
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# fe091c20 08-Dec-2011 Tejun Heo <tj@kernel.org>

memblock: Kill memblock_init()

memblock_init() initializes arrays for regions and memblock itself;
however, all these can be done with struct initializers and
memblock_init() can be removed. This patch kills memblock_init() and
initializes memblock with struct initializer.

The only difference is that the first dummy entries don't have .nid
set to MAX_NUMNODES initially. This doesn't cause any behavior
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "H. Peter Anvin" <hpa@zytor.com>


# 24aa0788 12-Jul-2011 Tejun Heo <tj@kernel.org>

memblock, x86: Replace memblock_x86_reserve/free_range() with generic ones

Other than sanity check and debug message, the x86 specific version of
memblock reserve/free functions are simple wrappers around the generic
versions - memblock_reserve/free().

This patch adds debug messages with caller identification to the
generic versions and replaces x86 specific ones and kills them.
arch/x86/include/asm/memblock.h and arch/x86/mm/memblock.c are empty
after this change and removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310462166-31469-14-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# e5f15b45 18-Feb-2011 Yinghai Lu <yinghai@kernel.org>

x86: Cleanup highmap after brk is concluded

Now cleanup_highmap actually is in two steps: one is early in head64.c
and only clears above _end; a second one is in init_memory_mapping() and
tries to clean from _brk_end to _end.
It should check if those boundaries are PMD_SIZE aligned but currently
does not.
Also init_memory_mapping() is called several times for numa or memory
hotplug, so we really should not handle initial kernel mappings there.

This patch moves cleanup_highmap() down after _brk_end is settled so
we can do everything in one step.
Also we honor max_pfn_mapped in the implementation of cleanup_highmap.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
LKML-Reference: <alpine.DEB.2.00.1103171739050.3382@kaball-desktop>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 67e87f0a 13-Oct-2010 Jeremy Fitzhardinge <jeremy@goop.org>

x86-64: Only set max_pfn_mapped to 512 MiB if we enter via head_64.S

head_64.S maps up to 512 MiB, but that is not necessarity true for
other entry paths, such as Xen.

Thus, co-locate the setting of max_pfn_mapped with the code to
actually set up the page tables in head_64.S. The 32-bit code is
already so co-located. (The Xen code already sets max_pfn_mapped
correctly for its own use case.)

-v2:

Yinghai fixed the following bug in this patch:

|
| max_pfn_mapped is in .bss section, so we need to set that
| after bss get cleared. Without that we crash on bootup.
|
| That is safe because Xen does not call x86_64_start_kernel().
|

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Fixed-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <4CB6AB24.9020504@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# a9ce6bc1 25-Aug-2010 Yinghai Lu <yinghai@kernel.org>

x86, memblock: Replace e820_/_early string with memblock_

1.include linux/memblock.h directly. so later could reduce e820.h reference.
2 this patch is done by sed scripts mainly

-v2: use MEMBLOCK_ERROR instead of -1ULL or -1UL

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 72d7c3b3 25-Aug-2010 Yinghai Lu <yinghai@kernel.org>

x86: Use memblock to replace early_res

1. replace find_e820_area with memblock_find_in_range
2. replace reserve_early with memblock_x86_reserve_range
3. replace free_early with memblock_x86_free_range.
4. NO_BOOTMEM will switch to use memblock too.
5. use _e820, _early wrap in the patch, in following patch, will
replace them all
6. because memblock_x86_free_range support partial free, we can remove some special care
7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
so adjust some calling later in setup.c::setup_arch()
-- corruption_check and mptable_update

-v2: Move reserve_brk() early
Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
that could happen We have more then 128 RAM entry in E820 tables, and
memblock_x86_fill() could use memblock_find_in_range() to find a new place for
memblock.memory.region array.
and We don't need to use extend_brk() after fill_memblock_area()
So move reserve_brk() early before fill_memblock_area().
-v3: Move find_smp_config early
To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
in right place.
-v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
memblock.reserved already..
use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
-v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
active_region for 32bit does include high pages
need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
-v6: Use current_limit instead
-v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
-v8: Set memblock_can_resize early to handle EFI with more RAM entries
-v9: update after kmemleak changes in mainline

Suggested-by: David S. Miller <davem@davemloft.net>
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# c967da6a 28-Mar-2010 Yinghai Lu <yinghai@kernel.org>

x86: Make sure free_init_pages() frees pages on page boundary

When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.

Example:

Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56

The new RAMDISK's end is not page aligned.
Last page could be shared with other users.

When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.

code segment in free_init_pages():

| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }

last half page could be used as one whole free page.

So page align the boundaries.

-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes

Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 893f38d1 10-Dec-2009 Yinghai Lu <yinghai@kernel.org>

x86: Use find_e820() instead of hard coded trampoline address

Jens found the following crash/regression:

[ 0.000000] found SMP MP-table at [ffff8800000fdd80] fdd80
[ 0.000000] Kernel panic - not syncing: Overlapping early reservations 12-f011 MP-table mpc to 0-fff BIOS data page

and

[ 0.000000] Kernel panic - not syncing: Overlapping early reservations 12-f011 MP-table mpc to 6000-7fff TRAMPOLINE

and bisected it to b24c2a9 ("x86: Move find_smp_config()
earlier and avoid bootmem usage").

It turns out the BIOS is using the first 64k for mptable,
without reserving it.

So try to find good range for the real-mode trampoline instead of
hard coding it, in case some bios tries to use that range for sth.

Reported-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
LKML-Reference: <4B21630A.6000308@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 47a3d5da 29-Aug-2009 Thomas Gleixner <tglx@linutronix.de>

x86: Add early platform detection

Platforms like Moorestown require early setup and want to avoid the
call to reserve_ebda_region. The x86_init override is too late when
the MRST detection happens in setup_arch. Move the default i386
x86_init overrides and the call to reserve_ebda_region into a separate
function which is called as the default of a switch case depending on
the hardware_subarch id in boot params. This allows us to add a case
for MRST and let MRST have its own early setup function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 816c25e7 19-Aug-2009 Thomas Gleixner <tglx@linutronix.de>

x86: Add reserve_ebda_region to x86_init_ops

reserve_ebda_region needs to be called befor start_kernel. Moorestown
needs to override it. Make it a x86_init_ops function and initialize
it with the default reserve_ebda_region.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 93dbda7c 26-Feb-2009 Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

x86: add brk allocation for very, very early allocations

Impact: new interface

Add a brk()-like allocator which effectively extends the bss in order
to allow very early code to do dynamic allocations. This is better than
using statically allocated arrays for data in subsystems which may never
get used.

The space for brk allocations is in the bss ELF segment, so that the
space is mapped properly by the code which maps the kernel, and so
that bootloaders keep the space free rather than putting a ramdisk or
something into it.

The bss itself, delimited by __bss_stop, ends before the brk area
(__brk_base to __brk_limit). The kernel text, data and bss is reserved
up to __bss_stop.

Any brk-allocated data is reserved separately just before the kernel
pagetable is built, as that code allocates from unreserved spaces
in the e820 map, potentially allocating from any unused brk memory.
Ultimately any unused memory in the brk area is used in the general
kernel memory pool.

Initially the brk space is set to 1MB, which is probably much larger
than any user needs (the largest current user is i386 head_32.S's code
to build the pagetables to map the kernel, which can get fairly large
with a big kernel image and no PSE support). So long as the system
has sufficient memory for the bootloader to reserve the kernel+1MB brk,
there are no bad effects resulting from an over-large brk.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 8ce03197 18-Jan-2009 Brian Gerst <brgerst@gmail.com>

x86: remove pda_init()

Impact: cleanup

Copy the code to cpu_init() to satisfy the requirement that the cpu
be reinitialized. Remove all other calls, since the segments are
already initialized in head_64.S.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 004aa322 13-Jan-2009 Tejun Heo <tj@kernel.org>

x86: misc clean up after the percpu update

Do the following cleanups:

* kill x86_64_init_pda() which now is equivalent to pda_init()

* use per_cpu_offset() instead of cpu_pda() when initializing
initial_gs

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# b12d8db8 13-Jan-2009 Tejun Heo <tj@kernel.org>

x86: make pda a percpu variable

[ Based on original patch from Christoph Lameter and Mike Travis. ]

As pda is now allocated in percpu area, it can easily be made a proper
percpu variable. Make it so by defining per cpu symbol from linker
script and declaring it in C code for SMP and simply defining it for
UP. This change cleans up code and brings SMP and UP closer a bit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 9939ddaf 13-Jan-2009 Tejun Heo <tj@kernel.org>

x86: merge 64 and 32 SMP percpu handling

Now that pda is allocated as part of percpu, percpu doesn't need to be
accessed through pda. Unify x86_64 SMP percpu access with x86_32 SMP
one. Other than the segment register, operand size and the base of
percpu symbols, they behave identical now.

This patch replaces now unnecessary pda->data_offset with a dummy
field which is necessary to keep stack_canary at its place. This
patch also moves per_cpu_offset initialization out of init_gdt() into
setup_per_cpu_areas(). Note that this change also necessitates
explicit per_cpu_offset initializations in voyager_smp.c.

With this change, x86_OP_percpu()'s are as efficient on x86_64 as on
x86_32 and also x86_64 can use assembly PER_CPU macros.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 1a51e3a0 13-Jan-2009 Tejun Heo <tj@kernel.org>

x86: fold pda into percpu area on SMP

[ Based on original patch from Christoph Lameter and Mike Travis. ]

Currently pdas and percpu areas are allocated separately. %gs points
to local pda and percpu area can be reached using pda->data_offset.
This patch folds pda into percpu area.

Due to strange gcc requirement, pda needs to be at the beginning of
the percpu area so that pda->stack_canary is at %gs:40. To achieve
this, a new percpu output section macro - PERCPU_VADDR_PREALLOC() - is
added and used to reserve pda sized chunk at the start of the percpu
area.

After this change, for boot cpu, %gs first points to pda in the
data.init area and later during setup_per_cpu_areas() gets updated to
point to the actual pda. This means that setup_per_cpu_areas() need
to reload %gs for CPU0 while clearing pda area for other cpus as cpu0
already has modified it when control reaches setup_per_cpu_areas().

This patch also removes now unnecessary get_local_pda() and its call
sites.

A lot of this patch is taken from Mike Travis' "x86_64: Fold pda into
per cpu area" patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# c8f3329a 13-Jan-2009 Tejun Heo <tj@kernel.org>

x86: use static _cpu_pda array

_cpu_pda array first uses statically allocated storage in data.init
and then switches to allocated bootmem to conserve space. However,
after folding pda area into percpu area, _cpu_pda array will be
removed completely. Drop the reallocation part to simplify the code
for soon-to-follow changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# f32ff538 13-Jan-2009 Tejun Heo <tj@kernel.org>

x86: load pointer to pda into %gs while brining up a CPU

[ Based on original patch from Christoph Lameter and Mike Travis. ]

CPU startup code in head_64.S loaded address of a zero page into %gs
for temporary use till pda is loaded but address to the actual pda is
available at the point. Load the real address directly instead.

This will help unifying percpu and pda handling later on.

This patch is mostly taken from Mike Travis' "x86_64: Fold pda into
per cpu area" patch.

Signed-off-by: Tejun Heo <tj@kernel.org>


# 3e5d8f97 13-Jan-2009 Tejun Heo <tj@kernel.org>

x86: make percpu symbols zerobased on SMP

[ Based on original patch from Christoph Lameter and Mike Travis. ]

This patch makes percpu symbols zerobased on x86_64 SMP by adding
PERCPU_VADDR() to vmlinux.lds.h which helps setting explicit vaddr on
the percpu output section and using it in vmlinux_64.lds.S. A new
PHDR is added as existing ones cannot contain sections near address
zero. PERCPU_VADDR() also adds a new symbol __per_cpu_load which
always points to the vaddr of the loaded percpu data.init region.

The following adjustments have been made to accomodate the address
change.

* code to locate percpu gdt_page in head_64.S is updated to add the
load address to the gdt_page offset.

* __per_cpu_load is used in places where access to the init data area
is necessary.

* pda->data_offset is initialized soon after C code is entered as zero
value doesn't work anymore.

This patch is mostly taken from Mike Travis' "x86_64: Base percpu
variables at zero" patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 26799a63 31-Dec-2008 Ravikiran G Thirumalai <kiran@scalex86.org>

x86: fix incorrect __read_mostly on _boot_cpu_pda

The pda rework (commit 3461b0af025251bbc6b3d56c821c6ac2de6f7209)
to remove static boot cpu pdas introduced a performance bug.

_boot_cpu_pda is the actual pda used by the boot cpu and is definitely
not "__read_mostly" and ended up polluting the read mostly section with
writes. This bug caused regression of about 8-10% on certain syscall
intensive workloads.

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Acked-by: Mike Travis <travis@sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 3e1e9002 07-Dec-2008 Rafael J. Wysocki <rjw@rjwysocki.net>

x86: change static allocation of trampoline area

Impact: fix trampoline sizing bug, save space

While debugging a suspend-to-RAM related issue it occured to me that
if the trampoline code had grown past 4 KB, we would have been
allocating too little memory for it, since the 4 KB size of the
trampoline is hardcoded into arch/x86/kernel/e820.c . Change that
by making the kernel compute the trampoline size and allocate as much
memory as necessary.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 12544697 28-Sep-2008 dcg <diegocg@gmail.com>

x86_64: be less annoying on boot, v2

Honour "quiet" boot parameter in early_printk() calls

Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# f6476774 24-Sep-2008 Bill Nottingham <notting@redhat.com>

x86_64: be less annoying on boot

Remove mostly useless message on every boot.

Signed-off-by: Bill Nottingham <notting@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 66d4bdf2 31-Jul-2008 Jan Beulich <jbeulich@novell.com>

x86-64: fix overlap of modules and fixmap areas

Plus add a build time check so this doesn't go unnoticed again.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 5b09b287 08-Jul-2008 Jeremy Fitzhardinge <jeremy@goop.org>

x86_64: add workaround for no %gs-based percpu

As a stopgap until Mike Travis's x86-64 gs-based percpu patches are
ready, provide workaround functions for x86_read/write_percpu for
Xen's use.

Specifically, this means that we can't really make use of vcpu
placement, because we can't use a single gs-based memory access to get
to vcpu fields. So disable all that for now.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 28bb2237 30-Jun-2008 Yinghai Lu <yhlu.kernel@gmail.com>

x86: move reserve_setup_data to setup.c

Ying Huang would like setup_data to be reserved, but not included in the
no save range.

Here we try to modify the e820 table to reserve that range early.
also add that in early_res in case bootloader messes up with the ramdisk.

other solution would be
1. add early_res_to_highmem...
2. early_res_to_e820...
but they could reserve another type memory wrongly, if early_res has some
resource reserved early, and not needed later, but it is not removed from
early_res in time. Like the RAMDISK (already handled).

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: andi@firstfloor.org
Tested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# f97013fd 24-Jun-2008 Jeremy Fitzhardinge <jeremy@goop.org>

x86, 64-bit: split x86_64_start_kernel

Split x86_64_start_kernel() into two pieces:

The first essentially cleans up after head_64.S. It clears the
bss, zaps low identity mappings, sets up some early exception
handlers.

The second part preserves the boot data, reserves the kernel's
text/data/bss, pagetables and ramdisk, and then starts the kernel
proper.

This split is so that Xen can call the second part to do the set up it
needs done. It doesn't need any of the first part setups, because it
doesn't boot via head_64.S, and its redundant or actively damaging.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 5deb0b2a 12-May-2008 Mike Travis <travis@sgi.com>

x86: leave initial __cpu_pda array in place until cpus are booted

Ingo Molnar wrote:
...
> they crashed after about 3 randconfig iterations with:
>
> early res: 4 [8000-afff] PGTABLE
> early res: 5 [b000-b87f] MEMNODEMAP
> PANIC: early exception 0e rip 10:ffffffff8077a150 error 2 cr2 37
> Pid: 0, comm: swapper Not tainted 2.6.25-sched-devel.git-x86-latest.git #14
>
> Call Trace:
> [<ffffffff81466196>] early_idt_handler+0x56/0x6a
> [<ffffffff8077a150>] ? numa_set_node+0x30/0x60
> [<ffffffff8077a129>] ? numa_set_node+0x9/0x60
> [<ffffffff8147a543>] numa_init_array+0x93/0xf0
> [<ffffffff8147b039>] acpi_scan_nodes+0x3b9/0x3f0
> [<ffffffff8147a496>] numa_initmem_init+0x136/0x150
> [<ffffffff8146da5f>] setup_arch+0x48f/0x700
> [<ffffffff802566ea>] ? clockevents_register_notifier+0x3a/0x50
> [<ffffffff81466a87>] start_kernel+0xd7/0x440
> [<ffffffff81466422>] x86_64_start_kernel+0x222/0x280
...
Here's the fixup... This one should follow the previous patches.

Thanks,
Mike
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 3461b0af 12-May-2008 Mike Travis <travis@sgi.com>

x86: remove static boot_cpu_pda array v2

* Remove the boot_cpu_pda array and pointer table from the data section.
Allocate the pointer table and array during init. do_boot_cpu()
will reallocate the pda in node local memory and if the cpu is being
brought up before the bootmem array is released (after_bootmem = 0),
then it will free the initial pda. This will happen for all cpus
present at system startup.

This removes 512k + 32k bytes from the data section.

For inclusion into sched-devel/latest tree.

Based on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
+ sched-devel/latest .../mingo/linux-2.6-sched-devel.git

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# c45a707d 02-Jun-2008 Huang, Ying <ying.huang@intel.com>

x86: linked list of setup_data for i386

This patch adds linked list of struct setup_data supported for i386.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: andi@firstfloor.org
Cc: mingo@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 0c51a965 02-Jun-2008 Huang, Ying <ying.huang@intel.com>

x86: extract common part of head32.c and head64.c into head.c

This patch extracts the common part of head32.c and head64.c into head.c.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: andi@firstfloor.org
Cc: mingo@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 8b664aa6 27-Mar-2008 Huang, Ying <ying.huang@intel.com>

x86, boot: add linked list of struct setup_data

This patch adds a field of 64-bit physical pointer to NULL terminated
single linked list of struct setup_data to real-mode kernel
header. This is used as a more extensible boot parameters passing
mechanism.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 356fa0c6 19-Apr-2008 Akinobu Mita <akinobu.mita@gmail.com>

x86: use get_bios_ebda()

Use get_bios_ebda().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 2b8106a0 18-Mar-2008 Yinghai Lu <yhlu.kernel@gmail.com>

x86_64: do not reserve ramdisk two times

ramdisk is reserved via reserve_early in x86_64_start_kernel,
later early_res_to_bootmem() will convert to reservation in bootmem.

so don't need to reserve that again.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 5524ea32 10-Mar-2008 Andi Kleen <andi@firstfloor.org>

x86: don't set up early exception handlers for external interrupts

All of early setup runs with interrupts disabled, so there is no
need to set up early exception handlers for vectors >= 32

This saves some minor text size.

Signed-off-by: Andi Kleen <ak@suse.de>
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# ecd94c08 04-Mar-2008 Alexander van Heukelum <heukelum@mailshack.com>

x86: reserve end-of-conventional-memory to 1MB, 64-bit, use paravirt_enabled

Jeremy Fitzhardinge pointed out that looking at the boot_params
struct to determine if the system is running in a paravirtual
environment is not reliable for the Xen case, currently. He also
points out that there already exists a function to determine if
the system is running in a paravirtual environment. So let's use
that instead. This gets rid of the preprocessor test too.

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 320a6b2e 01-Mar-2008 Alexander van Heukelum <heukelum@mailshack.com>

x86: reserve end-of-conventional-memory to 1MB, 64-bit

This patch is an add-on to the 64-bit ebda patch. It makes
the functions reserve_ebda_region (renamed from reserve_ebda)
and copy_e820_map equal to the 32-bit versions of the previous
patch.

Changes:

Use u64 and u32 for local variables in copy_e820_map.

The amount of conventional memory and the start of the EBDA are
detected by reading the BIOS data area directly. Paravirtual
environments do not provide this area, so we bail out early
in that case. They will just have to set up a correct memory
map to start with.

Add a safety net for zeroed out BIOS data area.

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# f6eb62b6 25-Feb-2008 Alexander van Heukelum <heukelum@mailshack.com>

x86: reserve_early end-of-conventional-memory to 1MB, 64-bit

Explicitly reserve_early the whole address range from the end of
conventional memory as reported by the bios data area up to the
1Mb mark. Regard the info retrieved from the BIOS data area with
a bit of paranoia, though, because some biosses forget to register
the EBDA correctly.

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# b4e0409a 21-Feb-2008 Ingo Molnar <mingo@elte.hu>

x86: check vmlinux limits, 64-bit

these build-time and link-time checks would have prevented the
vmlinux size regression.

Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 31eedd82 15-Feb-2008 Thomas Gleixner <tglx@linutronix.de>

x86: zap invalid and unused pmds in early boot

The early boot code maps KERNEL_TEXT_SIZE (currently 40MB) starting
from __START_KERNEL_map. The kernel itself only needs _text to _end
mapped in the high alias. On relocatible kernels the ASM setup code
adjusts the compile time created high mappings to the relocation. This
creates invalid pmd entries for negative offsets:

0xffffffff80000000 -> pmd entry: ffffffffff2001e3
It points outside of the physical address space and is marked present.

This starts at the virtual address __START_KERNEL_map and goes up to
the point where the first valid physical address (0x0) is mapped.

Zap the mappings before _text and after _end right away in early
boot. This removes also the invalid entries.

Furthermore it simplifies the range check for high aliases.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 25eff8d4 01-Feb-2008 Yinghai Lu <Yinghai.Lu@Sun.COM>

x86_64: add debug name for early_res

helps debugging problems in this rather murky area of code.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# cd582896 30-Jan-2008 Ingo Molnar <mingo@elte.hu>

x86: fix more non-global TLB flushes

fix more __flush_tlb() instances, out of caution.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 75175278 30-Jan-2008 Andi Kleen <ak@linux.intel.com>

x86: replace hard coded reservations in 64-bit early boot code with dynamic table

On x86-64 there are several memory allocations before bootmem. To avoid
them stomping on each other they used to be all hard coded in bad_area().
Replace this with an array that is filled as needed.

This cleans up the code considerably and allows to expand its use.

Cc: peterz@infradead.org
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 076f9776 30-Jan-2008 Ingo Molnar <mingo@elte.hu>

x86: make early printk selectable on 64-bit as well

Enable CONFIG_EMBEDDED to select CONFIG_EARLY_PRINTK on 64-bit as well.

saves ~2K:

text data bss dec hex filename
7290283 3672091 1907848 12870222 c4624e vmlinux.before
7288373 3671795 1907848 12868016 c459b0 vmlinux.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 8866cd9d 30-Jan-2008 Roland McGrath <roland@redhat.com>

x86: early_idt_handler improvements, 64-bit

It's not too pretty, but I found this made the "PANIC: early exception"
messages become much more reliably useful: 1. print the vector number,
2. print the %cs value, 3. handle error-code-pushing vs non-pushing vectors.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 9de819fe 30-Jan-2008 Yinghai Lu <Yinghai.Lu@Sun.COM>

x86: do not set boot cpu in cpu_online_map at x86_64_start_kernel()

In init/main.c boot_cpu_init() does that later.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# eaf76e8b 30-Jan-2008 Thomas Gleixner <tglx@linutronix.de>

x86: remove duplicate start_kernel declaration

start_kernel is already declared in a generic header file.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 718fc13b 30-Jan-2008 Thomas Gleixner <tglx@linutronix.de>

x86: move debug related declarations to kdebug.h

Move them and fixup some users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 9d1c6e7c 19-Oct-2007 Glauber de Oliveira Costa <gcosta@redhat.com>

x86: use descriptor's functions instead of inline assembly

This patch provides a new set of functions for managing the descriptor
tables that can be used instead of putting the raw assembly in .c files.

Remodeling of store_tr() suggested by Frederik Deweerdt.

[ tglx: arch/x86 adaptation ]

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 30c82645 15-Oct-2007 H. Peter Anvin <hpa@zytor.com>

[x86] remove uses of magic macros for boot_params access

Instead of using magic macros for boot_params access, simply use the
boot_params structure.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 835c34a1 12-Oct-2007 Dave Jones <davej@redhat.com>

Delete filenames in comments.

Since the x86 merge, lots of files that referenced their own filenames
are no longer correct. Rather than keep them up to date, just delete
them, as they add no real value.

Additionally:
- fix up comment formatting in scx200_32.c
- Remove a credit from myself in setup_64.c from a time when we had no SCM
- remove longwinded history from tsc_32.c which can be figured out from
git.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 250c2277 11-Oct-2007 Thomas Gleixner <tglx@linutronix.de>

x86_64: move kernel

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>