History log of /linux-master/arch/x86/kernel/alternative.c
Revision Date Author Comments
# 7dbbc8f5 26-Jan-2024 Yosry Ahmed <yosryahmed@google.com>

x86/mm: delete unused cpu argument to leave_mm()

The argument is unused since commit 3d28ebceaffa ("x86/mm: Rework lazy
TLB to track the actual loaded mm"), delete it.

Link: https://lkml.kernel.org/r/20240126080644.1714297-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0911b8c5 21-Nov-2023 Breno Leitao <leitao@debian.org>

x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK

Step 10/10 of the namespace unification of CPU mitigations related Kconfig options.

[ mingo: Added one more case. ]

Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231121160740.1249350-11-leitao@debian.org


# 7b75782f 21-Nov-2023 Breno Leitao <leitao@debian.org>

x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS

Step 6/10 of the namespace unification of CPU mitigations related Kconfig options.

Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231121160740.1249350-7-leitao@debian.org


# aefb2f2e 21-Nov-2023 Breno Leitao <leitao@debian.org>

x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINE

Step 5/10 of the namespace unification of CPU mitigations related Kconfig options.

[ mingo: Converted a few more uses in comments/messages as well. ]

Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ariel Miculas <amiculas@cisco.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org


# 86ed430c 04-Dec-2023 Arnd Bergmann <arnd@arndb.de>

x86/alternatives: Move apply_relocation() out of init section

This function is now called from a few places that are no __init_or_module,
resulting a link time warning:

WARNING: modpost: vmlinux: section mismatch in reference: patch_dest+0x8a (section: .text) -> apply_relocation (section: .init.text)

Remove the annotation here.

[ mingo: Also sync up add_nop() with these changes. ]

Fixes: 17bce3b2ae2d ("x86/callthunks: Handle %rip-relative relocations in call thunk template")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231204072856.1033621-1-arnd@kernel.org


# 6724ba89 30-Nov-2023 Ingo Molnar <mingo@kernel.org>

x86/callthunks: Mark apply_relocation() as __init_or_module

Do it like the rest of the methods using it.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231105213731.1878100-3-ubizjak@gmail.com


# 17bce3b2 05-Nov-2023 Uros Bizjak <ubizjak@gmail.com>

x86/callthunks: Handle %rip-relative relocations in call thunk template

Contrary to alternatives, relocations are currently not supported in
call thunk templates. Re-use the existing infrastructure from
alternative.c to allow %rip-relative relocations when copying call
thunk template from its storage location.

The patch allows unification of ASM_INCREMENT_CALL_DEPTH, which already
uses PER_CPU_VAR macro, with INCREMENT_CALL_DEPTH, used in call thunk
template, which is currently limited to use absolute address.

Reuse existing relocation infrastructure from alternative.c.,
as suggested by Peter Zijlstra.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231105213731.1878100-3-ubizjak@gmail.com


# f7cfe701 09-Jan-2024 Juergen Gross <jgross@suse.com>

x86/paravirt: Make BUG_func() usable by non-GPL modules

Several inlined functions subject to paravirt patching are referencing
BUG_func() after the recent switch to the alternative patching
mechanism.

As those functions can legally be used by non-GPL modules, BUG_func()
must be usable by those modules, too. So use EXPORT_SYMBOL() when
exporting BUG_func().

Fixes: 9824b00c2b58 ("x86/paravirt: Move some functions and defines to alternative.c")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240109082232.22657-1-jgross@suse.com


# 2cd3e377 15-Dec-2023 Peter Zijlstra <peterz@infradead.org>

x86/cfi,bpf: Fix bpf_struct_ops CFI

BPF struct_ops uses __arch_prepare_bpf_trampoline() to write
trampolines for indirect function calls. These tramplines much have
matching CFI.

In order to obtain the correct CFI hash for the various methods, add a
matching structure that contains stub functions, the compiler will
generate correct CFI which we can pilfer for the trampolines.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.566977112@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# e72d88d1 15-Dec-2023 Peter Zijlstra <peterz@infradead.org>

x86/cfi,bpf: Fix bpf_callback_t CFI

Where the main BPF program is expected to match bpf_func_t,
sub-programs are expected to match bpf_callback_t.

This fixes things like:

tools/testing/selftests/bpf/progs/bloom_filter_bench.c:

bpf_for_each_map_elem(&array_map, bloom_callback, &data, 0);

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.451956710@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 4f9087f1 15-Dec-2023 Peter Zijlstra <peterz@infradead.org>

x86/cfi,bpf: Fix BPF JIT call

The current BPF call convention is __nocfi, except when it calls !JIT things,
then it calls regular C functions.

It so happens that with FineIBT the __nocfi and C calling conventions are
incompatible. Specifically __nocfi will call at func+0, while FineIBT will have
endbr-poison there, which is not a valid indirect target. Causing #CP.

Notably this only triggers on IBT enabled hardware, which is probably why this
hasn't been reported (also, most people will have JIT on anyway).

Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for
x86.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.345270396@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 54aa699e 02-Jan-2024 Bjorn Helgaas <bhelgaas@google.com>

arch/x86: Fix typos

Fix typos, most reported by "codespell arch/x86". Only touches comments,
no code changes.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org


# 7991ed43 29-Dec-2023 Borislav Petkov (AMD) <bp@alien8.de>

x86/alternative: Correct feature bit debug output

In

https://lore.kernel.org/r/20231206110636.GBZXBVvCWj2IDjVk4c@fat_crate.local

I wanted to adjust the alternative patching debug output to the new
changes introduced by

da0fe6e68e10 ("x86/alternative: Add indirect call patching")

but removed the '*' which denotes the ->x86_capability word. The correct
output should be, for example:

[ 0.230071] SMP alternatives: feat: 11*32+15, old: (entry_SYSCALL_64_after_hwframe+0x5a/0x77 (ffffffff81c000c2) len: 16), repl: (ffffffff89ae896a, len: 5) flags: 0x0

while the incorrect one says "... 1132+15" currently.

Add back the '*'.

Fixes: da0fe6e68e10 ("x86/alternative: Add indirect call patching")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231206110636.GBZXBVvCWj2IDjVk4c@fat_crate.local


# f7af6977 09-Dec-2023 Juergen Gross <jgross@suse.com>

x86/paravirt: Remove no longer needed paravirt patching code

Now that paravirt is using the alternatives patching infrastructure,
remove the paravirt patching code.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231210062138.2417-6-jgross@suse.com


# 60bc276b 09-Dec-2023 Juergen Gross <jgross@suse.com>

x86/paravirt: Switch mixed paravirt/alternative calls to alternatives

Instead of stacking alternative and paravirt patching, use the new
ALT_FLAG_CALL flag to switch those mixed calls to pure alternative
handling.

Eliminate the need to be careful regarding the sequence of alternative
and paravirt patching.

[ bp: Touch up commit message. ]

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231210062138.2417-5-jgross@suse.com


# da0fe6e6 09-Dec-2023 Juergen Gross <jgross@suse.com>

x86/alternative: Add indirect call patching

In order to prepare replacing of paravirt patching with alternative
patching, add the capability to replace an indirect call with a direct
one.

This is done via a new flag ALT_FLAG_CALL as the target of the CALL
instruction needs to be evaluated using the value of the location
addressed by the indirect call.

For convenience, add a macro for a default CALL instruction. In case it
is being used without the new flag being set, it will result in a BUG()
when being executed. As in most cases, the feature used will be
X86_FEATURE_ALWAYS so add another macro ALT_CALL_ALWAYS usable for the
flags parameter of the ALTERNATIVE macros.

For a complete replacement, handle the special cases of calling a nop
function and an indirect call of NULL the same way as paravirt does.

[ bp: Massage commit message, fixup the debug output and clarify flow
more. ]

Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231210062138.2417-4-jgross@suse.com


# 9824b00c 29-Nov-2023 Juergen Gross <jgross@suse.com>

x86/paravirt: Move some functions and defines to alternative.c

As a preparation for replacing paravirt patching completely by
alternative patching, move some backend functions and #defines to
the alternatives code and header.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231129133332.31043-3-jgross@suse.com


# 5c22c472 09-Jun-2023 Hou Wenlong <houwenlong.hwl@antgroup.com>

x86/paravirt: Use relative reference for the original instruction offset

Similar to the alternative patching, use a relative reference for original
instruction offset rather than absolute one, which saves 8 bytes for one
PARA_SITE entry on x86_64. As a result, a R_X86_64_PC32 relocation is
generated instead of an R_X86_64_64 one, which also reduces relocation
metadata on relocatable builds. Hardcode the alignment to 4 now.

[ bp: Massage commit message. ]

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/9e6053107fbaabc0d33e5d2865c5af2c67ec9925.1686301237.git.houwenlong.hwl@antgroup.com


# 2dc41961 07-Dec-2023 Thomas Gleixner <tglx@linutronix.de>

x86/alternatives: Disable interrupts and sync when optimizing NOPs in place

apply_alternatives() treats alternatives with the ALT_FLAG_NOT flag set
special as it optimizes the existing NOPs in place.

Unfortunately, this happens with interrupts enabled and does not provide any
form of core synchronization.

So an interrupt hitting in the middle of the update and using the affected code
path will observe a half updated NOP and crash and burn. The following
3 NOP sequence was observed to expose this crash halfway reliably under QEMU
32bit:

0x90 0x90 0x90

which is replaced by the optimized 3 byte NOP:

0x8d 0x76 0x00

So an interrupt can observe:

1) 0x90 0x90 0x90 nop nop nop
2) 0x8d 0x90 0x90 undefined
3) 0x8d 0x76 0x90 lea -0x70(%esi),%esi
4) 0x8d 0x76 0x00 lea 0x0(%esi),%esi

Where only #1 and #4 are true NOPs. The same problem exists for 64bit obviously.

Disable interrupts around this NOP optimization and invoke sync_core()
before re-enabling them.

Fixes: 270a69c4485d ("x86/alternative: Support relocations in alternatives")
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/ZT6narvE%2BLxX%2B7Be@windriver.com


# 3ea1704a 07-Dec-2023 Thomas Gleixner <tglx@linutronix.de>

x86/alternatives: Sync core before enabling interrupts

text_poke_early() does:

local_irq_save(flags);
memcpy(addr, opcode, len);
local_irq_restore(flags);
sync_core();

That's not really correct because the synchronization should happen before
interrupts are re-enabled to ensure that a pending interrupt observes the
complete update of the opcodes.

It's not entirely clear whether the interrupt entry provides enough
serialization already, but moving the sync_core() invocation into interrupt
disabled region does no harm and is obviously correct.

Fixes: 6fffacb30349 ("x86/alternatives, jumplabel: Use text_poke_early() before mm_init()")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/ZT6narvE%2BLxX%2B7Be@windriver.com


# d35652a5 12-Oct-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/alternatives: Disable KASAN in apply_alternatives()

Fei has reported that KASAN triggers during apply_alternatives() on
a 5-level paging machine:

BUG: KASAN: out-of-bounds in rcu_is_watching()
Read of size 4 at addr ff110003ee6419a0 by task swapper/0/0
...
__asan_load4()
rcu_is_watching()
trace_hardirqs_on()
text_poke_early()
apply_alternatives()
...

On machines with 5-level paging, cpu_feature_enabled(X86_FEATURE_LA57)
gets patched. It includes KASAN code, where KASAN_SHADOW_START depends on
__VIRTUAL_MASK_SHIFT, which is defined with cpu_feature_enabled().

KASAN gets confused when apply_alternatives() patches the
KASAN_SHADOW_START users. A test patch that makes KASAN_SHADOW_START
static, by replacing __VIRTUAL_MASK_SHIFT with 56, works around the issue.

Fix it for real by disabling KASAN while the kernel is patching alternatives.

[ mingo: updated the changelog ]

Fixes: 6657fca06e3f ("x86/mm: Allow to boot without LA57 if CONFIG_X86_5LEVEL=y")
Reported-by: Fei Yang <fei.yang@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20231012100424.1456-1-kirill.shutemov@linux.intel.com


# aee9d30b 22-Sep-2023 Peter Zijlstra <peterz@infradead.org>

x86,static_call: Fix static-call vs return-thunk

Commit

7825451fa4dc ("static_call: Add call depth tracking support")

failed to realize the problem fixed there is not specific to call depth
tracking but applies to all return-thunk uses.

Move the fix to the appropriate place and condition.

Fixes: ee88d363d156 ("x86,static_call: Use alternative RET encoding")
Reported-by: David Kaplan <David.Kaplan@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>


# 4ba89dd6 04-Sep-2023 Josh Poimboeuf <jpoimboe@kernel.org>

x86/alternatives: Remove faulty optimization

The following commit

095b8303f383 ("x86/alternative: Make custom return thunk unconditional")

made '__x86_return_thunk' a placeholder value. All code setting
X86_FEATURE_RETHUNK also changes the value of 'x86_return_thunk'. So
the optimization at the beginning of apply_returns() is dead code.

Also, before the above-mentioned commit, the optimization actually had a
bug It bypassed __static_call_fixup(), causing some raw returns to
remain unpatched in static call trampolines. Thus the 'Fixes' tag.

Fixes: d2408e043e72 ("x86/alternative: Optimize returns patching")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/16d19d2249d4485d8380fb215ffaae81e6b8119e.1693889988.git.jpoimboe@kernel.org


# 1a3e4b4d 03-Aug-2023 Arnd Bergmann <arnd@arndb.de>

x86/alternative: Add a __alt_reloc_selftest() prototype

The newly introduced selftest function causes a warning when -Wmissing-prototypes
is enabled:

arch/x86/kernel/alternative.c:1461:32: error: no previous prototype for '__alt_reloc_selftest' [-Werror=missing-prototypes]

Since it's only used locally, add the prototype directly in front of it.

Fixes: 270a69c4485d ("x86/alternative: Support relocations in alternatives")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230803082619.1369127-6-arnd@kernel.org


# 095b8303 14-Aug-2023 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Make custom return thunk unconditional

There is infrastructure to rewrite return thunks to point to any
random thunk one desires, unwrap that from CALL_THUNKS, which up to
now was the sole user of that.

[ bp: Make the thunks visible on 32-bit and add ifdeffery for the
32-bit builds. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121148.775293785@infradead.org


# 238ec850 28-Jul-2023 Josh Poimboeuf <jpoimboe@kernel.org>

x86/srso: Fix return thunks in generated code

Set X86_FEATURE_RETHUNK when enabling the SRSO mitigation so that
generated code (e.g., ftrace, static call, eBPF) generates "jmp
__x86_return_thunk" instead of RET.

[ bp: Add a comment. ]

Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>


# fb3bd914 28-Jun-2023 Borislav Petkov (AMD) <bp@alien8.de>

x86/srso: Add a Speculative RAS Overflow mitigation

Add a mitigation for the speculative return address stack overflow
vulnerability found on AMD processors.

The mitigation works by ensuring all RET instructions speculate to
a controlled location, similar to how speculation is controlled in the
retpoline sequence. To accomplish this, the __x86_return_thunk forces
the CPU to mispredict every function return using a 'safe return'
sequence.

To ensure the safety of this mitigation, the kernel must ensure that the
safe return sequence is itself free from attacker interference. In Zen3
and Zen4, this is accomplished by creating a BTB alias between the
untraining function srso_untrain_ret_alias() and the safe return
function srso_safe_ret_alias() which results in evicting a potentially
poisoned BTB entry and using that safe one for all function returns.

In older Zen1 and Zen2, this is accomplished using a reinterpretation
technique similar to Retbleed one: srso_untrain_ret() and
srso_safe_ret().

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>


# 535d0ae3 11-Jul-2023 Ingo Molnar <mingo@kernel.org>

x86/cfi: Only define poison_cfi() if CONFIG_X86_KERNEL_IBT=y

poison_cfi() was introduced in:

9831c6253ace ("x86/cfi: Extend ENDBR sealing to kCFI")

... but it's only ever used under CONFIG_X86_KERNEL_IBT=y,
and if that option is disabled, we get:

arch/x86/kernel/alternative.c:1243:13: error: ‘poison_cfi’ defined but not used [-Werror=unused-function]

Guard the definition with CONFIG_X86_KERNEL_IBT.

Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 04505bbb 15-Jun-2023 Peter Zijlstra <peterz@infradead.org>

x86/fineibt: Poison ENDBR at +0

Alyssa noticed that when building the kernel with CFI_CLANG+IBT and
booting on IBT enabled hardware to obtain FineIBT, the indirect
functions look like:

__cfi_foo:
endbr64
subl $hash, %r10d
jz 1f
ud2
nop
1:
foo:
endbr64

This is because the compiler generates code for kCFI+IBT. In that case
the caller does the hash check and will jump to +0, so there must be
an ENDBR there. The compiler doesn't know about FineIBT at all; also
it is possible to actually use kCFI+IBT when booting with 'cfi=kcfi'
on IBT enabled hardware.

Having this second ENDBR however makes it possible to elide the CFI
check. Therefore, we should poison this second ENDBR when switching to
FineIBT mode.

Fixes: 931ab63664f0 ("x86/ibt: Implement FineIBT")
Reported-by: "Milburn, Alyssa" <alyssa.milburn@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20230615193722.194131053@infradead.org


# 9831c625 21-Jun-2023 Peter Zijlstra <peterz@infradead.org>

x86/cfi: Extend ENDBR sealing to kCFI

Kees noted that IBT sealing could be extended to kCFI.

Fundamentally it is the list of functions that do not have their
address taken and are thus never called indirectly. It doesn't matter
that objtool uses IBT infrastructure to determine this list, once we
have it it can also be used to clobber kCFI hashes and avoid kCFI
indirect calls.

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lkml.kernel.org/r/20230622144321.494426891%40infradead.org


# be0fffa5 22-Jun-2023 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Rename apply_ibt_endbr()

The current name doesn't reflect what it does very well.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lkml.kernel.org/r/20230622144321.427441595%40infradead.org


# 2bd4aa93 14-Jun-2023 Peter Zijlstra <peterz@infradead.org>

x86/alternative: PAUSE is not a NOP

While chasing ghosts, I did notice that optimize_nops() was replacing
'REP NOP' aka 'PAUSE' with NOP2. This is clearly not right.

Fixes: 6c480f222128 ("x86/alternative: Rewrite optimize_nops() some")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/linux-next/20230524130104.GR83892@hirez.programming.kicks-ass.net/


# 9350a629 31-May-2023 Steven Rostedt (Google) <rostedt@goodmis.org>

x86/alternatives: Add cond_resched() to text_poke_bp_batch()

Debugging in the kernel has started slowing down the kernel by a
noticeable amount. The ftrace start up tests are triggering the softlockup
watchdog on some boxes. This is caused by the start up tests that enable
function and function graph tracing several times. Sprinkling
cond_resched() just in the start up test code was not enough to stop the
softlockup from triggering. It would sometimes trigger in the
text_poke_bp_batch() code.

When function tracing enables all functions, it will call
text_poke_queue() to queue the places that need to be patched. Every
256 entries will do a "flush" that calls text_poke_bp_batch() to do the
update of the 256 locations. As this is in a scheduleable context,
calling cond_resched() at the start of text_poke_bp_batch() will ensure
that other tasks could get a chance to run while the patching is
happening. This keeps the softlockup from triggering in the start up
tests.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230531092419.4d051374@rorschach.local.home


# 0f613bfa 05-Jun-2023 Mark Rutland <mark.rutland@arm.com>

locking/atomic: treewide: use raw_atomic*_<op>()

Now that we have raw_atomic*_<op>() definitions, there's no need to use
arch_atomic*_<op>() definitions outside of the low-level atomic
definitions.

Move treewide users of arch_atomic*_<op>() over to the equivalent
raw_atomic*_<op>().

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230605070124.3741859-19-mark.rutland@arm.com


# df25edba 15-May-2023 Peter Zijlstra <peterz@infradead.org>

x86/alternatives: Add longer 64-bit NOPs

By adding support for longer NOPs there are a few more alternatives
that can turn into a single instruction.

Add up to NOP11, the same limit where GNU as .nops also stops
generating longer nops. This is because a number of uarchs have severe
decode penalties for more than 3 prefixes.

[ bp: Sync up with the version in tools/ while at it. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230515093020.661756940@infradead.org


# d42a2a89 13-May-2023 Borislav Petkov (AMD) <bp@alien8.de>

x86/alternatives: Fix section mismatch warnings

Fix stuff like:

WARNING: modpost: vmlinux.o: section mismatch in reference: \
__optimize_nops (section: .text) -> debug_alternative (section: .init.data)

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230513160146.16039-1-bp@alien8.de


# d2408e04 12-May-2023 Borislav Petkov (AMD) <bp@alien8.de>

x86/alternative: Optimize returns patching

Instead of decoding each instruction in the return sites range only to
realize that that return site is a jump to the default return thunk
which is needed - X86_FEATURE_RETHUNK is enabled - lift that check
before the loop and get rid of that loop overhead.

Add comments about what gets patched, while at it.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230512120952.7924-1-bp@alien8.de


# b6c881b2 08-Feb-2023 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Complicate optimize_nops() some more

Because:

SMP alternatives: ffffffff810026dc: [2:44) optimized NOPs: eb 2a eb 28 cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc

is quite daft, make things more complicated and have the NOP runlength
detection eat the preceding JMP if they both end at the same target.

SMP alternatives: ffffffff810026dc: [0:44) optimized NOPs: eb 2a cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230208171431.433132442@infradead.org


# 6c480f22 08-Feb-2023 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Rewrite optimize_nops() some

Address two issues:

- it no longer hard requires single byte NOP runs - now it accepts any
NOP and NOPL encoded instruction (but not the more complicated 32bit
NOPs).

- it writes a single 'instruction' replacement.

Specifically, ORC unwinder relies on the tail NOP of an alternative to
be a single instruction. In particular, it relies on the inner bytes not
being executed.

Once the max supported NOP length has been reached (currently 8, could easily
be extended to 11 on x86_64), switch to JMP.d8 and INT3 padding to
achieve the same result.

Objtool uses this guarantee in the analysis of alternative/overlapping
CFI state for the ORC unwinder data. Every instruction edge gets a CFI
state and the more instructions the larger the chance of conflicts.

[ bp:
- Add a comment over add_nop() to explain why it does it this way
- Make add_nops() PARAVIRT only as it is used solely there now ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230208171431.373412974@infradead.org


# 270a69c4 08-Feb-2023 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Support relocations in alternatives

A little while ago someone (Kirill) ran into the whole 'alternatives don't
do relocations nonsense' again and I got annoyed enough to actually look
at the code.

Since the whole alternative machinery already fully decodes the
instructions it is simple enough to adjust immediates and displacement
when needed. Specifically, the immediates for IP modifying instructions
(JMP, CALL, Jcc) and the displacement for RIP-relative instructions.

[ bp: Massage comment some more and get rid of third loop in
apply_relocation(). ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230208171431.313857925@infradead.org


# 6becb502 08-Feb-2023 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Make debug-alternative selective

Using debug-alternative generates a *LOT* of output, extend it a bit
to select which of the many rewrites it reports on.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230208171431.253636689@infradead.org


# ac0ee0a9 23-Jan-2023 Peter Zijlstra <peterz@infradead.org>

x86/alternatives: Teach text_poke_bp() to patch Jcc.d32 instructions

In order to re-write Jcc.d32 instructions text_poke_bp() needs to be
taught about them.

The biggest hurdle is that the whole machinery is currently made for 5
byte instructions and extending this would grow struct text_poke_loc
which is currently a nice 16 bytes and used in an array.

However, since text_poke_loc contains a full copy of the (s32)
displacement, it is possible to map the Jcc.d32 2 byte opcodes to
Jcc.d8 1 byte opcode for the int3 emulation.

This then leaves the replacement bytes; fudge that by only storing the
last 5 bytes and adding the rule that 'length == 6' instruction will
be prefixed with a 0x0f byte.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20230123210607.115718513@infradead.org


# 5d1dd961 21-Dec-2022 Borislav Petkov (AMD) <bp@alien8.de>

x86/alternatives: Add alt_instr.flags

Add a struct alt_instr.flags field which will contain different flags
controlling alternatives patching behavior.

The initial idea was to be able to specify it as a separate macro
parameter but that would mean touching all possible invocations of the
alternatives macros and thus a lot of churn.

What is more, as PeterZ suggested, being able to say ALT_NOT(feature) is
very readable and explains exactly what is meant.

So make the feature field a u32 where the patching flags are the upper
u16 part of the dword quantity while the lower u16 word is the feature.

The highest feature number currently is 0x26a (i.e., word 19) so there
is plenty of space. If that becomes insufficient, the field can be
extended to u64 which will then make struct alt_instr of the nice size
of 16 bytes (14 bytes currently).

There should be no functional changes resulting from this.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/Y6RCoJEtxxZWwotd@zn.tnic


# eb7d389d 25-Oct-2022 Peter Zijlstra <peterz@infradead.org>

x86/ftrace: Remove SYSTEM_BOOTING exceptions

Now that text_poke is available before ftrace, remove the
SYSTEM_BOOTING exceptions.

Specifically, this cures a W+X case during boot.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221025201057.945960823@infradead.org


# 0c3e806e 27-Oct-2022 Peter Zijlstra <peterz@infradead.org>

x86/cfi: Add boot time hash randomization

In order to avoid known hashes (from knowing the boot image),
randomize the CFI hashes with a per-boot random seed.

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221027092842.765195516@infradead.org


# 082c4c81 27-Oct-2022 Peter Zijlstra <peterz@infradead.org>

x86/cfi: Boot time selection of CFI scheme

Add the "cfi=" boot parameter to allow people to select a CFI scheme
at boot time. Mostly useful for development / debugging.

Requested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221027092842.699804264@infradead.org


# 931ab636 27-Oct-2022 Peter Zijlstra <peterz@infradead.org>

x86/ibt: Implement FineIBT

Implement an alternative CFI scheme that merges both the fine-grained
nature of kCFI but also takes full advantage of the coarse grained
hardware CFI as provided by IBT.

To contrast:

kCFI is a pure software CFI scheme and relies on being able to read
text -- specifically the instruction *before* the target symbol, and
does the hash validation *before* doing the call (otherwise control
flow is compromised already).

FineIBT is a software and hardware hybrid scheme; by ensuring every
branch target starts with a hash validation it is possible to place
the hash validation after the branch. This has several advantages:

o the (hash) load is avoided; no memop; no RX requirement.

o IBT WAIT-FOR-ENDBR state is a speculation stop; by placing
the hash validation in the immediate instruction after
the branch target there is a minimal speculation window
and the whole is a viable defence against SpectreBHB.

o Kees feels obliged to mention it is slightly more vulnerable
when the attacker can write code.

Obviously this patch relies on kCFI, but additionally it also relies
on the padding from the call-depth-tracking patches. It uses this
padding to place the hash-validation while the call-sites are
re-written to modify the indirect target to be 16 bytes in front of
the original target, thus hitting this new preamble.

Notably, there is no hardware that needs call-depth-tracking (Skylake)
and supports IBT (Tigerlake and onwards).

Suggested-by: Joao Moreira (Intel) <joao@overdrivepizza.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221027092842.634714496@infradead.org


# ae25e00b 25-Oct-2022 Dan Carpenter <error27@gmail.com>

x86/retpoline: Fix crash printing warning

The first argument of WARN() is a condition, so this will use "addr"
as the format string and possibly crash.

Fixes: 3b6c1747da48 ("x86/retpoline: Add SKL retthunk retpolines")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/Y1gBoUZrRK5N%2FlCB@kili/


# 3b6c1747 15-Sep-2022 Peter Zijlstra <peterz@infradead.org>

x86/retpoline: Add SKL retthunk retpolines

Ensure that retpolines do the proper call accounting so that the return
accounting works correctly.

Specifically; retpolines are used to replace both 'jmp *%reg' and
'call *%reg', however these two cases do not have the same accounting
requirements. Therefore split things up and provide two different
retpoline arrays for SKL.

The 'jmp *%reg' case needs no accounting, the
__x86_indirect_jump_thunk_array[] covers this. The retpoline is
changed to not use the return thunk; it's a simple call;ret construct.

[ strictly speaking it should do:
andq $(~0x1f), PER_CPU_VAR(__x86_call_depth)
but we can argue this can be covered by the fuzz we already have
in the accounting depth (12) vs the RSB depth (16) ]

The 'call *%reg' case does need accounting, the
__x86_indirect_call_thunk_array[] covers this. Again, this retpoline
avoids the use of the return-thunk, in this case to avoid double
accounting.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.996634749@infradead.org


# 770ae1b7 15-Sep-2022 Peter Zijlstra <peterz@infradead.org>

x86/returnthunk: Allow different return thunks

In preparation for call depth tracking on Intel SKL CPUs, make it possible
to patch in a SKL specific return thunk.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.680469665@infradead.org


# e81dc127 15-Sep-2022 Thomas Gleixner <tglx@linutronix.de>

x86/callthunks: Add call patching for call depth tracking

Mitigating the Intel SKL RSB underflow issue in software requires to
track the call depth. That is every CALL and every RET need to be
intercepted and additional code injected.

The existing retbleed mitigations already include means of redirecting
RET to __x86_return_thunk; this can be re-purposed and RET can be
redirected to another function doing RET accounting.

CALL accounting will use the function padding introduced in prior
patches. For each CALL instruction, the destination symbol's padding
is rewritten to do the accounting and the CALL instruction is adjusted
to call into the padding.

This ensures only affected CPUs pay the overhead of this accounting.
Unaffected CPUs will leave the padding unused and have their 'JMP
__x86_return_thunk' replaced with an actual 'RET' instruction.

Objtool has been modified to supply a .call_sites section that lists
all the 'CALL' instructions. Additionally the paravirt instruction
sites are iterated since they will have been patched from an indirect
call to direct calls (or direct instructions in which case it'll be
ignored).

Module handling and the actual thunk code for SKL will be added in
subsequent steps.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.470877038@infradead.org


# fe54d079 15-Sep-2022 Thomas Gleixner <tglx@linutronix.de>

x86/alternatives: Provide text_poke_copy_locked()

The upcoming call thunk patching must hold text_mutex and needs access to
text_poke_copy(), which takes text_mutex.

Provide a _locked postfixed variant to expose the inner workings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.159977224@infradead.org


# 023e59d4 11-Oct-2022 Miaohe Lin <linmiaohe@huawei.com>

x86/alternative: Remove noinline from __ibt_endbr_seal[_end]() stubs

Due to the explicit 'noinline' GCC-7.3 is not able to optimize away the
argument setup of:

apply_ibt_endbr(__ibt_endbr_seal, __ibt_enbr_seal_end);

even when X86_KERNEL_IBT=n and the function is an empty stub, which leads
to link errors due to missing __ibt_endbr_seal* symbols:

ld: arch/x86/kernel/alternative.o: in function `alternative_instructions':
alternative.c:(.init.text+0x15d): undefined reference to `__ibt_endbr_seal_end'
ld: alternative.c:(.init.text+0x164): undefined reference to `__ibt_endbr_seal'

Remove the explicit 'noinline' to help gcc optimize them away.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221011113803.956808-1-linmiaohe@huawei.com


# 64267734 08-Nov-2022 Jiapeng Chong <jiapeng.chong@linux.alibaba.com>

x86: Fix misc small issues

Fix:

./arch/x86/kernel/traps.c: asm/proto.h is included more than once.

./arch/x86/kernel/alternative.c:1610:2-3: Unneeded semicolon.

[ bp: Merge into a single patch. ]

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/1620902768-53822-1-git-send-email-jiapeng.chong@linux.alibaba.com
Link: https://lore.kernel.org/r/20220926054628.116957-1-jiapeng.chong@linux.alibaba.com


# 8c03af3e 07-Sep-2022 Peter Zijlstra <peterz@infradead.org>

x86,retpoline: Be sure to emit INT3 after JMP *%\reg

Both AMD and Intel recommend using INT3 after an indirect JMP. Make sure
to emit one when rewriting the retpoline JMP irrespective of compiler
SLS options or even CONFIG_SLS.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Link: https://lkml.kernel.org/r/Yxm+QkFPOhrVSH6q@hirez.programming.kicks-ass.net


# efd608fa 21-Sep-2022 Nadav Amit <namit@vmware.com>

x86/alternative: Fix race in try_get_desc()

I encountered some occasional crashes of poke_int3_handler() when
kprobes are set, while accessing desc->vec.

The text poke mechanism claims to have an RCU-like behavior, but it
does not appear that there is any quiescent state to ensure that
nobody holds reference to desc. As a result, the following race
appears to be possible, which can lead to memory corruption.

CPU0 CPU1
---- ----
text_poke_bp_batch()
-> smp_store_release(&bp_desc, &desc)

[ notice that desc is on
the stack ]

poke_int3_handler()

[ int3 might be kprobe's
so sync events are do not
help ]

-> try_get_desc(descp=&bp_desc)
desc = __READ_ONCE(bp_desc)

if (!desc) [false, success]
WRITE_ONCE(bp_desc, NULL);
atomic_dec_and_test(&desc.refs)

[ success, desc space on the stack
is being reused and might have
non-zero value. ]
arch_atomic_inc_not_zero(&desc->refs)

[ might succeed since desc points to
stack memory that was freed and might
be reused. ]

Fix this issue with small backportable patch. Instead of trying to
make RCU-like behavior for bp_desc, just eliminate the unnecessary
level of indirection of bp_desc, and hold the whole descriptor as a
global. Anyhow, there is only a single descriptor at any given
moment.

Fixes: 1f676247f36a4 ("x86/alternatives: Implement a better poke_int3_handler() completion scheme")
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@kernel.org
Link: https://lkml.kernel.org/r/20220920224743.3089-1-namit@vmware.com


# 65cdf0d6 13-Jul-2022 Kees Cook <keescook@chromium.org>

x86/alternative: Report missing return thunk details

Debugging missing return thunks is easier if we can see where they're
happening.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/Ys66hwtFcGbYmoiZ@hirez.programming.kicks-ass.net/


# f43b9876 27-Jun-2022 Peter Zijlstra <peterz@infradead.org>

x86/retbleed: Add fine grained Kconfig knobs

Do fine-grained Kconfig for all the various retbleed parts.

NOTE: if your compiler doesn't support return thunks this will
silently 'upgrade' your mitigation to IBPB, you might not like this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>


# ee88d363 14-Jun-2022 Peter Zijlstra <peterz@infradead.org>

x86,static_call: Use alternative RET encoding

In addition to teaching static_call about the new way to spell 'RET',
there is an added complication in that static_call() is allowed to
rewrite text before it is known which particular spelling is required.

In order to deal with this; have a static_call specific fixup in the
apply_return() 'alternative' patching routine that will rewrite the
static_call trampoline to match the definite sequence.

This in turn creates the problem of uniquely identifying static call
trampolines. Currently trampolines are 8 bytes, the first 5 being the
jmp.d32/ret sequence and the final 3 a byte sequence that spells out
'SCT'.

This sequence is used in __static_call_validate() to ensure it is
patching a trampoline and not a random other jmp.d32. That is,
false-positives shouldn't be plenty, but aren't a big concern.

OTOH the new __static_call_fixup() must not have false-positives, and
'SCT' decodes to the somewhat weird but semi plausible sequence:

push %rbx
rex.XB push %r12

Additionally, there are SLS concerns with immediate jumps. Combined it
seems like a good moment to change the signature to a single 3 byte
trap instruction that is unique to this usage and will not ever get
generated by accident.

As such, change the signature to: '0x0f, 0xb9, 0xcc', which decodes
to:

ud1 %esp, %ecx

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>


# 15e67227 14-Jun-2022 Peter Zijlstra <peterz@infradead.org>

x86: Undo return-thunk damage

Introduce X86_FEATURE_RETHUNK for those afflicted with needing this.

[ bp: Do only INT3 padding - simpler. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>


# aadd1b67 20-May-2022 Song Liu <song@kernel.org>

x86/alternative: Introduce text_poke_set

Introduce a memset like API for text_poke. This will be used to fill the
unused RX memory with illegal instructions.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20220520235758.1858153-3-song@kernel.org


# 03f16cd0 18-Apr-2022 Josh Poimboeuf <jpoimboe@redhat.com>

objtool: Add CONFIG_OBJTOOL

Now that stack validation is an optional feature of objtool, add
CONFIG_OBJTOOL and replace most usages of CONFIG_STACK_VALIDATION with
it.

CONFIG_STACK_VALIDATION can now be considered to be frame-pointer
specific. CONFIG_UNWINDER_ORC is already inherently valid for live
patching, so no need to "validate" it.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lkml.kernel.org/r/939bf3d85604b2a126412bf11af6e3bd3b872bcb.1650300597.git.jpoimboe@redhat.com


# ed53a0d9 08-Mar-2022 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Use .ibt_endbr_seal to seal indirect calls

Objtool's --ibt option generates .ibt_endbr_seal which lists
superfluous ENDBR instructions. That is those instructions for which
the function is never indirectly called.

Overwrite these ENDBR instructions with a NOP4 such that these
function can never be indirect called, reducing the number of viable
ENDBR targets in the kernel.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154319.822545231@infradead.org


# 3e3f0695 08-Mar-2022 Peter Zijlstra <peterz@infradead.org>

x86/ibt: Annotate text references

Annotate away some of the generic code references. This is things
where we take the address of a symbol for exception handling or return
addresses (eg. context switch).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.877758523@infradead.org


# 99c95c5d 08-Mar-2022 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Simplify int3_selftest_ip

Similar to ibt_selftest_ip, apply the same pattern.

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.700456643@infradead.org


# 0e06b403 04-Feb-2022 Song Liu <song@kernel.org>

x86/alternative: Introduce text_poke_copy

This will be used by BPF jit compiler to dump JITed binary to a RX huge
page, and thus allow multiple BPF programs sharing the a huge (2MB) page.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-6-song@kernel.org


# d45476d9 16-Feb-2022 Peter Zijlstra (Intel) <peterz@infradead.org>

x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCE

The RETPOLINE_AMD name is unfortunate since it isn't necessarily
AMD only, in fact Hygon also uses it. Furthermore it will likely be
sufficient for some Intel processors. Therefore rename the thing to
RETPOLINE_LFENCE to better describe what it is.

Add the spectre_v2=retpoline,lfence option as an alias to
spectre_v2=retpoline,amd to preserve existing setups. However, the output
of /sys/devices/system/cpu/vulnerabilities/spectre_v2 will be changed.

[ bp: Fix typos, massage. ]

Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>


# 26c44b77 04-Dec-2021 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Relax text_poke_bp() constraint

Currently, text_poke_bp() is very strict to only allow patching a
single instruction; however with straight-line-speculation it will be
required to patch: ret; int3, which is two instructions.

As such, relax the constraints a little to allow int3 padding for all
instructions that do not imply the execution of the next instruction,
ie: RET, JMP.d8 and JMP.d32.

While there, rename the text_poke_loc::rel32 field to ::disp.

Note: this fills up the text_poke_loc structure which is now a round
16 bytes big.

[ bp: Put comments ontop instead of on the side. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134908.082342723@infradead.org


# b17c2baa 04-Dec-2021 Peter Zijlstra <peterz@infradead.org>

x86: Prepare inline-asm for straight-line-speculation

Replace all ret/retq instructions with ASM_RET in preparation of
making it more than a single instruction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134907.964635458@infradead.org


# d4b5a5c9 26-Oct-2021 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Add debug prints to apply_retpolines()

Make sure we can see the text changes when booting with
'debug-alternative'.

Example output:

[ ] SMP alternatives: retpoline at: __traceiter_initcall_level+0x1f/0x30 (ffffffff8100066f) len: 5 to: __x86_indirect_thunk_rax+0x0/0x20
[ ] SMP alternatives: ffffffff82603e58: [2:5) optimized NOPs: ff d0 0f 1f 00
[ ] SMP alternatives: ffffffff8100066f: orig: e8 cc 30 00 01
[ ] SMP alternatives: ffffffff8100066f: repl: ff d0 0f 1f 00

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.422273830@infradead.org


# bbe2df3f 26-Oct-2021 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Try inline spectre_v2=retpoline,amd

Try and replace retpoline thunk calls with:

LFENCE
CALL *%\reg

for spectre_v2=retpoline,amd.

Specifically, the sequence above is 5 bytes for the low 8 registers,
but 6 bytes for the high 8 registers. This means that unless the
compilers prefix stuff the call with higher registers this replacement
will fail.

Luckily GCC strongly favours RAX for the indirect calls and most (95%+
for defconfig-x86_64) will be converted. OTOH clang strongly favours
R11 and almost nothing gets converted.

Note: it will also generate a correct replacement for the Jcc.d32
case, except unless the compilers start to prefix stuff that, it'll
never fit. Specifically:

Jncc.d8 1f
LFENCE
JMP *%\reg
1:

is 7-8 bytes long, where the original instruction in unpadded form is
only 6 bytes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.359986601@infradead.org


# 2f0cbb2a 26-Oct-2021 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Handle Jcc __x86_indirect_thunk_\reg

Handle the rare cases where the compiler (clang) does an indirect
conditional tail-call using:

Jcc __x86_indirect_thunk_\reg

For the !RETPOLINE case this can be rewritten to fit the original (6
byte) instruction like:

Jncc.d8 1f
JMP *%\reg
NOP
1:

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.296470217@infradead.org


# 75085009 26-Oct-2021 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Implement .retpoline_sites support

Rewrite retpoline thunk call sites to be indirect calls for
spectre_v2=off. This ensures spectre_v2=off is as near to a
RETPOLINE=n build as possible.

This is the replacement for objtool writing alternative entries to
ensure the same and achieves feature-parity with the previous
approach.

One noteworthy feature is that it relies on the thunks to be in
machine order to compute the register index.

Specifically, this does not yet address the Jcc __x86_indirect_thunk_*
calls generated by clang, a future patch will add this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.232495794@infradead.org


# 7ee0e638 01-Jun-2021 Borislav Petkov <bp@suse.de>

x86/alternative: Align insn bytes vertically

For easier inspection which bytes have changed.

For example:

feat: 7*32+12, old: (__x86_indirect_thunk_r10+0x0/0x20 (ffffffff81c02480) len: 17), repl: (ffffffff897813aa, len: 17)
ffffffff81c02480: old_insn: 41 ff e2 90 90 90 90 90 90 90 90 90 90 90 90 90 90
ffffffff897813aa: rpl_insn: e8 07 00 00 00 f3 90 0f ae e8 eb f9 4c 89 14 24 c3
ffffffff81c02480: final_insn: e8 07 00 00 00 f3 90 0f ae e8 eb f9 4c 89 14 24 c3

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210601193713.16190-1-bp@alien8.de


# 64e1f587 06-May-2021 Pavel Skripkin <paskripkin@gmail.com>

x86/alternatives: Make the x86nops[] symbol static

Sparse says:

arch/x86/kernel/alternative.c:78:21: warning: symbol 'x86nops' was not declared. Should it be static?

Since x86nops[] is not used outside this file, Sparse is right and it can be made static.

Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210506190726.15575-1-paskripkin@gmail.com


# 2b31e8ed 01-Jun-2021 Borislav Petkov <bp@suse.de>

x86/alternative: Optimize single-byte NOPs at an arbitrary position

Up until now the assumption was that an alternative patching site would
have some instructions at the beginning and trailing single-byte NOPs
(0x90) padding. Therefore, the patching machinery would go and optimize
those single-byte NOPs into longer ones.

However, this assumption is broken on 32-bit when code like
hv_do_hypercall() in hyperv_init() would use the ratpoline speculation
killer CALL_NOSPEC. The 32-bit version of that macro would align certain
insns to 16 bytes, leading to the compiler issuing a one or more
single-byte NOPs, depending on the holes it needs to fill for alignment.

That would lead to the warning in optimize_nops() to fire:

------------[ cut here ]------------
Not a NOP at 0xc27fb598
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:211 optimize_nops.isra.13

due to that function verifying whether all of the following bytes really
are single-byte NOPs.

Therefore, carve out the NOP padding into a separate function and call
it for each NOP range beginning with a single-byte NOP.

Fixes: 23c1ad538f4f ("x86/alternatives: Optimize optimize_nops()")
Reported-by: Richard Narron <richard@aaazen.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213301
Link: https://lkml.kernel.org/r/20210601212125.17145-1-bp@alien8.de


# 2f4305b1 20-Feb-2021 Nadav Amit <namit@vmware.com>

x86/mm/tlb: Privatize cpu_tlbstate

cpu_tlbstate is mostly private and only the variable is_lazy is shared.
This causes some false-sharing when TLB flushes are performed.

Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
each one accordingly.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20210220231712.2475218-6-namit@vmware.com


# 23c1ad53 26-Mar-2021 Peter Zijlstra <peterz@infradead.org>

x86/alternatives: Optimize optimize_nops()

Currently, optimize_nops() scans to see if the alternative starts with
NOPs. However, the emit pattern is:

141: \oldinstr
142: .skip (len-(142b-141b)), 0x90

That is, when 'oldinstr' is short, the tail is padded with NOPs. This case
never gets optimized.

Rewrite optimize_nops() to replace any trailing string of NOPs inside
the alternative to larger NOPs. Also run it irrespective of patching,
replacing NOPs in both the original and replaced code.

A direct consequence is that 'padlen' becomes superfluous, so remove it.

[ bp:
- Adjust commit message
- remove a stale comment about needing to pad
- add a comment in optimize_nops()
- exit early if the NOP verif. loop catches a mismatch - function
should not not add NOPs in that case
- fix the "optimized NOPs" offsets output ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210326151259.442992235@infradead.org


# 52fa82c2 26-Mar-2021 Peter Zijlstra <peterz@infradead.org>

x86: Add insn_decode_kernel()

Add a helper to decode kernel instructions; there's no point in
endlessly repeating those last two arguments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210326151259.379242587@infradead.org


# a89dfde3 11-Mar-2021 Peter Zijlstra <peterz@infradead.org>

x86: Remove dynamic NOP selection

This ensures that a NOP is a NOP and not a random other instruction that
is also a NOP. It allows simplification of dynamic code patching that
wants to verify existing code before writing new instructions (ftrace,
jump_label, static_call, etc..).

Differentiating on NOPs is not a feature.

This pessimises 32bit (DONTCARE) and 32bit on 64bit CPUs (CARELESS).
32bit is not a performance target.

Everything x86_64 since AMD K10 (2007) and Intel IvyBridge (2012) is
fine with using NOPL (as opposed to prefix NOP). And per FEATURE_NOPL
being required for x86_64, all x86_64 CPUs can use NOPL. So stop
caring about NOPs, simplify things and get on with life.

[ The problem seems to be that some uarchs can only decode NOPL on a
single front-end port while others have severe decode penalties for
excessive prefixes. All modern uarchs can handle both, except Atom,
which has prefix penalties. ]

[ Also, much doubt you can actually measure any of this on normal
workloads. ]

After this, FEATURE_NOPL is unused except for required-features for
x86_64. FEATURE_K8 is only used for PTI.

[ bp: Kernel build measurements showed ~0.3s slowdown on Sandybridge
which is hardly a slowdown. Get rid of X86_FEATURE_K7, while at it. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> # bpf
Acked-by: Linus Torvalds <torvalds@linuxfoundation.org>
Link: https://lkml.kernel.org/r/20210312115749.065275711@infradead.org


# 63c66cde 06-Nov-2020 Borislav Petkov <bp@suse.de>

x86/alternative: Use insn_decode()

No functional changes, just simplification.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210304174237.31945-10-bp@alien8.de


# 054ac8ad 11-Mar-2021 Juergen Gross <jgross@suse.com>

x86/paravirt: Have only one paravirt patch function

There is no need any longer to have different paravirt patch functions
for native and Xen. Eliminate native_patch() and rename
paravirt_patch_default() to paravirt_patch().

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210311142319.4723-15-jgross@suse.com


# 4e629211 11-Mar-2021 Juergen Gross <jgross@suse.com>

x86/paravirt: Add new features for paravirt patching

For being able to switch paravirt patching from special cased custom
code sequences to ALTERNATIVE handling some X86_FEATURE_* are needed
as new features. This enables to have the standard indirect pv call
as the default code and to patch that with the non-Xen custom code
sequence via ALTERNATIVE patching later.

Make sure paravirt patching is performed before alternatives patching.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210311142319.4723-9-jgross@suse.com


# dda7bb76 11-Mar-2021 Juergen Gross <jgross@suse.com>

x86/alternative: Support not-feature

Add support for alternative patching for the case a feature is not
present on the current CPU. For users of ALTERNATIVE() and friends, an
inverted feature is specified by applying the ALT_NOT() macro to it,
e.g.:

ALTERNATIVE(old, new, ALT_NOT(feature));

Committer note:

The decision to encode the NOT-bit in the feature bit itself is because
a future change which would make objtool generate such alternative
calls, would keep the code in objtool itself fairly simple.

Also, this allows for the alternative macros to support the NOT feature
without having to change them.

Finally, the u16 cpuid member encoding the X86_FEATURE_ flags is not an
ABI so if more bits are needed, cpuid itself can be enlarged or a flags
field can be added to struct alt_instr after having considered the size
growth in either cases.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210311142319.4723-6-jgross@suse.com


# 72ebb5ff 03-Dec-2020 Qiujun Huang <hqjagain@gmail.com>

x86/alternative: Update text_poke_bp() kernel-doc comment

Update kernel-doc parameter name after

c3d6324f841b ("x86/alternatives: Teach text_poke_bp() to emulate instructions")

changed the last parameter from @handler to @emulate.

[ bp: Make commit message more precise. ]

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201203145020.2441-1-hqjagain@gmail.com


# abee7c49 09-Oct-2020 Juergen Gross <jgross@suse.com>

x86/alternative: Don't call text_poke() in lazy TLB mode

When running in lazy TLB mode the currently active page tables might
be the ones of a previous process, e.g. when running a kernel thread.

This can be problematic in case kernel code is being modified via
text_poke() in a kernel thread, and on another processor exit_mmap()
is active for the process which was running on the first cpu before
the kernel thread.

As text_poke() is using a temporary address space and the former
address space (obtained via cpu_tlbstate.loaded_mm) is restored
afterwards, there is a race possible in case the cpu on which
exit_mmap() is running wants to make sure there are no stale
references to that address space on any cpu active (this e.g. is
required when running as a Xen PV guest, where this problem has been
observed and analyzed).

In order to avoid that, drop off TLB lazy mode before switching to the
temporary address space.

Fixes: cefa929c034eb5d ("x86/mm: Introduce temporary mm structs")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201009144225.12019-1-jgross@suse.com


# c43a43e4 18-Aug-2020 Peter Zijlstra <peterz@infradead.org>

x86/alternatives: Teach text_poke_bp() to emulate RET

Future patches will need to poke a RET instruction, provide the
infrastructure required for this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20200818135804.982214828@infradead.org


# df561f66 23-Aug-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

treewide: Use fallthrough pseudo-keyword

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>


# a6d996cb 12-Aug-2020 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

x86/alternatives: Acquire pte lock with interrupts enabled

pte lock is never acquired in-IRQ context so it does not require interrupts
to be disabled. The lock is a regular spinlock which cannot be acquired
with interrupts disabled on RT.

RT complains about pte_lock() in __text_poke() because it's invoked after
disabling interrupts.

__text_poke() has to disable interrupts as use_temporary_mm() expects
interrupts to be off because it invokes switch_mm_irqs_off() and uses
per-CPU (current active mm) data.

Move the PTE lock handling outside the interrupt disabled region.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by; Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200813105026.bvugytmsso6muljw@linutronix.de


# ca15ca40 07-Aug-2020 Mike Rapoport <rppt@kernel.org>

mm: remove unneeded includes of <asm/pgalloc.h>

Patch series "mm: cleanup usage of <asm/pgalloc.h>"

Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table. These patches add
generic versions of these functions in <asm-generic/pgalloc.h> and enable
use of the generic functions where appropriate.

In addition, functions declared and defined in <asm/pgalloc.h> headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the <asm/pgalloc.h> included all over the place.
The first patch in this series removes unneeded includes of
<asm/pgalloc.h>

In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.

This patch (of 8):

In most cases <asm/pgalloc.h> header is required only for allocations of
page table memory. Most of the .c files that include that header do not
use symbols declared in <asm/pgalloc.h> and do not require that header.

As for the other header files that used to include <asm/pgalloc.h>, it is
possible to move that include into the .c file that actually uses symbols
from <asm/pgalloc.h> and drop the include from the header file.

The process was somewhat automated using

sed -i -E '/[<"]asm\/pgalloc\.h/d' \
$(grep -L -w -f /tmp/xx \
$(git grep -E -l '[<"]asm/pgalloc\.h'))

where /tmp/xx contains all the symbols defined in
arch/*/include/asm/pgalloc.h.

[rppt@linux.ibm.com: fix powerpc warning]

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9998a983 26-Jul-2020 Ricardo Neri <ricardo.neri-calderon@linux.intel.com>

x86/cpu: Relocate sync_core() to sync_core.h

Having sync_core() in processor.h is problematic since it is not possible
to check for hardware capabilities via the *cpu_has() family of macros.
The latter needs the definitions in processor.h.

It also looks more intuitive to relocate the function to sync_core.h.

This changeset does not make changes in functionality.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200727043132.15082-3-ricardo.neri-calderon@linux.intel.com


# 7f6fa101 23-Jul-2020 Ira Weiny <ira.weiny@intel.com>

x86: Correct noinstr qualifiers

The noinstr qualifier is to be specified before the return type in the
same way inline is used.

These 2 cases were missed by previous patches.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200723161405.852613-1-ira.weiny@intel.com


# 1b2e335e 15-Jun-2020 Borislav Petkov <bp@suse.de>

x86/alternatives: Add pr_fmt() to debug macros

... in order to have debug output prefixed with the pr_fmt text "SMP
alternatives:" which allows easy grepping:

$ dmesg | grep "SMP alternatives"
[ 0.167783] SMP alternatives: alt table ffffffff8272c780, -> ffffffff8272fd6e
[ 0.168620] SMP alternatives: feat: 3*32+16, old: (x86_64_start_kernel+0x37/0x73 \
(ffffffff826093f7) len: 5), repl: (ffffffff8272fd6e, len: 5), pad: 0
[ 0.170103] SMP alternatives: ffffffff826093f7: old_insn: e8 54 a8 da fe
[ 0.171184] SMP alternatives: ffffffff8272fd6e: rpl_insn: e8 cd 3e c8 fe
...

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200615175315.17301-1-bp@alien8.de


# d769811c 12-May-2020 Adrian Hunter <adrian.hunter@intel.com>

perf/x86: Add support for perf text poke event for text_poke_bp_batch() callers

Add support for perf text poke event for text_poke_bp_batch() callers. That
includes jump labels. See comments for more details.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200512121922.8997-3-adrian.hunter@intel.com


# f64366ef 20-Feb-2020 Peter Zijlstra <peterz@infradead.org>

x86/int3: Inline bsearch()

Avoid calling out to bsearch() by inlining it, for normal kernel configs
this was the last external call and poke_int3_handler() is now fully self
sufficient -- no calls to external code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135313.731774429@linutronix.de


# ef882bfe 24-Jan-2020 Peter Zijlstra <peterz@infradead.org>

x86/int3: Avoid atomic instrumentation

Use arch_atomic_*() and __READ_ONCE() to ensure nothing untoward
creeps in and ruins things.

That is; this is the INT3 text poke handler, strictly limit the code
that runs in it, lest it inadvertenly hits yet another INT3.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135313.517429268@linutronix.de


# 4979fb53 21-Jan-2020 Thomas Gleixner <tglx@linutronix.de>

x86/int3: Ensure that poke_int3_handler() is not traced

In order to ensure poke_int3_handler() is completely self contained -- this
is called while modifying other text, imagine the fun of hitting another
INT3 -- ensure that everything it uses is not traced.

The primary means here is to force inlining; bsearch() is notrace because
all of lib/ is.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135313.410702173@linutronix.de


# e31cf2f4 08-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: don't include asm/pgtable.h if linux/mm.h is already included

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.

The include statements in such cases are remove with a simple loop:

for f in $(git grep -l "include <linux/mm.h>") ; do
sed -i -e '/include <asm\/pgtable.h>/ d' $f
done

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9020d395 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/alternatives: Move temporary_mm helpers into C

The only user of these inlines is the text poke code and this must not be
exposed to the world.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.139069561@linutronix.de


# 244febbe 03-Mar-2020 Qiujun Huang <hqjagain@gmail.com>

x86/alternatives: Mark text_poke_loc_init() static

The function is only used in this file so make it static.

[ bp: Massage. ]

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1583253732-18988-1-git-send-email-hqjagain@gmail.com


# 3a125539 23-Jan-2020 Dave Hansen <dave.hansen@linux.intel.com>

x86/alternatives: add missing insn.h include

From: Dave Hansen <dave.hansen@linux.intel.com>

While testing my MPX removal series, Borislav noted compilation
failure with an allnoconfig build.

Turned out to be a missing include of insn.h in alternative.c.
With MPX, it got it implicitly from:

asm/mmu_context.h -> asm/mpx.h -> asm/insn.h

Fixes: c3d6324f841b ("x86/alternatives: Teach text_poke_bp() to emulate instructions")
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>


# 1f676247 10-Dec-2019 Peter Zijlstra <peterz@infradead.org>

x86/alternatives: Implement a better poke_int3_handler() completion scheme

Commit:

285a54efe386 ("x86/alternatives: Sync bp_patching update for avoiding NULL pointer exception")

added an additional text_poke_sync() IPI to text_poke_bp_batch() to
handle the rare case where another CPU is still inside an INT3 handler
while we clear the global state.

Instead of spraying IPIs around, count the active INT3 handlers and
wait for them to go away before proceeding to clear/reuse the data.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 285a54ef 26-Nov-2019 Masami Hiramatsu <mhiramat@kernel.org>

x86/alternatives: Sync bp_patching update for avoiding NULL pointer exception

ftracetest multiple_kprobes.tc testcase hits the following NULL pointer
exception:

BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 800000007bf60067 P4D 800000007bf60067 PUD 7bf5f067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
RIP: 0010:poke_int3_handler+0x39/0x100
Call Trace:
<IRQ>
do_int3+0xd/0xf0
int3+0x42/0x50
RIP: 0010:sched_clock+0x6/0x10

poke_int3_handler+0x39 was alternatives:958:

static inline void *text_poke_addr(struct text_poke_loc *tp)
{
return _stext + tp->rel_addr; <------ Here is line #958
}

This seems to be caused by tp (bp_patching.vec) being NULL but
bp_patching.nr_entries != 0. There is a small chance for this
to happen, because we have no synchronization between the zeroing
of bp_patching.nr_entries and before clearing bp_patching.vec.

Steve suggested we could fix this by adding sync_core(), because int3
is done with interrupts disabled, and the on_each_cpu() requires
all CPUs to have had their interrupts enabled.

[ mingo: Edited the comments and the changelog. ]

Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Fixes: c0213b0ac03c ("x86/alternative: Batch of patch operations")
Link: https://lkml.kernel.org/r/157483421229.25881.15314414408559963162.stgit@devnote2
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 76ffa720 11-Nov-2019 Peter Zijlstra <peterz@infradead.org>

x86/alternatives: Use INT3_INSN_SIZE

Use INT3_INSN_SIZE instead of sizeof(int3).

Suggested-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132458.460144656@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 5c02ece8 09-Oct-2019 Peter Zijlstra <peterz@infradead.org>

x86/kprobes: Fix ordering while text-patching

Kprobes does something like:

register:
arch_arm_kprobe()
text_poke(INT3)
/* guarantees nothing, INT3 will become visible at some point, maybe */

kprobe_optimizer()
/* guarantees the bytes after INT3 are unused */
synchronize_rcu_tasks();
text_poke_bp(JMP32);
/* implies IPI-sync, kprobe really is enabled */

unregister:
__disarm_kprobe()
unoptimize_kprobe()
text_poke_bp(INT3 + tail);
/* implies IPI-sync, so tail is guaranteed visible */
arch_disarm_kprobe()
text_poke(old);
/* guarantees nothing, old will maybe become visible */

synchronize_rcu()

free-stuff

Now the problem is that on register, the synchronize_rcu_tasks() does
not imply sufficient to guarantee all CPUs have already observed INT3
(although in practice this is exceedingly unlikely not to have
happened) (similar to how MEMBARRIER_CMD_PRIVATE_EXPEDITED does not
imply MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).

Worse, even if it did, we'd have to do 2 synchronize calls to provide
the guarantee we're looking for, the first to ensure INT3 is visible,
the second to guarantee nobody is then still using the instruction
bytes after INT3.

Similar on unregister; the synchronize_rcu() between
__unregister_kprobe_top() and __unregister_kprobe_bottom() does not
guarantee all CPUs are free of the INT3 (and observe the old text).

Therefore, sprinkle some IPI-sync love around. This guarantees that
all CPUs agree on the text and RCU once again provides the required
guaranteed.

Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132458.162172862@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 4531ef6a 08-Oct-2019 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Shrink text_poke_loc

Employ the fact that all text must be within a s32 displacement of one
another to shrink the text_poke_loc::addr field. Make it relative to
_stext.

This then shrinks struct text_poke_loc to 16 bytes, and consequently
increases TP_VEC_MAX from 170 to 256.

Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132458.047052889@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 97e6c977 08-Oct-2019 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Remove text_poke_loc::len

Per the BUG_ON(len != insn.length) in text_poke_loc_init(), tp->len
must indeed be the same as text_opcode_size(tp->opcode). Use this to
remove this field from the structure.

Sadly, due to 8 byte alignment, this only increases the structure
padding.

Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132457.989922744@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 67c1d4a2 08-Oct-2019 Peter Zijlstra <peterz@infradead.org>

x86/ftrace: Use text_gen_insn()

Replace the ftrace_code_union with the generic text_gen_insn() helper,
which does exactly this.

Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132457.932808000@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 254d2c04 08-Oct-2019 Peter Zijlstra <peterz@infradead.org>

x86/alternative: Add text_opcode_size()

Introduce a common helper to map *_INSN_OPCODE to *_INSN_SIZE.

Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132457.875666061@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 768ae440 26-Aug-2019 Peter Zijlstra <peterz@infradead.org>

x86/ftrace: Use text_poke()

Move ftrace over to using the generic x86 text_poke functions; this
avoids having a second/different copy of that code around.

This also avoids ftrace violating the (new) W^X rule and avoids
fragmenting the kernel text page-tables, due to no longer having to
toggle them RW.

Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132457.761255803@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 63f62add 03-Oct-2019 Peter Zijlstra <peterz@infradead.org>

x86/alternatives: Add and use text_gen_insn() helper

Provide a simple helper function to create common instruction
encodings.

Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132457.703538332@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 18cbc8be 26-Aug-2019 Peter Zijlstra <peterz@infradead.org>

x86/alternatives, jump_label: Provide better text_poke() batching interface

Adding another text_poke_bp_batch() user made me realize the interface
is all sorts of wrong. The text poke vector should be internal to the
implementation.

This then results in a trivial interface:

text_poke_queue() - which has the 'normal' text_poke_bp() interface
text_poke_finish() - which takes no arguments and flushes any
pending text_poke()s.

Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132457.646280715@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c3d6324f 05-Jun-2019 Peter Zijlstra <peterz@infradead.org>

x86/alternatives: Teach text_poke_bp() to emulate instructions

In preparation for static_call and variable size jump_label support,
teach text_poke_bp() to emulate instructions, namely:

JMP32, JMP8, CALL, NOP2, NOP_ATOMIC5, INT3

The current text_poke_bp() takes a @handler argument which is used as
a jump target when the temporary INT3 is hit by a different CPU.

When patching CALL instructions, this doesn't work because we'd miss
the PUSH of the return address. Instead, teach poke_int3_handler() to
emulate an instruction, typically the instruction we're patching in.

This fits almost all text_poke_bp() users, except
arch_unoptimize_kprobe() which restores random text, and for that site
we have to build an explicit emulate instruction.

Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132457.529086974@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 8c7eebc10687af45ac8e40ad1bac0cf7893dba9f)
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 32b1cbe3 02-Sep-2019 Marco Ammon <marco.ammon@fau.de>

x86: Correct misc typos

Correct spelling typos in comments in different files under arch/x86/.

[ bp: Merge into a single patch, massage. ]

Signed-off-by: Marco Ammon <marco.ammon@fau.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: trivial@kernel.org
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190902102436.27396-1-marco.ammon@fau.de


# ecc60610 08-Jul-2019 Peter Zijlstra <peterz@infradead.org>

x86/alternatives: Fix int3_emulate_call() selftest stack corruption

KASAN shows the following splat during boot:

BUG: KASAN: unknown-crash in unwind_next_frame+0x3f6/0x490
Read of size 8 at addr ffffffff84007db0 by task swapper/0

CPU: 0 PID: 0 Comm: swapper Tainted: G T 5.2.0-rc6-00013-g7457c0d #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x19/0x1b
print_address_description+0x1b0/0x2b2
__kasan_report+0x10f/0x171
kasan_report+0x12/0x1c
__asan_load8+0x54/0x81
unwind_next_frame+0x3f6/0x490
unwind_next_frame+0x1b/0x23
arch_stack_walk+0x68/0xa5
stack_trace_save+0x7b/0xa0
save_trace+0x3c/0x93
mark_lock+0x1ef/0x9b1
lock_acquire+0x122/0x221
__mutex_lock+0xb6/0x731
mutex_lock_nested+0x16/0x18
_vm_unmap_aliases+0x141/0x183
vm_unmap_aliases+0x14/0x16
change_page_attr_set_clr+0x15e/0x2f2
set_memory_4k+0x2a/0x2c
check_bugs+0x11fd/0x1298
start_kernel+0x793/0x7eb
x86_64_start_reservations+0x55/0x76
x86_64_start_kernel+0x87/0xaa
secondary_startup_64+0xa4/0xb0

Memory state around the buggy address:
ffffffff84007c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1
ffffffff84007d00: f1 00 00 00 00 00 00 00 00 00 f2 f2 f2 f3 f3 f3
>ffffffff84007d80: f3 79 be 52 49 79 be 00 00 00 00 00 00 00 00 f1

It turns out that int3_selftest() is corrupting the stack. The problem is
that the KASAN-ified version of int3_magic() is much less trivial than the
C code appears. It clobbers several unexpected registers. So when the
selftest's INT3 is converted to an emulated call to int3_magic(), the
registers are clobbered and Bad Things happen when the function returns.

Fix this by converting int3_magic() to the trivial ASM function it should
be, avoiding all calling convention issues. Also add ASM_CALL_CONSTRAINT to
the INT3 ASM, since it contains a 'CALL'.

[peterz: cribbed changelog from josh]

Fixes: 7457c0da024b ("x86/alternatives: Add int3_emulate_call() selftest")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Debugged-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20190709125744.GB3402@hirez.programming.kicks-ass.net


# 7457c0da 02-May-2019 Peter Zijlstra <peterz@infradead.org>

x86/alternatives: Add int3_emulate_call() selftest

Given that the entry_*.S changes for this functionality are somewhat
tricky, make sure the paths are tested every boot, instead of on the
rare occasion when we trip an INT3 while rewriting text.

Requested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c0213b0a 12-Jun-2019 Daniel Bristot de Oliveira <bristot@redhat.com>

x86/alternative: Batch of patch operations

Currently, the patch of an address is done in three steps:

-- Pseudo-code #1 - Current implementation ---

1) add an int3 trap to the address that will be patched
sync cores (send IPI to all other CPUs)
2) update all but the first byte of the patched range
sync cores (send IPI to all other CPUs)
3) replace the first byte (int3) by the first byte of replacing opcode
sync cores (send IPI to all other CPUs)

-- Pseudo-code #1 ---

When a static key has more than one entry, these steps are called once for
each entry. The number of IPIs then is linear with regard to the number 'n' of
entries of a key: O(n*3), which is O(n).

This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling the key is _not_ running.

With this patch, rather than receiving a single patch to be processed, a vector
of patches is passed, enabling the rewrite of the pseudo-code #1 in this
way:

-- Pseudo-code #2 - This patch ---
1) for each patch in the vector:
add an int3 trap to the address that will be patched

sync cores (send IPI to all other CPUs)

2) for each patch in the vector:
update all but the first byte of the patched range

sync cores (send IPI to all other CPUs)

3) for each patch in the vector:
replace the first byte (int3) by the first byte of replacing opcode

sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - This patch ---

Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).

The batch mode is done with the function text_poke_bp_batch(), that receives
two arguments: a vector of "struct text_to_poke", and the number of entries
in the vector.

The vector must be sorted by the addr field of the text_to_poke structure,
enabling the binary search of a handler in the poke_int3_handler function
(a fast path).

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/ca506ed52584c80f64de23f6f55ca288e5d079de.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 457c8996 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Add SPDX license identifier for missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 3950746d 25-Apr-2019 Nadav Amit <namit@vmware.com>

x86/alternatives: Add comment about module removal races

Add a comment to clarify that users of text_poke() must ensure that
no races with module removal take place.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-22-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 0a203df5 25-Apr-2019 Nadav Amit <namit@vmware.com>

x86/alternatives: Remove the return value of text_poke_*()

The return value of text_poke_early() and text_poke_bp() is useless.
Remove it.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-14-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# f2c65fb3 25-Apr-2019 Nadav Amit <namit@vmware.com>

x86/modules: Avoid breaking W^X while loading modules

When modules and BPF filters are loaded, there is a time window in
which some memory is both writable and executable. An attacker that has
already found another vulnerability (e.g., a dangling pointer) might be
able to exploit this behavior to overwrite kernel code. Prevent having
writable executable PTEs in this stage.

In addition, avoiding having W+X mappings can also slightly simplify the
patching of modules code on initialization (e.g., by alternatives and
static-key), as would be done in the next patch. This was actually the
main motivation for this patch.

To avoid having W+X mappings, set them initially as RW (NX) and after
they are set as RO set them as X as well. Setting them as executable is
done as a separate step to avoid one core in which the old PTE is cached
(hence writable), and another which sees the updated PTE (executable),
which would break the W^X protection.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lkml.kernel.org/r/20190426001143.4983-12-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# b3fd8e83 25-Apr-2019 Nadav Amit <namit@vmware.com>

x86/alternatives: Use temporary mm for text poking

text_poke() can potentially compromise security as it sets temporary
PTEs in the fixmap. These PTEs might be used to rewrite the kernel code
from other cores accidentally or maliciously, if an attacker gains the
ability to write onto kernel memory.

Moreover, since remote TLBs are not flushed after the temporary PTEs are
removed, the time-window in which the code is writable is not limited if
the fixmap PTEs - maliciously or accidentally - are cached in the TLB.
To address these potential security hazards, use a temporary mm for
patching the code.

Finally, text_poke() is also not conservative enough when mapping pages,
as it always tries to map 2 pages, even when a single one is sufficient.
So try to be more conservative, and do not map more than needed.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-8-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 4fc19708 26-Apr-2019 Nadav Amit <namit@vmware.com>

x86/alternatives: Initialize temporary mm for patching

To prevent improper use of the PTEs that are used for text patching, the
next patches will use a temporary mm struct. Initailize it by copying
the init mm.

The address that will be used for patching is taken from the lower area
that is usually used for the task memory. Doing so prevents the need to
frequently synchronize the temporary-mm (e.g., when BPF programs are
installed), since different PGDs are used for the task memory.

Finally, randomize the address of the PTEs to harden against exploits
that use these PTEs.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: ard.biesheuvel@linaro.org
Cc: deneen.t.dock@intel.com
Cc: kernel-hardening@lists.openwall.com
Cc: kristen@linux.intel.com
Cc: linux_dti@icloud.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190426232303.28381-8-nadav.amit@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# e836673c 25-Apr-2019 Nadav Amit <namit@vmware.com>

x86/alternatives: Add text_poke_kgdb() to not assert the lock when debugging

text_mutex is currently expected to be held before text_poke() is
called, but kgdb does not take the mutex, and instead *supposedly*
ensures the lock is not taken and will not be acquired by any other core
while text_poke() is running.

The reason for the "supposedly" comment is that it is not entirely clear
that this would be the case if gdb_do_roundup is zero.

Create two wrapper functions, text_poke() and text_poke_kgdb(), which do
or do not run the lockdep assertion respectively.

While we are at it, change the return code of text_poke() to something
meaningful. One day, callers might actually respect it and the existing
BUG_ON() when patching fails could be removed. For kgdb, the return
value can actually be used.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9222f606506c ("x86/alternatives: Lockdep-enforce text_mutex in text_poke*()")
Link: https://lkml.kernel.org/r/20190426001143.4983-2-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 46938cc8 25-Apr-2019 Ingo Molnar <mingo@kernel.org>

x86/paravirt: Rename paravirt_patch_site::instrtype to paravirt_patch_site::type

It's used as 'type' in almost every paravirt patching function, so standardize
the field name from the somewhat weird 'instrtype' name to 'type'.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 1fc654cf 25-Apr-2019 Ingo Molnar <mingo@kernel.org>

x86/paravirt: Standardize 'insn_buff' variable names

We currently have 6 (!) separate naming variants to name temporary instruction
buffers that are used for code patching:

- insnbuf
- insnbuff
- insn_buff
- insn_buffer
- ibuf
- ibuffer

These are used as local variables, percpu fields and function parameters.

Standardize all the names to a single variant: 'insn_buff'.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c13324a5 12-Feb-2019 Masami Hiramatsu <mhiramat@kernel.org>

x86/kprobes: Prohibit probing on functions before kprobe_int3_handler()

Prohibit probing on the functions called before kprobe_int3_handler()
in do_int3(). More specifically, ftrace_int3_handler(),
poke_int3_handler(), and ist_enter(). And since rcu_nmi_enter() is
called by ist_enter(), it also should be marked as NOKPROBE_SYMBOL.

Since those are handled before kprobe_int3_handler(), probing those
functions can cause a breakpoint recursion and crash the kernel.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/154998793571.31052.11301258949601150994.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c1d4e419 09-Dec-2018 Borislav Petkov <bp@suse.de>

x86/alternatives: Print containing function

... in the "debug-alternative" output so that one can find her way
easier when staring at the vmlinux disassembly.

For example:

apply_alternatives: feat: 3*32+18, old: (read_tsc+0x0/0x10 (ffffffff8101d1c0) len: 5), repl: (ffffffff824e6d33, len: 5)
^^^^^^^^^^^^^^^^^
ffffffff8101d1c0: old_insn: 0f 31 90 90 90
ffffffff824e6d33: rpl_insn: 0f ae e8 0f 31
ffffffff8101d1c0: final_insn: 0f ae e8 0f 31

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: X86 ML <x86@kernel.org>
Link: https://lkml.kernel.org/r/20181211222326.14581-3-bp@alien8.de


# c3fecca4 23-Sep-2018 Pu Wen <puwen@hygon.cn>

x86/alternative: Init ideal_nops for Hygon Dhyana

The ideal_nops for Hygon Dhyana CPU should be p6_nops.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: x86@kernel.org
Cc: thomas.lendacky@amd.com
Link: https://lkml.kernel.org/r/79e76c3173716984fe5fdd4a8e2c798bf4193205.1537533369.git.puwen@hygon.cn


# 5c83511b 28-Aug-2018 Juergen Gross <jgross@suse.com>

x86/paravirt: Use a single ops structure

Instead of using six globally visible paravirt ops structures combine
them in a single structure, keeping the original structures as
sub-structures.

This avoids the need to assemble struct paravirt_patch_template at
runtime on the stack each time apply_paravirt() is being called (i.e.
when loading a module).

[ tglx: Made the struct and the initializer tabular for readability sake ]

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: xen-devel@lists.xenproject.org
Cc: virtualization@lists.linux-foundation.org
Cc: akataria@vmware.com
Cc: rusty@rustcorp.com.au
Cc: boris.ostrovsky@oracle.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180828074026.820-9-jgross@suse.com


# abc745f8 28-Aug-2018 Juergen Gross <jgross@suse.com>

x86/paravirt: Remove clobbers parameter from paravirt patch functions

The clobbers parameter from paravirt_patch_default() et al isn't used
any longer. Remove it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: xen-devel@lists.xenproject.org
Cc: virtualization@lists.linux-foundation.org
Cc: akataria@vmware.com
Cc: rusty@rustcorp.com.au
Cc: boris.ostrovsky@oracle.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180828074026.820-7-jgross@suse.com


# 9222f606 28-Aug-2018 Jiri Kosina <jkosina@suse.cz>

x86/alternatives: Lockdep-enforce text_mutex in text_poke*()

text_poke() and text_poke_bp() must be called with text_mutex held.

Put proper lockdep anotation in place instead of just mentioning the
requirement in a comment.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1808280853520.25787@cbobk.fhfr.pm


# 6fffacb3 19-Jul-2018 Pavel Tatashin <pasha.tatashin@oracle.com>

x86/alternatives, jumplabel: Use text_poke_early() before mm_init()

It supposed to be safe to modify static branches after jump_label_init().
But, because static key modifying code eventually calls text_poke() it can
end up accessing a struct page which has not been initialized yet.

Here is how to quickly reproduce the problem. Insert code like this
into init/main.c:

| +static DEFINE_STATIC_KEY_FALSE(__test);
| asmlinkage __visible void __init start_kernel(void)
| {
| char *command_line;
|@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
| vfs_caches_init_early();
| sort_main_extable();
| trap_init();
|+ {
|+ static_branch_enable(&__test);
|+ WARN_ON(!static_branch_likely(&__test));
|+ }
| mm_init();

The following warnings show-up:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
RIP: 0010:text_poke+0x20d/0x230
Call Trace:
? text_poke_bp+0x50/0xda
? arch_jump_label_transform+0x89/0xe0
? __jump_label_update+0x78/0xb0
? static_key_enable_cpuslocked+0x4d/0x80
? static_key_enable+0x11/0x20
? start_kernel+0x23e/0x4c8
? secondary_startup_64+0xa5/0xb0

---[ end trace abdc99c031b8a90a ]---

If the code above is moved after mm_init(), no warning is shown, as struct
pages are initialized during handover from memblock.

Use text_poke_early() in static branching until early boot IRQs are enabled
and from there switch to text_poke. Also, ensure text_poke() is never
invoked when unitialized memory access may happen by using adding a
!after_bootmem assertion.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-9-pasha.tatashin@oracle.com


# 12c69f1e 30-Jan-2018 Josh Poimboeuf <jpoimboe@redhat.com>

x86/paravirt: Remove 'noreplace-paravirt' cmdline option

The 'noreplace-paravirt' option disables paravirt patching, leaving the
original pv indirect calls in place.

That's highly incompatible with retpolines, unless we want to uglify
paravirt even further and convert the paravirt calls to retpolines.

As far as I can tell, the option doesn't seem to be useful for much
other than introducing surprising corner cases and making the kernel
vulnerable to Spectre v2. It was probably a debug option from the early
paravirt days. So just remove it.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Link: https://lkml.kernel.org/r/20180131041333.2x6blhxirc2kclrq@treble


# 0e6c16c6 26-Jan-2018 Borislav Petkov <bp@suse.de>

x86/alternative: Print unadorned pointers

After commit ad67b74d2469 ("printk: hash addresses printed with %p")
pointers are being hashed when printed. However, this makes the alternative
debug output completely useless. Switch to %px in order to see the
unadorned kernel pointers.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: riel@redhat.com
Cc: ak@linux.intel.com
Cc: peterz@infradead.org
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: jikos@kernel.org
Cc: luto@amacapital.net
Cc: dave.hansen@intel.com
Cc: torvalds@linux-foundation.org
Cc: keescook@google.com
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: tim.c.chen@linux.intel.com
Cc: gregkh@linux-foundation.org
Cc: pjt@google.com
Link: https://lkml.kernel.org/r/20180126121139.31959-2-bp@alien8.de


# 612e8e93 09-Jan-2018 Borislav Petkov <bp@suse.de>

x86/alternatives: Fix optimize_nops() checking

The alternatives code checks only the first byte whether it is a NOP, but
with NOPs in front of the payload and having actual instructions after it
breaks the "optimized' test.

Make sure to scan all bytes before deciding to optimize the NOPs in there.

Reported-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: https://lkml.kernel.org/r/20180110112815.mgciyf5acwacphkq@pd.tnic


# e846d139 01-Nov-2017 Zhou Chengming <zhouchengming1@huawei.com>

kprobes, x86/alternatives: Use text_mutex to protect smp_alt_modules

We use alternatives_text_reserved() to check if the address is in
the fixed pieces of alternative reserved, but the problem is that
we don't hold the smp_alt mutex when call this function. So the list
traversal may encounter a deleted list_head if another path is doing
alternatives_smp_module_del().

One solution is that we can hold smp_alt mutex before call this
function, but the difficult point is that the callers of this
functions, arch_prepare_kprobe() and arch_prepare_optimized_kprobe(),
are called inside the text_mutex. So we must hold smp_alt mutex
before we go into these arch dependent code. But we can't now,
the smp_alt mutex is the arch dependent part, only x86 has it.
Maybe we can export another arch dependent callback to solve this.

But there is a simpler way to handle this problem. We can reuse the
text_mutex to protect smp_alt_modules instead of using another mutex.
And all the arch dependent checks of kprobes are inside the text_mutex,
so it's safe now.

Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Fixes: 2cfa197 "ftrace/alternatives: Introducing *_text_reserved functions"
Link: http://lkml.kernel.org/r/1509585501-79466-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 01651324 30-Jul-2017 Peter Zijlstra <peterz@infradead.org>

x86: Clarify/fix no-op barriers for text_poke_bp()

So I was looking at text_poke_bp() today and I couldn't make sense of
the barriers there.

How's for something like so?

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: masami.hiramatsu.pt@hitachi.com
Link: http://lkml.kernel.org/r/20170731102154.f57cvkjtnbmtctk6@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# fc152d22 24-May-2017 Mateusz Jurczyk <mjurczyk@google.com>

x86/alternatives: Prevent uninitialized stack byte read in apply_alternatives()

In the current form of the code, if a->replacementlen is 0, the reference
to *insnbuf for comparison touches potentially garbage memory. While it
doesn't affect the execution flow due to the subsequent a->replacementlen
comparison, it is (rightly) detected as use of uninitialized memory by a
runtime instrumentation currently under my development, and could be
detected as such by other tools in the future, too (e.g. KMSAN).

Fix the "false-positive" by reordering the conditions to first check the
replacement instruction length before referencing specific opcode bytes.

Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Link: http://lkml.kernel.org/r/20170524135500.27223-1-mjurczyk@google.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 34bfab0e 03-Dec-2016 Borislav Petkov <bp@suse.de>

x86/alternatives: Do not use sync_core() to serialize I$

We use sync_core() in the alternatives code to stop speculative
execution of prefetched instructions because we are potentially changing
them and don't want to execute stale bytes.

What it does on most machines is call CPUID which is a serializing
instruction. And that's expensive.

However, the instruction cache is serialized when we're on the local CPU
and are changing the data through the same virtual address. So then, we
don't need the serializing CPUID but a simple control flow change. Last
being accomplished with a CALL/RET which the noinline causes.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Matthew Whitehead <tedheadster@gmail.com>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161203150258.vwr5zzco7ctgc4pe@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 35de5b06 26-Apr-2016 Andy Lutomirski <luto@kernel.org>

x86/asm: Stop depending on ptrace.h in alternative.h

alternative.h pulls in ptrace.h, which means that alternatives can't
be used in anything referenced from ptrace.h, which is a mess.

Break the dependency by pulling text patching helpers into their own
header.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/99b93b13f2c9eb671f5c98bba4c2cbdc061293a2.1461698311.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 66c117d7 02-Sep-2015 Thomas Gleixner <tglx@linutronix.de>

x86/alternatives: Make optimize_nops() interrupt safe and synced

Richard reported the following crash:

[ 0.036000] BUG: unable to handle kernel paging request at 55501e06
[ 0.036000] IP: [<c0aae48b>] common_interrupt+0xb/0x38
[ 0.036000] Call Trace:
[ 0.036000] [<c0409c80>] ? add_nops+0x90/0xa0
[ 0.036000] [<c040a054>] apply_alternatives+0x274/0x630

Chuck decoded:

" 0: 8d 90 90 83 04 24 lea 0x24048390(%eax),%edx
6: 80 fc 0f cmp $0xf,%ah
9: a8 0f test $0xf,%al
>> b: a0 06 1e 50 55 mov 0x55501e06,%al
10: 57 push %edi
11: 56 push %esi

Interrupt 0x30 occurred while the alternatives code was replacing the
initial 0x90,0x90,0x90 NOPs (from the ASM_CLAC macro) with the
optimized version, 0x8d,0x76,0x00. Only the first byte has been
replaced so far, and it makes a mess out of the insn decoding."

optimize_nops() is buggy in two aspects:

- It's not disabling interrupts across the modification
- It's lacking a sync_core() call

Add both.

Fixes: 4fd4b6e5537c 'x86/alternatives: Use optimized NOPs for padding'
Reported-and-tested-by: "Richard W.M. Jones" <rjones@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard W.M. Jones <rjones@redhat.com>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1509031232340.15006@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 5e907bb0 30-Apr-2015 Ingo Molnar <mingo@kernel.org>

x86/alternatives, x86/fpu: Add 'alternatives_patched' debug flag and use it in xsave_state()

We'd like to use xsave_state() earlier, but its SYSTEM_BOOTING check
is too imprecise.

The real condition that xsave_state() would like to check is whether
alternative XSAVE instructions were patched into the kernel image
already.

Add such a (read-mostly) debug flag and use it in xsave_state().

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# f21262b8 11-May-2015 Borislav Petkov <bp@suse.de>

x86/alternatives: Switch AMD F15h and later to the P6 NOPs

Software optimization guides for both F15h and F16h cite those
NOPs as the optimal ones. A microbenchmark confirms that
actually even older families are better with the single-insn
NOPs so switch to them for the alternatives.

Cycles count below includes the loop overhead of the measurement
but that overhead is the same with all runs.

F10h, revE:
-----------
Running NOP tests, 1000 NOPs x 1000000 repetitions

K8:
90 288.212282 cycles
66 90 288.220840 cycles
66 66 90 288.219447 cycles
66 66 66 90 288.223204 cycles
66 66 90 66 90 571.393424 cycles
66 66 90 66 66 90 571.374919 cycles
66 66 66 90 66 66 90 572.249281 cycles
66 66 66 90 66 66 66 90 571.388651 cycles

P6:
90 288.214193 cycles
66 90 288.225550 cycles
0f 1f 00 288.224441 cycles
0f 1f 40 00 288.225030 cycles
0f 1f 44 00 00 288.233558 cycles
66 0f 1f 44 00 00 324.792342 cycles
0f 1f 80 00 00 00 00 325.657462 cycles
0f 1f 84 00 00 00 00 00 430.246643 cycles

F14h:
----
Running NOP tests, 1000 NOPs x 1000000 repetitions

K8:
90 510.404890 cycles
66 90 510.432117 cycles
66 66 90 510.561858 cycles
66 66 66 90 510.541865 cycles
66 66 90 66 90 1014.192782 cycles
66 66 90 66 66 90 1014.226546 cycles
66 66 66 90 66 66 90 1014.334299 cycles
66 66 66 90 66 66 66 90 1014.381205 cycles

P6:
90 510.436710 cycles
66 90 510.448229 cycles
0f 1f 00 510.545100 cycles
0f 1f 40 00 510.502792 cycles
0f 1f 44 00 00 510.589517 cycles
66 0f 1f 44 00 00 510.611462 cycles
0f 1f 80 00 00 00 00 511.166794 cycles
0f 1f 84 00 00 00 00 00 511.651641 cycles

F15h:
-----
Running NOP tests, 1000 NOPs x 1000000 repetitions

K8:
90 243.128396 cycles
66 90 243.129883 cycles
66 66 90 243.131631 cycles
66 66 66 90 242.499324 cycles
66 66 90 66 90 481.829083 cycles
66 66 90 66 66 90 481.884413 cycles
66 66 66 90 66 66 90 481.851446 cycles
66 66 66 90 66 66 66 90 481.409220 cycles

P6:
90 243.127026 cycles
66 90 243.130711 cycles
0f 1f 00 243.122747 cycles
0f 1f 40 00 242.497617 cycles
0f 1f 44 00 00 245.354461 cycles
66 0f 1f 44 00 00 361.930417 cycles
0f 1f 80 00 00 00 00 362.844944 cycles
0f 1f 84 00 00 00 00 00 480.514948 cycles

F16h:
-----
Running NOP tests, 1000 NOPs x 1000000 repetitions

K8:
90 507.793298 cycles
66 90 507.789636 cycles
66 66 90 507.826490 cycles
66 66 66 90 507.859075 cycles
66 66 90 66 90 1008.663129 cycles
66 66 90 66 66 90 1008.696259 cycles
66 66 66 90 66 66 90 1008.692517 cycles
66 66 66 90 66 66 66 90 1008.755399 cycles

P6:
90 507.795232 cycles
66 90 507.794761 cycles
0f 1f 00 507.834901 cycles
0f 1f 40 00 507.822629 cycles
0f 1f 44 00 00 507.838493 cycles
66 0f 1f 44 00 00 507.908597 cycles
0f 1f 80 00 00 00 00 507.946417 cycles
0f 1f 84 00 00 00 00 00 507.954960 cycles

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431332153-18566-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 69df353f 04-Apr-2015 Borislav Petkov <bp@suse.de>

x86/alternatives: Guard NOPs optimization

Take a look at the first instruction byte before optimizing the NOP -
there might be something else there already, like the ALTERNATIVE_2()
in rdtsc_barrier() which NOPs out on AMD even though we just
patched in an MFENCE.

This happens because the alternatives sees X86_FEATURE_MFENCE_RDTSC,
AMD CPUs set it, we patch in the MFENCE and right afterwards it sees
X86_FEATURE_LFENCE_RDTSC which AMD CPUs don't set and we blindly
optimize the NOP.

Checking whether at least the first byte is 0x90 prevents that.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1428181662-18020-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# dbe4058a 04-Apr-2015 Borislav Petkov <bp@suse.de>

x86/alternatives: Fix ALTERNATIVE_2 padding generation properly

Quentin caught a corner case with the generation of instruction
padding in the ALTERNATIVE_2 macro: if len(orig_insn) <
len(alt1) < len(alt2), then not enough padding gets added and
that is not good(tm) as we could overwrite the beginning of the
next instruction.

Luckily, at the time of this writing, we don't have
ALTERNATIVE_2() invocations which have that problem and even if
we did, a simple fix would be to prepend the instructions with
enough prefixes so that that corner case doesn't happen.

However, best it would be if we fixed it properly. See below for
a simple, abstracted example of what we're doing.

So what we ended up doing is, we compute the

max(len(alt1), len(alt2)) - len(orig_insn)

and feed that value to the .skip gas directive. The max() cannot
have conditionals due to gas limitations, thus the fancy integer
math.

With this patch, all ALTERNATIVE_2 sites get padded correctly;
generating obscure test cases pass too:

#define alt_max_short(a, b) ((a) ^ (((a) ^ (b)) & -(-((a) < (b)))))

#define gen_skip(orig, alt1, alt2, marker) \
.skip -((alt_max_short(alt1, alt2) - (orig)) > 0) * \
(alt_max_short(alt1, alt2) - (orig)),marker

.pushsection .text, "ax"
.globl main
main:
gen_skip(1, 2, 4, 0x09)
gen_skip(4, 1, 2, 0x10)
...
.popsection

Thanks to Quentin for catching it and double-checking the fix!

Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150404133443.GE21152@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# f39b6f0e 18-Mar-2015 Andy Lutomirski <luto@kernel.org>

x86/asm/entry: Change all 'user_mode_vm()' calls to 'user_mode()'

user_mode_vm() and user_mode() are now the same. Change all callers
of user_mode_vm() to user_mode().

The next patch will remove the definition of user_mode_vm.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/43b1f57f3df70df5a08b0925897c660725015554.1426728647.git.luto@kernel.org
[ Merged to a more recent kernel. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 4fd4b6e5 10-Jan-2015 Borislav Petkov <bp@suse.de>

x86/alternatives: Use optimized NOPs for padding

Alternatives allow now for an empty old instruction. In this case we go
and pad the space with NOPs at assembly time. However, there are the
optimal, longer NOPs which should be used. Do that at patching time by
adding alt_instr.padlen-sized NOPs at the old instruction address.

Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Borislav Petkov <bp@suse.de>


# 48c7a250 05-Jan-2015 Borislav Petkov <bp@suse.de>

x86/alternatives: Make JMPs more robust

Up until now we had to pay attention to relative JMPs in alternatives
about how their relative offset gets computed so that the jump target
is still correct. Or, as it is the case for near CALLs (opcode e8), we
still have to go and readjust the offset at patching time.

What is more, the static_cpu_has_safe() facility had to forcefully
generate 5-byte JMPs since we couldn't rely on the compiler to generate
properly sized ones so we had to force the longest ones. Worse than
that, sometimes it would generate a replacement JMP which is longer than
the original one, thus overwriting the beginning of the next instruction
at patching time.

So, in order to alleviate all that and make using JMPs more
straight-forward we go and pad the original instruction in an
alternative block with NOPs at build time, should the replacement(s) be
longer. This way, alternatives users shouldn't pay special attention
so that original and replacement instruction sizes are fine but the
assembler would simply add padding where needed and not do anything
otherwise.

As a second aspect, we go and recompute JMPs at patching time so that we
can try to make 5-byte JMPs into two-byte ones if possible. If not, we
still have to recompute the offsets as the replacement JMP gets put far
away in the .altinstr_replacement section leading to a wrong offset if
copied verbatim.

For example, on a locally generated kernel image

old insn VA: 0xffffffff810014bd, CPU feat: X86_FEATURE_ALWAYS, size: 2
__switch_to:
ffffffff810014bd: eb 21 jmp ffffffff810014e0
repl insn: size: 5
ffffffff81d0b23c: e9 b1 62 2f ff jmpq ffffffff810014f2

gets corrected to a 2-byte JMP:

apply_alternatives: feat: 3*32+21, old: (ffffffff810014bd, len: 2), repl: (ffffffff81d0b23c, len: 5)
alt_insn: e9 b1 62 2f ff
recompute_jumps: next_rip: ffffffff81d0b241, tgt_rip: ffffffff810014f2, new_displ: 0x00000033, ret len: 2
converted to: eb 33 90 90 90

and a 5-byte JMP:

old insn VA: 0xffffffff81001516, CPU feat: X86_FEATURE_ALWAYS, size: 2
__switch_to:
ffffffff81001516: eb 30 jmp ffffffff81001548
repl insn: size: 5
ffffffff81d0b241: e9 10 63 2f ff jmpq ffffffff81001556

gets shortened into a two-byte one:

apply_alternatives: feat: 3*32+21, old: (ffffffff81001516, len: 2), repl: (ffffffff81d0b241, len: 5)
alt_insn: e9 10 63 2f ff
recompute_jumps: next_rip: ffffffff81d0b246, tgt_rip: ffffffff81001556, new_displ: 0x0000003e, ret len: 2
converted to: eb 3e 90 90 90

... and so on.

This leads to a net win of around

40ish replacements * 3 bytes savings =~ 120 bytes of I$

on an AMD guest which means some savings of precious instruction cache
bandwidth. The padding to the shorter 2-byte JMPs are single-byte NOPs
which on smart microarchitectures means discarding NOPs at decode time
and thus freeing up execution bandwidth.

Signed-off-by: Borislav Petkov <bp@suse.de>


# 4332195c 27-Dec-2014 Borislav Petkov <bp@suse.de>

x86/alternatives: Add instruction padding

Up until now we have always paid attention to make sure the length of
the new instruction replacing the old one is at least less or equal to
the length of the old instruction. If the new instruction is longer, at
the time it replaces the old instruction it will overwrite the beginning
of the next instruction in the kernel image and cause your pants to
catch fire.

So instead of having to pay attention, teach the alternatives framework
to pad shorter old instructions with NOPs at buildtime - but only in the
case when

len(old instruction(s)) < len(new instruction(s))

and add nothing in the >= case. (In that case we do add_nops() when
patching).

This way the alternatives user shouldn't have to care about instruction
sizes and simply use the macros.

Add asm ALTERNATIVE* flavor macros too, while at it.

Also, we need to save the pad length in a separate struct alt_instr
member for NOP optimization and the way to do that reliably is to carry
the pad length instead of trying to detect whether we're looking at
single-byte NOPs or at pathological instruction offsets like e9 90 90 90
90, for example, which is a valid instruction.

Thanks to Michael Matz for the great help with toolchain questions.

Signed-off-by: Borislav Petkov <bp@suse.de>


# db477a33 30-Dec-2014 Borislav Petkov <bp@suse.de>

x86/alternatives: Cleanup DPRINTK macro

Make it pass __func__ implicitly. Also, dump info about each replacing
we're doing. Fixup comments and style while at it.

Signed-off-by: Borislav Petkov <bp@suse.de>


# 9c54b616 17-Apr-2014 Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>

kprobes, x86: Allow kprobes on text_poke/hw_breakpoint

Allow kprobes on text_poke/hw_breakpoint because
those are not related to the critical int3-debug
recursive path of kprobes at this moment.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/20140417081807.26341.73219.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 2a929242 27-Sep-2013 Borislav Petkov <bp@suse.de>

lockdep, x86/alternatives: Drop ancient lockdep fixup message

It messes up the output of the nodes/cores bootup table and it
is obsolete anyway, see

17abecfe651c x86: fix up alternatives with lockdep enabled

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: huawei.libin@huawei.com
Cc: wangyijing@huawei.com
Cc: fenghua.yu@intel.com
Cc: guohanjun@huawei.com
Cc: paul.gortmaker@windriver.com
Link: http://lkml.kernel.org/r/20130927143442.GE4422@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 17f41571 23-Jul-2013 Jiri Kosina <jkosina@suse.cz>

kprobes/x86: Call out into INT3 handler directly instead of using notifier

In fd4363fff3d96 ("x86: Introduce int3 (breakpoint)-based
instruction patching"), the mechanism that was introduced for
notifying alternatives code from int3 exception handler that and
exception occured was die_notifier.

This is however problematic, as early code might be using jump
labels even before the notifier registration has been performed,
which will then lead to an oops due to unhandled exception. One
of such occurences has been encountered by Fengguang:

int3: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.11.0-rc1-01429-g04bf576 #8
task: ffff88000da1b040 ti: ffff88000da1c000 task.ti: ffff88000da1c000
RIP: 0010:[<ffffffff811098cc>] [<ffffffff811098cc>] ttwu_do_wakeup+0x28/0x225
RSP: 0000:ffff88000dd03f10 EFLAGS: 00000006
RAX: 0000000000000000 RBX: ffff88000dd12940 RCX: ffffffff81769c40
RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88000dd03f28 R08: ffffffff8176a8c0 R09: 0000000000000002
R10: ffffffff810ff484 R11: ffff88000dd129e8 R12: ffff88000dbc90c0
R13: ffff88000dbc90c0 R14: ffff88000da1dfd8 R15: ffff88000da1dfd8
FS: 0000000000000000(0000) GS:ffff88000dd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000ffffffff CR3: 0000000001c88000 CR4: 00000000000006e0
Stack:
ffff88000dd12940 ffff88000dbc90c0 ffff88000da1dfd8 ffff88000dd03f48
ffffffff81109e2b ffff88000dd12940 0000000000000000 ffff88000dd03f68
ffffffff81109e9e 0000000000000000 0000000000012940 ffff88000dd03f98
Call Trace:
<IRQ>
[<ffffffff81109e2b>] ttwu_do_activate.constprop.56+0x6d/0x79
[<ffffffff81109e9e>] sched_ttwu_pending+0x67/0x84
[<ffffffff8110c845>] scheduler_ipi+0x15a/0x2b0
[<ffffffff8104dfb4>] smp_reschedule_interrupt+0x38/0x41
[<ffffffff8173bf5d>] reschedule_interrupt+0x6d/0x80
<EOI>
[<ffffffff810ff484>] ? __atomic_notifier_call_chain+0x5/0xc1
[<ffffffff8105cc30>] ? native_safe_halt+0xd/0x16
[<ffffffff81015f10>] default_idle+0x147/0x282
[<ffffffff81017026>] arch_cpu_idle+0x3d/0x5d
[<ffffffff81127d6a>] cpu_idle_loop+0x46d/0x5db
[<ffffffff81127f5c>] cpu_startup_entry+0x84/0x84
[<ffffffff8104f4f8>] start_secondary+0x3c8/0x3d5
[...]

Fix this by directly calling poke_int3_handler() from the int3
exception handler (analogically to what ftrace has been doing
already), instead of relying on notifier, registration of which
might not have yet been finalized by the time of the first trap.

Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1307231007490.14024@pobox.suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ea8596bb 18-Jul-2013 Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>

kprobes/x86: Remove unused text_poke_smp() and text_poke_smp_batch() functions

Since introducing the text_poke_bp() for all text_poke_smp*()
callers, text_poke_smp*() are now unused. This patch basically
reverts:

3d55cc8a058e ("x86: Add text_poke_smp for SMP cross modifying code")
7deb18dcf047 ("x86: Introduce text_poke_smp_batch() for batch-code modifying")

and related commits.

This patch also fixes a Kconfig dependency issue on STOP_MACHINE
in the case of CONFIG_SMP && !CONFIG_MODULE_UNLOAD.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Borislav Petkov <bpetkov@suse.de>
Link: http://lkml.kernel.org/r/20130718114753.26675.18714.stgit@mhiramat-M0-7522
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# fd4363ff 12-Jul-2013 Jiri Kosina <jkosina@suse.cz>

x86: Introduce int3 (breakpoint)-based instruction patching

Introduce a method for run-time instruction patching on a live SMP kernel
based on int3 breakpoint, completely avoiding the need for stop_machine().

The way this is achieved:

- add a int3 trap to the address that will be patched
- sync cores
- update all but the first byte of the patched range
- sync cores
- replace the first byte (int3) by the first byte of
replacing opcode
- sync cores

According to

http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/01530.html

synchronization after replacing "all but first" instructions should not
be necessary (on Intel hardware), as the syncing after the subsequent
patching of the first byte provides enough safety.
But there's not only Intel HW out there, and we'd rather be on a safe
side.

If any CPU instruction execution would collide with the patching,
it'd be trapped by the int3 breakpoint and redirected to the provided
"handler" (which would typically mean just skipping over the patched
region, acting as "nop" has been there, in case we are doing nop -> jump
and jump -> nop transitions).

Ftrace has been using this very technique since 08d636b ("ftrace/x86:
Have arch x86_64 use breakpoints instead of stop machine") for ages
already, and jump labels are another obvious potential user of this.

Based on activities of Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
a few years ago.

Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1307121102440.29788@pobox.suse.cz
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 65fc985b 20-Mar-2013 Borislav Petkov <bp@suse.de>

x86, cpu: Expand cpufeature facility to include cpu bugs

We add another 32-bit vector at the end of the ->x86_capability
bitvector which collects bugs present in CPUs. After all, a CPU bug is a
kind of a capability, albeit a strange one.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1363788448-31325-2-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 4b8073e4 18-Sep-2012 Peter Senna Tschudin <peter.senna@gmail.com>

arch/x86: Remove unecessary semicolons

Found by http://coccinelle.lip6.fr/

Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
Cc: avi@redhat.com
Cc: mtosatti@redhat.com
Cc: a.p.zijlstra@chello.nl
Cc: rusty@rustcorp.com.au
Cc: masami.hiramatsu.pt@hitachi.com
Cc: suresh.b.siddha@intel.com
Cc: joerg.roedel@amd.com
Cc: agordeev@redhat.com
Cc: yinghai@kernel.org
Cc: bhelgaas@google.com
Cc: liuj97@gmail.com
Link: http://lkml.kernel.org/r/1347986174-30287-7-git-send-email-peter.senna@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 816afe4f 06-Aug-2012 Rusty Russell <rusty@rustcorp.com.au>

x86/smp: Don't ever patch back to UP if we unplug cpus

We still patch SMP instructions to UP variants if we boot with a
single CPU, but not at any other time. In particular, not if we
unplug CPUs to return to a single cpu.

Paul McKenney points out:

mean offline overhead is 6251/48=130.2 milliseconds.

If I remove the alternatives_smp_switch() from the offline
path [...] the mean offline overhead is 550/42=13.1 milliseconds

Basically, we're never going to get those 120ms back, and the
code is pretty messy.

We get rid of:

1) The "smp-alt-once" boot option. It's actually "smp-alt-boot", the
documentation is wrong. It's now the default.

2) The skip_smp_alternatives flag used by suspend.

3) arch_disable_nonboot_cpus_begin() and arch_disable_nonboot_cpus_end()
which were only used to set this one flag.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paul.mckenney@us.ibm.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/87vcgwwive.fsf@rustcorp.com.au
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# cb09cad4 22-Aug-2012 Avi Kivity <avi@redhat.com>

x86/alternatives: Fix p6 nops on non-modular kernels

Probably a leftover from the early days of self-patching, p6nops
are marked __initconst_or_module, which causes them to be
discarded in a non-modular kernel. If something later triggers
patching, it will overwrite kernel code with garbage.

Reported-by: Tomas Racek <tracek@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Cc: Michael Tokarev <mjt@tls.msk.ru>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: qemu-devel@nongnu.org
Cc: Anthony Liguori <anthony@codemonkey.ws>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/5034AE84.90708@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# d6250a3f 25-Jul-2012 Alan Cox <alan@linux.intel.com>

x86, nops: Missing break resulting in incorrect selection on Intel

The Intel case falls through into the generic case which then changes
the values. For cases like the P6 it doesn't do the right thing so
this seems to be a screwup.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-lww2uirad4skzjlmrm0vru8o@git.kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org>


# 2f747590 07-Jun-2012 OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

x86/alternatives: Use atomic_xchg() instead atomic_dec_and_test() for stop_machine_text_poke()

stop_machine_text_poke() uses atomic_dec_and_test() to select one of
the CPUs executing that function to actually modify the code.

Since the variable is initialized to 1, subsequent CPUs will make the
variable go negative. Since going negative is uncommon/unexpected in
typical dec_and_test usage change this user to atomic_xchg().

This was found using a patch that warns on dec_and_test going
negative.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
[ Rewrote changelog ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/87zk8fgsx9.fsf@devron.myhome.or.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c767a54b 21-May-2012 Joe Perches <joe@perches.com>

x86/debug: Add KERN_<LEVEL> to bare printks, convert printks to pr_<level>

Use a more current logging style:

- Bare printks should have a KERN_<LEVEL> for consistency's sake
- Add pr_fmt where appropriate
- Neaten some macro definitions
- Convert some Ok output to OK
- Use "%s: ", __func__ in pr_fmt for summit
- Convert some printks to pr_<level>

Message output is not identical in all cases.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: levinsasha928@gmail.com
Link: http://lkml.kernel.org/r/1337655007.24226.10.camel@joe2Laptop
[ merged two similar patches, tidied up the changelog ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 78345d2e 27-Oct-2011 Rabin Vincent <rabin@rab.in>

x86: Call stop_machine_text_poke() on all CPUs

It appears that stop_machine_text_poke() wants to be called on all CPUs,
like it's done from text_poke_smp(). Fix text_poke_smp_batch() to do
this.

Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/r/1319702072-32676-1-git-send-email-rabin@rab.in
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 98d0ac38 14-Jul-2011 Andy Lutomirski <luto@mit.edu>

x86-64: Move vread_tsc and vread_hpet into the vDSO

The vsyscall page now consists entirely of trap instructions.

Cc: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/637648f303f2ef93af93bae25186e9a1bea093f5.1310639973.git.luto@mit.edu
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 59e97e4d 13-Jul-2011 Andy Lutomirski <luto@mit.edu>

x86: Make alternative instruction pointers relative

This save a few bytes on x86-64 and means that future patches can
apply alternatives to unrelocated code.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/ff64a6b9a1a3860ca4a7b8b6dc7b4754f9491cd7.1310563276.git.luto@mit.edu
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 50973133 17-May-2011 Fenghua Yu <fenghua.yu@intel.com>

x86, alternative, doc: Add comment for applying alternatives order

Some string operation functions may be patched twice, e.g. on enhanced REP MOVSB
/STOSB processors, memcpy is patched first by fast string alternative function,
then it is patched by enhanced REP MOVSB/STOSB alternative function.

Add comment for applying alternatives order to warn people who may change the
applying alternatives order for any reason.

[ Documentation-only patch ]

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1305671358-14478-4-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# d8d9766c 18-Apr-2011 H. Peter Anvin <hpa@linux.intel.com>

x86, cpu: Change NOP selection for certain Intel CPUs

Due to a decoder implementation quirk, some specific Intel CPUs
actually perform better with the "k8_nops" than with the
SDM-recommended NOPs. For runtime-selected NOPs, if we detect those
specific CPUs then use the k8_nops instead of the ones we would
normally use.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/r/1303166160-10315-4-git-send-email-hpa@linux.intel.com


# dc326fca 18-Apr-2011 H. Peter Anvin <hpa@linux.intel.com>

x86, cpu: Clean up and unify the NOP selection infrastructure

Clean up and unify the NOP selection infrastructure:

- Make the atomic 5-byte NOP a part of the selection system.
- Pick NOPs once during early boot and then be done with it.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/r/1303166160-10315-3-git-send-email-hpa@linux.intel.com


# d430d3d7 16-Mar-2011 Jason Baron <jbaron@redhat.com>

jump label: Introduce static_branch() interface

Introduce:

static __always_inline bool static_branch(struct jump_label_key *key);

instead of the old JUMP_LABEL(key, label) macro.

In this way, jump labels become really easy to use:

Define:

struct jump_label_key jump_key;

Can be used as:

if (static_branch(&jump_key))
do unlikely code

enable/disale via:

jump_label_inc(&jump_key);
jump_label_dec(&jump_key);

that's it!

For the jump labels disabled case, the static_branch() becomes an
atomic_read(), and jump_label_inc()/dec() are simply atomic_inc(),
atomic_dec() operations. We show testing results for this change below.

Thanks to H. Peter Anvin for suggesting the 'static_branch()' construct.

Since we now require a 'struct jump_label_key *key', we can store a pointer into
the jump table addresses. In this way, we can enable/disable jump labels, in
basically constant time. This change allows us to completely remove the previous
hashtable scheme. Thanks to Peter Zijlstra for this re-write.

Testing:

I ran a series of 'tbench 20' runs 5 times (with reboots) for 3
configurations, where tracepoints were disabled.

jump label configured in
avg: 815.6

jump label *not* configured in (using atomic reads)
avg: 800.1

jump label *not* configured in (regular reads)
avg: 803.4

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110316212947.GA8792@redhat.com>
Signed-off-by: Jason Baron <jbaron@redhat.com>
Suggested-by: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: David Daney <ddaney@caviumnetworks.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 0d2eb44f 17-Mar-2011 Lucas De Marchi <lucas.de.marchi@gmail.com>

x86: Fix common misspellings

They were generated by 'codespell' and then manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: trivial@kernel.org
LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 0e00f7ae 03-Mar-2011 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

x86: stop_machine_text_poke() should issue sync_core()

Intel Archiecture Software Developer's Manual section 7.1.3 specifies that a
core serializing instruction such as "cpuid" should be executed on _each_ core
before the new instruction is made visible.

Failure to do so can lead to unspecified behavior (Intel XMC erratas include
General Protection Fault in the list), so we should avoid this at all cost.

This problem can affect modified code executed by interrupt handlers after
interrupt are re-enabled at the end of stop_machine, because no core serializing
instruction is executed between the code modification and the moment interrupts
are reenabled.

Because stop_machine_text_poke performs the text modification from the first CPU
decrementing stop_machine_first, modified code executed in thread context is
also affected by this problem. To explain why, we have to split the CPUs in two
categories: the CPU that initiates the text modification (calls text_poke_smp)
and all the others. The scheduler, executed on all other CPUs after
stop_machine, issues an "iret" core serializing instruction, and therefore
handles core serialization for all these CPUs. However, the text modification
initiator can continue its execution on the same thread and access the modified
text without any scheduler call. Given that the CPU that initiates the code
modification is not guaranteed to be the one actually performing the code
modification, it falls into the XMC errata.

Q: Isn't this executed from an IPI handler, which will return with IRET (a
serializing instruction) anyway?
A: No, now stop_machine uses per-cpu workqueue, so that handler will be
executed from worker threads. There is no iret anymore.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20110303160137.GB1590@Krystal>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: <stable@kernel.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# d91309f6 11-Feb-2011 Peter Zijlstra <peterz@infradead.org>

x86: Fix text_poke_smp_batch() deadlock

Fix this deadlock - we are already holding the mutex:

=======================================================
[ INFO: possible circular locking dependency detected ] 2.6.38-rc4-test+ #1
-------------------------------------------------------
bash/1850 is trying to acquire lock:
(text_mutex){+.+.+.}, at: [<ffffffff8100a9c1>] return_to_handler+0x0/0x2f

but task is already holding lock:
(smp_alt){+.+...}, at: [<ffffffff8100a9c1>] return_to_handler+0x0/0x2f

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (smp_alt){+.+...}:
[<ffffffff81082d02>] lock_acquire+0xcd/0xf8
[<ffffffff8192e119>] __mutex_lock_common+0x4c/0x339
[<ffffffff8192e4ca>] mutex_lock_nested+0x3e/0x43
[<ffffffff8101050f>] alternatives_smp_switch+0x77/0x1d8
[<ffffffff81926a6f>] do_boot_cpu+0xd7/0x762
[<ffffffff819277dd>] native_cpu_up+0xe6/0x16a
[<ffffffff81928e28>] _cpu_up+0x9d/0xee
[<ffffffff81928f4c>] cpu_up+0xd3/0xe7
[<ffffffff82268d4b>] kernel_init+0xe8/0x20a
[<ffffffff8100ba24>] kernel_thread_helper+0x4/0x10

-> #1 (cpu_hotplug.lock){+.+.+.}:
[<ffffffff81082d02>] lock_acquire+0xcd/0xf8
[<ffffffff8192e119>] __mutex_lock_common+0x4c/0x339
[<ffffffff8192e4ca>] mutex_lock_nested+0x3e/0x43
[<ffffffff810568cc>] get_online_cpus+0x41/0x55
[<ffffffff810a1348>] stop_machine+0x1e/0x3e
[<ffffffff819314c1>] text_poke_smp_batch+0x3a/0x3c
[<ffffffff81932b6c>] arch_optimize_kprobes+0x10d/0x11c
[<ffffffff81933a51>] kprobe_optimizer+0x152/0x222
[<ffffffff8106bb71>] process_one_work+0x1d3/0x335
[<ffffffff8106cfae>] worker_thread+0x104/0x1a4
[<ffffffff810707c4>] kthread+0x9d/0xa5
[<ffffffff8100ba24>] kernel_thread_helper+0x4/0x10

-> #0 (text_mutex){+.+.+.}:

other info that might help us debug this:

6 locks held by bash/1850:
#0: (&buffer->mutex){+.+.+.}, at: [<ffffffff8100a9c1>] return_to_handler+0x0/0x2f
#1: (s_active#75){.+.+.+}, at: [<ffffffff8100a9c1>] return_to_handler+0x0/0x2f
#2: (x86_cpu_hotplug_driver_mutex){+.+.+.}, at: [<ffffffff8100a9c1>] return_to_handler+0x0/0x2f
#3: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff8100a9c1>] return_to_handler+0x0/0x2f
#4: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8100a9c1>] return_to_handler+0x0/0x2f
#5: (smp_alt){+.+...}, at: [<ffffffff8100a9c1>] return_to_handler+0x0/0x2f

stack backtrace:
Pid: 1850, comm: bash Not tainted 2.6.38-rc4-test+ #1
Call Trace:

[<ffffffff81080eb2>] print_circular_bug+0xa8/0xb7
[<ffffffff8192e4ca>] mutex_lock_nested+0x3e/0x43
[<ffffffff81010302>] alternatives_smp_unlock+0x3d/0x93
[<ffffffff81010630>] alternatives_smp_switch+0x198/0x1d8
[<ffffffff8102568a>] native_cpu_die+0x65/0x95
[<ffffffff818cc4ec>] _cpu_down+0x13e/0x202
[<ffffffff8117a619>] sysfs_write_file+0x108/0x144
[<ffffffff8111f5a2>] vfs_write+0xac/0xff
[<ffffffff8111f7a9>] sys_write+0x4a/0x6e

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: mathieu.desnoyers@efficios.com
Cc: rusty@rustcorp.com.au
Cc: ananth@in.ibm.com
Cc: masami.hiramatsu.pt@hitachi.com
Cc: fweisbec@gmail.com
Cc: jbeulich@novell.com
Cc: jbaron@redhat.com
Cc: mhiramat@redhat.com
LKML-Reference: <1297458466.5226.93.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 3fb82d56 23-Nov-2010 Suresh Siddha <suresh.b.siddha@intel.com>

x86, suspend: Avoid unnecessary smp alternatives switch during suspend/resume

During suspend, we disable all the non boot cpus. And during resume we bring
them all back again. So no need to do alternatives_smp_switch() in between.

On my core 2 based laptop, this speeds up the suspend path by 15msec and the
resume path by 5 msec (suspend/resume speed up differences can be attributed
to the different P-states that the cpu is in during suspend/resume).

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <1290557500.4946.8.camel@sbsiddha-MOBL3.sc.intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 7deb18dc 03-Dec-2010 Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>

x86: Introduce text_poke_smp_batch() for batch-code modifying

Introduce text_poke_smp_batch(). This function modifies several
text areas with one stop_machine() on SMP. Because calling
stop_machine() is heavy task, it is better to aggregate
text_poke requests.

( Note: I've talked with Rusty about this interface, and
he would not like to expand stop_machine() interface, since
it is not for generic use. )

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
LKML-Reference: <20101203095422.2961.51217.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 404ba5d7 28-Oct-2010 Jason Baron <jbaron@redhat.com>

x86, alternative: Call stop_machine_text_poke() on all cpus

Currently, text_poke_smp() passes a NULL as the third argument to
__stop_machine(), which will only run stop_machine_text_poke()
on 1 cpu. Change NULL -> cpu_online_mask, as stop_machine_text_poke()
is intended to be run on all cpus.

I actually didn't notice any problems with stop_machine_text_poke()
only being called on 1 cpu, but found this via code inspection.

Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <20101028152026.GB2875@redhat.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 2d1d7126 27-Oct-2010 H. Peter Anvin <hpa@zytor.com>

x86, ftrace: Use safe noops, drop trap test

Always use a safe 5-byte noop sequence. Drop the trap test, since it
is known to return false negatives on some virtualization platforms on
32 bits. The resulting code is both simpler and safer.

Cc: Daniel Drake <dsd@laptop.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 3caa3751 13-Oct-2010 Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>

x86: Use __stop_machine() in text_poke_smp()

Use __stop_machine() in text_poke_smp() because the caller
must get online_cpus before calling text_poke_smp(), but
stop_machine() do it again. We don't need it.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20101014031036.4100.83989.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# bf5438fc 17-Sep-2010 Jason Baron <jbaron@redhat.com>

jump label: Base patch for jump label

base patch to implement 'jump labeling'. Based on a new 'asm goto' inline
assembly gcc mechanism, we can now branch to labels from an 'asm goto'
statment. This allows us to create a 'no-op' fastpath, which can subsequently
be patched with a jump to the slowpath code. This is useful for code which
might be rarely used, but which we'd like to be able to call, if needed.
Tracepoints are the current usecase that these are being implemented for.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <ee8b3595967989fdaf84e698dc7447d315ce972a.1284733808.git.jbaron@redhat.com>

[ cleaned up some formating ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# fa6f2cc7 17-Sep-2010 Jason Baron <jbaron@redhat.com>

jump label: Make text_poke_early() globally visible

Make text_poke_early available outside of alternative.c. The jump label
patchset wants to make use of it in order to set up the optimal no-op
sequences at run-time.

Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <04cfddf2ba77bcabfc3e524f1849d871d6a1cf9d.1284733808.git.jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# f49aa448 17-Sep-2010 Jason Baron <jbaron@redhat.com>

jump label: Make dynamic no-op selection available outside of ftrace

Move Steve's code for finding the best 5-byte no-op from ftrace.c to
alternative.c. The idea is that other consumers (in this case jump label)
want to make use of that code.

Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <96259ae74172dcac99c0020c249743c523a92e18.1284733808.git.jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 3b770a21 13-Jul-2010 H. Peter Anvin <hpa@linux.intel.com>

x86, alternatives: BUG on encountering an invalid CPU feature number

Make the alternatives-patching code BUG on encountering an invalid CPU
feature number. Should have done this a long time ago.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Yinghai Lu <yinhai@kernel.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <tip-df378ccfc4dd04e263426ad805516915874774aa@git.kernel.org>


# 5967ed87 21-Apr-2010 Jan Beulich <JBeulich@novell.com>

x86-64: Reduce SMP locks table size

Reduce the SMP locks table size by using relative pointers instead of
absolute ones, thus cutting the table size by half.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4BCF30FE020000780003B3B6@vpn.id2.novell.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# e5a11016 03-Mar-2010 Masami Hiramatsu <mhiramat@redhat.com>

x86: Issue at least one memory barrier in stop_machine_text_poke()

Fix stop_machine_text_poke() to issue smp_mb() before exiting
waiting loop, and use cpu_relax() for waiting.

Changes in v2:
- Don't use ACCESS_ONCE().

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <20100304033850.3819.74590.stgit@localhost6.localdomain6>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# b3ac891b 24-Feb-2010 Luca Barbieri <luca@luca-barbieri.com>

x86: Add support for lock prefix in alternatives

The current lock prefix UP/SMP alternative code doesn't allow
LOCK_PREFIX to be used in alternatives code.

This patch solves the problem by adding a new LOCK_PREFIX_ALTERNATIVE_PATCH
macro that only records the lock prefix location but does not emit
the prefix.

The user of this macro can then start any alternative sequence with
"lock" and have it UP/SMP patched.

To make this work, the UP/SMP alternative code is changed to do the
lock/DS prefix switching only if the byte actually contains a lock or
DS prefix.

Thus, if an alternative without the "lock" is selected, it will now do
nothing instead of clobbering the code.

Changes in v2:
- Naming change
- Change label to not conflict with alternatives

Signed-off-by: Luca Barbieri <luca@luca-barbieri.com>
LKML-Reference: <1267005265-27958-2-git-send-email-luca@luca-barbieri.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 3d55cc8a 25-Feb-2010 Masami Hiramatsu <mhiramat@redhat.com>

x86: Add text_poke_smp for SMP cross modifying code

Add generic text_poke_smp for SMP which uses stop_machine()
to synchronize modifying code.
This stop_machine() method is officially described at "7.1.3
Handling Self- and Cross-Modifying Code" on the intel's
software developer's manual 3A.

Since stop_machine() can't protect code against NMI/MCE, this
function can not modify those handlers. And also, this function
is basically for modifying multibyte-single-instruction. For
modifying multibyte-multi-instructions, we need another special
trap & detour code.

This code originaly comes from immediate values with
stop_machine() version. Thanks Jason and Mathieu!

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
LKML-Reference: <20100225133438.6725.80273.stgit@localhost6.localdomain6>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 076dc4a6 04-Feb-2010 Masami Hiramatsu <mhiramat@redhat.com>

x86/alternatives: Fix build warning

Fixes these warnings:

arch/x86/kernel/alternative.c: In function 'alternatives_text_reserved':
arch/x86/kernel/alternative.c:402: warning: comparison of distinct pointer types lacks a cast
arch/x86/kernel/alternative.c:402: warning: comparison of distinct pointer types lacks a cast
arch/x86/kernel/alternative.c:405: warning: comparison of distinct pointer types lacks a cast
arch/x86/kernel/alternative.c:405: warning: comparison of distinct pointer types lacks a cast

Caused by:

2cfa197: ftrace/alternatives: Introducing *_text_reserved functions

Changes in v2:
- Use local variables to compare, instead of type casts.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
LKML-Reference: <20100205171647.15750.37221.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 2cfa1978 02-Feb-2010 Masami Hiramatsu <mhiramat@redhat.com>

ftrace/alternatives: Introducing *_text_reserved functions

Introducing *_text_reserved functions for checking the text
address range is partially reserved or not. This patch provides
checking routines for x86 smp alternatives and dynamic ftrace.
Since both functions modify fixed pieces of kernel text, they
should reserve and protect those from other dynamic text
modifier, like kprobes.

This will also be extended when introducing other subsystems
which modify fixed pieces of kernel text. Dynamic text modifiers
should avoid those.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: przemyslaw@pawelczyk.it
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <20100202214911.4694.16587.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 1b1d9258 18-Dec-2009 Jan Beulich <JBeulich@novell.com>

x86-64: Modify copy_user_generic() alternatives mechanism

In order to avoid unnecessary chains of branches, rather than
implementing copy_user_generic() as a function consisting of
just a single (possibly patched) branch, instead properly deal
with patching call instructions in the alternative instructions
framework, and move the patching into the callers.

As a follow-on, one could also introduce something like
__EXPORT_SYMBOL_ALT() to avoid patching call sites in modules.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4B2BB8180200007800026AE7@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 5367b688 09-Sep-2009 Ben Hutchings <ben@decadent.org.uk>

x86: Fix code patching for paravirt-alternatives on 486

As reported in <http://bugs.debian.org/511703> and
<http://bugs.debian.org/515982>, kernels with paravirt-alternatives
enabled crash in text_poke_early() on at least some 486-class
processors.

The problem is that text_poke_early() itself uses inline functions
affected by paravirt-alternatives and so will modify instructions that
have already been prefetched. Pentium and later processors will
invalidate the prefetched instructions in this case, but 486-class
processors do not.

Change sync_core() to limit prefetching on 486-class (and 386-class)
processors, and move the call to sync_core() above the call to the
modifiable local_irq_restore().

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
LKML-Reference: <1252547631.3423.134.camel@localhost>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 8b5a10fc 19-Aug-2009 Jan Beulich <JBeulich@novell.com>

x86: properly annotate alternatives.c

Some of the NOPs tables aren't used on 64-bits, quite some code and
data is needed post-init for module loading only, and a couple of
functions aren't used outside that file (i.e. can be static, and don't
need to be exported).

The change to __INITDATA/__INITRODATA is needed to avoid an assembler
warning.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4A8BC8A00200007800010823@vpn.id2.novell.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 7cf49427 08-Mar-2009 Masami Hiramatsu <mhiramat@redhat.com>

x86: expand irq-off region in text_poke()

Expand irq-off region to cover fixmap using code and cache synchronizing.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <49B54688.8090403@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 78ff7fae 06-Mar-2009 Masami Hiramatsu <mhiramat@redhat.com>

x86: implement atomic text_poke() via fixmap

Use fixmaps instead of vmap/vunmap in text_poke() for avoiding
page allocation and delayed unmapping.

At the result of above change, text_poke() becomes atomic and can be called
from stop_machine() etc.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
LKML-Reference: <49B14352.2040705@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 3945dab4 06-Mar-2009 Masami Hiramatsu <mhiramat@redhat.com>

tracing, Text Edit Lock - SMP alternatives support

Use the mutual exclusion provided by the text edit lock in alternatives code.
Since alternative_smp_* will be called from module init code, etc,
we'd better protect it from other subsystems.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <49B14332.9030109@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 34754b69 25-Feb-2009 Peter Zijlstra <a.p.zijlstra@chello.nl>

x86: make vmap yell louder when it is used under irqs_disabled()

Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 123aa76e 12-Feb-2009 Andi Kleen <andi@firstfloor.org>

x86, mce: don't disable machine checks during code patching

Impact: low priority bug fix

This removes part of a a patch I added myself some time ago. After some
consideration the patch was a bad idea. In particular it stopped machine check
exceptions during code patching.

To quote the comment:

* MCEs only happen when something got corrupted and in this
* case we must do something about the corruption.
* Ignoring it is worse than a unlikely patching race.
* Also machine checks tend to be broadcast and if one CPU
* goes into machine check the others follow quickly, so we don't
* expect a machine check to cause undue problems during to code
* patching.

So undo the machine check related parts of
8f4e956b313dcccbc7be6f10808952345e3b638c NMIs are still disabled.

This only removes code, the only additions are a new comment.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 649c6653 05-Oct-2008 Thomas Gleixner <tglx@linutronix.de>

x86: improve UP kernel when CPU-hotplug and SMP is enabled

num_possible_cpus() can be > 1 when disabled CPUs have been accounted.

Disabled CPUs are not in the cpu_present_map, so we can use
num_present_cpus() as a safe indicator to switch to UP alternatives.

Reported-by: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# f31d731e 18-Aug-2008 H. Peter Anvin <hpa@zytor.com>

x86: use X86_FEATURE_NOPL in alternatives

Use X86_FEATURE_NOPL to determine if it is safe to use P6 NOPs in
alternatives. Also, replace table and loop with simple if statement.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# f1c5d30e 18-Aug-2008 H. Peter Anvin <hpa@zytor.com>

x86: use X86_FEATURE_NOPL in alternatives

Use X86_FEATURE_NOPL to determine if it is safe to use P6 NOPs in
alternatives. Also, replace table and loop with simple if statement.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# f88f07e0 14-Aug-2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

x86: alternatives : fix LOCK_PREFIX race with preemptible kernel and CPU hotplug

If a kernel thread is preempted in single-cpu mode right after the NOP (nop
about to be turned into a lock prefix), then we CPU hotplug a CPU, and then the
thread is scheduled back again, a SMP-unsafe atomic operation will be used on
shared SMP variables, leading to corruption. No corruption would happen in the
reverse case : going from SMP to UP is ok because we split a bit instruction
into tiny pieces, which does not present this condition.

Changing the 0x90 (single-byte nop) currently used into a 0x3E DS segment
override prefix should fix this issue. Since the default of the atomic
instructions is to use the DS segment anyway, it should not affect the
behavior.

The exception to this are references that use ESP/RSP and EBP/RBP as
the base register (they will use the SS segment), however, in Linux
(a) DS == SS at all times, and (b) we do not distinguish between
segment violations reported as #SS as opposed to #GP, so there is no
need to disassemble the instruction to figure out the suitable segment.

This patch assumes that the 0x3E prefix will leave atomic operations as-is (thus
assuming they normally touch data in the DS segment). Since there seem to be no
obvious ill-use of other segment override prefixes for atomic operations, it
should be safe. It can be verified with a quick

grep -r LOCK_PREFIX include/asm-x86/
grep -A 1 -r LOCK_PREFIX arch/x86/

Taken from

This source :
AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System
Instructions
States
"Instructions that Reference a Non-Stack Segment—If an instruction encoding
references any base register other than rBP or rSP, or if an instruction
contains an immediate offset, the default segment is the data segment (DS).
These instructions can use the segment-override prefix to select one of the
non-default segments, as shown in Table 1-5."

Therefore, forcing the DS segment on the atomic operations, which already use
the DS segment, should not change.

This source :
http://wiki.osdev.org/X86_Instruction_Encoding
States
"In 64-bit the CS, SS, DS and ES segment overrides are ignored."

Confirmed by "AMD 64-Bit Technology" A.7
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/x86-64_overview.pdf

"In 64-bit mode, the DS, ES, SS and CS segment-override prefixes have no effect.
These four prefixes are no longer treated as segment-override prefixes in the
context of multipleprefix rules. Instead, they are treated as null prefixes."

This patch applies to 2.6.27-rc2, but would also have to be applied to earlier
kernels (2.6.26, 2.6.25, ...).

Performance impact of the fix : tests done on "xaddq" and "xaddl" shows it
actually improves performances on Intel Xeon, AMD64, Pentium M. It does not
change the performance on Pentium II, Pentium 3 and Pentium 4.

Xeon E5405 2.0GHz :
NR_TESTS 10000000
test empty cycles : 162207948
test test 1-byte nop xadd cycles : 170755422
test test DS override prefix xadd cycles : 170000118 *
test test LOCK xadd cycles : 472012134

AMD64 2.0GHz :
NR_TESTS 10000000
test empty cycles : 146674549
test test 1-byte nop xadd cycles : 150273860
test test DS override prefix xadd cycles : 149982382 *
test test LOCK xadd cycles : 270000690

Pentium 4 3.0GHz
NR_TESTS 10000000
test empty cycles : 290001195
test test 1-byte nop xadd cycles : 310000560
test test DS override prefix xadd cycles : 310000575 *
test test LOCK xadd cycles : 1050103740

Pentium M 2.0GHz
NR_TESTS 10000000
test empty cycles : 180000523
test test 1-byte nop xadd cycles : 320000345
test test DS override prefix xadd cycles : 310000374 *
test test LOCK xadd cycles : 480000357

Pentium 3 550MHz
NR_TESTS 10000000
test empty cycles : 510000231
test test 1-byte nop xadd cycles : 620000128
test test DS override prefix xadd cycles : 620000110 *
test test LOCK xadd cycles : 800000088

Pentium II 350MHz
NR_TESTS 10000000
test empty cycles : 200833494
test test 1-byte nop xadd cycles : 340000130
test test DS override prefix xadd cycles : 340000126 *
test test LOCK xadd cycles : 530000078

Speed test modules can be found at
http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-prefix-speed-32.c
http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-prefix-speed.c

Macro-benchmarks

2.0GHz E5405 Core 2 dual Quad-Core Xeon

Summary

* replace smp lock prefixes with DS segment selector prefixes
no lock prefix (s) with lock prefix (s) Speedup
make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %

* replace smp lock prefixes with 0x90 nops
no lock prefix (s) with lock prefix (s) Speedup
make -j1 kernel/ 34.16 +/- 0.32 34.91 +/- 0.27 2.2 %
hackbench 50 3.00 +/- 0.01 3.74 +/- 0.01 24.7 %

Detail :

1 CPU, replace smp lock prefixes with DS segment selector prefixes

make -j1 kernel/

real 0m34.067s
user 0m30.630s
sys 0m2.980s

real 0m33.867s
user 0m30.582s
sys 0m3.024s

real 0m33.939s
user 0m30.738s
sys 0m2.876s

real 0m33.913s
user 0m30.806s
sys 0m2.808s

avg : 33.94s
std. dev. : 0.07s

hackbench 50

Time: 2.978
Time: 2.982
Time: 3.010
Time: 2.984
Time: 2.982

avg : 2.99
std. dev. : 0.01

1 CPU, noreplace-smp

make -j1 kernel/

real 0m35.326s
user 0m30.630s
sys 0m3.260s

real 0m34.325s
user 0m30.802s
sys 0m3.084s

real 0m35.568s
user 0m30.722s
sys 0m3.168s

real 0m34.435s
user 0m30.886s
sys 0m2.996s

avg.: 34.91s
std. dev. : 0.27s

hackbench 50

Time: 3.733
Time: 3.750
Time: 3.761
Time: 3.737
Time: 3.741

avg : 3.74
std. dev. : 0.01

1 CPU, replace smp lock prefixes with 0x90 nops

make -j1 kernel/

real 0m34.139s
user 0m30.782s
sys 0m2.820s

real 0m34.010s
user 0m30.630s
sys 0m2.976s

real 0m34.777s
user 0m30.658s
sys 0m2.916s

real 0m33.924s
user 0m30.634s
sys 0m2.924s

real 0m33.962s
user 0m30.774s
sys 0m2.800s

real 0m34.141s
user 0m30.770s
sys 0m2.828s

avg : 34.16
std. dev. : 0.32

hackbench 50

Time: 2.999
Time: 2.994
Time: 3.004
Time: 2.991
Time: 2.988

avg : 3.00
std. dev. : 0.01

I did more runs (20 runs of each) to compare the nop case to the DS
prefix case. Results in seconds. They actually does not seems to show a
significant difference.

NOP

34.155
33.955
34.012
35.299
35.679
34.141
33.995
35.016
34.254
33.957
33.957
34.008
35.013
34.494
33.893
34.295
34.314
34.854
33.991
34.132

DS

34.080
34.304
34.374
35.095
34.291
34.135
33.940
34.208
35.276
34.288
33.861
33.898
34.610
34.709
33.851
34.256
35.161
34.283
33.865
35.078

Used http://www.graphpad.com/quickcalcs/ttest1.cfm?Format=C to do the
T-test (yeah, I'm lazy) :

Group Group One (DS prefix) Group Two (nops)
Mean 34.37815 34.37070
SD 0.46108 0.51905
SEM 0.10310 0.11606
N 20 20

P value and statistical significance:
The two-tailed P value equals 0.9620
By conventional criteria, this difference is considered to be not statistically significant.

Confidence interval:
The mean of Group One minus Group Two equals 0.00745
95% confidence interval of this difference: From -0.30682 to 0.32172

Intermediate values used in calculations:
t = 0.0480
df = 38
standard error of difference = 0.155

So, unless these calculus are completely bogus, the difference between the nop
and the DS case seems not to be statistically significant.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: H. Peter Anvin <hpa@zytor.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Roland McGrath <roland@redhat.com>
CC: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
CC: Steven Rostedt <srostedt@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: David Miller <davem@davemloft.net>
CC: Ulrich Drepper <drepper@redhat.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Gregory Haskins <ghaskins@novell.com>
CC: Arnaldo Carvalho de Melo <acme@redhat.com>
CC: "Luis Claudio R. Goncalves" <lclaudio@uudg.org>
CC: Clark Williams <williams@redhat.com>
CC: Christoph Lameter <cl@linux-foundation.org>
CC: Andi Kleen <andi@firstfloor.org>
CC: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 2f1dafe5 12-May-2008 Pekka Paalanen <pq@iki.fi>

x86: fix SMP alternatives: use mutex instead of spinlock, text_poke is sleepable

text_poke is sleepable.
The original fix by Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>.

Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# dfa60aba 12-May-2008 Steven Rostedt <srostedt@redhat.com>

ftrace: use nops instead of jmp

This patch patches the call to mcount with nops instead
of a jmp over the mcount call.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 00c6b2d5 25-Apr-2008 Ingo Molnar <mingo@elte.hu>

x86: harden kernel code patching

Signed-off-by: Ingo Molnar <mingo@elte.hu>


# b7b66baa 24-Apr-2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

x86: clean up text_poke()

Clean up the codepath, remove alignment restrictions and do sanity
checking of the end result, to make sure we patched the right site.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 8b132ecb 27-Apr-2008 Jiri Slaby <jirislaby@kernel.org>

x86: fix text_poke()

kernel_text_address returns true even for modules which is not wanted
in text_poke. Use core_kernel_text instead.

This is a regression introduced in e587cadd8f47e202a30712e2906a65a0606d5865
which caused occasionaly crashes after suspend/resume.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
CC: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <andi@firstfloor.org>
CC: pageexec@freemail.hu
CC: H. Peter Anvin <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 15a601eb 12-Mar-2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

x86: fix test_poke for vmalloced pages

* Ingo Molnar (mingo@elte.hu) wrote:
>
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>
> > The shadow vmap for DEBUG_RODATA kernel text modification uses
> > virt_to_page to get the pages from the pointer address.
> >
> > However, I think vmalloc_to_page would be required in case the page is
> > used for modules.
> >
> > Since only the core kernel text is marked read-only, use
> > kernel_text_address() to make sure we only shadow map the core kernel
> > text, not modules.
>
> actually, i think we should mark module text readonly too.
>

Yes, but in the meantime, the x86 tree would need this patch to make
kprobes work correctly on modules.

I suspect that without this fix, with the enhanced hotplug and kprobes
patch, kprobes will use text_poke to insert breakpoints in modules
(vmalloced pages used), which will map the wrong pages and corrupt
random kernel locations instead of updating the correct page.

Work that would write protect the module pages should clearly be done,
but it can come in a later time. We have to make sure we interact
correctly with the page allocation debugging, as an example.

Here is the patch against x86.git 2.6.25-rc5 :

The shadow vmap for DEBUG_RODATA kernel text modification uses virt_to_page to
get the pages from the pointer address.

However, I think vmalloc_to_page would be required in case the page is used for
modules.

Since only the core kernel text is marked read-only, use kernel_text_address()
to make sure we only shadow map the core kernel text, not modules.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: akpm@linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# e587cadd 06-Mar-2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

x86: enhance DEBUG_RODATA support - alternatives

Fix a memcpy that should be a text_poke (in apply_alternatives).

Use kernel_wp_save/kernel_wp_restore in text_poke to support DEBUG_RODATA
correctly and so the CPU HOTPLUG special case can be removed.

Add text_poke_early, for alternatives and paravirt boot-time and module load
time patching.

Changelog:

- Fix text_set and text_poke alignment check (mixed up bitwise and and or)
- Remove text_set
- Export add_nops, so it can be used by others.
- Document text_poke_early.
- Remove clflush, since it breaks some VIA architectures and is not strictly
necessary.
- Add kerneldoc to text_poke and text_poke_early.
- Create a second vmap instead of using the WP bit to support Xen and VMI.
- Move local_irq disable within text_poke and text_poke_early to be able to
be sleepable in these functions.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <andi@firstfloor.org>
CC: pageexec@freemail.hu
CC: H. Peter Anvin <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 77bf90ed 03-Mar-2008 Harvey Harrison <harvey.harrison@gmail.com>

x86: replace remaining __FUNCTION__ occurances

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# f4be31ec 09-Apr-2008 Steven Rostedt <rostedt@goodmis.org>

pop previous section in alternative.c

gcc expects all toplevel assembly to return to the original section type.
The code in alteranative.c does not do this. This caused some strange bugs
in sched-devel where code would end up in the .rodata section and when
the kernel sets the NX bit on all .rodata, the kernel would crash when
executing this code.

This patch adds a .previous marker to return the code back to the
original section.

Credit goes to Andrew Pinski for telling me it wasn't a gcc bug but a
bug in the toplevel asm code in the kernel. ;-)

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 17abecfe 30-Jan-2008 Ingo Molnar <mingo@elte.hu>

x86: fix up alternatives with lockdep enabled

An older binutils bug caused us to not fix up alternatives.
This problem involved mutex.c but we dont do lockdep section tricks
there anymore, so this workaround is moot. Keep the printk nevertheless,
just in case ... We can remove that later on.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# ca74a6f8 30-Jan-2008 Andi Kleen <ak@linux.intel.com>

x86: optimize lock prefix switching to run less frequently

On VMs implemented using JITs that cache translated code changing the lock
prefixes is a quite costly operation that forces the JIT to throw away and
retranslate a lot of code.

Previously a SMP kernel would rewrite the locks once for each CPU which
is quite unnecessary. This patch changes the code to never switch at boot in
the normal case (SMP kernel booting with >1 CPU) or only once for SMP kernel
on UP.

This makes a significant difference in boot up performance on AMD SimNow!
Also I expect it to be a little faster on native systems too because a smp
switch does a lot of text_poke()s which each synchronize the pipeline.

v1->v2: Rename max_cpus
v1->v2: Fix off by one in UP check (Thomas Gleixner)

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 53756d37 30-Jan-2008 Jeremy Fitzhardinge <jeremy@goop.org>

x86: add set/clear_cpu_cap operations

The patch to suppress bitops-related warnings added a pile of ugly
casts. Many of these were related to the management of x86 CPU
capabilities. Clean these up by adding specific set/clear_cpu_cap
macros, and use them consistently.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 92cb7612 19-Oct-2007 Mike Travis <travis@sgi.com>

x86: convert cpuinfo_x86 array to a per_cpu array

cpu_data is currently an array defined using NR_CPUS. This means that
we overallocate since we will rarely really use maximum configured cpus.
When NR_CPU count is raised to 4096 the size of cpu_data becomes
3,145,728 bytes.

These changes were adopted from the sparc64 (and ia64) code. An
additional field was added to cpuinfo_x86 to be a non-ambiguous cpu
index. This corresponds to the index into a cpumask_t as well as the
per_cpu index. It's used in various places like show_cpuinfo().

cpu_data is defined to be the boot_cpu_data structure for the NON-SMP
case.

Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Mark M. Hoffman <mhoffman@lightlink.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 32c464f5 17-Oct-2007 Jan Beulich <jbeulich@novell.com>

x86: multi-byte single instruction NOPs

Add support for and use the multi-byte NOPs recently documented to be
available on all PentiumPro and later processors.

This patch only applies cleanly on top of the "x86: misc.
constifications" patch sent earlier.

[ tglx: arch/x86 adaptation ]

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

arch/x86/kernel/alternative.c | 23 ++++++++++++++++++++++-
include/asm-x86/processor_32.h | 22 ++++++++++++++++++++++
include/asm-x86/processor_64.h | 22 ++++++++++++++++++++++
3 files changed, 66 insertions(+), 1 deletion(-)


# 121d7bf5 17-Oct-2007 Jan Beulich <jbeulich@novell.com>

x86: misc. constifications

Miscellaneous x86 stuff that can live in .rodata.

[ tglx: arch/x86 adaptation ]

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# f68fd5f4 17-Oct-2007 Fengguang Wu <wfg@mail.ustc.edu.cn>

x86: call free_init_pages() with irqs enabled in alternative_instructions()

In alternative_instructions(), call free_init_pages() with irqs enabled.

It fixes the warning message in smp_call_function*(), which should not be
called with irqs disabled.

[ 0.310000] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
[ 0.310000] CPU: L2 Cache: 512K (64 bytes/line)
[ 0.310000] CPU 0/0 -> Node 0
[ 0.310000] SMP alternatives: switching to UP code
[ 0.310000] Freeing SMP alternatives: 25k freed
[ 0.310000] WARNING: at arch/x86_64/kernel/smp.c:397 smp_call_function_mask()
[ 0.310000]
[ 0.310000] Call Trace:
[ 0.310000] [<ffffffff8100dbde>] dump_trace+0x3ee/0x4a0
[ 0.310000] [<ffffffff8100dcd3>] show_trace+0x43/0x70
[ 0.310000] [<ffffffff8100dd15>] dump_stack+0x15/0x20
[ 0.310000] [<ffffffff8101cd44>] smp_call_function_mask+0x94/0xa0
[ 0.310000] [<ffffffff8101d0b2>] smp_call_function+0x32/0x40
[ 0.310000] [<ffffffff8104277f>] on_each_cpu+0x1f/0x50
[ 0.310000] [<ffffffff81026eac>] global_flush_tlb+0x8c/0x110
[ 0.310000] [<ffffffff81025c85>] free_init_pages+0xe5/0xf0
[ 0.310000] [<ffffffff81549b5e>] alternative_instructions+0x7e/0x150
[ 0.310000] [<ffffffff8154a2ea>] check_bugs+0x1a/0x20
[ 0.310000] [<ffffffff81540c4a>] start_kernel+0x2da/0x380
[ 0.310000] [<ffffffff81540132>] _sinittext+0x132/0x140
[ 0.310000]
[ 0.320000] ACPI: Core revision 20070126
[ 0.560000] Using local APIC timer interrupts.
[ 0.590000] Detected 62.496 MHz APIC timer.
[ 0.590000] Brought up 1 CPUs

[ tglx: arch/x86 adaptation ]

Cc: Laurent Vivier <Laurent.Vivier@bull.net>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 93b1eab3 16-Oct-2007 Jeremy Fitzhardinge <jeremy@xensource.com>

paravirt: refactor struct paravirt_ops into smaller pv_*_ops

This patch refactors the paravirt_ops structure into groups of
functionally related ops:

pv_info - random info, rather than function entrypoints
pv_init_ops - functions used at boot time (some for module_init too)
pv_misc_ops - lazy mode, which didn't fit well anywhere else
pv_time_ops - time-related functions
pv_cpu_ops - various privileged instruction ops
pv_irq_ops - operations for managing interrupt state
pv_apic_ops - APIC operations
pv_mmu_ops - operations for managing pagetables

There are several motivations for this:

1. Some of these ops will be general to all x86, and some will be
i386/x86-64 specific. This makes it easier to share common stuff
while allowing separate implementations where needed.

2. At the moment we must export all of paravirt_ops, but modules only
need selected parts of it. This allows us to export on a case by case
basis (and also choose which export license we want to apply).

3. Functional groupings make things a bit more readable.

Struct paravirt_ops is now only used as a template to generate
patch-site identifiers, and to extract function pointers for inserting
into jmp/calls when patching. It is only instantiated when needed.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Cc: Zach Amsden <zach@vmware.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Anthony Liguory <aliguori@us.ibm.com>
Cc: "Glauber de Oliveira Costa" <glommer@gmail.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>


# b097976e 14-Oct-2007 Dave Jones <davej@redhat.com>

x86: fix missing include for vsyscall

> Maybe I just picked a bad time to try, but...
>
> arch/x86/kernel/alternative.c: In function 'apply_alternatives':
> arch/x86/kernel/alternative.c:191: error: 'VSYSCALL_START' undeclared (first use in this function)
> arch/x86/kernel/alternative.c:191: error: (Each undeclared identifier is reported only once
> arch/x86/kernel/alternative.c:191: error: for each function it appears in.)
> arch/x86/kernel/alternative.c:191: error: 'VSYSCALL_END' undeclared (first use in this function)
> make[1]: *** [arch/x86/kernel/alternative.o] Error 1
> make: *** [arch/x86/kernel] Error 2

Try this.

Include missing header for vsyscall.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 9a163ed8 11-Oct-2007 Thomas Gleixner <tglx@linutronix.de>

i386: move kernel

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>