Cross Reference: /linux-master/Documentation/admin-guide/sysctl/kernel.rst

History log of /linux-master/Documentation/admin-guide/sysctl/kernel.rst
Revision	Date	Author	Comments
# 19f0423f	23-Feb-2024	Huang Yiwei <quic_hyiwei@quicinc.com>	tracing: Support to dump instance traces by ftrace_dump_on_oops Currently ftrace only dumps the global trace buffer on an OOPs. For debugging a production usecase, instance trace will be helpful to check specific problems since global trace buffer may be used for other purposes. This patch extend the ftrace_dump_on_oops parameter to dump a specific or multiple trace instances: - ftrace_dump_on_oops=0: as before -- don't dump - ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer on all CPUs - ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global trace buffer on CPU that triggered the oops - ftrace_dump_on_oops=<instance_name>: new behavior -- dump the tracing instance matching <instance_name> - ftrace_dump_on_oops[=2/orig_cpu],<instance1_name>[=2/orig_cpu], <instrance2_name>[=2/orig_cpu]: new behavior -- dump the global trace buffer and multiple instance buffer on all CPUs, or only dump on CPU that triggered the oops if =2 or =orig_cpu is given Also, the sysctl node can handle the input accordingly. Link: https://lore.kernel.org/linux-trace-kernel/20240223083126.1817731-1-quic_hyiwei@quicinc.com Cc: Ross Zwisler <zwisler@google.com> Cc: <mhiramat@kernel.org> Cc: <mark.rutland@arm.com> Cc: <mcgrof@kernel.org> Cc: <keescook@chromium.org> Cc: <j.granados@samsung.com> Cc: <mathieu.desnoyers@efficios.com> Cc: <corbet@lwn.net> Signed-off-by: Huang Yiwei <quic_hyiwei@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
# 2e3fc6ca	02-Feb-2024	Feng Tang <feng.tang@intel.com>	panic: add option to dump blocked tasks in panic_print For debugging kernel panics and other bugs, there is already an option of panic_print to dump all tasks' call stacks. On today's large servers running many containers, there could be thousands of tasks or more, and this will print out huge amount of call stacks, taking a lot of time (for serial console which is main target user case of panic_print). And in many cases, only those several tasks being blocked are key for the panic, so add an option to only dump blocked tasks' call stacks. [akpm@linux-foundation.org: clarify documentation a little] Link: https://lkml.kernel.org/r/20240202132042.3609657-1-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# 9220066e	15-Jan-2024	Alexey Gladkov <legion@kernel.org>	docs: add information about ipc sysctls limitations After 25b21cb2f6d6 ("[PATCH] IPC namespace core") and 4e9823111bdc ("[PATCH] IPC namespace - shm") the shared memory page count stopped being global and started counting per ipc namespace. The documentation and shmget(2) still says that shmall is a global option. shmget(2): SHMALL System-wide limit on the total amount of shared memory, measured in units of the system page size. On Linux, this limit can be read and modified via /proc/sys/kernel/shmall. I think the changes made in 2006 should be documented. Link: https://lkml.kernel.org/r/09e99911071766958af488beb4e8a728a4f12135.1705333426.git.legion@kernel.org Signed-off-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lkml.kernel.org/r/ede20ddf7be48b93e8084c3be2e920841ee1a641.1663756794.git.legion@kernel.org Cc: Christian Brauner <brauner@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Joel Granados <joel.granados@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# 94483490	13-Jan-2023	Ard Biesheuvel <ardb@kernel.org>	Documentation: Drop or replace remaining mentions of IA64 Drop or update mentions of IA64, as appropriate. Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
# 8f833c82	09-Oct-2023	Shrikanth Hegde <sshegde@linux.vnet.ibm.com>	sched/topology: Change behaviour of the 'sched_energy_aware' sysctl, based on the platform The 'sched_energy_aware' sysctl is available for the admin to disable/enable energy aware scheduling(EAS). EAS is enabled only if few conditions are met by the platform. They are, asymmetric CPU capacity, no SMT, schedutil CPUfreq governor, frequency invariant load tracking etc. A platform may boot without EAS capability, but could gain such capability at runtime. For example, changing/registering the cpufreq governor to schedutil. At present, though platform doesn't support EAS, this sysctl returns 1 and it ends up calling build_perf_domains on write to 1 and NOP when writing to 0. That is confusing and un-necessary. Desired behavior would be to have this sysctl to enable/disable the EAS on supported platform. On non-supported platform write to the sysctl would return not supported error and read of the sysctl would return empty. So sched_energy_aware returns empty - EAS is not possible at this moment This will include EAS capable platforms which have at least one EAS condition false during startup, e.g. not using the schedutil cpufreq governor sched_energy_aware returns 0 - EAS is supported but disabled by admin. sched_energy_aware returns 1 - EAS is supported and enabled. User can find out the reason why EAS is not possible by checking info messages. sched_is_eas_possible returns true if the platform can do EAS at this moment. Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
# 76d3ccec	21-Aug-2023	Matteo Rizzo <matteorizzo@google.com>	io_uring: add a sysctl to disable io_uring system-wide Introduce a new sysctl (io_uring_disabled) which can be either 0, 1, or 2. When 0 (the default), all processes are allowed to create io_uring instances, which is the current behavior. When 1, io_uring creation is disabled (io_uring_setup() will fail with -EPERM) for unprivileged processes not in the kernel.io_uring_group group. When 2, calls to io_uring_setup() fail with -EPERM regardless of privilege. Signed-off-by: Matteo Rizzo <matteorizzo@google.com> [JEM: modified to add io_uring_group] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/x49y1i42j1z.fsf@segfault.boston.devel.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 57972127	02-Aug-2023	Alexandre Ghiti <alexghiti@rivosinc.com>	Documentation: admin-guide: Add riscv sysctl_perf_user_access riscv now uses this sysctl so document its usage for this architecture. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
# e4624435	12-Jun-2023	Jonathan Corbet <corbet@lwn.net>	docs: arm64: Move arm64 documentation under Documentation/arch/ Architecture-specific documentation is being moved into Documentation/arch/ as a way of cleaning up the top-level documentation directory and making the docs hierarchy more closely match the source hierarchy. Move Documentation/arm64 into arch/ (along with the Chinese equvalent translations) and fix up documentation references. Cc: Will Deacon <will@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <src.res@email.cn> Cc: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Yantengsi <siyanteng@loongson.cn> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# ff61f079	14-Mar-2023	Jonathan Corbet <corbet@lwn.net>	docs: move x86 documentation into Documentation/arch/ Move the x86 documentation under Documentation/arch/ as a way of cleaning up the top-level directory and making the structure of our docs more closely match the structure of the source directories it describes. All in-kernel references to the old paths have been updated. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20230315211523.108836-1-corbet@lwn.net/ Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# a42aaad2	04-Jan-2023	Ricardo Ribalda <ribalda@chromium.org>	kexec: introduce sysctl parameters kexec_load_limit_* kexec allows replacing the current kernel with a different one. This is usually a source of concerns for sysadmins that want to harden a system. Linux already provides a way to disable loading new kexec kernel via kexec_load_disabled, but that control is very coard, it is all or nothing and does not make distinction between a panic kexec and a normal kexec. This patch introduces new sysctl parameters, with finer tuning to specify how many times a kexec kernel can be loaded. The sysadmin can set different limits for kexec panic and kexec reboot kernels. The value can be modified at runtime via sysctl, but only with a stricter value. With these new parameters on place, a system with loadpin and verity enabled, using the following kernel parameters: sysctl.kexec_load_limit_reboot=0 sysct.kexec_load_limit_panic=1 can have a good warranty that if initrd tries to load a panic kernel, a malitious user will have small chances to replace that kernel with a different one, even if they can trigger timeouts on the disk where the panic kernel lives. Link: https://lkml.kernel.org/r/20221114-disable-kexec-reset-v6-3-6a8531a09b9a@chromium.org Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# 06dcb013	04-Jan-2023	Ricardo Ribalda <ribalda@chromium.org>	Documentation: sysctl: correct kexec_load_disabled Patch series "kexec: Add new parameter to limit the access to kexec", v6. Add two parameter to specify how many times a kexec kernel can be loaded. These parameter allow hardening the system. While we are at it, fix a documentation issue and refactor some code. This patch (of 3): kexec_load_disabled affects both ``kexec_load`` and ``kexec_file_load`` syscalls. Make it explicit. Link: https://lkml.kernel.org/r/20221114-disable-kexec-reset-v6-0-6a8531a09b9a@chromium.org Link: https://lkml.kernel.org/r/20221114-disable-kexec-reset-v6-1-6a8531a09b9a@chromium.org Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Guilherme G. Piccoli <gpiccoli@igalia.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# 61a6fccc	10-Dec-2022	Huacai Chen <chenhuacai@kernel.org>	LoongArch: Add unaligned access support Loongson-2 series (Loongson-2K500, Loongson-2K1000) don't support unaligned access in hardware, while Loongson-3 series (Loongson-3A5000, Loongson-3C5000) are configurable whether support unaligned access in hardware. This patch add unaligned access emulation for those LoongArch processors without hardware support. Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
# 9fc9e278	17-Nov-2022	Kees Cook <keescook@chromium.org>	panic: Introduce warn_limit Like oops_limit, add warn_limit for limiting the number of warnings when panic_on_warn is not set. Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Petr Mladek <pmladek@suse.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-doc@vger.kernel.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.org
# de92f657	02-Dec-2022	Kees Cook <keescook@chromium.org>	exit: Allow oops_limit to be disabled In preparation for keeping oops_limit logic in sync with warn_limit, have oops_limit == 0 disable checking the Oops counter. Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-doc@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
# d4ccd54d	17-Nov-2022	Jann Horn <jannh@google.com>	exit: Put an upper limit on how often we can oops Many Linux systems are configured to not panic on oops; but allowing an attacker to oops the system really often can make even bugs that look completely unexploitable exploitable (like NULL dereferences and such) if each crash elevates a refcount by one or a lock is taken in read mode, and this causes a counter to eventually overflow. The most interesting counters for this are 32 bits wide (like open-coded refcounts that don't use refcount_t). (The ldsem reader count on 32-bit platforms is just 16 bits, but probably nobody cares about 32-bit platforms that much nowadays.) So let's panic the system if the kernel is constantly oopsing. The speed of oopsing 2^32 times probably depends on several factors, like how long the stack trace is and which unwinder you're using; an empirically important one is whether your console is showing a graphical environment or a text console that oopses will be printed to. In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run. (Adding more threads makes this faster, but the actual oops printing happens under &die_lock on x86, so you can maybe speed this up by a factor of around 2 and then any further improvement gets eaten up by lock contention.) It looks like it would take around 8-12 days to overflow a 32-bit counter with repeated oopsing on a multi-core X86 system running a graphical environment; both me (in an X86 VM) and Seth (with a distro kernel on normal hardware in a standard configuration) got numbers in that ballpark. 12 days aren't that short on a desktop system, and you'd likely need much longer on a typical server system (assuming that people don't run graphical desktop environments on their servers), and this is a very noisy and violent approach to exploiting the kernel; and it also seems to take orders of magnitude longer on some machines, probably because stuff like EFI pstore will slow it down a ton if that's active. Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org
# 8603b6f5	03-Sep-2022	Oleksandr Natalenko <oleksandr@redhat.com>	core_pattern: add CPU specifier Statistically, in a large deployment regular segfaults may indicate a CPU issue. Currently, it is not possible to find out what CPU the segfault happened on. There are at least two attempts to improve segfault logging with this regard, but they do not help in case the logs rotate. Hence, lets make sure it is possible to permanently record a CPU the task ran on using a new core_pattern specifier. Link: https://lkml.kernel.org/r/20220903064330.20772-1-oleksandr@redhat.com Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com> Suggested-by: Renaud Métrich <rmetrich@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Grzegorz Halat <ghalat@redhat.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Joel Savitz <jsavitz@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Stephen Kitt <steve@sk2.org> Cc: Will Deacon <will@kernel.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# 72720937	24-Oct-2022	Guilherme G. Piccoli <gpiccoli@igalia.com>	x86/split_lock: Add sysctl to control the misery mode Commit b041b525dab9 ("x86/split_lock: Make life miserable for split lockers") changed the way the split lock detector works when in "warn" mode; basically, it not only shows the warn message, but also intentionally introduces a slowdown through sleeping plus serialization mechanism on such task. Based on discussions in [0], seems the warning alone wasn't enough motivation for userspace developers to fix their applications. This slowdown is enough to totally break some proprietary (aka. unfixable) userspace[1]. Happens that originally the proposal in [0] was to add a new mode which would warns + slowdown the "split locking" task, keeping the old warn mode untouched. In the end, that idea was discarded and the regular/default "warn" mode now slows down the applications. This is quite aggressive with regards proprietary/legacy programs that basically are unable to properly run in kernel with this change. While it is understandable that a malicious application could DoS by split locking, it seems unacceptable to regress old/proprietary userspace programs through a default configuration that previously worked. An example of such breakage was reported in [1]. Add a sysctl to allow controlling the "misery mode" behavior, as per Thomas suggestion on [2]. This way, users running legacy and/or proprietary software are allowed to still execute them with a decent performance while still observing the warning messages on kernel log. [0] https://lore.kernel.org/lkml/20220217012721.9694-1-tony.luck@intel.com/ [1] https://github.com/doitsujin/dxvk/issues/2938 [2] https://lore.kernel.org/lkml/87pmf4bter.ffs@tglx/ [ dhansen: minor changelog tweaks, including clarifying the actual problem ] Fixes: b041b525dab9 ("x86/split_lock: Make life miserable for split lockers") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Tested-by: Andre Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/all/20221024200254.635256-1-gpiccoli%40igalia.com
# aadc0cd5	29-Sep-2022	Stephen Kitt <steve@sk2.org>	docs: sysctl/fs: re-order, prettify This brings the text markup in line with sysctl/abi and sysctl/kernel: * the entries are ordered alphabetically * the table of contents is automatically generated * markup is used as appropriate for constants etc. The content isn't fully up-to-date but the obsolete entries are gone, so remove the kernel version mention. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20220930102937.135841-6-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# bfca3dd3	01-Sep-2022	Petr Vorel <pvorel@suse.cz>	kernel/utsname_sysctl.c: print kernel arch Print the machine hardware name (UTS_MACHINE) in /proc/sys/kernel/arch. This helps people who debug kernel with initramfs with minimal environment (i.e. without coreutils or even busybox) or allow to open sysfs file instead of run 'uname -m' in high level languages. Link: https://lkml.kernel.org/r/20220901194403.3819-1-pvorel@suse.cz Signed-off-by: Petr Vorel <pvorel@suse.cz> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: David Sterba <dsterba@suse.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# c6833e10	13-Jul-2022	Huang Ying <ying.huang@intel.com>	memory tiering: rate limit NUMA migration throughput In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patch as the page promotion rate limit mechanism. The number of the candidate pages to be promoted to the fast memory node via NUMA balancing is counted, if the count exceeds the limit specified by the users, the NUMA balancing promotion will be stopped until the next second. A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added for the users to specify the limit. Link: https://lkml.kernel.org/r/20220713083954.34196-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# 118b1366	13-Jul-2022	Laurent Dufour <ldufour@linux.ibm.com>	powerpc/pseries/mobility: set NMI watchdog factor during an LPM During an LPM, while the memory transfer is in progress on the arrival side, some latencies are generated when accessing not yet transferred pages on the arrival side. Thus, the NMI watchdog may be triggered too frequently, which increases the risk to hit an NMI interrupt in a bad place in the kernel, leading to a kernel panic. Disabling the Hard Lockup Watchdog until the memory transfer could be a too strong work around, some users would want this timeout to be eventually triggered if the system is hanging even during an LPM. Introduce a new sysctl variable nmi_watchdog_factor. It allows to apply a factor to the NMI watchdog timeout during an LPM. Just before the CPUs are stopped for the switchover sequence, the NMI watchdog timer is set to watchdog_thresh + factor% A value of 0 has no effect. The default value is 200, meaning that the NMI watchdog is set to 30s during LPM (based on a 10s watchdog_thresh value). Once the memory transfer is achieved, the factor is reset to 0. Setting this value to a high number is like disabling the NMI watchdog during an LPM. Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220713154729.80789-5-ldufour@linux.ibm.com
# 30fb8761	24-Jun-2022	Stephen Kitt <steve@sk2.org>	docs: admin-guide/sysctl: Fix rendering error Text in ``literal`` markup must be separated by word separators, so text like ``lowwater``% renders incorrectly. Add the suggested "\ " after two problematic occurrences. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20220624110230.595740-1-steve@sk2.org [jc: tweaked to use "\ "] Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 069c4ea6	03-May-2022	Jason A. Donenfeld <Jason@zx2c4.com>	random: fix sysctl documentation nits A semicolon was missing, and the almost-alphabetical-but-not ordering was confusing, so regroup these by category instead. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
# 81c65365	24-Mar-2022	Joel Savitz <jsavitz@redhat.com>	Documentation/sysctl: document max_rcu_stall_to_panic commit dfe564045c65 ("rcu: Panic after fixed number of stalls") introduced a new systctl but no accompanying documentation. Add a simple entry to the documentation. Signed-off-by: Joel Savitz <jsavitz@redhat.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 8d470a45	23-Mar-2022	Guilherme G. Piccoli <gpiccoli@igalia.com>	panic: add option to dump all CPUs backtraces in panic_print Currently the "panic_print" parameter/sysctl allows some interesting debug information to be printed during a panic event. This is useful for example in cases the user cannot kdump due to resource limits, or if the user collects panic logs in a serial output (or pstore) and prefers a fast reboot instead of a kdump. Happens that currently there's no way to see all CPUs backtraces in a panic using "panic_print" on architectures that support that. We do have "oops_all_cpu_backtrace" sysctl, but although partially overlapping in the functionality, they are orthogonal in nature: "panic_print" is a panic tuning (and we have panics without oopses, like direct calls to panic() or maybe other paths that don't go through oops_enter() function), and the original purpose of "oops_all_cpu_backtrace" is to provide more information on oopses for cases in which the users desire to continue running the kernel even after an oops, i.e., used in non-panic scenarios. So, we hereby introduce an additional bit for "panic_print" to allow dumping the CPUs backtraces during a panic event. Link: https://lkml.kernel.org/r/20211109202848.610874-3-gpiccoli@igalia.com Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Samuel Iglesias Gonsalvez <siglesias@igalia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# a1ff1de0	23-Mar-2022	Guilherme G. Piccoli <gpiccoli@igalia.com>	docs: sysctl/kernel: add missing bit to panic_print Patch series "Some improvements on panic_print". This is a mix of a documentation fix with some additions to the "panic_print" syscall / parameter. The goal here is being able to collect all CPUs backtraces during a panic event and also to enable "panic_print" in a kdump event - details of the reasoning and design choices in the patches. This patch (of 3): Commit de6da1e8bcf0 ("panic: add an option to replay all the printk message in buffer") added a new bit to the sysctl/kernel parameter "panic_print", but the documentation was added only in kernel-parameters.txt, not in the sysctl guide. Fix it here by adding bit 5 to sysctl admin-guide documentation. [rdunlap@infradead.org: fix table format warning] Link: https://lkml.kernel.org/r/20220109055635.6999-1-rdunlap@infradead.org Link: https://lkml.kernel.org/r/20211109202848.610874-1-gpiccoli@igalia.com Link: https://lkml.kernel.org/r/20211109202848.610874-2-gpiccoli@igalia.com Fixes: de6da1e8bcf0 ("panic: add an option to replay all the printk message in buffer") Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Samuel Iglesias Gonsalvez <siglesias@igalia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# c574bbe9	22-Mar-2022	Huang Ying <ying.huang@intel.com>	NUMA balancing: optimize page placement for memory tiering system With the advent of various new memory types, some machines will have multiple types of memory, e.g. DRAM and PMEM (persistent memory). The memory subsystem of these machines can be called memory tiering system, because the performance of the different types of memory are usually different. In such system, because of the memory accessing pattern changing etc, some pages in the slow memory may become hot globally. So in this patch, the NUMA balancing mechanism is enhanced to optimize the page placement among the different memory types according to hot/cold dynamically. In a typical memory tiering system, there are CPUs, fast memory and slow memory in each physical NUMA node. The CPUs and the fast memory will be put in one logical node (called fast memory node), while the slow memory will be put in another (faked) logical node (called slow memory node). That is, the fast memory is regarded as local while the slow memory is regarded as remote. So it's possible for the recently accessed pages in the slow memory node to be promoted to the fast memory node via the existing NUMA balancing mechanism. The original NUMA balancing mechanism will stop to migrate pages if the free memory of the target node becomes below the high watermark. This is a reasonable policy if there's only one memory type. But this makes the original NUMA balancing mechanism almost do not work to optimize page placement among different memory types. Details are as follows. It's the common cases that the working-set size of the workload is larger than the size of the fast memory nodes. Otherwise, it's unnecessary to use the slow memory at all. So, there are almost always no enough free pages in the fast memory nodes, so that the globally hot pages in the slow memory node cannot be promoted to the fast memory node. To solve the issue, we have 2 choices as follows, a. Ignore the free pages watermark checking when promoting hot pages from the slow memory node to the fast memory node. This will create some memory pressure in the fast memory node, thus trigger the memory reclaiming. So that, the cold pages in the fast memory node will be demoted to the slow memory node. b. Define a new watermark called wmark_promo which is higher than wmark_high, and have kswapd reclaiming pages until free pages reach such watermark. The scenario is as follows: when we want to promote hot-pages from a slow memory to a fast memory, but fast memory's free pages would go lower than high watermark with such promotion, we wake up kswapd with wmark_promo watermark in order to demote cold pages and free us up some space. So, next time we want to promote hot-pages we might have a chance of doing so. The choice "a" may create high memory pressure in the fast memory node. If the memory pressure of the workload is high, the memory pressure may become so high that the memory allocation latency of the workload is influenced, e.g. the direct reclaiming may be triggered. The choice "b" works much better at this aspect. If the memory pressure of the workload is high, the hot pages promotion will stop earlier because its allocation watermark is higher than that of the normal memory allocation. So in this patch, choice "b" is implemented. A new zone watermark (WMARK_PROMO) is added. Which is larger than the high watermark and can be controlled via watermark_scale_factor. In addition to the original page placement optimization among sockets, the NUMA balancing mechanism is extended to be used to optimize page placement according to hot/cold among different memory types. So the sysctl user space interface (numa_balancing) is extended in a backward compatible way as follow, so that the users can enable/disable these functionality individually. The sysctl is converted from a Boolean value to a bits field. The definition of the flags is, - 0: NUMA_BALANCING_DISABLED - 1: NUMA_BALANCING_NORMAL - 2: NUMA_BALANCING_MEMORY_TIERING We have tested the patch with the pmbench memory accessing benchmark with the 80:20 read/write ratio and the Gauss access address distribution on a 2 socket Intel server with Optane DC Persistent Memory Model. The test results shows that the pmbench score can improve up to 95.9%. Thanks Andrew Morton to help fix the document format error. Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 3624ba7b	09-Feb-2022	Huang Ying <ying.huang@intel.com>	sched/numa-balancing: Move some document to make it consistent with the code After commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs"), some NUMA balancing sysctls enclosed with SCHED_DEBUG has been moved to debugfs. This patch move the document for these sysctls from Documentation/admin-guide/sysctl/kernel.rst to Documentation/scheduler/sched-debug.rst to make the document consistent with the code. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lkml.kernel.org/r/20220210052514.3038279-1-ying.huang@intel.com
# 95e6060c	10-Feb-2022	Jason A. Donenfeld <Jason@zx2c4.com>	random: remove ifdef'd out interrupt bench With tools like kbench9000 giving more finegrained responses, and this basically never having been used ever since it was initially added, let's just get rid of this. There is still work to be done on the interrupt handler, but this really isn't the way it's being developed. Cc: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
# 489c7fc4	05-Feb-2022	Jason A. Donenfeld <Jason@zx2c4.com>	random: always wake up entropy writers after extraction Now that POOL_BITS == POOL_MIN_BITS, we must unconditionally wake up entropy writers after every extraction. Therefore there's no point of write_wakeup_threshold, so we can move it to the dustbin of unused compatibility sysctls. While we're at it, we can fix a small comparison where we were waking up after <= min rather than < min. Cc: Theodore Ts'o <tytso@mit.edu> Suggested-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
# e2012600	08-Dec-2021	Rob Herring <robh@kernel.org>	arm64: perf: Add userspace counter access disable switch Like x86, some users may want to disable userspace PMU counter altogether. Add a sysctl 'perf_user_access' file to control userspace counter access. The default is '0' which is disabled. Writing '1' enables access. Note that x86 supports globally enabling user access by writing '2' to /sys/bus/event_source/devices/cpu/rdpmc. As there's not existing userspace support to worry about, this shouldn't be necessary for Arm. It could be added later if the need arises. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: linux-perf-users@vger.kernel.org Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20211208201124.310740-4-robh@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
# 0f60a29c	15-Nov-2021	Mauro Carvalho Chehab <mchehab+huawei@kernel.org>	docs: accounting: update delay-accounting.rst reference The file name: accounting/delay-accounting.rst should be, instead: Documentation/accounting/delay-accounting.rst. Also, there's no need to use doc:`foo`, as automarkup.py will automatically handle plain text mentions to Documentation/ files. So, update its cross-reference accordingly. Fixes: fcb501704554 ("delayacct: Document task_delayacct sysctl") Fixes: c3123552aad3 ("docs: accounting: convert to ReST") Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 256f7a67	28-Jun-2021	Wang Qing <wangqing@vivo.com>	doc: watchdog: modify the doc related to "watchdog/%u" "watchdog/%u" threads has be replaced by cpu_stop_work. The current description is extremely misleading. Link: https://lkml.kernel.org/r/1619687073-24686-5-git-send-email-wangqing@vivo.com Signed-off-by: Wang Qing <wangqing@vivo.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: "Guilherme G. Piccoli" <gpiccoli@canonical.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Cc: Qais Yousef <qais.yousef@arm.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Santosh Sivaraj <santosh@fossix.org> Cc: Stephen Kitt <steve@sk2.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 2793e19d	16-Jun-2021	Mauro Carvalho Chehab <mchehab+huawei@kernel.org>	docs: admin-guide: sysctl: avoid using ReST :doc:`foo` markup The :doc:`foo` tag is auto-generated via automarkup.py. So, use the filename at the sources, instead of :doc:`foo`. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/12abd2290c7ebc05c89178d2556bea740bd70fac.1623824363.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# fcb50170	11-May-2021	Mel Gorman <mgorman@suse.de>	delayacct: Document task_delayacct sysctl Update sysctl/kernel.rst. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210512114035.GH3672@suse.de
# 1e886090	20-Apr-2021	Rasmus Villemoes <linux@rasmusvillemoes.dk>	docs: admin-guide: update description for kernel.hotplug sysctl It's been a few releases since this defaulted to /sbin/hotplug. Update the text, and include pointers to the two CONFIG_UEVENT_HELPER{,_PATH} config knobs whose help text could provide more info, but also hint that the user probably doesn't need to care at all. Fixes: 7934779a69f1 ("Driver-Core: disable /sbin/hotplug by default") Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Link: https://lore.kernel.org/r/20210420120638.1104016-1-linux@rasmusvillemoes.dk Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 08389d88	11-May-2021	Daniel Borkmann <daniel@iogearbox.net>	bpf: Add kconfig knob for disabling unpriv bpf by default Add a kconfig knob which allows for unprivileged bpf to be disabled by default. If set, the knob sets /proc/sys/kernel/unprivileged_bpf_disabled to value of 2. This still allows a transition of 2 -> {0,1} through an admin. Similarly, this also still keeps 1 -> {1} behavior intact, so that once set to permanently disabled, it cannot be undone aside from a reboot. We've also added extra2 with max of 2 for the procfs handler, so that an admin still has a chance to toggle between 0 <-> 2. Either way, as an additional alternative, applications can make use of CAP_BPF that we added a while ago. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/74ec548079189e4e4dffaeb42b8987bb3c852eee.1620765074.git.daniel@iogearbox.net
# f4d3f25a	14-May-2021	Rasmus Villemoes <linux@rasmusvillemoes.dk>	docs: admin-guide: update description for kernel.modprobe sysctl When I added CONFIG_MODPROBE_PATH, I neglected to update Documentation/. It's still true that this defaults to /sbin/modprobe, but now via a level of indirection. So document that the kernel might have been built with something other than /sbin/modprobe as the initial value. Link: https://lkml.kernel.org/r/20210420125324.1246826-1-linux@rasmusvillemoes.dk Fixes: 17652f4240f7a ("modules: add CONFIG_MODPROBE_PATH") Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 547f574f	02-Dec-2020	Mathieu Chouquet-Stringer <me@mathieu.digital>	docs: Update documentation to reflect what TAINT_CPU_OUT_OF_SPEC means Here's a patch updating the meaning of TAINT_CPU_OUT_OF_SPEC after Borislav introduced changes in a7e1f67ed29f and upcoming patches in tip. TAINT_CPU_OUT_OF_SPEC now means a bit more what it implies as the flag isn't set just because of a CPU misconfiguration or mismatch. Historically it was for SMP kernel oops on an officially SMP incapable processor but now it also covers CPUs whose MSRs have been incorrectly poked at from userspace, drivers being used on non supported architectures, broken firmware, mismatched CPUs, ... Update documentation and script to reflect that. Signed-off-by: Mathieu Chouquet-Stringer <me@mathieu.digital> Link: https://lore.kernel.org/r/20201202153244.709752-1-me@mathieu.digital Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 751d5b27	04-Dec-2020	Andrew Klychkov <andrew.a.klychkov@gmail.com>	Documentation: fix multiple typos found in the admin-guide subdirectory Fix thirty five typos in dm-integrity.rst, dm-raid.rst, dm-zoned.rst, verity.rst, writecache.rst, tsx_async_abort.rst, md.rst, bttv.rst, dvb_references.rst, frontend-cardlist.rst, gspca-cardlist.rst, ipu3.rst, remote-controller.rst, mm/index.rst, numaperf.rst, userfaultfd.rst, module-signing.rst, imx-ddr.rst, intel-speed-select.rst, intel_pstate.rst, ramoops.rst, abi.rst, kernel.rst, vm.rst Signed-off-by: Andrew Klychkov <andrew.a.klychkov@gmail.com> Link: https://lore.kernel.org/r/20201204072848.GA49895@spblnx124.lan Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# d151a23d	08-Dec-2020	Stephen Kitt <steve@sk2.org>	docs: clean up sysctl/kernel: titles, version This cleans up a few titles with extra colons, and removes the reference to kernel 2.2. The docs don't yet cover all of 5.10 or 5.11, but I think they're close enough. Most entries are documented, and have been checked against current kernels. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20201208074922.30359-1-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# f38c85f1	11-Aug-2020	Lepton Wu <ytht.net@gmail.com>	coredump: add %f for executable filename The document reads "%e" should be "executable filename" while actually it could be changed by things like pr_ctl PR_SET_NAME. People who uses "%e" in core_pattern get surprised when they find out they get thread name instead of executable filename. This is either a bug of document or a bug of code. Since the behavior of "%e" is there for long time, it could bring another surprise for users if we "fix" the code. So we just "fix" the document. And more, for users who really need the "executable filename" in core_pattern, we introduce a new "%f" for the real executable filename. We already have "%E" for executable path in kernel, so just reuse most of its code for the new added "%f" format. Signed-off-by: Lepton Wu <ytht.net@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200701031432.2978761-1-ytht.net@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 1f73d1ab	15-Jul-2020	Qais Yousef <qais.yousef@arm.com>	Documentation/sysctl: Document uclamp sysctl knobs Uclamp exposes 3 sysctl knobs: * sched_util_clamp_min * sched_util_clamp_max * sched_util_clamp_min_rt_default Document them in sysctl/kernel.rst. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200716110347.19553-3-qais.yousef@arm.com
# ee74db08	03-Jul-2020	Randy Dunlap <rdunlap@infradead.org>	Documentation/admin-guide: sysctl/kernel: drop doubled word Drop the doubled word "set". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Link: https://lore.kernel.org/r/20200704032020.21923-12-rdunlap@infradead.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 0b227076	23-Jun-2020	Stephen Kitt <steve@sk2.org>	docs: sysctl/kernel: document random This documents the random directory, based on the behaviour seen in drivers/char/random.c. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20200623112514.10650-1-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# e996919b	14-Jun-2020	Randy Dunlap <rdunlap@infradead.org>	Documentation: fix sysctl/kernel.rst heading format warnings Fix heading format warnings in admin-guide/sysctl/kernel.rst: Documentation/admin-guide/sysctl/kernel.rst:339: WARNING: Title underline too short. hung_task_all_cpu_backtrace: ================ Documentation/admin-guide/sysctl/kernel.rst:650: WARNING: Title underline too short. oops_all_cpu_backtrace: ================ Fixes: 0ec9dc9bcba0 ("kernel/hung_task.c: introduce sysctl to print all traces when a hung task is detected") Fixes: 60c958d8df9c ("panic: add sysctl to dump all CPUs backtraces on oops event") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/8af1cb77-4b5a-64b9-da5d-f6a95e537f99@infradead.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 60c958d8	07-Jun-2020	Guilherme G. Piccoli <gpiccoli@canonical.com>	panic: add sysctl to dump all CPUs backtraces on oops event Usually when the kernel reaches an oops condition, it's a point of no return; in case not enough debug information is available in the kernel splat, one of the last resorts would be to collect a kernel crash dump and analyze it. The problem with this approach is that in order to collect the dump, a panic is required (to kexec-load the crash kernel). When in an environment of multiple virtual machines, users may prefer to try living with the oops, at least until being able to properly shutdown their VMs / finish their important tasks. This patch implements a way to collect a bit more debug details when an oops event is reached, by printing all the CPUs backtraces through the usage of NMIs (on architectures that support that). The sysctl added (and documented) here was called "oops_all_cpu_backtrace", and when set will (as the name suggests) dump all CPUs backtraces. Far from ideal, this may be the last option though for users that for some reason cannot panic on oops. Most of times oopses are clear enough to indicate the kernel portion that must be investigated, but in virtual environments it's possible to observe hypervisor/KVM issues that could lead to oopses shown in other guests CPUs (like virtual APIC crashes). This patch hence aims to help debug such complex issues without resorting to kdump. Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/20200327224116.21030-1-gpiccoli@canonical.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 0ec9dc9b	07-Jun-2020	Guilherme G. Piccoli <gpiccoli@canonical.com>	kernel/hung_task.c: introduce sysctl to print all traces when a hung task is detected Commit 401c636a0eeb ("kernel/hung_task.c: show all hung tasks before panic") introduced a change in that we started to show all CPUs backtraces when a hung task is detected _and_ the sysctl/kernel parameter "hung_task_panic" is set. The idea is good, because usually when observing deadlocks (that may lead to hung tasks), the culprit is another task holding a lock and not necessarily the task detected as hung. The problem with this approach is that dumping backtraces is a slightly expensive task, specially printing that on console (and specially in many CPU machines, as servers commonly found nowadays). So, users that plan to collect a kdump to investigate the hung tasks and narrow down the deadlock definitely don't need the CPUs backtrace on dmesg/console, which will delay the panic and pollute the log (crash tool would easily grab all CPUs traces with 'bt -a' command). Also, there's the reciprocal scenario: some users may be interested in seeing the CPUs backtraces but not have the system panic when a hung task is detected. The current approach hence is almost as embedding a policy in the kernel, by forcing the CPUs backtraces' dump (only) on hung_task_panic. This patch decouples the panic event on hung task from the CPUs backtraces dump, by creating (and documenting) a new sysctl called "hung_task_all_cpu_backtrace", analog to the approach taken on soft/hard lockups, that have both a panic and an "all_cpu_backtrace" sysctl to allow individual control. The new mechanism for dumping the CPUs backtraces on hung task detection respects "hung_task_warnings" by not dumping the traces in case there's no warnings left. Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: http://lkml.kernel.org/r/20200327223646.20779-1-gpiccoli@canonical.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# db38d5c1	07-Jun-2020	Rafael Aquini <aquini@redhat.com>	kernel: add panic_on_taint Analogously to the introduction of panic_on_warn, this patch introduces a kernel option named panic_on_taint in order to provide a simple and generic way to stop execution and catch a coredump when the kernel gets tainted by any given flag. This is useful for debugging sessions as it avoids having to rebuild the kernel to explicitly add calls to panic() into the code sites that introduce the taint flags of interest. For instance, if one is interested in proceeding with a post-mortem analysis at the point a given code path is hitting a bad page (i.e. unaccount_page_cache_page(), or slab_bug()), a coredump can be collected by rebooting the kernel with 'panic_on_taint=0x20' amended to the command line. Another, perhaps less frequent, use for this option would be as a means for assuring a security policy case where only a subset of taints, or no single taint (in paranoid mode), is allowed for the running system. The optional switch 'nousertaint' is handy in this particular scenario, as it will avoid userspace induced crashes by writes to sysctl interface /proc/sys/kernel/tainted causing false positive hits for such policies. [akpm@linux-foundation.org: tweak kernel-parameters.txt wording] Suggested-by: Qian Cai <cai@lca.pw> Signed-off-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Bunk <bunk@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Takashi Iwai <tiwai@suse.de> Link: http://lkml.kernel.org/r/20200515175502.146720-1-aquini@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 997c798e	15-May-2020	Stephen Kitt <steve@sk2.org>	docs: sysctl/kernel: document unaligned controls This documents ignore-unaligned-usertrap, unaligned-dump-stack, and unaligned-trap, based on arch/arc/kernel/unaligned.c, arch/ia64/kernel/unaligned.c, and arch/parisc/kernel/unaligned.c. While we're at it, integrate unaligned-memory-access.txt into the docs tree. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20200515212443.5012-1-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 17444d9b	18-May-2020	Stephen Kitt <steve@sk2.org>	docs: sysctl/kernel: document ngroups_max This is a read-only export of NGROUPS_MAX. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20200518145836.15816-1-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# fdb1b5e0	18-May-2020	Jonathan Corbet <corbet@lwn.net>	Revert "docs: sysctl/kernel: document ngroups_max" This reverts commit 2f4c33063ad713e3a5b63002cf8362846e78bd71. The changes here were fine, but there's a non-documentation change to sysctl.c that makes messes elsewhere; those changes should have been done independently. Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 2f4c3306	15-May-2020	Stephen Kitt <steve@sk2.org>	docs: sysctl/kernel: document ngroups_max This is a read-only export of NGROUPS_MAX, so this patch also changes the declarations in kernel/sysctl.c to const. Signed-off-by: Stephen Kitt <steve@sk2.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20200515160222.7994-1-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# d75829c1	29-Apr-2020	Stephen Kitt <steve@sk2.org>	docs: sysctl/kernel: document firmware_config Based on the firmware fallback mechanisms documentation and the implementation in drivers/base/firmware_loader/fallback.c. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20200429205757.8677-2-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 50cdae76	29-Apr-2020	Stephen Kitt <steve@sk2.org>	docs: sysctl/kernel: document ftrace entries Based on the ftrace documentation, the tp_printk boot parameter documentation, and the implementation in kernel/trace/trace.c. Signed-off-by: Stephen Kitt <steve@sk2.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20200429205757.8677-1-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 01478b83	27-Apr-2020	Mauro Carvalho Chehab <mchehab+huawei@kernel.org>	docs: filesystems: convert devpts.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/4ac8f3a7edd4d817acf0d173ead7ef74fe010c6c.1588021877.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 6bc47621	23-Apr-2020	Stephen Kitt <steve@sk2.org>	docs: sysctl/kernel: document cad_pid Based on the implementation in kernel/sysctl.c (the proc_do_cad_pid() function), kernel/reboot.c, and include/linux/sched/signal.h. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20200423183651.15365-1-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 5d8e5aee	15-Mar-2020	Stephen Kitt <steve@sk2.org>	docs: sysctl/kernel: document BPF entries Based on the implementation in kernel/bpf/syscall.c, kernel/bpf/trampoline.c, include/linux/filter.h, and the documentation in bpftool-prog.rst. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20200315122648.20558-1-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 025b16f8	02-Apr-2020	Alexey Budankov <alexey.budankov@linux.intel.com>	doc/admin-guide: update kernel.rst with CAP_PERFMON information Update the kernel.rst documentation file with the information related to usage of CAP_PERFMON capability to secure performance monitoring and observability operations in system. Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Igor Lubashev <ilubashe@akamai.com> Cc: James Morris <jmorris@namei.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Serge Hallyn <serge@hallyn.com> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: intel-gfx@lists.freedesktop.org Cc: linux-doc@vger.kernel.org Cc: linux-man@vger.kernel.org Cc: linux-security-module@vger.kernel.org Cc: selinux@vger.kernel.org Link: http://lore.kernel.org/lkml/84c32383-14a2-fa35-16b6-f9e59bd37240@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
# 52338dfb	14-Apr-2020	Eric Biggers <ebiggers@google.com>	docs: admin-guide: merge sections for the kernel.modprobe sysctl Documentation for the kernel.modprobe sysctl was added both by commit 0317c5371e6a ("docs: merge debugging-modules.txt into sysctl/kernel.rst") and by commit 6e7158250625 ("docs: admin-guide: document the kernel.modprobe sysctl"), resulting in the same sysctl being documented in two places. Merge these into one place. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20200414172430.230293-1-ebiggers@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 6e715825	10-Apr-2020	Eric Biggers <ebiggers@google.com>	docs: admin-guide: document the kernel.modprobe sysctl Document the kernel.modprobe sysctl in the same place that all the other kernel.* sysctls are documented. Make sure to mention how to use this sysctl to completely disable module autoloading, and how this sysctl relates to CONFIG_STATIC_USERMODEHELPER. [ebiggers@google.com: v5] Link: http://lkml.kernel.org/r/20200318230515.171692-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: NeilBrown <neilb@suse.com> Link: http://lkml.kernel.org/r/20200312202552.241885-4-ebiggers@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 0a07bef6	10-Mar-2020	Guilherme G. Piccoli <gpiccoli@canonical.com>	Documentation: Better document the softlockup_panic sysctl Commit 9c44bc03fff4 ("softlockup: allow panic on lockup") added the softlockup_panic sysctl, but didn't add information about it to the file Documentation/admin-guide/sysctl/kernel.rst (which in that time certainly wasn't rst and had other name!). This patch just adds the respective documentation and references it from the corresponding entry in Documentation/admin-guide/kernel-parameters.txt. This patch was strongly based on Scott Wood's commit d22881dc13b6 ("Documentation: Better document the hardlockup_panic sysctl"). Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> Link: https://lore.kernel.org/r/20200310183649.23163-1-gpiccoli@canonical.com Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 021622df	19-Feb-2020	Stephen Kitt <steve@sk2.org>	docs: add a script to check sysctl docs This script allows sysctl documentation to be checked against the kernel source code, to identify missing or obsolete entries. Running it against 5.5 shows for example that sysctl/kernel.rst has two obsolete entries and is missing 52 entries. Signed-off-by: Stephen Kitt <steve@sk2.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 2bd49cb5	21-Feb-2020	Stephen Kitt <steve@sk2.org>	docs: sysctl/kernel: document acpi_video_flags Based on the implementation in arch/x86/kernel/acpi/sleep.c, in particular the acpi_sleep_setup() function. Signed-off-by: Stephen Kitt <steve@sk2.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 8f21f54b	18-Feb-2020	Stephen Kitt <steve@sk2.org>	docs: sysctl/kernel: remove rtsig entries These have no corresponding code in the kernel. Signed-off-by: Stephen Kitt <steve@sk2.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 404347e6	18-Feb-2020	Stephen Kitt <steve@sk2.org>	docs: document panic fully in sysctl/kernel.rst The description of panic doesn’t cover all the supported scenarios; this patch fixes that, describing the three possibilities (no reboot, immediate reboot, reboot after a delay). Based on the implementation in kernel/panic.c. Signed-off-by: Stephen Kitt <steve@sk2.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# a1ad4f15	18-Feb-2020	Stephen Kitt <steve@sk2.org>	docs: document stop-a in sysctl/kernel.rst This describes the SPARC-specific stop-a sysctl entry, which was previously listed in kernel.rst but not documented. Base on the implementation in arch/sparc/kernel/setup_{32,64}.c and kernel/panic.c. Signed-off-by: Stephen Kitt <steve@sk2.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# fa5b5264	18-Feb-2020	Stephen Kitt <steve@sk2.org>	docs: add missing IPC documentation in sysctl/kernel.rst This adds short descriptions of msgmax, msgmnb, msgmni, and shmmni, which were previously listed in kernel.rst but not described. Signed-off-by: Stephen Kitt <steve@sk2.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# a474105b	18-Feb-2020	Stephen Kitt <steve@sk2.org>	docs: drop l2cr from sysctl/kernel.rst The l2cr sysctl entry was removed in commit c2f3dabefa73 ("sysctl: kill binary sysctl KERN_PPC_L2CR"), this removes the corresponding documentation. Signed-off-by: Stephen Kitt <steve@sk2.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 0317c537	18-Feb-2020	Stephen Kitt <steve@sk2.org>	docs: merge debugging-modules.txt into sysctl/kernel.rst This fits nicely in sysctl/kernel.rst, merge it (and rephrase it) instead of linking to it. Signed-off-by: Stephen Kitt <steve@sk2.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# a3cb66a5	18-Feb-2020	Stephen Kitt <steve@sk2.org>	docs: pretty up sysctl/kernel.rst This updates sysctl/kernel.rst to use ReStructured Text more fully: * the list of files is now the table of contents (old entries with no corresponding sections are added as empty sections for now); * code references and commands are formatted as code, except for function names which end up linked to the appropriate documentation; * links are used to point to other documentation and other sections; * tables are used to make lists of values more readable (as already done for some sections); * in heavily-reworked paragraphs, sentences are wrapped individually, to make future diffs easier to read. The first mention of the kernel version is dropped. The second mention, saying that the document is accurate for 2.2, is preserved for now; I will update that once the document really is accurate for a current kernel release. Signed-off-by: Stephen Kitt <steve@sk2.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 895f2c20	12-Feb-2020	d.hatayama@fujitsu.com <d.hatayama@fujitsu.com>	docs: admin-guide: Add description of %c corename format There is somehow no description of %c corename format specifier for /proc/sys/kernel/core_pattern. The %c corename format specifier is used by user-space application such as systemd-coredump, so it should be documented. To find where %c is handled in the kernel source code, look at function format_corename() in fs/coredump.c. Signed-off-by: HATAYAMA Daisuke <d.hatayama@fujitsu.com> Link: https://lore.kernel.org/r/TYAPR01MB4014714BB2ACE425BB6EC6B7951A0@TYAPR01MB4014.jpnprd01.prod.outlook.com Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# e80d8938	31-Oct-2019	Masanari Iida <standby24x7@gmail.com>	docs: admin-guide: Remove threads-max auto-tuning Since following path was merged in 5.4-rc3, auto-tuning feature in threads-max does not exist any more. Fix the admin-guide document as is. kernel/sysctl.c: do not override max_threads provided by userspace b0f53dbc4bc4c371f38b14c391095a3bb8a0bb40 Fixes: b0f53dbc4bc4 ("kernel/sysctl.c: do not override max_threads provided by userspace") Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 73eb802a	31-Oct-2019	Masanari Iida <standby24x7@gmail.com>	docs: admin-guide: Fix min value of threads-max in kernel.rst Since following patch was merged 5.4-rc3, minimum value for threads-max changed to 1. kernel/sysctl.c: do not override max_threads provided by userspace b0f53dbc4bc4c371f38b14c391095a3bb8a0bb40 Fixes: b0f53dbc4bc4 ("kernel/sysctl.c: do not override max_threads provided by userspace") Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# ca30ad85	02-Oct-2019	Oleksandr Natalenko <oleksandr@redhat.com>	docs: admin-guide: fix printk_ratelimit explanation The printk_ratelimit value accepts seconds, not jiffies (though it is converted into jiffies internally). Update documentation to reflect this. Also, remove the statement about allowing 1 message in 5 seconds since bursts up to 10 messages are allowed by default. Finally, while we are here, mention default value for printk_ratelimit_burst too. Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
# 4f4cfa6c	27-Jun-2019	Mauro Carvalho Chehab <mchehab+samsung@kernel.org>	docs: admin-guide: add a series of orphaned documents There are lots of documents that belong to the admin-guide but are on random places (most under Documentation root dir). Move them to the admin guide. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
# 57043247	22-Apr-2019	Mauro Carvalho Chehab <mchehab+samsung@kernel.org>	docs: admin-guide: move sysctl directory to it The stuff under sysctl describes /sys interface from userspace point of view. So, add it to the admin-guide and remove the :orphan: from its index file. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>