#
1f68ce2a |
|
15-Nov-2023 |
Tony Luck <tony.luck@intel.com> |
x86/mce: Handle Intel threshold interrupt storms Add an Intel specific hook into machine_check_poll() to keep track of per-CPU, per-bank corrected error logs (with a stub for the CONFIG_MCE_INTEL=n case). When a storm is observed the rate of interrupts is reduced by setting a large threshold value for this bank in IA32_MCi_CTL2. This bank is added to the bitmap of banks for this CPU to poll. The polling rate is increased to once per second. When a storm ends reset the threshold in IA32_MCi_CTL2 back to 1, remove the bank from the bitmap for polling, and change the polling rate back to the default. If a CPU with banks in storm mode is taken offline, the new CPU that inherits ownership of those banks takes over management of storm(s) in the inherited bank(s). The cmci_discover() function was already very large. These changes pushed it well over the top. Refactor with three helper functions to bring it back under control. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231115195450.12963-4-tony.luck@intel.com
|
#
3ed57b41 |
|
15-Nov-2023 |
Tony Luck <tony.luck@intel.com> |
x86/mce: Remove old CMCI storm mitigation code When a "storm" of corrected machine check interrupts (CMCI) is detected this code mitigates by disabling CMCI interrupt signalling from all of the banks owned by the CPU that saw the storm. There are problems with this approach: 1) It is very coarse grained. In all likelihood only one of the banks was generating the interrupts, but CMCI is disabled for all. This means Linux may delay seeing and processing errors logged from other banks. 2) Although CMCI stands for Corrected Machine Check Interrupt, it is also used to signal when an uncorrected error is logged. This is a problem because these errors should be handled in a timely manner. Delete all this code in preparation for a finer grained solution. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20231115195450.12963-2-tony.luck@intel.com
|
#
1bae0cfe |
|
13-Jun-2023 |
Yazen Ghannam <yazen.ghannam@amd.com> |
x86/mce: Cleanup mce_usable_address() Move Intel-specific checks into a helper function. Explicitly use "bool" for return type. No functional change intended. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230613141142.36801-4-yazen.ghannam@amd.com
|
#
c3629dd7 |
|
19-Jul-2023 |
Borislav Petkov (AMD) <bp@alien8.de> |
x86/mce: Prevent duplicate error records A legitimate use case of the MCA infrastructure is to have the firmware log all uncorrectable errors and also, have the OS see all correctable errors. The uncorrectable, UCNA errors are usually configured to be reported through an SMI. CMCI, which is the correctable error reporting interrupt, uses SMI too and having both enabled, leads to unnecessary overhead. So what ends up happening is, people disable CMCI in the wild and leave on only the UCNA SMI. When CMCI is disabled, the MCA infrastructure resorts to polling the MCA banks. If a MCA MSR is shared between the logical threads, one error ends up getting logged multiple times as the polling runs on every logical thread. Therefore, introduce locking on the Intel side of the polling routine to prevent such duplicate error records from appearing. Based on a patch by Aristeu Rozanski <aris@ruivo.org>. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tony Luck <tony.luck@intel.com> Acked-by: Aristeu Rozanski <aris@ruivo.org> Link: https://lore.kernel.org/r/20230515143225.GC4090740@cathedrallabs.org
|
#
0dcab41d |
|
31-Jan-2022 |
Tony Luck <tony.luck@intel.com> |
x86/cpu: Merge Intel and AMD ppin_init() functions The code to decide whether a system supports the PPIN (Protected Processor Inventory Number) MSR was cloned from the Intel implementation. Apart from the X86_FEATURE bit and the MSR numbers it is identical. Merge the two functions into common x86 code, but use x86_match_cpu() instead of the switch (c->x86_model) that was used by the old Intel code. No functional change. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220131230111.2004669-2-tony.luck@intel.com
|
#
e464121f |
|
21-Jan-2022 |
Tony Luck <tony.luck@intel.com> |
x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPIN Missed adding the Icelake-D CPU to the list. It uses the same MSRs to control and read the inventory number as all the other models. Fixes: dc6b025de95b ("x86/mce: Add Xeon Icelake to list of CPUs that support PPIN") Reported-by: Ailin Xu <ailin.xu@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220121174743.1875294-2-tony.luck@intel.com
|
#
e629fc14 |
|
29-Oct-2021 |
Dave Jones <davej@codemonkey.org.uk> |
x86/mce: Add errata workaround for Skylake SKX37 Errata SKX37 is word-for-word identical to the other errata listed in this workaround. I happened to notice this after investigating a CMCI storm on a Skylake host. While I can't confirm this was the root cause, spurious corrected errors does sound like a likely suspect. Fixes: 2976908e4198 ("x86/mce: Do not log spurious corrected mce errors") Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20211029205759.GA7385@codemonkey.org.uk
|
#
a331f5fd |
|
19-Mar-2021 |
Tony Luck <tony.luck@intel.com> |
x86/mce: Add Xeon Sapphire Rapids to list of CPUs that support PPIN New CPU model, same MSRs to control and read the inventory number. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210319173919.291428-1-tony.luck@intel.com
|
#
9223d0dc |
|
07-Jan-2021 |
Borislav Petkov <bp@suse.de> |
thermal: Move therm_throt there from x86/mce This functionality has nothing to do with MCE, move it to the thermal framework and untangle it from MCE. Requested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Link: https://lkml.kernel.org/r/20210202121003.GD18075@zn.tnic
|
#
098416e6 |
|
10-Nov-2020 |
Tony Luck <tony.luck@intel.com> |
x86/mce: Use "safe" MSR functions when enabling additional error logging Booting as a guest under KVM results in error messages about unchecked MSR access: unchecked MSR access error: RDMSR from 0x17f at rIP: 0xffffffff84483f16 (mce_intel_feature_init+0x156/0x270) because KVM doesn't provide emulation for random model specific registers. Switch to using rdmsrl_safe()/wrmsrl_safe() to avoid the message. Fixes: 68299a42f842 ("x86/mce: Enable additional error logging on certain Intel CPUs") Reported-by: Qian Cai <cai@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201111003954.GA11878@agluck-desk2.amr.corp.intel.com
|
#
68299a42 |
|
30-Oct-2020 |
Tony Luck <tony.luck@intel.com> |
x86/mce: Enable additional error logging on certain Intel CPUs The Xeon versions of Sandy Bridge, Ivy Bridge and Haswell support an optional additional error logging mode which is enabled by an MSR. Previously, this mode was enabled from the mcelog(8) tool via /dev/cpu, but userspace should not be poking at MSRs. So move the enabling into the kernel. [ bp: Correct the explanation why this is done. ] Suggested-by: Boris Petkov <bp@alien8.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201030190807.GA13884@agluck-desk2.amr.corp.intel.com
|
#
df561f66 |
|
23-Aug-2020 |
Gustavo A. R. Silva <gustavoars@kernel.org> |
treewide: Use fallthrough pseudo-keyword Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
#
59b58096 |
|
25-Feb-2020 |
Tony Luck <tony.luck@intel.com> |
x86/mce: Fix logic and comments around MSR_PPIN_CTL There are two implemented bits in the PPIN_CTL MSR: Bit 0: LockOut (R/WO) Set 1 to prevent further writes to MSR_PPIN_CTL. Bit 1: Enable_PPIN (R/W) If 1, enables MSR_PPIN to be accessible using RDMSR. If 0, an attempt to read MSR_PPIN will cause #GP. So there are four defined values: 0: PPIN is disabled, PPIN_CTL may be updated 1: PPIN is disabled. PPIN_CTL is locked against updates 2: PPIN is enabled. PPIN_CTL may be updated 3: PPIN is enabled. PPIN_CTL is locked against updates Code would only enable the X86_FEATURE_INTEL_PPIN feature for case "2". When it should have done so for both case "2" and case "3". Fix the final test to just check for the enable bit. Also fix some of the other comments in this function. Fixes: 3f5a7896a509 ("x86/mce: Include the PPIN in MCE records when available") Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200226011737.9958-1-tony.luck@intel.com
|
#
2976908e |
|
19-Feb-2020 |
Prarit Bhargava <prarit@redhat.com> |
x86/mce: Do not log spurious corrected mce errors A user has reported that they are seeing spurious corrected errors on their hardware. Intel Errata HSD131, HSM142, HSW131, and BDM48 report that "spurious corrected errors may be logged in the IA32_MC0_STATUS register with the valid field (bit 63) set, the uncorrected error field (bit 61) not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and an MCA Error Code (bits [15:0]) of 0x0005." The Errata PDFs are linked in the bugzilla below. Block these spurious errors from the console and logs. [ bp: Move the intel_filter_mce() header declarations into the already existing CONFIG_X86_MCE_INTEL ifdeffery. ] Co-developed-by: Alexander Krupp <centos@akr.yagii.de> Signed-off-by: Alexander Krupp <centos@akr.yagii.de> Signed-off-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206587 Link: https://lkml.kernel.org/r/20200219131611.36816-1-prarit@redhat.com
|
#
6d527ceb |
|
20-Dec-2019 |
Sean Christopherson <seanjc@google.com> |
x86/mce: WARN once if IA32_FEAT_CTL MSR is left unlocked WARN if the IA32_FEAT_CTL MSR is somehow left unlocked now that CPU initialization unconditionally locks the MSR. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20191221044513.21680-6-sean.j.christopherson@intel.com
|
#
32ad73db |
|
20-Dec-2019 |
Sean Christopherson <seanjc@google.com> |
x86/msr-index: Clean up bit defines for IA32_FEATURE_CONTROL MSR As pointed out by Boris, the defines for bits in IA32_FEATURE_CONTROL are quite a mouthful, especially the VMX bits which must differentiate between enabling VMX inside and outside SMX (TXT) operation. Rename the MSR and its bit defines to abbreviate FEATURE_CONTROL as FEAT_CTL to make them a little friendlier on the eyes. Arguably, the MSR itself should keep the full IA32_FEATURE_CONTROL name to match Intel's SDM, but a future patch will add a dedicated Kconfig, file and functions for the MSR. Using the full name for those assets is rather unwieldy, so bite the bullet and use IA32_FEAT_CTL so that its nomenclature is consistent throughout the kernel. Opportunistically, fix a few other annoyances with the defines: - Relocate the bit defines so that they immediately follow the MSR define, e.g. aren't mistaken as belonging to MISC_FEATURE_CONTROL. - Add whitespace around the block of feature control defines to make it clear they're all related. - Use BIT() instead of manually encoding the bit shift. - Use "VMX" instead of "VMXON" to match the SDM. - Append "_ENABLED" to the LMCE (Local Machine Check Exception) bit to be consistent with the kernel's verbiage used for all other feature control bits. Note, the SDM refers to the LMCE bit as LMCE_ON, likely to differentiate it from IA32_MCG_EXT_CTL.LMCE_EN. Ignore the (literal) one-off usage of _ON, the SDM is simply "wrong". Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20191221044513.21680-2-sean.j.christopherson@intel.com
|
#
dc6b025d |
|
28-Oct-2019 |
Tony Luck <tony.luck@intel.com> |
x86/mce: Add Xeon Icelake to list of CPUs that support PPIN New CPU model, same MSRs to control and read the inventory number. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191028163719.19708-1-tony.luck@intel.com
|
#
70f0c230 |
|
18-Sep-2019 |
Tony W Wang-oc <TonyWWang-oc@zhaoxin.com> |
x86/mce: Add Zhaoxin LMCE support Newer Zhaoxin CPUs support LMCE compatible with Intel. Add support for that. [ bp: Export functions and massage. ] Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: CooperYan@zhaoxin.com Cc: DavidWang@zhaoxin.com Cc: HerryYang@zhaoxin.com Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: QiyuanWang@zhaoxin.com Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/1568787573-1297-5-git-send-email-TonyWWang-oc@zhaoxin.com
|
#
5a3d56a0 |
|
18-Sep-2019 |
Tony W Wang-oc <TonyWWang-oc@zhaoxin.com> |
x86/mce: Add Zhaoxin CMCI support All newer Zhaoxin CPUs support CMCI and are compatible with Intel's Machine-Check Architecture. Add that support for Zhaoxin CPUs. [ bp: Massage comments and export intel_init_cmci(). ] Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: CooperYan@zhaoxin.com Cc: DavidWang@zhaoxin.com Cc: HerryYang@zhaoxin.com Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: QiyuanWang@zhaoxin.com Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/1568787573-1297-4-git-send-email-TonyWWang-oc@zhaoxin.com
|
#
5ebb34ed |
|
27-Aug-2019 |
Peter Zijlstra <peterz@infradead.org> |
x86/intel: Aggregate microserver naming Currently big microservers have _XEON_D while small microservers have _X, Make it uniformly: _D. for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(X\|XEON_D\)"` do sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*ATOM.*\)_X/\1_D/g' \ -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_XEON_D/\1_D/g' ${i} done Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Cc: Dave Hansen <dave.hansen@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/20190827195122.677152989@infradead.org
|
#
21afaf18 |
|
18-Nov-2018 |
Borislav Petkov <bp@suse.de> |
x86/mce: Streamline MCE subsystem's naming Rename the containing folder to "mce" which is the most widespread name. Drop the "mce[-_]" filename prefix of some compilation units (while others don't have it). This unifies the file naming in the MCE subsystem: mce/ |-- amd.c |-- apei.c |-- core.c |-- dev-mcelog.c |-- genpool.c |-- inject.c |-- intel.c |-- internal.h |-- Makefile |-- p5.c |-- severity.c |-- therm_throt.c |-- threshold.c `-- winchip.c No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20181205141323.14995-1-bp@alien8.de
|