History log of /linux-master/drivers/ras/amd/fmpm.c
Revision Date Author Comments
# 9b195439 19-Mar-2024 Yazen Ghannam <yazen.ghannam@amd.com>

RAS/AMD/FMPM: Safely handle saved records of various sizes

Currently, the size of the locally cached FRU record structures is
based on the module parameter "max_nr_entries".

This creates issues when restoring records if a user changes the
parameter.

If the number of entries is reduced, then old, larger records will not
be restored. The opportunity to take action on the saved data is missed.
Also, new records will be created and written to storage, even as the old
records remain in storage, resulting in wasted space.

If the number of entries is increased, then the length of the old,
smaller records will not be adjusted. This causes a checksum failure
which leads to the old record being cleared from storage. Again this
results in another missed opportunity for action on the saved data.

Allocate the temporary record with the maximum possible size based on
the current maximum number of supported entries (255). This allows the
ERST read operation to succeed if max_nr_entries has been increased.

Warn the user if a saved record exceeds the expected size and fail to
load the module. This allows the user to adjust the module parameter
without losing data or the opportunity to restore larger records.

Increase the size of a saved record up to the current max_rec_len. The
checksum will be recalculated, and the updated record will be written to
storage.

Fixes: 6f15e617cc99 ("RAS: Introduce a FRU memory poison manager")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Muralidhara M K <muralidhara.mk@amd.com>
Link: https://lore.kernel.org/r/20240319113322.280096-3-yazen.ghannam@amd.com


# 4b0e527c 19-Mar-2024 Yazen Ghannam <yazen.ghannam@amd.com>

RAS/AMD/FMPM: Avoid NULL ptr deref in get_saved_records()

An old, invalid record should be cleared and skipped.

Currently, the record is cleared in ERST, but it is not skipped. This
leads to a NULL pointer dereference when attempting to copy the old
record to the new record.

Continue the loop after clearing an old, invalid record to skip it.

Fixes: 6f15e617cc99 ("RAS: Introduce a FRU memory poison manager")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Muralidhara M K <muralidhara.mk@amd.com>
Link: https://lore.kernel.org/r/20240319113322.280096-2-yazen.ghannam@amd.com


# bd17b7c3 05-Mar-2024 Dan Carpenter <dan.carpenter@linaro.org>

RAS/AMD/FMPM: Fix off by one when unwinding on error

Decrement the index variable i before the first iteration when freeing
the remaining elements on error. Depending on where this fails it could
free something from one element beyond the end of the fru_records[]
array.

[ bp: Massage commit message. ]

Fixes: 6f15e617cc99 ("RAS: Introduce a FRU memory poison manager")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/6fdec71a-846b-4cd0-af69-e5f6cd12f4f6@moroto.mountain


# 7d19eea5 01-Mar-2024 Yazen Ghannam <yazen.ghannam@amd.com>

RAS/AMD/FMPM: Add debugfs interface to print record entries

It is helpful to see the saved record entries during run time in
human-readable format. This is useful for testing during module
development. It can also be used by system admins to quickly and easily
see the state of the system.

Provide a sequential file in debugfs to print fields of interest from
the FRU records and their entries.

Don't fail to load the module if the debugfs interface is not available.
This is a convenience feature which does not affect other module
functionality.

The new interface reads the record entries and should hold the mutex.
Expand the mutex code comment to clarify when it should be held.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240301143748.854090-4-yazen.ghannam@amd.com


# 838850c50 01-Mar-2024 Yazen Ghannam <yazen.ghannam@amd.com>

RAS/AMD/FMPM: Save SPA values

The system physical address (SPA) of an error is not a stable value. It
will change depending on the location of the memory: parts can be
swapped. And it will change depending on memory topology: NUMA nodes
and/or interleaving can be adjusted.

Therefore, the SPA value is not part of the "FRU Memory Poison" record
format. And it will not be saved to persistent storage.

However, the SPA values can be helpful during debug and for system
admins during run time.

Save the SPA values in a separate structure. This is updated when
records are restored and when new errors are saved.

[ bp: Make error messages more user friendly and add and correct
comments. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240301143748.854090-3-yazen.ghannam@amd.com


# 6f15e617 13-Feb-2024 Yazen Ghannam <yazen.ghannam@amd.com>

RAS: Introduce a FRU memory poison manager

Memory errors are an expected occurrence on systems with high memory
density. Generally, errors within a small number of unique physical
locations are acceptable, based on manufacturer and/or admin policy.
During run time, memory with errors may be retired so it is no longer
used by the system. This is done in mm through page poisoning, and the
effect will remain until the system is restarted.

If a memory location is consistently faulty, then the same run time
error handling may occur in the next reboot cycle, leading to
terminating jobs due to that already known bad memory. This could be
prevented if information from the previous boot was not lost.

Some add-in cards with driver-managed memory have on-board persistent
storage. Their driver saves memory error information to the persistent
storage during run time. The information is then restored after reset,
and known bad memory will be retired before the hardware is used.
A running log of bad memory locations is kept across multiple resets.

A similar solution is desirable for CPUs. However, this solution should
leverage industry-standard components as much as possible, rather than
a bespoke platform driver.

Two components are needed: a record format and a persistent storage
interface.

Implement a new module to manage the record formats on persistent
storage. Use the requirements for an AMD MI300-based system to start.
Vendor- and platform-specific details can be abstracted later as needed.

[ bp: Massage commit message and code, squash 30-ish more fixes from
Yazen and me. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Co-developed-by: <naveenkrishna.chatradhi@amd.com>
Signed-off-by: <naveenkrishna.chatradhi@amd.com>
Co-developed-by: <muralidhara.mk@amd.com>
Signed-off-by: <muralidhara.mk@amd.com>
Tested-by: <sathyapriya.k@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com