History log of /linux-master/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
Revision Date Author Comments
# 90bd0147 03-Jan-2024 Candice Li <candice.li@amd.com>

drm/amdgpu: Drop unnecessary sentences about CE and deferred error.

Remove "no user action is needed" for correctable and deferred error
to avoid confusion.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# fc926fae 27-Mar-2023 YiPeng Chai <YiPeng.Chai@amd.com>

drm/amdgpu: optimize redundant code in umc_v6_7

Optimize redundant code in umc_v6_7.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 071f526a 18-Jan-2023 Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: retire unused get_umc_v6_7_channel_index

Fix the following compile warning:

drivers/gpu/drm/amd/amdgpu/umc_v6_7.c:53:24: warning: unused function 'get_umc_v6_7_channel_index' [-Wunused-function]
static inline uint32_t get_umc_v6_7_channel_index(struct amdgpu_device *adev,
^
1 warning generated.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 6c0ca748 14-Oct-2022 Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: move convert_error_address out of umc_ras

RAS error address translation algorithm is common
across dGPU and A + A platform as along as the SOC
integrates the same generation of UMC IP.

UMC RAS is managed by x86 MCA on A + A platform,
umc_ras in GPU driver is not initialized at all on
A + A platform. In such case, any umc_ras callback
implemented for dGPU config shouldn't be invoked
from A + A specific callback.

The change moves convert_error_address out of dGPU
umc_ras structure and makes it share between A + A
and dGPU config.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 44420ac5 26-Sep-2022 Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: define RAS convert_error_address API

Make the code reusable and remove redundant code.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# cdbb816b 26-Sep-2022 Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: remove check for CE in RAS error address query

Only RAS UE error address is queried currently, no need to check CE status.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 3ff4ccc3 28-Sep-2022 Leo Li <sunpeng.li@amd.com>

drm/amdgpu: Fix mc_umc_status used uninitialized warning

On ChromeOS clang build, the following warning is seen:

/mnt/host/source/src/third_party/kernel/v5.15/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c:463:6: error: variable 'mc_umc_status' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
if (mca_addr == UMC_INVALID_ADDR) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/host/source/src/third_party/kernel/v5.15/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c:485:21: note: uninitialized use occurs here
if ((REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1 &&
^~~~~~~~~~~~~
/mnt/host/source/src/third_party/kernel/v5.15/drivers/gpu/drm/amd/amdgpu/../amdgpu/amdgpu.h:1208:5: note: expanded from macro 'REG_GET_FIELD'
(((value) & REG_FIELD_MASK(reg, field)) >> REG_FIELD_SHIFT(reg, field))
^~~~~
/mnt/host/source/src/third_party/kernel/v5.15/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c:463:2: note: remove the 'if' if its condition is always true
if (mca_addr == UMC_INVALID_ADDR) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/host/source/src/third_party/kernel/v5.15/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c:460:24: note: initialize the variable 'mc_umc_status' to silence this warning
uint64_t mc_umc_status, mc_umc_addrt0;
^
= 0
1 error generated.
make[5]: *** [/mnt/host/source/src/third_party/kernel/v5.15/scripts/Makefile.build:289: drivers/gpu/drm/amd/amdgpu/umc_v6_7.o] Error 1

Fix by initializing mc_umc_status = 0.

Fixes: 1014bd1cb32552 ("drm/amdgpu: support to convert dedicated umc mca address")
Reviewed-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Signed-off-by: Leo Li <sunpeng.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 1014bd1c 21-Sep-2022 Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: support to convert dedicated umc mca address

Update umc error address query interface, the mca address can be read
from register or input from parameter.

TODO: define a common address conversion function to simplify the code.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# c19a5f32 21-Sep-2022 Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: export umc error address convert interface

Make it global so we can convert specific mca address.

v2: rename query_error_address_per_channel to
convert_ras_error_address

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# cbd3e844 20-May-2022 Stanley.Yang <Stanley.Yang@amd.com>

drm/amdgpu: print umc correctable error address

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 05eee31c 07-Apr-2022 Stanley.Yang <Stanley.Yang@amd.com>

drm/amdgpu: add umc query error status function

In order to debug ras error, driver will print IPID/SYND/MISC0
register value if detect correctable or uncorrectable error.
Provide umc_query_error_status_helper function to reduce code
redundancy.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 1ec1944e 26-Jan-2022 Stanley.Yang <Stanley.Yang@amd.com>

drm/amdgpu: print more error info

print more error info when deferred uncorrectable ras error

changed from V1:
move Defferred error msg into query uncorrectable error
count function.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 1915a433 11-Feb-2022 Stanley.Yang <Stanley.Yang@amd.com>

drm/amdgpu: adjust register address calculation

the UMC_STATUS register is not linear, adjust offset
calculation formula to get correct address

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 69f915cc 10-Feb-2022 Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: loose check for umc poison mode

No need to check poison setting for each channel, check for umc0
channel0 is enough.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# bee7f8d0 24-Jan-2022 Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: get hash bit for CH4 in umc channel index

On ALDEBARAN, the umc channel bits are not original values, they
are hashed.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# e63fa4dc 19-Jan-2022 Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: update algorithm of umc address conversion

On ALDEBARAN, we need to traverse all column bits higher than
BIT11(C4C3C2) in a row, the shift of R14 bit should be also taken
into account. Retire all pages we find.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 400013b2 19-Jan-2022 Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: add umc_fill_error_record to make code more simple

Create common amdgpu_umc_fill_error_record function for all versions
of UMC and clean up related codes.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 37ff945f 19-Jan-2022 Stanley.Yang <Stanley.Yang@amd.com>

drm/amdgpu: fix convert bad page retiremt

Pmfw read ecc info registers and store values in
eccinfo_table in the following order

umc0 ch_inst 0, 1, 2 ... 7
umc1 ch_inst 0, 1, 2 ... 7
...
umc3 ch_inst 0, 1, 2 ... 7

Driver should convert eccinfo_table_idx to channel_index according
to channel_idx_tbl.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 5904e413 19-Jan-2022 Stanley.Yang <Stanley.Yang@amd.com>

drm/amdgpu: remove unused variable warning

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# efe17d5a 05-Jan-2022 yipechai <YiPeng.Chai@amd.com>

drm/amdgpu: Modify umc block to fit for the unified ras block data and ops

1.Modify umc block to fit for the unified ras block data and ops.
2.Change amdgpu_umc_ras_funcs to amdgpu_umc_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of umc ras variable so that umc ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register umc ras block into amdgpu device ras block link list.
5.Remove the redundant code about umc in amdgpu_ras.c after using the unified ras block.
6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of umc versions. If .ras_late_init and .ras_fini had been defined by the selected umc version, the defined functions will take effect; if not defined, default fill them with amdgpu_umc_ras_late_init and amdgpu_umc_ras_fini.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 8882f90a 16-Nov-2021 Stanley.Yang <Stanley.Yang@amd.com>

drm/amdgpu: add new query interface for umc block v2

add message smu to query error information

v2:
rename message_smu to ecc_info

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# aaca8c38 17-Sep-2021 Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: add poison mode query for UMC

Add ras poison mode query interface for UMC.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 719e433e 23-Jul-2021 Mukul Joshi <mukul.joshi@amd.com>

drm/amdgpu: Fix channel_index table layout for Aldebaran

Fix the channel_index table layout to fetch the correct
channel_index when calculating physical address from
normalized address during page retirement.
Also, fix the number of UMC instances and number of channels
within each UMC instance for Aldebaran.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-By: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 186c8a85 09-Jun-2021 John Clements <john.clements@amd.com>

drm/amdgpu: initialize umc ras function

support umc ras function initialization for aldebaran

v2: squash in compile fix

Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 49070c4e 17-Mar-2021 Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: split umc callbacks to ras and non-ras ones

umc ras is not managed by gpu driver when gpu is
connected to cpu through xgmi. split umc callbacks
into ras and non-ras ones so gpu driver only
initializes umc ras callbacks when it manages
umc ras.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 87da0cc1 11-Mar-2021 Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: implement query_ras_error_address callback

query_ras_error_address will be invoked to query bad
page address when there is poison data in HBM consumed
by GPU engines.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 878b9e94 09-Mar-2021 Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: implement umc query error count callback

umc query_ras_error_count will be invoked to query
umc correctable and uncorrectable error. It will
reset the umc ras error counter after the query.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 3f903560 08-Mar-2021 Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: add helper funtion to query umc ras error

Add helper functions to query correctable and
uncorrectable umc ras error.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


# 1696bf35 08-Mar-2021 Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: create umc_v6_7_funcs for aldebaran

umc_v6_7_funcs are callbacks to support umc ras
functionalities in aldebaran

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>