#
7ec11c2f |
|
21-Feb-2024 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Fix ineffective ras_mask settings Check amdgpu_ras_mask to fix ineffective ras_mask setting due to special asic without sram ecc enable but with poison supported. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1b6ef74b |
|
22-Feb-2024 |
Lijo Lazar <lijo.lazar@amd.com> |
drm/amdgpu: Add fatal error detected flag For a RAS error that needs a full reset to recover, set the fatal error status. Clear the status once the device is reset. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
edfdde90 |
|
26-Jan-2024 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: disable RAS feature when fini Send RAS disable feature command in fini. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1731ba9b |
|
29-Jan-2024 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Update boot time errors polling sequence Update boot time errors polling sequence to align with the latest firmware change. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Frank Min <Frank.Min@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ed1e1e42 |
|
23-Jan-2024 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Support passing poison consumption ras block to SRIOV Support passing poison consumption ras blocks to SRIOV. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c0c48f0d |
|
23-Jan-2024 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: adjust aca init/fini sequence to match gpu reset - move aca init/fini function into ras init/fini to adapt gpu reset sequence. - add new function amdgpu_aca_reset() Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c84a7e21 |
|
23-Jan-2024 |
Mukul Joshi <mukul.joshi@amd.com> |
drm/amdgpu: Fix module unload hang with RAS enabled The driver unload hangs because the page retirement kthread cannot be stopped as it is sleeping and waiting on page retirement event to occur. Add kthread_should_stop() to the event condition to wake up the kthread when kthread stop is called during driver unload. Fixes: 3fdcd0a31d7a ("drm/amdgpu: Prepare for asynchronous processing of umc page retirement") Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2866a454 |
|
21-Jan-2024 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: skip call ras_late_init if ras block is not supported skip call ras_late_init callback if ras block is not supported. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
0795b5d2 |
|
14-Jan-2024 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu:Support retiring multiple MCA error address pages Support retiring multiple MCA error address pages in one in-band query for umc v12_0. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6c23f3d1 |
|
14-Jan-2024 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Use asynchronous polling to handle umc_v12_0 poisoning Use asynchronous polling to handle umc_v12_0 poisoning. v2: 1. Change function name. 2. Change the debugging information content. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ee9c3031 |
|
17-Jan-2024 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Fix ras features value calltrace The high three bits of ras features mask indicate socket id, it should skip to check high three bits of ras features mask before disable all ras features. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
3fdcd0a3 |
|
17-Jan-2024 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Prepare for asynchronous processing of umc page retirement Preparing for asynchronous processing of umc page retirement. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2c7a1560 |
|
16-Jan-2024 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Show deferred error count for UMC Show deferred error count for UMC syfs node Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7ed97155 |
|
16-Jan-2024 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: fix UBSAN array-index-out-of-bounds for ras_block_string[] fix array index out of bounds issue for ras_block_string[] array. Fixes: 30df05fb74f6 ("drm/amdgpu: Align ras block enum with firmware") Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4e2965bd |
|
28-Dec-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Centralize ras cap query to amdgpu_ras_check_supported Move ras capablity check to amdgpu_ras_check_supported. Driver will query ras capablity through psp interace, or vbios interface, or specific ip callbacks. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
46e2231c |
|
03-Jan-2024 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Log deferred error separately Separate deferred error from UE and CE and log it individually. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
37973b69 |
|
01-Jan-2024 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: add aca sysfs support add aca sysfs node support Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
04c4fcd2 |
|
23-Nov-2023 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: add amdgpu ras aca query interface v1: add ACA error query interface v2: Add a new helper function to determine whether to use ACA or MCA. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
33dcda51 |
|
24-Nov-2023 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: add ACA bank dump debugfs support add ACA bank dump debugfs support Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cce4febb |
|
02-Jan-2024 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Add ras helper to query boot errors v2 Add ras helper function to query boot time gpu errors. v2: use aqua_vanjaram smn addressing pattern Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
73cb81dc |
|
28-Dec-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Packed socket_id to ras feature mask Initialize RAS feature mask bit[31:29] with socket_id. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
fb1e9171 |
|
03-Jan-2024 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Support poison error injection via ras_ctrl debugfs Support poison error injection. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
90bd0147 |
|
03-Jan-2024 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Drop unnecessary sentences about CE and deferred error. Remove "no user action is needed" for correctable and deferred error to avoid confusion. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6697dbf0 |
|
31-Dec-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
Revert "drm/amdgpu: enable mca debug mode on APU by default" Not needed any more with firmware fixes Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b8d55a90 |
|
26-Dec-2023 |
Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> |
drm/amdgpu: Fix possible NULL dereference in amdgpu_ras_query_error_status_helper() Return invalid error code -EINVAL for invalid block id. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1183 amdgpu_ras_query_error_status_helper() error: we previously assumed 'info' could be null (see line 1176) Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
091411be |
|
19-Dec-2023 |
Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> |
drm/amdgpu: Use kzalloc instead of kmalloc+__GFP_ZERO in amdgpu_ras.c Fixes the below smatch warnings: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2543 amdgpu_ras_recovery_init() warn: Please consider using kzalloc instead of kmalloc drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2830 amdgpu_ras_init() warn: Please consider using kzalloc instead of kmalloc Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9f91e983 |
|
12-Dec-2023 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: MCA supports recording umc address information MCA supports recording umc address information. V2: Move err_addr variable from struct ras_err_node to struct ras_err_info. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
04a71f11 |
|
03-Dec-2023 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: optimize the printing order of error data sort error data list to optimize the printing order. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f875f61b |
|
30-Nov-2023 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: enable mca debug mode on APU by default enable MCA debug mode on APU device by default. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
201761b5 |
|
08-Nov-2023 |
Lijo Lazar <lijo.lazar@amd.com> |
drm/amdgpu: Move mca debug mode decision to ras Refactor code such that ras block decides the default mca debug mode, and not swsmu block. By default mca debug mode is set to false. v2: squash in uninitialized value fix (Alex) Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
dbf3850d |
|
03-Dec-2023 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: optimize the printing order of error data sort error data list to optimize the printing order. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
07ee43fa |
|
12-Nov-2023 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: fix ras err_data null pointer issue in amdgpu_ras.c fix ras err_data null pointer issue in amdgpu_ras.c Fixes: 8cc0f5669eb6 ("drm/amdgpu: Support multiple error query modes") Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4638e0c2 |
|
11-Oct-2023 |
Vitaly Prosyak <vitaly.prosyak@amd.com> |
drm/amdgpu: fix software pci_unplug on some chips When software 'pci unplug' using IGT is executed we got a sysfs directory entry is NULL for differant ras blocks like hdp, umc, etc. Before call 'sysfs_remove_file_from_group' and 'sysfs_remove_group' check that 'sd' is not NULL. [ +0.000001] RIP: 0010:sysfs_remove_group+0x83/0x90 [ +0.000002] Code: 31 c0 31 d2 31 f6 31 ff e9 9a a8 b4 00 4c 89 e7 e8 f2 a2 ff ff eb c2 49 8b 55 00 48 8b 33 48 c7 c7 80 65 94 82 e8 cd 82 bb ff <0f> 0b eb cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 [ +0.000001] RSP: 0018:ffffc90002067c90 EFLAGS: 00010246 [ +0.000002] RAX: 0000000000000000 RBX: ffffffff824ea180 RCX: 0000000000000000 [ +0.000001] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ +0.000001] RBP: ffffc90002067ca8 R08: 0000000000000000 R09: 0000000000000000 [ +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ +0.000001] R13: ffff88810a395f48 R14: ffff888101aab0d0 R15: 0000000000000000 [ +0.000001] FS: 00007f5ddaa43a00(0000) GS:ffff88841e800000(0000) knlGS:0000000000000000 [ +0.000002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000001] CR2: 00007f8ffa61ba50 CR3: 0000000106432000 CR4: 0000000000350ef0 [ +0.000001] Call Trace: [ +0.000001] <TASK> [ +0.000001] ? show_regs+0x72/0x90 [ +0.000002] ? sysfs_remove_group+0x83/0x90 [ +0.000002] ? __warn+0x8d/0x160 [ +0.000001] ? sysfs_remove_group+0x83/0x90 [ +0.000001] ? report_bug+0x1bb/0x1d0 [ +0.000003] ? handle_bug+0x46/0x90 [ +0.000001] ? exc_invalid_op+0x19/0x80 [ +0.000002] ? asm_exc_invalid_op+0x1b/0x20 [ +0.000003] ? sysfs_remove_group+0x83/0x90 [ +0.000001] dpm_sysfs_remove+0x61/0x70 [ +0.000002] device_del+0xa3/0x3d0 [ +0.000002] ? ktime_get_mono_fast_ns+0x46/0xb0 [ +0.000002] device_unregister+0x18/0x70 [ +0.000001] i2c_del_adapter+0x26d/0x330 [ +0.000002] arcturus_i2c_control_fini+0x25/0x50 [amdgpu] [ +0.000236] smu_sw_fini+0x38/0x260 [amdgpu] [ +0.000241] amdgpu_device_fini_sw+0x116/0x670 [amdgpu] [ +0.000186] ? mutex_lock+0x13/0x50 [ +0.000003] amdgpu_driver_release_kms+0x16/0x40 [amdgpu] [ +0.000192] drm_minor_release+0x4f/0x80 [drm] [ +0.000025] drm_release+0xfe/0x150 [drm] [ +0.000027] __fput+0x9f/0x290 [ +0.000002] ____fput+0xe/0x20 [ +0.000002] task_work_run+0x61/0xa0 [ +0.000002] exit_to_user_mode_prepare+0x150/0x170 [ +0.000002] syscall_exit_to_user_mode+0x2a/0x50 Cc: Hawking Zhang <hawking.zhang@amd.com> Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian Koenig <christian.koenig@amd.com> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8cc0f566 |
|
08-Nov-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Support multiple error query modes Direct error query mode and firmware error query mode are supported for now. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
18eae367 |
|
30-Oct-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_count Handle xgmi hive case. Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d1d4c0b7 |
|
24-Oct-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: check RAS supported first in ras_reset_error_count Not all platforms support RAS. Fixes: 73582be11ac8 ("drm/amdgpu: bypass RAS error reset in some conditions") Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
73582be1 |
|
12-Oct-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: bypass RAS error reset in some conditions PMFW is responsible for RAS error reset in some conditions, driver can skip the operation. v2: add check for ras->in_recovery, it's set earlier than amdgpu_in_reset. v3: fix error in gpu reset check. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f3a3bbf1 |
|
20-Oct-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: enable RAS poison mode for APU Enable it by default on APU platform. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ec3e0a91 |
|
19-Oct-2023 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: refine ras error kernel log print refine ras error kernel log to avoid user-ridden ambiguity. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
53d4d779 |
|
19-Oct-2023 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: fix find ras error node error the origin function might return the wrong node. Fixes: 5b1270beb380 ("drm/amdgpu: add ras_err_info to identify RAS error source") Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8096df76 |
|
11-Oct-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add set/get mca debug mode operations Record the debug mode status in RAS. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
66d64e4e |
|
19-Oct-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Enable RAS feature by default for APU Enable RAS feature by default for aqua vanjaram on apu platform. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
49c260be |
|
19-Oct-2023 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: fix typo for amdgpu ras error data print typo fix. Fixes: 5b1270beb380 ("drm/amdgpu: add ras_err_info to identify RAS error source") Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Candice Li <candice.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8a656611 |
|
19-Oct-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Fix delete nodes that have been relesed Fix delete nodes that it has been freed. Fixes: 5b1270beb380 ("drm/amdgpu: add ras_err_info to identify RAS error source") Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9248462d |
|
28-Sep-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Enable software RAS in vcn v4_0_3 Set VCN/JPEG RAS masks to enable software RAS for VCN and JPEG. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
472c5fb2 |
|
11-Oct-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: define ras_reset_error_count function Make the code architecture more simple. v2: reuse ras_reset_error_count in ras_reset_error_status. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
53dd920c |
|
05-Oct-2023 |
Asad Kamal <asad.kamal@amd.com> |
drm/amdgpu : Add hive ras recovery check If one of the devices in the hive detects a fatal error, need to send ras recovery reset message to PMFW of all devices in the hive. For that add a flag in hive to indicate that it's undergoing ras recovery Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5b1270be |
|
25-Sep-2023 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: add ras_err_info to identify RAS error source introduced "ras_err_info" to better identify a RAS ERROR source. NOTE: For legacy chips, keep the original RAS error print format. v1: RAS errors may come from different dies during a RAS error query, therefore, need a new data structure to identify the source of RAS ERROR. v2: - use new data structure 'amdgpu_smuio_mcm_config_info' instead of ras_err_id (in v1 patch) - refine ras error dump function name - refine ras error dump log format Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
625e5f38 |
|
02-Oct-2023 |
Asad Kamal <asad.kamal@amd.com> |
drm/amdgpu: Expose ras version & schema info Expose ras table version & schema info to sysfs v2: Updated schema to get poison support info from ras context, removed asic specific checks Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a83f2bf1 |
|
15-Sep-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Fix false positive error log It should first check block ras obj whether be set, it should return 0 directly if block ras obj or hw_ops is not set. If block doesn't support RAS just return 0 is fine. Changed from V1: return 0 directly if block ras obj or hw ops is not set Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5838f74c |
|
14-Sep-2023 |
Cong Liu <liucong2@kylinos.cn> |
drm/amdgpu: fix a memory leak in amdgpu_ras_feature_enable This patch fixes a memory leak in the amdgpu_ras_feature_enable() function. The leak occurs when the function sends a command to the firmware to enable or disable a RAS feature for a GFX block. If the command fails, the kfree() function is not called to free the info memory. Fixes: 9f051d6ff13f ("drm/amdgpu: Free ras cmd input buffer properly") Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Cong Liu <liucong2@kylinos.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4051844c |
|
05-Sep-2023 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: add amdgpu mca debug sysfs support add amdgpu mca debug sysfs support. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4e8303cf |
|
11-Sep-2023 |
Lijo Lazar <lijo.lazar@amd.com> |
drm/amdgpu: Use function for IP version check Use an inline function for version check. Gives more flexibility to handle any format changes. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9f9d4651 |
|
08-Sep-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: fallback to old RAS error message for aqua_vanjaram So driver doesn't generate incorrect message until the new format is settled down for aqua_vanjaram Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bf7aa8be |
|
29-Aug-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Free ras cmd input buffer properly Do not access the pointer for ras input cmd buffer if it is even not allocated. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ec70578c |
|
24-Aug-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Allow issue disable gfx ras cmd to firmware Disable gfx ras command is needed in some use cases Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
80578f16 |
|
15-Aug-2023 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Enable ras for mp0 v13_0_6 sriov Enable ras for mp0 v13_0_6 sriov Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f387bb57 |
|
14-Sep-2023 |
Cong Liu <liucong2@kylinos.cn> |
drm/amdgpu: fix a memory leak in amdgpu_ras_feature_enable This patch fixes a memory leak in the amdgpu_ras_feature_enable() function. The leak occurs when the function sends a command to the firmware to enable or disable a RAS feature for a GFX block. If the command fails, the kfree() function is not called to free the info memory. Fixes: 9f051d6ff13f ("drm/amdgpu: Free ras cmd input buffer properly") Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Cong Liu <liucong2@kylinos.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ffd6bde3 |
|
08-Sep-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: fallback to old RAS error message for aqua_vanjaram So driver doesn't generate incorrect message until the new format is settled down for aqua_vanjaram Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9f051d6f |
|
29-Aug-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Free ras cmd input buffer properly Do not access the pointer for ras input cmd buffer if it is even not allocated. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e81c4556 |
|
15-Aug-2023 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Enable ras for mp0 v13_0_6 sriov Enable ras for mp0 v13_0_6 sriov Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1b98a5f8 |
|
07-Aug-2023 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: mode1 reset needs to recover mp1 for mp0 v13_0_10 Mode1 reset needs to recover mp1 in fatal error case for mp0 v13_0_10. v2: Define a macro to wrap psp function calls. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8b3a7a70 |
|
09-Aug-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Remove unnecessary ras cap check RAS global isr will only be invoked by hardware interrupt. Don't need to query ras capability in isr In addition, amdgpu_ras_interrupt_fatal_error_handler ensures the isr won't be called from guest linux side by accident. The RAS cap check in isr that introduced to fix sriov crash is not needed any more Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bc0f8080 |
|
08-Aug-2023 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Extend poison mode check to SDMA/VCN/JPEG Treat SDMA/VCN/JPEG as RAS capable IP blocks in poison mode. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7692e1ee |
|
08-Jun-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add RAS fatal error handler for NBIO v7.9 Register RAS fatal error interrupt and add handler. v2: only register NBIO RAS for dGPU platform. change nbio_v7_9_set_ras_controller_irq_state and nbio_v7_9_set_ras_err_event_athub_irq_state to dummy functions. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6fc9d92c |
|
03-Jul-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Issue ras enable_feature for gfx ip only For non-GFX IP blocks, set up ras obj if ras feature is allowed. For GFX IP blocks, force issue ras enable_feature command to firmware and only set up ras obj if ras feature is allowed Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
62c4b772 |
|
03-Jul-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Apply poison mode check to GFX IP only For GFX IP that only supports poison consumption, GFX RAS won't be marked as enabled. i.e., hardware doesn't support gfx sram ecc. But driver still needs to issue firmware to enable poison consumption mode for GFX IP. In such case, check poison mode and treat GFX IP as RAS capable IP block. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f957138c |
|
31-Jul-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Only create err_count sysfs when hw_op is supported Some IP blocks only support partial ras feature and don't have ras counter and/or ras error status register at all. Driver should not create err_count sysfs node for those IP blocks. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
fcb7a184 |
|
21-Jul-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Check APU flag to disable RAS Only disable RAS by default for aqua vanjaram on APU platform. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
adf64e21 |
|
18-Jul-2023 |
Mario Limonciello <mario.limonciello@amd.com> |
drm/amd: Avoid reading the VBIOS part number twice The VBIOS part number is read both in amdgpu_atom_parse() as well as in atom_get_vbios_pn() and stored twice in the `struct atom_context` structure. Remove the first unnecessary read and move the `pr_info` line from that read into the second. v2: squash in unused variable removal Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
276f6e8c |
|
13-Jul-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Disable RAS by default on APU flatform Disable RAS feature by default for aqua vanjaram on APU platform. Changed from V1: Splite Disable RAS by default on APU platform into a separated patch. Changed from V2: Avoid to modify global variable amdgpu_ras_enable. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cb906ce3 |
|
25-Jun-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Enable aqua vanjaram RAS Enable RAS for aqua vanjaram. Changed from V1: Split the change in amdgpu_ras_asic_supported into a separated patch. Changed from V2: Avoid to modify global variable amdgpu_ras_enable. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a80fe1a6 |
|
29-Jun-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: skip address adjustment for GFX RAS injection The address parameter of GFX RAS injection isn't related to XGMI node number, keep it unchanged. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Candice Li <candice.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2c7cd280 |
|
24-Jun-2023 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: gpu recovers from fatal error in poison mode Fatal error occurs in ras poison mode, mode1 reset is used to recover gpu. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
43aedbf4 |
|
12-Jun-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Add checking mc_vram_size Do not compare injection address with mc_vram_size if mc_vram_size is zero. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
38298ce6 |
|
12-Jun-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Optimize checking ras supported Using "is_app_apu" to identify device in the native APU mode or carveout mode. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
71344a71 |
|
10-Jun-2023 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Fix usage of UMC fill record in RAS The fixed commit listed in the Fixes tag below, introduced a bug in amdgpu_ras.c::amdgpu_reserve_page_direct(), in that when introducing the new amdgpu_umc_fill_error_record() and internally in that new function the physical address (argument "uint64_t retired_page"--wrong name) is right-shifted by AMDGPU_GPU_PAGE_SHIFT. Thus, in amdgpu_reserve_page_direct() when we pass "address" to that new function, we should NOT right-shift it, since this results, erroneously, in the page address to be 0 for first 2^(2*AMDGPU_GPU_PAGE_SHIFT) memory addresses. This commit fixes this bug. Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Fixes: 400013b268cb ("drm/amdgpu: add umc_fill_error_record to make code more simple") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20230610113536.10621-1-luben.tuikov@amd.com Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
740f42a2 |
|
02-Jun-2023 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Report ras_num_recs in debugfs Report the number of records stored in the RAS EEPROM table in debugfs. This can be used by user-space to calculate the capacity of the RAS EEPROM table since "bad_page_cnt_threshold" is also reported in the same place in debugfs. See commit 7fb640714547 ("drm/amdgpu: Add bad_page_cnt_threshold to debugfs"). ras_num_recs can already be inferred by dumping the RAS EEPROM table, also in the same debugfs location, see commit reference c65b0805e77919 (drm/amdgpu: RAS EEPROM table is now in debugfs, 2021-04-08). This commit makes it an integer value easily shown in a single file. Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: John Clements <john.clements@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20230603051043.211548-1-luben.tuikov@amd.com Acked-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7f599fed |
|
30-May-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Add support EEPROM table v2.1 Add ras info to EEPROM table, app can analyse device ECC status without GPU driver through EEPROM table ras info. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e3959cb5 |
|
21-Apr-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: support check vcn jpeg block mask Support VCN/JPEG instance mask checking, pass logical mask directly except GFX/SDMA/VCN/JPEG blocks. Changed from V1: correct a typo Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6c47a79b |
|
16-May-2023 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: perform mode2 reset for sdma fed error on gfx v11_0_3 perform mode2 reset for sdma fed error on gfx v11_0_3. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f464c5dd |
|
20-Mar-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add check for RAS instance mask The mask is only needed to be set when RAS block instance number is more than 1 and invalid bits should be also masked out. We only check valid bits for GFX and SDMA block for now, and will add check for other RAS blocks in the future. v2: move the check under injection operation since the mask is only used by RAS error inject. v3: add valid bits handling for SDMA. v4: print message if the mask is adjusted. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
27c5f295 |
|
13-Mar-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: reorganize RAS injection flow So GFX RAS injection could use default function if it doesn't define its own injection interface. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2c22ed0b |
|
27-Feb-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add instance mask for RAS inject User can specify injected instances by the mask. For backward compatibility, the mask value is incorporated into sub block index without interface change of RAS TA. User uses logical mask and driver should convert it to physical value before sending it to RAS TA. v2: update parameter name. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9b337b7d |
|
20-Mar-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Adjust the sequence to query ras error info It turns out STATUS_VALID_FLAG needs to be checked ahead of any other fields. ADDRESS_VALID_FLAG and ERR_INFO_VALID_FLAG only manages ADDRESS and ERR_INFO field respectively. driver should continue poll ERR CNT field even ERR_INFO_VALD_FLAG is not set. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8107e499 |
|
29-Jan-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Enable persistent edc harvesting in APP APU Persistent edc harvesting is supported in APP APU Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e53a3250 |
|
03-Feb-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Add common helper to reset ras error Add common helper to reset ras error status. It applies to IP blocks that follow the new ras error logging register design, and need to write 0 to reset the error status. For IP blocks that don't support the new design, please still implement ip specific helper. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
322a7e00 |
|
02-Feb-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Add common helper to query ras error (v2) Add common helper to query ras error status and log error information, including memory block id and erorr count. The helpers are applicable to IP blocks that follow the new ras error logging design. For IP blocks that don't support the new design, please still implement ip specific helper to query ras error. v2: optimize struct amdgpu_ras_err_status_reg_entry and the implementaion in helper (Lijo/Tao) Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8eba7205 |
|
19-Apr-2023 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Drop pcie_bif ras check from fatal error handler Some ASICs support fatal error event but do not support pcie_bif ras. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b805d8d7 |
|
13-Apr-2023 |
Jane Jian <Jane.Jian@amd.com> |
Revert "drm/amdgpu: enable ras for mp0 v13_0_10 on SRIOV" This reverts commit fe120b9f5ce873516a2604e4ff0c19084be94e8c. This patch impacts sriov multi-vf stability Signed-off-by: Jane Jian <Jane.Jian@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
58bc2a9c |
|
10-Apr-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: correct ras enabled flag XGMI RAS should be according to the gmc xgmi physical nodes number, XGMI RAS should not be enabled if xgmi num_physical_nodes is zero. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9af357bc |
|
22-Mar-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Add fatal error handling in nbio v4_3 GPU will stop working once fatal error is detected. it will inform driver to do reset to recover from the fatal error. v2: squash in logic fix (Srinivasan) v3: squash in logic fix (Dan) Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Candice Li <candice.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
fe120b9f |
|
02-Mar-2023 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: enable ras for mp0 v13_0_10 on SRIOV Enable ras for mp0 v13_0_10 on SRIOV. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
37080887 |
|
04-Mar-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: drop ras check at asic level for new blocks amdgpu_ras_register_ras_block should always be invoked by ras_sw_init, where driver needs to check ras caps at ip level, instead of asic level. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
fdc94d3a |
|
13-Mar-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: Rework pcie_bif ras sw_init pcie_bif ras blocks needs to be initialized as early as possible to handle fatal error detected in hw_init phase. also align the pcie_bif ras sw_init with other ras blocks Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f3cbe70e |
|
21-Feb-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: change default behavior of bad_page_threshold parameter Ignore ras umc bad page threshold by default, GPU initialization won't be stopped in this mode. v2: refine the description of bad_page_threshold. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4d33e0f1 |
|
10-Feb-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: exclude duplicate pages from UMC RAS UE count If a UMC bad page is reserved but not freed by an application, the application may trigger uncorrectable error repeatly by accessing the page. v2: add specific function to do the check. v3: remove duplicate pages, calculate new added bad page number. v4: reuse save_bad_pages to calculate new added bad page number. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8f453c51 |
|
06-Jan-2023 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Adjust ras support check condition for special asic [Why]: Amdgpu ras uses amdgpu_ras_is_supported to check whether the ras block supports the ras function. amdgpu_ras_is_supported uses .ras_enabled to determine whether the ras function of the block is enabled. But for special asic with mem ecc enabled but sram ecc not enabled, some ras blocks support poison mode but their ras function is not enabled on .ras_enabled, these ras blocks will run abnormally. [How]: If the ras block is not supported on .ras_enabled but the asic supports poison mode and the ras block has ras configuration, it can be considered that the ras block supports ras function. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8c305a3f |
|
06-Jan-2023 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Remove unnecessary ras block support check [Why]: For special asic with mem ecc enabled but sram ecc not enabled, some ras blocks can register their ras configuration to ras list, but these ras blocks are not enabled on .ras_enabled, so it can not get ras block object using amdgpu_ras_get_ras_block. [How]: Remove ras block support check. Even if the ras block checked is not in the ras list, it will return a null pointer and will have no effect. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ac7b25d9 |
|
03-Jan-2023 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Perform gpu reset after gfx finishes processing ras poison consumption on gfx_v11_0_3 Perform gpu reset after gfx finishes processing ras poison consumption on gfx_v11_0_3. V2: Move gfx poison consumption handler from hw_ops to ip function level. V3: Adjust the calling position of amdgpu_gfx_poison_consumation_handler. V4: Since gfx v11_0_3 does not have .hw_ops instance, the .hw_ops null pointer check in amdgpu_ras_interrupt_poison_consumption_handler needs to be adjusted. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4a1c9a44 |
|
03-Jan-2023 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: allow query error counters for specific IP block amdgpu_ras_block_late_init will be invoked in IP specific ras_late_init call as a common helper for all the IP blocks. However, when amdgpu_ras_block_late_init call amdgpu_ras_query_error_count to query ras error counters, amdgpu_ras_query_error_count queries all the IP blocks that support ras query interface. This results to wrong error counters cached in software copies when there are ras errors detected at time zero or warm reset procedure. i.e., in sdma_ras_late_init phase, it counts on sdma/mmhub errors, while, in mmhub_ras_late_init phase, it still counts on sdma/mmhub errors. The change updates amdgpu_ras_query_error_count interface to allow query specific ip error counter. It introduces a new input parameter: query_info. if query_info is NULL, it means query all the IP blocks, otherwise, only query the ip block specified by query_info. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c26cd999 |
|
21-Dec-2022 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: remove enable ras cmd call trace [Why] [ 41.285804] RIP: 0010:amdgpu_ras_feature_enable+0x15c/0x310 [amdgpu] [ 41.285945] Code: 48 89 c1 48 c7 c2 b9 f2 88 c1 48 c7 c0 c0 f2 88 c1 49 8b 3c 24 48 0f 44 d0 48 c7 c6 98 33 80 c1 e8 5f 52 75 d9 e9 fa fe ff ff <0f> 0b e9 66 ff ff ff 48 8b 3d 86 8c 0f da ba 00 04 00 00 be c0 0d [ 41.285946] RSP: 0018:ffffbccdc72efc90 EFLAGS: 00010246 [ 41.285948] RAX: 0000000000000004 RBX: ffff931897406980 RCX: 0000000000000002 [ 41.285949] RDX: 0000000000000dc0 RSI: 0000000000000002 RDI: ffff931500042b00 [ 41.285950] RBP: ffffbccdc72efcc0 R08: 0000000000000002 R09: ffff931885b87000 [ 41.285951] R10: 0000000000ffff10 R11: 0000000000000001 R12: ffff931893e20000 [ 41.285952] R13: 0000000000000001 R14: ffff931885b87000 R15: 0000000000000000 [ 41.285953] FS: 0000000000000000(0000) GS:ffff931c6f200000(0000) knlGS:0000000000000000 [ 41.285954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 41.285955] CR2: 000055dd6f532008 CR3: 000000061b010006 CR4: 00000000003706e0 [ 41.285956] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 41.285957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 41.285958] Call Trace: [ 41.285959] <TASK> [ 41.285963] ? gfx_v11_0_early_init+0x250/0x250 [amdgpu] [ 41.286117] gfx_v11_0_late_init+0x8c/0xb0 [amdgpu] [ 41.286271] amdgpu_device_ip_late_init+0x8d/0x3c0 [amdgpu] [ 41.286401] amdgpu_device_init.cold+0x1677/0x1fda [amdgpu] [ 41.286616] ? pci_bus_read_config_word+0x4a/0x70 [ 41.286621] ? do_pci_enable_device+0xdb/0x110 [ 41.286625] amdgpu_driver_load_kms+0x1a/0x160 [amdgpu] [ 41.286762] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu] [ 41.286898] local_pci_probe+0x4b/0x90 [ 41.286901] work_for_cpu_fn+0x1a/0x30 [ 41.286903] process_one_work+0x22b/0x3d0 [ 41.286905] worker_thread+0x223/0x420 [ 41.286907] ? process_one_work+0x3d0/0x3d0 [ 41.286908] kthread+0x12a/0x150 [ 41.286911] ? set_kthread_struct+0x50/0x50 [ 41.286913] ret_from_fork+0x22/0x30 [How] For specific asic, only mem ecc is enabled, sram ecc is not enabled, but it still need to send ras enable cmd to gfx block to support poison mode, so add check posion mode. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2dd9032b |
|
05-Dec-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: define RAS query poison mode function 1. no need to query poison mode on SRIOV guest side, host can handle it. 2. define the function to simplify code. v2: rename amdgpu_ras_poison_mode_query to amdgpu_ras_query_poison_mode. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
3189501e |
|
05-Dec-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: update VCN/JPEG RAS setting Support VCN/JPEG RAS in both bare metal and SRIOV environment. v2: update commit description. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
248c9635 |
|
07-Dec-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: skip RAS error injection in SRIOV Injection on guest is not allowed. v2: return directly in SRIOV environment. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2cffcb66 |
|
30-Nov-2022 |
ye xingchen <ye.xingchen@zte.com.cn> |
drm/amdgpu: use sysfs_emit() to instead of scnprintf() Replace the open-code with sysfs_emit() to simplify the code. Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
07615da1 |
|
20-Oct-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: enable RAS for VCN/JPEG v4.0 Set support flag for VCN/JPEG 4.0. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1a11a65d |
|
08-Nov-2022 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Enable mode-1 reset for RAS recovery in fatal error mode The patch is enabling mode-1 reset for RAS recovery in fatal error mode. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1ed0e176 |
|
17-Oct-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: remove ras_error_status parameter for UMC poison handler Make the code simpler. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
24b82292 |
|
17-Oct-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: use page retirement API in MCA notifier Make the code more readable. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6c0ca748 |
|
14-Oct-2022 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: move convert_error_address out of umc_ras RAS error address translation algorithm is common across dGPU and A + A platform as along as the SOC integrates the same generation of UMC IP. UMC RAS is managed by x86 MCA on A + A platform, umc_ras in GPU driver is not initialized at all on A + A platform. In such case, any umc_ras callback implemented for dGPU config shouldn't be invoked from A + A specific callback. The change moves convert_error_address out of dGPU umc_ras structure and makes it share between A + A and dGPU config. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
82835055 |
|
28-Sep-2022 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Add sriov vf ras support in amdgpu_ras_asic_supported V2: Add sriov vf ras support in amdgpu_ras_asic_supported. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
073285ef |
|
27-Sep-2022 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Enable ras support for mp0 v13_0_0 and v13_0_10 V1: Enable ras support for CHIP_IP_DISCOVERY asic type. V2: 1. Change commit comment. 2. Enable ras support for mp0 v13_0_0 and v13_0_10. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b98a1648 |
|
12-Oct-2022 |
Victor Zhao <Victor.Zhao@amd.com> |
Revert "drm/amdgpu: let mode2 reset fallback to default when failure" This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957. This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it. Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
3bd026c3 |
|
28-Sep-2022 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Add sriov vf ras support in amdgpu_ras_asic_supported V2: Add sriov vf ras support in amdgpu_ras_asic_supported. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
001ebcf5 |
|
27-Sep-2022 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Enable ras support for mp0 v13_0_0 and v13_0_10 V1: Enable ras support for CHIP_IP_DISCOVERY asic type. V2: 1. Change commit comment. 2. Enable ras support for mp0 v13_0_0 and v13_0_10. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a340847b |
|
12-Oct-2022 |
Victor Zhao <Victor.Zhao@amd.com> |
Revert "drm/amdgpu: let mode2 reset fallback to default when failure" This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957. This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it. Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
38dbbfa5 |
|
29-Sep-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: fix coding style issue for mca notifier Fix some issues found by checkpatch script. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
44420ac5 |
|
26-Sep-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: define RAS convert_error_address API Make the code reusable and remove redundant code. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cd4c99f1 |
|
21-Sep-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: use RAS error address convert api in mca notifier Use the convert interface to simplify code. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
642c0401 |
|
07-Sep-2022 |
YiPeng Chai <YiPeng.Chai@amd.com> |
drm/amdgpu: Fixed ras warning when uninstalling amdgpu For the asic using smu v13_0_2, there is the following warning when uninstalling amdgpu: amdgpu: ras disable gfx failed poison:1 ret:-22. [Why]: For the asic using smu v13_0_2, the psp .suspend and mode1reset is called before amdgpu_ras_pre_fini during amdgpu uninstall, it has disabled all ras features and reset the psp. Since the psp is reset, calling amdgpu_ras_disable_all_features in amdgpu_ras_pre_fini to disable ras features will fail. [How]: If all ras features are disabled, amdgpu_ras_disable_all_features will not be called to disable all ras features again. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6da15a23 |
|
07-Sep-2022 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Skip reset error status for psp v13_0_0 No need to reset error status since only umc ras supported on psp v13_0_0. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
dac6b808 |
|
27-Jul-2022 |
Victor Zhao <Victor.Zhao@amd.com> |
drm/amdgpu: let mode2 reset fallback to default when failure - introduce AMDGPU_SKIP_MODE2_RESET flag - let mode2 reset fallback to default reset method if failed v2: move this part out from the asic specific part Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
86875d55 |
|
07-Sep-2022 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Skip reset error status for psp v13_0_0 No need to reset error status since only umc ras supported on psp v13_0_0. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f1549c09 |
|
07-Jul-2022 |
Likun Gao <Likun.Gao@amd.com> |
drm/amdgpu: support reset flag set for gpu reset Move reset_context out of gpu recover function to make it configurable for different reset purpose. For the reset way of call gpu_recovery sysfs, force to use full reset method. Otherwise, try soft reset by default if the related ASIC supportted, if soft reset failed, will use full reset. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e0e146d5 |
|
04-Jul-2022 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: skip whole ras bad page framework on sriov It should not init whole ras bad page framework on sriov guest side due to it is handled on host side. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
26093ce1 |
|
23-Jun-2022 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Only send ras feature for gfx block GFX is the only IP block that RAS TA needs to program the hardware when receiving enable_feature command. Changed from V1: remove amdgpu_ras_need_send_ras_feature inline function, use GFX RAS block check directly. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cf727044 |
|
17-May-2022 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover We removed the wrapper that was queueing the recover function into reset domain queue who was using this name. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
25a2b22e |
|
16-May-2022 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/admgpu: Serialize RAS recovery work directly into reset domain queue. Save the extra usless work schedule. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2a460963 |
|
01-Jun-2022 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Resolve RAS GFX error count issue after cold boot on Arcturus Adjust the sequence for ras late init and separate ras reset error status from query status. v2: squash in fix from Candice Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
28caf8c4 |
|
31-May-2022 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: fix ras supported check Fix aldebaran ras supported check on SRIOV guest side, the previous check conditicon block all ras feature on baremetal Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
950d6425 |
|
26-Apr-2022 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: support ras on SRIOV support umc/gfx/sdma ras on guest side Changed from V1: move sriov judgment in amdgpu_ras_interrupt_fatal_error_handler Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b63ac5d3 |
|
09-May-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: refine RAS poison consumption handler Qeury ras status before ras poison consumption handling, add more comment and log. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-and-tested-by: Mohammad Zafar Ziya <Mohammadzafar.ziya@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
36780606 |
|
09-May-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: enable RAS IH for poison consumption Enable RAS IH if poison consumption handler is implemented. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Mohammad Zafar Ziya <Mohammadzafar.ziya@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b3ef3205 |
|
22-Apr-2022 |
Haowen Bai <baihaowen@meizu.com> |
drm/amdgpu: Remove useless kfree After alloc fail, we do not need to kfree. Signed-off-by: Haowen Bai <baihaowen@meizu.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b3c76814 |
|
19-Apr-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add RAS fatal error interrupt handler The fatal error handler is independent from general ras interrupt handler since there is no related IH ring. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
66f87949 |
|
18-Apr-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add RAS poison consumption handler (v2) Add support for general RAS poison consumption handler. v2: remove callback function for poison consumption. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
50a7d025 |
|
08-Apr-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add RAS poison creation handler (v2) Prepare for the implementation of poison consumption handler. v2: separate umc handler from poison creation. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a3d63c62 |
|
16-Mar-2022 |
Mohammad Zafar Ziya <Mohammadzafar.ziya@amd.com> |
drm/amdgpu: Add vcn and jpeg ras support flag Add vcn and jpeg ras support options V2: vcn and jpeg ras flag enabled for aldebaran asic only V3: vcn and jpeg ras flag disabled for error counter query Generic poison query interface added VCN and JPEG ras enabled based on IP version check V4: vcn and jpeg ras flag moved under ecc flag for dGPU Signed-off-by: Mohammad Zafar Ziya <Mohammadzafar.ziya@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
69691c82 |
|
03-Mar-2022 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: message smu to update bad channel info It should notice SMU to update bad channel info when detected uncorrectable error in UMC block Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
80e0c2cb |
|
17-Feb-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Remove redundant .ras_fini initialization in some ras blocks 1. Define amdgpu_ras_block_late_fini_default in amdgpu_ras.c as .ras_fini common function, which is called when .ras_fini of ras block isn't initialized. 2. Remove the code of using amdgpu_ras_block_late_fini to initialize .ras_fini in ras blocks. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1f211a82 |
|
14-Feb-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: centrally calls the .ras_fini function of all ras blocks centrally calls the .ras_fini function of all ras blocks. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
29c9b6cd |
|
22-Feb-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Fixed warning reported by kernel test robot Fixed warning reported by kernel test robot: 1.warning: variable 'ras_obj' is used uninitialized whenever '||' condition is true. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d41ff22a |
|
22-Feb-2022 |
Maíra Canal <maira.canal@usp.br> |
drm/amdgpu: Change amdgpu_ras_block_late_init_default function scope Turn previously global function into a static function to avoid the following Clang warning: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2459:5: warning: no previous prototype for function 'amdgpu_ras_block_late_init_default' [-Wmissing-prototypes] int amdgpu_ras_block_late_init_default(struct amdgpu_device *adev, ^ drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2459:1: note: declare 'static' if the function is not intended to be used outside of this translation unit int amdgpu_ras_block_late_init_default(struct amdgpu_device *adev, ^ static Signed-off-by: Maíra Canal <maira.canal@usp.br> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
779596ce |
|
17-Feb-2022 |
Tom Rix <trix@redhat.com> |
drm/amdgpu: fix amdgpu_ras_block_late_init error handler Clang build fails with amdgpu_ras.c:2416:7: error: variable 'ras_obj' is used uninitialized whenever 'if' condition is true if (adev->in_suspend || amdgpu_in_reset(adev)) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ amdgpu_ras.c:2453:6: note: uninitialized use occurs here if (ras_obj->ras_cb) ^~~~~~~ There is a logic error in the error handler's labels. ex/ The sysfs: is the last goto label in the normal code but is the middle of error handler. Rework the error handler. cleanup: is the first error, so it's handler should be last. interrupt: is the second error, it's handler is next. interrupt: handles the failure of amdgpu_ras_interrupt_add_hander() by calling amdgpu_ras_interrupt_remove_handler(). This is wrong, remove() assumes the interrupt has been setup, not torn down by add(). Change the goto label to cleanup. sysfs is the last error, it's handler should be first. sysfs: handles the failure of amdgpu_ras_sysfs_create() by calling amdgpu_ras_sysfs_remove(). But when the create() fails there is nothing added so there is nothing to remove. This error handler is not needed. Remove the error handler and change goto label to interrupt. Fixes: b293e891b057 ("drm/amdgpu: add helper function to do common ras_late_init/fini (v3)") Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
418abce2 |
|
14-Feb-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Remove redundant .ras_late_init initialization in some ras blocks 1. Define amdgpu_ras_block_late_init_default in amdgpu_ras.c as .ras_late_init common function, which is called when .ras_late_init of ras block isn't initialized. 2. Remove the code of using amdgpu_ras_block_late_init to initialize .ras_late_init in ras blocks. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
867e24ca |
|
13-Feb-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: define amdgpu_ras_late_init to call all ras blocks' .ras_late_init Define amdgpu_ras_late_init to call all ras blocks' .ras_late_init. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
563285c8 |
|
08-Feb-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Merge amdgpu_ras_late_init/amdgpu_ras_late_fini to amdgpu_ras_block_late_init/amdgpu_ras_block_late_fini 1. Merge amdgpu_ras_late_init to amdgpu_ras_block_late_init. 2. Remove amdgpu_ras_late_init since no ras block calls amdgpu_ras_late_init. 3. Merge amdgpu_ras_late_fini to amdgpu_ras_block_late_fini. 4. Remove amdgpu_ras_late_fini since no ras block calls amdgpu_ras_late_fini. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9252d33d |
|
08-Feb-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Optimize operating sysfs and interrupt function interface in amdgpu_ras.c In order to reduce redundant struct conversion, modify operating sysfs and interrupt function interface parameters. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
80ed77f9 |
|
07-Feb-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Optimize amdgpu_nbio_ras_late_init/amdgpu_nbio_ras_fini function code Optimize amdgpu_nbio_ras_late_init/amdgpu_nbio_ras_fini function code. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bdb3489c |
|
30-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Optimize xxx_ras_late_init/xxx_ras_late_fini for each ras block 1. Define amdgpu_ras_block_late_init to create sysfs nodes and interrupt handles. 2. Define amdgpu_ras_block_late_fini to remove sysfs nodes and interrupt handles. 3. Replace ras block variable members in struct amdgpu_ras_block_object with struct ras_common_if, which can make it easy to associate each ras block instance with each ras block functional interface. 4. Add .ras_cb to struct amdgpu_ras_block_object. 5. Change each ras block to fit for the changement of struct amdgpu_ras_block_object. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a50b0482 |
|
06-Feb-2022 |
yipechai <YiPeng.Chai@amd.com> |
Revert "drm/amdgpu: Add judgement to avoid infinite loop" The commit d5e8ff5f7b2a ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop") had fixed this defect. Revert workaround commit a2170b4af62f ("drm/amdgpu: Add judgement to avoid infinite loop"). Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d5e8ff5f |
|
29-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Fixed the defect of soft lock caused by infinite loop 1. The infinite loop case only occurs on multiple cards support ras functions. 2. The explanation of root cause refer to commit 76641cbbf196 ("drm/amdgpu: Add judgement to avoid infinite loop"). 3. Create new node to manage each unique ras instance to guarantee each device .ras_list is completely independent. 4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block interface for each ras block"). 5. The soft locked logs are as follows: [ 262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G OE 5.13.0-27-generic #29~20.04.1-Ubuntu [ 262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020 [ 262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [ 262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45 [ 262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202 [ 262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248 [ 262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000 [ 262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001 [ 262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e [ 262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003 [ 262.166258] FS: 0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000 [ 262.166261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0 [ 262.166267] Call Trace: [ 262.166272] amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] [ 262.166529] ? psi_task_switch+0xd2/0x250 [ 262.166537] ? __switch_to+0x11d/0x460 [ 262.166542] ? __switch_to_asm+0x36/0x70 [ 262.166549] process_one_work+0x220/0x3c0 [ 262.166556] worker_thread+0x4d/0x3f0 [ 262.166560] ? process_one_work+0x3c0/0x3c0 [ 262.166563] kthread+0x12b/0x150 [ 262.166568] ? set_kthread_struct+0x40/0x40 [ 262.166571] ret_from_fork+0x22/0x30 Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
afa37315 |
|
03-Feb-2022 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Print once if RAS unsupported MESA polls for errors every 2-3 seconds. Printing with dev_info() causes the dmesg log to fill up with the same message, e.g, [18028.206676] amdgpu 0000:0b:00.0: amdgpu: df doesn't config ras function. Make it dev_dbg_once(), as it isn't something correctible during boot or thereafter, so printing just once is sufficient. Also sanitize the message. Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: yipechai <YiPeng.Chai@amd.com> Fixes: 8b0fb0e967c1 ("drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a2170b4a |
|
28-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Add judgement to avoid infinite loop 1. The infinite loop causing soft lock occurs on multiple amdgpu cards supporting ras feature. 2. This a workaround patch to fix 6492e1b07c03397f85bd6dc0e230ea6cd9394635. It is valid for multiple amdgpu cards of the same type. 3. The root cause is that each GPU card device has a separate .ras_list link header, but the instance and linked list node of each ras block are unique. When each device is initialized, each ras instance will repeatedly add link node to the device every time. In this way, only the .ras_list of the last initialized device is completely correct. the .ras_list->prev and .ras_list->next of the device initialzied before can still point to the correct ras instance, but the prev pointer and next pointer of the pointed ras instance both point to the last initialized device's .ras_ list instead of the beginning .ras_ list. When using list_for_each_entry_safe searches for non-existent Ras nodes on devices other than the last device, the last ras instance next pointer cannot always be equal to the beginning .ras_list, so that the loop cannot be terminated, the program enters a infinite loop. BTW: Since the data and initialization process of each card are the same, the link list between ras instances will not be destroyed every time the device is initialized. 4. The soft locked logs are as follows: [ 262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G OE 5.13.0-27-generic #29~20.04.1-Ubuntu [ 262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020 [ 262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [ 262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45 [ 262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202 [ 262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248 [ 262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000 [ 262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001 [ 262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e [ 262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003 [ 262.166258] FS: 0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000 [ 262.166261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0 [ 262.166267] Call Trace: [ 262.166272] amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] [ 262.166529] ? psi_task_switch+0xd2/0x250 [ 262.166537] ? __switch_to+0x11d/0x460 [ 262.166542] ? __switch_to_asm+0x36/0x70 [ 262.166549] process_one_work+0x220/0x3c0 [ 262.166556] worker_thread+0x4d/0x3f0 [ 262.166560] ? process_one_work+0x3c0/0x3c0 [ 262.166563] kthread+0x12b/0x150 [ 262.166568] ? set_kthread_struct+0x40/0x40 [ 262.166571] ret_from_fork+0x22/0x30 Fixes: 6492e1b07c0339 ("drm/amdgpu: Unify ras block interface for each ras block") Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
400013b2 |
|
19-Jan-2022 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add umc_fill_error_record to make code more simple Create common amdgpu_umc_fill_error_record function for all versions of UMC and clean up related codes. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e9287ef8 |
|
19-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
Revert "drm/amdgpu: No longer insert ras blocks into ras_list if it already exists in ras_list" This reverts commit df4f0041c6ef497e598a67e367db835489162754. Xgmi ras initialization had been moved from .late_init to early_init, the defect of repeated calling amdgpu_ras_register_ras_block had been fixed, so revert this patch. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8eb53bb2 |
|
16-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Remove repeated calls Remove repeated calls. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b6efdb02 |
|
13-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Fix the code style warnings in amdgpu_ras Fix the code style warnings in amdgpu_ras: 1. ERROR: space required before the open parenthesis '('. 2. WARNING: line length of xxx exceeds 100 columns. 3. ERROR: "foo* bar" should be "foo *bar". 4. WARNING: unnecessary whitespace before a quoted newline. 5. WARNING: space prohibited before semicolon. 6. WARNING: suspect code indent for conditional statements. 7. WARNING: braces {} are not necessary for single statement blocks. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
79c04621 |
|
10-Jan-2022 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: handle denied inject error into critical regions v2 Changed from v1: remove unused brace Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e3d833f4 |
|
12-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: fix compile warning for ras_block_match_default fix compile warning for ras_block_match_default Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
954ea6aa |
|
12-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Use ARRAY_SIZE to get array length Use ARRAY_SIZE to get array length. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ab3b9de6 |
|
13-Jan-2022 |
Yang Li <yang.lee@linux.alibaba.com> |
drm/amdgpu: clean up some inconsistent indenting Eliminate the follow smatch warnings: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:3504 amdgpu_device_init() warn: inconsistent indenting drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1716 amdgpu_ras_error_status_query() warn: if statement not indented drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1058 amdgpu_ras_error_inject() warn: inconsistent indenting Reported-by: Abaci Robot <abaci@linux.alibaba.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
69f91d32 |
|
13-Jan-2022 |
Yang Li <yang.lee@linux.alibaba.com> |
drm/amdgpu: remove unneeded semicolon Eliminate the following coccicheck warning: ./drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2725:16-17: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
df4f0041 |
|
11-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: No longer insert ras blocks into ras_list if it already exists in ras_list No longer insert ras blocks into ras_list if it already exists in ras_list. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
df01fe73 |
|
11-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Add ras supported check for register_ras_block Add ras supported check for register_ras_block. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7389a5b8 |
|
05-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Removed redundant ras code Removed redundant ras code. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
22d4ba53 |
|
05-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Adjust error inject function code style in amdgpu_ras.c 1. Move xgmi special error inject function from amdgpu_ras.c to xgmi block. 2. Support to use psp_ras_trigger_error as default error inject function in amdgpu_ras.c. If .ras_error_inject isn't defined in ras block, default error inject function will take effect. v2: squash in warning fix (Alex) Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bdc4292b |
|
05-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Modify sdma block to fit for the unified ras block data and ops 1.Modify sdma block to fit for the unified ras block data and ops. 2.Change amdgpu_sdma_ras_funcs to amdgpu_sdma_ras, and the corresponding variable name remove _funcs suffix. 3.Remove the const flag of sdma ras variable so that sdma ras block can be able to be inserted into amdgpu device ras block link list. 4.Invoke amdgpu_ras_register_ras_block function to register sdma ras block into amdgpu device ras block link list. 5.Remove the redundant code about sdma in amdgpu_ras.c after using the unified ras block. 6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of sdma versions. If .ras_late_init and .ras_fini had been defined by the selected sdma version, the defined functions will take effect; if not defined, default fill them with amdgpu_sdma_ras_late_init and amdgpu_sdma_ras_fini. v2: squash in warning fix (Alex) Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
efe17d5a |
|
05-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Modify umc block to fit for the unified ras block data and ops 1.Modify umc block to fit for the unified ras block data and ops. 2.Change amdgpu_umc_ras_funcs to amdgpu_umc_ras, and the corresponding variable name remove _funcs suffix. 3.Remove the const flag of umc ras variable so that umc ras block can be able to be inserted into amdgpu device ras block link list. 4.Invoke amdgpu_ras_register_ras_block function to register umc ras block into amdgpu device ras block link list. 5.Remove the redundant code about umc in amdgpu_ras.c after using the unified ras block. 6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of umc versions. If .ras_late_init and .ras_fini had been defined by the selected umc version, the defined functions will take effect; if not defined, default fill them with amdgpu_umc_ras_late_init and amdgpu_umc_ras_fini. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2e54fe5d |
|
04-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Modify nbio block to fit for the unified ras block data and ops 1.Modify nbio block to fit for the unified ras block data and ops. 2.Change amdgpu_nbio_ras_funcs to amdgpu_nbio_ras, and the corresponding variable name remove _funcs suffix. 3.Remove the const flag of mmhub ras variable so that nbio ras block can be able to be inserted into amdgpu device ras block link list. 4.Invoke amdgpu_ras_register_ras_block function to register nbio ras block into amdgpu device ras block link list. 5.Remove the redundant code about nbio in amdgpu_ras.c after using the unified ras block. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5e67bba3 |
|
04-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Modify mmhub block to fit for the unified ras block data and ops 1.Modify mmhub block to fit for the unified ras block data and ops. 2.Change amdgpu_mmhub_ras_funcs to amdgpu_mmhub_ras, and the corresponding variable name remove _funcs suffix. 3.Remove the const flag of mmhub ras variable so that mmhub ras block can be able to be inserted into amdgpu device ras block link list. 4.Invoke amdgpu_ras_register_ras_block function to register mmhub ras block into amdgpu device ras block link list. 5.Remove the redundant code about mmhub in amdgpu_ras.c after using the unified ras block. 5.Remove the redundant code about mmhub in amdgpu_ras.c after using the unified ras block. 6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of mmhub versions. If .ras_late_init and .ras_fini had been defined by the selected mmhub version, the defined functions will take effect; if not defined, default fill them with amdgpu_mmhub_ras_late_init and amdgpu_mmhub_ras_fini. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6d76e904 |
|
04-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Modify hdp block to fit for the unified ras block data and ops 1.Modify hdp block to fit for the unified ras block data and ops. 2.Change amdgpu_hdp_ras_funcs to amdgpu_hdp_ras, and the corresponding variable name remove _funcs suffix. 3.Remove the const flag of hdp ras variable so that hdp ras block can be able to be inserted into amdgpu device ras block link list. 4.Invoke amdgpu_ras_register_ras_block function to register hdp ras block into amdgpu device ras block link list. 5.Remove the redundant code about hdp in amdgpu_ras.c after using the unified ras block. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6c245386 |
|
04-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Modify xgmi block to fit for the unified ras block data and ops 1.Modify gmc block to fit for the unified ras block data and ops. 2.Change amdgpu_xgmi_ras_funcs to amdgpu_xgmi_ras, and the corresponding variable name remove _funcs suffix. 3.Remove the const flag of gmc ras variable so that gmc ras block can be able to be inserted into amdgpu device ras block link list. 4.Invoke amdgpu_ras_register_ras_block function to register gmc ras block into amdgpu device ras block link list. 5.Remove the redundant code about gmc in amdgpu_ras.c after using the unified ras block. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8b0fb0e9 |
|
03-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops 1.Modify gfx block to fit for the unified ras block data and ops. 2.Change amdgpu_gfx_ras_funcs to amdgpu_gfx_ras, and the corresponding variable name remove _funcs suffix. 3.Remove the const flag of gfx ras variable so that gfx ras block can be able to be inserted into amdgpu device ras block link list. 4.Invoke amdgpu_ras_register_ras_block function to register gfx ras block into amdgpu device ras block link list. 5.Remove the redundant code about gfx in amdgpu_ras.c after using the unified ras block. 6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of gfx versions. If .ras_late_init and .ras_fini had been defined by the selected gfx version, the defined functions will take effect; if not defined, default fill with amdgpu_gfx_ras_late_init and amdgpu_gfx_ras_fini. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7cab2124 |
|
03-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Modify the compilation failed problem when other ras blocks' .h include amdgpu_ras.h Modify the compilation failed problem when other ras blocks' .h include amdgpu_ras.h. v2: squash in forward declaration warning fix (Alex) Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6492e1b0 |
|
03-Jan-2022 |
yipechai <YiPeng.Chai@amd.com> |
drm/amdgpu: Unify ras block interface for each ras block 1. Define unified ops interface for each block. 2. Add ras_block_match function pointer in ops interface, each ras block can customize specail match function to identify itself. 3. Add amdgpu_ras_block_match_default new function. If a ras block doesn't define .ras_block_match, default execute amdgpu_ras_block_match_default to identify this ras block. 4. Define unified basic ras block data for each ras block. 5. Create dedicated amdgpu device ras block link list to manage all of the ras blocks. 6. Add amdgpu_ras_register_ras_block new function interface for each ras block to register itself to ras controlling block. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bc143d8b |
|
21-Nov-2021 |
Evan Quan <evan.quan@amd.com> |
drm/amd/pm: do not expose implementation details to other blocks out of power Those implementation details(whether swsmu supported, some ppt_funcs supported, accessing internal statistics ...)should be kept internally. It's not a good practice and even error prone to expose implementation details. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ec6aae97 |
|
07-Jan-2022 |
Nirmoy Das <nirmoy.das@amd.com> |
drm/amdgpu: do not pass ttm_resource_manager to vram_mgr Do not allow exported amdgpu_vram_mgr_*() to accept any ttm_resource_manager pointer. Also there is no need to force other module to call a ttm function just to eventually call vram_mgr functions. v2: pass adev's vram_mgr instead of adev Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b54ce6c9 |
|
06-Jan-2022 |
Jiawei Gu <Jiawei.Gu@amd.com> |
drm/amdgpu: Clear garbage data in err_data before usage Memory of err_data should be cleaned before usage when there're multiple entry in ras ih. Otherwise garbage data from last loop will be used. Signed-off-by: Jiawei Gu <Jiawei.Gu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
91f75eb4 |
|
16-Dec-2021 |
Yazen Ghannam <yazen.ghannam@amd.com> |
x86/MCE/AMD, EDAC/mce_amd: Support non-uniform MCA bank type enumeration AMD systems currently lay out MCA bank types such that the type of bank number "i" is either the same across all CPUs or is Reserved/Read-as-Zero. For example: Bank # | CPUx | CPUy 0 LS LS 1 RAZ UMC 2 CS CS 3 SMU RAZ Future AMD systems will lay out MCA bank types such that the type of bank number "i" may be different across CPUs. For example: Bank # | CPUx | CPUy 0 LS LS 1 RAZ UMC 2 CS NBIO 3 SMU RAZ Change the structures that cache MCA bank types to be per-CPU and update smca_get_bank_type() to handle this change. Move some SMCA-specific structures to amd.c from mce.h, since they no longer need to be global. Break out the "count" for bank types from struct smca_hwid, since this should provide a per-CPU count rather than a system-wide count. Apply the "const" qualifier to the struct smca_hwid_mcatypes array. The values in this array should not change at runtime. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211216162905.4132657-3-yazen.ghannam@amd.com
|
#
929bb8e2 |
|
08-Dec-2021 |
Isabella Basso <isabbasso@riseup.net> |
drm/amdgpu: fix amdgpu_ras_mca_query_error_status scope This commit fixes the compile-time warning below: warning: no previous prototype for ‘amdgpu_ras_mca_query_error_status’ [-Wmissing-prototypes] Changes since v1: - As suggested by Alexander Deucher: 1. Make function static instead of adding prototype. Signed-off-by: Isabella Basso <isabbasso@riseup.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bbe04dec |
|
07-Dec-2021 |
Isabella Basso <isabbasso@riseup.net> |
drm/amd: fix improper docstring syntax This fixes various warnings relating to erroneous docstring syntax, of which some are listed below: warning: Function parameter or member 'adev' not described in 'amdgpu_atomfirmware_ras_rom_addr' ... warning: expecting prototype for amdgpu_atpx_validate_functions(). Prototype was for amdgpu_atpx_validate() instead ... warning: Excess function parameter 'mem' description in 'amdgpu_preempt_mgr_new' ... warning: Cannot understand * @kfd_get_cu_occupancy - Collect number of waves in-flight on this device ... warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst Signed-off-by: Isabella Basso <isabbasso@riseup.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
655ff353 |
|
07-Dec-2021 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: enable RAS poison flag when GPU is connected to CPU The RAS poison mode is enabled by default on the platform. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cf63b702 |
|
06-Dec-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: skip umc ras error count harvest remove in recovery stat check, skip umc ras err cnt harvest in amdgpu_ras_log_on_err_counter Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
aed1faab |
|
02-Dec-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: only skip get ecc info for aldebaran skip get ecc info for aldebarn through check ip version do not affect other asic type Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bab73f09 |
|
01-Dec-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: skip query ecc info in gpu recovery this is a workaround due to get ecc info failed during gpu recovery [ 700.236122] amdgpu 0000:09:00.0: amdgpu: Failed to export SMU ecc table! [ 700.236128] amdgpu 0000:09:00.0: amdgpu: GPU reset begin! [ 704.331171] amdgpu: qcm fence wait loop timeout expired [ 704.331194] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption [ 704.332445] amdgpu 0000:09:00.0: amdgpu: GPU reset begin! [ 704.332448] amdgpu 0000:09:00.0: amdgpu: Bailing on TDR for s_job:ffffffffffffffff, as another already in progress [ 704.332456] amdgpu: Pasid 0x8000 destroy queue 0 failed, ret -62 [ 710.360924] amdgpu 0000:09:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000013 SMN_C2PMSG_82:0x00000007 [ 710.360964] amdgpu 0000:09:00.0: amdgpu: Failed to disable smu features. [ 710.361002] amdgpu 0000:09:00.0: amdgpu: Fail to disable dpm features! [ 710.361014] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62 Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
232d1d43 |
|
26-Nov-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: fix disable ras feature failed when unload drvier v2 v2: still need call ras_disable_all_featrures to handle ras initilization failure case. Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw, so ras ta will unload before send ras disable command, ras dsiable operation must before hw fini. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
fdcb279d |
|
18-Nov-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: query umc error info from ecc_table v2 if smu support ECCTABLE, driver can message smu to get ecc_table then query umc error info from ECCTABLE v2: optimize source code makes logical more reasonable Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d9a69fe5 |
|
15-Nov-2021 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Add recovery_lock to save bad pages function Fix race condition failure during UMC UE injection. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
91a1a52d |
|
20-Sep-2021 |
Mukul Joshi <mukul.joshi@amd.com> |
drm/amdgpu: Fix RAS page retirement with mode2 reset on Aldebaran During mode2 reset, the GPU is temporarily removed from the mgpu_info list. As a result, page retirement fails because it cannot find the GPU in the GPU list. To fix this, create our own list of GPUs that support MCE notifier based page retirement and use that list to check if the UMC error occurred on a GPU that supports MCE notifier based page retirement. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
12b2cab7 |
|
22-Sep-2021 |
Mukul Joshi <mukul.joshi@amd.com> |
drm/amdgpu: Register MCE notifier for Aldebaran RAS On Aldebaran, GPU driver will handle bad page retirement for GPU memory even though UMC is host managed. As a result, register a bad page retirement handler on the mce notifier chain to retire bad pages on Aldebaran. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
eb601e61 |
|
29-Sep-2021 |
John Clements <john.clements@amd.com> |
drm/amdgpu: resolve RAS query bug clear error count when persistant harvesting is not enabled Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f524dd54 |
|
17-Sep-2021 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: skip umc ras irq handling in poison mode (v2) In ras poison mode, umc uncorrectable error will be ignored until the corrupted data consumed by another ras module (such as gfx, sdma). v2: update the debug message and replace dev_warn with dev_info. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e4348849 |
|
17-Sep-2021 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: set poison supported flag for RAS (v2) Add RAS poison supported flag and tell PSP RAS TA about the info. v2: rename poison mode to poison supported, we can also disable poison mode even we support it. print value of poison supported if ras feature enablement fails. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9080a18f |
|
15-Sep-2021 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Remove all code paths under the EAGAIN path in RAS late init All code paths under the EAGAIN path in RAS late init are unused. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
640ae42e |
|
22-Sep-2021 |
John Clements <john.clements@amd.com> |
drm/amdgpu: Updated RAS infrastructure Update RAS infrastructure to support RAS query for MCA subblocks Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
3771449b |
|
09-Sep-2021 |
John Clements <john.clements@amd.com> |
drm/amdgpu: Update RAS trigger error block support Added trigger error support for MP0/MP1/MPIO blocks Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a0a2f7bb |
|
27-Aug-2021 |
Candice Li <candice.li@amd.com> |
drm/amd/amdgpu: add mpio to ras block Add MPIO to RAS block Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
893cf382 |
|
13-Aug-2021 |
Candice Li <candice.li@amd.com> |
drm/amd/amdgpu: remove unnecessary RAS context field Delete ras_if->name in the RAS ctx structure and remove related lines. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6457205c |
|
12-Aug-2021 |
Candice Li <candice.li@amd.com> |
drm/amd/amdgpu: consolidate PSP TA context Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4d9f771e |
|
02-Jul-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Return error if no RAS In amdgpu_ras_query_error_count() return an error if the device doesn't support RAS. This prevents that function from having to always set the values of the integer pointers (if set), and thus prevents function side effects--always to have to set values of integers if integer pointers set, regardless of whether RAS is supported or not--with this change this side effect is mitigated. Also, if no pointers are set, don't count, since we've no way of reporting the counts. Also, give this function a kernel-doc. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Reported-by: Tom Rix <trix@redhat.com> Fixes: a46751fbcde505 ("drm/amdgpu: Fix RAS function interface") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1d9d2ca8 |
|
29-Apr-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Fix koops when accessing RAS EEPROM Debugfs RAS EEPROM files are available when the ASIC supports RAS, and when the debugfs is enabled, an also when "ras_enable" module parameter is set to 0. However in this case, we get a kernel oops when accessing some of the "ras_..." controls in debugfs. The reason for this is that struct amdgpu_ras::adev is unset. This commit sets it, thus enabling access to those facilities. Note that this facilitates EEPROM access and not necessarily RAS features or functionality. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c65b0805 |
|
08-Apr-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: RAS EEPROM table is now in debugfs Add "ras_eeprom_size" file in debugfs, which reports the maximum size allocated to the RAS table in EEROM, as the number of bytes and the number of records it could store. For instance, $cat /sys/kernel/debug/dri/0/ras/ras_eeprom_size 262144 bytes or 10921 records $_ Add "ras_eeprom_table" file in debugfs, which dumps the RAS table stored EEPROM, in a formatted way. For instance, $cat ras_eeprom_table Signature Version FirstOffs Size Checksum 0x414D4452 0x00010000 0x00000014 0x000000EC 0x000000DA Index Offset ErrType Bank/CU TimeStamp Offs/Addr MemChl MCUMCID RetiredPage 0 0x00014 ue 0x00 0x00000000607608DC 0x000000000000 0x00 0x00 0x000000000000 1 0x0002C ue 0x00 0x00000000607608DC 0x000000001000 0x00 0x00 0x000000000001 2 0x00044 ue 0x00 0x00000000607608DC 0x000000002000 0x00 0x00 0x000000000002 3 0x0005C ue 0x00 0x00000000607608DC 0x000000003000 0x00 0x00 0x000000000003 4 0x00074 ue 0x00 0x00000000607608DC 0x000000004000 0x00 0x00 0x000000000004 5 0x0008C ue 0x00 0x00000000607608DC 0x000000005000 0x00 0x00 0x000000000005 6 0x000A4 ue 0x00 0x00000000607608DC 0x000000006000 0x00 0x00 0x000000000006 7 0x000BC ue 0x00 0x00000000607608DC 0x000000007000 0x00 0x00 0x000000000007 8 0x000D4 ue 0x00 0x00000000607608DD 0x000000008000 0x00 0x00 0x000000000008 $_ Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Xinhui Pan <xinhui.pan@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
63d4c081 |
|
06-Apr-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Optimize EEPROM RAS table I/O Split functionality between read and write, which simplifies the code and exposes areas of optimization and more or less complexity, and take advantage of that. Read and write the table in one go; use a separate stage to decode or encode the data, as opposed to on the fly, which keeps the I2C bus busy. Use a single read/write to read/write the table or at most two if the number of records we're reading/writing wraps around. Check the check-sum of a table in EEPROM on init. Update the checksum at the same time as when updating the table header signature, when the threshold was increased on boot. Take advantage of arithmetic modulo 256, that is, use a byte!, to greatly simplify checksum arithmetic. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
0686627b |
|
16-Jun-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Some renames Qualify with "ras_". Use kernel's own--don't redefine your own. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e4e6a589 |
|
27-Mar-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Use explicit cardinality for clarity RAS_MAX_RECORD_NUM may mean the maximum record number, as in the maximum house number on your street, or it may mean the maximum number of records, as in the count of records, which is also a number. To make this distinction whether the number is ordinal (index) or cardinal (count), rename this macro to RAS_MAX_RECORD_COUNT. This makes it easy to understand what it refers to, especially when we compute quantities such as, how many records do we have left in the table, especially when there are so many other numbers, quantities and numerical macros around. Also rename the long, amdgpu_ras_eeprom_get_record_max_length() to the more succinct and clear, amdgpu_ras_eeprom_max_record_count(). When computing the threshold, which also deals with counts, i.e. "how many", use cardinal "max_eeprom_records_count", than the quantitative "max_eeprom_records_len". Simplify the logic here and there, as well. Cc: Guchun Chen <guchun.chen@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cf696091 |
|
11-Mar-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Return result fix in RAS The low level EEPROM write method, doesn't return 1, but the number of bytes written. Thus do not compare to 1, instead, compare to greater than 0 for success. Other cleanup: if the lower layers returned -errno, then return that, as opposed to overwriting the error code with one-fits-all -EINVAL. For instance, some return -EAGAIN. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1fab841f |
|
11-Mar-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: RAS xfer to read/write Wrap amdgpu_ras_eeprom_xfer(..., bool write), into amdgpu_ras_eeprom_read() and amdgpu_ras_eeprom_write(), as that makes reading and understanding the code clearer. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a4399657 |
|
05-Feb-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Rename misspelled function Instead of fixing the spelling in amdgpu_ras_eeprom_process_recods(), rename it to, amdgpu_ras_eeprom_xfer(), to look similar to other I2C and protocol transfer (read/write) functions. Also to keep the column span to within reason by using a shorter name. Change the "num" function parameter from "int" to "const u32" since it is the number of items (records) to xfer, i.e. their count, which cannot be a negative number. Also swap the order of parameters, keeping the pointer to records and their number next to each other, while the direction now becomes the last parameter. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
43a44c53 |
|
02-Jul-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Return error if no RAS In amdgpu_ras_query_error_count() return an error if the device doesn't support RAS. This prevents that function from having to always set the values of the integer pointers (if set), and thus prevents function side effects--always to have to set values of integers if integer pointers set, regardless of whether RAS is supported or not--with this change this side effect is mitigated. Also, if no pointers are set, don't count, since we've no way of reporting the counts. Also, give this function a kernel-doc. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Reported-by: Tom Rix <trix@redhat.com> Fixes: a46751fbcde505 ("drm/amdgpu: Fix RAS function interface") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
513befa6 |
|
11-Jun-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: message smu to update hbm bad page number Use SMU to update the bad pages rather than directly accessing the EEPROM from the driver. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e11d5e0d |
|
15-Jun-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: add vega20 to ras quirk list Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a3fbb0d8 |
|
09-Jun-2021 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: use adev_to_drm macro for consistency (v2) Use adev_to_drm() to get to the drm_device pointer. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
05adfd80 |
|
21-May-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Use delayed work to collect RAS error counters On Context Query2 IOCTL return the correctable and uncorrectable errors in O(1) fashion, from cached values, and schedule a delayed work function to calculate and cache them for the next such IOCTL. v2: Cancel pending delayed work at ras_fini(). v3: Remove conditionals when dealing with delayed work manipulation as they're inherently racy. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a46751fb |
|
18-May-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Fix RAS function interface The correctable and uncorrectable errors are calculated at each invocation of this function. Therefore, it is highly inefficient to return just one of them based on a Boolean input. If the caller wants both, twice the work would be done. (And this work is O(n^3) on Vega20.) Fix this "interface" to simply return what it had calculated--both values. Let the caller choose what it wants to record, inspect, use. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
72c8c97b |
|
12-May-2021 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Split amdgpu_device_fini into early and late Some of the stuff in amdgpu_device_fini such as HW interrupts disable and pending fences finilization must be done right away on pci_remove while most of the stuff which relates to finilizing and releasing driver data structures can be kept until drm_driver.release hook is called, i.e. when the last device reference is dropped. v4: Change functions prefix early->hw and late->sw Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-3-andrey.grodzovsky@amd.com
|
#
8f6368a9 |
|
17-May-2021 |
John Clements <john.clements@amd.com> |
drm/amdgpu: Conditionally reset RAS counters on boot Only clear RAS error counters if perestent EDC harvesting is not supported Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c666bbf0 |
|
09-May-2021 |
Dwaipayan Ray <dwaipayanray1@gmail.com> |
drm/amd/amdgpu: Fix errors in function documentation Fix a couple of syntax errors and removed one excess parameter in the function documentations which lead to kernel docs build warning. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7780f503 |
|
10-May-2021 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: add function to clear MMEA error status for aldebaran For aldebaran, hardware will not clear error status automatically when reading error status register, insteadly driver should set clear bit of the error status register explicitly to clear error status. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
011907fd |
|
09-May-2021 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: covert ras status to kernel errno The original codes use ras status and kernl errno together in the same function, which is a wrong code style. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7ddd9770 |
|
06-May-2021 |
Oak Zeng <Oak.Zeng@amd.com> |
drm/amdgpu: Quit RAS initialization earlier if RAS is disabled If RAS is disabled through amdgpu_ras_enable kernel parameter, we should quit the RAS initialization eariler to avoid initialization of some RAS data structure such as sysfs etc. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ef0d7d20 |
|
04-May-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Export ras_*_enabled to debugfs Export the runtime-set "ras_hw_enabled" and "ras_enabled" to debugfs, for debugging. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8ab0d6f0 |
|
04-May-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Rename to ras_*_enabled Rename, ras_hw_supported --> ras_hw_enabled, and ras_features --> ras_enabled, to show that ras_enabled is a subset of ras_hw_enabled, which itself is a subset of the ASIC capability. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e509965e |
|
03-May-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Move up ras_hw_supported Move ras_hw_supported into struct amdgpu_dev. The dependency is: struct amdgpu_ras <== struct amdgpu_dev <== ASIC, read as "struct amdgpu_ras depends on struct amdgpu_dev, which depends on the hardware." This can be loosely understood as, "if RAS is supported, which is property of the ASIC (struct amdgpu_dev), then we can access struct amdgpu_ras." v2: Fix a typo: must binary AND in ternary cond in amdgpu_ras.c Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
acdae216 |
|
03-May-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Remove redundant ras->supported Remove redundant ras->supported, as this value is also stored in adev->ras_features. Use adev->ras_features, as that supercedes "ras", since the latter is its member. The dependency goes like this: ras <== adev->ras_features <== hw_supported, and is read as "ras depends on ras_features, which depends on hw_supported." The arrows show the flow of information, i.e. the dependency update. "hw_supported" should also live in "adev". Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f50160cf |
|
28-Apr-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: force enable gfx ras for vega20 ws Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1f6e8eb1 |
|
29-Apr-2021 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: enable gfx ras in aldebran by default gfx ras now can be enabled by default in aldebaran Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
78871b6c |
|
28-Apr-2021 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: enable ras error count query and reset for HDP add hdp block ras error query and reset support in amdgpu ras error count query and reset interface Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a30f1286 |
|
25-Apr-2021 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: provide socket/die id info in RAS message Add socket/die information in RAS messages for platforms that support query those information Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
502f0e28 |
|
22-Apr-2021 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: disable gfx ras by default in aldebaran aldebaran gfx ras is still under development Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
19d0dfda |
|
18-Apr-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: optimize gfx ras features flag clean Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1f0d8e37 |
|
24-Mar-2021 |
Mukul Joshi <mukul.joshi@amd.com> |
drm/amdgpu: Reset RAS error count and status regs Reset the RAS error count and error status registers after reading to prevent over reporting error counts on Aldebaran. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-By: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6df23f4c |
|
16-Apr-2021 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: fix a error injection failed issue because "sscanf(str, "retire_page")" always return 0, if application use the raw data for error injection, it always wrongly falls into "op == 3". Change to use strstr instead. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
546aa546 |
|
14-Apr-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Add double-sscanf but invert Add back the double-sscanf so that both decimal and hexadecimal values could be read in, but this time invert the scan so that hexadecimal format with a leading 0x is tried first, and if that fails, then try decimal format. Also use a logical-AND instead of nesting double if-conditional. See commit "drm/amdgpu: Fix a bug for input with double sscanf" Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
737c375b |
|
13-Apr-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Fix kernel-doc for the RAS sysfs interface Imporve the kernel-doc for the RAS sysfs interface. Fix the grammar, fix the context. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7fb64071 |
|
13-Apr-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Add bad_page_cnt_threshold to debugfs Add bad_page_cnt_threshold to debugfs, an optional file system used for debugging, for reporting purposes only--it usually matches the size of EEPROM but may be different depending on the "bad_page_threshold" kernel module option. The "bad_page_cnt_threshold" is a dynamically computed value. It depends on three things: the VRAM size; the size of the EEPROM (or the size allocated to the RAS table therein); and the "bad_page_threshold" module parameter. It is a dynamically computed value, when the amdgpu module is run, on which further parameters and logic depend, and as such it is helpful to see the dynamically computed value in debugfs. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
80b0cd0f |
|
13-Apr-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Fix a bug in checking the result of reserve page Fix if (ret) --> if (!ret), a bug, for "retire_page", which caused the kernel to recall the method with *pos == end of file, and that bounced back with error. On the first run, we advanced *pos, but returned 0 back to fs layer, also a bug. Fix the logic of the check of the result of amdgpu_reserve_page_direct()--it is 0 on success, and non-zero on error, not the other way around. This patch fixes this bug. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6cb7a1d4 |
|
13-Apr-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Fix a bug for input with double sscanf Remove double-sscanf to scan for %llu and 0x%llx, as that is not going to work! The %llu will consume the "0" in "0x" of your input, and the hex value you think you're entering will always be 0. That is, a valid hex value can never be consumed. On the other hand, just entering a hex number without leading 0x will either be scanned as a string and not match, for instance FAB123, or the leading decimal portion is scanned as the %llu, for instance 123FAB will be scanned as 123, which is not correct. Thus remove the first %llu scan and leave only the %llx scan, removing the leading 0x since %llx can scan either. Addresses are usually always hex values, so this suffices. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Xinhui Pan <xinhui.pan@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cbb8f989 |
|
09-Apr-2021 |
John Clements <john.clements@amd.com> |
drm/amdgpu: page retire over debugfs mechanism added support in RAS debugfs to add bad page for isolated page retirement testing Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
134d16d5 |
|
25-Mar-2021 |
John Clements <john.clements@amd.com> |
drm/amdgpu: RAS harvest on driver load In event of RAS UE + warm reset, error counters shall be harvested and cleared on driver load Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
719a9b33 |
|
19-Mar-2021 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: split gfx callbacks into ras and non-ras ones gfx ras is only available in cerntain ip generations. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8bc7b360 |
|
19-Mar-2021 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: split mmhub callbacks into ras and non-ras ones mmhub ras is only avaiable in cerntain mmhub ip generation. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
49070c4e |
|
17-Mar-2021 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: split umc callbacks to ras and non-ras ones umc ras is not managed by gpu driver when gpu is connected to cpu through xgmi. split umc callbacks into ras and non-ras ones so gpu driver only initializes umc ras callbacks when it manages umc ras. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
52137ca8 |
|
18-Mar-2021 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: move xgmi ras functions to xgmi_ras_funcs xgmi ras is not managed by gpu driver when gpu is connected to cpu through xgmi. move all xgmi ras functions to xgmi_ras_funcs so gpu driver only initializes xgmi ras functions when it manages xgmi ras. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6e36f231 |
|
02-Apr-2021 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: split nbio callbacks into ras and non-ras ones nbio ras is not managed by gpu driver when gpu is connected to cpu through xgmi. split nbio callbacks into ras and non-ras ones so gpu driver only initializes nbio ras callbacks when it manages nbio ras. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
75f06251 |
|
08-Mar-2021 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: initialze ras caps per paltform config Driver only manages GFX/SDMA/MMHUB RAS in platforms that gpu node is connected to cpu through XGMI, other than that, it queries VBIOS for RAS capabilities. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f0872686 |
|
31-Mar-2021 |
Bernard Zhao <bernard@vivo.com> |
drm/amd: cleanup coding style a bit Fix patch check warning: WARNING: suspect code indent for conditional statements (8, 17) + if (obj && obj->use < 0) { + DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name); WARNING: braces {} are not necessary for single statement blocks + if (obj && obj->use < 0) { + DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name); + } Signed-off-by: Bernard Zhao <bernard@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5a434527 |
|
01-Apr-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: support sdma error injection Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reivewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
36000c7a |
|
24-Mar-2021 |
Tian Tao <tiantao6@hisilicon.com> |
drm/amdgpu: Convert sysfs sprintf/snprintf family to sysfs_emit Fix the following coccicheck warning: drivers/gpu//drm/amd/amdgpu/amdgpu_ras.c:434:9-17: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:220:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:249:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/df_v3_6.c:208:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_psp.c:2973:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:75:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:112:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:58:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:93:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:125:9-17: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:52:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:71:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:140:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:164:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:186:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:208:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_atombios.c:1916:8-16: WARNING: use scnprintf or sprintf Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
084e2640 |
|
11-Mar-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Fix check for RAS support Use positive logic to check for RAS support. Rename the function to actually indicate what it is testing for. Essentially, make the function a predicate with the correct name. Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e5c04edf |
|
18-Mar-2021 |
Christian König <christian.koenig@amd.com> |
drm/amdgpu: revert "reserve backup pages for bad page retirment" As noted during the review this approach doesn't make sense at all. We should not apply any limitation on the VRAM applications can use inside the kernel. If an application or end user wants to reserve a certain amount of VRAM for bad pages handling we should do this in the upper layer. This reverts commit f89b881c81d9a6481fc17b46b351ca38f5dd6f3a. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
970fd197 |
|
10-Mar-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: fix send ras disable cmd when asic not support ras cause: It is necessary to send ras disable command to ras-ta during gfx block ras later init, because the ras capability is disable read from vbios for vega20 gaming, but the ras context is released during ras init process, this will cause send ras disable command to ras-to failed. how: Delay releasing ras context, the ras context will be released after gfx block later init done. Changed from V1: move release_ras_context into ras_resume Changed from V2: check BIT(UMC) is more reasonable before access eeprom table Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b69d5c7e |
|
09-Mar-2021 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: support query ecc cap for SIENNA_CICHLID driver needs to query umc_info_v3_3 for ecc capability in sienna_cichlid Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e5086659 |
|
09-Mar-2021 |
shaoyunl <shaoyun.liu@amd.com> |
drm/amdgpu: skip read eeprom for device that pending on XGMI reset Read eeprom through SMU doesn't works stable on XGMI reset during test. skip it for now Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
761d86d3 |
|
03-Feb-2021 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: harvest edc status when connected to host via xGMI When connected to a host via xGMI, system fatal errors may trigger warm reset, driver has no change to query edc status before reset. Therefore in this case, driver should harvest previous error loging registers during boot, instead of only resetting them. v2: 1. IP's ras_manager object is created when its ras feature is enabled, so change to query edc status after amdgpu_ras_late_init called 2. change to enable watchdog timer after finishing gfx edc init Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reivewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
88f8575b |
|
05-Mar-2021 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: enable watchdog feature for SQ of aldebaran SQ's watchdog timer monitors forward progress, a mask of which waves caused the watchdog timeout is recorded into ras status registers and then trigger a system fatal error event. v2: 1. change *query_timeout_status to *query_sq_timeout_status. 2. move query_sq_timeout_status into amdgpu_ras_do_recovery. 3. add module parameters to enable/disable fatal error event and modify the watchdog timer. v3: 1. remove unused parameters of *enable_watchdog_timer Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
11003c68 |
|
25-Feb-2021 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: remove unnecessary reading for epprom header If the number of badpage records exceed the threshold, driver has updated both epprom header and control->tbl_hdr.header before gpu reset, therefore GPU recovery thread no need to read epprom header directly. v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f89b881c |
|
22-Feb-2021 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: reserve backup pages for bad page retirment To ensure user has a constant of VRAM accessible in run-time, driver reserves limit backup pages when init, and return ones when bad pages retired, to keep no change of unused memory size. v2: refine codes to calculate badpags threshold Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ea1b8c9b |
|
16-Feb-2021 |
Nirmoy Das <nirmoy.das@amd.com> |
drm/amdgpu: mark local function as static Mark amdgpu_ras_debugfs_create_ctrl_node() as static. Fixes: eb14235668777b ("drm/amdgpu: do not keep debugfs dentry") Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
88293c03 |
|
10-Feb-2021 |
Nirmoy Das <nirmoy.das@amd.com> |
drm/amdgpu: do not keep debugfs dentry Cleanup unnecessary debugfs dentries and surrounding functions. v3: remove return value check for debugfs_create_file() v2: remove ttm_debugfs_entries array. do not init variables. Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
fe2d9f5a |
|
14-Jan-2021 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: toggle on DF Cstate after finishing xgmi injection Fixes: 5c23e9e05e42 ("drm/amdgpu: Update RAS XGMI error inject sequence") Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
732f2a30 |
|
04-Jan-2021 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: fix no bad_pages issue after umc ue injection old code wrongly used the bad page status as the function return value, which cause amdgpu_ras_badpages_read always return failed. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8a82b347 |
|
04-Jan-2021 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: fix no bad_pages issue after umc ue injection old code wrongly used the bad page status as the function return value, which cause amdgpu_ras_badpages_read always return failed. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2343e9d2 |
|
03-Dec-2020 |
Arnd Bergmann <arnd@arndb.de> |
drm/amdgpu: fix debugfs creation/removal, again There is still a warning when CONFIG_DEBUG_FS is disabled: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function] 1145 | static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) Change the code again to make the compiler actually drop this code but not warn about it. Fixes: ae2bf61ff39e ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS") Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cedf7884 |
|
03-Dec-2020 |
Arnd Bergmann <arnd@arndb.de> |
drm/amdgpu: fix debugfs creation/removal, again There is still a warning when CONFIG_DEBUG_FS is disabled: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function] 1145 | static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) Change the code again to make the compiler actually drop this code but not warn about it. Fixes: ae2bf61ff39e ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS") Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cd92df93 |
|
13-Nov-2020 |
Lee Jones <lee.jones@linaro.org> |
drm/amd/amdgpu/amdgpu_ras: Make local function 'amdgpu_ras_error_status_query' static Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1482:6: warning: no previous prototype for ‘amdgpu_ras_error_status_query’ [-Wmissing-prototypes] Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
32dc5348 |
|
13-Nov-2020 |
Lee Jones <lee.jones@linaro.org> |
drm/amd/amdgpu/amdgpu_ras: Remove unused function 'amdgpu_ras_error_cure' Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:908:5: warning: no previous prototype for ‘amdgpu_ras_error_cure’ [-Wmissing-prototypes] Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f3729f7b |
|
02-Nov-2020 |
Deepak R Varma <mh12gx2825@gmail.com> |
drm/amdgpu/amdgpu: improve code indentation and alignment General code indentation and alignment changes such as replace spaces by tabs or align function arguments as per the coding style guidelines. The patch corrects issues for various amdgpu_*.c files for this driver. Issue reported by checkpatch script. Signed-off-by: Deepak R Varma <mh12gx2825@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
aec576f9 |
|
27-Oct-2020 |
Tom Rix <trix@redhat.com> |
drm/amdgpu: remove unneeded semicolon A semicolon is not needed after a switch statement. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
676deb38 |
|
22-Oct-2020 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: fix the issue of reserving bad pages failed In amdgpu_ras_reset_gpu, because bad pages may not be freed, it has high probability to reserve bad pages failed. Change to reserve bad pages when freeing VRAM. v2: 1. avoid allocating the drm_mm node outside of amdgpu_vram_mgr.c 2. move bad page reserving into amdgpu_ras_add_bad_pages, if vram mgr reserve bad page failed, it will put it into pending list, otherwise put it into processed list; 3. remove amdgpu_ras_release_bad_pages, because retired page's info has been moved into amdgpu_vram_mgr v3: 1. formate code style; 2. rename amdgpu_vram_reserve_scope as amdgpu_vram_reservation; 3. rename scope_pending as reservations_pending; 4. rename scope_processed as reserved_pages; 5. change to iterate over all the pending ones and try to insert them with drm_mm_reserve_node(); v4: 1. rename amdgpu_vram_mgr_reserve_scope as amdgpu_vram_mgr_reserve_range; 2. remove unused include "amdgpu_ras.h"; 3. rename amdgpu_vram_mgr_check_and_reserve as amdgpu_vram_mgr_do_reserve; 4. refine amdgpu_vram_mgr_reserve_range to call amdgpu_vram_mgr_do_reserve. Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
22503d80 |
|
19-Oct-2020 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: change to save bad pages in UMC error interrupt callback Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce the unnecessary calling of amdgpu_ras_save_bad_pages. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e4eeceb7 |
|
21-Oct-2020 |
John Clements <john.clements@amd.com> |
Revert drm/amdgpu: disable sienna chichlid UMC RAS This reverts commit 265c280a4807419249644156654e5c40a235ea84. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4d2aae33 |
|
21-Oct-2020 |
John Clements <john.clements@amd.com> |
Revert drm/amdgpu: disable sienna chichlid UMC RAS This reverts commit 265c280a4807419249644156654e5c40a235ea84. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a069a9eb |
|
25-Sep-2020 |
Alex Deucher <alexander.deucher@amd.com> |
drm/amdgpu: fix a warning in amdgpu_ras.c (v2) drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c: In function ‘amdgpu_ras_fs_init’: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1284:2: warning: ignoring return value of ‘sysfs_create_group’, declared with attribute warn_unused_result [-Wunused-result] 1284 | sysfs_create_group(&adev->dev->kobj, &group); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ v2: just print an error for sysfs group creation failure Acked-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c3d4d45d |
|
24-Sep-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: clean up ras sysfs creation (v2) Merge ras sysfs creation together by calling sysfs_create_group once, as sysfs_update_group may not work properly as expected. v2: improve commit message Signed-off-by: Guchun Chen <guchun.chen@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
265c280a |
|
24-Sep-2020 |
John Clements <john.clements@amd.com> |
drm/amdgpu: disable sienna chichlid UMC RAS disable UMC RAS in lieu of stability issues on certain sku Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
3f975d0f |
|
15-Sep-2020 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: update athub interrupt harvesting handle GCEA/MMHUB EA error should not result to DF freeze, this is fixed in next generation, but for some reasons the GCEA/MMHUB EA error will result to DF freeze in previous generation, diver should avoid to indicate GCEA/MMHUB EA error as hw fatal error in kernel message by read GCEA/MMHUB err status registers. Changed from V1: make query_ras_error_status function more general make read mmhub er status register more friendly Changed from V2: move ras error status query function into do_recovery workqueue Changed from V3: remove useless code from V2, print GCEA error status instance number Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5436ab94 |
|
17-Aug-2020 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdkfd: fix set kfd node ras properties value The ctx->features are new RAS implementation which is only available for Vega20 and onwards, it is not available for vega10, vega10 should follow legacy ECC implementation. Changed from V1: wrap function to initialize kfd node properties Changed from V2: remove wrap function and SDMA SRAM ECC check Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4a580877 |
|
23-Aug-2020 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Get DRM dev from adev by inline-f Add a static inline adev_to_drm() to obtain the DRM device pointer from an amdgpu_device pointer. Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d95e8e97 |
|
18-Aug-2020 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: refine create and release logic of hive info Change to dynamically create and release hive info object, which help driver support more hives in the future. v2: Change to save hive object pointer in adev, to avoid locking xgmi_mutex every time when calling amdgpu_get_xgmi_hive. v3: 1. Change type of hive object pointer in adev from void* to amdgpu_hive_info*. 2. remove unnecessary variable initialization. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
53b3f8f4 |
|
19-Aug-2020 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: refine codes to avoid reentering GPU recovery if other threads have holden the reset lock, recovery will fail to try_lock. Therefore we introduce atomic hive->in_reset and adev->in_gpu_reset, to avoid reentering GPU recovery. v2: drop "? true : false" in the definition of amdgpu_in_reset Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1a68d96f |
|
13-Aug-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: fix NULL pointer access issue when unloading driver When unloading driver by "modprobe -r amdgpu", one NULL pointer dereference bug occurs in ras debugfs releasing. The cause is the duplicated debugfs_remove, as drm debugfs_root dir has been cleaned up already by drm_minor_unregister. BUG: kernel NULL pointer dereference, address: 00000000000000a0 PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 11 PID: 1526 Comm: modprobe Tainted: G OE 5.6.0-guchchen #1 Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018 RIP: 0010:down_write+0x15/0x40 Code: eb de e8 7e 17 72 ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 53 48 89 fb e8 92 d8 ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 43 08 5b c3 RSP: 0018:ffffb1590386fcd0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000000000a0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff85b2fcc2 RDI: 00000000000000a0 RBP: ffffb1590386fd30 R08: ffffffff85b2fcc2 R09: 000000000002b3c0 R10: ffff97a330618c40 R11: 00000000000005f6 R12: ffff97a3481beb40 R13: 00000000000000a0 R14: ffff97a3481beb40 R15: 0000000000000000 FS: 00007fb11a717540(0000) GS:ffff97a376cc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000a0 CR3: 00000004066d6006 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: simple_recursive_removal+0x63/0x370 ? debugfs_remove+0x60/0x60 debugfs_remove+0x40/0x60 amdgpu_ras_fini+0x82/0x230 [amdgpu] ? __kernfs_remove.part.17+0x101/0x1f0 ? kernfs_name_hash+0x12/0x80 amdgpu_device_fini+0x1c0/0x580 [amdgpu] amdgpu_driver_unload_kms+0x3e/0x70 [amdgpu] amdgpu_pci_remove+0x36/0x60 [amdgpu] pci_device_remove+0x3b/0xb0 device_release_driver_internal+0xe5/0x1c0 driver_detach+0x46/0x90 bus_remove_driver+0x58/0xd0 pci_unregister_driver+0x29/0x90 amdgpu_exit+0x11/0x25 [amdgpu] __x64_sys_delete_module+0x13d/0x210 do_syscall_64+0x5f/0x250 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ae2bf61f |
|
13-Aug-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS It can avoid potential build warn/error when CONFIG_DEBUG_FS is not set. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2e2f5dd5 |
|
13-Aug-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: fix NULL pointer access issue when unloading driver When unloading driver by "modprobe -r amdgpu", one NULL pointer dereference bug occurs in ras debugfs releasing. The cause is the duplicated debugfs_remove, as drm debugfs_root dir has been cleaned up already by drm_minor_unregister. BUG: kernel NULL pointer dereference, address: 00000000000000a0 PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 11 PID: 1526 Comm: modprobe Tainted: G OE 5.6.0-guchchen #1 Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018 RIP: 0010:down_write+0x15/0x40 Code: eb de e8 7e 17 72 ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 53 48 89 fb e8 92 d8 ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 43 08 5b c3 RSP: 0018:ffffb1590386fcd0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000000000a0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff85b2fcc2 RDI: 00000000000000a0 RBP: ffffb1590386fd30 R08: ffffffff85b2fcc2 R09: 000000000002b3c0 R10: ffff97a330618c40 R11: 00000000000005f6 R12: ffff97a3481beb40 R13: 00000000000000a0 R14: ffff97a3481beb40 R15: 0000000000000000 FS: 00007fb11a717540(0000) GS:ffff97a376cc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000a0 CR3: 00000004066d6006 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: simple_recursive_removal+0x63/0x370 ? debugfs_remove+0x60/0x60 debugfs_remove+0x40/0x60 amdgpu_ras_fini+0x82/0x230 [amdgpu] ? __kernfs_remove.part.17+0x101/0x1f0 ? kernfs_name_hash+0x12/0x80 amdgpu_device_fini+0x1c0/0x580 [amdgpu] amdgpu_driver_unload_kms+0x3e/0x70 [amdgpu] amdgpu_pci_remove+0x36/0x60 [amdgpu] pci_device_remove+0x3b/0xb0 device_release_driver_internal+0xe5/0x1c0 driver_detach+0x46/0x90 bus_remove_driver+0x58/0xd0 pci_unregister_driver+0x29/0x90 amdgpu_exit+0x11/0x25 [amdgpu] __x64_sys_delete_module+0x13d/0x210 do_syscall_64+0x5f/0x250 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f1403342 |
|
12-Aug-2020 |
Christian König <christian.koenig@amd.com> |
drm/amdgpu: revert "fix system hang issue during GPU reset" The whole approach wasn't thought through till the end. We already had a reset lock like this in the past and it caused the same problems like this one. Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary. This reverts commit df9c8d1aa278c435c30a69b8f2418b4a52fcb929. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2f530724 |
|
11-Aug-2020 |
Nirmoy Das <nirmoy.das@amd.com> |
drm/amdgpu: pass NULL pointer instead of 0 Fixes: c030f2e4166c3f55 ("drm/amdgpu: add amdgpu_ras.c to support ras (v2)") Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
66459e1d |
|
04-Aug-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: add debugfs node to toggle ras error cnt harvest Before ras recovery is issued, user could operate this debugfs node to enable/disable the harvest of all RAS IPs' ras error count registers, which will help keep hardware's registers' status instead of cleaning up them. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f75e94d8 |
|
04-Aug-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: bypass querying ras error count registers Once ras recovery is issued by ras sync flood interrupt or ras controller interrupt, add this guard to bypass or execute ras error count register harvest of all IPs. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ee10e06e |
|
19-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: add printing after executing page reservation to eeprom This will tell users if the faulty page has been written to external eeprom device in dmesg log. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
0ad7a64d |
|
03-Aug-2020 |
John Clements <john.clements@amd.com> |
drm/amdgpu: enable RAS support for sienna cichlid enabled GECC error injection and query support Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a219ecbb |
|
27-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: disable page reservation when amdgpu_bad_page_threshold = 0 When amdgpu_bad_page_threshold = 0, bad page reservation stuffs are skipped in either UMC ECC irq or page retirement calling of sync flood isr. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f848159b |
|
31-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: decouple sysfs creating of bad page node Bad page information should not be exposed by sysfs when bad page retirement is disabled, so decouple it from ras sysfs group creating, and add one guard before creating. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
eb0c3cd4 |
|
31-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: add one definition for RAS's sysfs/debugfs name(v2) Add one definition for the RAS module's FS name. It's used in both debugfs and sysfs cases. v2: Use static variable instead of macro definition. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bf0b91b7 |
|
21-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: restore ras flags when user resets eeprom(v2) RAS flags needs to be cleaned as well when user requires one clean eeprom. v2: RAS flags shall be restored after eeprom reset succeeds. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e8fbaf03 |
|
23-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: break GPU recovery once it's in bad state(v4) When GPU executes recovery and retriving bad GPU tag from external eerpom device, the recovery will be broken and error message is printed as well for user's awareness. v2: Refine warning message in threshold reaching case, and fix spelling typo. v3: Fix explicit calling of bad gpu. v4: Rename function names. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
35cd2cda |
|
23-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: skip bad page reservation once issuing from eeprom write Once the ras recovery is issued from eeprom write itself, bad page reservation should be ignored, otherwise, recursive calling of writting to eeprom would happen. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b82e65a9 |
|
23-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: break driver init process when it's bad GPU(v5) When retrieving bad gpu tag from eeprom, GPU init should fail as the GPU needs to be retired for further check. v2: Fix spelling typo, correct the condition to detect bad gpu tag and refine error message. v3: Refine function argument name. v4: Fix missing check of returning value of i2c initialization error case. v5: Use dev_err to print PCI information in dmesg instead of DRM_ERROR. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c84d4670 |
|
21-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: validate bad page threshold in ras(v3) Bad page threshold value should be valid in the range between -1 and max records length of eeprom. It could determine when saved bad pages exceed threshold value, and proceed corresponding actions. v2: When using the default typical value, it should be min value between typical value and eeprom max records length. v3: drop the case of setting bad_page_cnt_threshold to be 0xFFFFFFFF, as it confuses user. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
df9c8d1a |
|
08-Jul-2020 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: fix system hang issue during GPU reset when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover, the atomic adev->in_gpu_reset and hive->in_reset are used to avoid re-entering GPU recovery. During GPU reset and resume, it is unsafe that other threads access GPU, which maybe cause GPU reset failed. Therefore the new rw_semaphore adev->reset_sem is introduced, which protect GPU from being accessed by external threads during recovery. v2: 1. add rwlock for some ioctls, debugfs and file-close function. 2. change to use dqm->is_resetting and dqm_lock for protection in kfd driver. 3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid re-enter GPU recovery for the same GPU hang. v3: 1. change back to use adev->reset_sem to protect kfd callback functions, because dqm_lock couldn't protect all codes, for example: free_mqd must be called outside of dqm_lock; [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 1230.177221] Call Trace: [ 1230.178249] dump_stack+0x98/0xd5 [ 1230.179443] amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu] [ 1230.180673] gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu] [ 1230.181882] amdgpu_gart_unbind+0xa9/0xe0 [amdgpu] [ 1230.183098] amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu] [ 1230.184239] ? ttm_bo_put+0x171/0x5f0 [ttm] [ 1230.185394] ttm_tt_unbind+0x21/0x40 [ttm] [ 1230.186558] ttm_tt_destroy.part.12+0x12/0x60 [ttm] [ 1230.187707] ttm_tt_destroy+0x13/0x20 [ttm] [ 1230.188832] ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm] [ 1230.189979] ttm_bo_put+0x1be/0x5f0 [ttm] [ 1230.191230] amdgpu_bo_unref+0x1e/0x30 [amdgpu] [ 1230.192522] amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu] [ 1230.193833] free_mqd+0x25/0x40 [amdgpu] [ 1230.195143] destroy_queue_cpsch+0x1a7/0x270 [amdgpu] [ 1230.196475] pqm_destroy_queue+0x105/0x260 [amdgpu] [ 1230.197819] kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu] [ 1230.199154] kfd_ioctl+0x277/0x500 [amdgpu] [ 1230.200458] ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu] [ 1230.201656] ? tomoyo_file_ioctl+0x19/0x20 [ 1230.202831] ksys_ioctl+0x98/0xb0 [ 1230.204004] __x64_sys_ioctl+0x1a/0x20 [ 1230.205174] do_syscall_64+0x5f/0x250 [ 1230.206339] entry_SYSCALL_64_after_hwframe+0x49/0xbe 2. remove try_lock and introduce atomic hive->in_reset, to avoid re-enter GPU recovery. v4: 1. remove an unnecessary whitespace change in kfd_chardev.c 2. remove comment codes in amdgpu_device.c 3. add more detailed comment in commit message 4. define a wrap function amdgpu_in_reset v5: 1. Fix some style issues. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com> Suggested-by: Lijo Lazar <Lijo.Lazar@amd.com> Suggested-by: Luben Tukov <luben.tuikov@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b1628425 |
|
19-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: add printing after executing page reservation to eeprom This will tell users if the faulty page has been written to external eeprom device in dmesg log. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bb5c7235 |
|
13-Jul-2020 |
Wenhui Sheng <Wenhui.Sheng@amd.com> |
drm/amdgpu: RAS emergency restart logic refine If we are in RAS triggered situation and BACO isn't support, emergency restart is needed, and this code is only needed for some specific cases(vega20 with given smu fw version). After we add smu mode1 reset for sienna cichlid, we need to share AMD_RESET_METHOD_MODE1 with psp mode1 reset, so in amdgpu_device_gpu_recover, we need differentiate which mode1 reset we are using, then decide if it's a full reset and then decide if emergency restart is needed, the logic will become much more complex. After discussion with Hawking, move emergency restart logic to an independent function. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f3167919 |
|
18-Jun-2020 |
Nirmoy Das <nirmoy.das@amd.com> |
drm/amdgpu: label internally used symbols as static Used sparse(make C=1) to find these loose ends. v2: removed unwanted extra line Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9e69b1ee |
|
01-Jun-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: remove useless code in RAS Module parameter amdgpu_ras_mask has been involved in the calculation of ras support capability, so drop this redundant code. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5e91160a |
|
01-Jun-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: fix RAS memory leak in error case RAS context memory needs to freed in failure case. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b0d4783a |
|
22-May-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: print warning when input address is invalid This will assist debug in error injection case. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5c23e9e0 |
|
13-May-2020 |
John Clements <john.clements@amd.com> |
drm/amdgpu: Update RAS XGMI error inject sequence Disable XGMI link power down prior to issuing a XGMI RAS error Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7fcffecf |
|
05-May-2020 |
Arnd Bergmann <arnd@arndb.de> |
drm/amdgpu: allocate large structures dynamically After the structure was padded to 1024 bytes, it is no longer suitable for being a local variable, as the function surpasses the warning limit for 32-bit architectures: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:587:5: error: stack frame size of 1072 bytes in function 'amdgpu_ras_feature_enable' [-Werror,-Wframe-larger-than=] int amdgpu_ras_feature_enable(struct amdgpu_device *adev, ^ Use kzalloc() instead to get it from the heap. Fixes: a0d254820f43 ("drm/amdgpu: update RAS TA to Host interface") Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a200034b |
|
30-Apr-2020 |
John Clements <john.clements@amd.com> |
drm/amdgpu: update RAS error handling Parse return status from TA to determine error severity Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a891d239 |
|
21-Apr-2020 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: set error query ready after all IPs late init If set error query ready in amdgpu_ras_late_init, which will cause some IP blocks aren't initialized, but their error query is ready. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
12c17b9d |
|
16-Apr-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: fix kernel page fault issue by ras recovery on sGPU When running ras uncorrectable error injection and triggering GPU reset on sGPU, below issue is observed. It's caused by the list uninitialized when accessing. [ 80.047227] BUG: unable to handle page fault for address: ffffffffc0f4f750 [ 80.047300] #PF: supervisor write access in kernel mode [ 80.047351] #PF: error_code(0x0003) - permissions violation [ 80.047404] PGD 12c20e067 P4D 12c20e067 PUD 12c210067 PMD 41c4ee067 PTE 404316061 [ 80.047477] Oops: 0003 [#1] SMP PTI [ 80.047516] CPU: 7 PID: 377 Comm: kworker/7:2 Tainted: G OE 5.4.0-rc7-guchchen #1 [ 80.047594] Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018 [ 80.047888] Workqueue: events amdgpu_ras_do_recovery [amdgpu] Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6952e99c |
|
10-Apr-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: refine ras related message print Prefix ras related kernel message logging with PCI device info by replacing DRM_INFO/WARN/ERROR with dev_info/warn/err. This can clearly tell user about GPU device information where ras is. And add some other ras message printing to make it more clear and friendly as well. Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b3dbd6d3 |
|
07-Apr-2020 |
John Clements <john.clements@amd.com> |
drm/amdgpu: resolve mGPU RAS query instability upon receiving uncorrectable error, query every GPU node for ras errors Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
0b9ebd7e |
|
07-Apr-2020 |
John Clements <john.clements@amd.com> |
drm/amdgpu: resolve mGPU RAS query instability upon receiving uncorrectable error, query every GPU node for ras errors Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a9d82d2f |
|
27-Mar-2020 |
Evan Quan <evan.quan@amd.com> |
drm/amdgpu: fix non-pointer dereference for non-RAS supported Backtrace on gpu recover test on Navi10. [ 1324.516681] RIP: 0010:amdgpu_ras_set_error_query_ready+0x15/0x20 [amdgpu] [ 1324.523778] Code: 4c 89 f7 e8 cd a2 a0 d8 e9 99 fe ff ff 45 31 ff e9 91 fe ff ff 0f 1f 44 00 00 55 48 85 ff 48 89 e5 74 0e 48 8b 87 d8 2b 01 00 <40> 88 b0 38 01 00 00 5d c3 66 90 0f 1f 44 00 00 55 31 c0 48 85 ff [ 1324.543452] RSP: 0018:ffffaa1040e4bd28 EFLAGS: 00010286 [ 1324.549025] RAX: 0000000000000000 RBX: ffff911198b20000 RCX: 0000000000000000 [ 1324.556217] RDX: 00000000000c0a01 RSI: 0000000000000000 RDI: ffff911198b20000 [ 1324.563514] RBP: ffffaa1040e4bd28 R08: 0000000000001000 R09: ffff91119d0028c0 [ 1324.570804] R10: ffffffff9a606b40 R11: 0000000000000000 R12: 0000000000000000 [ 1324.578413] R13: ffffaa1040e4bd70 R14: ffff911198b20000 R15: 0000000000000000 [ 1324.586464] FS: 00007f4441cbf540(0000) GS:ffff91119ed80000(0000) knlGS:0000000000000000 [ 1324.595434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1324.601345] CR2: 0000000000000138 CR3: 00000003fcdf8004 CR4: 00000000003606e0 [ 1324.608694] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1324.616303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1324.623678] Call Trace: [ 1324.626270] amdgpu_device_gpu_recover+0x6e7/0xc50 [amdgpu] [ 1324.632018] ? seq_printf+0x4e/0x70 [ 1324.636652] amdgpu_debugfs_gpu_recover+0x50/0x80 [amdgpu] [ 1324.643371] seq_read+0xda/0x420 [ 1324.647601] full_proxy_read+0x5c/0x90 [ 1324.652426] __vfs_read+0x1b/0x40 [ 1324.656734] vfs_read+0x8e/0x130 [ 1324.660981] ksys_read+0xa7/0xe0 [ 1324.665201] __x64_sys_read+0x1a/0x20 [ 1324.669907] do_syscall_64+0x57/0x1c0 [ 1324.674517] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1324.680654] RIP: 0033:0x7f44417cf081 Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
61380faa |
|
25-Mar-2020 |
John Clements <john.clements@amd.com> |
drm/amdgpu: disable ras query and iject during gpu reset added flag to ras context to indicate if ras query functionality is ready Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
43c4d576 |
|
19-Mar-2020 |
John Clements <john.clements@amd.com> |
drm/amdgpu: protect RAS sysfs during GPU reset MMHub EDC becomes dirty after BACO reset EDC registers should be cleared early on in reset phase Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c1509f3f |
|
12-Mar-2020 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: fix warning in ras_debugfs_create_all() Fix the warning "warn: variable dereferenced before check 'obj' (see line 1131)" by removing unnecessary checks as amdgpu_ras_debugfs_create_all() is only called from amdgpu_debugfs_init() where obj member in con->head list is not NULL. Use list_for_each_entry() instead list_for_each_entry_safe() as obj do not to be freeing or removing from list during this process. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
88474cca |
|
10-Mar-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: update ras capability's query based on mem ecc configuration RAS support capability needs to be updated on top of different memeory ECC enablement, and remove redundant memory ecc check in gmc module for vega20 and arcturus. v2: check HBM ECC enablement and set ras mask accordingly. v3: avoid to invoke atomfirmware interface to query twice. Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
204eaac6 |
|
05-Mar-2020 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: call ras_debugfs_create_all in debugfs_init and remove each ras IP's own debugfs creation this is required to fix ras when the driver does not use the drm load and unload callbacks due to ordering issues with the drm device node. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f9317014 |
|
05-Mar-2020 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add function to creat all ras debugfs node centralize all debugfs creation in one place for ras this is required to fix ras when the driver does not use the drm load and unload callbacks due to ordering issues with the drm device node. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ec01fe2d |
|
21-Feb-2020 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: enable PCS error report on VG20 Now driver will report XGMI/WAFL PCS error through sysfs xgmi_wafl_err_count node on Vega20 Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
19744f5f |
|
24-Feb-2020 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: move get_xgmi_relative_phy_addr to amdgpu_xgmi.c centralize all the xgmi related function to amdgpu_xgmi.c Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
313c8fd3 |
|
13-Feb-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: log on non-zero error conter per IP before GPU reset Once sync flood interrupt is triggered by RAS error, before actual GPU recovery job, it's necessary to log on and print non-zero error counter, this will help user knows where the RAS error source is from quickly. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a6c44d25 |
|
16-Jan-2020 |
John Clements <john.clements@amd.com> |
drm/amdgpu: added support to get mGPU DRAM base resolves issue with RAS error injection in mGPU configuration Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
93af20f7 |
|
15-Jan-2020 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: check if driver should try recovery in ras recovery path To allow the flexibilty for user to disable gpu recovery in RAS recovery path by module parameter amdgpu_gpu_recovery Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
3e81ee9a |
|
08-Jan-2020 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: support error reporting for sdma ip block invoke sdma query_ras_error_count to get sdma single bit error count Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
46cf2fec |
|
22-Dec-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: add missed return value set for error case Return value should be set when going to error handle tag for error case, this can avoid potential invalid array access by upper caller. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
374bf7bd |
|
14-Dec-2019 |
zhengbin <zhengbin13@huawei.com> |
drm/amdgpu: Remove unneeded semicolon in amdgpu_ras.c Fixes coccicheck warning: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:318:2-3: Unneeded semicolon Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
61934624 |
|
13-Dec-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpu BACO reset mode strategy is determined by latter func when calling amdgpu_ras_reset_gpu. So not to confuse audience, drop it. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f2a79be1 |
|
24-Nov-2019 |
Le Ma <le.ma@amd.com> |
drm/amdgpu: export amdgpu_ras_find_obj to use externally Change it to external interface. Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
baaeb610 |
|
13-Nov-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: enable ras capablity check on arcturus check hw ras capablity via atomfirmware Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ef177d11 |
|
30-Oct-2019 |
Alex Deucher <alexander.deucher@amd.com> |
drm/amdgpu: Improve RAS documentation (v2) Clarify some areas, clean up formatting, add section for unrecoverable error handling. v2: fix grammatical errors Reviewed-by: Yong Zhao <yong.zhao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bff77e86 |
|
25-Oct-2019 |
Le Ma <le.ma@amd.com> |
drm/amdgpu: bypass some cleanup work after err_event_athub (v2) PSP lost connection when err_event_athub occurs. These cleanup work can be skipped in BACO reset. v2: squash in missing include (Alex) Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
52dd95f2 |
|
21-Oct-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: define macros for retire page reservation Easy for maintainance. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c688a06b |
|
21-Oct-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: refine reboot debugfs operation in ras case (v3) Ras reboot debugfs node allows user one easy control to avoid gpu recovery hang problem and directly reboot system per card basis, after ras uncorrectable error happens. However, it is one common entry, which should get rid of ras_ctrl node and remove ip dependence when inputting by user. So add one new auto_reboot node in ras debugfs dir to achieve this. v2: in commit mssage, add justification why ras reboot debugfs node is needed. v3: use debugfs_create_bool to create debugfs file for boolean value Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ed606f8a |
|
11-Oct-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
dmr/amdgpu: Fix crash on SRIOV for ERREVENT_ATHUB_INTERRUPT interrupt. Ignre the ERREVENT_ATHUB_INTERRUPT for systems without RAS. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-and-tested-by: Jack Zhang <Jack.Zhang1@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6e4be987 |
|
30-Sep-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: avoid ras error injection for retired page check whether a page is bad page before umc error injection, bad page should not be accessed again Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
54e9ab2e |
|
08-Oct-2019 |
Alex Deucher <alexander.deucher@amd.com> |
drm/amdgpu/ras: document the reboot ras option We recently added it, but never documented it. Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a20bfd0f |
|
08-Oct-2019 |
Alex Deucher <alexander.deucher@amd.com> |
drm/amdgpu/ras: fix typos in documentation Fix a couple of spelling typos. Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1995b3a3 |
|
03-Oct-2019 |
Felix Kuehling <Felix.Kuehling@amd.com> |
drm/amdgpu: Fix error handling in amdgpu_ras_recovery_init Don't set a struct pointer to NULL before freeing its members. It's hard to see what's happening due to a local pointer-to-pointer data aliasing con->eh_data. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: Philip Cox <Philip.Cox@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
0771b0bf |
|
18-Sep-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: simplify the access to eeprom_control struct simplify the code of accessing to eeprom_control struct Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d65bf1f8 |
|
12-Sep-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: replace mmhub_funcs with mmhub.funcs remove mmhub_funcs in adev Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f77c7109 |
|
19-Sep-2019 |
Alex Deucher <alexander.deucher@amd.com> |
drm/amdgpu/ras: fix and update the documentation for RAS Add new sections to amdgpu.rst, fix up formatting issues, add additional documentation to each section. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8a3e801f |
|
17-Sep-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: avoid null pointer dereference null ptr should be checked first to avoid null ptr access Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a142ba88 |
|
17-Sep-2019 |
Alex Deucher <alexander.deucher@amd.com> |
drm/amdgpu/ras: use GPU PAGE_SIZE/SHIFT for reserving pages We are reserving vram pages so they should be aligned to the GPU page size. Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ae115c81 |
|
12-Sep-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: replace DRM_ERROR with DRM_WARN in ras_reserve_bad_pages There are two cases of reserve error should be ignored: 1) a ras bad page has been allocated (used by someone); 2) a ras bad page has been reserved (duplicate error injection for one page); DRM_ERROR is unnecessary for the failure of bad page reserve Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
879e723d |
|
14-Sep-2019 |
Adam Zerella <adam.zerella@gmail.com> |
docs: drm/amdgpu: Resolve build warnings Some of the documentation formatting could be improved which will resolve some Sphinx amdgpu build warnings e.g WARNING: Unexpected indentation. WARNING: Block quote ends without a blank line; unexpected unindent. WARNING: Inline emphasis start-string without end-string. Signed-off-by: Adam Zerella <adam.zerella@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
de7b45ba |
|
13-Sep-2019 |
Christian König <christian.koenig@amd.com> |
drm/amdgpu: cleanup creating BOs at fixed location (v2) The placement is something TTM/BO internal and the RAS code should avoid touching that directly. Add a helper to create a BO at a fixed location and use that instead. v2: squash in fixes (Alex) Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
012dd14d |
|
15-Sep-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: fix ras ctrl debugfs node leak Use debugfs_remove_recursive to remove the whole debugfs directory instead of removing the node one by one. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d7bd680d |
|
10-Sep-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: support pcie bif ras query and inject Call pcie bif ras query/inject in amdgpu ras. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f3170352 |
|
07-Sep-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: enable error injection to XGMI block via debugfs allow inject error to XGMI block via debugfs node ras_ctrl Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
084fe13b |
|
09-Sep-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Allow to reset to EERPOM table. The table grows quickly during debug/development effort when multiple RAS errors are injected. Allow to avoid this by setting table header back to empty if needed. v2: Switch to debugfs entry instead of load time parameter. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4930aabe |
|
05-Sep-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: move umc ras init to umc block move umc ras init from ras module to umc block, generic ras module should pay less attention to specific ras block. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4d1337d2 |
|
06-Sep-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Avoid RAS recovery init when no RAS support. Fixes driver load regression on APUs. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1a6fc071 |
|
30-Aug-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: move the call of ras recovery_init and bad page reserve to proper place ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
78ad00c9 |
|
15-Aug-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: Hook EEPROM table to RAS support eeprom records load and save for ras, move EEPROM records storing to bad page reserving v2: remove redundant check for con->eh_data Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9dc23a63 |
|
12-Aug-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: change ras bps type to eeprom table record structure change bps type from retired page to eeprom table record, prepare for saving umc error records to eeprom Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d5ea093e |
|
22-Aug-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
dmr/amdgpu: Add system auto reboot to RAS. In case of RAS error allow user configure auto system reboot through ras_ctrl. This is also part of the temproray work around for the RAS hang problem. v4: Use latest kernel API for disk sync. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7c6e68c7 |
|
13-Sep-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Avoid HW GPU reset for RAS. Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem. v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b293e891 |
|
29-Aug-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: add helper function to do common ras_late_init/fini (v3) In late_init for ras, the helper function will be used to 1). disable ras feature if the IP block is masked as disabled 2). send enable feature command if the ip block was masked as enabled 3). create debugfs/sysfs node per IP block 4). register interrupt handler v2: check ih_info.cb to decide add interrupt handler or not v3: add ras_late_fini for cleanup all the ras fs node and remove interrupt handler Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4e644fff |
|
05-Jun-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: add ras_controller and err_event_athub interrupt support Ras controller interrupt and Ras err event athub interrupt are two dedicated interrupts for RAS support. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
64cc5414 |
|
16-Aug-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: correct ras error count type Use unsigned long type for the same ras count variable. This will avoid overflow on 64 bit system. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5212a3bd |
|
09-Aug-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: remove ras block's feature status info in sysfs feature mask info is enough for rocm tool, "cat /sys/class/drm/card0/device/ras/features" will get the info like this: feature mask: 0x3ffb Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9fb2d8de |
|
06-Aug-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: support mmhub ras in amdgpu ras call mmhub ras query/inject in amdgpu ras Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
44494f96 |
|
07-Aug-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add sub block parameter in ras inject command ras sub block index could be passed from shell command Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2a3c7ff6 |
|
05-Aug-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: update ras sysfs feature info remove confused ras error type info Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bd2280da |
|
01-Aug-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: replace AMDGPU_RAS_UE with AMDGPU_RAS_SUCCESS ce can also trigger interrupt, and even both ce and ue error can be found in one ras query, distinguishing between ce and ue in interrupt handler is uncessary. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Suggested-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
51437623 |
|
29-Jul-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: support ce interrupt in ras module correctable error can also trigger interrupt in some ras blocks Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
13b7c46c |
|
31-Jul-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add error address query for umc ras umc error address query can get ce/ue error address and clear error status Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
83b0582c |
|
31-Jul-2019 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: support gfx ras error injection and err_cnt query check gfx error count in both ras querry function and ras interrupt handler. gfx ras is still disabled by default due to known stability issue found in gpu reset. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7cdc2ee3 |
|
23-Jul-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: remove ras_reserve_vram in ras injection error injection address is not in gpu address space Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e1063493 |
|
22-Jul-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add check for ras error type only ue and ce errors are supported Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cf04dfd0 |
|
22-Jul-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: allow ras interrupt callback to return error data add error data as parameter for ras interrupt cb and process it Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6f102dba |
|
22-Jul-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add support for recording ras error address more than one error address may be recorded in one query Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
045c0216 |
|
22-Jul-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: switch to amdgpu_umc structure create new amdgpu_umc structure to for more umc settings in future and switch to the new structure Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
05a58345 |
|
31-Jul-2019 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add ras error count after each query (v2) v1: increase ras ce/ue error count v2: log the number of correctable and uncorrectable errors Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
939e2258 |
|
17-Jul-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: querry umc error count check umc error count in both ras querry function and ras interrupt handler Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7af25d5b |
|
17-Jul-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: move some ras data structure to amdgpu_ras.h These are common structures that can be included by IP specific source files Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <dennis.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
33c976c9 |
|
18-Jul-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: drop ras self test this function is not needed any more. error injection is the only way to validate ras but it can't be executed in amdgpu_ras_init, where gpu is even not initialized Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a5dd40ca |
|
17-Jul-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: only allow error injection to UMC IP block error injection to other IP blocks (except UMC) will be enabled until RAS feature stablize on those IP blocks Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
fb2a3607 |
|
17-Jul-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: do not create ras debugfs/sysfs node for ASICs that don't have ras ability driver shouldn't init any ras debugfs/sysfs node for ASICs that don't have ras hardware ability Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
578a4daa |
|
18-Jul-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: drop ras self test this function is not needed any more. error injection is the only way to validate ras but it can't be executed in amdgpu_ras_init, where gpu is even not initialized Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
29bd6508 |
|
17-Jul-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: only allow error injection to UMC IP block error injection to other IP blocks (except UMC) will be enabled until RAS feature stablize on those IP blocks Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5f872b72 |
|
17-Jul-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: do not create ras debugfs/sysfs node for ASICs that don't have ras ability driver shouldn't init any ras debugfs/sysfs node for ASICs that don't have ras hardware ability Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
acb05f0a |
|
14-Jun-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: Do error injection even vram reserve fails As long as the address is mapped with vram, we can do an error injection. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
450f30ea |
|
13-Jun-2019 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
amdgpu: no need to check return value of debugfs_create functions When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: xinhui pan <xinhui.pan@amd.com> Cc: Evan Quan <evan.quan@amd.com> Cc: Feifei Xu <Feifei.Xu@amd.com> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f867723b |
|
09-Jun-2019 |
Sam Ravnborg <sam@ravnborg.org> |
drm/amd: drop use of drmP.h in amdgpu.h Delete the unused drmP.h from amdgpu.h. Fix fallout in various files. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20190609220757.10862-5-sam@ravnborg.org
|
#
efb426d5 |
|
28-May-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: ras injection use gpu address injection need a valid gpu address. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
74abc221 |
|
24-May-2019 |
Tom St Denis <tom.stdenis@amd.com> |
drm/amd/doc: Add RAS documentation to guide Acked-by: Slava Abramov <slava.abramov@amd.com> Signed-off-by: Tom St Denis <tom.stdenis@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d6ee400e |
|
16-May-2019 |
Slava Abramov <slava.abramov@amd.com> |
drm/amdgpu: use div64_ul for 32-bit compatibility v1 v1: replace casting to unsigned long with div64_ul Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Slava Abramov <slava.abramov@amd.com> Tested-by: Slava Abramov <slava.abramov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
511fdbc3 |
|
08-May-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: ras support suspend/resume add ras suspend function. rename ras_post_init to amdgpu_ras_resume. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: James Zhu <James.Zhu@amd.com> Tested-by: James Zhu <James.Zhu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
466b1793 |
|
06-May-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: add badpages sysfs interafce add badpages node. it will output badpages list in format gpu pfn : gpu page size : flags example 0x00000000 : 0x00001000 : R 0x00000001 : 0x00001000 : R 0x00000002 : 0x00001000 : R 0x00000003 : 0x00001000 : R 0x00000004 : 0x00001000 : R 0x00000005 : 0x00001000 : R 0x00000006 : 0x00001000 : R 0x00000007 : 0x00001000 : P 0x00000008 : 0x00001000 : P 0x00000009 : 0x00001000 : P flags can be one of below characters R: reserved. P: pending for reserve. F: failed to reserve for some reasons. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a564808e |
|
08-May-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: handle ras reset add another flag to allow IP do a gpu reset after device init. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7af23ebe |
|
08-May-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: Issue ras TA disable/enable cmd forcely on boot Check ras TA error code and return EAGAIN. Issue ras enable/disable cmd without checking currect state. Looks like ras TA will handle current state == target state case. Now driver might need do a reset to satisfy ras TA. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
77de502b |
|
08-Apr-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: Introduce another ras enable function Many parts of the whole SW stack can program the ras enablement state during the boot. Now we handle that case by adding one function which check the ras flags and choose different code path. Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: xinhui pan <xinhui.pan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
191051a1 |
|
02-Apr-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: Make default ras error type to none Unless IP has implemented its own ras, use ERROR_NONE as the default type. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
190211ab |
|
21-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: remove per obj debugfs write there is ras_control node which can do its job. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
828cfa29 |
|
21-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: Fix amdgpu ras to ta enums conversion Add helpes to transalte the two enums. And it will catch bugs easily. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9f491d73 |
|
20-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: use macro instead of enum for flags better to use macro. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
73aa8e1a |
|
18-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: Fix some sanity check ras context might be NULL, so move con->h_data after check !con also fix sizeof wrong type while at it. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
289d513b |
|
05-Mar-2019 |
kbuild test robot <lkp@intel.com> |
drm/amdgpu: fix semicolon.cocci warnings drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:405:2-3: Unneeded semicolon drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:435:2-3: Unneeded semicolon Remove unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci CC: xinhui pan <xinhui.pan@amd.com> Signed-off-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
108c6a63 |
|
11-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: add new ras workflow control flags add ras post init function. Do some initialization after all IP have finished their late init. Add new member flags which will control the ras work flow. For now, vbios enable ras for us on boot. That might change in the future. So there should be a flag from vbios to tell us if ras is enabled or not on boot. Looks like there is no such info now. Other bits of the flags are reserved to control other parts of ras. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5d0f903f |
|
12-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: let ras initialization a little noticeable add drm info output if ras initialized successfully. add ras atomfirmware sanity check. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
163def43 |
|
11-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: Fix lockdep warning more gracely lockdep need a static key. Previously we set ignore bit to avoid the warning. Now call sysfs_attr_init to initialize the static key. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-and-Tested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b076296b |
|
11-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: Fix ras debugfs data parse Unzero char is accepted by sscanf, so when data is structure but unexpectedly return error invalid; Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5caf466a |
|
11-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: add new member hw_supported Currently, it is not clear how ras is supported. Both software and hardware can set the supported. That is confusing. Fix it by adding new member hw_supported. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
2b9505e3 |
|
11-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: Fix warning when lockdep is enabled Set ignore bit to satisfy locpdep. Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b404ae82 |
|
06-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: lookup vbios table to check ecc capability Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
96ebb307 |
|
01-Mar-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: add human readable debugfs control support (v2) Currently, the debugfs control node can't parse bash-like commands. Now add such support for any tester that uses scripts. v2: squash in fixes for input validation Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
36ea1bd2 |
|
31-Jan-2019 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: add debugfs ctrl node allow userspace enable/disable ras Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c030f2e4 |
|
31-Oct-2018 |
xinhui pan <xinhui.pan@amd.com> |
drm/amdgpu: add amdgpu_ras.c to support ras (v2) add obj management. add feature control. add debugfs infrastructure. add sysfs infrastructure. add IH infrastructure. add recovery infrastructure. It is a framework. Other IPs need call amdgpu_ras_xxx function instead of psp_ras_xxx functions. v2: squash in warning fixes Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|