#
f579c06b |
|
06-Feb-2024 |
Yang Wang <kevinyang.wang@amd.com> |
drm/amdgpu: send smu rma reason event in ras eeprom driver send smu rma reason event to smu in ras eeprom driver. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ca0ad760 |
|
23-Nov-2023 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Update EEPROM I2C address for smu v13_0_0 Check smu v13_0_0 SKU type to select EEPROM I2C address. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e0409021 |
|
23-Nov-2023 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Update EEPROM I2C address for smu v13_0_0 Check smu v13_0_0 SKU type to select EEPROM I2C address. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.1.x
|
#
2b6b29f3 |
|
30-Sep-2023 |
Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> |
drm/amdgpu: Fix complex macros error Fixes the below: ERROR: Macros with complex values should be enclosed in parentheses WARNING: macros should not use a trailing semicolon +#define amdgpu_inc_vram_lost(adev) atomic_inc(&((adev)->vram_lost_counter)); Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8c14a67b |
|
19-Sep-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: change if condition for bad channel bitmap update The amdgpu_ras_eeprom_control.bad_channel_bitmap is u32 type, but the channel index could be larger than 32. For the ASICs whose channel number is more than 32, the amdgpu_dpm_send_hbm_bad_channel_flag interface is not supported, so we simply bypass channel bitmap update under this condition. v2: replace sizeof with BITS_PER_TYPE, we should check bit number instead of byte number. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4e8303cf |
|
11-Sep-2023 |
Lijo Lazar <lijo.lazar@amd.com> |
drm/amdgpu: Use function for IP version check Use an inline function for version check. Gives more flexibility to handle any format changes. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
46963ed5 |
|
16-Aug-2023 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Only support RAS EEPROM on dGPU platform RAS EEPROM device is only supported on dGPU platform for smu v13_0_6. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4b721ed8 |
|
16-Aug-2023 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Only support RAS EEPROM on dGPU platform RAS EEPROM device is only supported on dGPU platform for smu v13_0_6. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b81fde0d |
|
01-Aug-2023 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Add I2C EEPROM support on smu v13_0_6 Support I2C EEPROM on smu v13_0_6. v2: Move IP_VERSION(13, 0, 6) ahead of IP_VERSION(13, 0, 10). Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
adf64e21 |
|
18-Jul-2023 |
Mario Limonciello <mario.limonciello@amd.com> |
drm/amd: Avoid reading the VBIOS part number twice The VBIOS part number is read both in amdgpu_atom_parse() as well as in atom_get_vbios_pn() and stored twice in the `struct atom_context` structure. Remove the first unnecessary read and move the `pr_info` line from that read into the second. v2: squash in unused variable removal Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e06da817 |
|
09-Jun-2023 |
Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> |
drm/amdgpu: Fix kdoc warning Fixes the following gcc with W=1: drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:76: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * EEPROM Table structure v1 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:98: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * EEPROM Table structrue v2.1 Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
0bc3137b |
|
01-Jun-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Set EEPROM ras info Set EEPROM ras info: rma status, health percent and bad page threshold. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7c2551fa |
|
28-May-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Calculate EEPROM table ras info bytes sum It's more reasonable to check EEPROM table ras info bytes. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
7f599fed |
|
30-May-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Add support EEPROM table v2.1 Add ras info to EEPROM table, app can analyse device ECC status without GPU driver through EEPROM table ras info. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
65183fae |
|
30-May-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Add RAS table v2.1 macro definition Add RAS EEPROM table version 2.1 macro definition. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
71c79a19 |
|
30-May-2023 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Rename ras table version Rename RAS_TABLE_VER to RAS_TABLE_VER_V1, move RAS_TABLE_VER_V1 from amdgpu_ras_eeprom.c to amdgpu_ras_eeprom.h. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6246059a |
|
27-Mar-2023 |
Alex Deucher <alexander.deucher@amd.com> |
drm/amdgpu: simplify amdgpu_ras_eeprom.c All chips that support RAS also support IP discovery, so use the IP versions rather than a mix of IP versions and asic types. Checking the validity of the atom_ctx pointer is not required as the vbios is already fetched at this point. v2: add comments to id asic types based on feedback from Luben Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: Luben Tuikov <luben.tuikov@amd.com>
|
#
8782007b |
|
22-Mar-2023 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Return from switch early for EEPROM I2C address As soon as control->i2c_address is set, return; remove the "break;" from the switch--it is unnecessary. This mimics what happens when for some cases in the switch, we call helper functions with "return <helper function>". Remove final function "return true;" to indicate that the switch is final and terminal, and that there should be no code after the switch. Cc: Candice Li <candice.li@amd.com> Cc: Kent Russell <kent.russell@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1bb745d7 |
|
22-Mar-2023 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Remove second moot switch to set EEPROM I2C address Remove second switch since it already has its own function and case in the first switch. This also avoids requalifying the EEPROM I2C address for VEGA20, SIENNA CICHLID, and ALDEBARAN, as those have been set by the first switch and shouldn't match SMU v13.0.x. Cc: Candice Li <candice.li@amd.com> Cc: Kent Russell <kent.russell@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Fixes: 158225294683 ("drm/amdgpu: Add EEPROM I2C address for smu v13_0_0") Fixes: c9bdc6c3cf39 ("drm/amdgpu: Add EEPROM I2C address support for ip discovery") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
22106ed0 |
|
21-Feb-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err bad_page_threshold controls page retirement behavior and it should be also checked. v2: simplify the condition of bad page handling path. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f3cbe70e |
|
21-Feb-2023 |
Tao Zhou <tao.zhou1@amd.com> |
drm/amdgpu: change default behavior of bad_page_threshold parameter Ignore ras umc bad page threshold by default, GPU initialization won't be stopped in this mode. v2: refine the description of bad_page_threshold. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
64a3dbb0 |
|
14-Nov-2022 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Add support for RAS table at 0x40000 Add support for RAS table at I2C EEPROM address of 0x40000, since on some ASICs it is not at 0, but at 0x40000. Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Kent Russell <kent.russell@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Tested-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
3b8164f8 |
|
06-Nov-2022 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Decouple RAS EEPROM addresses from chips Abstract RAS I2C EEPROM addresses from chip names, and set their macro definition names to the address they set, not the chip they attach to. Since most chips either use I2C EEPROM address 0 or 40000h for the RAS table start offset, this leaves us with only two macro definitions as opposed to five, and removes the redundancy of four. Cc: Candice Li <candice.li@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
da858dea |
|
06-Nov-2022 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Remove redundant I2C EEPROM address Remove redundant EEPROM_I2C_MADDR_54H address, since we already have it represented (ARCTURUS), and since we don't include the I2C device type identifier in EEPROM memory addresses, i.e. that high up in the device abstraction--we only use EEPROM memory addresses, as memory is continuously represented by EEPROM device(s) on the I2C bus. Add a comment describing what these memory addresses are, how they come about and how they're usually extracted from the device address byte. Cc: Candice Li <candice.li@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Fixes: c9bdc6c3cf39df ("drm/amdgpu: Add EEPROM I2C address support for ip discovery") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c9bdc6c3 |
|
11-Oct-2022 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Add EEPROM I2C address support for ip discovery 1. Update EEPROM_I2C_MADDR_SMU_13_0_0 to EEPROM_I2C_MADDR_54H 2. Add EEPROM I2C address support for smu v13_0_0 and v13_0_10. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
bc22f8ec |
|
10-Oct-2022 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Update ras eeprom support for smu v13_0_0 and v13_0_10 Enable RAS EEPROM support for smu v13_0_0 and v13_0_10. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
15822529 |
|
07-Sep-2022 |
Candice Li <candice.li@amd.com> |
drm/amdgpu: Add EEPROM I2C address for smu v13_0_0 Set correct EEPROM I2C address for smu v13_0_0. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
69691c82 |
|
03-Mar-2022 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: message smu to update bad channel info It should notice SMU to update bad channel info when detected uncorrectable error in UMC block Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8bbd4d83 |
|
10-Feb-2022 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: Reset OOB table error count info The OOB table error count info should be reset after reset eeprom table Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d0fb18b5 |
|
19-Jan-2022 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Move reset sem into reset_domain We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74117.html
|
#
2f60dd50 |
|
19-Jan-2022 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amd: Expose the FRU SMU I2C bus Expose both SMU I2C buses. Some boards use the same bus for both the RAS and FRU EEPROMs and others use different buses. This enables the additional I2C bus and sets the right buses to use for RAS and FRU EEPROM access. Cc: Roy Sun <Roy.Sun@amd.com> Co-developed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
68daadf3 |
|
19-Oct-2021 |
Kent Russell <kent.russell@amd.com> |
drm/amdgpu: Add kernel parameter support for ignoring bad page threshold When a GPU hits the bad_page_threshold, it will not be initialized by the amdgpu driver. This means that the table cannot be cleared, nor can information gathering be performed (getting serial number, BDF, etc). If the bad_page_threshold kernel parameter is set to -2, continue to initialize the GPU, while printing a warning to dmesg that this action has been done v2: squash in Luben's fix to restore RAS info reporting Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Mukul Joshi <Mukul.Joshi@amd.com> Signed-off-by: Kent Russell <kent.russell@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8483fdfe |
|
12-Oct-2021 |
Kent Russell <kent.russell@amd.com> |
drm/amdgpu: Warn when bad pages approaches 90% threshold dmesg doesn't warn when the number of bad pages approaches the threshold for page retirement. WARN when the number of bad pages is at 90% or greater for easier checks and planning, instead of waiting until the GPU is full of bad pages. Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Mukul Joshi <Mukul.Joshi@amd.com> Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
dcd5ea9f |
|
19-Oct-2021 |
Kent Russell <kent.russell@amd.com> |
drm/amdgpu: Clarify error when hitting bad page threshold Change the error message when the bad_page_threshold is reached, explicitly stating that the GPU will not be initialized. Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Mukul Joshi <Mukul.Joshi@amd.com> Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
6cd1f9b4 |
|
09-Sep-2021 |
Michel Dänzer <mdaenzer@redhat.com> |
drm/amdgpu: Drop inline from amdgpu_ras_eeprom_max_record_count This was unusual; normally, inline functions are declared static as well, and defined in a header file if used by multiple compilation units. The latter would be more involved in this case, so just drop the inline declaration for now. Fixes compile failure building for ppc64le on RHEL 8: In file included from ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h:32, from ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:33: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c: In function ‘amdgpu_ras_recovery_init’: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h:90:17: error: inlining failed in call to ‘always_inline’ ‘amdgpu_ras_eeprom_max_record_count’: function body not available 90 | inline uint32_t amdgpu_ras_eeprom_max_record_count(void); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1985:34: note: called from here 1985 | max_eeprom_records_len = amdgpu_ras_eeprom_max_record_count(); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: c84d46707ebb "drm/amdgpu: validate bad page threshold in ras(v3)" Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
114518ff |
|
09-Sep-2021 |
Michel Dänzer <mdaenzer@redhat.com> |
drm/amdgpu: Drop inline from amdgpu_ras_eeprom_max_record_count This was unusual; normally, inline functions are declared static as well, and defined in a header file if used by multiple compilation units. The latter would be more involved in this case, so just drop the inline declaration for now. Fixes compile failure building for ppc64le on RHEL 8: In file included from ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h:32, from ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:33: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c: In function ‘amdgpu_ras_recovery_init’: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h:90:17: error: inlining failed in call to ‘always_inline’ ‘amdgpu_ras_eeprom_max_record_count’: function body not available 90 | inline uint32_t amdgpu_ras_eeprom_max_record_count(void); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1985:34: note: called from here 1985 | max_eeprom_records_len = amdgpu_ras_eeprom_max_record_count(); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: c84d46707ebb "drm/amdgpu: validate bad page threshold in ras(v3)" Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cc947bf9 |
|
25-Aug-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Process any VBIOS RAS EEPROM address We can now process any RAS EEPROM address from VBIOS. Generalize so as to compute the top three bits of the 19-bit EEPROM address, from any byte returned as the "i2c address" from VBIOS. Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
14fb496a |
|
04-Aug-2021 |
John Clements <john.clements@amd.com> |
drm/amdgpu: set RAS EEPROM address from VBIOS update to latest atombios fw table Signed-off-by: John Clements <john.clements@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>. Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
64598e23 |
|
02-Jul-2021 |
Dan Carpenter <dan.carpenter@oracle.com> |
drm/amdgpu: return -EFAULT if copy_to_user() fails If copy_to_user() fails then this should return -EFAULT instead of -EINVAL. Fixes: c65b0805e77919 ("drm/amdgpu: RAS EEPROM table is now in debugfs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b8badd50 |
|
02-Jul-2021 |
Dan Carpenter <dan.carpenter@oracle.com> |
drm/amdgpu: unlock on error in amdgpu_ras_debugfs_table_read() This error path needs to unlock before returning. While we're at it, the correct error code from copy_to_user() failure is -EFAULT, not -EINVAL. Fixes: c65b0805e77919 ("drm/amdgpu: RAS EEPROM table is now in debugfs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
3006c924 |
|
02-Jul-2021 |
Dan Carpenter <dan.carpenter@oracle.com> |
drm/amdgpu: fix a signedness bug in __verify_ras_table_checksum() If amdgpu_eeprom_read() returns a negative error code then the error handling checks: if (res < buf_size) { The problem is that "buf_size" is a u32 so negative values are type promoted to a high positive values and the condition is false. Fix this by changing the type of "buf_size" to int. Fixes: 63d4c081a556a1 ("drm/amdgpu: Optimize EEPROM RAS table I/O") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d456f387 |
|
30-Jun-2021 |
Alex Deucher <alexander.deucher@amd.com> |
drm/amdgpu: fix 64 bit divide in eeprom code pos is 64 bits. Fixes: c65b0805e77919 ("drm/amdgpu: RAS EEPROM table is now in debugfs") Cc: luben.tuikov@amd.com Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c65b0805 |
|
08-Apr-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: RAS EEPROM table is now in debugfs Add "ras_eeprom_size" file in debugfs, which reports the maximum size allocated to the RAS table in EEROM, as the number of bytes and the number of records it could store. For instance, $cat /sys/kernel/debug/dri/0/ras/ras_eeprom_size 262144 bytes or 10921 records $_ Add "ras_eeprom_table" file in debugfs, which dumps the RAS table stored EEPROM, in a formatted way. For instance, $cat ras_eeprom_table Signature Version FirstOffs Size Checksum 0x414D4452 0x00010000 0x00000014 0x000000EC 0x000000DA Index Offset ErrType Bank/CU TimeStamp Offs/Addr MemChl MCUMCID RetiredPage 0 0x00014 ue 0x00 0x00000000607608DC 0x000000000000 0x00 0x00 0x000000000000 1 0x0002C ue 0x00 0x00000000607608DC 0x000000001000 0x00 0x00 0x000000000001 2 0x00044 ue 0x00 0x00000000607608DC 0x000000002000 0x00 0x00 0x000000000002 3 0x0005C ue 0x00 0x00000000607608DC 0x000000003000 0x00 0x00 0x000000000003 4 0x00074 ue 0x00 0x00000000607608DC 0x000000004000 0x00 0x00 0x000000000004 5 0x0008C ue 0x00 0x00000000607608DC 0x000000005000 0x00 0x00 0x000000000005 6 0x000A4 ue 0x00 0x00000000607608DC 0x000000006000 0x00 0x00 0x000000000006 7 0x000BC ue 0x00 0x00000000607608DC 0x000000007000 0x00 0x00 0x000000000007 8 0x000D4 ue 0x00 0x00000000607608DD 0x000000008000 0x00 0x00 0x000000000008 $_ Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Xinhui Pan <xinhui.pan@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
63d4c081 |
|
06-Apr-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Optimize EEPROM RAS table I/O Split functionality between read and write, which simplifies the code and exposes areas of optimization and more or less complexity, and take advantage of that. Read and write the table in one go; use a separate stage to decode or encode the data, as opposed to on the fly, which keeps the I2C bus busy. Use a single read/write to read/write the table or at most two if the number of records we're reading/writing wraps around. Check the check-sum of a table in EEPROM on init. Update the checksum at the same time as when updating the table header signature, when the threshold was increased on boot. Take advantage of arithmetic modulo 256, that is, use a byte!, to greatly simplify checksum arithmetic. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
017dad64 |
|
16-Jun-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Get rid of test function The code is now tested from userspace. Remove already macroed out test function. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
0686627b |
|
16-Jun-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Some renames Qualify with "ras_". Use kernel's own--don't redefine your own. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d7edde3d |
|
16-Jun-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Nerf buff buff --> buf. Essentially buffer abbreviates to buf, remove 1/2 of it, or just the iron part, as opposed to just the Er, Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e4e6a589 |
|
27-Mar-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Use explicit cardinality for clarity RAS_MAX_RECORD_NUM may mean the maximum record number, as in the maximum house number on your street, or it may mean the maximum number of records, as in the count of records, which is also a number. To make this distinction whether the number is ordinal (index) or cardinal (count), rename this macro to RAS_MAX_RECORD_COUNT. This makes it easy to understand what it refers to, especially when we compute quantities such as, how many records do we have left in the table, especially when there are so many other numbers, quantities and numerical macros around. Also rename the long, amdgpu_ras_eeprom_get_record_max_length() to the more succinct and clear, amdgpu_ras_eeprom_max_record_count(). When computing the threshold, which also deals with counts, i.e. "how many", use cardinal "max_eeprom_records_count", than the quantitative "max_eeprom_records_len". Simplify the logic here and there, as well. Cc: Guchun Chen <guchun.chen@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
803c6ebd |
|
26-Mar-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Simplify RAS EEPROM checksum calculations Rename update_table_header() to write_table_header() as this function is actually writing it to EEPROM. Use kernel types; use u8 to carry around the checksum, in order to take advantage of arithmetic modulo 8-bits (256). Tidy up to 80 columns. When updating the checksum, just recalculate the whole thing. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
dce4400e |
|
26-Mar-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Fix amdgpu_ras_eeprom_init() No need to account for the 2 bytes of EEPROM address--this is now well abstracted away by the fixes the the lower layers. Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cf696091 |
|
11-Mar-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Return result fix in RAS The low level EEPROM write method, doesn't return 1, but the number of bytes written. Thus do not compare to 1, instead, compare to greater than 0 for success. Other cleanup: if the lower layers returned -errno, then return that, as opposed to overwriting the error code with one-fits-all -EINVAL. For instance, some return -EAGAIN. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
16ef7977 |
|
11-Mar-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: EEPROM: add explicit read and write Add explicit amdgpu_eeprom_read() and amdgpu_eeprom_write() for clarity. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1fab841f |
|
11-Mar-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: RAS xfer to read/write Wrap amdgpu_ras_eeprom_xfer(..., bool write), into amdgpu_ras_eeprom_read() and amdgpu_ras_eeprom_write(), as that makes reading and understanding the code clearer. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
a4399657 |
|
05-Feb-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Rename misspelled function Instead of fixing the spelling in amdgpu_ras_eeprom_process_recods(), rename it to, amdgpu_ras_eeprom_xfer(), to look similar to other I2C and protocol transfer (read/write) functions. Also to keep the column span to within reason by using a shorter name. Change the "num" function parameter from "int" to "const u32" since it is the number of items (records) to xfer, i.e. their count, which cannot be a negative number. Also swap the order of parameters, keeping the pointer to records and their number next to each other, while the direction now becomes the last parameter. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c28aa44d |
|
04-Feb-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: RAS: EEPROM --> RAS In amdgpu_ras_eeprom.c--the interface from RAS to EEPROM, rename macros from EEPROM to RAS, to indicate that the quantities and objects are RAS specific, not EEPROM. We can decrease the RAS table, or put it in different offset of EEPROM as needed in the future. Remove EEPROM_ADDRESS_SIZE macro definition, equal to 2, from the file and calculations, as that quantity is computed and added on the stack, in the lower layer, amdgpu_eeprom_xfer(). Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
edb63a53 |
|
02-Feb-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: Fix wrap-around bugs in RAS Fix the size of the EEPROM from 256000 bytes to 262144 bytes (256 KiB). Fix a couple or wrap around bugs. If a valid value/address is 0 <= addr < size, the inverse of this inequality (barring negative values which make no sense here) is addr >= size. Fix this in the RAS code. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ccdfbfec |
|
02-Feb-2021 |
Luben Tuikov <luben.tuikov@amd.com> |
drm/amdgpu: RAS and FRU now use 19-bit I2C address Convert RAS and FRU code to use the 19-bit I2C memory address and remove all "slave_addr", as this is now absolved into the 19-bit address. Cc: Jean Delvare <jdelvare@suse.de> Cc: John Clements <john.clements@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
39ed82d1 |
|
21-Jan-2021 |
Alex Deucher <alexander.deucher@amd.com> |
drm/amdgpu: i2c subsystem uses 7 bit addresses Convert from 8 bit to 7 bit. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
|
#
24f55c05 |
|
21-Jan-2021 |
Alex Deucher <alexander.deucher@amd.com> |
drm/amdgpu/ras: switch ras eeprom handling to use generic helper Use the new helper rather than doing i2c transfers directly. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
|
#
39932ef7 |
|
04-Aug-2021 |
John Clements <john.clements@amd.com> |
drm/amdgpu: set RAS EEPROM address from VBIOS update to latest atombios fw table [Backport to 5.14 - Alex] Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1670 Signed-off-by: John Clements <john.clements@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
|
#
52a9df81 |
|
08-Apr-2021 |
John Clements <john.clements@amd.com> |
drm/amdgpu: enable ras eeprom on aldebaran enable ras eeprom loading by default on aldebaran Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
970fd197 |
|
10-Mar-2021 |
Stanley.Yang <Stanley.Yang@amd.com> |
drm/amdgpu: fix send ras disable cmd when asic not support ras cause: It is necessary to send ras disable command to ras-ta during gfx block ras later init, because the ras capability is disable read from vbios for vega20 gaming, but the ras context is released during ras init process, this will cause send ras disable command to ras-to failed. how: Delay releasing ras context, the ras context will be released after gfx block later init done. Changed from V1: move release_ras_context into ras_resume Changed from V2: check BIT(UMC) is more reasonable before access eeprom table Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e3c1b071 |
|
15-Feb-2021 |
shaoyunl <shaoyun.liu@amd.com> |
drm/amdgpu: Reset the devices in the XGMI hive duirng probe In passthrough configuration, hypervisior will trigger the SBR(Secondary bus reset) to the devices without sync to each other. This could cause device hang since for XGMI configuration, all the devices within the hive need to be reset at a limit time slot. This serial of patches try to solve this issue by co-operate with new SMU which will only do minimum house keeping to response the SBR request but don't do the real reset job and leave it to driver. Driver need to do the whole sw init and minimum HW init to bring up the SMU and trigger the reset(possibly BACO) on all the ASICs at the same time Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Acked-by: Andrey Grodzovsky andrey.grodzovsky@amd.com Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
11003c68 |
|
25-Feb-2021 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: remove unnecessary reading for epprom header If the number of badpage records exceed the threshold, driver has updated both epprom header and control->tbl_hdr.header before gpu reset, therefore GPU recovery thread no need to read epprom header directly. v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
3e7bc83e |
|
04-Jan-2021 |
John Clements <john.clements@amd.com> |
drm/amdgpu: enable ras eeprom support for sienna cichlid added I2C address and asic support flag Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
3851c90b |
|
04-Jan-2021 |
John Clements <john.clements@amd.com> |
drm/amdgpu: enable ras eeprom support for sienna cichlid added I2C address and asic support flag Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
f3729f7b |
|
02-Nov-2020 |
Deepak R Varma <mh12gx2825@gmail.com> |
drm/amdgpu/amdgpu: improve code indentation and alignment General code indentation and alignment changes such as replace spaces by tabs or align function arguments as per the coding style guidelines. The patch corrects issues for various amdgpu_*.c files for this driver. Issue reported by checkpatch script. Signed-off-by: Deepak R Varma <mh12gx2825@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5eeb4593 |
|
19-Oct-2020 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: remove redundant GPU reset Because bad pages saving has been moved to UMC error interrupt callback, which will trigger a new GPU reset after saving. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
40e7ed97 |
|
14-Oct-2020 |
Dennis Li <Dennis.Li@amd.com> |
drm/amdgpu: protect eeprom update from GPU reset because i2c is unstable in GPU reset, driver need protect eeprom update from GPU reset, to not miss any bad page record. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4bfb7428 |
|
03-Aug-2020 |
John Clements <john.clements@amd.com> |
drm/amdgpu: added RAS EEPROM device support check updated RAS EEPROM init/threshold sequences to check for device support Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9b856def |
|
27-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: update eeprom once specifying one bigger threshold(v3) During driver's probe, when it hits bad gpu tag in eeprom i2c init calling(the tag was set when reported bad page reaches bad page threshold in last driver's working loop), there are some strategys to deal with the cases: 1. when the module parameter amdgpu_bad_page_threshold = 0, that means page retirement feature is disabled, so just resetting the eeprom is fine. 2. When amdgpu_bad_page_threshold is not 0, and moreover, user sets one bigger valid data in order to make current boot up succeeds, correct eeprom header tag and do not break booting. 3. For other cases, driver's probe will be broken. v2: Just update eeprom header tag instead of resetting the whole table header when user sets one bigger threshold data. v3: Use dev_info/dev_err to print PCI device information, which helps in mGPU case. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
e8fbaf03 |
|
23-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: break GPU recovery once it's in bad state(v4) When GPU executes recovery and retriving bad GPU tag from external eerpom device, the recovery will be broken and error message is printed as well for user's awareness. v2: Refine warning message in threshold reaching case, and fix spelling typo. v3: Fix explicit calling of bad gpu. v4: Rename function names. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9c06f91f |
|
23-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: schedule ras recovery when reaching bad page threshold(v2) Once the bad page saved to eeprom reaches the configured threshold, ras recovery will be issued to notify user. v2: Fix spelling typo. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
b82e65a9 |
|
23-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: break driver init process when it's bad GPU(v5) When retrieving bad gpu tag from eeprom, GPU init should fail as the GPU needs to be retired for further check. v2: Fix spelling typo, correct the condition to detect bad gpu tag and refine error message. v3: Refine function argument name. v4: Fix missing check of returning value of i2c initialization error case. v5: Use dev_err to print PCI information in dmesg instead of DRM_ERROR. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
1d6a9d12 |
|
23-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: add bad gpu tag definition This tag will be hired for bad gpu detection in eeprom's access. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
c84d4670 |
|
21-Jul-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: validate bad page threshold in ras(v3) Bad page threshold value should be valid in the range between -1 and max records length of eeprom. It could determine when saved bad pages exceed threshold value, and proceed corresponding actions. v2: When using the default typical value, it should be min value between typical value and eeprom max records length. v3: drop the case of setting bad_page_cnt_threshold to be 0xFFFFFFFF, as it confuses user. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
9015d60c |
|
12-Mar-2020 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Move EEPROM I2C adapter to amdgpu_device Puts the i2c adapter in common place for sharing by RAS and upcoming data read from FRU EEPROM feature. v2: Move i2c adapter to amdgpu_pm and rename it. v3: Move i2c adapter init to ASIC specific code and get rid of the switch case in amdgpu_device Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
ef1caf48 |
|
25-Feb-2020 |
John Clements <john.clements@amd.com> |
drm/amdgpu: Add Arcturus D342 page retire support Check Arcturus SKU type to select I2C address of page retirement EEPROM Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5985ebbe |
|
25-Nov-2019 |
John Clements <john.clements@amd.com> |
drm/amdgpu: Resolved offchip EEPROM I/O issue Updated target I2C address Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
8633f126 |
|
25-Nov-2019 |
John Clements <john.clements@amd.com> |
drm/amdgpu: Resolved offchip EEPROM I/O issue Updated target I2C address Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
cf52ecc8 |
|
11-Oct-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Use ARCTURUS in RAS EEPROM. Add Arcturus EEPROM/I2C support in generic EEPROM code. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
5222d261 |
|
17-Sep-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: remove redundant variable definition No need to define the same variables in each loop of the function. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
db338e16 |
|
12-Sep-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu:Fix EEPROM checksum calculation. Fix checksum calculation after manually resetting the table. Unify reset and empty EEPROM init flow. Protect the table reset with lock. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
d01b400b |
|
09-Sep-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Add amdgpu_ras_eeprom_reset_table This will allow to reset the table on the fly. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
4bc22340 |
|
04-Sep-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/madgpu: Fix EEPROM Checksum calculation. Fix typo which messed up the calculation. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
92ead9fa |
|
28-Aug-2019 |
Colin Ian King <colin.king@canonical.com> |
drm/amdgpu: fix spelling mistake "jumpimng" -> "jumping" There is a spelling mistake in a DRM_DEBUG_DRIVER debug message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
691bac9d |
|
01-May-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Vega20 SMU I2C HW engine controller. Implement HW I2C enigne controller to be used by the RAS EEPROM table manager. This is based on code from ATITOOLs. v2: Rename the file and all function prefixes to smu_v11_0_i2c By Luben's observation always fill the TX fifo to full so we don't have garbadge interpreted by the slave as valid data. v3: Remove preemption disable as the HW I2C controller will not stop the clock on empty TX fifo and so it's not critical to keep not empty queue. Switch to fast mode 400 khz SCL clock for faster read and write. v5: Restore clock gating before releasing I2C bus and fix some style comments. v6: squash in warning fix, fix includes (Alex) Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Luben Tuikov <Luben.Tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
#
64f55e62 |
|
30-May-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Add RAS EEPROM table. Add RAS EEPROM table manager to eanble RAS errors to be stored upon appearance and retrived on driver load. v2: Fix some prints. v3: Fix checksum calculation. Make table record and header structs packed to do correct byte value sum. Fix record crossing EEPROM page boundry. v4: Fix byte sum val calculation for record - look at sizeof(record). Fix some style comments. v5: Add description to EEPROM_TABLE_RECORD_SIZE and syntax fixes. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Luben Tuikov <Luben.Tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|