Cross Reference: /linux-master/drivers/gpu/drm/i915/i915_gpu

History log of /linux-master/drivers/gpu/drm/i915/i915_gpu_error.c
Revision	Date	Author	Comments
# 3c0fa9f4	02-Feb-2024	Ville Syrjälä <ville.syrjala@linux.intel.com>	drm/i915: Use struct resource for memory region IO as well mem->region is a struct resource, but mem->io_start and mem->io_size are not for whatever reason. Let's unify this and convert the io stuff into a struct resource as well. Should make life a little less annoying when you don't have juggle between two different approaches all the time. Mostly done using cocci (with manual tweaks at all the places where we mutate io_size by hand): @@ struct intel_memory_region M; expression START, SIZE; @@ - M->io_start = START; - M->io_size = SIZE; + M->io = DEFINE_RES_MEM(START, SIZE); @@ struct intel_memory_region M; @@ - M->io_start + M->io.start @@ struct intel_memory_region M; @@ - M.io_start + M.io.start @@ expression M; @@ - M->io_size + resource_size(&M->io) @@ expression M; @@ - M.io_size + resource_size(&M.io) Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com> Acked-by: Nirmoy Das <nirmoy.das@intel.com> Tested-by: Paz Zcharya <pazz@chromium.org> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240202224340.30647-2-ville.syrjala@linux.intel.com
# f8e9325f	10-Nov-2023	Jani Nikula <jani.nikula@intel.com>	drm/i915: update in-source bug filing URLs The bug filing documentation has been moved from the gitlab wiki to gitlab pages at https://drm.pages.freedesktop.org/intel-docs/. Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231110114807.3455739-2-jani.nikula@intel.com
# d5818410	31-Oct-2023	Jani Nikula <jani.nikula@intel.com>	drm/i915: move gpu error sysfs to i915_gpu_error.c Hide gpu error specifics in i915_gpu_error.c. This is also cleaner wrt conditional compilation, as i915_gpu_error.c is only built with DRM_I915_CAPTURE_ERROR=y. With this, we can also make i915_first_error_state() static. Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231031124502.1772160-3-jani.nikula@intel.com
# 4fca5198	31-Oct-2023	Jani Nikula <jani.nikula@intel.com>	drm/i915: move gpu error debugfs to i915_gpu_error.c Hide gpu error specifics in i915_gpu_error.c. This is also cleaner wrt conditional compilation, as i915_gpu_error.c is only built with DRM_I915_CAPTURE_ERROR=y. Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231031124502.1772160-2-jani.nikula@intel.com
# 2efb81e5	31-Oct-2023	Jani Nikula <jani.nikula@intel.com>	drm/i915: make some error capture functions static Not needed outside of i915_gpu_error.c. Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231031124502.1772160-1-jani.nikula@intel.com
# 7a61a6aa	24-Oct-2023	Jouni Högander <jouni.hogander@intel.com>	drm/i915/display: Dump also display parameters GPU error dump contained all module parameters. If we are moving display parameters to intel_display_params.[ch] they are not dumped into GPU error dump. This patch is adding moved display parameters back to GPU error dump. Display parameters are also included in i915_capabilities v2: Add parameters to i915_capabilities as well Signed-off-by: Jouni Högander <jouni.hogander@intel.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Luca Coelho <luciano.coelho@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231024124109.384973-3-jouni.hogander@intel.com
# 37d62359	28-Sep-2023	Nirmoy Das <nirmoy.das@intel.com>	drm/i915/mtl: Skip MCR ops for ring fault register On MTL GEN12_RING_FAULT_REG is not replicated so don't do mcr based operation for this register. v2: use MEDIA_VER() instead of GRAPHICS_VER()(Matt). v3: s/"MEDIA_VER(i915) == 13"/"MEDIA_VER(i915) >= 13"(Matt) improve comment. v4: improve the comment further(Andi) Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230928130015.6758-4-nirmoy.das@intel.com
# b94c165e	19-Sep-2023	Clint Taylor <clinton.a.taylor@intel.com>	drm/i915/xe2lpd: Register DE_RRMR has been removed Do not read DE_RRMR register after display version 20. This register contains display state information during GFX state dumps. Bspec: 69456 Cc: Anusha Srivatsa <anusha.srivatsa@intel.com> Cc: Gustavo Sousa <gustavo.sousa@intel.com> Signed-off-by: Clint Taylor <clinton.a.taylor@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230919192128.2045154-9-lucas.demarchi@intel.com
# 8874288c	13-Sep-2023	Jouni Högander <jouni.hogander@intel.com>	drm/i915: Remove runtime suspended boolean from intel_runtime_pm struct It's not necessary to carry separate suspended status information in intel_runtime_pm struct as this information is already in underlying device structure. Remove it and use pm_runtime_suspended() to obtain suspended status information when needed. Cc: Jani Nikula <jani.nikula@intel.com> Cc: Imre Deak <imre.deak@intel.com> Signed-off-by: Jouni Högander <jouni.hogander@intel.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Imre Deak <imre.deak@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230913100430.3433969-1-jouni.hogander@intel.com
# 4ae7eb92	27-Jun-2023	Jani Nikula <jani.nikula@intel.com>	drm/i915: separate display info printing from the rest Add new function intel_display_device_info_print() and print the display device info there instead of intel_device_info_print(). This also fixes the display runtime info printing to use the actual runtime info instead of the static defaults. Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/30d4f93c58839bc9312b43423cd43bc0ef655a35.1687878757.git.jani.nikula@intel.com
# 36dd2a6e	17-Jun-2023	Sumitra Sharma <sumitraartsy@gmail.com>	drm/i915: Replace kmap() with kmap_local_page() kmap() has been deprecated in favor of the kmap_local_page() due to high cost, restricted mapping space, the overhead of a global lock for synchronization, and making the process sleep in the absence of free slots. kmap_local_page() is faster than kmap() and offers thread-local and CPU-local mappings, can take pagefaults in a local kmap region and preserves preemption by saving the mappings of outgoing tasks and restoring those of the incoming one during a context switch. The mapping is kept thread local in the function “i915_vma_coredump_create” in i915_gpu_error.c Therefore, replace kmap() with kmap_local_page(). Suggested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Sumitra Sharma <sumitraartsy@gmail.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230617180420.GA410966@sumitra.com [tursulin: Removed blank line within tags. Fixup commit text.]
# f8a101ff	21-Jun-2023	Matthew Wilcox (Oracle) <willy@infradead.org>	i915: convert i915_gpu_error to use a folio_batch Remove one of the last remaining users of pagevec. Link: https://lkml.kernel.org/r/20230621164557.3510324-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# 6197cff3	18-Apr-2023	John Harrison <John.C.Harrison@Intel.com>	drm/i915: Dump error capture to kernel log This is useful for getting debug information out in certain situations, such as failing kernel selftests and CI runs that don't log error captures. It is especially useful for things like retrieving GuC logs as GuC operation can't be tracked by adding printk or ftrace entries. v2: Add CONFIG_DRM_I915_DEBUG_GEM wrapper (review feedback by Rodrigo). Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230418181744.3251240-2-John.C.Harrison@Intel.com
# 9275277d	09-May-2023	Fei Yang <fei.yang@intel.com>	drm/i915: use pat_index instead of cache_level Currently the KMD is using enum i915_cache_level to set caching policy for buffer objects. This is flaky because the PAT index which really controls the caching behavior in PTE has far more levels than what's defined in the enum. In addition, the PAT index is platform dependent, having to translate between i915_cache_level and PAT index is not reliable, and makes the code more complicated. From UMD's perspective there is also a necessity to set caching policy for performance fine tuning. It's much easier for the UMD to directly use PAT index because the behavior of each PAT index is clearly defined in Bspec. Having the abstracted i915_cache_level sitting in between would only cause more ambiguity. PAT is expected to work much like MOCS already works today, and by design userspace is expected to select the index that exactly matches the desired behavior described in the hardware specification. For these reasons this patch replaces i915_cache_level with PAT index. Also note, the cache_level is not completely removed yet, because the KMD still has the need of creating buffer objects with simple cache settings such as cached, uncached, or writethrough. For kernel objects, cache_level is used for simplicity and backward compatibility. For Pre-gen12 platforms PAT can have 1:1 mapping to i915_cache_level, so these two are interchangeable. see the use of LEGACY_CACHELEVEL. One consequence of this change is that gen8_pte_encode is no longer working for gen12 platforms due to the fact that gen12 platforms has different PAT definitions. In the meantime the mtl_pte_encode introduced specfically for MTL becomes generic for all gen12 platforms. This patch renames the MTL PTE encode function into gen12_pte_encode and apply it to all gen12. Even though this change looks unrelated, but separating them would temporarily break gen12 PTE encoding, thus squash them in one patch. Special note: this patch changes the way caching behavior is controlled in the sense that some objects are left to be managed by userspace. For such objects we need to be careful not to change the userspace settings.There are kerneldoc and comments added around obj->cache_coherent, cache_dirty, and how to bypass the checkings by i915_gem_object_has_cache_level. For full understanding, these changes need to be looked at together with the two follow-up patches, one disables the {set\|get}_caching ioctl's and the other adds set_pat extension to the GEM_CREATE uAPI. Bspec: 63019 Cc: Chris Wilson <chris.p.wilson@linux.intel.com> Signed-off-by: Fei Yang <fei.yang@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230509165200.1740-3-fei.yang@intel.com
# 88629fee	02-May-2023	Jani Nikula <jani.nikula@intel.com>	drm/i915/error: fix i915_capture_error_state() kernel-doc drivers/gpu/drm/i915/i915_gpu_error.c:2174: warning: Function parameter or member 'dump_flags' not described in 'i915_capture_error_state' Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20524292b002800975d82d23b5bd47da878f1733.1683041799.git.jani.nikula@intel.com
# e4730ae4	28-Apr-2023	John Harrison <John.C.Harrison@Intel.com>	drm/i915/guc: Fix error capture for virtual engines GuC based register dumps in error capture logs were basically broken for virtual engines. This can be seen in igt@gem_exec_balancer@hang: [IGT] gem_exec_balancer: starting subtest hang [drm] GPU HANG: ecode 12:4:e1524110, in gem_exec_balanc [6388] [drm] GT0: GUC: No register capture node found for 0x1005 / 0xFEDC311D [drm] GPU HANG: ecode 12:4:00000000, in gem_exec_balanc [6388] [IGT] gem_exec_balancer: exiting, ret=0 The test causes a hang on both engines of a virtual engine context. The engine instance zero hang gets a valid error capture but the non-instance-zero hang does not. Fix that by scanning through the list of pending register captures when a hang notification for a virtual engine is received. That way, the hang can be assigned to the correct physical engine prior to starting the error capture process. So later on, when the error capture handler tries to find the engine register list, it looks for one on the correct engine. Also, sneak in a missing blank line before a comment in the node search code. v2: Fix null pointer deref on non-GuC platforms. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230428185636.457407-5-John.C.Harrison@Intel.com
# c8a76df6	10-Mar-2023	John Harrison <John.C.Harrison@Intel.com>	drm/i915: Include timeline seqno in error capture The seqno value actually written out to memory is no longer in the regular HWSP. Instead, it is now in its own private timeline buffer. Thus, it is no longer visible in an error capture. So, explicitly read the value and include that in the capture. v2: %d -> %u (Alan) Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230311063714.570389-4-John.C.Harrison@Intel.com
# e7696d65	26-Jan-2023	John Harrison <John.C.Harrison@Intel.com>	drm/i915: Allow error capture of a pending request A hang situation has been observed where the only requests on the context were either completed or not yet started according to the breaadcrumbs. However, the register state claimed a batch was (maybe) in progress. So, allow capture of the pending request on the grounds that this might be better than nothing. v2: Reword 'not started' warning message (Tvrtko) Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230127002842.3169194-6-John.C.Harrison@Intel.com
# e8a3319c	26-Jan-2023	John Harrison <John.C.Harrison@Intel.com>	drm/i915: Allow error capture without a request There was a report of error captures occurring without any hung context being indicated despite the capture being initiated by a 'hung context notification' from GuC. The problem was not reproducible. However, it is possible to happen if the context in question has no active requests. For example, if the hang was in the context switch itself then the breadcrumb write would have occurred and the KMD would see an idle context. In the interests of attempting to provide as much information as possible about a hang, it seems wise to include the engine info regardless of whether a request was found or not. As opposed to just prentending there was no hang at all. So update the error capture code to always record engine information if a context is given. Which means updating record_context() to take a context instead of a request (which it only ever used to find the context anyway). And split the request agnostic parts of intel_engine_coredump_add_request() out into a seaprate function. v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null pointer. v3: Tidy up request locking code flow (Tvrtko) v4: Pull in improved info message from next patch and fix up potential leak of GuC register state (Daniele) Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> (v2) Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230127002842.3169194-5-John.C.Harrison@Intel.com
# a4be3dca	26-Jan-2023	John Harrison <John.C.Harrison@Intel.com>	drm/i915: Fix up locking around dumping requests lists The debugfs dump of requests was confused about what state requires the execlist lock versus the GuC lock. There was also a bunch of duplicated messy code between it and the error capture code. So refactor the hung request search into a re-usable function. And reduce the span of the execlist state lock to only the execlist specific code paths. In order to do that, also move the report of hold count (which is an execlist only concept) from the top level dump function to the lower level execlist specific function. Also, move the execlist specific code into the execlist source file. v2: Rename some functions and move to more appropriate files (Daniele). v3: Rename new execlist dump function (Daniele) Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Cc: Michael Cheng <michael.cheng@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Bruce Chang <yu.bruce.chang@intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230127002842.3169194-4-John.C.Harrison@Intel.com
# 3700e353	26-Jan-2023	John Harrison <John.C.Harrison@Intel.com>	drm/i915: Fix request ref counting during error capture & debugfs dump When GuC support was added to error capture, the reference counting around the request object was broken. Fix it up. The context based search manages the spinlocking around the search internally. So it needs to grab the reference count internally as well. The execlist only request based search relies on external locking, so it needs an external reference count but within the spinlock not outside it. The only other caller of the context based search is the code for dumping engine state to debugfs. That code wasn't previously getting an explicit reference at all as it does everything while holding the execlist specific spinlock. So, that needs updaing as well as that spinlock doesn't help when using GuC submission. Rather than trying to conditionally get/put depending on submission model, just change it to always do the get/put. v2: Explicitly document adding an extra blank line in some dense code (Andy Shevchenko). Fix multiple potential null pointer derefs in case of no request found (some spotted by Tvrtko, but there was more!). Also fix a leaked request in case of !started and another in __guc_reset_context now that intel_context_find_active_request is actually reference counting the returned request. v3: Add a _get suffix to intel_context_find_active_request now that it grabs a reference (Daniele). v4: Split the intel_guc_find_hung_context change to a separate patch and rename intel_context_find_active_request_get to intel_context_get_active_request (Tvrtko). v5: s/locking/reference counting/ in commit message (Tvrtko) Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset") Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Andrzej Hajda <andrzej.hajda@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Cc: Michael Cheng <michael.cheng@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> Cc: Bruce Chang <yu.bruce.chang@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230127002842.3169194-3-John.C.Harrison@Intel.com
# 5bc4b43d	26-Jan-2023	John Harrison <John.C.Harrison@Intel.com>	drm/i915: Fix up locking around dumping requests lists The debugfs dump of requests was confused about what state requires the execlist lock versus the GuC lock. There was also a bunch of duplicated messy code between it and the error capture code. So refactor the hung request search into a re-usable function. And reduce the span of the execlist state lock to only the execlist specific code paths. In order to do that, also move the report of hold count (which is an execlist only concept) from the top level dump function to the lower level execlist specific function. Also, move the execlist specific code into the execlist source file. v2: Rename some functions and move to more appropriate files (Daniele). v3: Rename new execlist dump function (Daniele) Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Cc: Michael Cheng <michael.cheng@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Bruce Chang <yu.bruce.chang@intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230127002842.3169194-4-John.C.Harrison@Intel.com (cherry picked from commit a4be3dca53172d9d2091e4b474fb795c81ed3d6c) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
# 86d8ddc7	26-Jan-2023	John Harrison <John.C.Harrison@Intel.com>	drm/i915: Fix request ref counting during error capture & debugfs dump When GuC support was added to error capture, the reference counting around the request object was broken. Fix it up. The context based search manages the spinlocking around the search internally. So it needs to grab the reference count internally as well. The execlist only request based search relies on external locking, so it needs an external reference count but within the spinlock not outside it. The only other caller of the context based search is the code for dumping engine state to debugfs. That code wasn't previously getting an explicit reference at all as it does everything while holding the execlist specific spinlock. So, that needs updaing as well as that spinlock doesn't help when using GuC submission. Rather than trying to conditionally get/put depending on submission model, just change it to always do the get/put. v2: Explicitly document adding an extra blank line in some dense code (Andy Shevchenko). Fix multiple potential null pointer derefs in case of no request found (some spotted by Tvrtko, but there was more!). Also fix a leaked request in case of !started and another in __guc_reset_context now that intel_context_find_active_request is actually reference counting the returned request. v3: Add a _get suffix to intel_context_find_active_request now that it grabs a reference (Daniele). v4: Split the intel_guc_find_hung_context change to a separate patch and rename intel_context_find_active_request_get to intel_context_get_active_request (Tvrtko). v5: s/locking/reference counting/ in commit message (Tvrtko) Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset") Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Andrzej Hajda <andrzej.hajda@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Cc: Michael Cheng <michael.cheng@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> Cc: Bruce Chang <yu.bruce.chang@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230127002842.3169194-3-John.C.Harrison@Intel.com (cherry picked from commit 3700e353781e27f1bc7222f51f2cc36cbeb9b4ec) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
# 801543b2	09-Nov-2022	Jani Nikula <jani.nikula@intel.com>	drm/i915: stop including i915_irq.h from i915_trace.h Turns out many of the files that need i915_reg.h get it implicitly via {display/intel_de.h, gt/intel_context.h} -> i915_trace.h -> i915_irq.h -> i915_reg.h. Since i915_trace.h doesn't actually need i915_irq.h, makes sense to drop it, but that requires adding quite a few new includes all over the place. Prefer including i915_reg.h where needed instead of adding another implicit include, because eventually we'll want to split up i915_reg.h and only include the specific registers at each place. Also some places actually needed i915_irq.h too. Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/6e78a2e0ac1bffaf5af3b5ccc21dff05e6518cef.1668008071.git.jani.nikula@intel.com
# ab1b2d40	14-Oct-2022	Matt Roper <matthew.d.roper@intel.com>	drm/i915/xehp: Check for faults on primary GAM On Xe_HP the fault registers are now in a multicast register range. However as part of the GAM these registers follow special rules and we need only read from the "primary" GAM's instance to get the information we need. So a single intel_gt_mcr_read_any() (which will automatically steer to the primary GAM) is sufficient; we don't need to loop over each instance of the MCR register. v2: - Update more instances of fault registers. (Bala) Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014230239.1023689-7-matthew.d.roper@intel.com
# 665ae9c9	06-Sep-2022	John Harrison <John.C.Harrison@Intel.com>	drm/i915/uc: Support for version reduced and multiple firmware files There was a misunderstanding in how firmware file compatibility should be managed within i915. This has been clarified as: i915 must support all existing firmware releases forever new minor firmware releases should replace prior versions only backwards compatibility breaking releases should be a new file This patch cleans up the single fallback file support that was added as a quick fix emergency effort. That is now removed in preference to supporting arbitrary numbers of firmware files per platform. The patch also adds support for having GuC firmware files that are named by major version only (because the major version indicates backwards breaking changes that affect the KMD) and for having HuC firmware files with no version number at all (because the KMD has no interface requirements with the HuC). For GuC, the driver will report via dmesg if the found file is older than expected. For HuC, the KMD will no longer require updating for any new HuC release so will not be able to report what the latest expected version is. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220906230147.479945-1-daniele.ceraolospurio@intel.com
# c7d3c844	19-Aug-2022	Jani Nikula <jani.nikula@intel.com>	drm/i915: combine device info printing into one We'll be moving info between static and runtime info. Combine the printing functions into one to keep the output sensible and (mostly) unchanged in the process. Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Maarten Lankhort <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/8579bfe0fcc5ee8390d4cded68a0167a618097f5.1660910433.git.jani.nikula@intel.com
# c5de70f6	27-Jul-2022	John Harrison <John.C.Harrison@Intel.com>	drm/i915/guc: Record CTB info in error logs When debugging GuC communication issues, it is useful to have the CTB info available. So add the state and buffer contents to the error capture log. Also, add a sub-structure for the GuC specific error capture info as it is now becoming numerous. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220728022028.2190627-5-John.C.Harrison@Intel.com
# 368d179a	27-Jul-2022	John Harrison <John.C.Harrison@Intel.com>	drm/i915/guc: Add GuC <-> kernel time stamp translation information It is useful to be able to match GuC events to kernel events when looking at the GuC log. That requires being able to convert GuC timestamps to kernel time. So, when dumping error captures and/or GuC logs, include a stamp in both time zones plus the clock frequency. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220728022028.2190627-4-John.C.Harrison@Intel.com
# 9a92732f	01-Jul-2022	Matt Roper <matthew.d.roper@intel.com>	drm/i915/gt: Add general DSS steering iterator to intel_gt_mcr Although all DSS belong to a single pool on Xe_HP platforms (i.e., they're not organized into slices from a topology point of view), we do still need to pass 'group' and 'instance' targets when steering register accesses to a specific instance of a per-DSS multicast register. The rules for how to determine group and instance IDs (which previously used legacy terms "slice" and "subslice") varies by platform. Some platforms determine steering by gslice membership, some platforms by cslice membership, and future platforms may have other rules. Since looping over each DSS and performing steered unicast register accesses is a relatively common pattern, let's add a dedicated iteration macro to handle this (and replace the platform-specific "instdone" loop we were using previously. This will avoid the calling code needing to figure out the details about how to obtain steering IDs for a specific DSS. Most of the places where we use this new loop are in the GPU errorstate code at the moment, but we do have some additional features coming in the future that will also need to loop over each DSS and steer some register accesses accordingly. Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220701232006.1016135-1-matthew.d.roper@intel.com
# d42a738e	29-Jun-2022	Matthew Auld <matthew.auld@intel.com>	drm/i915/error: skip non-mappable pages Skip capturing any lmem pages that can't be copied using the CPU. This in now only best effort on platforms that have small BAR. Testcase: igt@gem-exec-capture@capture-invisible Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Jon Bloomfield <jon.bloomfield@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Jordan Justen <jordan.l.justen@intel.com> Cc: Kenneth Graunke <kenneth@whitecape.org> Cc: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220629174350.384910-7-matthew.auld@intel.com
# a0696856	24-Jun-2022	Nirmoy Das <nirmoy.das@intel.com>	drm/i915: Fix a lockdep warning at error capture For some platfroms we use stop_machine version of gen8_ggtt_insert_page/gen8_ggtt_insert_entries to avoid a concurrent GGTT access bug but this causes a circular locking dependency warning: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ggtt->error_mutex); lock(dma_fence_map); lock(&ggtt->error_mutex); lock(cpu_hotplug_lock); Fix this by calling gen8_ggtt_insert_page/gen8_ggtt_insert_entries directly at error capture which is concurrent GGTT access safe because reset path make sure of that. v2: Fix rebase conflict and added a comment. Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5595 Reviewed-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com> Signed-off-by: Ramalingam C <ramalingam.c@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220624110821.29190-1-nirmoy.das@intel.com
# b729cfee	01-Jun-2022	Stuart Summers <stuart.summers@intel.com>	drm/i915: Add extra registers to GPU error dump Our internal teams have identified a few additional engine registers that are worth inspecting in error state dumps during development & debug. Let's capture and print them as part of our error dump. For simplicity we'll just dump these registers on gen11 and beyond. Most of these registers have existed since earlier platforms (e.g., gen6 or gen7) but were initially introduced only for a subset of the platforms' engines; gen11 seems to be where they became available on all engines. Signed-off-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220601210646.615946-1-matthew.d.roper@intel.com
# 39921e5f	19-May-2022	Tvrtko Ursulin <tvrtko.ursulin@intel.com>	Revert "drm/i915: Drop has_gt_uc from device info" This reverts commit 222ff6db8a0dcb86f2bb65fc8656aec635a737a6. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220519090802.1294691-8-tvrtko.ursulin@linux.intel.com
# 222ff6db	05-May-2022	José Roberto de Souza <jose.souza@intel.com>	drm/i915: Drop has_gt_uc from device info No need to have this parameter in intel_device_info struct as all platforms with graphics version 9 or newer has graphics microcontroller. As a side effect of the of removal this flag, it will not be printed in dmesg during driver load anymore and developers will have to rely on to check the macro and compare with platform being used and IP versions of it. Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220505193524.276400-1-jose.souza@intel.com
# bb6287cb	01-Apr-2022	Tvrtko Ursulin <tvrtko.ursulin@intel.com>	drm/i915: Track context current active time Track context active (on hardware) status together with the start timestamp. This will be used to provide better granularity of context runtime reporting in conjunction with already tracked pphwsp accumulated runtime. The latter is only updated on context save so does not give us visibility to any currently executing work. As part of the patch the existing runtime tracking data is moved under the new ce->stats member and updated under the seqlock. This provides the ability to atomically read out accumulated plus active runtime. v2: * Rename and make __intel_context_get_active_time unlocked. v3: * Use GRAPHICS_VER. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> # v1 Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220401142205.3123159-6-tvrtko.ursulin@linux.intel.com
# 5efde05f	30-Mar-2022	Jani Nikula <jani.nikula@intel.com>	drm/i915/dmc: abstract GPU error state dump Only intel_dmc.c should be accessing dmc details directly. Need to add an i915_error_printf() stub for CONFIG_DRM_I915_CAPTURE_ERROR=n. v2: Add the stub (kernel test robot <lkp@intel.com>) Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> # v1 Link: https://patchwork.freedesktop.org/patch/msgid/20220330113417.220964-1-jani.nikula@intel.com
# a7f46d5b	29-Mar-2022	Tvrtko Ursulin <tvrtko.ursulin@intel.com>	drm/i915: Move intel_vtd_active and run_as_guest to i915_utils Continuation of the effort to declutter i915_drv.h. Also, component specific helpers which consult the iommu/virtualization helpers moved to respective component source/header files as appropriate. v2: * s/dev_priv/i915/ in intel_scanout_needs_vtd_wa. (Lucas) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220329090204.2324499-1-tvrtko.ursulin@linux.intel.com [tursulin: fixup conflict in i915_drv.h]
# a0f1f7b4	21-Mar-2022	Alan Previn <alan.previn.teres.alexis@intel.com>	drm/i915/guc: Print the GuC error capture output register list. Print the GuC captured error state register list (string names and values) when gpu_coredump_state printout is invoked via the i915 debugfs for flushing the gpu error-state that was captured prior. Since GuC could have reported multiple engine register dumps in a single notification event, parse the captured data (appearing as a stream of structures) to identify each dump as a different 'engine-capture-group-output'. Finally, for each 'engine-capture-group-output' that is found, verify if the engine register dump corresponds to the engine_coredump content that was previously populated by the i915_gpu_coredump function. That function would have copied the context's vma's including the bacth buffer during the G2H-context-reset notification that occurred earlier. Perform this verification check by comparing guc_id, lrca and engine- instance obtained from the 'engine-capture-group-output' vs a copy of that same info taken during i915_gpu_coredump. If they match, then print those vma's as well (such as the batch buffers). NOTE: the output format was verified using the gem_exec_capture IGT test. Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-14-alan.previn.teres.alexis@intel.com
# a6f0f9cf	21-Mar-2022	Alan Previn <alan.previn.teres.alexis@intel.com>	drm/i915/guc: Plumb GuC-capture into gpu_coredump Add a flags parameter through all of the coredump creation functions. Add a bitmask flag to indicate if the top level gpu_coredump event is triggered in response to a GuC context reset notification. Using that flag, ensure all coredump functions that read or print mmio-register values related to work submission or command-streamer engines are skipped and replaced with a calls guc-capture module equivalent functions to retrieve or print the register dump. While here, split out display related register reading and printing into its own function that is called agnostic to whether GuC had triggered the reset. For now, introduce an empty printing function that can filled in on a subsequent patch just to handle formatting. Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-13-alan.previn.teres.alexis@intel.com