History log of /linux-master/drivers/gpu/drm/xe/xe_devcoredump_types.h
Revision Date Author Comments
# 0eb2a18a 21-Feb-2024 Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

drm/xe: Implement VM snapshot support for BO's and userptr

Since we cannot immediately capture the BO's and userptr, perform it in
2 stages. The immediate stage takes a reference to each BO and userptr,
while a delayed worker captures the contents and then frees the
reference.

This is required because in signaling context, no locks can be taken, no
memory can be allocated, and no waits on userspace can be performed.

With the delayed worker, all of this can be performed very easily,
without having to resort to hacks.

Changes since v1:
- Fix crash on NULL captured vm.
- Use ascii85_encode to capture BO contents and save some space.
- Add length to coredump output for each captured area.
Changes since v2:
- Dump each mapping on their own line, to simplify tooling.
- Fix null pointer deref in xe_vm_snapshot_free.
Changes since v3:
- Don't add uninitialized value to snap->ofs. (Souza)
- Use kernel types for u32 and u64.
- Move snap_mutex destruction to final vm destruction. (Souza)
Changes since v4:
- Remove extra memset. (Souza)

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240221133024.898315-6-maarten.lankhorst@linux.intel.com


# be7d51c5 30-Jan-2024 José Roberto de Souza <jose.souza@intel.com>

drm/xe: Add batch buffer addresses to devcoredump

Those addresses are necessary to Mesa tools knows where in VM are the
batch buffers to parse and print instructions that are human readable.

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Maarten Lankhorst <dev@lankhorst.se>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240130135648.30211-2-jose.souza@intel.com


# 83ef64eb 23-Jan-2024 José Roberto de Souza <jose.souza@intel.com>

drm/xe: Nuke xe from xe_devcoredump

xe is never set in xe_devcoredump but if xe_device is needed
devcoredump_to_xe_device() can be used.

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Maarten Lankhorst <dev@lankhorst.se>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240123204454.246788-3-jose.souza@intel.com


# 9b9529ce 31-Jul-2023 Francois Dugast <francois.dugast@intel.com>

drm/xe: Rename engine to exec_queue

Engine was inappropriately used to refer to execution queues and it
also created some confusion with hardware engines. Where it applies
the exec_queue variable name is changed to q and comments are also
updated.

Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/162
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>


# 82f428b6 22-May-2023 Matthew Auld <matthew.auld@intel.com>

drm/xe: fix kernel-doc issues

drivers/gpu/drm/xe/xe_guc_submit_types.h:47: warning: cannot understand
function prototype: 'struct guc_submit_parallel_scratch '

drivers/gpu/drm/xe/xe_devcoredump_types.h:38: warning: Function
parameter or member 'ct' not described in 'xe_devcoredump_snapshot'

CI doesn't appear to be running BAT anymore, assuming this is caused by
the CI.Hooks now failing due to above warnings.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>


# 01a87f31 16-May-2023 Rodrigo Vivi <rodrigo.vivi@intel.com>

drm/xe: Add HW Engine snapshot to xe_devcoredump.

Let's continue to add our existent simple logs to devcoredump one
by one. Any format change should come on follow-up work.

v2: remove unnecessary, and now duplicated, dma_fence annotation. (Matthew)
v3: avoid for_each with faulty_engine since that can be already freed at
the time of the read/free. Instead, iterate in the full array of
hw_engines. (Kasan)

Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>


# 3847ec03d 16-May-2023 Rodrigo Vivi <rodrigo.vivi@intel.com>

drm/xe: Add GuC Submit Engine snapshot to xe_devcoredump.

Let's start to move our existent logs to devcoredump one by
one. Any format change should come on follow-up work.

Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>


# 5ed53446 16-May-2023 Rodrigo Vivi <rodrigo.vivi@intel.com>

drm/xe: Add GuC CT snapshot to xe_devcoredump.

Let's start to move our existent logs to devcoredump one by
one. Any format change should come on follow-up work.

v2: Rebase and add the dma_fence locking annotation here.

Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>


# e7994850 18-May-2023 Rodrigo Vivi <rodrigo.vivi@intel.com>

drm/xe: Introduce the dev_coredump infrastructure.

The goal is to use devcoredump infrastructure to report error states
captured at the crash time.

The error state will contain useful information for GPU hang debug, such
as INSTDONE registers and the current buffers getting executed, as well
as any other information that helps user space and allow later replays of
the error.

The proposal here is to avoid a Xe only error_state like i915 and use
a standard dev_coredump infrastructure to expose the error state.

For our own case, the data is only useful if it is a snapshot of the
time when the GPU crash has happened, since we reset the GPU immediately
after and the registers might have changed. So the proposal here is to
have an internal snapshot to be printed out later.

Also, usually a subsequent GPU hang can be only a cause of the initial
one. So we only save the 'first' hang. The dev_coredump has a delayed
work queue where it remove the coredump and free all the data within a
few moments of the error. When that happens we also reset our capture
state and allow further snapshots.

Right now this infra only print out the time of the hang. More information
will be migrated here on subsequent work. Also, in order to organize the
dump better, the goal is to propose dev_coredump changes itself to allow
multiple files and different controls. But for now we start Xe usage of
it without any dependency on dev_coredump core changes.

v2: Add dma_fence annotation for capture that might happen during long
running. (Thomas and Matt)
Use xe->drm.primary->index on drm_info msg. (Jani)
v3: checkpatch fixes
v4: Fix building and locking issues found by Francois.
Actually let's kill all of the locking in here. gt_reset serialization
already guarantee that there will be only one capture at the same time.
Also, the devcoredump has its own locking to protect the free and reads
and drivers don't need to duplicate it.
Besides this, the dma_fence locking was pushed to a following patch
since it is not needed in this one.
Fix a use after free identified by KASAN: Do not stash the faulty_engine
since that will be freed somewhere else.
v5: Fix Uptime - ktime_get_boottime actually returns the Uptime. (Francois)

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>