History log of /linux-master/drivers/accel/habanalabs/common/device.c
Revision Date Author Comments
# fa58b594 12-Feb-2024 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: modify pci health check

Today we read PCI VENDOR-ID in order to make sure PCI link is
healthy. Apparently the VENDOR-ID might be stored on host and
hence, when we read it we might not access the PCI bus.
In order to make sure PCI health check is reliable, we will start
checking the DEVICE-ID instead.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 731d320e 01-Jan-2024 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: remove call to deprecated function

In newer kernel versions, irq_set_affinity_hint() is deprecated.
Instead, use the newer version which is irq_set_affinity_and_hint().

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 246d8b6c 24-Dec-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: abort device reset for consecutive heartbeat failures

The mechanism of aborting device reset for consecutive fatal errors is
currently only for fatal errors that are reported by FW.
A non-responsive FW and consecutive heartbeat failures is also
considered fatal, so add them as well to this mechanism to avoid
recurring device reset in such a case.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# d0df8a35 14-Dec-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: fix DRAM BAR base address calculation

When the DRAM region size in the BAR is not a power of 2, calculating
the corresponding BAR base address should be done using the offset from
the DRAM start address, and not using directly the DRAM address.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# e91c37f1 21-Sep-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs/gaudi2: add interrupt affinity for user interrupts

User interrupts are MSIx interrupts coming from Gaudi2, that have
specific range of IDs and are assigned to the sole use of the user
process that opened the Gaudi2 device (reminder: there can be only
a single user process running on Gaudi2 at any given time).

The interrupts are allocated and managed by the driver and therefore,
the user expects the driver to initialize them properly, which also
includes setting the affinity to the related CPU cores of the
device's NUMA node to get maximum performance.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 4b0b1fbc 20-Jul-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: set hard reset flag if graceful reset is skipped

hl_device_cond_reset() might be called with the hard reset flag unset,
because a compute reset upon device release as part of a graceful reset
is valid.
If the conditions for graceful reset are not met, hl_device_reset() will
be called for an immediate reset. In this case a compute reset is not
valid, so it will be replaced with a hard reset together with a debug
message about it.
This message might be confusing, as it implies that a compute reset was
requested when it shouldn't. To prevent this confusion, set the hard
reset flag in hl_device_cond_reset() if going to an immediate reset.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# d1958dce 30-Oct-2023 Farah Kassabri <fkassabri@habana.ai>

accel/habanalabs: fix EQ heartbeat mechanism

Stop rescheduling another heartbeat check when EQ heartbeat check fails
as it generates confusing logs in dmesg that the heartbeat fails.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 42422993 29-Oct-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: add support for Gaudi2C device

Gaudi2 with PCI revision ID with the value of '3' represents Gaudi2C
device and should be detected and initialized as Gaudi2.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# e8bc0c1b 29-Oct-2023 Farah Kassabri <fkassabri@habana.ai>

accel/habanalabs: add log when eq event is not received

Add error log when no eq event is received from FW,
to cover a scenario when FW is stuck for some reason.
In such case driver will not receive neither the eq error interrupt
or the eq heartbeat event, and will just initiate a reset without
indication in the dmesg about the reason.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 3652117f 22-Nov-2023 Christian Brauner <brauner@kernel.org>

eventfd: simplify eventfd_signal()

Ever since the eventfd type was introduced back in 2007 in commit
e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed 1 as a value for @n. There's no point in
keeping that additional argument.

Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org
Acked-by: Xu Yilun <yilun.xu@intel.com>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl
Acked-by: Eric Farman <farman@linux.ibm.com> # s390
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 6fc69ca8 21-Sep-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: print device name when it is removed

Notifies the user which device was removed. It is important in
a server with multiple devices.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>


# ff92d010 27-Aug-2023 Ohad Sharabi <osharabi@habana.ai>

accel/habanalabs: trace dma map sgtable

Traces the DMA [un]map_sgtable using the new traces we added.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 1157b5d6 02-Mar-2023 farah kassabri <fkassabri@habana.ai>

accel/habanalabs: optimize timestamp registration handler

Currently we use dynamic allocation inside the irq handler
in order to allocate free node to be used for the free jobs.

This operation is expensive, especially when we deal with large
burst of events records that get released at the same time.

The alternative is to have pre allocated pool of free nodes
and just fetch nodes from this pool at irq handling time instead
of allocating them.

In case the pool becomes full, then the driver will fallback to
dynamic allocations.

As part of the optimization also update the unregister flow
upon re-using a timestamp record, by making the operation much
simpler and quicker. We already have the record in the registration
flow and now we just seek to re-use with different interrupt.
Therefore, no need to look for buffer according to the user handle.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 051868d9 27-Aug-2023 farah kassabri <fkassabri@habana.ai>

accel/habanalabs: prevent sending heartbeat before events are enabled

After the heartbeat mechanism is now expanded to be used also
for EQ health check, we shouldn't send heartbeat messages
to FW before driver allow events to be received from FW.

Because if the driver will send two heartbeats before it enables
events to be received from FW, then the EQ health check
will fail and reset the device.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 7c4130e6 07-Aug-2023 farah kassabri <fkassabri@habana.ai>

accel/habanalabs/gaudi2: handle eq health heartbeat check

Add mechanism for fw eq health check. this will be done using two flows:
using the heartbeat mechanism and raising a dedicated interrupt to
indicate an eq failure like EQ full.
This patch will add implementation for the eq heartbeat for gaudi2 asic.

More info about the heartbeat mechanism:
Expand the heartbeat mechanism to monitor a new event that
will be sent from FW upon receiving heartbeat message.
that way driver can know that the eq is working or not.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# e0f45280 25-Jun-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: fix inline doc typos

Fix two typos

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# ab574f6a 25-Jun-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: disable events ioctls on control device

Because it is not used and also, for graceful reset to work
those ioctls should run on the compute device.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# fe77368c 19-Feb-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: register compute device as an accel device

Register the compute device as an accel device, and remove the creation
of the habanalabs compute char device.

The IOCTLs in this patch are still handled by the current driver
handler. Moving to DRM IOCTL handling requires moving the IOCTLs
numbers to a specific range, so it will be handled in subsequent
patches.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# a8ab1a81 23-May-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: add info ioctl for engine error reports

User gets notification for every engine error report, but he still
lacks the exact engine information. Hence, we allow user to query
for the exact engine reported an error.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 10926f60 13-Jun-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: set default device release watchdog T/O as 30 sec

After being notified about certain errors, user is expected to finish
his post-errors actions and to release the device within some timeout,
after which is deice is being reset.
The default timeout value is 5 sec, which in some case is not enough for
a user application to collect debug data.
Increase the default value to 30 sec.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 37d72439 12-Jun-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: reset device if scrubbing failed

If scrubbing memory after user released device has failed it means
the device is in a bad state and should be reset.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>


# 89803af5 12-Jun-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: remove pdev check on idle check

Our simulator supports idle check so no need anymore to check if pdev
exists.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>


# 964b1f67 06-Jun-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: rename fd_list to hpriv_list

Every time an FD is returned to the user, the driver adds
a corresponding private structure to the list.
Yet, it's still a list of private structures rather than of FDs.
Remove, as well, an unnecessary comment.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 942f18c5 06-Jun-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: call put_pid after hpriv list is updated

Because we might still be using related resources, decrementing PID's
reference count should be done at later stages of the device release.
A good place is right after the representing private structure is
removed from LKD's list.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 2b541cf9 05-Jun-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: print return code when process termination fails

As part of driver teardown, we attempt to kill all user processes.
It shouldn't fail, but if it does we want to print the error code that
the kapi returned to us.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# e4a97d6b 29-May-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: set device status 'malfunction' while in rmmod

hl_device_status() returns the status of an acquired device.
If a device is going down (following an rmmod cmd),
it should be marked as an unusable/malfunctioning device, and
hence should not be acquired.
However, since this was not the case so far (i.e., a device going
down would inaccurately return 'in reset' status allowing the user
to acquire the device) it introduced a bug where as part of a reset
flow, the driver could not kill processes that have not run yet, and
since those processes aren't blocked from reacquiring a device,
we get eventually a new flow of a driver attempting to kill all
processes in a list that can't be ever really empty.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# a35c9976 10-May-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: update pending reset flags with new reset requests

If hl_device_cond_reset() is called while a reset is already pending but
hasn't started, the reset request will be dropped.
If the flags of the new request are more severe, e.g. a hard reset while
the pending reset is a compute reset, the eventual reset won't be
suitable for the device status.

To prevent such cases, update the pending reset flags with the new
requests flags before the requests are dropped.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 5d89ce6f 09-May-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: prevent immediate hard reset due to 2 adjacent H/W events

When a H/W event is received while a user is registered to events, no
immediate hard reset will happen, and instead the user will be notified
and will have some time to handle it and eventually release the
device, after which the reset will be done.
If a user, as part of the handling and as part of the cleanup steps
towards releasing the device, unregisters from receiving those events,
and at that time an adjacent H/W event is received, it will be assumed
that the user is not registered to events and thus an immediate hard
reset is required.

To prevent such an unwanted immediate reset, modify the driver to
perform it if the user is not registered to events AND we don't already
have a pending reset for a previous H/W event.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# e6f49e96 22-May-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: refactor error info reset

Moved error info reset code to single function for future use from
other places in the driver.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 56921023 22-May-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: remove sim code

There were a few places where simulator only code got into the upstream.
Remove those places that can confuse other developers.

Fixes: 2a0a839b6a28 ("habanalabs: extend fatal messages to contain PCI info")
Cc: Moti Haimovski <mhaimovski@habana.ai>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 3b9abb4f 19-Mar-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: expose debugfs files later

Currently the debugfs root folder and files for a device are created at
an early step, before the device initialization and before the char
device and sysfs files are exposed to user.
As there is no real reason not to do it together with the device
creation, postpone it to be done right afterwards.

The initialization of the debugfs entry structure is left in its
current position because it is used before creating the files.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# d8b9cea5 18-Apr-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: add pci health check during heartbeat

Currently upon a heartbeat failure, we don't know if the failure
is due to firmware hang or due to a bad PCI link. Hence, we
are reading a PCI config space register with a known value (vendor ID)
so we will know which of the two possibilities caused the heartbeat
failure.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 9a4e44a4 04-Dec-2022 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: refactor abort of completions and waits

Aborting CS completions should be in command_submission.c but aborting
waiting for user interrupts should be in device.c.

This separation is also for adding more abort operations in the future.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 802f25b6 21-Mar-2023 Tal Cohen <talcohen@habana.ai>

accel/habanalabs: sync f/w events interrupt in hard reset

Receiving events from FW, while the device is in hard reset, causes
a warning message in Driver log. The message may point to a
problem in the Driver or FW. But It also can appear as a result
of events that have been sent from FW just before the hard reset.
In order to avoid receiving events from FW while the device is in reset
and is already in 'disabled' mode, sync the f/w events interrupt right
before setting the device to 'disabled'.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 3a8d7c3a 22-Mar-2023 Tal Cohen <talcohen@habana.ai>

accel/habanalabs: send disable pci when compute ctx is active

Fix an issue in hard reset flow in which the driver didn't send a
disable pci message if there was an active compute context.
In hard reset, disable pci message should be sent no matter if
a compute context exists or not.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# a855f710 21-Mar-2023 Tal Cohen <talcohen@habana.ai>

accel/habanalabs: remove duplicated disable pci msg

The disable pci message is sent in reset device. It informs the FW not
to raise more EQs. The Driver may ignore received EQs, when the device
is in disabled mode.
The duplication happens when hard reset is scheduled during compute
reset and also performs 'escalate_reset_flow'.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 248ed9e2 23-Mar-2023 Cai Huoqing <cai.huoqing@linux.dev>

accel/habanalabs: Remove redundant pci_clear_master

Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
u16 pci_command;

pci_read_config_word(dev, PCI_COMMAND, &pci_command);
if (pci_command & PCI_COMMAND_MASTER) {
pci_command &= ~PCI_COMMAND_MASTER;
pci_write_config_word(dev, PCI_COMMAND, pci_command);
}

pcibios_disable_device(dev);
}.
And dev->is_busmaster is set to 0 in pci_disable_device.

Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 2e8e9a89 01-Mar-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release()

The memory manager IDR is currently destroyed when user releases the
file descriptor.
However, at this point the user context might be still held, and memory
buffers might be still in use.
Later on, calls to release those buffers will fail due to not finding
their handles in the IDR, leading to a memory leak.
To avoid this leak, split the IDR destruction from the memory manager
fini, and postpone it to hpriv_release() when there is no user context
and no buffers are used.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 28fbc058 16-Feb-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: use scnprintf() in print_device_in_use_info()

compose_device_in_use_info() was added to handle the snprintf() return
value in a single place.
However, the buffer size in print_device_in_use_info() is set such that
it would be enough for the max possible print, so
compose_device_in_use_info() is not really needed.
Moreover, scnprintf() can be used instead of snprintf(), to save the
check if the return value larger than the given size.

Cc: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 86b74d84 14-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: assert return value of hw_fini

Since hw_fini return error code for failure indication, we should
check its return value. Currently it might only fail upon soft-reset
from hl_device_reset. Later patch will add hw_fini failure in case of
polling timeout in hard-reset.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# efbd36b2 19-Feb-2023 Sagiv Ozeri <sozeri@habana.ai>

accel/habanalabs: add device id to all threads names

Compute driver threads names will start with hlX-*, when X is the
device id.
This will help distinguish them from the NIC thread names.

Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# a8c14f53 12-Feb-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: improve readability of engines idle mask print

Remove leading zeroes when printing the idle mask to make it clearer.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 3a621af6 13-Feb-2023 Tom Rix <trix@redhat.com>

accel/habanalabs: set hl_capture_*_err storage-class-specifier to static

smatch reports
drivers/accel/habanalabs/common/device.c:2619:6: warning:
symbol 'hl_capture_hw_err' was not declared. Should it be static?
drivers/accel/habanalabs/common/device.c:2641:6: warning:
symbol 'hl_capture_fw_err' was not declared. Should it be static?

both are only used in device.c, so they should be static

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 4a2e9d11 01-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: don't trace cpu accessible dma alloc/free

The cpu accessible dma allocations use the gen_pool api which actually
does not allocate new memory from the system but manages memory already
allocated before. When tracing this together with real dma
allocation/free it cause confusing logs like a '0' dma address and
a cpu address appearing twice etc.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 1d0f9ad7 08-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: in hl_device_reset small refactor for readabilty

in the out_err flow, combine the two cases of soft-reset since
they have mostly common code. In addition unlock reset_info.lock
after touching reset count.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 39ab4da9 08-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: in hl_device_reset remove 'hard_instead_of_soft'

Because this field is only used for debug print,
we can do more precise debug directly instead.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 7810c524 08-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: tiny refactor of hl_device_reset for readability

Align assignment of reset_upon_device_release to the convention used
in this function.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 18d13584 25-Jan-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: enable graceful reset mechanism for compute-reset

The graceful reset mechanism is currently enabled only for reset
requests that will end up with hard-reset.
In future, reset requests due to errors in some device engines, are
going to be modified to request compute-reset, as the much longer
hard-reset is not really needed there.
To allow it, enable graceful reset also for compute-reset, and reset
after user releases the device won't be escalated to hard-reset in those
cases.
If watchdog expires and user didn't release the device, hard-reset will
be initiated in any case.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 57479adb 29-Jan-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: disable PCI when escalating compute to hard-reset

In case a compute reset has failed or a request for a hard reset has
just arrived, then we escalate current reset procedure from compute
to hard-reset.
In such a case, the FW should be aware of the updated error cause,
and if LKD is the one who performs the reset (rather than the FW),
then we ask the FW to disable PCI access.

We would also like to have relevant debug info and therefore
we print the currently escalating reset type.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 313e9f63 10-Jan-2023 Moti Haimovski <mhaimovski@habana.ai>

accel/habanalabs: add critical-event bit in notifier

Enhance the existing user notifications by adding a HW and FW critical
event bits to be used when a HW or FW event occur that requires
both SW abort and hard-resetting the chip.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# d43bce6e 18-Jan-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: add info when FD released while device still in use

When user closes the device file descriptor, it is checked whether the
device is still in use, and a message is printed if it is.
To make this message more informative, add to this print also the reason
due to which the device is considered as in use.
The possible reasons which are checked for now are active CS and
exported dma-buf.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 323adae9 22-Jan-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: save class in hdev

It is more concise than to pass it to device init. Once we will add the
accel class, then we won't need to change the function signatures.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 89859a89 22-Jan-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: split cdev creation to separate function

Move the cdev creation code from the main hdev init function to
a separate function. This will make the code more readable once we
add the accel registration code (instead/in addition to legacy
cdev).

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 44155bb6 17-Jan-2023 Tomer Tayar <ttayar@habana.ai>

habanalabs: clear in_compute_reset when escalating to hard reset

If resetting device upon release while the release watchdog work is
scheduled, the compute reset is replaced with hard reset.
In this case, need to clear the in_compute_reset indication in the
device reset information structure.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 0c93eb09 17-Jan-2023 Tomer Tayar <ttayar@habana.ai>

habanalabs: run error handling if scrub_device_mem fails after reset

If device memory scrubbing from hl_device_reset() fails, we return with
an error code but not perform error handling code.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# a6685b57 11-Jan-2023 Koby Elbaz <kelbaz@habana.ai>

habanalabs: block soft-reset on an unusable device

A device with status malfunction indicates that it can't be used.
In such a case we do not support certain reset types, e.g.,
all kinds of soft-resets (compute reset, inference soft-reset),
and reset upon device release.

A hard-reset is the only way that an unusable device can change its
status. All other reset procedures can't put the device in a reset
procedure, which might ultimately cause the device to change its
status, unintentionally, to become operational again.

Such a scenario has recently occurred, when a user requested
a hard-reset while another heavy user workload was ongoing (reset
request is queued).
Since the workload couldn't finish within reset's timeout limits, the
reset has failed and set a device status malfunction.
Eventually, when the user released the FD, an unsuccessful soft-reset
occurred, hence followed by an additional hard-reset that changed the
ASICs status back to be operational.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 2a0a839b 28-Dec-2022 Moti Haimovski <mhaimovski@habana.ai>

habanalabs: extend fatal messages to contain PCI info

This commit attaches the PCI device address to driver fatal messages
in order to ease debugging in multi-device setups.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 54fcb384 30-Nov-2022 Ohad Sharabi <osharabi@habana.ai>

habanalabs: trace LBW reads/writes

Add traces to LBW reads/writes.
This may be handy when debugging configuration failure or events when
tracking configuration flow.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 571d1a72 23-Dec-2022 Koby Elbaz <kelbaz@habana.ai>

habanalabs: protect access to dynamic mem 'user_mappings'

When HL_INFO_USER_MAPPINGS IOCTL is called, we copy_to_user from
a dynamically allocated memory - 'user_mappings'.
Since freeing/allocating it happens in runtime (upon a page fault),
it not unlikely to access it even before being initially allocated
(i.e., accessing a NULL pointer).

The solution is to simply mark the spot when the err info has been
collected, and that way to know whether err info (either page fault
or RAZWI) is available to be read.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 78baccbd 24-Dec-2022 Koby Elbaz <kelbaz@habana.ai>

habanalabs: refactor razwi/page-fault information structures

This refactor makes the code clearer and the new variables' names
better describe their roles.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# e2a079a2 06-Dec-2022 Tomer Tayar <ttayar@habana.ai>

habanalabs: verify that kernel CB is destroyed only once

Remove the distinction between user CB and kernel CB, and verify for
both that they are not destroyed more than once.

As kernel CB might be taken from the pre-allocated CB pool, so we need
to clear the handle destroyed indication when returning a CB to the
pool.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# e65e175b 26-Dec-2022 Oded Gabbay <ogabbay@kernel.org>

habanalabs: move driver to accel subsystem

Now that we have a subsystem for compute accelerators, move the
habanalabs driver to it.

This patch only moves the files and fixes the Makefiles. Future
patches will change the existing code to register to the accel
subsystem and expose the accel device char files instead of the
habanalabs device char files.

Update the MAINTAINERS file to reflect this change.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>