Cross Reference: /linux-master/drivers/accel/habanalabs/common/device.c

History log of /linux-master/drivers/accel/habanalabs/common/device.c
Revision	Date	Author	Comments
# fa58b594	12-Feb-2024	Ofir Bitton <obitton@habana.ai>	accel/habanalabs: modify pci health check Today we read PCI VENDOR-ID in order to make sure PCI link is healthy. Apparently the VENDOR-ID might be stored on host and hence, when we read it we might not access the PCI bus. In order to make sure PCI health check is reliable, we will start checking the DEVICE-ID instead. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 731d320e	01-Jan-2024	Dani Liberman <dliberman@habana.ai>	accel/habanalabs: remove call to deprecated function In newer kernel versions, irq_set_affinity_hint() is deprecated. Instead, use the newer version which is irq_set_affinity_and_hint(). Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 246d8b6c	24-Dec-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: abort device reset for consecutive heartbeat failures The mechanism of aborting device reset for consecutive fatal errors is currently only for fatal errors that are reported by FW. A non-responsive FW and consecutive heartbeat failures is also considered fatal, so add them as well to this mechanism to avoid recurring device reset in such a case. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# d0df8a35	14-Dec-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: fix DRAM BAR base address calculation When the DRAM region size in the BAR is not a power of 2, calculating the corresponding BAR base address should be done using the offset from the DRAM start address, and not using directly the DRAM address. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# e91c37f1	21-Sep-2023	Dani Liberman <dliberman@habana.ai>	accel/habanalabs/gaudi2: add interrupt affinity for user interrupts User interrupts are MSIx interrupts coming from Gaudi2, that have specific range of IDs and are assigned to the sole use of the user process that opened the Gaudi2 device (reminder: there can be only a single user process running on Gaudi2 at any given time). The interrupts are allocated and managed by the driver and therefore, the user expects the driver to initialize them properly, which also includes setting the affinity to the related CPU cores of the device's NUMA node to get maximum performance. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 4b0b1fbc	20-Jul-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: set hard reset flag if graceful reset is skipped hl_device_cond_reset() might be called with the hard reset flag unset, because a compute reset upon device release as part of a graceful reset is valid. If the conditions for graceful reset are not met, hl_device_reset() will be called for an immediate reset. In this case a compute reset is not valid, so it will be replaced with a hard reset together with a debug message about it. This message might be confusing, as it implies that a compute reset was requested when it shouldn't. To prevent this confusion, set the hard reset flag in hl_device_cond_reset() if going to an immediate reset. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# d1958dce	30-Oct-2023	Farah Kassabri <fkassabri@habana.ai>	accel/habanalabs: fix EQ heartbeat mechanism Stop rescheduling another heartbeat check when EQ heartbeat check fails as it generates confusing logs in dmesg that the heartbeat fails. Signed-off-by: Farah Kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 42422993	29-Oct-2023	Oded Gabbay <ogabbay@kernel.org>	accel/habanalabs: add support for Gaudi2C device Gaudi2 with PCI revision ID with the value of '3' represents Gaudi2C device and should be detected and initialized as Gaudi2. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# e8bc0c1b	29-Oct-2023	Farah Kassabri <fkassabri@habana.ai>	accel/habanalabs: add log when eq event is not received Add error log when no eq event is received from FW, to cover a scenario when FW is stuck for some reason. In such case driver will not receive neither the eq error interrupt or the eq heartbeat event, and will just initiate a reset without indication in the dmesg about the reason. Signed-off-by: Farah Kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 3652117f	22-Nov-2023	Christian Brauner <brauner@kernel.org>	eventfd: simplify eventfd_signal() Ever since the eventfd type was introduced back in 2007 in commit e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal() function only ever passed 1 as a value for @n. There's no point in keeping that additional argument. Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org Acked-by: Xu Yilun <yilun.xu@intel.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl Acked-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
# 6fc69ca8	21-Sep-2023	Oded Gabbay <ogabbay@kernel.org>	accel/habanalabs: print device name when it is removed Notifies the user which device was removed. It is important in a server with multiple devices. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
# ff92d010	27-Aug-2023	Ohad Sharabi <osharabi@habana.ai>	accel/habanalabs: trace dma map sgtable Traces the DMA [un]map_sgtable using the new traces we added. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 1157b5d6	02-Mar-2023	farah kassabri <fkassabri@habana.ai>	accel/habanalabs: optimize timestamp registration handler Currently we use dynamic allocation inside the irq handler in order to allocate free node to be used for the free jobs. This operation is expensive, especially when we deal with large burst of events records that get released at the same time. The alternative is to have pre allocated pool of free nodes and just fetch nodes from this pool at irq handling time instead of allocating them. In case the pool becomes full, then the driver will fallback to dynamic allocations. As part of the optimization also update the unregister flow upon re-using a timestamp record, by making the operation much simpler and quicker. We already have the record in the registration flow and now we just seek to re-use with different interrupt. Therefore, no need to look for buffer according to the user handle. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 051868d9	27-Aug-2023	farah kassabri <fkassabri@habana.ai>	accel/habanalabs: prevent sending heartbeat before events are enabled After the heartbeat mechanism is now expanded to be used also for EQ health check, we shouldn't send heartbeat messages to FW before driver allow events to be received from FW. Because if the driver will send two heartbeats before it enables events to be received from FW, then the EQ health check will fail and reset the device. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 7c4130e6	07-Aug-2023	farah kassabri <fkassabri@habana.ai>	accel/habanalabs/gaudi2: handle eq health heartbeat check Add mechanism for fw eq health check. this will be done using two flows: using the heartbeat mechanism and raising a dedicated interrupt to indicate an eq failure like EQ full. This patch will add implementation for the eq heartbeat for gaudi2 asic. More info about the heartbeat mechanism: Expand the heartbeat mechanism to monitor a new event that will be sent from FW upon receiving heartbeat message. that way driver can know that the eq is working or not. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# e0f45280	25-Jun-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: fix inline doc typos Fix two typos Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# ab574f6a	25-Jun-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: disable events ioctls on control device Because it is not used and also, for graceful reset to work those ioctls should run on the compute device. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# fe77368c	19-Feb-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: register compute device as an accel device Register the compute device as an accel device, and remove the creation of the habanalabs compute char device. The IOCTLs in this patch are still handled by the current driver handler. Moving to DRM IOCTL handling requires moving the IOCTLs numbers to a specific range, so it will be handled in subsequent patches. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# a8ab1a81	23-May-2023	Ofir Bitton <obitton@habana.ai>	accel/habanalabs: add info ioctl for engine error reports User gets notification for every engine error report, but he still lacks the exact engine information. Hence, we allow user to query for the exact engine reported an error. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 10926f60	13-Jun-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: set default device release watchdog T/O as 30 sec After being notified about certain errors, user is expected to finish his post-errors actions and to release the device within some timeout, after which is deice is being reset. The default timeout value is 5 sec, which in some case is not enough for a user application to collect debug data. Increase the default value to 30 sec. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 37d72439	12-Jun-2023	Oded Gabbay <ogabbay@kernel.org>	accel/habanalabs: reset device if scrubbing failed If scrubbing memory after user released device has failed it means the device is in a bad state and should be reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
# 89803af5	12-Jun-2023	Oded Gabbay <ogabbay@kernel.org>	accel/habanalabs: remove pdev check on idle check Our simulator supports idle check so no need anymore to check if pdev exists. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
# 964b1f67	06-Jun-2023	Koby Elbaz <kelbaz@habana.ai>	accel/habanalabs: rename fd_list to hpriv_list Every time an FD is returned to the user, the driver adds a corresponding private structure to the list. Yet, it's still a list of private structures rather than of FDs. Remove, as well, an unnecessary comment. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 942f18c5	06-Jun-2023	Koby Elbaz <kelbaz@habana.ai>	accel/habanalabs: call put_pid after hpriv list is updated Because we might still be using related resources, decrementing PID's reference count should be done at later stages of the device release. A good place is right after the representing private structure is removed from LKD's list. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 2b541cf9	05-Jun-2023	Koby Elbaz <kelbaz@habana.ai>	accel/habanalabs: print return code when process termination fails As part of driver teardown, we attempt to kill all user processes. It shouldn't fail, but if it does we want to print the error code that the kapi returned to us. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# e4a97d6b	29-May-2023	Koby Elbaz <kelbaz@habana.ai>	accel/habanalabs: set device status 'malfunction' while in rmmod hl_device_status() returns the status of an acquired device. If a device is going down (following an rmmod cmd), it should be marked as an unusable/malfunctioning device, and hence should not be acquired. However, since this was not the case so far (i.e., a device going down would inaccurately return 'in reset' status allowing the user to acquire the device) it introduced a bug where as part of a reset flow, the driver could not kill processes that have not run yet, and since those processes aren't blocked from reacquiring a device, we get eventually a new flow of a driver attempting to kill all processes in a list that can't be ever really empty. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# a35c9976	10-May-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: update pending reset flags with new reset requests If hl_device_cond_reset() is called while a reset is already pending but hasn't started, the reset request will be dropped. If the flags of the new request are more severe, e.g. a hard reset while the pending reset is a compute reset, the eventual reset won't be suitable for the device status. To prevent such cases, update the pending reset flags with the new requests flags before the requests are dropped. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 5d89ce6f	09-May-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: prevent immediate hard reset due to 2 adjacent H/W events When a H/W event is received while a user is registered to events, no immediate hard reset will happen, and instead the user will be notified and will have some time to handle it and eventually release the device, after which the reset will be done. If a user, as part of the handling and as part of the cleanup steps towards releasing the device, unregisters from receiving those events, and at that time an adjacent H/W event is received, it will be assumed that the user is not registered to events and thus an immediate hard reset is required. To prevent such an unwanted immediate reset, modify the driver to perform it if the user is not registered to events AND we don't already have a pending reset for a previous H/W event. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# e6f49e96	22-May-2023	Dani Liberman <dliberman@habana.ai>	accel/habanalabs: refactor error info reset Moved error info reset code to single function for future use from other places in the driver. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 56921023	22-May-2023	Oded Gabbay <ogabbay@kernel.org>	accel/habanalabs: remove sim code There were a few places where simulator only code got into the upstream. Remove those places that can confuse other developers. Fixes: 2a0a839b6a28 ("habanalabs: extend fatal messages to contain PCI info") Cc: Moti Haimovski <mhaimovski@habana.ai> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 3b9abb4f	19-Mar-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: expose debugfs files later Currently the debugfs root folder and files for a device are created at an early step, before the device initialization and before the char device and sysfs files are exposed to user. As there is no real reason not to do it together with the device creation, postpone it to be done right afterwards. The initialization of the debugfs entry structure is left in its current position because it is used before creating the files. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# d8b9cea5	18-Apr-2023	Ofir Bitton <obitton@habana.ai>	accel/habanalabs: add pci health check during heartbeat Currently upon a heartbeat failure, we don't know if the failure is due to firmware hang or due to a bad PCI link. Hence, we are reading a PCI config space register with a known value (vendor ID) so we will know which of the two possibilities caused the heartbeat failure. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 9a4e44a4	04-Dec-2022	Koby Elbaz <kelbaz@habana.ai>	accel/habanalabs: refactor abort of completions and waits Aborting CS completions should be in command_submission.c but aborting waiting for user interrupts should be in device.c. This separation is also for adding more abort operations in the future. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 802f25b6	21-Mar-2023	Tal Cohen <talcohen@habana.ai>	accel/habanalabs: sync f/w events interrupt in hard reset Receiving events from FW, while the device is in hard reset, causes a warning message in Driver log. The message may point to a problem in the Driver or FW. But It also can appear as a result of events that have been sent from FW just before the hard reset. In order to avoid receiving events from FW while the device is in reset and is already in 'disabled' mode, sync the f/w events interrupt right before setting the device to 'disabled'. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 3a8d7c3a	22-Mar-2023	Tal Cohen <talcohen@habana.ai>	accel/habanalabs: send disable pci when compute ctx is active Fix an issue in hard reset flow in which the driver didn't send a disable pci message if there was an active compute context. In hard reset, disable pci message should be sent no matter if a compute context exists or not. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# a855f710	21-Mar-2023	Tal Cohen <talcohen@habana.ai>	accel/habanalabs: remove duplicated disable pci msg The disable pci message is sent in reset device. It informs the FW not to raise more EQs. The Driver may ignore received EQs, when the device is in disabled mode. The duplication happens when hard reset is scheduled during compute reset and also performs 'escalate_reset_flow'. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 248ed9e2	23-Mar-2023	Cai Huoqing <cai.huoqing@linux.dev>	accel/habanalabs: Remove redundant pci_clear_master Remove pci_clear_master to simplify the code, the bus-mastering is also cleared in do_pci_disable_device, like this: ./drivers/pci/pci.c:2197 static void do_pci_disable_device(struct pci_dev *dev) { u16 pci_command; pci_read_config_word(dev, PCI_COMMAND, &pci_command); if (pci_command & PCI_COMMAND_MASTER) { pci_command &= ~PCI_COMMAND_MASTER; pci_write_config_word(dev, PCI_COMMAND, pci_command); } pcibios_disable_device(dev); }. And dev->is_busmaster is set to 0 in pci_disable_device. Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 2e8e9a89	01-Mar-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release() The memory manager IDR is currently destroyed when user releases the file descriptor. However, at this point the user context might be still held, and memory buffers might be still in use. Later on, calls to release those buffers will fail due to not finding their handles in the IDR, leading to a memory leak. To avoid this leak, split the IDR destruction from the memory manager fini, and postpone it to hpriv_release() when there is no user context and no buffers are used. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 28fbc058	16-Feb-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: use scnprintf() in print_device_in_use_info() compose_device_in_use_info() was added to handle the snprintf() return value in a single place. However, the buffer size in print_device_in_use_info() is set such that it would be enough for the max possible print, so compose_device_in_use_info() is not really needed. Moreover, scnprintf() can be used instead of snprintf(), to save the check if the return value larger than the given size. Cc: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 86b74d84	14-Feb-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: assert return value of hw_fini Since hw_fini return error code for failure indication, we should check its return value. Currently it might only fail upon soft-reset from hl_device_reset. Later patch will add hw_fini failure in case of polling timeout in hard-reset. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# efbd36b2	19-Feb-2023	Sagiv Ozeri <sozeri@habana.ai>	accel/habanalabs: add device id to all threads names Compute driver threads names will start with hlX-*, when X is the device id. This will help distinguish them from the NIC thread names. Signed-off-by: Sagiv Ozeri <sozeri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# a8c14f53	12-Feb-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: improve readability of engines idle mask print Remove leading zeroes when printing the idle mask to make it clearer. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 3a621af6	13-Feb-2023	Tom Rix <trix@redhat.com>	accel/habanalabs: set hl_capture_*_err storage-class-specifier to static smatch reports drivers/accel/habanalabs/common/device.c:2619:6: warning: symbol 'hl_capture_hw_err' was not declared. Should it be static? drivers/accel/habanalabs/common/device.c:2641:6: warning: symbol 'hl_capture_fw_err' was not declared. Should it be static? both are only used in device.c, so they should be static Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 4a2e9d11	01-Feb-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: don't trace cpu accessible dma alloc/free The cpu accessible dma allocations use the gen_pool api which actually does not allocate new memory from the system but manages memory already allocated before. When tracing this together with real dma allocation/free it cause confusing logs like a '0' dma address and a cpu address appearing twice etc. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 1d0f9ad7	08-Feb-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: in hl_device_reset small refactor for readabilty in the out_err flow, combine the two cases of soft-reset since they have mostly common code. In addition unlock reset_info.lock after touching reset count. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 39ab4da9	08-Feb-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: in hl_device_reset remove 'hard_instead_of_soft' Because this field is only used for debug print, we can do more precise debug directly instead. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 7810c524	08-Feb-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: tiny refactor of hl_device_reset for readability Align assignment of reset_upon_device_release to the convention used in this function. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 18d13584	25-Jan-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: enable graceful reset mechanism for compute-reset The graceful reset mechanism is currently enabled only for reset requests that will end up with hard-reset. In future, reset requests due to errors in some device engines, are going to be modified to request compute-reset, as the much longer hard-reset is not really needed there. To allow it, enable graceful reset also for compute-reset, and reset after user releases the device won't be escalated to hard-reset in those cases. If watchdog expires and user didn't release the device, hard-reset will be initiated in any case. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 57479adb	29-Jan-2023	Koby Elbaz <kelbaz@habana.ai>	accel/habanalabs: disable PCI when escalating compute to hard-reset In case a compute reset has failed or a request for a hard reset has just arrived, then we escalate current reset procedure from compute to hard-reset. In such a case, the FW should be aware of the updated error cause, and if LKD is the one who performs the reset (rather than the FW), then we ask the FW to disable PCI access. We would also like to have relevant debug info and therefore we print the currently escalating reset type. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 313e9f63	10-Jan-2023	Moti Haimovski <mhaimovski@habana.ai>	accel/habanalabs: add critical-event bit in notifier Enhance the existing user notifications by adding a HW and FW critical event bits to be used when a HW or FW event occur that requires both SW abort and hard-resetting the chip. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# d43bce6e	18-Jan-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: add info when FD released while device still in use When user closes the device file descriptor, it is checked whether the device is still in use, and a message is printed if it is. To make this message more informative, add to this print also the reason due to which the device is considered as in use. The possible reasons which are checked for now are active CS and exported dma-buf. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 323adae9	22-Jan-2023	Oded Gabbay <ogabbay@kernel.org>	accel/habanalabs: save class in hdev It is more concise than to pass it to device init. Once we will add the accel class, then we won't need to change the function signatures. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 89859a89	22-Jan-2023	Oded Gabbay <ogabbay@kernel.org>	accel/habanalabs: split cdev creation to separate function Move the cdev creation code from the main hdev init function to a separate function. This will make the code more readable once we add the accel registration code (instead/in addition to legacy cdev). Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
# 44155bb6	17-Jan-2023	Tomer Tayar <ttayar@habana.ai>	habanalabs: clear in_compute_reset when escalating to hard reset If resetting device upon release while the release watchdog work is scheduled, the compute reset is replaced with hard reset. In this case, need to clear the in_compute_reset indication in the device reset information structure. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 0c93eb09	17-Jan-2023	Tomer Tayar <ttayar@habana.ai>	habanalabs: run error handling if scrub_device_mem fails after reset If device memory scrubbing from hl_device_reset() fails, we return with an error code but not perform error handling code. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# a6685b57	11-Jan-2023	Koby Elbaz <kelbaz@habana.ai>	habanalabs: block soft-reset on an unusable device A device with status malfunction indicates that it can't be used. In such a case we do not support certain reset types, e.g., all kinds of soft-resets (compute reset, inference soft-reset), and reset upon device release. A hard-reset is the only way that an unusable device can change its status. All other reset procedures can't put the device in a reset procedure, which might ultimately cause the device to change its status, unintentionally, to become operational again. Such a scenario has recently occurred, when a user requested a hard-reset while another heavy user workload was ongoing (reset request is queued). Since the workload couldn't finish within reset's timeout limits, the reset has failed and set a device status malfunction. Eventually, when the user released the FD, an unsuccessful soft-reset occurred, hence followed by an additional hard-reset that changed the ASICs status back to be operational. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 2a0a839b	28-Dec-2022	Moti Haimovski <mhaimovski@habana.ai>	habanalabs: extend fatal messages to contain PCI info This commit attaches the PCI device address to driver fatal messages in order to ease debugging in multi-device setups. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 54fcb384	30-Nov-2022	Ohad Sharabi <osharabi@habana.ai>	habanalabs: trace LBW reads/writes Add traces to LBW reads/writes. This may be handy when debugging configuration failure or events when tracking configuration flow. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 571d1a72	23-Dec-2022	Koby Elbaz <kelbaz@habana.ai>	habanalabs: protect access to dynamic mem 'user_mappings' When HL_INFO_USER_MAPPINGS IOCTL is called, we copy_to_user from a dynamically allocated memory - 'user_mappings'. Since freeing/allocating it happens in runtime (upon a page fault), it not unlikely to access it even before being initially allocated (i.e., accessing a NULL pointer). The solution is to simply mark the spot when the err info has been collected, and that way to know whether err info (either page fault or RAZWI) is available to be read. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# 78baccbd	24-Dec-2022	Koby Elbaz <kelbaz@habana.ai>	habanalabs: refactor razwi/page-fault information structures This refactor makes the code clearer and the new variables' names better describe their roles. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# e2a079a2	06-Dec-2022	Tomer Tayar <ttayar@habana.ai>	habanalabs: verify that kernel CB is destroyed only once Remove the distinction between user CB and kernel CB, and verify for both that they are not destroyed more than once. As kernel CB might be taken from the pre-allocated CB pool, so we need to clear the handle destroyed indication when returning a CB to the pool. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
# e65e175b	26-Dec-2022	Oded Gabbay <ogabbay@kernel.org>	habanalabs: move driver to accel subsystem Now that we have a subsystem for compute accelerators, move the habanalabs driver to it. This patch only moves the files and fixes the Makefiles. Future patches will change the existing code to register to the accel subsystem and expose the accel device char files instead of the habanalabs device char files. Update the MAINTAINERS file to reflect this change. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>