#
3bf6ef98 |
|
05-Feb-2024 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs/gaudi2: drain event lacks rd/wr indication Due to a H/W issue, AXI drain event does not include a read/write indication, hence we remove this print. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
e855869b |
|
25-Jan-2024 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs: fix glbl error cause handling The glbl error cause handling has a wrong assumption that all error bits are consecutive. Fix the handling to check all relevant error bits per ASIC. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
c1e89ae4 |
|
18-Jan-2024 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs/gaudi2: check extended errors according to PCIe addr_dec interrupt info The FW interrupt info for a PCIe addr_dec event is set correctly, so check for either global errors or razwi according to the indications there. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
c14e5cd3 |
|
14-Jan-2024 |
Farah Kassabri <fkassabri@habana.ai> |
accel/habanalabs: remove hop size from asic properties The hop size related properties is a MMU properties and not asic properties. As for PMMU and HMMU we could have different sizes. Signed-off-by: Farah Kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
01f8cd0f |
|
02-Jan-2024 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs/gaudi2: fail memory memset when failing to copy QM packet to device gaudi2_memset_memory_chunk_using_edma_qm() calls the access_dev_mem() ASIC function, but ignores its return value. Add this missing check. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
731d320e |
|
01-Jan-2024 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs: remove call to deprecated function In newer kernel versions, irq_set_affinity_hint() is deprecated. Instead, use the newer version which is irq_set_affinity_and_hint(). Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
f728c17f |
|
02-Nov-2023 |
Farah Kassabri <fkassabri@habana.ai> |
accel/habanalabs/gaudi2: move HMMU page tables to device memory Currently the HMMU page tables reside in the host memory, which will cause host access from the device for every page walk. This can affect PCIe bandwidth in certain scenarios. To prevent that problem, HMMU page tables will be moved to the device memory so the miss transaction will read the hops from there instead of going to the host. Signed-off-by: Farah Kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
e91c37f1 |
|
21-Sep-2023 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs/gaudi2: add interrupt affinity for user interrupts User interrupts are MSIx interrupts coming from Gaudi2, that have specific range of IDs and are assigned to the sole use of the user process that opened the Gaudi2 device (reminder: there can be only a single user process running on Gaudi2 at any given time). The interrupts are allocated and managed by the driver and therefore, the user expects the driver to initialize them properly, which also includes setting the affinity to the related CPU cores of the device's NUMA node to get maximum performance. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
bc5f15ab |
|
29-Nov-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs/gaudi2: avoid overriding existing undefined opcode data Part of the undefined opcode data is updated in gaudi2_handle_qman_err_generic() and some in handle_lower_qman_data_on_err(). However, the 'write_enable' flag is checked only in gaudi2_handle_qman_err_generic(), and information of more than a single error can be mixed there. Moreover, handle_lower_qman_data_on_err() is called only for the lower QMAN, so for an error in the upper QMAN there is only a partial info. Move all the data update to be done in a single place, protected by the 'write_enable' flag. As mainly the lower QMAN's info is interesting, avoid saving the partial info for the upper QMAN. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
565ee788 |
|
23-Nov-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs/gaudi2: add zero padding when printing QM CP instruction QM instructions are in multiples of 64 bits and the command type is in the upper bits of first QWORD. To make it clearer that an undefined command is due to a type of 0x0, always print all 64 bits and add a zero padding if needed. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
5bc155cf |
|
16-Nov-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs/gaudi2: use correct registers to dump QM CQ info The QM CQ PTR_LO/PTR_HI/TSIZE registers are for pushing a CQ entry, and although they are updated by HW even when descriptors are fetched by PQ and CB addresses are fed into CQ, the correct registers to use when dumping the CQ info are the ones with the _STS suffix. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
ae303d88 |
|
06-Nov-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs/gaudi2: get the correct QM CQ info upon an error Upon a QM error, the address/size from both the CQ and the ARC_CQ are printed, although the instruction that led to the error was received from only one of them. Moreover, in case of a QM undefined opcode, only one of these address/size sets will be captured based on the value of ARC_CQ_PTR. However, this value can be non-zero even if currently the CQ is used, in case the CQ/ARC_CQ are alternately used. Under the assumption of having a stop-on-error configuration, modify to use CP_STS.CUR_CQ field to get the relevant CQ for the QM error. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
0ec34677 |
|
31-Oct-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs/gaudi2: fix undef opcode reporting currently the undefined opcode event bit in set only for lower cp and only if 'write_enable' is true. It should be set anyway and for all streams in order to report that event to userspace. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
c6485482 |
|
15-Oct-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain When a PCIe AXI drain event happens, it is possible that the driver cannot access the device through PCIe, and therefore cannot send a hard-reset request to FW. Starting from FW version 1.13, FW will initiate a hard-reset in such a case without waiting for a reset request from the driver. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
ff92d010 |
|
27-Aug-2023 |
Ohad Sharabi <osharabi@habana.ai> |
accel/habanalabs: trace dma map sgtable Traces the DMA [un]map_sgtable using the new traces we added. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
d7aa2948 |
|
19-Sep-2023 |
Oded Gabbay <ogabbay@kernel.org> |
accel/habanalabs: remove unused asic functions asic_dma_{un}map_single() asic-specific functions are no longer called from the common code, so delete these functions. In addition, delete the gaudi2 implementation as they are also not called. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
|
#
674f7779 |
|
06-Sep-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: extend preboot timeout when preboot might take longer There are cases such when FW runs MBIST, that preboot is expected to take longer than the usual. In such cases the firmware reports status SECURITY_READY/IN_PREBOOT and we extend the timeout waiting for it. This is currently implemented for Gaudi2 only. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
ba24b5ec |
|
19-Jul-2023 |
farah kassabri <fkassabri@habana.ai> |
accel/habanalabs: split user interrupts pending list Currently driver maintain one list for both pending user interrupts which seeks to wait till CQ reaches it's target value and also the ones that seeks to get timestamp records when the CQ reaches it's target value. This causes delay in handling the waiters which gets higher priority than the timestamp records. In order to solve this, let's split the list into two, one for each case and each one is protected by it's own spinlock. Waiters will be handled within the interrupt context first, then the timestamp records will be set. Freeing the timestamp related memory will be handled in a workqueue. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
764bfd13 |
|
22-Aug-2023 |
farah kassabri <fkassabri@habana.ai> |
accel/habanalabs/gaudi2: add eq health check using irq This is the second patch for applying the eq health check mechanism which will add support for the interrupt flow for gaudi2 asic. More info about the interrupt mechanism: set a dedicated msix for the eq error interrupt, and add interrupt handler for it. when FW detects some issue with EQ like EQ_FULL, it'll raise that interrupt and driver should reset the device. Driver will inform the FW which msix index to use through the already existing handshake mechanism which will send msix info message to fw. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
7c4130e6 |
|
07-Aug-2023 |
farah kassabri <fkassabri@habana.ai> |
accel/habanalabs/gaudi2: handle eq health heartbeat check Add mechanism for fw eq health check. this will be done using two flows: using the heartbeat mechanism and raising a dedicated interrupt to indicate an eq failure like EQ full. This patch will add implementation for the eq heartbeat for gaudi2 asic. More info about the heartbeat mechanism: Expand the heartbeat mechanism to monitor a new event that will be sent from FW upon receiving heartbeat message. that way driver can know that the eq is working or not. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
72bff371 |
|
22-Aug-2023 |
Moti Haimovski <mhaimovski@habana.ai> |
accel/habanalabs/gaudi2: print power-mode changes Print to kernel log any device power mode changes events reported by the FW. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
d261b0ab |
|
25-Jul-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs/gaudi2: include block id in ECC error reporting During ECC event handling, Memory wrapper id was mistakenly printed as block id. Fix the print and in addition fetch the actual block-id from firmware. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
90f3de61 |
|
04-Sep-2023 |
Christophe JAILLET <christophe.jaillet@wanadoo.fr> |
accel/habanalabs/gaudi2: Fix incorrect string length computation in gaudi2_psoc_razwi_get_engines() snprintf() returns the "number of characters which *would* be generated for the given input", not the size *really* generated. In order to avoid too large values for 'str_size' (and potential negative values for "PSOC_RAZWI_ENG_STR_SIZE - str_size") use scnprintf() instead of snprintf(). Fixes: c0e6df916050 ("accel/habanalabs: fix address decode RAZWI handling") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
a45d5cf0 |
|
25-Aug-2023 |
Justin Stitt <justinstitt@google.com> |
accel/habanalabs: refactor deprecated strncpy to strscpy_pad `strncpy` is deprecated for use on NUL-terminated destination strings [1]. We see that `prop->cpucp_info.card_name` is supposed to be NUL-terminated based on its usage within `__hwmon_device_register()` (wherein it's called "name"): | if (name && (!strlen(name) || strpbrk(name, "-* \t\n"))) | dev_warn(dev, | "hwmon: '%s' is not a valid name attribute, please fix\n", | name); A suitable replacement is `strscpy_pad` [2] due to the fact that it guarantees both NUL-termination and NUL-padding on its destination buffer. NUL-padding on `prop->cpucp_info.card_name` is not strictly necessary as `hdev->prop` is explicitly zero-initialized but should be used regardless as it gets copied out to userspace directly -- as per Kees' suggestion. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
01ab1629 |
|
04-Jul-2023 |
Igor Grinberg <igrinberg@habana.ai> |
accel/habanalabs/gaudi2: prepare to remove cpu_rst_status The soft reset has transitioned to CPUCP packet instead of plain register write and is about to be removed from the struct cpu_dyn_regs. As a preparation for removing the cpu_rst_status field from struct cpu_dyn_regs, switch to use the plain macro - this keeps the backward compatibility. Signed-off-by: Igor Grinberg <igrinberg@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
a8ab1a81 |
|
23-May-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: add info ioctl for engine error reports User gets notification for every engine error report, but he still lacks the exact engine information. Hence, we allow user to query for the exact engine reported an error. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
fa46c7bb |
|
11-Jul-2023 |
Oded Gabbay <ogabbay@kernel.org> |
accel/habanalabs/gaudi2: fix missing check of kernel ctx If we are initializing the kernel context when we have a Gaudi2 device, we don't need to do any late initializing of that context with specific Gaudi2 code. Reviewed-by: Ofir Bitton <obitton@habana.ai> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
15c0bb16 |
|
04-Jul-2023 |
Igor Grinberg <igrinberg@habana.ai> |
accel/habanalabs/gaudi2: prepare to remove soft_rst_irq The soft reset has transitioned to CPUCP packet instead of plain register write and is about to be removed from the struct cpu_dyn_regs. As a preparation for removing the gic_host_soft_rst_irq field from struct cpu_dyn_regs, switch to use the plain macro - this keeps the backward compatibility. Signed-off-by: Igor Grinberg <igrinberg@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
43d8acce |
|
29-Mar-2023 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs: handle arc farm razwi Implement razwi handling for arc farm and add it to arc farm sei event handler. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
f17182d0 |
|
30-May-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: stop fetching MME SBTE error cause Because in this case we have only a single possible cause, we can safely stop fetching the cause from firmware. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
c6a4f256 |
|
15-May-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: notify user about undefined opcode event In order for user to be aware of undefined opcode events, we must store all relevant information and notify user about the failure. The user will fetch the stored info via info ioctl. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
fac91dd5 |
|
21-May-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: add event queue extra validation In order to increase reliability of the event queue interface, we apply to Gaudi2 the same mechanism we have in Gaudi1. The extra validation is basically checking that the received event index matches the expected index. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
8a20b381 |
|
17-May-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: fix bug of not fetching addr_dec info addr_dec info should always be fetched, regardless of cause value. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
5d658d0c |
|
08-May-2023 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs: mask part of hmmu page fault captured address When receiving page fault from hmmu, the captured address is scrambled both by HW and by driver. The driver part is unscrambled but the HW part isn't getting unscrambled. To avoid declaring wrong address, the HW scrambled part will be masked. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
6092cedf |
|
10-May-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs: print qman data on error only for lower qman By default, the upper QMANs are not used, and instead engines ARCs access the lower QMANs directly. Errors for upper QMANs are therefore not expected, and the debug print of the PQ entries is not needed. Modify the QMAN debug data print on errors to include only information for the lower QMAN. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
54381ee8 |
|
10-May-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs: use lower QM in QM errors handling The QMAN GLBL_ERR_STS_4 register has indications for errors also in the lower CQ and the ARC CQ, and not just for errors in the lower CP. Modify the relevant define/struct and the related print to use "lower QM" instead of "lower CP". Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
dcc8fa88 |
|
08-May-2023 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs: use binning info when handling razwi When receiving sei interrupt from tpc or decoder, we need to check the binning mask because if the engine is binned, the razwi info won't be in the router of the binned engine, instead will be in the router of the substitute engine. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
b2d61fec |
|
02-May-2023 |
Koby Elbaz <kelbaz@habana.ai> |
accel/habanalabs: upon DMA errors, use FW-extracted error cause Initially, the driver used to read the error cause data directly from the ASIC. However, the FW now clears it before the driver could read it. Therefore we should use the error cause data that is extracted by the FW. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
9ec7639b |
|
15-May-2023 |
Dan Carpenter <dan.carpenter@linaro.org> |
accel/habanalabs: fix gaudi2_get_tpc_idle_status() return The gaudi2_get_tpc_idle_status() function returned the incorrect variable so it always returned true. Fixes: d85f0531b928 ("accel/habanalabs: break is_idle function into per-engine sub-routines") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
d0dcd4bb |
|
24-Apr-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: always fetch pci addr_dec error info Due to missing indication of address decode source (LBW/HBW bus), we should always try and fetch extended information. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
7d212963 |
|
20-Apr-2023 |
Koby Elbaz <kelbaz@habana.ai> |
accel/habanalabs: fix a static warning - 'dubious: x & !y' Use a straight forward approach to get a conditional result. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
3d21ec64 |
|
17-Apr-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: add missing tpc interrupt info For some reason the last possible tpc interrupt cause in gaudi2_tpc_interrupts_cause is missing from the code. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
cc7b790d |
|
08-Feb-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: do soft-reset using cpucp packet This is done depending on the FW version. The cpucp method is preferable and saves scratchpads resource. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
a12428ac |
|
18-Apr-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: check fw version using sw version The fw inner version is less trustable, instead use the fw general sw release version. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
f9b60242 |
|
10-Apr-2023 |
Moti Haimovski <mhaimovski@habana.ai> |
accel/habanalabs: fix bug in free scratchpad memory This commit fixes a bug in Gaudi2 when freeing the scratchpad memory in case software init fails. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
1464fbd8 |
|
30-Mar-2023 |
Tal Cohen <talcohen@habana.ai> |
accel/habanalabs: ignore false positive razwi In Gaudi2 asic, PSOC RAZWI may cause in HBW or LBW. The address that caused the error is read from HW register and printed by the Driver. There are cases where the Driver receives an indication on PSOC RAZWI error but the address value is zero. In that case, the indication is a false positive. The Driver should not "count" a PSOC RAZWI event error when the caused the address is zeroed. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
31420f93 |
|
20-Mar-2023 |
Moti Haimovski <mhaimovski@habana.ai> |
accel/habanalabs: speedup h/w queues test in Gaudi2 HW queues testing at driver load and after reset takes a substantial amount of time. This commit reduces the queues test time in Gaudi2 devices by running all the tests in parallel instead of one after the other. Time measurements on tests duration shows that the new method is almost x100 faster than the serial approach. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
91204e47 |
|
28-Mar-2023 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs: fix handling of arc farm sei event There is only single eq entry for arc farm sei event which aggregates events from the four arc farms. Fix the code to handle this event according to this behavior. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
38f3c732 |
|
28-Mar-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: fixes for unexpected error interrupt Removing redundant asic prop variable as we don't need to expose this to common code. In addition, fix some typos. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
802f25b6 |
|
21-Mar-2023 |
Tal Cohen <talcohen@habana.ai> |
accel/habanalabs: sync f/w events interrupt in hard reset Receiving events from FW, while the device is in hard reset, causes a warning message in Driver log. The message may point to a problem in the Driver or FW. But It also can appear as a result of events that have been sent from FW just before the hard reset. In order to avoid receiving events from FW while the device is in reset and is already in 'disabled' mode, sync the f/w events interrupt right before setting the device to 'disabled'. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
82a1b48a |
|
26-Mar-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: fix wrong reset and event flags During event handling, driver sets relevant reset and user event notifier flags. Fix few wrong flags settings. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
49fd071d |
|
26-Mar-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: print raw binning masks in debug level There are rare cases of failures when cards are initialized due to wrong values in efuse mappings that are parsed by firmware. To help debug those cases, print (in debug level) the raw binning masks as fetched from the firmware during device initialization. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
d1943f1b |
|
15-Mar-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: fix HBM MMU interrupt handling Current mapping between HMMU event and HMMU block is wrong. In addition the captured address in case of a page fault or an access error is scrambled, Hence we must call the descramble function. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
6306e815 |
|
23-Mar-2023 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs: fix access error clear event The register which needs to be cleared is the valid register instead of the address. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
75b44575 |
|
16-Mar-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: remove redundant TODOs As mmu refactor and nic resume are not relevant anymore, remove their TODO comments. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
ec484931 |
|
16-Mar-2023 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs: change razwi handle after fw fix FW had one data route for tpc0 and tpc1 when running in secured mode and a different one when running without secured mode. After fw fixed this issue, both mode have the same data path. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
e1ef053e |
|
08-Mar-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: add handling for unexpected user event In order for the user to be aware of unexpected events in Gaudi2 that aren't assigned to a specific engine, we are adding the handling of this dedicated interrupt. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
dc934c18 |
|
15-Mar-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs: fix a maybe-uninitialized compilation warnings Initialize 'index' in gaudi2_handle_qman_err() and 'offset' in gaudi2_get_nic_idle_status() to avoid "maybe-uninitialized" compilation warnings. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
9669b96f |
|
15-Mar-2023 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs: fix page fault event clear After getting page fault in gaudi2, we need to clear the valid bit instead of the address. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
958e4797 |
|
13-Mar-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: expose rotator mask to userspace All engine masks are exposed to user, make sure user gets the correct rotator enabled mask in gaudi2. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
0e418ab7 |
|
12-Mar-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs: remove '\n' when passing strings to gaudi2_print_event() Remove all '\n' from strings which are passed as arguments to gaudi2_print_event(), because the newline character is added internally in this function. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
af5e675f |
|
07-Mar-2023 |
Koby Elbaz <kelbaz@habana.ai> |
accel/habanalabs: return tlb inv error code upon failure Now that CQ-completion based jobs do not trigger a reset upon failure, failure of such jobs (e.g., MMU cache invalidation) should be handled by the caller itself depending on the error code returned to it. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
5d8a5f29 |
|
09-Mar-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: in {e/p}dma_core events read the err cause reg Since the err_cause register is unprivileged, we should read it from the driver instead of using the param that came from the FW. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
f8d139a7 |
|
08-Mar-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: fix use of var reset_sleep_ms - remove reset_sleep_ms arg from functions that don't use it. - move the call msleep(reset_sleep_ms) from btm poll to gaudi2_hw_fini as it is called from there already for other flow. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
077a39fa |
|
20-Feb-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: in hw_fini return error code if polling timed-out In hw_fini callback, we use either the cpucp packet method or polling a register. Currently we return error only in the case of cpucp packet failure. In this patch we also return error if polling timed out. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
8c695455 |
|
08-Mar-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: increase reset poll timeout Due to a firmware bug we need to increase reset poll timeout or else we will timeout in secured environments. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
7c766e58 |
|
06-Mar-2023 |
Koby Elbaz <kelbaz@habana.ai> |
accel/habanalabs: do not verify engine modes after being changed Engines idle state can't always be verified between changes of engine modes (e.g., stall/halt). For example, if a CS is inflight when altering engine's mode, idle state will return NOT idle, always. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
336b78c6 |
|
05-Mar-2023 |
Oded Gabbay <ogabbay@kernel.org> |
accel/habanalabs: align to latest firmware specs Copy the most up-to-date interface files to the firmware. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
|
#
79c16437 |
|
08-Mar-2023 |
Oded Gabbay <ogabbay@kernel.org> |
accel/habanalabs: make gaudi2_is_device_idle() static This function is only called inside gaudi2.c file. Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202303071320.X5ouBlNY-lkp@intel.com/ Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
|
#
801507d3 |
|
19-Feb-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: move soft-reset wait to soft-reset execute We plan to do soft-reset either by mmio or by using cpucp packet depending on the FW version. We don't want to check FW version in two different places for that (execute soft-reset and wait to soft-reset) so move the waiting to gaudi2_execute_soft_reset. This also makes sense because the cpucp also does the waiting. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
f7f0085e |
|
15-Feb-2023 |
Koby Elbaz <kelbaz@habana.ai> |
accel/habanalabs: add uapi to stall/resume engine The user might want to stall/resume engines to perform power testing for various scenarios. Because our current HL_CS_FLAGS_ENGINE_CORE_COMMAND command only handles the engines' cores, we need to add another opcode for handling entire engine and not just its core. The user supplies an array, where each entry holds the engine's ID and the command to send to the engine. The size of the array is limited by the number of engines in the ASIC (only Gaudi2 is currently supported). Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
087fe7c9 |
|
01-Mar-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: unify err log of hw-fini failure in dirty state print more informative message when failing in dirty state Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
|
#
9732d5d0 |
|
21-Feb-2023 |
Koby Elbaz <kelbaz@habana.ai> |
accel/habanalabs: fix register address on PDMA/EDMA idle check The PDMA/EDMA is_idle routines didn't check the correct CORE register in order to get the accurate idle state. Moreover, it's better to make the is_idle routine more robust by adding additional checks (IS_HALTED) before announcing that the core is idle. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
75276e23 |
|
23-Feb-2023 |
Koby Elbaz <kelbaz@habana.ai> |
accel/habanalabs: verify return code after scrubbing ARCs DCCMs In case the KDMA fails scrubbing the DCCMs (following a soft-reset upon device release), the driver will only print failure until reset flow ends, rather than escalating it into a hard-reset. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
86b74d84 |
|
14-Feb-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: assert return value of hw_fini Since hw_fini return error code for failure indication, we should check its return value. Currently it might only fail upon soft-reset from hl_device_reset. Later patch will add hw_fini failure in case of polling timeout in hard-reset. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
d85f0531 |
|
19-Feb-2023 |
Koby Elbaz <kelbaz@habana.ai> |
accel/habanalabs: break is_idle function into per-engine sub-routines is_idle() was too long, so break it up for readability. In addition, we can now use the new sub-routines from other places. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
d1bae819 |
|
16-Feb-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs: remove unneeded irq_handler variable 'irq_handler' in gaudi2_enable_msix(), is just assigned with a function name and then used when calling request_threaded_irq(). Remove the variable and use the function name directly as an argument. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
|
#
5e09ae92 |
|
08-Feb-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: change hw_fini to return int to indicate error We later use cpucp packet for soft reset which might fail so we should be able propagate the failure case. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
|
#
4713ace3 |
|
16-Jan-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: add support for TPC assert In order to allow TPC engines to raise an assert, we must expose the relevant MSIX interrupt to the user so he will configure the engine correctly. In addition, we implement the corresponding interrupt handler that will notify the user upon such an event. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
|
#
60122358 |
|
25-Jan-2023 |
Tal Cohen <talcohen@habana.ai> |
accel/habanalabs: change user interrupt to threaded IRQ We prefer not to handle the user interrupt job inside the interrupt context. Instead, use threaded IRQ to handle the user interrupts. This will allow to avoid disabling interrupts when the user process registers for a new event and to avoid long handling inside an interrupt. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
|
#
32231b6c |
|
13-Nov-2022 |
Ohad Sharabi <osharabi@habana.ai> |
accel/habanalabs: get reset type indication from irq_map When getting an event, add the ability to deduce the reset type from the IRQ map table instead of using hard reset regardless. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
|
#
7fc0d011 |
|
22-Jan-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: expose engine core int reg address In order for engine cores to raise interrupts towards FW, They need to know which register the event data should be written to. Hence, we forward the relevant scratchpad register received during dynamic regs handshake with FW. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
|
#
313e9f63 |
|
10-Jan-2023 |
Moti Haimovski <mhaimovski@habana.ai> |
accel/habanalabs: add critical-event bit in notifier Enhance the existing user notifications by adding a HW and FW critical event bits to be used when a HW or FW event occur that requires both SW abort and hard-resetting the chip. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
|
#
c0e6df91 |
|
17-Jan-2023 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs: fix address decode RAZWI handling PSOC RAZWI handling code did not took into account single router that supports several initiators with different XY coordinates. Also, it ignored XY_HI coordinate. This caused 2 problems: 1. RAZWI handle ignored some initiators. 2. When getting PSOC RAZWI from some routers, there was a lot of possible engines which could have caused the RAZWI. Fixed the above issue by handling PSOC RAZWI with both low and high XY coordinates. This way driver supports all initiators and in the worst case there are not more than 2 possible engines for RAZWI. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
|
#
3822a7c4 |
|
23-Feb-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
|
#
f7d67c1c |
|
14-Jan-2023 |
Koby Elbaz <kelbaz@habana.ai> |
habanalabs/gaudi2: find decode error root cause When a decode error happens, we often don't know the exact root cause (the erroneous address that was accessed) and the exact engine that created the erroneous transaction. To find out, we need to go over all the relevant register blocks in the ASIC. Once we find the relevant engine, we print its details and the offending address. This helps tremendously when debugging an error that was created by running a user workload. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> |
#
9a7d530a |
|
16-Jan-2023 |
Ofir Bitton <obitton@habana.ai> |
habanalabs: refactor user interrupt type In order to support more user interrupt types in the future, we enumerate the user interrupt type instead of using a boolean. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> |
#
12d3ea01 |
|
15-Jan-2023 |
Dani Liberman <dliberman@habana.ai> |
habanalabs/gaudi2: fix emda range registers razwi handling Handling edma razwi is different than all other engines since edma uses sft routers. For hbw transactions sft router contain separate interface for each edma and for lbw there is common interface for both edma engines of the same dcore. To handle the razwi correctly we need to: 1. Simplify the calculation of the sft router address. 2. Add razwi handling for edma qm errors, since edma qman doesn't reports axi error response. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> |
#
43647952 |
|
10-Jan-2023 |
Dani Liberman <dliberman@habana.ai> |
habanalabs/gaudi2: print page fault axi transaction id AXI transaction id holds information about the initiator which caused the page fault. In the future it will be translated automatically by driver to an initiator name. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> |
#
c89d19f7 |
|
11-Jan-2023 |
Dani Liberman <dliberman@habana.ai> |
habanalabe/gaudi2: add cfg base when displaying razwi addresses Captured addresses of low b/w razwi information contains only the offset from the cfg base. To make it more user readable, add the cfg base to it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> |
#
c21f9f34 |
|
05-Jan-2023 |
Dani Liberman <dliberman@habana.ai> |
habanalabs/gaudi2: read mmio razwi information In gaudi2 there night be different routers for low b/w and high b/w transactions. But in the code that collects razwi information, we used the same router for high b/w and low b/w. Fixed it by reading the information also from low b/w routers. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> |
#
eaca606e |
|
03-Jan-2023 |
Dani Liberman <dliberman@habana.ai> |
habanalabs/gaudi2: remove use of razwi info received from f/w Because f/w does not update razwi info when sending events, remove the use of it. The driver is responsible to check if razwi happened and to collect razwi data. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> |
#
200f3cf0 |
|
04-Jan-2023 |
Carmit Carmel <ccarmel@habana.ai> |
habanalabs/gaudi2: fix log for sob value overflow/underflow The value in SM_SEI_CAUSE includes the SOB index and not the SOB group index. Remove usage of log_mask in sm_sei_cause structure as it was never used. Signed-off-by: Carmit Carmel <ccarmel@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> |
#
ab509d81 |
|
02-Jan-2023 |
Ohad Sharabi <osharabi@habana.ai> |
habanalabs: add set engines masks ASIC function This function shall be used whenever components enable/binning masks should be updated. Usage is in one of the below cases: - update user (or default) component masks - update when getting the masks from FW (either CPUCP or COMMS) Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> |
#
20faaeec |
|
18-Dec-2022 |
Ohad Sharabi <osharabi@habana.ai> |
habanalabs: add uapi to flush inbound HBM transactions When doing p2p with a NIC device, the NIC needs to make sure all the writes to the HBM (through the PCI bar of the Gaudi device) were flushed. It can be done by either the NIC or the host reading through the PCI bar. To support the host side, we supply a simple uapi to perform this flush through the driver, because the user can't create such a transaction by itself (the PCI bar isn't exposed to normal users). Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> |
#
e65e175b |
|
26-Dec-2022 |
Oded Gabbay <ogabbay@kernel.org> |
habanalabs: move driver to accel subsystem Now that we have a subsystem for compute accelerators, move the habanalabs driver to it. This patch only moves the files and fixes the Makefiles. Future patches will change the existing code to register to the accel subsystem and expose the accel device char files instead of the habanalabs device char files. Update the MAINTAINERS file to reflect this change. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> |
#
f7d67c1c |
|
14-Jan-2023 |
Koby Elbaz <kelbaz@habana.ai> |
habanalabs/gaudi2: find decode error root cause When a decode error happens, we often don't know the exact root cause (the erroneous address that was accessed) and the exact engine that created the erroneous transaction. To find out, we need to go over all the relevant register blocks in the ASIC. Once we find the relevant engine, we print its details and the offending address. This helps tremendously when debugging an error that was created by running a user workload. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
9a7d530a |
|
16-Jan-2023 |
Ofir Bitton <obitton@habana.ai> |
habanalabs: refactor user interrupt type In order to support more user interrupt types in the future, we enumerate the user interrupt type instead of using a boolean. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
12d3ea01 |
|
15-Jan-2023 |
Dani Liberman <dliberman@habana.ai> |
habanalabs/gaudi2: fix emda range registers razwi handling Handling edma razwi is different than all other engines since edma uses sft routers. For hbw transactions sft router contain separate interface for each edma and for lbw there is common interface for both edma engines of the same dcore. To handle the razwi correctly we need to: 1. Simplify the calculation of the sft router address. 2. Add razwi handling for edma qm errors, since edma qman doesn't reports axi error response. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
43647952 |
|
10-Jan-2023 |
Dani Liberman <dliberman@habana.ai> |
habanalabs/gaudi2: print page fault axi transaction id AXI transaction id holds information about the initiator which caused the page fault. In the future it will be translated automatically by driver to an initiator name. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
c89d19f7 |
|
11-Jan-2023 |
Dani Liberman <dliberman@habana.ai> |
habanalabe/gaudi2: add cfg base when displaying razwi addresses Captured addresses of low b/w razwi information contains only the offset from the cfg base. To make it more user readable, add the cfg base to it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
c21f9f34 |
|
05-Jan-2023 |
Dani Liberman <dliberman@habana.ai> |
habanalabs/gaudi2: read mmio razwi information In gaudi2 there night be different routers for low b/w and high b/w transactions. But in the code that collects razwi information, we used the same router for high b/w and low b/w. Fixed it by reading the information also from low b/w routers. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
eaca606e |
|
03-Jan-2023 |
Dani Liberman <dliberman@habana.ai> |
habanalabs/gaudi2: remove use of razwi info received from f/w Because f/w does not update razwi info when sending events, remove the use of it. The driver is responsible to check if razwi happened and to collect razwi data. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
200f3cf0 |
|
04-Jan-2023 |
Carmit Carmel <ccarmel@habana.ai> |
habanalabs/gaudi2: fix log for sob value overflow/underflow The value in SM_SEI_CAUSE includes the SOB index and not the SOB group index. Remove usage of log_mask in sm_sei_cause structure as it was never used. Signed-off-by: Carmit Carmel <ccarmel@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
ab509d81 |
|
02-Jan-2023 |
Ohad Sharabi <osharabi@habana.ai> |
habanalabs: add set engines masks ASIC function This function shall be used whenever components enable/binning masks should be updated. Usage is in one of the below cases: - update user (or default) component masks - update when getting the masks from FW (either CPUCP or COMMS) Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
20faaeec |
|
18-Dec-2022 |
Ohad Sharabi <osharabi@habana.ai> |
habanalabs: add uapi to flush inbound HBM transactions When doing p2p with a NIC device, the NIC needs to make sure all the writes to the HBM (through the PCI bar of the Gaudi device) were flushed. It can be done by either the NIC or the host reading through the PCI bar. To support the host side, we supply a simple uapi to perform this flush through the driver, because the user can't create such a transaction by itself (the PCI bar isn't exposed to normal users). Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
#
e65e175b |
|
26-Dec-2022 |
Oded Gabbay <ogabbay@kernel.org> |
habanalabs: move driver to accel subsystem Now that we have a subsystem for compute accelerators, move the habanalabs driver to it. This patch only moves the files and fixes the Makefiles. Future patches will change the existing code to register to the accel subsystem and expose the accel device char files instead of the habanalabs device char files. Update the MAINTAINERS file to reflect this change. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|