History log of /linux-master/drivers/accel/habanalabs/gaudi2/gaudi2.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 3bf6ef98 05-Feb-2024 Ofir Bitton <obitton@habana.ai>

accel/habanalabs/gaudi2: drain event lacks rd/wr indication

Due to a H/W issue, AXI drain event does not include a read/write
indication, hence we remove this print.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# e855869b 25-Jan-2024 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: fix glbl error cause handling

The glbl error cause handling has a wrong assumption that all error
bits are consecutive.
Fix the handling to check all relevant error bits per ASIC.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# c1e89ae4 18-Jan-2024 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs/gaudi2: check extended errors according to PCIe addr_dec interrupt info

The FW interrupt info for a PCIe addr_dec event is set correctly, so
check for either global errors or razwi according to the indications
there.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# c14e5cd3 14-Jan-2024 Farah Kassabri <fkassabri@habana.ai>

accel/habanalabs: remove hop size from asic properties

The hop size related properties is a MMU properties and not
asic properties.
As for PMMU and HMMU we could have different sizes.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 01f8cd0f 02-Jan-2024 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs/gaudi2: fail memory memset when failing to copy QM packet to device

gaudi2_memset_memory_chunk_using_edma_qm() calls the access_dev_mem()
ASIC function, but ignores its return value.
Add this missing check.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 731d320e 01-Jan-2024 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: remove call to deprecated function

In newer kernel versions, irq_set_affinity_hint() is deprecated.
Instead, use the newer version which is irq_set_affinity_and_hint().

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# f728c17f 02-Nov-2023 Farah Kassabri <fkassabri@habana.ai>

accel/habanalabs/gaudi2: move HMMU page tables to device memory

Currently the HMMU page tables reside in the host memory,
which will cause host access from the device for every page walk.
This can affect PCIe bandwidth in certain scenarios.

To prevent that problem, HMMU page tables will be moved to the device
memory so the miss transaction will read the hops from there instead of
going to the host.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# e91c37f1 21-Sep-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs/gaudi2: add interrupt affinity for user interrupts

User interrupts are MSIx interrupts coming from Gaudi2, that have
specific range of IDs and are assigned to the sole use of the user
process that opened the Gaudi2 device (reminder: there can be only
a single user process running on Gaudi2 at any given time).

The interrupts are allocated and managed by the driver and therefore,
the user expects the driver to initialize them properly, which also
includes setting the affinity to the related CPU cores of the
device's NUMA node to get maximum performance.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# bc5f15ab 29-Nov-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs/gaudi2: avoid overriding existing undefined opcode data

Part of the undefined opcode data is updated in
gaudi2_handle_qman_err_generic() and some in
handle_lower_qman_data_on_err().
However, the 'write_enable' flag is checked only in
gaudi2_handle_qman_err_generic(), and information of more than a single
error can be mixed there.

Moreover, handle_lower_qman_data_on_err() is called only for the lower
QMAN, so for an error in the upper QMAN there is only a partial info.

Move all the data update to be done in a single place, protected by the
'write_enable' flag.
As mainly the lower QMAN's info is interesting, avoid saving the partial
info for the upper QMAN.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 565ee788 23-Nov-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs/gaudi2: add zero padding when printing QM CP instruction

QM instructions are in multiples of 64 bits and the command type is in
the upper bits of first QWORD.
To make it clearer that an undefined command is due to a type of 0x0,
always print all 64 bits and add a zero padding if needed.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 5bc155cf 16-Nov-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs/gaudi2: use correct registers to dump QM CQ info

The QM CQ PTR_LO/PTR_HI/TSIZE registers are for pushing a CQ entry, and
although they are updated by HW even when descriptors are fetched by PQ
and CB addresses are fed into CQ, the correct registers to use when
dumping the CQ info are the ones with the _STS suffix.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# ae303d88 06-Nov-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs/gaudi2: get the correct QM CQ info upon an error

Upon a QM error, the address/size from both the CQ and the ARC_CQ are
printed, although the instruction that led to the error was received
from only one of them.

Moreover, in case of a QM undefined opcode, only one of these
address/size sets will be captured based on the value of ARC_CQ_PTR.
However, this value can be non-zero even if currently the CQ is used, in
case the CQ/ARC_CQ are alternately used.

Under the assumption of having a stop-on-error configuration, modify to
use CP_STS.CUR_CQ field to get the relevant CQ for the QM error.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 0ec34677 31-Oct-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs/gaudi2: fix undef opcode reporting

currently the undefined opcode event bit in set only for lower cp and
only if 'write_enable' is true. It should be set anyway and for all
streams in order to report that event to userspace.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# c6485482 15-Oct-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain

When a PCIe AXI drain event happens, it is possible that the driver
cannot access the device through PCIe, and therefore cannot send a
hard-reset request to FW.
Starting from FW version 1.13, FW will initiate a hard-reset in such
a case without waiting for a reset request from the driver.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# ff92d010 27-Aug-2023 Ohad Sharabi <osharabi@habana.ai>

accel/habanalabs: trace dma map sgtable

Traces the DMA [un]map_sgtable using the new traces we added.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# d7aa2948 19-Sep-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: remove unused asic functions

asic_dma_{un}map_single() asic-specific functions are no longer called
from the common code, so delete these functions.

In addition, delete the gaudi2 implementation as they are also not
called.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>


# 674f7779 06-Sep-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: extend preboot timeout when preboot might take longer

There are cases such when FW runs MBIST, that preboot is expected to take
longer than the usual. In such cases the firmware reports status
SECURITY_READY/IN_PREBOOT and we extend the timeout waiting for it.
This is currently implemented for Gaudi2 only.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# ba24b5ec 19-Jul-2023 farah kassabri <fkassabri@habana.ai>

accel/habanalabs: split user interrupts pending list

Currently driver maintain one list for both pending user interrupts
which seeks to wait till CQ reaches it's target value and also the ones
that seeks to get timestamp records when the CQ reaches it's target
value.
This causes delay in handling the waiters which gets higher priority
than the timestamp records.
In order to solve this, let's split the list into two,
one for each case and each one is protected by it's own spinlock.
Waiters will be handled within the interrupt context first,
then the timestamp records will be set.
Freeing the timestamp related memory will be handled in a workqueue.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 764bfd13 22-Aug-2023 farah kassabri <fkassabri@habana.ai>

accel/habanalabs/gaudi2: add eq health check using irq

This is the second patch for applying the eq health check mechanism
which will add support for the interrupt flow for gaudi2 asic.

More info about the interrupt mechanism:
set a dedicated msix for the eq error interrupt, and add
interrupt handler for it.
when FW detects some issue with EQ like EQ_FULL, it'll
raise that interrupt and driver should reset the device.
Driver will inform the FW which msix index to use through
the already existing handshake mechanism which will
send msix info message to fw.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 7c4130e6 07-Aug-2023 farah kassabri <fkassabri@habana.ai>

accel/habanalabs/gaudi2: handle eq health heartbeat check

Add mechanism for fw eq health check. this will be done using two flows:
using the heartbeat mechanism and raising a dedicated interrupt to
indicate an eq failure like EQ full.
This patch will add implementation for the eq heartbeat for gaudi2 asic.

More info about the heartbeat mechanism:
Expand the heartbeat mechanism to monitor a new event that
will be sent from FW upon receiving heartbeat message.
that way driver can know that the eq is working or not.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 72bff371 22-Aug-2023 Moti Haimovski <mhaimovski@habana.ai>

accel/habanalabs/gaudi2: print power-mode changes

Print to kernel log any device power mode changes events reported by
the FW.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# d261b0ab 25-Jul-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs/gaudi2: include block id in ECC error reporting

During ECC event handling, Memory wrapper id was mistakenly
printed as block id. Fix the print and in addition fetch the actual
block-id from firmware.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 90f3de61 04-Sep-2023 Christophe JAILLET <christophe.jaillet@wanadoo.fr>

accel/habanalabs/gaudi2: Fix incorrect string length computation in gaudi2_psoc_razwi_get_engines()

snprintf() returns the "number of characters which *would* be generated for
the given input", not the size *really* generated.

In order to avoid too large values for 'str_size' (and potential negative
values for "PSOC_RAZWI_ENG_STR_SIZE - str_size") use scnprintf()
instead of snprintf().

Fixes: c0e6df916050 ("accel/habanalabs: fix address decode RAZWI handling")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# a45d5cf0 25-Aug-2023 Justin Stitt <justinstitt@google.com>

accel/habanalabs: refactor deprecated strncpy to strscpy_pad

`strncpy` is deprecated for use on NUL-terminated destination strings [1].

We see that `prop->cpucp_info.card_name` is supposed to be
NUL-terminated based on its usage within `__hwmon_device_register()`
(wherein it's called "name"):
| if (name && (!strlen(name) || strpbrk(name, "-* \t\n")))
| dev_warn(dev,
| "hwmon: '%s' is not a valid name attribute, please fix\n",
| name);

A suitable replacement is `strscpy_pad` [2] due to the fact that it
guarantees both NUL-termination and NUL-padding on its destination
buffer.

NUL-padding on `prop->cpucp_info.card_name` is not strictly necessary as
`hdev->prop` is explicitly zero-initialized but should be used
regardless as it gets copied out to userspace directly -- as per Kees'
suggestion.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 01ab1629 04-Jul-2023 Igor Grinberg <igrinberg@habana.ai>

accel/habanalabs/gaudi2: prepare to remove cpu_rst_status

The soft reset has transitioned to CPUCP packet instead of plain
register write and is about to be removed from the struct cpu_dyn_regs.
As a preparation for removing the cpu_rst_status field from
struct cpu_dyn_regs, switch to use the plain macro - this keeps the
backward compatibility.

Signed-off-by: Igor Grinberg <igrinberg@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# a8ab1a81 23-May-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: add info ioctl for engine error reports

User gets notification for every engine error report, but he still
lacks the exact engine information. Hence, we allow user to query
for the exact engine reported an error.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# fa46c7bb 11-Jul-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs/gaudi2: fix missing check of kernel ctx

If we are initializing the kernel context when we have a Gaudi2 device,
we don't need to do any late initializing of that context with
specific Gaudi2 code.

Reviewed-by: Ofir Bitton <obitton@habana.ai>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 15c0bb16 04-Jul-2023 Igor Grinberg <igrinberg@habana.ai>

accel/habanalabs/gaudi2: prepare to remove soft_rst_irq

The soft reset has transitioned to CPUCP packet instead of plain
register write and is about to be removed from the struct cpu_dyn_regs.
As a preparation for removing the gic_host_soft_rst_irq field from
struct cpu_dyn_regs, switch to use the plain macro - this keeps the
backward compatibility.

Signed-off-by: Igor Grinberg <igrinberg@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 43d8acce 29-Mar-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: handle arc farm razwi

Implement razwi handling for arc farm and add it to arc farm sei
event handler.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# f17182d0 30-May-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: stop fetching MME SBTE error cause

Because in this case we have only a single possible cause, we can
safely stop fetching the cause from firmware.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# c6a4f256 15-May-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: notify user about undefined opcode event

In order for user to be aware of undefined opcode events, we must
store all relevant information and notify user about the failure.
The user will fetch the stored info via info ioctl.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# fac91dd5 21-May-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: add event queue extra validation

In order to increase reliability of the event queue interface,
we apply to Gaudi2 the same mechanism we have in Gaudi1.
The extra validation is basically checking that the received
event index matches the expected index.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 8a20b381 17-May-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: fix bug of not fetching addr_dec info

addr_dec info should always be fetched, regardless of cause value.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 5d658d0c 08-May-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: mask part of hmmu page fault captured address

When receiving page fault from hmmu, the captured address is scrambled
both by HW and by driver. The driver part is unscrambled but the HW
part isn't getting unscrambled.
To avoid declaring wrong address, the HW scrambled part will be
masked.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 6092cedf 10-May-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: print qman data on error only for lower qman

By default, the upper QMANs are not used, and instead engines ARCs
access the lower QMANs directly.
Errors for upper QMANs are therefore not expected, and the debug print
of the PQ entries is not needed.

Modify the QMAN debug data print on errors to include only information
for the lower QMAN.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 54381ee8 10-May-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: use lower QM in QM errors handling

The QMAN GLBL_ERR_STS_4 register has indications for errors also in the
lower CQ and the ARC CQ, and not just for errors in the lower CP.
Modify the relevant define/struct and the related print to use "lower
QM" instead of "lower CP".

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# dcc8fa88 08-May-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: use binning info when handling razwi

When receiving sei interrupt from tpc or decoder, we need to check
the binning mask because if the engine is binned, the razwi info
won't be in the router of the binned engine, instead will be in the
router of the substitute engine.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# b2d61fec 02-May-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: upon DMA errors, use FW-extracted error cause

Initially, the driver used to read the error cause data directly from
the ASIC. However, the FW now clears it before the driver could read
it. Therefore we should use the error cause data that is extracted by
the FW.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 9ec7639b 15-May-2023 Dan Carpenter <dan.carpenter@linaro.org>

accel/habanalabs: fix gaudi2_get_tpc_idle_status() return

The gaudi2_get_tpc_idle_status() function returned the incorrect variable
so it always returned true.

Fixes: d85f0531b928 ("accel/habanalabs: break is_idle function into per-engine sub-routines")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# d0dcd4bb 24-Apr-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: always fetch pci addr_dec error info

Due to missing indication of address decode source (LBW/HBW bus),
we should always try and fetch extended information.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 7d212963 20-Apr-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: fix a static warning - 'dubious: x & !y'

Use a straight forward approach to get a conditional result.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 3d21ec64 17-Apr-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: add missing tpc interrupt info

For some reason the last possible tpc interrupt cause in
gaudi2_tpc_interrupts_cause is missing from the code.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# cc7b790d 08-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: do soft-reset using cpucp packet

This is done depending on the FW version. The cpucp method is
preferable and saves scratchpads resource.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# a12428ac 18-Apr-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: check fw version using sw version

The fw inner version is less trustable, instead use the fw general
sw release version.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# f9b60242 10-Apr-2023 Moti Haimovski <mhaimovski@habana.ai>

accel/habanalabs: fix bug in free scratchpad memory

This commit fixes a bug in Gaudi2 when freeing the scratchpad memory
in case software init fails.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 1464fbd8 30-Mar-2023 Tal Cohen <talcohen@habana.ai>

accel/habanalabs: ignore false positive razwi

In Gaudi2 asic, PSOC RAZWI may cause in HBW or LBW. The address that
caused the error is read from HW register and printed by the Driver.
There are cases where the Driver receives an indication on PSOC
RAZWI error but the address value is zero. In that case, the indication
is a false positive.
The Driver should not "count" a PSOC RAZWI event error when the
caused the address is zeroed.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 31420f93 20-Mar-2023 Moti Haimovski <mhaimovski@habana.ai>

accel/habanalabs: speedup h/w queues test in Gaudi2

HW queues testing at driver load and after reset takes a substantial
amount of time.
This commit reduces the queues test time in Gaudi2 devices by running
all the tests in parallel instead of one after the other.
Time measurements on tests duration shows that the new method is almost
x100 faster than the serial approach.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 91204e47 28-Mar-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: fix handling of arc farm sei event

There is only single eq entry for arc farm sei event which aggregates
events from the four arc farms.
Fix the code to handle this event according to this behavior.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 38f3c732 28-Mar-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: fixes for unexpected error interrupt

Removing redundant asic prop variable as we don't need to expose this
to common code. In addition, fix some typos.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 802f25b6 21-Mar-2023 Tal Cohen <talcohen@habana.ai>

accel/habanalabs: sync f/w events interrupt in hard reset

Receiving events from FW, while the device is in hard reset, causes
a warning message in Driver log. The message may point to a
problem in the Driver or FW. But It also can appear as a result
of events that have been sent from FW just before the hard reset.
In order to avoid receiving events from FW while the device is in reset
and is already in 'disabled' mode, sync the f/w events interrupt right
before setting the device to 'disabled'.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 82a1b48a 26-Mar-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: fix wrong reset and event flags

During event handling, driver sets relevant reset and user event
notifier flags. Fix few wrong flags settings.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 49fd071d 26-Mar-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: print raw binning masks in debug level

There are rare cases of failures when cards are initialized due to
wrong values in efuse mappings that are parsed by firmware.

To help debug those cases, print (in debug level) the raw binning masks
as fetched from the firmware during device initialization.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# d1943f1b 15-Mar-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: fix HBM MMU interrupt handling

Current mapping between HMMU event and HMMU block is wrong.
In addition the captured address in case of a page fault or
an access error is scrambled, Hence we must call the descramble
function.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 6306e815 23-Mar-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: fix access error clear event

The register which needs to be cleared is the valid register instead
of the address.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 75b44575 16-Mar-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: remove redundant TODOs

As mmu refactor and nic resume are not relevant anymore, remove
their TODO comments.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# ec484931 16-Mar-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: change razwi handle after fw fix

FW had one data route for tpc0 and tpc1 when running in secured mode
and a different one when running without secured mode. After fw fixed
this issue, both mode have the same data path.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# e1ef053e 08-Mar-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: add handling for unexpected user event

In order for the user to be aware of unexpected events in Gaudi2 that
aren't assigned to a specific engine, we are adding the handling of
this dedicated interrupt.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# dc934c18 15-Mar-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: fix a maybe-uninitialized compilation warnings

Initialize 'index' in gaudi2_handle_qman_err() and 'offset' in
gaudi2_get_nic_idle_status() to avoid "maybe-uninitialized" compilation
warnings.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 9669b96f 15-Mar-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: fix page fault event clear

After getting page fault in gaudi2, we need to clear the valid bit
instead of the address.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 958e4797 13-Mar-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: expose rotator mask to userspace

All engine masks are exposed to user, make sure user gets the
correct rotator enabled mask in gaudi2.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 0e418ab7 12-Mar-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: remove '\n' when passing strings to gaudi2_print_event()

Remove all '\n' from strings which are passed as arguments to
gaudi2_print_event(), because the newline character is added internally
in this function.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# af5e675f 07-Mar-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: return tlb inv error code upon failure

Now that CQ-completion based jobs do not trigger a reset upon failure,
failure of such jobs (e.g., MMU cache invalidation) should be handled
by the caller itself depending on the error code returned to it.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 5d8a5f29 09-Mar-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: in {e/p}dma_core events read the err cause reg

Since the err_cause register is unprivileged, we should read it from
the driver instead of using the param that came from the FW.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# f8d139a7 08-Mar-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: fix use of var reset_sleep_ms

- remove reset_sleep_ms arg from functions that don't use it.
- move the call msleep(reset_sleep_ms) from btm poll to gaudi2_hw_fini
as it is called from there already for other flow.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 077a39fa 20-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: in hw_fini return error code if polling timed-out

In hw_fini callback, we use either the cpucp packet method or polling a
register. Currently we return error only in the case of cpucp packet
failure. In this patch we also return error if polling timed out.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 8c695455 08-Mar-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: increase reset poll timeout

Due to a firmware bug we need to increase reset poll timeout
or else we will timeout in secured environments.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 7c766e58 06-Mar-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: do not verify engine modes after being changed

Engines idle state can't always be verified between changes of
engine modes (e.g., stall/halt).
For example, if a CS is inflight when altering engine's mode,
idle state will return NOT idle, always.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 336b78c6 05-Mar-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: align to latest firmware specs

Copy the most up-to-date interface files to the firmware.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>


# 79c16437 08-Mar-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: make gaudi2_is_device_idle() static

This function is only called inside gaudi2.c file.

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202303071320.X5ouBlNY-lkp@intel.com/
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>


# 801507d3 19-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: move soft-reset wait to soft-reset execute

We plan to do soft-reset either by mmio or by using cpucp packet
depending on the FW version. We don't want to check FW version in two
different places for that (execute soft-reset and wait to soft-reset)
so move the waiting to gaudi2_execute_soft_reset. This also makes sense
because the cpucp also does the waiting.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# f7f0085e 15-Feb-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: add uapi to stall/resume engine

The user might want to stall/resume engines to perform power testing
for various scenarios. Because our current
HL_CS_FLAGS_ENGINE_CORE_COMMAND command only handles the engines' cores,
we need to add another opcode for handling entire engine and not just
its core.

The user supplies an array, where each entry holds the engine's ID and
the command to send to the engine. The size of the array is limited
by the number of engines in the ASIC (only Gaudi2 is currently
supported).

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 087fe7c9 01-Mar-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: unify err log of hw-fini failure in dirty state

print more informative message when failing in dirty state

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 9732d5d0 21-Feb-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: fix register address on PDMA/EDMA idle check

The PDMA/EDMA is_idle routines didn't check the correct CORE register
in order to get the accurate idle state.
Moreover, it's better to make the is_idle routine more robust by adding
additional checks (IS_HALTED) before announcing that the core is idle.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 75276e23 23-Feb-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: verify return code after scrubbing ARCs DCCMs

In case the KDMA fails scrubbing the DCCMs (following a soft-reset
upon device release), the driver will only print failure until reset
flow ends, rather than escalating it into a hard-reset.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 86b74d84 14-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: assert return value of hw_fini

Since hw_fini return error code for failure indication, we should
check its return value. Currently it might only fail upon soft-reset
from hl_device_reset. Later patch will add hw_fini failure in case of
polling timeout in hard-reset.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# d85f0531 19-Feb-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: break is_idle function into per-engine sub-routines

is_idle() was too long, so break it up for readability.
In addition, we can now use the new sub-routines from other places.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# d1bae819 16-Feb-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: remove unneeded irq_handler variable

'irq_handler' in gaudi2_enable_msix(), is just assigned with a function
name and then used when calling request_threaded_irq().
Remove the variable and use the function name directly as an argument.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 5e09ae92 08-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: change hw_fini to return int to indicate error

We later use cpucp packet for soft reset which might fail
so we should be able propagate the failure case.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 4713ace3 16-Jan-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: add support for TPC assert

In order to allow TPC engines to raise an assert, we must expose
the relevant MSIX interrupt to the user so he will configure the engine
correctly. In addition, we implement the corresponding interrupt
handler that will notify the user upon such an event.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 60122358 25-Jan-2023 Tal Cohen <talcohen@habana.ai>

accel/habanalabs: change user interrupt to threaded IRQ

We prefer not to handle the user interrupt job inside the interrupt
context. Instead, use threaded IRQ to handle the user interrupts.
This will allow to avoid disabling interrupts when the user process
registers for a new event and to avoid long handling inside an
interrupt.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 32231b6c 13-Nov-2022 Ohad Sharabi <osharabi@habana.ai>

accel/habanalabs: get reset type indication from irq_map

When getting an event, add the ability to deduce the reset type from
the IRQ map table instead of using hard reset regardless.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 7fc0d011 22-Jan-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: expose engine core int reg address

In order for engine cores to raise interrupts towards FW, They need
to know which register the event data should be written to.
Hence, we forward the relevant scratchpad register received during
dynamic regs handshake with FW.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 313e9f63 10-Jan-2023 Moti Haimovski <mhaimovski@habana.ai>

accel/habanalabs: add critical-event bit in notifier

Enhance the existing user notifications by adding a HW and FW critical
event bits to be used when a HW or FW event occur that requires
both SW abort and hard-resetting the chip.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# c0e6df91 17-Jan-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: fix address decode RAZWI handling

PSOC RAZWI handling code did not took into account single router that
supports several initiators with different XY coordinates. Also, it
ignored XY_HI coordinate. This caused 2 problems:
1. RAZWI handle ignored some initiators.
2. When getting PSOC RAZWI from some routers, there was a lot of
possible engines which could have caused the RAZWI.

Fixed the above issue by handling PSOC RAZWI with both low and high
XY coordinates. This way driver supports all initiators and in
the worst case there are not more than 2 possible engines for RAZWI.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>


# 3822a7c4 23-Feb-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.

- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.

- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes

- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.

- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".

These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.

- Kairui Song adds a series ("Clean up and fixes for swap").

- Vernon Yang contributed the series "Clean up and refinement for maple
tree".

- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.

- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".

- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".

- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".

- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".

- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".

- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".

- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".

- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.

The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".

- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".

- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".

- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".

- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".

- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".

- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".

- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".

- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"

- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".

- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".

- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.

- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".

- Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...


# f7d67c1c 14-Jan-2023 Koby Elbaz <kelbaz@habana.ai>

habanalabs/gaudi2: find decode error root cause

When a decode error happens, we often don't know the exact root
cause (the erroneous address that was accessed) and the exact engine
that created the erroneous transaction.

To find out, we need to go over all the relevant register blocks
in the ASIC. Once we find the relevant engine, we print its details
and the offending address.

This helps tremendously when debugging an error that was created
by running a user workload.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

# 9a7d530a 16-Jan-2023 Ofir Bitton <obitton@habana.ai>

habanalabs: refactor user interrupt type

In order to support more user interrupt types in the future, we
enumerate the user interrupt type instead of using a boolean.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

# 12d3ea01 15-Jan-2023 Dani Liberman <dliberman@habana.ai>

habanalabs/gaudi2: fix emda range registers razwi handling

Handling edma razwi is different than all other engines since edma
uses sft routers. For hbw transactions sft router contain separate
interface for each edma and for lbw there is common interface for
both edma engines of the same dcore.

To handle the razwi correctly we need to:
1. Simplify the calculation of the sft router address.
2. Add razwi handling for edma qm errors, since edma qman doesn't
reports axi error response.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

# 43647952 10-Jan-2023 Dani Liberman <dliberman@habana.ai>

habanalabs/gaudi2: print page fault axi transaction id

AXI transaction id holds information about the initiator which caused
the page fault. In the future it will be translated automatically by
driver to an initiator name.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

# c89d19f7 11-Jan-2023 Dani Liberman <dliberman@habana.ai>

habanalabe/gaudi2: add cfg base when displaying razwi addresses

Captured addresses of low b/w razwi information contains only the
offset from the cfg base. To make it more user readable, add the cfg
base to it.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

# c21f9f34 05-Jan-2023 Dani Liberman <dliberman@habana.ai>

habanalabs/gaudi2: read mmio razwi information

In gaudi2 there night be different routers for low b/w and high b/w
transactions. But in the code that collects razwi information, we used
the same router for high b/w and low b/w.

Fixed it by reading the information also from low b/w routers.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

# eaca606e 03-Jan-2023 Dani Liberman <dliberman@habana.ai>

habanalabs/gaudi2: remove use of razwi info received from f/w

Because f/w does not update razwi info when sending events, remove the
use of it.
The driver is responsible to check if razwi happened and to
collect razwi data.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

# 200f3cf0 04-Jan-2023 Carmit Carmel <ccarmel@habana.ai>

habanalabs/gaudi2: fix log for sob value overflow/underflow

The value in SM_SEI_CAUSE includes the SOB index and not the SOB group
index.
Remove usage of log_mask in sm_sei_cause structure as it was never
used.

Signed-off-by: Carmit Carmel <ccarmel@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

# ab509d81 02-Jan-2023 Ohad Sharabi <osharabi@habana.ai>

habanalabs: add set engines masks ASIC function

This function shall be used whenever components enable/binning masks
should be updated.

Usage is in one of the below cases:
- update user (or default) component masks
- update when getting the masks from FW (either CPUCP or COMMS)

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

# 20faaeec 18-Dec-2022 Ohad Sharabi <osharabi@habana.ai>

habanalabs: add uapi to flush inbound HBM transactions

When doing p2p with a NIC device, the NIC needs to make sure all the
writes to the HBM (through the PCI bar of the Gaudi device) were
flushed.

It can be done by either the NIC or the host reading through the PCI
bar.

To support the host side, we supply a simple uapi to perform this flush
through the driver, because the user can't create such a transaction
by itself (the PCI bar isn't exposed to normal users).

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

# e65e175b 26-Dec-2022 Oded Gabbay <ogabbay@kernel.org>

habanalabs: move driver to accel subsystem

Now that we have a subsystem for compute accelerators, move the
habanalabs driver to it.

This patch only moves the files and fixes the Makefiles. Future
patches will change the existing code to register to the accel
subsystem and expose the accel device char files instead of the
habanalabs device char files.

Update the MAINTAINERS file to reflect this change.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

# f7d67c1c 14-Jan-2023 Koby Elbaz <kelbaz@habana.ai>

habanalabs/gaudi2: find decode error root cause

When a decode error happens, we often don't know the exact root
cause (the erroneous address that was accessed) and the exact engine
that created the erroneous transaction.

To find out, we need to go over all the relevant register blocks
in the ASIC. Once we find the relevant engine, we print its details
and the offending address.

This helps tremendously when debugging an error that was created
by running a user workload.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 9a7d530a 16-Jan-2023 Ofir Bitton <obitton@habana.ai>

habanalabs: refactor user interrupt type

In order to support more user interrupt types in the future, we
enumerate the user interrupt type instead of using a boolean.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 12d3ea01 15-Jan-2023 Dani Liberman <dliberman@habana.ai>

habanalabs/gaudi2: fix emda range registers razwi handling

Handling edma razwi is different than all other engines since edma
uses sft routers. For hbw transactions sft router contain separate
interface for each edma and for lbw there is common interface for
both edma engines of the same dcore.

To handle the razwi correctly we need to:
1. Simplify the calculation of the sft router address.
2. Add razwi handling for edma qm errors, since edma qman doesn't
reports axi error response.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 43647952 10-Jan-2023 Dani Liberman <dliberman@habana.ai>

habanalabs/gaudi2: print page fault axi transaction id

AXI transaction id holds information about the initiator which caused
the page fault. In the future it will be translated automatically by
driver to an initiator name.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# c89d19f7 11-Jan-2023 Dani Liberman <dliberman@habana.ai>

habanalabe/gaudi2: add cfg base when displaying razwi addresses

Captured addresses of low b/w razwi information contains only the
offset from the cfg base. To make it more user readable, add the cfg
base to it.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# c21f9f34 05-Jan-2023 Dani Liberman <dliberman@habana.ai>

habanalabs/gaudi2: read mmio razwi information

In gaudi2 there night be different routers for low b/w and high b/w
transactions. But in the code that collects razwi information, we used
the same router for high b/w and low b/w.

Fixed it by reading the information also from low b/w routers.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# eaca606e 03-Jan-2023 Dani Liberman <dliberman@habana.ai>

habanalabs/gaudi2: remove use of razwi info received from f/w

Because f/w does not update razwi info when sending events, remove the
use of it.
The driver is responsible to check if razwi happened and to
collect razwi data.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 200f3cf0 04-Jan-2023 Carmit Carmel <ccarmel@habana.ai>

habanalabs/gaudi2: fix log for sob value overflow/underflow

The value in SM_SEI_CAUSE includes the SOB index and not the SOB group
index.
Remove usage of log_mask in sm_sei_cause structure as it was never
used.

Signed-off-by: Carmit Carmel <ccarmel@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# ab509d81 02-Jan-2023 Ohad Sharabi <osharabi@habana.ai>

habanalabs: add set engines masks ASIC function

This function shall be used whenever components enable/binning masks
should be updated.

Usage is in one of the below cases:
- update user (or default) component masks
- update when getting the masks from FW (either CPUCP or COMMS)

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# 20faaeec 18-Dec-2022 Ohad Sharabi <osharabi@habana.ai>

habanalabs: add uapi to flush inbound HBM transactions

When doing p2p with a NIC device, the NIC needs to make sure all the
writes to the HBM (through the PCI bar of the Gaudi device) were
flushed.

It can be done by either the NIC or the host reading through the PCI
bar.

To support the host side, we supply a simple uapi to perform this flush
through the driver, because the user can't create such a transaction
by itself (the PCI bar isn't exposed to normal users).

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>


# e65e175b 26-Dec-2022 Oded Gabbay <ogabbay@kernel.org>

habanalabs: move driver to accel subsystem

Now that we have a subsystem for compute accelerators, move the
habanalabs driver to it.

This patch only moves the files and fixes the Makefiles. Future
patches will change the existing code to register to the accel
subsystem and expose the accel device char files instead of the
habanalabs device char files.

Update the MAINTAINERS file to reflect this change.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>