History log of /linux-master/drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h
Revision Date Author Comments
# a1772de7 11-Jun-2023 Maher Sanalla <msanalla@nvidia.com>

net/mlx5: Refactor completion IRQ request/release API

Introduce a per-vector completion IRQ request API that requests a
single IRQ for a given vector index instead of multiple IRQs request API.
On driver load, loop over all completion vectors and request an IRQ for
each one via the newly introduced API.

Symmetrically, introduce an IRQ release API per vector. On driver
unload, loop over all vectors and release each completion IRQ via
the new per-vector API.

As IRQ vectors will be requested dynamically later in the patchset,
add a cpumask of the bounded CPUs to avoid the possible mapping of
two IRQs of the same device to the same cpu.

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 9c2d0801 13-Apr-2023 Shay Drory <shayd@nvidia.com>

net/mlx5: Free irqs only on shutdown callback

Whenever a shutdown is invoked, free irqs only and keep mlx5_irq
synthetic wrapper intact in order to avoid use-after-free on
system shutdown.

for example:
==================================================================
BUG: KASAN: use-after-free in _find_first_bit+0x66/0x80
Read of size 8 at addr ffff88823fc0d318 by task kworker/u192:0/13608

CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G B W O 6.1.21-cloudflare-kasan-2023.3.21 #1
Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021
Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core]
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x48
print_report+0x170/0x473
? _find_first_bit+0x66/0x80
kasan_report+0xad/0x130
? _find_first_bit+0x66/0x80
_find_first_bit+0x66/0x80
mlx5e_open_channels+0x3c5/0x3a10 [mlx5_core]
? console_unlock+0x2fa/0x430
? _raw_spin_lock_irqsave+0x8d/0xf0
? _raw_spin_unlock_irqrestore+0x42/0x80
? preempt_count_add+0x7d/0x150
? __wake_up_klogd.part.0+0x7d/0xc0
? vprintk_emit+0xfe/0x2c0
? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core]
? dev_attr_show.cold+0x35/0x35
? devlink_health_do_dump.part.0+0x174/0x340
? devlink_health_report+0x504/0x810
? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
? process_one_work+0x680/0x1050
mlx5e_safe_switch_params+0x156/0x220 [mlx5_core]
? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core]
? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core]
mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core]
? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0
devlink_health_reporter_recover+0xa6/0x1f0
devlink_health_report+0x2f7/0x810
? vsnprintf+0x854/0x15e0
mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core]
? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core]
? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core]
? newidle_balance+0x9b7/0xe30
? psi_group_change+0x6a7/0xb80
? mutex_lock+0x96/0xf0
? __mutex_lock_slowpath+0x10/0x10
mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
process_one_work+0x680/0x1050
worker_thread+0x5a0/0xeb0
? process_one_work+0x1050/0x1050
kthread+0x2a2/0x340
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>

Freed by task 1:
kasan_save_stack+0x23/0x50
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x40
____kasan_slab_free+0x169/0x1d0
slab_free_freelist_hook+0xd2/0x190
__kmem_cache_free+0x1a1/0x2f0
irq_pool_free+0x138/0x200 [mlx5_core]
mlx5_irq_table_destroy+0xf6/0x170 [mlx5_core]
mlx5_core_eq_free_irqs+0x74/0xf0 [mlx5_core]
shutdown+0x194/0x1aa [mlx5_core]
pci_device_shutdown+0x75/0x120
device_shutdown+0x35c/0x620
kernel_restart+0x60/0xa0
__do_sys_reboot+0x1cb/0x2c0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x4b/0xb5

The buggy address belongs to the object at ffff88823fc0d300
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 24 bytes inside of
192-byte region [ffff88823fc0d300, ffff88823fc0d3c0)

The buggy address belongs to the physical page:
page:0000000010139587 refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x23fc0c
head:0000000010139587 order:1 compound_mapcount:0 compound_pincount:0
flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
raw: 002ffff800010200 0000000000000000 dead000000000122 ffff88810004ca00
raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88823fc0d200: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88823fc0d280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff88823fc0d300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88823fc0d380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff88823fc0d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
general protection fault, probably for non-canonical address
0xdffffc005c40d7ac: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: probably user-memory-access in range [0x00000002e206bd60-0x00000002e206bd67]
CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G B W O 6.1.21-cloudflare-kasan-2023.3.21 #1
Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021
Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core]
RIP: 0010:__alloc_pages+0x141/0x5c0
Call Trace:
<TASK>
? sysvec_apic_timer_interrupt+0xa0/0xc0
? asm_sysvec_apic_timer_interrupt+0x16/0x20
? __alloc_pages_slowpath.constprop.0+0x1ec0/0x1ec0
? _raw_spin_unlock_irqrestore+0x3d/0x80
__kmalloc_large_node+0x80/0x120
? kvmalloc_node+0x4e/0x170
__kmalloc_node+0xd4/0x150
kvmalloc_node+0x4e/0x170
mlx5e_open_channels+0x631/0x3a10 [mlx5_core]
? console_unlock+0x2fa/0x430
? _raw_spin_lock_irqsave+0x8d/0xf0
? _raw_spin_unlock_irqrestore+0x42/0x80
? preempt_count_add+0x7d/0x150
? __wake_up_klogd.part.0+0x7d/0xc0
? vprintk_emit+0xfe/0x2c0
? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core]
? dev_attr_show.cold+0x35/0x35
? devlink_health_do_dump.part.0+0x174/0x340
? devlink_health_report+0x504/0x810
? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
? process_one_work+0x680/0x1050
mlx5e_safe_switch_params+0x156/0x220 [mlx5_core]
? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core]
? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core]
mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core]
? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0
devlink_health_reporter_recover+0xa6/0x1f0
devlink_health_report+0x2f7/0x810
? vsnprintf+0x854/0x15e0
mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core]
? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core]
? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core]
? newidle_balance+0x9b7/0xe30
? psi_group_change+0x6a7/0xb80
? mutex_lock+0x96/0xf0
? __mutex_lock_slowpath+0x10/0x10
mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
process_one_work+0x680/0x1050
worker_thread+0x5a0/0xeb0
? process_one_work+0x1050/0x1050
kthread+0x2a2/0x340
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>
---[ end trace 0000000000000000 ]---
RIP: 0010:__alloc_pages+0x141/0x5c0
Code: e0 39 a3 96 89 e9 b8 22 01 32 01 83 e1 0f 48 89 fa 01 c9 48 c1 ea
03 d3 f8 83 e0 03 89 44 24 6c 48 b8 00 00 00 00 00 fc ff df <80> 3c 02
00 0f 85 fc 03 00 00 89 e8 4a 8b 14 f5 e0 39 a3 96 4c 89
RSP: 0018:ffff888251f0f438 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 1ffff1104a3e1e8b RCX: 0000000000000000
RDX: 000000005c40d7ac RSI: 0000000000000003 RDI: 00000002e206bd60
RBP: 0000000000052dc0 R08: ffff8882b0044218 R09: ffff8882b0045e8a
R10: fffffbfff300fefc R11: ffff888167af4000 R12: 0000000000000003
R13: 0000000000000000 R14: 00000000696c7070 R15: ffff8882373f4380
FS: 0000000000000000(0000) GS:ffff88bf2be80000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005641d031eee8 CR3: 0000002e7ca14000 CR4: 0000000000350ee0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---]

Reported-by: Frederick Lawler <fred@cloudflare.com>
Link: https://lore.kernel.org/netdev/be5b9271-7507-19c5-ded1-fa78f1980e69@cloudflare.com
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 3354822c 31-Dec-2022 Eli Cohen <elic@nvidia.com>

net/mlx5: Use dynamic msix vectors allocation

Current implementation calculates the number and the partitioaning of
available interrupts vectors and then allocates all the interrupt
vectors.

Here, whenever dynamic msix allocation is supported, we change this to
use msix vectors dynamically so a vectors is actually allocated only
when needed. The current pool logic is kept in place to take care of
partitioning the vectors between the consumers and take care of
reference counting. However, the vectors are allocated only when needed.

Subsequent patches will make use of this to allocate vectors for VDPA.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>


# bbac70c7 29-Dec-2022 Eli Cohen <elic@nvidia.com>

net/mlx5: Use newer affinity descriptor

Use the more refined struct irq_affinity_desc to describe the required
IRQ affinity. For the async IRQs request unmanaged affinity and for
completion queues use managed affinity.

No functionality changes introduced. It will be used in a subsequent
patch when we use dynamic MSIX allocation.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>


# 061f5b23 22-Nov-2021 Shay Drory <shayd@nvidia.com>

net/mlx5: SF, Use all available cpu for setting cpu affinity

Currently all SFs are using the same CPUs. Spreading SF over CPUs, in
round-robin manner, in order to achieve better distribution of the SFs
over available CPUs.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 79b60ca8 12-Dec-2021 Shay Drory <shayd@nvidia.com>

net/mlx5: Introduce API for bulk request and release of IRQs

Currently IRQs are requested one by one. To balance spreading IRQs
among cpus using such scheme requires remembering cpu mask for the
cpus used for a given device. This complicates the IRQ allocation
scheme in subsequent patch.

Hence, prepare the code for bulk IRQs allocation. This enables
spreading IRQs among cpus in subsequent patch.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 424544df 23-Nov-2021 Shay Drory <shayd@nvidia.com>

net/mlx5: Split irq_pool_affinity logic to new file

The downstream patches add more functionality to irq_pool_affinity.
Move the irq_pool_affinity logic to a new file in order to ease the
coding and maintenance of it.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 5256a46b 14-Nov-2021 Shay Drory <shayd@nvidia.com>

net/mlx5: Introduce control IRQ request API

Currently, IRQ layer have a separate flow for ctrl and comp IRQs, and
the distinction between ctrl and comp IRQs is done in the IRQ layer.

In order to ease the coding and maintenance of the IRQ layer,
introduce a new API for requesting control IRQs -
mlx5_ctrl_irq_request(struct mlx5_core_dev *dev).

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 3663ad34 19-Aug-2021 Shay Drory <shayd@nvidia.com>

net/mlx5: Shift control IRQ to the last index

Control IRQ is the first IRQ vector. This complicates handling of
completion irqs as we need to offset them by one.
in the next patch, there are scenarios where completion and control EQs
will share the same irq. for example: functions with single IRQ. To ease
such scenarios, we shift control IRQ to the end of the irq array.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# c36326d3 23-Feb-2021 Shay Drory <shayd@nvidia.com>

net/mlx5: Round-Robin EQs over IRQs

Whenever users provided affinity for an EQ creation request, map the
EQ to a matching IRQ.
Matching IRQ=IRQ with the same affinity and type (completion/control) of
the EQ created.

This mapping is being done in agressive dedicated IRQ allocation scheme,
which described bellow.

First, we check whether there is a matching IRQ that his min threshold
is not exhausted.
- min_eqs_threshold = 3 for control EQ.
- min_eqs_threshold = 1 for completion EQ.
In case no matching IRQ was found, try to request a new IRQ.
In case we can't request a new IRQ, reuse least-used matching IRQ.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 71e084e2 23-Feb-2021 Shay Drory <shayd@nvidia.com>

net/mlx5: Allocating a pool of MSI-X vectors for SFs

SFs (Sub Functions) currently use IRQs from the global IRQ table their
parent Physical Function have. In order to better scale, we need to
allocate more IRQs and share them between different SFs.

Driver will maintain 3 separated irq pools:
1. A pool that serve the PF consumer (PF's netdev, rdma stacks), similar
to what the driver had before this patch. i.e, this pool will share irqs
between rdma and netev, and will keep the irq indexes and allocation
order. The last is important for PF netdev rmap (aRFS).

2. A pool of control IRQs for SFs. The size of this pool is the number
of SFs that can be created divided by SFS_PER_IRQ. This pool will serve
the control path EQs of the SFs.

3. A pool of completion data path IRQs for SFs transport queues. The
size of this pool is:
num_irqs_allocated - pf_pool_size - sf_ctrl_pool_size.
This pool will served netdev and rdma stacks. Moreover, rmap is not
supported on SFs.

Sharing methodology of the SFs pools is explained in the next patch.

Important note: rmap is not supported on SFs because rmap mapping cannot
function correctly for IRQs that are shared for different core/netdev RX
rings.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# fc63dd2a 23-Feb-2021 Shay Drory <shayd@nvidia.com>

net/mlx5: Change IRQ storage logic from static to dynamic

Store newly created IRQs in the xarray DB instead of a static array,
so we will be able to store only IRQs which are being used.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 2d74524c 23-Feb-2021 Shay Drory <shayd@nvidia.com>

net/mlx5: Moving rmap logic to EQs

IRQs are being simplified in order to ease their sharing and any feature
specific object will be moved to upper layer.
Hence we move rmap object into eq_table.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# e4e3f24b 23-Feb-2021 Leon Romanovsky <leon@kernel.org>

net/mlx5: Provide cpumask at EQ creation phase

The users of EQ are running their code on different CPUs and with
various affinity patterns. Move the cpumask setting close to their
actual usage.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 3b43190b 06-Apr-2021 Shay Drory <shayd@nvidia.com>

net/mlx5: Introduce API for request and release IRQs

Introduce new API that will allow IRQs users to hold a pointer to
mlx5_irq.
In the end of this series, IRQs will be allocated on demand. Hence,
this will allow us to properly manage and use IRQs.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>