History log of /linux-master/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
Revision Date Author Comments
# b430c1b4 12-Oct-2023 Shay Drory <shayd@nvidia.com>

net/mlx5: Replace global mlx5_intf_lock with HCA devcom component lock

mlx5_intf_lock is used to sync between LAG changes and its slaves
mlx5 core dev aux devices changes, which means every time mlx5 core
dev add/remove aux devices, mlx5 is taking this global lock, even if
LAG functionality isn't supported over the core dev.
This cause a bottleneck when probing VFs/SFs in parallel.

Hence, replace mlx5_intf_lock with HCA devcom component lock, or no
lock if LAG functionality isn't supported.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 02ceda65 13-Jun-2023 Roi Dayan <roid@nvidia.com>

net/mlx5: Use shared code for checking lag is supported

Move shared function to check lag is supported to lag header file.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 8ec91f5d 22-May-2023 Roi Dayan <roid@nvidia.com>

net/mlx5: Lag, Remove duplicate code checking lag is supported

Remove duplicate function for checking if device has lag support.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# c83e6ab9 06-Jun-2023 Shay Drory <shayd@nvidia.com>

net/mlx5: LAG, change mlx5_shared_fdb_supported() to static

mlx5_shared_fdb_supported() is used only in a single file. Change the
function to be static.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 27f9e0cc 05-Dec-2022 Mark Bloch <mbloch@nvidia.com>

net/mlx5: Lag, Add single RDMA device in multiport mode

In MultiPort E-Switch mode a single RDMA is created. This device has multiple
RDMA ports that represent the uplink ports that are connected to the E-Switch.
Account for this when creating the RDMA device so it has an additional port for
the non native uplink.

As a side effect of this patch, use shared fdb in multiport eswitch mode.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# a32327a3 28-Nov-2022 Roi Dayan <roid@nvidia.com>

net/mlx5: Lag, Control MultiPort E-Switch single FDB mode

MultiPort E-Switch builds on newer hardware's capabilities and introduces
a mode where a single E-Switch is used and all the vports and physical
ports on the NIC are connected to it.

The new mode will allow in the future a decrease in the memory used by the
driver and advanced features that aren't possible today.

This represents a big change in the current E-Switch implantation in mlx5.
Currently, by default, each E-Switch manager manages its E-Switch.
Steering rules in each E-Switch can only forward traffic to the native
physical port associated with that E-Switch. While there are ways to target
non-native physical ports, for example using a bond or via special TC
rules. None of the ways allows a user to configure the driver
to operate by default in such a mode nor can the driver decide
to move to this mode by default as it's user configuration-driven right now.

While MultiPort E-Switch single FDB mode is the preferred mode, older
generations of ConnectX hardware couldn't support this mode so it was never
implemented. Now that there is capable hardware present, start the
transition to having this mode by default.

Introduce a devlink parameter to control MultiPort E-Switch single FDB mode.
This will allow users to select this mode on their system right now
and in the future will allow the driver to move to this mode by default.

Example:
$ devlink dev param set pci/0000:00:0b.0 name esw_multiport value 1 \
cmode runtime

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 199abf33 29-Nov-2022 Roi Dayan <roid@nvidia.com>

net/mlx5: Lag, Move mpesw related definitions to mpesw.h

mpesw definitions should be in mpesw.h and not lag.h.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 0d4e8ed1 15-Aug-2022 Eli Cohen <elic@nvidia.com>

net/mlx5: Lag, avoid lockdep warnings

ldev->lock is used to serialize lag change operations. Since multiport
eswtich functionality was added, we now change the mode dynamically.
However, acquiring ldev->lock is not allowed as it could possibly lead
to a deadlock as reported by the lockdep mechanism.

[ 836.154963] WARNING: possible circular locking dependency detected
[ 836.155850] 5.19.0-rc5_net_56b7df2 #1 Not tainted
[ 836.156549] ------------------------------------------------------
[ 836.157418] handler1/12198 is trying to acquire lock:
[ 836.158178] ffff888187d52b58 (&ldev->lock){+.+.}-{3:3}, at: mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core]
[ 836.159575]
[ 836.159575] but task is already holding lock:
[ 836.160474] ffff8881d4de2930 (&block->cb_lock){++++}-{3:3}, at: tc_setup_cb_add+0x5b/0x200
[ 836.161669] which lock already depends on the new lock.
[ 836.162905]
[ 836.162905] the existing dependency chain (in reverse order) is:
[ 836.164008] -> #3 (&block->cb_lock){++++}-{3:3}:
[ 836.164946] down_write+0x25/0x60
[ 836.165548] tcf_block_get_ext+0x1c6/0x5d0
[ 836.166253] ingress_init+0x74/0xa0 [sch_ingress]
[ 836.167028] qdisc_create.constprop.0+0x130/0x5e0
[ 836.167805] tc_modify_qdisc+0x481/0x9f0
[ 836.168490] rtnetlink_rcv_msg+0x16e/0x5a0
[ 836.169189] netlink_rcv_skb+0x4e/0xf0
[ 836.169861] netlink_unicast+0x190/0x250
[ 836.170543] netlink_sendmsg+0x243/0x4b0
[ 836.171226] sock_sendmsg+0x33/0x40
[ 836.171860] ____sys_sendmsg+0x1d1/0x1f0
[ 836.172535] ___sys_sendmsg+0xab/0xf0
[ 836.173183] __sys_sendmsg+0x51/0x90
[ 836.173836] do_syscall_64+0x3d/0x90
[ 836.174471] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 836.175282]

[ 836.175282] -> #2 (rtnl_mutex){+.+.}-{3:3}:
[ 836.176190] __mutex_lock+0x6b/0xf80
[ 836.176830] register_netdevice_notifier+0x21/0x120
[ 836.177631] rtnetlink_init+0x2d/0x1e9
[ 836.178289] netlink_proto_init+0x163/0x179
[ 836.178994] do_one_initcall+0x63/0x300
[ 836.179672] kernel_init_freeable+0x2cb/0x31b
[ 836.180403] kernel_init+0x17/0x140
[ 836.181035] ret_from_fork+0x1f/0x30

[ 836.181687] -> #1 (pernet_ops_rwsem){+.+.}-{3:3}:
[ 836.182628] down_write+0x25/0x60
[ 836.183235] unregister_netdevice_notifier+0x1c/0xb0
[ 836.184029] mlx5_ib_roce_cleanup+0x94/0x120 [mlx5_ib]
[ 836.184855] __mlx5_ib_remove+0x35/0x60 [mlx5_ib]
[ 836.185637] mlx5_eswitch_unregister_vport_reps+0x22f/0x440 [mlx5_core]
[ 836.186698] auxiliary_bus_remove+0x18/0x30
[ 836.187409] device_release_driver_internal+0x1f6/0x270
[ 836.188253] bus_remove_device+0xef/0x160
[ 836.188939] device_del+0x18b/0x3f0
[ 836.189562] mlx5_rescan_drivers_locked+0xd6/0x2d0 [mlx5_core]
[ 836.190516] mlx5_lag_remove_devices+0x69/0xe0 [mlx5_core]
[ 836.191414] mlx5_do_bond_work+0x441/0x620 [mlx5_core]
[ 836.192278] process_one_work+0x25c/0x590
[ 836.192963] worker_thread+0x4f/0x3d0
[ 836.193609] kthread+0xcb/0xf0
[ 836.194189] ret_from_fork+0x1f/0x30

[ 836.194826] -> #0 (&ldev->lock){+.+.}-{3:3}:
[ 836.195734] __lock_acquire+0x15b8/0x2a10
[ 836.196426] lock_acquire+0xce/0x2d0
[ 836.197057] __mutex_lock+0x6b/0xf80
[ 836.197708] mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core]
[ 836.198575] tc_act_parse_mirred+0x25b/0x800 [mlx5_core]
[ 836.199467] parse_tc_actions+0x168/0x5a0 [mlx5_core]
[ 836.200340] __mlx5e_add_fdb_flow+0x263/0x480 [mlx5_core]
[ 836.201241] mlx5e_configure_flower+0x8a0/0x1820 [mlx5_core]
[ 836.202187] tc_setup_cb_add+0xd7/0x200
[ 836.202856] fl_hw_replace_filter+0x14c/0x1f0 [cls_flower]
[ 836.203739] fl_change+0xbbe/0x1730 [cls_flower]
[ 836.204501] tc_new_tfilter+0x407/0xd90
[ 836.205168] rtnetlink_rcv_msg+0x406/0x5a0
[ 836.205877] netlink_rcv_skb+0x4e/0xf0
[ 836.206535] netlink_unicast+0x190/0x250
[ 836.207217] netlink_sendmsg+0x243/0x4b0
[ 836.207915] sock_sendmsg+0x33/0x40
[ 836.208538] ____sys_sendmsg+0x1d1/0x1f0
[ 836.209219] ___sys_sendmsg+0xab/0xf0
[ 836.209878] __sys_sendmsg+0x51/0x90
[ 836.210510] do_syscall_64+0x3d/0x90
[ 836.211137] entry_SYSCALL_64_after_hwframe+0x46/0xb0

[ 836.211954] other info that might help us debug this:
[ 836.213174] Chain exists of:
[ 836.213174] &ldev->lock --> rtnl_mutex --> &block->cb_lock
836.214650] Possible unsafe locking scenario:
[ 836.214650]
[ 836.215574] CPU0 CPU1
[ 836.216255] ---- ----
[ 836.216943] lock(&block->cb_lock);
[ 836.217518] lock(rtnl_mutex);
[ 836.218348] lock(&block->cb_lock);
[ 836.219212] lock(&ldev->lock);
[ 836.219758]
[ 836.219758] *** DEADLOCK ***
[ 836.219758]
[ 836.220747] 2 locks held by handler1/12198:
[ 836.221390] #0: ffff8881d4de2930 (&block->cb_lock){++++}-{3:3}, at: tc_setup_cb_add+0x5b/0x200
[ 836.222646] #1: ffff88810c9a92c0 (&esw->mode_lock){++++}-{3:3}, at: mlx5_esw_hold+0x39/0x50 [mlx5_core]

[ 836.224063] stack backtrace:
[ 836.224799] CPU: 6 PID: 12198 Comm: handler1 Not tainted 5.19.0-rc5_net_56b7df2 #1
[ 836.225923] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 836.227476] Call Trace:
[ 836.227929] <TASK>
[ 836.228332] dump_stack_lvl+0x57/0x7d
[ 836.228924] check_noncircular+0x104/0x120
[ 836.229562] __lock_acquire+0x15b8/0x2a10
[ 836.230201] lock_acquire+0xce/0x2d0
[ 836.230776] ? mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core]
[ 836.231614] ? find_held_lock+0x2b/0x80
[ 836.232221] __mutex_lock+0x6b/0xf80
[ 836.232799] ? mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core]
[ 836.233636] ? mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core]
[ 836.234451] ? xa_load+0xc3/0x190
[ 836.234995] mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core]
[ 836.235803] tc_act_parse_mirred+0x25b/0x800 [mlx5_core]
[ 836.236636] ? tc_act_can_offload_mirred+0x135/0x210 [mlx5_core]
[ 836.237550] parse_tc_actions+0x168/0x5a0 [mlx5_core]
[ 836.238364] __mlx5e_add_fdb_flow+0x263/0x480 [mlx5_core]
[ 836.239202] mlx5e_configure_flower+0x8a0/0x1820 [mlx5_core]
[ 836.240076] ? lock_acquire+0xce/0x2d0
[ 836.240668] ? tc_setup_cb_add+0x5b/0x200
[ 836.241294] tc_setup_cb_add+0xd7/0x200
[ 836.241917] fl_hw_replace_filter+0x14c/0x1f0 [cls_flower]
[ 836.242709] fl_change+0xbbe/0x1730 [cls_flower]
[ 836.243408] tc_new_tfilter+0x407/0xd90
[ 836.244043] ? tc_del_tfilter+0x880/0x880
[ 836.244672] rtnetlink_rcv_msg+0x406/0x5a0
[ 836.245310] ? netlink_deliver_tap+0x7a/0x4b0
[ 836.245991] ? if_nlmsg_stats_size+0x2b0/0x2b0
[ 836.246675] netlink_rcv_skb+0x4e/0xf0
[ 836.258046] netlink_unicast+0x190/0x250
[ 836.258669] netlink_sendmsg+0x243/0x4b0
[ 836.259288] sock_sendmsg+0x33/0x40
[ 836.259857] ____sys_sendmsg+0x1d1/0x1f0
[ 836.260473] ___sys_sendmsg+0xab/0xf0
[ 836.261064] ? lock_acquire+0xce/0x2d0
[ 836.261669] ? find_held_lock+0x2b/0x80
[ 836.262272] ? __fget_files+0xb9/0x190
[ 836.262871] ? __fget_files+0xd3/0x190
[ 836.263462] __sys_sendmsg+0x51/0x90
[ 836.264064] do_syscall_64+0x3d/0x90
[ 836.264652] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 836.265425] RIP: 0033:0x7fdbe5e2677d

[ 836.266012] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 ba ee
ff ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f
05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 ee ee ff ff 48
[ 836.268485] RSP: 002b:00007fdbe48a75a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[ 836.269598] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fdbe5e2677d
[ 836.270576] RDX: 0000000000000000 RSI: 00007fdbe48a7640 RDI: 000000000000003c
[ 836.271565] RBP: 00007fdbe48a8368 R08: 0000000000000000 R09: 0000000000000000
[ 836.272546] R10: 00007fdbe48a84b0 R11: 0000000000000293 R12: 0000557bd17dc860
[ 836.273527] R13: 0000000000000000 R14: 0000557bd17dc860 R15: 00007fdbe48a7640

[ 836.274521] </TASK>

To avoid using mode holding ldev->lock in the configure flow, we queue a
work to the lag workqueue and cease wait on a completion object.

In addition, we remove the lock from mlx5_lag_do_mirred() since it is
not really protecting anything.

It should be noted that an actual deadlock has not been observed.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 1afbd1e2 27-May-2022 Liu, Changcheng <jerrliu@nvidia.com>

net/mlx5: Lag, correct get the port select mode str

mode & mode_flags is updated at the end of mlx5_activate_lag which
may not reflect the actual mode as shown in below logic:
mlx5_activate_lag(struct mlx5_lag *ldev,
|-- unsigned long flags = 0;
|-- err = mlx5_lag_set_flags(ldev, mode, tracker, shared_fdb, &flags);
|-- err = mlx5_create_lag(ldev, tracker, mode, flags);
|-- mlx5_get_str_port_sel_mode(ldev);
|-- ldev->mode = mode;
|-- ldev->mode_flags = flags;
Use mode & flag as parameters to get port select mode info.

Fixes: 94db33177819 ("net/mlx5: Support multiport eswitch mode")
Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Reviewed-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 4892bd98 23-May-2022 Mark Bloch <mbloch@nvidia.com>

net/mlx5: Lag, decouple FDB selection and shared FDB

Multiport eswitch is required to use native FDB selection instead of
affinity, This was achieved by passing the shared_fdb flag down
the HW lag creation path. While it did accomplish the goal of setting
FDB selection mode to native, it had the side effect of also
creating a shared FDB configuration.

This created a few issues:
- TC rules are inserted into a non active FDB, which means traffic isn't
offloaded as all traffic will reach only a single FDB.
- All wire traffic is treated as if a single physical port received it; while
this is true for a bond configuration, this shouldn't be the case for
multiport eswitch.

Create a new flag MLX5_LAG_MODE_FLAG_FDB_SEL_MODE_NATIVE
to indicate what FDB selection mode should be used.

Fixes: 94db33177819 ("net/mlx5: Support multiport eswitch mode")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 3008e6a0 25-May-2022 Mark Bloch <mbloch@nvidia.com>

net/mlx5: E-Switch, pair only capable devices

OFFLOADS paring using devcom is possible only on devices
that support LAG. Filter based on lag capabilities.

This fixes an issue where mlx5_get_next_phys_dev() was
called without holding the interface lock.

This issue was found when commit
bc4c2f2e0179 ("net/mlx5: Lag, filter non compatible devices")
added an assert that verifies the interface lock is held.

WARNING: CPU: 9 PID: 1706 at drivers/net/ethernet/mellanox/mlx5/core/dev.c:642 mlx5_get_next_phys_dev+0xd2/0x100 [mlx5_core]
Modules linked in: mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm ib_uverbs ib_core overlay fuse [last unloaded: mlx5_core]
CPU: 9 PID: 1706 Comm: devlink Not tainted 5.18.0-rc7+ #11
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:mlx5_get_next_phys_dev+0xd2/0x100 [mlx5_core]
Code: 02 00 75 48 48 8b 85 80 04 00 00 5d c3 31 c0 5d c3 be ff ff ff ff 48 c7 c7 08 41 5b a0 e8 36 87 28 e3 85 c0 0f 85 6f ff ff ff <0f> 0b e9 68 ff ff ff 48 c7 c7 0c 91 cc 84 e8 cb 36 6f e1 e9 4d ff
RSP: 0018:ffff88811bf47458 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88811b398000 RCX: 0000000000000001
RDX: 0000000080000000 RSI: ffffffffa05b4108 RDI: ffff88812daaaa78
RBP: ffff88812d050380 R08: 0000000000000001 R09: ffff88811d6b3437
R10: 0000000000000001 R11: 00000000fddd3581 R12: ffff88815238c000
R13: ffff88812d050380 R14: ffff8881018aa7e0 R15: ffff88811d6b3428
FS: 00007fc82e18ae80(0000) GS:ffff88842e080000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9630d1b421 CR3: 0000000149802004 CR4: 0000000000370ea0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
mlx5_esw_offloads_devcom_event+0x99/0x3b0 [mlx5_core]
mlx5_devcom_send_event+0x167/0x1d0 [mlx5_core]
esw_offloads_enable+0x1153/0x1500 [mlx5_core]
? mlx5_esw_offloads_controller_valid+0x170/0x170 [mlx5_core]
? wait_for_completion_io_timeout+0x20/0x20
? mlx5_rescan_drivers_locked+0x318/0x810 [mlx5_core]
mlx5_eswitch_enable_locked+0x586/0xc50 [mlx5_core]
? mlx5_eswitch_disable_pf_vf_vports+0x1d0/0x1d0 [mlx5_core]
? mlx5_esw_try_lock+0x1b/0xb0 [mlx5_core]
? mlx5_eswitch_enable+0x270/0x270 [mlx5_core]
? __debugfs_create_file+0x260/0x3e0
mlx5_devlink_eswitch_mode_set+0x27e/0x870 [mlx5_core]
? mutex_lock_io_nested+0x12c0/0x12c0
? esw_offloads_disable+0x250/0x250 [mlx5_core]
? devlink_nl_cmd_trap_get_dumpit+0x470/0x470
? rcu_read_lock_sched_held+0x3f/0x70
devlink_nl_cmd_eswitch_set_doit+0x217/0x620

Fixes: dd3fddb82780 ("net/mlx5: E-Switch, handle devcom events only for ports on the same device")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 94db3317 30-Jan-2022 Eli Cohen <elic@nvidia.com>

net/mlx5: Support multiport eswitch mode

Multiport eswitch mode is a LAG mode that allows to add rules that
forward traffic to a specific physical port without being affected by LAG
affinity configuration.

This mode of operation is mutual exclusive with the other LAG modes used
by multipath and bonding.

To make the transition between the modes, we maintain a counter on the
number of rules specifying one of the uplink representors as the target
of mirred egress redirect action.

An example of such rule would be:

$ tc filter add dev enp8s0f0_0 prot all root flower dst_mac \
00:11:22:33:44:55 action mirred egress redirect dev enp8s0f0

If the reference count just grows to one and LAG is not in use, we
create the LAG in multiport eswitch mode. Other mode changes are not
allowed while in this mode. When the reference count reaches zero, we
destroy the LAG and let other modes be used if needed.

logic also changed such that if forwarding to some uplink destination
cannot be guaranteed, we fail the operation so the rule will eventually
be in software and not in hardware.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# ef9a3a4a 24-Jan-2022 Eli Cohen <elic@nvidia.com>

net/mlx5: Lag, refactor lag state machine

LAG state machine is implemented using bit flags. However, all these bit
flags, except for MLX5_LAG_FLAG_HASH_BASED, are really mutual exclusive.

In addition, MLX5_LAG_FLAG_READY is used by bonding to mark if we have
our netdevices successfully added to lag and does not really belong in
the same flags variable as the other flags.

Rename MLX5_LAG_FLAG_READY to MLX5_LAG_FLAG_NDEVS_READY to better
reflect its purpose and put it in a new flags variable.

For the rest of the flags, we introduce a mode enum to hold the state
of the LAG.

Remove the shared fdb boolean flag from struct mlx5_lag and store this
configuration as a mode flag.

Change all flag related operations to use standard Linux APIs.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 7f46a0b7 15-Mar-2022 Mark Bloch <mbloch@nvidia.com>

net/mlx5: Lag, add debugfs to query hardware lag state

Lag state has become very complicated with many modes, flags, types and
port selections methods and future work will add additional features.

Add a debugfs to query the current lag state. A new directory named "lag"
will be created under the mlx5 debugfs directory. As the driver has
debugfs per pci function the location will be: <debugfs>/mlx5/<BDF>/lag

For example:
/sys/kernel/debug/mlx5/0000:08:00.0/lag

The following files are exposed:

- state: Returns "active" or "disabled". If "active" it means hardware
lag is active.

- members: Returns the BDFs of all the members of lag object.

- type: Returns the type of the lag currently configured. Valid only
if hardware lag is active.
* "roce" - Members are bare metal PFs.
* "switchdev" - Members are in switchdev mode.
* "multipath" - ECMP offloads.

- port_sel_mode: Returns the egress port selection method, valid
only if hardware lag is active.
* "queue_affinity" - Egress port is selected by
the QP/SQ affinity.
* "hash" - Egress port is selected by hash done on
each packet. Controlled by: xmit_hash_policy of the
bond device.
- flags: Returns flags that are specific per lag @type. Valid only if
hardware lag is active.
* "shared_fdb" - "on" or "off", if "on" single FDB is used.

- mapping: Returns the mapping which is used to select egress port.
Valid only if hardware lag is active.
If @port_sel_mode is "hash" returns the active egress ports.
The hash result will select only active ports.
if @port_sel_mode is "queue_affinity" returns the mapping
between the configured port affinity of the QP/SQ and actual
egress port. For example:
* 1:1 - Mapping means if the configured affinity is port 1
traffic will egress via port 1.
* 1:2 - Mapping means if the configured affinity is port 1
traffic will egress via port 2. This can happen
if port 1 is down or in active/backup mode and port 1
is backup.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 352899f3 02-Mar-2022 Mark Bloch <mbloch@nvidia.com>

net/mlx5: Lag, use buckets in hash mode

When in hardware lag and the NIC has more than 2 ports when one port
goes down need to distribute the traffic between the remaining
active ports.

For better spread in such cases instead of using 1-to-1 mapping and only
4 slots in the hash, use many.

Each port will have many slots that point to it. When a port goes down
go over all the slots that pointed to that port and spread them between
the remaining active ports. Once the port comes back restore the default
mapping.

We will have number_of_ports * MLX5_LAG_MAX_HASH_BUCKETS slots.
Each MLX5_LAG_MAX_HASH_BUCKETS belong to a different port.
The native mapping is such that:

port 1: The first MLX5_LAG_MAX_HASH_BUCKETS slots are: [1, 1, .., 1]
which means if a packet is hased into one of this slots it will hit the
wire via port 1.

port 2: The second MLX5_LAG_MAX_HASH_BUCKETS slots are: [2, 2, .., 2]
which means if a packet is hased into one of this slots it will hit the
wire via port2.

and this mapping is the same of the rest of the ports.
On a failover, lets say port 2 goes down (port 1, 3, 4 are still up).
the new mapping for port 2 will be:

port 2: The second MLX5_LAG_MAX_HASH_BUCKETS are: [1, 3, 1, 4, .., 4]
which means the mapping was changed from the native mapping to a mapping
that consists of only the active ports.

With this if a port goes down the traffic will be split between the
active ports randomly

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# e9d5bb51 27-Feb-2022 Mark Bloch <mbloch@nvidia.com>

net/mlx5: Lag, store number of ports inside lag object

Store the number of lag ports inside the lag object. Lag object is a single
shared object managing the lag state of multiple mlx5 devices on the same
physical HCA.

Downstream patches will allow hardware lag to be created over devices with
more than 2 ports.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# ec2fa47d 14-Dec-2021 Mark Bloch <mbloch@nvidia.com>

net/mlx5: Lag, use lag lock

Use a lag specific lock instead of depending on external locks to
synchronise the lag creation/destruction.

With this, taking E-Switch mode lock is no longer needed for syncing
lag logic.

Cleanup any dead code that is left over and don't export functions that
aren't used outside the E-Switch core code.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 6cb87869 29-Nov-2021 Mark Bloch <mbloch@nvidia.com>

net/mlx5: Lag, offload active-backup drops to hardware

In active-backup mode the backup interface's packets are dropped by the
bond device. In switchdev where TC rules are offloaded to the FDB
this can lead to packets being hit in the FDB where without offload
they would have been dropped before reaching TC rules in the kernel.

Create a drop rule to make sure packets on inactive ports are dropped
before reaching the FDB.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 54493a08 12-Jan-2022 Mark Bloch <mbloch@nvidia.com>

net/mlx5: Lag, record inactive state of bond device

A bond device will drop duplicate packets (received on inactive ports)
by default. A flag (all_slaves_active) can be set to override such
behaviour. This flag is a global flag per bond device (ALB mode isn't
supported by mlx5 driver so it can be ignored)

When NETDEV_CHANGEUPPER / NETDEV_CHANGEINFODATA event is received check if
there is an interface that is inactive.

Downstream patch will use this information in order to decide if a drop
rule is needed.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# da6b0bb0 18-Aug-2021 Maor Gottlieb <maorg@nvidia.com>

net/mlx5: Lag, use steering to select the affinity port in LAG

Use the steering based solution for select the affinity port
when the LAG mode is based on hash policy and the device support
in port selection flow table.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# dc48516e 17-Aug-2021 Maor Gottlieb <maorg@nvidia.com>

net/mlx5: Lag, add support to create definers for LAG

Every definer will consist of a flow table with a single hash group
with exactly two flow table entries, one for each device port.
The destination of these entries is the uplink vport according to the
port state and hash policy.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 1065e001 13-Jul-2021 Maor Gottlieb <maorg@nvidia.com>

net/mlx5: Lag, set LAG traffic type mapping

Generate a traffic type bitmap that will define which
steering objects we need to create for the steering
based LAG.

Bits in this bitmap are set according to the LAG hash type.
In addition, have a field that indicate if the lag is in encap
mode or not.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>


# 3d677735 05-Jul-2021 Maor Gottlieb <maorg@nvidia.com>

net/mlx5: Lag, move lag files into directory

Downstream patches add another lag related file so it makes
sense to have all the lag files in a dedicated directory.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>