Cross Reference: /linux-master/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h

History log of /linux-master/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
Revision	Date	Author	Comments
# ec706a86	27-Nov-2023	Stanislav Fomichev <sdf@google.com>	net/mlx5e: Implement AF_XDP TX timestamp and checksum offload TX timestamp: - requires passing clock, not sure I'm passing the correct one (from cq->mdev), but the timestamp value looks convincing TX checksum: - looks like device does packet parsing (and doesn't accept custom start/offset), so I'm ignoring user offsets Cc: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231127190319.1190813-5-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 34a79876	01-Aug-2023	Dragos Tatulea <dtatulea@nvidia.com>	net/mlx5e: XDP, Fix fifo overrun on XDP_REDIRECT Before this fix, running high rate traffic through XDP_REDIRECT with multibuf could overrun the fifo used to release the xdp frames after tx completion. This resulted in corrupted data being consumed on the free side. The culplirt was a miscalculation of the fifo size: the maximum ratio between fifo entries / data segments was incorrect. This ratio serves to calculate the max fifo size for a full sq where each packet uses the worst case number of entries in the fifo. This patch fixes the formula and names the constant. It also makes sure that future values will use a power of 2 number of entries for the fifo mask to work. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Fixes: 3f734b8c594b ("net/mlx5e: XDP, Use multiple single-entry objects in xdpi_fifo") Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# 3f734b8c	17-Apr-2023	Tariq Toukan <tariqt@nvidia.com>	net/mlx5e: XDP, Use multiple single-entry objects in xdpi_fifo Here we fix the current wi->num_pkts abuse, as it was used to indicate multiple xdpi entries in the xdpi_fifo. Instead, reduce mlx5e_xdp_info to the size of a single field, making it a union of unions. Per packet, use as many instances as needed to provide the information needed at the time of completion. The sequence of xdpi instances pushed is well defined, derived by the xmit_mode. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
# eb9b9fdc	17-Apr-2023	Tariq Toukan <tariqt@nvidia.com>	net/mlx5e: Introduce extended version for mlx5e_xmit_data Introduce struct mlx5e_xmit_data_frags to be used for non-linear xmit buffers. Let it include sinfo pointer. Take one bit from the len field to indicate if the descriptor has fragments and can be casted-up into the extended version. Zero-init to make sure has_frags, and potentially future fields, are zero when not explicitly assigned. Another field will be added in a downstream patch to indicate and point to dma addresses of the different frags, for redirect-in requests. This simplifies the mlx5e_xmit_xdp_frame/mlx5e_xmit_xdp_frame_mpwqe functions params. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
# aebc62d3	17-Apr-2023	Tariq Toukan <tariqt@nvidia.com>	net/mlx5e: Move XDP struct and enum to XDP header Move struct mlx5e_xdp_info and enum mlx5e_xdp_xmit_mode from the generic en.h to the XDP header, where they belong. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
# 9da5294e	15-Feb-2023	Tariq Toukan <tariqt@nvidia.com>	net/mlx5e: Remove redundant page argument in mlx5e_xdp_handle() Remove the page parameter, it can be derived from the xdp_buff member of mlx5e_xdp_buff. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# bc8d405b	19-Jan-2023	Toke Høiland-Jørgensen <toke@redhat.com>	net/mlx5e: Support RX XDP metadata Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe pointer to the mlx5e_skb_from* functions so it can be retrieved from the XDP ctx to do this. Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20230119221536.3349901-17-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
# 384a13ca	19-Jan-2023	Toke Høiland-Jørgensen <toke@redhat.com>	net/mlx5e: Introduce wrapper for xdp_buff Preparation for implementing HW metadata kfuncs. No functional change. Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20230119221536.3349901-16-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
# e3c4c496	27-Sep-2022	Maxim Mikityanskiy <maximmi@nvidia.com>	net/mlx5e: Fix a typo in mlx5e_xdp_mpwqe_is_full Fix a typo in the function name: mpqwe -> mpwqe (stands for multi-packet work queue element). Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
# 08c34e95	15-Feb-2022	Maxim Mikityanskiy <maximmi@nvidia.com>	net/mlx5e: Remove MLX5E_XDP_TX_DS_COUNT After introducing multi-buffer XDP_TX, the MLX5E_XDP_TX_DS_COUNT define became misleading. It's no longer the DS count of an XDP_TX WQE, this WQE can be longer because of fragments. As this define is only used at one place in mlx5e_open_xdpsq(), it's also not very useful anymore. This commit removes the define and puts the calculation of ds_count for prefilled single-fragment WQEs inline. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# 39a1665d	31-Jan-2022	Maxim Mikityanskiy <maximmi@nvidia.com>	net/mlx5e: Implement sending multi buffer XDP frames xmit_xdp_frame is extended to support sending fragmented XDP frames. The next commit will start using this functionality. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# 49529a17	28-Jan-2022	Maxim Mikityanskiy <maximmi@nvidia.com>	net/mlx5e: Move mlx5e_xdpi_fifo_push out of xmit_xdp_frame The implementations of xmit_xdp_frame get the xdpi parameter of type struct mlx5e_xdp_info for the sole purpose of calling mlx5e_xdpi_fifo_push() on success. This commit moves this call outside of xmit_xdp_frame, shifting this responsibility to the caller. It will allow more fine-grained handling of XDP info for cases when an xdp_frame is fragmented. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# ddc87e7d	28-Jan-2022	Maxim Mikityanskiy <maximmi@nvidia.com>	net/mlx5e: Store DMA address inside struct page Use page_pool_set_dma_addr() to store the DMA address of a page inside struct page, in order to avoid passing struct mlx5e_dma_info to XDP handlers. Previously, struct mlx5e_dma_info was used to pass both the DMA address and the page, and it worked well for the single-fragment case. When XDP multi buffer is in use, and a fragmented xdp_frame has to be transmitted, the driver needs to know the DMA addresses of fragments, however, the array of fragments in struct skb_shared_info doesn't contain them. In order to pass the DMA addresses, the driver puts them into struct page itself, which is accessible from the array of fragments in struct skb_shared_info. The existing XDP handlers are modified to remove the dependency on struct mlx5e_dma_info. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# 064990d0	27-Jan-2022	Maxim Mikityanskiy <maximmi@nvidia.com>	net/mlx5e: Drop the len output parameter from mlx5e_xdp_handle The len parameter of mlx5e_xdp_handle is used to output the new packet length after XDP has processed the packet and returned XDP_PASS. However, this value can be calculated on the caller site, as the caller knows if it was an XDP_PASS. This commit drops the len parameter and moves the calculation to the caller, reducing the number of parameters passed to the function and preparing for XDP support in non-linear legacy RQ. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# e26eceb9	19-Jan-2022	Tariq Toukan <tariqt@nvidia.com>	net/mlx5e: RX, Test the XDP program existence out of the handler Instead of early return inside mlx5e_xdp_handle(), let the caller check if an XDP program is loaded. This allows saving a few unnecessary function calls and calculations in case !prog. Performance test: single core, drop packets in iptables Before: 3,872,504 pps After: 3,975,628 pps (+2.66%) Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# 76c31e5f	10-May-2021	Aya Levin <ayal@nvidia.com>	net/mlx5e: Use FW limitation for max MPW WQEBBs Calculate maximal count of MPW WQEBBs on SQ's creation and store it there. Remove MLX5E_TX_MPW_MAX_NUM_DS and MLX5E_TX_MPW_MAX_WQEBBS. Update mlx5e_tx_mpwqe_is_full() and mlx5e_xdp_mpqwe_is_full() . Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# 4d6e6b0c	01-Feb-2021	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Replace synchronize_rcu with synchronize_net The commit cited below switched from using napi_synchronize to synchronize_rcu to have a guarantee that it will finish in finite time. However, on average, synchronize_rcu takes more time than napi_synchronize. Given that it's called multiple times per channel on deactivation, it accumulates to a significant amount, which causes timeouts in some applications (for example, when using bonding with NetworkManager). This commit replaces synchronize_rcu with synchronize_net, which is faster when called under rtnl_lock, allowing to speed up the described flow. Fixes: 9c25a22dfb00 ("net/mlx5e: Use synchronize_rcu to sync with NAPI") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# 5af75c74	01-Jul-2020	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Enhanced TX MPWQE for SKBs This commit adds support for Enhanced TX MPWQE feature in the regular (SKB) data path. A MPWQE (multi-packet work queue element) can serve multiple packets, reducing the PCI bandwidth on control traffic. Two new stats (tx_mpwqe_blks and tx_mpwqe_pkts) are added. The feature is on by default and controlled by the skb_tx_mpwqe private flag. In a MPWQE, eseg is shared among all packets, so eseg-based offloads (IPSEC, GENEVE, checksum) run on a separate eseg that is compared to the eseg of the current MPWQE session to decide if the new packet can be added to the same session. MPWQE is not compatible with certain offloads and features, such as TLS offload, TSO, nonlinear SKBs. If such incompatible features are in use, the driver gracefully falls back to non-MPWQE. This change has no performance impact in TCP single stream test and XDP_TX single stream test. UDP pktgen, 64-byte packets, single stream, MPWQE off: Packet rate: 16.96 Mpps (±0.12 Mpps) -> 17.01 Mpps (±0.20 Mpps) Instructions per packet: 421 -> 429 Cycles per packet: 156 -> 161 Instructions per cycle: 2.70 -> 2.67 UDP pktgen, 64-byte packets, single stream, MPWQE on: Packet rate: 16.96 Mpps (±0.12 Mpps) -> 20.94 Mpps (±0.33 Mpps) Instructions per packet: 421 -> 329 Cycles per packet: 156 -> 123 Instructions per cycle: 2.70 -> 2.67 Enabling MPWQE can reduce PCI bandwidth: PCI Gen2, pktgen at fixed rate of 36864000 pps on 24 CPU cores: Inbound PCI utilization with MPWQE off: 80.3% Inbound PCI utilization with MPWQE on: 59.0% PCI Gen3, pktgen at fixed rate of 56064000 pps on 24 CPU cores: Inbound PCI utilization with MPWQE off: 65.4% Inbound PCI utilization with MPWQE on: 49.3% Enabling MPWQE can also reduce CPU load, increasing the packet rate in case of CPU bottleneck: PCI Gen2, pktgen at full rate on 24 CPU cores: Packet rate with MPWQE off: 37.5 Mpps Packet rate with MPWQE on: 49.0 Mpps PCI Gen3, pktgen at full rate on 24 CPU cores: Packet rate with MPWQE off: 57.0 Mpps Packet rate with MPWQE on: 66.8 Mpps Burst size in all pktgen tests is 32. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64) NIC: Mellanox ConnectX-6 Dx GCC 10.2.0 Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# b39fe61e	16-Apr-2020	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Rename xmit-related structs to generalize them As preparation for the upcoming TX MPWQE support for SKBs, rename struct mlx5e_xdp_mpwqe to mlx5e_tx_mpwqe and move it above struct mlx5e_txqsq. This structure will be reused in the regular SQ and in the regular TX data path. Also rename mlx5e_xdp_xmit_data to mlx5e_xmit_data - it will be used in the upcoming TX MPWQE flow. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# 530d5ce2	04-Jun-2020	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Generalize TX MPWQE checks for full session As preparation for the upcoming TX MPWQE for SKBs, create a function (mlx5e_tx_mpwqe_is_full) to check whether an MPWQE session is full. This function will be shared by MPWQE code for XDP and for SKBs. Defines are renamed and moved to make them not XDP-specific. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# 97e3afd6	09-Jul-2020	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT A constant for the number of DS in an empty WQE (i.e. a WQE without data segments) is needed in multiple places (normal TX data path, MPWQE in XDP), but currently we have a constant for XDP and an inline formula in normal TX. This patch introduces a common constant. Additionally, mlx5e_xdp_mpwqe_session_start is converted to use struct assignment, because the code nearby is touched. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# 388a2b56	30-Jul-2020	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Small improvements for XDP TX MPWQE logic Use MLX5E_XDP_MPW_MAX_WQEBBS to reserve space for a MPWQE, because it's actually the maximal size a MPWQE can take. Reorganize the logic that checks when to close the MPWQE session: 1. Put all checks into a single function. 2. When inline is on, make only one comparison - if it's false, the less strict one will also be false. The compiler probably optimized it out anyway, but it's clearer to also reflect it in the code. The MLX5E_XDP_INLINE_WQE_* defines are also changed to make the calculations more correct from the logical point of view. Though MLX5E_XDP_INLINE_WQE_MAX_DS_CNT used to be 16 and didn't change its value, the calculation used to be DIV_ROUND_UP(max inline packet size, MLX5_SEND_WQE_DS), and the numerator should have included sizeof(struct mlx5_wqe_inline_seg). Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
# 93761ca1	30-Apr-2020	Tariq Toukan <tariqt@mellanox.com>	net/mlx5e: XDP, Avoid indirect call in TX flow Use INDIRECT_CALL_2() helper to avoid the cost of the indirect call when/if CONFIG_RETPOLINE=y. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
# 39d6443c	20-May-2020	Björn Töpel <bjorn@kernel.org>	mlx5, xsk: Migrate to new MEM_TYPE_XSK_BUFF_POOL Use the new MEM_TYPE_XSK_BUFF_POOL API in lieu of MEM_TYPE_ZERO_COPY in mlx5e. It allows to drop a lot of code from the driver (which is now common in AF_XDP core and was related to XSK RX frame allocation, DMA mapping, etc.) and slightly improve performance (RX +0.8 Mpps, TX +0.4 Mpps). rfc->v1: Put back the sanity check for XSK params, use XSK API to get the total headroom size. (Maxim) v1->v2: Fix DMA address handling, set XDP metadata to invalid. (Maxim) v2->v3: Handle frame_sz, use xsk_buff_xdp_get_frame_dma, use xsk_buff API for DMA sync on TX, add performance numbers. (Maxim) v3->v4: Remove unused variable num_xsk_frames. (Jakub) Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-12-bjorn.topel@gmail.com
# 5ffb4d85	30-Mar-2020	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Calculate SQ stop room in a robust way Currently, different formulas are used to estimate the space that may be taken by WQEs in the SQ during a single packet transmit. This space is called stop room, and it's checked in the end of packet transmit to find out if the next packet could overflow the SQ. If it could, the driver tells the kernel to stop sending next packets. Many factors affect the stop room: 1. Padding with NOPs to avoid WQEs spanning over page boundaries. 2. Enabled and disabled offloads (TLS, upcoming MPWQE). 3. The maximum size of a WQE. The padding is performed before every WQE if it doesn't fit the current page. The current formula assumes that only one padding will be required per packet, and it doesn't take into account that the WQEs posted during the transmission of a single packet might exceed the page size in very rare circumstances. For example, to hit this condition with 4096-byte pages, TLS offload will have to interrupt an almost-full MPWQE session, be in the resync flow and try to transmit a near to maximum amount of data. To avoid SQ overflows in such rare cases after MPWQE is added, this patch introduces a more robust formula to estimate the stop room. The new formula uses the fact that a WQE of size X will not require more than X-1 WQEBBs of padding. More exact estimations are possible, but they result in much more complex and error-prone code for little gain. Before this patch, the TLS stop room included space for both INNOVA and ConnectX TLS offloads that couldn't run at the same time anyway, so this patch accounts only for the active one. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
# 05dfd570	09-Apr-2020	Tariq Toukan <tariqt@mellanox.com>	net/mlx5e: Take TX WQE info structures out of general EN header Into the txrx header file. The mlx5e_sq_wqe_info structure describes WQE info for the ICOSQ, rename it to better reflect this. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
# ec9cdca0	16-Apr-2020	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Unify reserving space for WQEs In our fast-path design, a WQE (Work Queue Element) must not cross the page boundary. To enforce that, for WQEs consisting of more than one BB (Basic Block), the driver checks the available contiguous space in the WQ in advance, and if it's not enough, it pads it with NOPs. This patch modifies the code that calculates the position of next WQE, considering the padding, and prepares the WQE. This code is common for all SQ types. In this patch it's reorganized in a way that makes the usage pattern unified for all SQ types, and makes the implementations self-contained and look almost the same, preparing the repeating code to further attempts to deduplicate it. One place is left as is: mlx5e_sq_xmit and mlx5e_fill_sq_frag_edge call inside, because it is special in a way that it may also copy WQE's cseg and eseg when reserving space. This will be eliminated in one of the following patches, and this place will be converted to the new approach, too. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
# fed0c6cf	15-Nov-2019	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Fetch WQE: reuse code and enforce typing There are multiple functions mlx5{e,i}__fetch_wqe that contain the same code, that is repeated, because they operate on different SQ struct types. mlx5e_sq_fetch_wqe also returns void , instead of the concrete WQE type. This commit generalizes the fetch WQE operation by putting this code into a single function. To simplify calls of the generic function in concrete use cases, macros are provided that substitute the right WQE size and cast the return type. Before this patch, fetch_wqe used to calculate pi itself, but the value was often known to the caller. This calculation is moved outside to eliminate this unnecessary step and prepare for the fill_frag_edge refactoring in the next patch. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
# 9cf88808	17-Dec-2019	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Fix concurrency issues between config flow and XSK After disabling resources necessary for XSK (the XDP program, channels, XSK queues), use synchronize_rcu to wait until the XSK wakeup function finishes, before freeing the resources. Suspend XSK wakeups during switching channels. If the XDP program is being removed, synchronize_rcu before closing the old channels to allow XSK wakeup to complete. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191217162023.16011-3-maximmi@mellanox.com
# 7cf6f811	14-Jul-2019	Tariq Toukan <tariqt@mellanox.com>	net/mlx5e: XDP, Slight enhancement for WQE fetch function Instead of passing an output param, let function return the WQE pointer. In addition, pass &pi so it gets its value in the function, and save the redundant assignment that comes after it. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
# 6c085a8a	12-May-2019	Shay Agroskin <shayag@mellanox.com>	net/mlx5e: XDP, Close TX MPWQE session when no room for inline packet left In MPWQE mode, when transmitting packets with XDP, a packet that is smaller than a certain size (set to 256 bytes) would be sent inline within its WQE TX descriptor (mem-copied), in case the hardware tx queue is congested beyond a pre-defined water-mark. If a MPWQE cannot contain an additional inline packet, we close this MPWQE session, and send the packet inlined within the next MPWQE. To save some MPWQE session close+open operations, we don't open MPWQE sessions that are contiguously smaller than certain size (set to the HW MPWQE maximum size). If there isn't enough contiguous room in the send queue, we fill it with NOPs and wrap the send queue index around. This way, qualified packets are always sent inline. Perf tests: Tested packet rate for UDP 64Byte multi-stream over two dual port ConnectX-5 100Gbps NICs. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz XDP_TX: With 24 channels: \| ------ \| bounced packets \| inlined packets \| inline ratio \| \| before \| 113.6Mpps \| 96.3Mpps \| 84% \| \| after \| 115Mpps \| 99.5Mpps \| 86% \| With one channel: \| ------ \| bounced packets \| inlined packets \| inline ratio \| \| before \| 6.7Mpps \| 0pps \| 0% \| \| after \| 6.8Mpps \| 0pps \| 0% \| As we can see, there is improvement in both inline ratio and overall packet rate for 24 channels. Also, we see no degradation for the one-channel case. Signed-off-by: Shay Agroskin <shayag@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
# 542578c6	05-Jul-2019	Tariq Toukan <tariqt@mellanox.com>	net/mlx5e: Move helper functions to a new txrx datapath header Take datapath helper functions to a new header file en/txrx.h. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
# db05815b	26-Jun-2019	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Add XSK zero-copy support This commit adds support for AF_XDP zero-copy RX and TX. We create a dedicated XSK RQ inside the channel, it means that two RQs are running simultaneously: one for non-XSK traffic and the other for XSK traffic. The regular and XSK RQs use a single ID namespace split into two halves: the lower half is regular RQs, and the upper half is XSK RQs. When any zero-copy AF_XDP socket is active, changing the number of channels is not allowed, because it would break to mapping between XSK RQ IDs and channels. XSK requires different page allocation and release routines. Such functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are generic enough to be used for both regular and XSK RQs, and they use the mlx5e_page_{alloc,release} wrappers around the real allocation functions. Function pointers are not used to avoid losing the performance with retpolines. Wherever it's certain that the regular (non-XSK) page release function should be used, it's called directly. Only the stats that could be meaningful for XSK are exposed to the userspace. Those that don't take part in the XSK flow are not considered. Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ), because the newer xdpsock sample doesn't provide any Fill Ring entries at the setup stage. We create a dedicated XSK SQ in the channel. This separation has its advantages: 1. When the UMEM is closed, the XSK SQ can also be closed and stop receiving completions. If an existing SQ was used for XSK, it would continue receiving completions for the packets of the closed socket. If a new UMEM was opened at that point, it would start getting completions that don't belong to it. 2. Calculating statistics separately. When the userspace kicks the TX, the driver triggers a hardware interrupt by posting a NOP to a dedicated XSK ICO (internal control operations) SQ, in order to trigger NAPI on the right CPU core. This XSK ICO SQ is protected by a spinlock, as the userspace application may kick the TX from any core. Store the pointers to the UMEMs in the net device private context, independently from the kernel. This way the driver can distinguish between the zero-copy and non-zero-copy UMEMs. The kernel function xdp_get_umem_from_qid does not care about this difference, but the driver is only interested in zero-copy UMEMs, particularly, on the cleanup it determines whether to close the XSK RQ and SQ or not by looking at the presence of the UMEM. Use state_lock to protect the access to this area of UMEM pointers. LRO isn't compatible with XDP, but there may be active UMEMs while XDP is off. If this is the case, don't allow LRO to ensure XDP can be reenabled at any time. The validation of XSK parameters typically happens when XSK queues open. However, when the interface is down or the XDP program isn't set, it's still possible to have active AF_XDP sockets and even to open new, but the XSK queues will be closed. To cover these cases, perform the validation also in these flows: 1. A new UMEM is registered, but the XSK queues aren't going to be created due to missing XDP program or interface being down. 2. MTU changes while there are UMEMs registered. Having this early check prevents mlx5e_open_channels from failing at a later stage, where recovery is impossible and the application has no chance to handle the error, because it got the successful return value for an MTU change or XSK open operation. The performance testing was performed on a machine with the following configuration: - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz - Mellanox ConnectX-5 Ex with 100 Gbit/s link The results with retpoline disabled, single stream: txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU) rxdrop: 12.2 Mpps l2fwd: 9.4 Mpps The results with retpoline enabled, single stream: txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU) rxdrop: 9.9 Mpps l2fwd: 6.8 Mpps Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
# a011b49f	26-Jun-2019	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Consider XSK in XDP MTU limit calculation Use the existing mlx5e_get_linear_rq_headroom function to calculate the headroom for mlx5e_xdp_max_mtu. This function takes the XSK headroom into consideration, which will be used in the following patches. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
# b9673cf5	26-Jun-2019	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Share the XDP SQ for XDP_TX between RQs Put the XDP SQ that is used for XDP_TX into the channel. It used to be a part of the RQ, but with introduction of AF_XDP there will be one more RQ that could share the same XDP SQ. This patch is a preparation for that change. Separate XDP_TX statistics per RQ were implemented in one of the previous patches. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
# d963fa15	26-Jun-2019	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Refactor struct mlx5e_xdp_info Currently, struct mlx5e_xdp_info has some issues that have to be cleaned up before the upcoming AF_XDP support makes things too complicated and messy. This structure is used both when sending the packet and on completion. Moreover, the cleanup procedure on completion depends on the origin of the packet (XDP_REDIRECT, XDP_TX). Adding AF_XDP support will add new flows that use this structure even differently. To avoid overcomplicating the code, this commit refactors the usage of this structure in the following ways: 1. struct mlx5e_xdp_info is split into two different structures. One is struct mlx5e_xdp_xmit_data, a transient structure that doesn't need to be stored and is only used while sending the packet. The other is still struct mlx5e_xdp_info that is stored in a FIFO and contains the fields needed on completion. 2. The fields of struct mlx5e_xdp_info that are used in different flows are put into a union. A special enum indicates the cleanup mode and helps choose the right union member. This approach is clear and explicit. Although it could be possible to "guess" the mode by looking at the values of the fields and at the XDP SQ type, it wouldn't be that clear and extendable and would require looking through the whole chain to understand what's going on. For the reference, there are the fields of struct mlx5e_xdp_info that are used in different flows (including AF_XDP ones): Packet origin \| Fields used on completion \| Cleanup steps -----------------------+---------------------------+------------------ XDP_REDIRECT, \| xdpf, dma_addr \| DMA unmap and XDP_TX from XSK RQ \| \| xdp_return_frame. -----------------------+---------------------------+------------------ XDP_TX from regular RQ \| di \| Recycle page. -----------------------+---------------------------+------------------ AF_XDP TX \| (none) \| Increment the \| \| producer index in \| \| Completion Ring. On send, the same set of mlx5e_xdp_xmit_data fields is used in all flows: DMA and virtual addresses and length. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
# c2273219	14-Mar-2019	Shay Agroskin <shayag@mellanox.com>	net/mlx5e: XDP, Inline small packets into the TX MPWQE in XDP xmit flow Upon high packet rate with multiple CPUs TX workloads, much of the HCA's resources are spent on prefetching TX descriptors, thus affecting transmission rates. This patch comes to mitigate this problem by moving some workload to the CPU and reducing the HW data prefetch overhead for small packets (<= 256B). When forwarding packets with XDP, a packet that is smaller than a certain size (set to ~256 bytes) would be sent inline within its WQE TX descrptor (mem-copied), when the hardware tx queue is congested beyond a pre-defined water-mark. This is added to better utilize the HW resources (which now makes one less packet data prefetch) and allow better scalability, on the account of CPU usage (which now 'memcpy's the packet into the WQE). To load balance between HW and CPU and get max packet rate, we use watermarks to detect how much the HW is congested and move the work loads back and forth between HW and CPU. Performance: Tested packet rate for UDP 64Byte multi-stream over two dual port ConnectX-5 100Gbps NICs. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz * Tested with hyper-threading disabled XDP_TX: \| \| before \| after \| \| \| 24 rings \| 51Mpps \| 116Mpps \| +126% \| \| 1 ring \| 12Mpps \| 12Mpps \| same \| XDP_REDIRECT: ** Below is the transmit rate, not the redirection rate which might be larger, and is not affected by this patch. \| \| before \| after \| \| \| 32 rings \| 64Mpps \| 92Mpps \| +43% \| \| 1 ring \| 6.4Mpps \| 6.4Mpps \| same \| As we can see, feature significantly improves scaling, without hurting single ring performance. Signed-off-by: Shay Agroskin <shayag@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
# d460c271	08-Apr-2019	Maxim Mikityanskiy <maximmi@mellanox.com>	net/mlx5e: Fix the max MTU check in case of XDP MLX5E_XDP_MAX_MTU was calculated incorrectly. It didn't account for NET_IP_ALIGN and MLX5E_HW2SW_MTU, and it also misused MLX5_SKB_FRAG_SZ. This commit fixes the calculations and adds a brief explanation for the formula used. Fixes: a26a5bdf3ee2d ("net/mlx5e: Restrict the combination of large MTU and XDP") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
# 407e17b1	11-Feb-2019	Saeed Mahameed <saeedm@mellanox.com>	net/mlx5e: XDP, fix redirect resources availability check Currently mlx5 driver creates xdp redirect hw queues unconditionally on netdevice open, This is great until someone starts redirecting XDP traffic via ndo_xdp_xmit on mlx5 device and changes the device configuration at the same time, this might cause crashes, since the other device's napi is not aware of the mlx5 state change (resources un-availability). To fix this we must synchronize with other devices napi's on the system. Added a new flag under mlx5e_priv to determine XDP TX resources are available, set/clear it up when necessary and use synchronize_rcu() when the flag is turned off, so other napi's are in-sync with it, before we actually cleanup the hw resources. The flag is tested prior to committing to transmit on mlx5e_xdp_xmit, and it is sufficient to determine if it safe to transmit or not. The other two internal flags (MLX5E_STATE_OPENED and MLX5E_SQ_STATE_ENABLED) become unnecessary. Thus, they are removed from data path. Fixes: 58b99ee3e3eb ("net/mlx5e: Add support for XDP_REDIRECT in device-out side") Reported-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
# 5e0d2eef	21-Nov-2018	Tariq Toukan <tariqt@mellanox.com>	net/mlx5e: XDP, Support Enhanced Multi-Packet TX WQE Add support for the HW feature of multi-packet WQE in XDP xmit flow. The conventional TX descriptor (WQE, Work Queue Element) serves a single packet. Our HW has support for multi-packet WQE (MPWQE) in which a single descriptor serves multiple TX packets. This reduces both the PCI overhead and the CPU cycles wasted on writing them. In this patch we add support for the HW feature, which is supported starting from ConnectX-5. Performance: Tested packet rate for UDP 64Byte multi-stream over ConnectX-5 NICs. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz XDP_TX: We see a huge gain on single port ConnectX-5, and reach the 100 Mpps milestone. * Single-port HCA: Before: 70 Mpps After: 100 Mpps (+42.8%) * Dual-port HCA: Before: 51.7 Mpps After: 57.3 Mpps (+10.8%) * In both cases we tested traffic on one port and for now On Dual-port HCAs we see only small gain, we are working to overcome this bottleneck, but for the moment only with experimental firmware on dual port HCAs we can reach the wanted numbers as seen on Single-port HCAs. XDP_REDIRECT: Redirect from (A) ConnectX-5 to (B) ConnectX-5. Due to a setup limitation, (A) and (B) are on different NUMA nodes, so absolute performance numbers are not optimal. Note: Below is the transmit rate of (B), not the redirect rate of (A) which is in some cases higher. * (B) is single-port: Before: 77 Mpps After: 90 Mpps (+16.8%) * (B) is dual-port: Before: 61 Mpps After: 72 Mpps (+18%) Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>