#
0a149ab7 |
|
11-Dec-2023 |
Liang Chen <liangchen.linux@gmail.com> |
page_pool: transition to reference count management after page draining To support multiple users referencing the same fragment, 'pp_frag_count' is renamed to 'pp_ref_count', transitioning pp pages from fragment management to reference count management after draining based on the suggestion from [1]. The idea is that the concept of fragmenting exists before the page is drained, and all related functions retain their current names. However, once the page is drained, its management shifts to being governed by 'pp_ref_count'. Therefore, all functions associated with that lifecycle stage of a pp page are renamed. [1] http://lore.kernel.org/netdev/f71d9448-70c8-8793-dc9a-0eb48a570300@huawei.com Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://lore.kernel.org/r/20231212044614.42733-2-liangchen.linux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
db52aa6d |
|
04-Aug-2023 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: Decouple CQ from priv Make CQ struct and methods independent of "priv", use more basic arguments instead. This will ease the transition to netdev with multiple mdevs. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
ef9369e9 |
|
29-Sep-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Fix page_pool allocation failure recovery for legacy rq When a page allocation fails during refill in mlx5e_refill_rx_wqes, the page will be released again on the next refill call. This triggers the page_pool negative page fragment count warning below: [ 338.326070] WARNING: CPU: 4 PID: 0 at include/net/page_pool/helpers.h:130 mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core] ... [ 338.328993] RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core] [ 338.329094] Call Trace: [ 338.329097] <IRQ> [ 338.329100] ? __warn+0x7d/0x120 [ 338.329105] ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core] [ 338.329173] ? report_bug+0x155/0x180 [ 338.329179] ? handle_bug+0x3c/0x60 [ 338.329183] ? exc_invalid_op+0x13/0x60 [ 338.329187] ? asm_exc_invalid_op+0x16/0x20 [ 338.329192] ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core] [ 338.329259] mlx5e_post_rx_wqes+0x210/0x5a0 [mlx5_core] [ 338.329327] ? mlx5e_poll_rx_cq+0x88/0x6f0 [mlx5_core] [ 338.329394] mlx5e_napi_poll+0x127/0x6b0 [mlx5_core] [ 338.329461] __napi_poll+0x25/0x1a0 [ 338.329465] net_rx_action+0x28a/0x300 [ 338.329468] __do_softirq+0xcd/0x279 [ 338.329473] irq_exit_rcu+0x6a/0x90 [ 338.329477] common_interrupt+0x82/0xa0 [ 338.329482] </IRQ> This patch fixes the legacy rq case by releasing all allocated fragments and then setting the skip flag on all released fragments. It is important to note that the number of released fragments will be higher than the number of allocated fragments when an allocation error occurs. Fixes: 3f93f82988bc ("net/mlx5e: RX, Defer page release in legacy rq for better recycling") Tested-by: Chris Mason <clm@fb.com> Reported-by: Chris Mason <clm@fb.com> Closes: https://lore.kernel.org/netdev/117FF31A-7BE0-4050-B2BB-E41F224FF72F@meta.com Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
be43b748 |
|
02-Oct-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Fix page_pool allocation failure recovery for striding rq When a page allocation fails during refill in mlx5e_post_rx_mpwqes, the page will be released again on the next refill call. This triggers the page_pool negative page fragment count warning below: [ 2436.447717] WARNING: CPU: 1 PID: 2419 at include/net/page_pool/helpers.h:130 mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core] ... [ 2436.447895] RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core] [ 2436.447991] Call Trace: [ 2436.447975] mlx5e_post_rx_mpwqes+0x1d5/0xcf0 [mlx5_core] [ 2436.447994] <IRQ> [ 2436.447996] ? __warn+0x7d/0x120 [ 2436.448009] ? mlx5e_handle_rx_cqe_mpwrq+0x109/0x1d0 [mlx5_core] [ 2436.448002] ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core] [ 2436.448044] ? mlx5e_poll_rx_cq+0x87/0x6e0 [mlx5_core] [ 2436.448061] ? report_bug+0x155/0x180 [ 2436.448065] ? handle_bug+0x36/0x70 [ 2436.448067] ? exc_invalid_op+0x13/0x60 [ 2436.448070] ? asm_exc_invalid_op+0x16/0x20 [ 2436.448079] mlx5e_napi_poll+0x122/0x6b0 [mlx5_core] [ 2436.448077] ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core] [ 2436.448113] ? generic_exec_single+0x35/0x100 [ 2436.448117] __napi_poll+0x25/0x1a0 [ 2436.448120] net_rx_action+0x28a/0x300 [ 2436.448122] __do_softirq+0xcd/0x279 [ 2436.448126] irq_exit_rcu+0x6a/0x90 [ 2436.448128] sysvec_apic_timer_interrupt+0x6e/0x90 [ 2436.448130] </IRQ> This patch fixes the striding rq case by setting the skip flag on all the wqe pages that were expected to have new pages allocated. Fixes: 4c2a13236807 ("net/mlx5e: RX, Defer page release in striding rq for better recycling") Tested-by: Chris Mason <clm@fb.com> Reported-by: Chris Mason <clm@fb.com> Closes: https://lore.kernel.org/netdev/117FF31A-7BE0-4050-B2BB-E41F224FF72F@meta.com Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
a9ca9f9c |
|
04-Aug-2023 |
Yunsheng Lin <linyunsheng@huawei.com> |
page_pool: split types and declarations from page_pool.h Split types and pure function declarations from page_pool.h and add them in page_page/types.h, so that C sources can include page_pool.h and headers should generally only include page_pool/types.h as suggested by jakub. Rename page_pool.h to page_pool/helpers.h to have both in one place. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com [Jakub: change microsoft/mana, fix kdoc paths in Documentation] Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
33b18a0f |
|
31-Jul-2023 |
Jianbo Liu <jianbol@nvidia.com> |
net/mlx5e: Change the parameter of IPsec RX skb handle function Refactor the function to pass in reg B value only. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/3b3c53f64660d464893eaecc41298b1ce49c6baa.1690802064.git.leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
7abd955a |
|
31-May-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Fix page_pool page fragment tracking for XDP Currently mlx5e releases pages directly to the page_pool for XDP_TX and does page fragment counting for XDP_REDIRECT. RX pages from the page_pool are leaking on XDP_REDIRECT because the xdp core will release only one fragment out of MLX5E_PAGECNT_BIAS_MAX and subsequently the page is marked as "skip release" which avoids the driver release. A fix would be to take an extra fragment for XDP_REDIRECT and not set the "skip release" bit so that the release on the driver side can handle the remaining bias fragments. But this would be a shortsighted solution. Instead, this patch converges the two XDP paths (XDP_TX and XDP_REDIRECT) to always do fragment tracking. The "skip release" bit is no longer necessary for XDP. Fixes: 6f5742846053 ("net/mlx5e: RX, Enable skb page recycling through the page_pool") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
2e2d1965 |
|
22-May-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Fix flush and close release flow of regular rq for legacy rq Regular (non-XSK) RQs get flushed on XSK setup and re-activated on XSK close. If the same regular RQ is closed (a config change for example) soon after the XSK close, a double release occurs because the missing wqes get released a second time. Fixes: 3f93f82988bc ("net/mlx5e: RX, Defer page release in legacy rq for better recycling") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
b51f4113 |
|
10-May-2023 |
Yunsheng Lin <linyunsheng@huawei.com> |
net: introduce and use skb_frag_fill_page_desc() Most users use __skb_frag_set_page()/skb_frag_off_set()/ skb_frag_size_set() to fill the page desc for a skb frag. Introduce skb_frag_fill_page_desc() to do that. net/bpf/test_run.c does not call skb_frag_off_set() to set the offset, "copy_from_user(page_address(page), ...)" and 'shinfo' being part of the 'data' kzalloced in bpf_test_init() suggest that it is assuming offset to be initialized as zero, so call skb_frag_fill_page_desc() with offset being zero for this case. Also, skb_frag_set_page() is not used anymore, so remove it. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
40afb3b1 |
|
03-Apr-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Fix XDP_TX page release for legacy rq nonlinear case When the XDP handler marks the data for transmission (XDP_TX), it is incorrect to release the page fragment. Instead, the fragments should be marked as MLX5E_WQE_FRAG_SKIP_RELEASE because XDP will release the page directly to the page_pool (page_pool_put_defragged_page) after TX completion. The linear case already does this. This patch fixes the nonlinear part as well. Also, the looping over the fragments was incorrect: When handling pages after XDP_TX in the legacy rq nonlinear case, the loop was skipping the first wqe fragment. Fixes: 3f93f82988bc ("net/mlx5e: RX, Defer page release in legacy rq for better recycling") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
c8e90902 |
|
30-Mar-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Fix releasing page_pool pages twice for striding RQ mlx5e_free_rx_descs is responsible for calling the dealloc_wqe op which returns pages to the page_pool. This can happen during flush or close. For XSK, the regular RQ is flushed (when replaced by the XSK RQ) and also closed later. This is normally not a problem as the wqe list is empty on a second call to mlx5e_free_rx_descs. However, for striding RQ, the previously released wqes from the list will appear as missing and will be released a second time by mlx5e_free_rx_missing_descs. This patch sets the no release bits on the striding RQ wqes in the dealloc_wqe op to prevent releasing the pages a second time. Please note that the bits are set only in the control path during close and not in the data path. Fixes: 4c2a13236807 ("net/mlx5e: RX, Defer page release in striding rq for better recycling") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
f52ac702 |
|
17-Apr-2023 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ Here we add support for multi-buffer XDP handling in Striding RQ, which is our default out-of-the-box RQ type. Before this series, loading such an XDP program would fail, until you switch to the legacy RQ (by unsetting the rx_striding_rq priv-flag). To overcome the lack of headroom and tailroom between the strides, we allocate a side page to be used for the descriptor (xdp_buff / skb) and the linear part. When an XDP program is attached, we structure the xdp_buff so that it contains no data in the linear part, and the whole packet resides in the fragments. In case of XDP_PASS, where an SKB still needs to be created, we copy up to 256 bytes to its linear part, to match the current behavior, and satisfy functions that assume finding the packet headers in the SKB linear part (like eth_type_trans). Performance testing: Packet rate test, 64 bytes, 32 channels, MTU 9000 bytes. CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz. NIC: ConnectX-6 Dx, at 100 Gbps. +----------+-------------+-------------+---------+ | Test | Legacy RQ | Striding RQ | Speedup | +----------+-------------+-------------+---------+ | XDP_DROP | 101,615,544 | 117,191,020 | +15% | +----------+-------------+-------------+---------+ | XDP_TX | 95,608,169 | 117,043,422 | +22% | +----------+-------------+-------------+---------+ Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2cb0e27d |
|
17-Apr-2023 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: RX, Prepare non-linear striding RQ for XDP multi-buffer support In preparation for supporting XDP multi-buffer in striding RQ, use xdp_buff struct to describe the packet. Make its skb_shared_info collide the one of the allocated SKB, then add the fragments using the xdp_buff API. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
221c8c7a |
|
17-Apr-2023 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: RX, Generalize mlx5e_fill_mxbuf() Make the function more generic. Let it get an additional frame_sz parameter instead of deriving it from the RQ struct. No functional change here, just a preparation for a downstream patch. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
27602319 |
|
17-Apr-2023 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: RX, Take shared info fragment addition into a function Introduce mlx5e_add_skb_shared_info_frag(), a function dedicated for adding a fragment into a struct skb_shared_info object. Use it in the Legacy RQ flow. Similar usage will be added in a downstream patch by the corresponding Striding RQ flow. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3905f8d6 |
|
22-Feb-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Remove unnecessary recycle parameter and page_cache stats The recycle parameter used during page release is no longer necessary: the page pool can detect when the page cannot be recycled to the cache or ring without any outside hint. The page pool will also take care of cleaning up after itself once all the inflight pages have been released. So no need to explicitly release pages to the system. Remove the internal page_cache stats as the mlx5e_page_cache struct no longer exists. Delete the documentation entries along with the stats. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
cd640b05 |
|
21-Feb-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Break the wqe bulk refill in smaller chunks To avoid overflowing the page pool's cache, don't release the whole bulk which is usually larger than the cache refill size. Group release+alloc instead into cache refill units that allow releasing to the cache and then allocating from the cache. A refill_unit variable is added as a iteration unit over the wqe_bulk when doing release+alloc. For a single ring, single core, default MTU (1500) TCP stream test the number of pages allocated from the cache directly (rx_pp_recycle_cached) increases from 0% to 52%: +---------------------------------------------+ | Page Pool stats (/sec) | Before | After | +-------------------------+---------+---------+ |rx_pp_alloc_fast | 2145422 | 2193802 | |rx_pp_alloc_slow | 2 | 0 | |rx_pp_alloc_empty | 2 | 0 | |rx_pp_alloc_refill | 34059 | 16634 | |rx_pp_alloc_waive | 0 | 0 | |rx_pp_recycle_cached | 0 | 1145818 | |rx_pp_recycle_cache_full | 0 | 0 | |rx_pp_recycle_ring | 2179361 | 1064616 | |rx_pp_recycle_ring_full | 121 | 0 | +---------------------------------------------+ With this patch, the performance for legacy rq for the above test is back to baseline. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
76238d0f |
|
21-Feb-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Split off release path for xsk buffers for legacy rq Don't mix xsk buffer releases with page releases anymore. This is needed for handling of deferred page release. Add a new bulk free function for xsk buffers from wqe frags. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
3f93f829 |
|
21-Feb-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Defer page release in legacy rq for better recycling Currently, fragmented pages from the page pool can be released in two ways: 1) In the mlx5e driver when trimming off the unused fragments AND the associated skb fragments have been released. This path allows recycling of pages to the page pool cache (allow_direct == true). 2) On the skb release path (last fragment release), which will always release pages to the page pool ring (allow_direct == false). Whichever is releasing the last fragment will be decisive on where the page gets released: the cache or the ring. So we obviously want to maximize for doing the release from 1. This patch does that by deferring the release of page fragments right before requesting new ones from the page pool. A flag is added to make sure that there's no release before first alloc and that XDP_TX fragments are not released prematurely. This is a preparation patch that doesn't unlock the performance improvements yet. A followup patch will do that. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
625dff29 |
|
21-Feb-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Change wqe last_in_page field from bool to bit flags Change the bool flag to a bitfield as we'll use it in a downstream patch in the series to add signaling about skipping a fragment release. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
4c2a1323 |
|
13-Feb-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Defer page release in striding rq for better recycling Currently, for striding RQ, fragmented pages from the page pool can get released in two ways: 1) In the mlx5e driver when trimming off the unused fragments AND the associated skb fragments have been released. This path allows recycling of pages to the page pool cache (allow_direct == true). 2) On the skb release path (last fragment release), which will always release pages to the page pool ring (allow_direct == false). Whichever is releasing the last fragment will be decisive on where the page gets released: the cache or the ring. So we obviously want to maximize for doing the release from 1. This patch does that by deferring the release of page fragments right before requesting new ones from the page pool. Extra care needs to be taken for the corner cases: * On first call, make sure that release is not called. The skip_release_bitmap is used for this purpose. * On rq shutdown, make sure that all wqes that were not in the linked list are released. For a single ring, single core, default MTU (1500) TCP stream test the number of pages allocated from the cache directly (rx_pp_recycle_cached) increases from 31 % to 98 %: +----------------------------------------------+ | Page Pool stats (/sec) | Before | After | +-------------------------+---------+----------+ |rx_pp_alloc_fast | 2137754 | 2261033 | |rx_pp_alloc_slow | 47 | 9 | |rx_pp_alloc_empty | 47 | 9 | |rx_pp_alloc_refill | 23230 | 819 | |rx_pp_alloc_waive | 0 | 0 | |rx_pp_recycle_cached | 672182 | 2209015 | |rx_pp_recycle_cache_full | 1789 | 0 | |rx_pp_recycle_ring | 1485848 | 52259 | |rx_pp_recycle_ring_full | 3003 | 584 | +----------------------------------------------+ With this patch, the performance in striding rq for the above test is back to baseline. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
38a36efc |
|
16-Feb-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic name The xdp_xmit_bitmap currently serves only one purpose: to avoid releasing pages that are still in use due to XDP TX. A following patch will use this bitmap in a slightly different context but for the same purpose. So rename the bitmap to a more generic name that reflects the purpose not the context. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
6f574284 |
|
18-Jan-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Enable skb page recycling through the page_pool Start using the page_pool skb recycling api to recycle all pages back to the page pool and stop using atomic page reference counting. The mlx5e driver used to manage in-flight pages using page refcounting: for each fragment there were 2 atomic write operations happening (one for building the skb and one on skb release). The page_pool api introduced a method to track page fragments more optimally: * The page's pp_fragment_count is set to a large bias on page alloc (1 x atomic write operation). * The driver tracks the actual page fragments in a non atomic variable. * When the skb is recycled, pp_fragment_count is decremented (atomic write operation). * When page is released in the driver, the unused number of fragments (relative to the bias) is deducted from pp_fragment_count (atomic write operation). * Last page defragmentation will only be an atomic read. So in total there are `number of fragments + 1` atomic write ops. As opposed to previously: `2 * frags` atomic writes ops. Pages are wrapped in a mlx5e_frag_page structure which also contains the number of fragments. This makes it easy to count the fragments in the driver. This change brings performance improvements for the case when the old rx page_cache had low recycling rates due to head of queue blocking. For a iperf3 TCP test with a single stream, on a single core (iperf and receive queue running on same core), the following improvements can be noticed: * Striding rq: - before (net-next baseline): bitrate = 30.1 Gbits/sec - after : bitrate = 31.4 Gbits/sec (diff: 4.14 %) * Legacy rq: - before (net-next baseline): bitrate = 30.2 Gbits/sec - after : bitrate = 33.0 Gbits/sec (diff: 8.48 %) There are 2 temporary performance degradations introduced: 1) TCP streams that had a good recycling rate with the old page_cache have a degradation for both striding and linear rq. This is due to very low page pool cache recycling: the pages are released during skb recycle which will release pages to the page pool ring for safety. The following patches in this series will tackle this problem by deferring the page release in the driver to increase the chance of having pages recycled to the cache. 2) XDP performance is now lower (4-5 %) due to the higher number of atomic operations used for fragment management. But this opens the door for supporting multiple packets per page in XDP, which will bring a big gain. Otherwise, performance is similar to baseline. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
4a5c5e25 |
|
14-Dec-2022 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Enable dma map and sync from page_pool allocator Remove driver dma mapping and unmapping of pages. Let the page_pool api do it. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
08c9b61b |
|
13-Dec-2022 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Remove internal page_cache This patch removes the internal rx page_cache and uses the generic page_pool api only. It used to be that the page_pool couldn't handle all the mlx5 driver usecases, but with the introduction of skb recycling and page fragmentaton in the page_pool full switch can now be made. Some benfits of this transition: * Better page recycling in the cases when the page_cache was suffering from head of queue blocking. The page_pool doesn't have this issue. * DMA mapping/unmapping can be managed by the page_pool. * mlx5e_rq size reduced by more than 50% due to the page_cache array being deleted. This patch only removes the page_cache. Downstream patches will enable the required page_pool features and will add further fine-tuning. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
ca6ef9f0 |
|
14-Mar-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Store SHAMPO header pages in array Save allocated SHAMPO header pages to an array to which the mlx5e_dma_info page will point to. This change is a preparation for introducing mlx5e_frag_page structure in a downstream patch. There's no new functionality introduced. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
d39092ca |
|
29-Jan-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Remove alloc unit layout constraint for striding rq This change removes the usage of mlx5e_alloc_unit union for striding rq. The change is more straightforward than legacy rq as the alloc units union is already in place. This patch only moves things around: instead of an array of unions make it a union of arrays. This has the effect that each mlx5e_mpw_info will allocate the largest possible size of the array member. For xsk this means that the array of xdp_buff pointers for the wqe will still be contiguous, but there will be some extra unused bytes at the end of the array. As further patch in the series will add the mlx5e_frag_page struct for which the described size constraint will no longer hold. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
8fb1814f |
|
27-Jan-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Remove alloc unit layout constraint for legacy rq The mlx5e_alloc_unit union is conveniently used to store arrays of pointers to struct page or struct xdp_buff (for xsk). The union is currently expected to have the size of a pointer for xsk batch allocations to work. This is conveniet for the current state of the code but makes it impossible to add a structure of a different size to the alloc unit. A further patch in the series will add the mlx5e_frag_page struct for which the described size constraint will no longer hold. This change removes the usage of mlx5e_alloc_unit union for legacy rq: - A union of arrays is introduced (mlx5e_alloc_units) to replace the array of unions to allow structures of different sizes. - Each fragment has a pointer to a unit in the mlx5e_alloc_units array. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
09df0370 |
|
27-Jan-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Remove mlx5e_alloc_unit argument in page allocation Change internal page cache and page pool api to use a struct page ** instead of a mlx5e_alloc_unit *. This is the first change in a series which is meant to remove the mlx5e_alloc_unit altogether. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
93a1ab2c |
|
17-Feb-2023 |
Paul Blakey <paulb@nvidia.com> |
net/mlx5: Refactor tc miss handling to a single function Move tc miss handling code to en_tc.c, and remove duplicate code. Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
993fd9bd |
|
15-Feb-2023 |
Gal Pressman <gal@nvidia.com> |
net/mlx5e: RX, Remove doubtful unlikely call When building an skb in non-linear mode, it is not likely nor unlikely that the xdp buff has fragments, it depends on the size of the packet received. Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
|
#
9da5294e |
|
15-Feb-2023 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: Remove redundant page argument in mlx5e_xdp_handle() Remove the page parameter, it can be derived from the xdp_buff member of mlx5e_xdp_buff. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
53ee9142 |
|
15-Feb-2023 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: Switch to using napi_build_skb() Use napi_build_skb() which uses NAPI percpu caches to obtain skbuff_head instead of inplace allocation. napi_build_skb() calls napi_skb_cache_get(), which returns a cached skb, or allocates a bulk of NAPI_SKB_CACHE_BULK (16) if cache is empty. Performance test: TCP single stream, single ring, single core, default MTU (1500B). Before: 26.5 Gbits/sec After: 30.1 Gbits/sec (+13.6%) Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
|
#
bc1536f3 |
|
18-Jan-2023 |
Jiri Pirko <jiri@nvidia.com> |
net/mlx5e: Replace usage of mlx5e_devlink_get_dl_port() by netdev->devlink_port On places where netdev pointer is available, access related devlink_port pointer by netdev->devlink_port instead of using mlx5e_devlink_get_dl_port() which is going to be removed. Move SET_NETDEV_DEVLINK_PORT() call right after devlink port registration to make sure netdev->devlink_port is valid. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
bc8d405b |
|
19-Jan-2023 |
Toke Høiland-Jørgensen <toke@redhat.com> |
net/mlx5e: Support RX XDP metadata Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe pointer to the mlx5e_skb_from* functions so it can be retrieved from the XDP ctx to do this. Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20230119221536.3349901-17-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
384a13ca |
|
19-Jan-2023 |
Toke Høiland-Jørgensen <toke@redhat.com> |
net/mlx5e: Introduce wrapper for xdp_buff Preparation for implementing HW metadata kfuncs. No functional change. Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20230119221536.3349901-16-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
b5e23931 |
|
23-Nov-2022 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: IPoIB, Fix child PKEY interface stats on rx path The current code always does the accounting using the stats from the parent interface (linked in the rq). This doesn't work when there are child interfaces configured. Fix this behavior by always using the stats from the child interface priv. This will also work for parent only interfaces: the child (netdev) and parent netdev (rq->netdev) will point to the same thing. Fixes: be98737a4faa ("net/mlx5e: Use dynamic per-channel allocations in stats") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
daab2e9c |
|
31-Oct-2022 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5: Use generic definition for UMR KLM alignment MLX5_UMR_KLM_ALIGNMENT is in units of number of entries, while MLX5_UMR_MTT_ALIGNMENT (generalized and renamed to MLX5_UMR_FLEX_ALIGNMENT) is in byte units. This is misleading and confusing. Replace this KLM definition with one based on the generic definition. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
b146658f |
|
31-Oct-2022 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: Add padding when needed in UMR WQEs Per the device spec, MTTs/KLMs list in a UMR WQE must be aligned to 64B. Per our SW design, the MTT/KLMs list would need alignment only if it's too small, for example on PPC when PAGE_SIZE is 64KB, and only 4 pages are needed to cover a MPWQE of size 256KB. Padding, if needed, is taken into account when calculating the UMR WQE fields (ds_cnt and xlt_octowords), however no entries are provided, instead garbage is passed. No real harm though, as these parts act as gaps between the RX MPWQEs and not used by any of them. Hence, in practice, device does not try to write any incoming packet to them. Still, prefer providing clean padding marking the end of the list, and do not map garbage into the RQ memory region. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
2c925db0 |
|
09-Feb-2021 |
Ofer Levi <oferle@nvidia.com> |
net/mlx5e: Support enhanced CQE compression CQE compression feature improves performance by reducing PCI bandwidth bottleneck on CQEs write. Enhanced CQE compression introduced in ConnectX-6 and it aims to reduce CPU utilization of SW side packets decompression by eliminating the need to rewrite ownership bit, which is likely to cost a cache-miss, is replaced by validity byte handled solely by HW. Another advantage of the enhanced feature is that session packets are available to SW as soon as a single CQE slot is filled, instead of waiting for session to close, this improves packet latency from NIC to host. Performance: Following are tested scenarios and reults comparing basic and enahnced CQE compression. setup: IXIA 100GbE connected directly to port 0 and port 1 of ConnectX-6 Dx 100GbE dual port. Case #1 RX only, single flow goes to single queue: IRQ rate reduced by ~ 30%, CPU utilization improved by 2%. Case #2 IP forwarding from port 1 to port 0 single flow goes to single queue: Avg latency improved from 60us to 21us, frame loss improved from 0.5% to 0.0%. Case #3 IP forwarding from port 1 to port 0 Max Throughput IXIA sends 100%, 8192 UDP flows, goes to 24 queues: Enhanced is equal or slightly better than basic. Testing the basic compression feature with this patch shows there is no perfrormance degradation of the basic compression feature. Signed-off-by: Ofer Levi <oferle@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
8d4b475e |
|
03-Nov-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Fix usage of DMA sync API DMA sync functions should use the same direction that was used by DMA mapping. Use DMA_BIDIRECTIONAL for XDP_TX from regular RQ, which reuses the same mapping that was used for RX, and DMA_TO_DEVICE for XDP_TX from XSK RQ and XDP_REDIRECT, which establish a new mapping in this direction. On the RX side, use the same direction that was used when setting up the mapping (DMA_BIDIRECTIONAL for XDP, DMA_FROM_DEVICE otherwise). Also don't skip sync for device when establishing a DMA_FROM_DEVICE mapping for RX, as some architectures (ARM) may require invalidating caches before the device can use the mapping. It doesn't break the bugfix made in commit 0b7cfa4082fb ("net/mlx5e: Fix page DMA map/unmap attributes"), since the bug happened on unmap. Fixes: 0b7cfa4082fb ("net/mlx5e: Fix page DMA map/unmap attributes") Fixes: b5503b994ed5 ("net/mlx5e: XDP TX forwarding support") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
9f123f74 |
|
01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Improve MTT/KSM alignment Make mlx5e_mpwrq_mtts_per_wqe take into account that KSM requires smaller alignment than MTT. Ensure that there is always an even amount of MTTs in a UMR WQE, so that complete octwords are formed, and no garbage is mapped. Drop extra alignment in MLX5_MTT_OCTW that may cause setting too big ucseg->xlt_octowords, also leading to mapping garbage. Generalize some calculations by introducing the MLX5_OCTWORD constant. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
cfb4d09c |
|
01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Improve need_wakeup logic XSK need_wakeup mechanism allows the driver to stop busy waiting for buffers when the fill ring is empty, yield to the application and signal it that the driver needs to be waken up after the application refills the fill ring. Add protection against the race condition on the RX (refill) side: if the application refills buffers after xskrq->post_wqes is called, but before mlx5e_xsk_update_rx_wakeup, NAPI will exit, skipping taking these buffers to the hardware WQ, and the application won't wake it up again. Optimize the whole need_wakeup logic, removing unneeded flows, to compensate for this new check. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
1ca6492e |
|
01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Include XSK skb_from_cqe callbacks in INDIRECT_CALL XSK is a performance-critical data path. To avoid an indirect function call with a retpoline, include XSK callbacks in the INDIRECT_CALL macro, so that they are called directly in XSK flows. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
ddb7afee |
|
30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Optimize RQ page deallocation mlx5e_free_rx_mpwqe loops over all pages of a MPWQE, calling mlx5e_page_release for ones that are not scheduled for XDP_TX or XDP_REDIRECT; and mlx5e_page_release checks whether it's an XSK RQ or a regular one for each page/XSK frame. This check can be moved outside the loop to reduce the number of branches. mlx5e_free_rx_wqe loops over all fragments, calling mlx5e_page_release for the ones that are last in a page; and mlx5e_page_release checks whether it's an XSK RQ or a regular one for each fragment. Using the fact that XSK doesn't support multiple fragments, it can be optimized for both XSK and regular usages: 1. Make an early check for XSK and call its deallocator directly, saving 3 branches (loop condition, frag->last_in_page and selection of deallocator). 2. Call the regular deallocator directly in the non-XSK case, saving a branch per fragment, except the first one. After the changes, mlx5e_page_release is removed, as there are no callers left. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
96d37d86 |
|
30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Call mlx5e_page_release_dynamic directly where possible mlx5e_page_release calls the appropriate deallocator depending on whether it's an XSK RQ or a regular one. Some flows that call this function are not compatible with XSK, so they can call the non-XSK deallocator directly to save a branch. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
132857d9 |
|
30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Use non-XSK page allocator in SHAMPO The SHAMPO flow is not compatible with XSK, it can call the page pool allocator directly to save a branch. mlx5e_page_alloc is removed, as it's no longer used in any flow. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
cf544517 |
|
30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use xsk_buff_alloc_batch on striding RQ XSK provides a function to allocate frames in batches for more efficient processing. This commit starts using this function on striding RQ and creates an optimized flow for XSK. A side effect is an opportunity to optimize the regular RX flow by dropping branching for XSK cases. Performance improvement is up to 6.4% in the aligned mode and up to 7.5% in the unaligned mode. Aligned mode, 2048-byte frames: 12.9 Mpps -> 13.8 Mpps Aligned mode, 4096-byte frames: 11.8 Mpps -> 12.5 Mpps Unaligned mode, 2048-byte frames: 11.9 Mpps -> 12.8 Mpps Unaligned mode, 3072-byte frames: 11.4 Mpps -> 12.1 Mpps Unaligned mode, 4096-byte frames: 11.0 Mpps -> 11.2 Mpps CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
259bbc64 |
|
30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use xsk_buff_alloc_batch on legacy RQ XSK provides a function to allocate frames in batches for more efficient processing. This commit starts using this function on legacy RQ, adding a special case for XSK. The new branch introduced basically replaces the branch that was removed from the same place a few commits before. A check is made that DMA sync is not needed, because the batching allocator falls back to returning one frame when DMA sync is needed, and this is best handled by the loop in the standard case. Performance improvement is up to 8% in the aligned mode and up to 9% in the unaligned mode. Aligned mode, 2048-byte frames: 12.8 Mpps -> 13.5 Mpps Aligned mode, 4096-byte frames: 11.5 Mpps -> 12.4 Mpps Unaligned mode, 2048-byte frames: 12.2 Mpps -> 13.4 Mpps Unaligned mode, 3072-byte frames: 11.6 Mpps -> 12.5 Mpps Unaligned mode, 4096-byte frames: 11.2 Mpps -> 12.2 Mpps CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
a2e5ba24 |
|
30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Split out WQE allocation for legacy XSK RQ Allocation of XSK frames on legacy RQ may be made more efficient with a specialized routine that relies on certain assumptions, such as there is only one fragment, allocation units (XSK frames) are not shared among multiple packets. It reduces the number of branches both in the XSK code and in the regular RQ, because with this approach there is only a single check whether it's an XSK or regular RQ. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
0b482232 |
|
30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Remove the outer loop when allocating legacy RQ WQEs Legacy RQ WQEs are allocated in a loop in small batches (8 WQEs). As partial batches are allowed, there is no point to have a loop in a loop, so the outer loop is removed, and the batch size is increased up to the total number of WQEs to allocate, still not smaller than 8. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
3f5fe0b2 |
|
30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use partial batches in legacy RQ with XSK The previous commit allowed allocating WQE batches in legacy RQ partially, however, XSK still checks whether there are enough frames in the fill ring. Remove this check to allow to allocate batches partially also with XSK. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
42847fed |
|
30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Use partial batches in legacy RQ Legacy RQ allocates WQEs in batches. If the batch allocation fails, the pages of the allocated part are released. This commit changes this behavior to allow to use the pages that have been already allocated. After this change, we need to be careful about indexing rq->wqe.frags[]. The WQ size is a power of two that divides by wqe_bulk (8), and the old code used whole bulks, which allowed to use indices [8*K; 8*K+7] without overflowing. Now that the bulks may be partial, the range can start at any location (not only at 8*K), so we need to wrap them around to avoid out-of-bounds array access. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2d0765f7 |
|
29-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Remove mlx5e_xsk_page_alloc_pool mlx5e_xsk_page_alloc_pool became a thin wrapper around xsk_buff_alloc. Drop it and call xsk_buff_alloc directly. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
672db024 |
|
29-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Convert struct mlx5e_alloc_unit to a union struct mlx5e_alloc_unit consists of a single union. Convert it to a union itself to simplify casting it to struct xdp_buff *, which will be used to implement XSK batching on striding RQ. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
6bdeb963 |
|
29-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Remove DMA address from mlx5e_alloc_unit mlx5e_alloc_unit stores the DMA address and a pointer to either struct page (regular RQ) or struct xdp_buff (XSK RQ). This DMA address is redundant, because when a page or an XSK frame is allocated, the same address is also stored there. Some flows take the address from struct mlx5e_alloc_unit, and some take it from struct page or xdp_buff. This commit removes the address from struct mlx5e_alloc_unit, which makes it twice as small and improves locality (this struct is used in an array), also saving on unnecessary stores to the addr field. Almost all flows know unambiguously whether the DMA address should be taken from page or from xdp_buff. The exception is the allocation flows, where a new branch appeared, which will be optimized out in the next commits. struct mlx5e_alloc_unit used to be called mlx5e_dma_info. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
79008676 |
|
29-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Rename mlx5e_dma_info to prepare for removal of DMA address The next commit will remove the DMA address from the struct currently called mlx5e_dma_info, because the same value can be retrieved with page_pool_get_dma_addr(page) in almost all cases, with the notable exception of SHAMPO (HW GRO implementation) that modifies this address on the fly, after the initial allocation. To keep the SHAMPO logic intact, struct mlx5e_dma_info remains in the SHAMPO code, consisting of addr and page (XSK is not compatible with SHAMPO). The struct used in all other places is renamed to mlx5e_alloc_unit, allowing the next commit to remove the addr field without affecting SHAMPO. The new name means "allocation unit", and it's more appropriate after the field with the DMA address gets removed. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
707f908e |
|
29-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Optimize the page cache reducing its size 2x RX page cache stores dma_info structs, that consist of a pointer to struct page and a DMA address. In fact, the DMA address is extracted from struct page using page_pool_get_dma_addr when a page is pushed to the cache. By moving this call to the point when a page is popped from the cache, we can avoid storing the DMA address in the cache, effectively reducing its size by two times without losing any functionality. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
6470d2e7 |
|
29-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use KSM for unaligned XSK UMR MTTs used in striding RQ have certain alignment requirements. While it's guaranteed to work when UMR pages are aligned to the UMR page size, in practice it works then UMR pages are aligned to 8 bytes. However, it's still not enough flexibility for the unaligned mode of XSK. This patch leverages KSM to map UMR pages without alignment requirements, when unaligned XSK is active. The downside is that KSM entries are twice as big as MTTs, which limits the maximum WQE size, so regular RQs and aligned XSK continue using MTTs. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
e5a3cc83 |
|
29-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Use runtime page_shift for striding RQ This commit allows striding RQ to determine MTT page size at runtime, instead of sticking to the compile-time PAGE_SIZE. This functionality will be used by a following commit that adjusts the MTT page size to the XSK frame size. Stick with PAGE_SIZE for XSK on legacy RQ, as frag_stride is not used in data path, it only helps calculate how pages are partitioned into fragments, and PAGE_SIZE will ensure each fragment starts at the beginning of a new allocation unit (XSK frame). Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
997ce6af |
|
27-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Use runtime values of striding RQ parameters in datapath Some of the parameters of striding RQ are compile-time constants, but they are going to become dynamically calculated at runtime in a following commit. This commit prepares the datapath to take cached runtime parameters, prefilled at queue creation. New fields added to struct mlx5e_rq fit into an existing 7-byte hole. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
258e655c |
|
27-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Make dma_info array dynamic in struct mlx5e_mpw_info This commit moves the dma_info array to the end of struct mlx5e_mpw_info to make it a flexible array. It also removes the intermediate struct mlx5e_umr_dma_info, which used to contain only this array. The flexibility of dma_info will allow to choose its size dynamically in a following commit. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
b7c9400c |
|
05-Sep-2022 |
Lior Nahmanson <liorna@nvidia.com> |
net/mlx5e: Implement MACsec Rx data path using MACsec skb_metadata_dst MACsec driver need to distinguish to which offload device the MACsec is target to, in order to handle them correctly. This can be done by attaching a metadata_dst to a SKB with a SCI, when there is a match on MACsec rule. To achieve that, there is a map between fs_id to SCI, so for each RX SC, there is a unique fs_id allocated when creating RX SC. fs_id passed to device driver as metadata for packets that passed Rx MACsec offload to aid the driver to retrieve the matching SCI. Signed-off-by: Lior Nahmanson <liorna@nvidia.com> Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0fe79f28 |
|
13-May-2022 |
Alexander Duyck <alexanderduyck@fb.com> |
net: allow gro_max_size to exceed 65536 Allow the gro_max_size to exceed a value larger than 65536. There weren't really any external limitations that prevented this other than the fact that IPv4 only supports a 16 bit length field. Since we have the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to exceed this value and for IPv4 and non-TCP flows we can cap things at 65536 via a constant rather than relying on gro_max_size. [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7c4e983c |
|
13-May-2022 |
Alexander Duyck <alexanderduyck@fb.com> |
net: allow gso_max_size to exceed 65536 The code for gso_max_size was added originally to allow for debugging and workaround of buggy devices that couldn't support TSO with blocks 64K in size. The original reason for limiting it to 64K was because that was the existing limits of IPv4 and non-jumbogram IPv6 length fields. With the addition of Big TCP we can remove this limit and allow the value to potentially go up to UINT_MAX and instead be limited by the tso_max_size value. So in order to support this we need to go through and clean up the remaining users of the gso_max_size value so that the values will cap at 64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper limit for GSO_MAX_SIZE. v6: (edumazet) fixed a compile error if CONFIG_IPV6=n, in a new sk_trim_gso_size() helper. netif_set_tso_max_size() caps the requested TSO size with GSO_MAX_SIZE. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c6e3b421 |
|
09-Mar-2022 |
Leon Romanovsky <leon@kernel.org> |
net/mlx5: Merge various control path IPsec headers into one file The mlx5 IPsec code has logical separation between code that operates with XFRM objects (ipsec.c), HW objects (ipsec_offload.c), flow steering logic (ipsec_fs.c) and data path (ipsec_rxtx.c). Such separation makes sense for C-files, but isn't needed at all for H-files as they are included in batch anyway. Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
84a137f0 |
|
20-Jan-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Drop error CQE handling from the XSK RX handler This commit removes the redundant check and removes the unused cqe parameter of skb_from_cqe handlers. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
16fe5a1c |
|
06-Apr-2022 |
Leon Romanovsky <leon@kernel.org> |
net/mlx5: Move IPsec file to relevant directory IPsec is part of ethernet side of mlx5 driver and needs to be placed in en_accel folder. Link: https://lore.kernel.org/r/a0ca88f4d9c602c574106c0de0511803e7dcbdff.1649232994.git.leonro@nvidia.com Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
#
7e4e8491 |
|
06-Apr-2022 |
Leon Romanovsky <leon@kernel.org> |
net/mlx5: Remove ipsec vs. ipsec offload file separation The IPsec won't be initialized at all if device doesn't support IPsec offload. It means that we can combine the ipsec.c and ipsec_offload.c files to one file. Such change will allow us to remove ipsec_ops indirection. Link: https://lore.kernel.org/r/d0ac1fb7b14c10ae20a21ae17a393ee860c72ac3.1649232994.git.leonro@nvidia.com Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
#
2fa33b35 |
|
06-Apr-2022 |
Leon Romanovsky <leon@kernel.org> |
net/mlx5_fpga: Drop INNOVA IPsec support Mellanox INNOVA IPsec cards are EOL in Nov, 2019 [1]. As such, the code is unmaintained, untested and not in-use by any upstream/distro oriented customers. In order to reduce code complexity, drop the kernel code. [1] https://network.nvidia.com/related-docs/eol/LCR-000535.pdf Link: https://lore.kernel.org/r/2afe88ec5020a491079eacf6fe3c89b64d65195c.1649232994.git.leonro@nvidia.com Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
#
943aa7bd |
|
04-Apr-2022 |
Leon Romanovsky <leon@kernel.org> |
net/mlx5: Remove tls vs. ktls separation as it is the same After removal FPGA TLS, we can remove tls->ktls indirection too, as it is the same thing. Link: https://lore.kernel.org/r/67e596599edcffb0de43f26551208dfd34ac777e.1649073691.git.leonro@nvidia.com Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
#
ddc87e7d |
|
28-Jan-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Store DMA address inside struct page Use page_pool_set_dma_addr() to store the DMA address of a page inside struct page, in order to avoid passing struct mlx5e_dma_info to XDP handlers. Previously, struct mlx5e_dma_info was used to pass both the DMA address and the page, and it worked well for the single-fragment case. When XDP multi buffer is in use, and a fragmented xdp_frame has to be transmitted, the driver needs to know the DMA addresses of fragments, however, the array of fragments in struct skb_shared_info doesn't contain them. In order to pass the DMA addresses, the driver puts them into struct page itself, which is accessible from the array of fragments in struct skb_shared_info. The existing XDP handlers are modified to remove the dependency on struct mlx5e_dma_info. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
ea5d49bd |
|
27-Jan-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Add XDP multi buffer support to the non-linear legacy RQ This commit adds XDP multi buffer support to the RX path in the non-linear legacy RQ mode. mlx5e_xdp_handle is called from mlx5e_skb_from_cqe_nonlinear. XDP_TX action for fragmented XDP frames is not yet supported and blocked. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
4e8231f1 |
|
26-Jan-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Prepare non-linear legacy RQ for XDP multi buffer support mlx5e_skb_from_cqe_nonlinear creates an xdp_buff first, putting the first fragment as the linear part, and the rest of fragments as fragments to struct skb_shared_info in the tailroom. Then it creates an SKB in place, based on the xdp_buff. The XDP program is not called in this commit yet. This commit contains no functional change, except the SKB is built over the whole frag_stride of the first fragment, instead of the minimal size required (headroom, data and skb_shared_info). Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
99892393 |
|
14-Feb-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Drop cqe_bcnt32 from mlx5e_skb_from_cqe_mpwrq_linear The packet size in mlx5e_skb_from_cqe_mpwrq_linear can't overflow u16, since the maximum packet size in linear striding RQ is 2^13 bytes. Drop the unneeded u32 variable. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
064990d0 |
|
27-Jan-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Drop the len output parameter from mlx5e_xdp_handle The len parameter of mlx5e_xdp_handle is used to output the new packet length after XDP has processed the packet and returned XDP_PASS. However, this value can be calculated on the caller site, as the caller knows if it was an XDP_PASS. This commit drops the len parameter and moves the calculation to the caller, reducing the number of parameters passed to the function and preparing for XDP support in non-linear legacy RQ. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
e26eceb9 |
|
19-Jan-2022 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: RX, Test the XDP program existence out of the handler Instead of early return inside mlx5e_xdp_handle(), let the caller check if an XDP program is loaded. This allows saving a few unnecessary function calls and calculations in case !prog. Performance test: single core, drop packets in iptables Before: 3,872,504 pps After: 3,975,628 pps (+2.66%) Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
8d35fb57 |
|
26-Jan-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Build SKB in place over the first fragment in non-linear legacy RQ As a performance optimization and preparation to enabling XDP multi buffer on non-linear legacy RQ, build the linear part of the SKB over the first fragment, instead of allocating a new buffer and copying the first 256 bytes there. To achieve this, add headroom and tailroom to the first fragment. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
c3cce0ff |
|
26-Jan-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Add headroom only to the first fragment in legacy RQ Currently, rq->buff.headroom is applied to all fragments in legacy RQ. In the linear mode, there is a non-zero headroom, but there is only one fragment per packet. In the non-linear mode, the headroom is zero. This commit changes the logic to apply the headroom only to the first fragment. The current behavior remains the same for both linear and non-linear modes. However, it allows the next commit to enable headroom for the non-linear mode, which will be applied only to the first fragment. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
4b5fba4a |
|
19-Jan-2022 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: RX, Restrict bulk size for small Striding RQs In RQs of type multi-packet WQE (Striding RQ), each WQE is relatively large (typically 256KB) but their number is relatively small (8 in default). Re-mapping the descriptors' buffers before re-posting them is done via UMR (User-Mode Memory Registration) operations. On the one hand, posting UMR WQEs in bulks reduces communication overhead with the HW and better utilizes its processing units. On the other hand, delaying the WQE repost operations for a small RQ (say, of 4 WQEs) might drastically hit its performance, causing packet drops due to no receive buffer, for high or bursty incoming packets rate. Here we restrict the bulk size for too small RQs. Effectively, with the current constants, RQ of size 4 (minimum allowed) would have no bulking, while larger RQs will continue working with bulks of 2. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
c1e80bf4 |
|
20-Jan-2022 |
Alex Liu <liualex@fb.com> |
net/mlx5e: Add support for using xdp->data_meta Add support for using xdp->data_meta for cross-program communication Pass "true" to the last argument of xdp_prepare_buff(). After SKB is built, call skb_metadata_set() if metadata was pushed. Signed-off-by: Alex Liu <liualex@fb.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
76c31e5f |
|
10-May-2021 |
Aya Levin <ayal@nvidia.com> |
net/mlx5e: Use FW limitation for max MPW WQEBBs Calculate maximal count of MPW WQEBBs on SQ's creation and store it there. Remove MLX5E_TX_MPW_MAX_NUM_DS and MLX5E_TX_MPW_MAX_WQEBBS. Update mlx5e_tx_mpwqe_is_full() and mlx5e_xdp_mpqwe_is_full() . Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
7eaf1f37 |
|
31-Jan-2022 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: kTLS, Use CHECKSUM_UNNECESSARY for device-offloaded packets For RX TLS device-offloaded packets, the HW spec guarantees checksum validation for the offloaded packets, but does not define whether the CQE.checksum field matches the original packet (ciphertext) or the decrypted one (plaintext). This latitude allows architetctural improvements between generations of chips, resulting in different decisions regarding the value type of CQE.checksum. Hence, for these packets, the device driver should not make use of this CQE field. Here we block CHECKSUM_COMPLETE usage for RX TLS device-offloaded packets, and use CHECKSUM_UNNECESSARY instead. Value of the packet's tcp_hdr.csum is not modified by the HW, and it always matches the original ciphertext. Fixes: 1182f3659357 ("net/mlx5e: kTLS, Add kTLS RX HW offload support") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
7957837b |
|
26-Jan-2022 |
Khalid Manaa <khalidm@nvidia.com> |
net/mlx5e: Fix broken SKB allocation in HW-GRO In case the HW doesn't perform header-data split, it will write the whole packet into the data buffer in the WQ, in this case the SHAMPO CQE handler couldn't use the header entry to build the SKB, instead it should allocate a new memory to build the SKB using the function: mlx5e_skb_from_cqe_mpwrq_nonlinear. Fixes: f97d5c2a453e ("net/mlx5e: Add handle SHAMPO cqe support") Signed-off-by: Khalid Manaa <khalidm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
b8d91145 |
|
26-Jan-2022 |
Khalid Manaa <khalidm@nvidia.com> |
net/mlx5e: Fix wrong calculation of header index in HW_GRO The HW doesn't wrap the CQE.shampo.header_index field according to the headers buffer size, instead it always increases it until reaching overflow of u16 size. Thus the mlx5e_handle_rx_cqe_mpwrq_shampo handler should mask the CQE header_index field to find the actual header index in the headers buffer. Fixes: f97d5c2a453e ("net/mlx5e: Add handle SHAMPO cqe support") Signed-off-by: Khalid Manaa <khalidm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
0b7cfa40 |
|
23-Dec-2021 |
Aya Levin <ayal@nvidia.com> |
net/mlx5e: Fix page DMA map/unmap attributes Driver initiates DMA sync, hence it may skip CPU sync. Add DMA_ATTR_SKIP_CPU_SYNC as input attribute both to dma_map_page and dma_unmap_page to avoid redundant sync with the CPU. When forcing the device to work with SWIOTLB, the extra sync might cause data corruption. The driver unmaps the whole page while the hardware used just a part of the bounce buffer. So syncing overrides the entire page with bounce buffer that only partially contains real data. Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE") Fixes: db05815b36cb ("net/mlx5e: Add XSK zero-copy support") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
5dd29f40 |
|
22-Dec-2021 |
Gal Pressman <gal@nvidia.com> |
net/mlx5e: Add recovery flow in case of error CQE The rep legacy RQ completion handling was missing the appropriate handling of error CQEs (dump the CQE and queue a recover work), fix it by calling trigger_report() when needed. Since all CQE handling flows do the exact same error CQE handling, extract it to a common helper function. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
be98737a |
|
05-Dec-2021 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: Use dynamic per-channel allocations in stats Make stats array an array of pointer. This patch comes in to prepare for the next patch where allocations of the stats are to be performed dynamically on first usage. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
3ef1f8e7 |
|
10-Nov-2021 |
Ben Ben-Ishay <benishay@nvidia.com> |
net/mlx5e: SHAMPO, clean MLX5E_MAX_KLM_PER_WQE macro This commit reduces unused variable from MLX5E_MAX_KLM_PER_WQE macro that introduced by commit d7b896acbdcb ("net/mlx5e: Add support to klm_umr_wqe"). Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
4721031c |
|
15-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: move gro definitions to include/net/gro.h include/linux/netdevice.h became too big, move gro stuff into include/net/gro.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8c8cf038 |
|
31-Oct-2021 |
Ben Ben-Ishay <benishay@nvidia.com> |
net/mlx5e: SHAMPO, Fix constant expression result mlx5e_build_shampo_hd_umr uses counters i and index incorrectly as unsigned, thus the err state err_unmap could stuck in endless loop. Change i to int to solve the first issue. Reduce index check to solve the second issue, the caller function validates that index could not rotate. Fixes: 64509b052525 ("net/mlx5e: Add data path for SHAMPO feature") Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
28e7606f |
|
26-Oct-2021 |
Ariel Levkovich <lariel@nvidia.com> |
net/mlx5e: Refactor rx handler of represetor device Move the ownership of skb forwarding to network stack to the tc update_skb handler as different cases will require different handling of the skb. While the tc handler will take care of the various cases and properly handle the handover of the skb to the network stack and freeing the skb, the main rx handler will be kept clean from branches and usage of flags. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
def09e7b |
|
13-Oct-2020 |
Khalid Manaa <khalidm@nvidia.com> |
net/mlx5e: Add HW_GRO statistics This patch adds HW_GRO counters to RX packets statistics: - gro_match_packets: counter of received packets with set match flag. - gro_packets: counter of received packets over the HW_GRO feature, this counter is increased by one for every received HW_GRO cqe. - gro_bytes: counter of received bytes over the HW_GRO feature, this counter is increased by the received bytes for every received HW_GRO cqe. - gro_skbs: counter of built HW_GRO skbs, increased by one when we flush HW_GRO skb (when we call a napi_gro_receive with hw_gro skb). - gro_large_hds: counter of received packets with large headers size, in case the packet needs new SKB, the driver will allocate new one and will not use the headers entry to build it. Signed-off-by: Khalid Manaa <khalidm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
92552d3a |
|
14-Sep-2020 |
Khalid Manaa <khalidm@nvidia.com> |
net/mlx5e: HW_GRO cqe handler implementation this patch updates the SHAMPO CQE handler to support HW_GRO, changes in the SHAMPO CQE handler: - CQE match and flush fields are used to determine if to build new skb using the new received packet, or to add the received packet data to the existing RQ.hw_gro_skb, also this fields are used to determine when to flush the skb. - in the end of the function mlx5e_poll_rx_cq the RQ.hw_gro_skb is flushed. Signed-off-by: Khalid Manaa <khalidm@nvidia.com> Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
64509b05 |
|
13-Sep-2020 |
Ben Ben-Ishay <benishay@nvidia.com> |
net/mlx5e: Add data path for SHAMPO feature The header buffer is used to store the headers of the rx packets. The header buffer size deduced from WorkQueue size + restriction of max packets per WorkQueueElement. This commit adds the functionality for posting/updating memory for the header buffer during the posting/updating of WQEs. Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
f97d5c2a |
|
19-May-2020 |
Khalid Manaa <khalidm@nvidia.com> |
net/mlx5e: Add handle SHAMPO cqe support This patch adds the new CQE SHAMPO fields: - flush: indicates that we must close the current session and pass the SKB to the network stack. - match: indicates that the current packet matches the oppened session, the packet will be merge into the current SKB. - header_size: the size of the packet headers that written into the headers buffer. - header_entry_index: the entry index in the headers buffer. - data_offset: packets data offset in the WQE. Also new cqe handler is added to handle SHAMPO packets: - The new handler uses CQE SHAMPO fields to build the SKB. CQE's Flush and match fields are not used in this patch, packets are not merged in this patch. Signed-off-by: Khalid Manaa <khalidm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
e5ca8fb0 |
|
08-Jun-2021 |
Ben Ben-Ishay <benishay@nvidia.com> |
net/mlx5e: Add control path for SHAMPO feature This commit introduces the control path infrastructure for SHAMPO feature. SHAMPO feature enables packet stitching by splitting packets to header and payload, the header is placed on a dedicated buffer and the payload on the RX ring, this allows stitching the data part of a flow together continuously in the receive buffer. SHAMPO feature is implemented as linked list striding RQ feature. To support packets splitting and payload stitching: - Enlarge the ICOSQ and the correspond CQ to support the header buffer memory regions. - Add support to create linked list striding RQ with SHAMPO feature set in the open_rq function. - Add deallocation function and corresponded calls for SHAMPO header buffer. - Add mlx5e_create_umr_klm_mkey to support KLM mkey for the header buffer. - Rename mlx5e_create_umr_mkey to mlx5e_create_umr_mtt_mkey. Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
f9a10440 |
|
26-Aug-2021 |
Raed Salem <raeds@nvidia.com> |
net/mlx5e: IPSEC RX, enable checksum complete Currently in Rx data path IPsec crypto offloaded packets uses csum_none flag, so checksum is handled by the stack, this naturally have some performance/cpu utilization impact on such flows. As Nvidia NIC starting from ConnectX6DX provides checksum complete value out of the box also for such flows there is no sense in taking csum_none path, furthermore the stack (xfrm) have the method to handle checksum complete corrections for such flows i.e. IPsec trailer removal and consequently checksum value adjustment. Because of the above and in addition the ConnectX6DX is the first HW which supports IPsec crypto offload then it is safe to report csum complete for IPsec offloaded traffic. Fixes: b2ac7541e377 ("net/mlx5e: IPsec: Add Connect-X IPsec Rx data path offload") Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
8ec5d438 |
|
19-May-2021 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: RX, Re-place page pool numa node change logic Move the logic that updates the page pool upon changes in numa node. Before this patch, logic was placed in the RX polling function, being called also when no RX traffic, wasting cpu cycles. Here we move it to the RX post_wqes function, to be called only when new RX descriptors are going to be allocated. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
2ef9c7c6 |
|
22-Apr-2021 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: RX, Remove unnecessary check in RX CQE compression handling There are two reasons for exiting mlx5e_decompress_cqes_cont(): 1. The compression session is completed (cqd.left == 0). 2. The budget is exhausted (work_done == budget). If after calling mlx5e_decompress_cqes_cont() we have cqd.left > 0, it necessarily implies that budget is exhausted. The first part of the complex condition is covered by the second, hence we remove it here. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
c07274ab |
|
15-Dec-2020 |
Huy Nguyen <huyn@nvidia.com> |
net/mlx5e: IPsec/rep_tc: Fix rep_tc_update_skb drops IPsec packet rep_tc copy REG_C1 to REG_B. IPsec crypto utilizes the whole REG_B register with BIT31 as IPsec marker. rep_tc_update_skb drops IPsec because it thought REG_B contains bad value. In previous patch, BIT 31 of REG_C1 is reserved for IPsec. Skip the rep_tc_update_skb if BIT31 of REG_B is set. Signed-off-by: Huy Nguyen <huyn@nvidia.com> Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
c27971d0 |
|
28-Oct-2020 |
Roi Dayan <roid@nvidia.com> |
net/mlx5: Move devlink port from mlx5e priv to mlx5e resources We re-use the native NIC port net device instance for the Uplink representor, and the devlink port. When changing profiles we reset the mlx5e priv but we should still use the devlink port so move it to mlx5e resources. Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
e16cf9d7 |
|
17-Feb-2021 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: Dump ICOSQ WQE descriptor on CQE with error events Dump the ICOSQ's WQE descriptor when a completion with error is received. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
d5dd03b2 |
|
12-Jan-2021 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: RX, Mind the MPWQE gaps when calculating offsets Since cited patch, MLX5E_REQUIRED_WQE_MTTS is not a power of two. Hence, usage of MLX5E_LOG_ALIGNED_MPWQE_PPW should be replaced, as it lost some accuracy. Use the designated macro to calculate the number of required MTTs. This makes sure the solution in cited patch works properly. While here, un-inline mlx5e_get_mpwqe_offset(), and remove the unused RQ parameter. Fixes: c3c9402373fe ("net/mlx5e: Add resiliency in Striding RQ mode for packets larger than MTU") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
432119de |
|
12-Feb-2021 |
Aya Levin <ayal@nvidia.com> |
net/mlx5: Add cyc2time HW translation mode support Device timestamp can be in real time mode (cycles to time translation is offloaded into the Hardware). With real time mode, HW provides timestamp which is already translated into nanoseconds. With this mode, driver adjusts both the HW and timecounter (to keep clock_info_page updated) using callbacks: adjfreq, adjtime and settime. HW clock modifications are done via MTUTC access reg commands. Driver is allowed to modify HW real time clock only if MCAM ptpcyc2realtime_modify capability is set. Add MTUTC set function to be used for configuring the HW real time clock. Modify existing code to support both internal timer (with conversion via timecounter_cyc2time() and real time (no conversions). Align the signatures of the helpers converting from timestamp to nanoseconds. With that, when allocating a queue assign the corresponding callback with respect to the capability. Adjust 1PPS timestamp calculation flows based on the timestamp mode. Cyc2time offload brings two major advantages: - Improve MTAE (Max Time Absolute Error) for HW TS by up to 160 ns over a 100% loaded CPU. - Faster data-path timestamp to nanoseconds, as translation is lock-less and done in HW. On real time mode, timestamp format is 32 high bits of seconds and 32 low bits of nanoseconds. On some flows, driver shall convert this format into nanoseconds wall-clock with REAL_TIME_TO_NS macro. HW supports a single clock, and it is shared by all functions on a device. In case real time clock is used, it is recommended to use a single GM to all device's functions. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
e4484d9d |
|
24-Jan-2021 |
Raed Salem <raeds@nvidia.com> |
net/mlx5e: Enable striding RQ for Connect-X IPsec capable devices This limitation was inherited by previous Innova (FPGA) IPsec implementation, it uses its private set of RQ handlers which does not support striding rq, for Connect-X this is no longer true. Fix by keeping this limitation only for Innova IPsec supporting devices, as otherwise this limitation effectively wrongly blocks striding RQs for all future Connect-X devices for all flows even if IPsec offload is not used. Fixes: 2d64663cd559 ("net/mlx5: IPsec: Add HW crypto offload support") Signed-off-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
a79afa78 |
|
02-Feb-2021 |
Alexander Lobakin <alobakin@pm.me> |
net: use the new dev_page_is_reusable() instead of private versions Now we can remove a bunch of identical functions from the drivers and make them use common dev_page_is_reusable(). All {,un}likely() checks are omitted since it's already present in this helper. Also update some comments near the call sites. Suggested-by: David Rientjes <rientjes@google.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
5543e989 |
|
26-Jan-2021 |
Aya Levin <ayal@nvidia.com> |
net/mlx5e: Add trap entity to ETH driver Introduce mlx5e_trap which includes a dedicated RQ and NAPI for trapped packets. Trap-RQ processes packets that were destined to be dropped, but for debug and visibility sake these packets are trapped and reported to devlink. Trap-RQ connects between the HW and the driver and is not a part of a channel. Open mlx5e_create_rq() and mlx5_core_destroy_rq() as API and add dedicate RQ handlers which report to devlink of trapped packets. Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
224169d2 |
|
12-Jan-2021 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: IPsec, Remove unnecessary config flag usage MLX5_IPSEC_DEV() is always defined, no need to protect it under config flag CONFIG_MLX5_EN_IPSEC, especially in slow path. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
be9df4af |
|
22-Dec-2020 |
Lorenzo Bianconi <lorenzo@kernel.org> |
net, xdp: Introduce xdp_prepare_buff utility routine Introduce xdp_prepare_buff utility routine to initialize per-descriptor xdp_buff fields (e.g. xdp_buff pointers). Rely on xdp_prepare_buff() in all XDP capable drivers. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Shay Agroskin <shayagr@amazon.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Acked-by: Camelia Groza <camelia.groza@nxp.com> Acked-by: Marcin Wojtas <mw@semihalf.com> Link: https://lore.kernel.org/bpf/45f46f12295972a97da8ca01990b3e71501e9d89.1608670965.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
43b5169d |
|
22-Dec-2020 |
Lorenzo Bianconi <lorenzo@kernel.org> |
net, xdp: Introduce xdp_init_buff utility routine Introduce xdp_init_buff utility routine to initialize xdp_buff fields const over NAPI iterations (e.g. frame_sz or rxq pointer). Rely on xdp_init_buff in all XDP capable drivers. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Shay Agroskin <shayagr@amazon.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Acked-by: Camelia Groza <camelia.groza@nxp.com> Acked-by: Marcin Wojtas <mw@semihalf.com> Link: https://lore.kernel.org/bpf/7f8329b6da1434dc2b05a77f2e800b29628a8913.1608670965.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
a34ffec8 |
|
31-Jan-2021 |
Maor Dickman <maord@nvidia.com> |
net/mlx5e: Release skb in case of failure in tc update skb In case of failure in tc update skb the packet is dropped without freeing the skb. Fixed by freeing the skb in case failure in tc update skb. Fixes: d6d27782864f ("net/mlx5: E-Switch, Restore chain id on miss") Fixes: c75690972228 ("net/mlx5e: Add tc chains offload support for nic flows") Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
fe839516 |
|
01-Dec-2020 |
YueHaibing <yuehaibing@huawei.com> |
net/mlx5e: Remove duplicated include Remove duplicated include. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
521f31af |
|
01-Dec-2020 |
Aya Levin <ayal@nvidia.com> |
net/mlx5e: Allow RQ outside of channel context In order to be able to create an RQ outside of a channel context, remove rq->channel direct pointer. This requires adding a direct pointer to: ICOSQ and priv in order to support RQs that are part of mlx5e_channel. Use channel_stats from the corresponding CQ. Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
4d0b7ef9 |
|
01-Dec-2020 |
Aya Levin <ayal@nvidia.com> |
net/mlx5e: Allow CQ outside of channel context In order to be able to create a CQ outside of a channel context, remove cq->channel direct pointer. This requires adding a direct pointer to channel statistics, netdevice, priv and to mlx5_core in order to support CQs that are a part of mlx5e_channel. In addition, parameters the were previously derived from the channel like napi, NUMA node, channel stats and index are now assembled in struct mlx5e_create_cq_param which is given to mlx5e_open_cq() instead of channel pointer. Generalizing mlx5e_open_cq() allows opening CQ outside of channel context which will be used in following patches in the patch-set. Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
1a50cf9a |
|
21-Oct-2020 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
net/mlx5e: Fix incorrect access of RCU-protected xdp_prog rq->xdp_prog is RCU-protected and should be accessed only with rcu_access_pointer for the NULL check in mlx5e_poll_rx_cq. rq->xdp_prog may change on the fly only from one non-NULL value to another non-NULL value, so the checks in mlx5e_xdp_handle and mlx5e_poll_rx_cq will have the same result during one NAPI cycle, meaning that no additional synchronization is needed. Fixes: fe45386a2082 ("net/mlx5e: Use RCU to protect rq->xdp_prog") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
c7569097 |
|
28-Apr-2020 |
Ariel Levkovich <lariel@mellanox.com> |
net/mlx5e: Add tc chains offload support for nic flows Allow adding nic tc flow rules with goto chain action. Connecting the nic flows to the mlx5 chains infrastructure in previous patches allows us to support the creation of chained flow tables and rules that direct to another chain for further packet processing. This is a required preparation to support CT offloads for nic tc flows. We allow the creation of 256 different chains for nic flows since we have 8 bits available for the chain restore tag in case of a miss. Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
47c97e6b |
|
10-May-2020 |
Ron Diskin <rondi@mellanox.com> |
net/mlx5e: Fix multicast counter not up-to-date in "ip -s" Currently the FW does not generate events for counters other than error counters. Unlike ".get_ethtool_stats", ".ndo_get_stats64" (which ip -s uses) might run in atomic context, while the FW interface is non atomic. Thus, 'ip' is not allowed to issue FW commands, so it will only display cached counters in the driver. Add a SW counter (mcast_packets) in the driver to count rx multicast packets. The counter also counts broadcast packets, as we consider it a special case of multicast. Use the counter value when calling "ip -s"/"ifconfig". Fixes: f62b8bb8f2d3 ("net/mlx5: Extend mlx5_core to support ConnectX-4 Ethernet functionality") Signed-off-by: Ron Diskin <rondi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
9c25a22d |
|
11-Jun-2020 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
net/mlx5e: Use synchronize_rcu to sync with NAPI As described in the previous commit, napi_synchronize doesn't quite fit the purpose when we just need to wait until the currently running NAPI quits. Its implementation waits until NAPI is not running by polling and waiting for 1ms in between. In cases where we need to deactivate one queue (e.g., recovery flows) or where we deactivate them one-by-one (deactivate channel flow), we may get stuck in napi_synchronize forever if other queues keep NAPI active, causing a soft lockup. Depending on kernel configuration (CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC), it may result in a kernel panic. To fix the issue, use synchronize_rcu to wait for NAPI to quit, and wrap the whole NAPI in rcu_read_lock. Fixes: acc6c5953af1 ("net/mlx5e: Split open/close channels to stages") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
b7cf0806 |
|
17-May-2020 |
Ofer Levi <oferle@mellanox.com> |
net/mlx5e: Add CQE compression support for multi-strides packets Add CQE compression support for completions of packets that span multiple strides in a Striding RQ, per the HW capability. In our memory model, we use small strides (256B as of today) for the non-linear SKB mode. This feature allows CQE compression to work also for multiple strides packets. In this case decompressing the mini CQE array will use stride index provided by HW as part of the mini CQE. Before this feature, compression was possible only for single-strided packets, i.e. for packets of size up to 256 bytes when in non-linear mode, and the index was maintained by SW. This feature is supported for ConnectX-5 and above. Feature performance test: This was whitebox-tested, we reduced the PCI speed from 125Gb/s to 62.5Gb/s to overload pci and manipulated mlx5 driver to drop incoming packets before building the SKB to achieve low cpu utilization. Outcome is low cpu utilization and bottleneck on pci only. Test setup: Server: Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz server, 32 cores NIC: ConnectX-6 DX. Sender side generates 300 byte packets at full pci bandwidth. Receiver side configuration: Single channel, one cpu processing with one ring allocated. Cpu utilization is ~20% while pci bandwidth is fully utilized. For the generated traffic and interface MTU of 4500B (to activate the non-linear SKB mode), packet rate improvement is about 19% from ~17.6Mpps to ~21Mpps. Without this feature, counters show no CQE compression blocks for this setup, while with the feature, counters show ~20.7Mpps compressed CQEs in ~500K compression blocks. Signed-off-by: Ofer Levi <oferle@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
c4655761 |
|
28-Aug-2020 |
Magnus Karlsson <magnus.karlsson@intel.com> |
xsk: i40e: ice: ixgbe: mlx5: Rename xsk zero-copy driver interfaces Rename the AF_XDP zero-copy driver interface functions to better reflect what they do after the replacement of umems with buffer pools in the previous commit. Mostly it is about replacing the umem name from the function names with xsk_buff and also have them take the a buffer pool pointer instead of a umem. The various ring functions have also been renamed in the process so that they have the same naming convention as the internal functions in xsk_queue.h. This so that it will be clearer what they do and also for consistency. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-3-git-send-email-magnus.karlsson@intel.com
|
#
1742b3d5 |
|
28-Aug-2020 |
Magnus Karlsson <magnus.karlsson@intel.com> |
xsk: i40e: ice: ixgbe: mlx5: Pass buffer pool to driver instead of umem Replace the explicit umem reference passed to the driver in AF_XDP zero-copy mode with the buffer pool instead. This in preparation for extending the functionality of the zero-copy mode so that umems can be shared between queues on the same netdev and also between netdevs. In this commit, only an umem reference has been added to the buffer pool struct. But later commits will add other entities to it. These are going to be entities that are different between different queue ids and netdevs even though the umem is shared between them. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com
|
#
e20f0dbf |
|
26-Aug-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Add a prefetch command for small L1_CACHE_BYTES A single cacheline might not contain the packet header for small L1_CACHE_BYTES values. Use net_prefetch() as it issues an additional prefetch in this case. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5d0b8476 |
|
12-Jul-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use indirect call wrappers for RX post WQEs functions Use the indirect call wrapper API macros for declaration and scope of the RX post WQEs functions. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
5adf4c475 |
|
30-Apr-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Re-work initializaiton of RX function pointers Instead of exposing the RQ datapath handlers (from en_rx.c) so that they are set in the control path (in en_main.c), wrap this logic in a single function in en_rx.c and expose it alone. Every profile will now have a pointer to the new mlx5e_rx_handlers structure, instead of directly pointing to the previously-exposed RQ handlers. This significantly improves locality and modularity of the driver, and allows many functions in en_rx.c to become static. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
2901a5c6 |
|
30-Apr-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Avoid indirect call in representor CQE handling Use INDIRECT_CALL_2() helper to avoid the cost of the indirect call when/if CONFIG_RETPOLINE=y. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
b2ac7541 |
|
24-Oct-2019 |
Raed Salem <raeds@mellanox.com> |
net/mlx5e: IPsec: Add Connect-X IPsec Rx data path offload On receive flow inspect received packets for IPsec offload indication using the cqe, for IPsec offloaded packets propagate offload status and stack handle to stack for further processing. Supported statuses: - Offload ok. - Authentication failure. - Bad trailer indication. Connect-X IPsec does not use mlx5e_ipsec_handle_rx_cqe. For RX only offload, we see the BW gain. Below is the iperf3 performance report on two server of 24 cores Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz with ConnectX6-DX. We use one thread per IPsec tunnel. --------------------------------------------------------------------- Mode | Num tunnel | BW | Send CPU util | Recv CPU util | | (Gbps) | (Average %) | (Average %) --------------------------------------------------------------------- Cryto offload | 1 | 4.6 | 4.2 | 14.5 --------------------------------------------------------------------- Cryto offload | 24 | 38 | 73 | 63 --------------------------------------------------------------------- Non-offload | 1 | 4 | 4 | 13 --------------------------------------------------------------------- Non-offload | 24 | 23 | 52 | 67 Signed-off-by: Raed Salem <raeds@mellanox.com> Reviewed-by: Boris Pismenny <borisp@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
b9961af7 |
|
30-Apr-2020 |
Aya Levin <ayal@mellanox.com> |
net/mlx5e: Remove redundant RQ state query When received a CQE error, the driver inspect the syndrome given by the firmware. RQ recovery is initiated only as a result of a fatal syndrome; syndrome which set the RQ into an error state. Hence no need to query the RQ state at the beginning of the recovery process. Add additional debug prints before recovering. Signed-off-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
a2907436 |
|
15-Jun-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: kTLS, Improve rx handler function call Prior to this patch mlx5e tls rx handler was called unconditionally on all rx frames and the decision whether a frame is a valid tls record is done inside that function. A function call can be expensive especially for regular rx packet rate. To avoid this, check the tls validity before jumping into the tls rx handler. While at it, split between kTLS device offload rx handler and FPGA tls rx handler using a similar method. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
|
#
0419d8c9 |
|
16-Jun-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: kTLS, Add kTLS RX resync support Implement the RX resync procedure, using the TLS async resync API. The HW offload of TLS decryption in RX side might get out-of-sync due to out-of-order reception of packets. This requires SW intervention to update the HW context and get it back in-sync. Performance: CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 24 cores, HT off NIC: ConnectX-6 Dx 100GbE dual port Goodput (app-layer throughput) comparison: +---------------+-------+-------+---------+ | # connections | 1 | 4 | 8 | +---------------+-------+-------+---------+ | SW (Gbps) | 7.26 | 24.70 | 50.30 | +---------------+-------+-------+---------+ | HW (Gbps) | 18.50 | 64.30 | 92.90 | +---------------+-------+-------+---------+ | Speedup | 2.55x | 2.56x | 1.85x * | +---------------+-------+-------+---------+ * After linerate is reached, diff is observed in CPU util. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
1182f365 |
|
28-May-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: kTLS, Add kTLS RX HW offload support Implement driver support for the kTLS RX HW offload feature. Resync support is added in a downstream patch. New offload contexts post their static/progress params WQEs over the per-channel async ICOSQ, protected under a spin-lock. The Channel/RQ is selected according to the socket's rxq index. Feature is OFF by default. Can be turned on by: $ ethtool -K <if> tls-hw-rx-offload on A new TLS-RX workqueue is used to allow asynchronous addition of steering rules, out of the NAPI context. It will be also used in a downstream patch in the resync procedure. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
56e2287b |
|
27-May-2020 |
Jesper Dangaard Brouer <brouer@redhat.com> |
mlx5: fix xdp data_meta setup in mlx5e_fill_xdp_buff The helper function xdp_set_data_meta_invalid() must be called after setting xdp->data as it depends on it. The bug was introduced in the cited patch below, and cause the kernel to crash when using BPF helper bpf_xdp_adjust_head() on mlx5 driver. Fixes: 39d6443c8daf ("mlx5, xsk: Migrate to new MEM_TYPE_XSK_BUFF_POOL") Reported-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
768c3667 |
|
12-May-2020 |
Vlad Buslov <vladbu@mellanox.com> |
net/mlx5e: Extract TC-specific code from en_rep.c to rep/tc.c As a preparation for introducing new kconfig option that controls compilation of all TC offloads code in mlx5, extract TC-specific code from en_rep.c to standalone file. This allows easily compiling out the code by only including new source in make file when corresponding kconfig is enabled instead of adding multiple ifdef blocks to en_rep. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
39d6443c |
|
20-May-2020 |
Björn Töpel <bjorn@kernel.org> |
mlx5, xsk: Migrate to new MEM_TYPE_XSK_BUFF_POOL Use the new MEM_TYPE_XSK_BUFF_POOL API in lieu of MEM_TYPE_ZERO_COPY in mlx5e. It allows to drop a lot of code from the driver (which is now common in AF_XDP core and was related to XSK RX frame allocation, DMA mapping, etc.) and slightly improve performance (RX +0.8 Mpps, TX +0.4 Mpps). rfc->v1: Put back the sanity check for XSK params, use XSK API to get the total headroom size. (Maxim) v1->v2: Fix DMA address handling, set XDP metadata to invalid. (Maxim) v2->v3: Handle frame_sz, use xsk_buff_xdp_get_frame_dma, use xsk_buff API for DMA sync on TX, add performance numbers. (Maxim) v3->v4: Remove unused variable num_xsk_frames. (Jakub) Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-12-bjorn.topel@gmail.com
|
#
8b46d424 |
|
04-May-2020 |
Erez Shitrit <erezsh@mellanox.com> |
net/mlx5e: IPoIB, Drop multicast packets that this interface sent After enabled loopback packets for IPoIB, we need to drop these packets that this HCA has replicated and came back to the same interface that sent them. Fixes: 4c6c615e3f30 ("net/mlx5e: IPoIB, Add PKEY child interface nic profile") Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Reviewed-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
d628ee4f |
|
13-May-2020 |
Jesper Dangaard Brouer <brouer@redhat.com> |
mlx5: Rx queue setup time determine frame_sz for XDP The mlx5 driver have multiple memory models, which are also changed according to whether a XDP bpf_prog is attached. The 'rx_striding_rq' setting is adjusted via ethtool priv-flags e.g.: # ethtool --set-priv-flags mlx5p2 rx_striding_rq off On the general case with 4K page_size and regular MTU packet, then the frame_sz is 2048 and 4096 when XDP is enabled, in both modes. The info on the given frame size is stored differently depending on the RQ-mode and encoded in a union in struct mlx5e_rq union wqe/mpwqe. In rx striding mode rq->mpwqe.log_stride_sz is either 11 or 12, which corresponds to 2048 or 4096 (MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ). In non-striding mode (MLX5_WQ_TYPE_CYCLIC) the frag_stride is stored in rq->wqe.info.arr[0].frag_stride, for the first fragment, which is what the XDP case cares about. To reduce effect on fast-path, this patch determine the frame_sz at setup time, to avoid determining the memory model runtime. Variable is named frame0_sz to make it clear that this is only the frame size of the first fragment. This mlx5 driver does a DMA-sync on XDP_TX action, but grow is safe as it have done a DMA-map on the entire PAGE_SIZE. The driver also already does a XDP length check against sq->hw_mtu on the possible XDP xmit paths mlx5e_xmit_xdp_frame() + mlx5e_xmit_xdp_frame_mpwqe(). V3+4: Change variable name first_frame_sz to frame0_sz V2: Fix that frag_size need to be recalc before creating SKB. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Link: https://lore.kernel.org/bpf/158945348021.97035.12295039384250022883.stgit@firesoul
|
#
28bff095 |
|
16-Dec-2019 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Enhance ICOSQ WQE info fields The same WQE opcode might be used in different ICOSQ flows and WQE types. To have a better distinguishability, replace it with an enum that better indicates the WQE type and flow it is used for. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
41a8e4eb |
|
19-Mar-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use struct assignment for WQE info updates Struct assignment looks more clean, and implies resetting the not assigned fields to zero, instead of holding values from older ring cycles. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
ec9cdca0 |
|
16-Apr-2020 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
net/mlx5e: Unify reserving space for WQEs In our fast-path design, a WQE (Work Queue Element) must not cross the page boundary. To enforce that, for WQEs consisting of more than one BB (Basic Block), the driver checks the available contiguous space in the WQ in advance, and if it's not enough, it pads it with NOPs. This patch modifies the code that calculates the position of next WQE, considering the padding, and prepares the WQE. This code is common for all SQ types. In this patch it's reorganized in a way that makes the usage pattern unified for all SQ types, and makes the implementations self-contained and look almost the same, preparing the repeating code to further attempts to deduplicate it. One place is left as is: mlx5e_sq_xmit and mlx5e_fill_sq_frag_edge call inside, because it is special in a way that it may also copy WQE's cseg and eseg when reserving space. This will be eliminated in one of the following patches, and this place will be converted to the new approach, too. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
7d42c8e9 |
|
16-Apr-2020 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
net/mlx5e: Rename ICOSQ WQE info struct and field Structs mlx5e_txqsq and mlx5e_xdpsq contain wqe_info arrays to store supplementary information corresponding to WQEs in the queue. Struct mlx5e_icosq also has such an array, but it's called differently - ico_wqe. This patch renames it to unify with the other SQs. In addition, rename the struct to emphasize its specific usage. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
f1b95753 |
|
09-Feb-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: TX, Generalise code and usage of error CQE dump Error CQE was dumped only for TXQ SQs. Generalise the function, and add usage for error completions on ICO SQs and XDP SQs. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
e7e0004a |
|
11-Feb-2020 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
net/mlx5e: Don't trigger IRQ multiple times on XSK wakeup to avoid WQ overruns XSK wakeup function triggers NAPI by posting a NOP WQE to a special XSK ICOSQ. When the application floods the driver with wakeup requests by calling sendto() in a certain pattern that ends up in mlx5e_trigger_irq, the XSK ICOSQ may overflow. Multiple NOPs are not required and won't accelerate the process, so avoid posting a second NOP if there is one already on the way. This way we also avoid increasing the queue size (which might not help anyway). Fixes: db05815b36cb ("net/mlx5e: Add XSK zero-copy support") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
1de0306c |
|
09-Mar-2020 |
Aya Levin <ayal@mellanox.com> |
net/mlx5e: Enhance ICOSQ WQE info fields Add number of WQEBBs (WQE's Basic Block) to WQE info struct. Set the number of WQEBBs on WQE post, and increment the consumer counter (cc) on completion. In case of error completions, the cc was mistakenly not incremented, keeping a gap between cc and pc (producer counter). This failed the recovery flow on the ICOSQ from a CQE error which timed-out waiting for the cc and pc to meet. Fixes: be5323c8379f ("net/mlx5e: Report and recover from CQE error on ICOSQ") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
e9c1d253 |
|
27-Jan-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Use indirect calls wrapper for handling compressed completions We can avoid an indirect call per compressed completion wrapping the completion handling call with the appropriate helper. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
b8ce9037 |
|
15-Feb-2020 |
Paul Blakey <paulb@mellanox.com> |
net/mlx5e: Restore tunnel metadata on miss In tunnel and chains setup, we decapsulate the packets on first chain hop, if we miss on later chains, the packet will comes up without tunnel header, so it won't be taken by the tunnel device automatically, which fills the tunnel metadata, and further tc tunnel matches won't work. On miss, we get the tunnel mapping id, which was set on the chain 0 rule that decapsulated the packet. This rule matched the tunnel outer headers. From the tunnel mapping id, we get to this tunnel matches and restore the equivalent tunnel info metadata dst on the skb. We also set the skb->dev to the relevant device (tunnel device). Now further tc processing can be done on the relevant device. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
d6d27782 |
|
15-Feb-2020 |
Paul Blakey <paulb@mellanox.com> |
net/mlx5: E-Switch, Restore chain id on miss Chain ids are mapped to the lower part of reg C, and after loopback are copied to to CQE via a restore rule's flow_tag. To let tc continue in the correct chain, we find the corresponding chain id in the eswitch chain id <-> reg C mapping, and set the SKB's tc extension chain to it. That tells tc to continue processing from this set chain. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
dfd9e750 |
|
15-Feb-2020 |
Paul Blakey <paulb@mellanox.com> |
net/mlx5e: Rx, Split rep rx mpwqe handler from nic Copy the current rep mpwqe rx handler which is also used by nic profile. In the next patch, we will add rep specific logic, just for the rep profile rx handler. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
b57e66ad |
|
09-Jan-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: TX, Error completion is for last WQE in batch For a cyclic work queue, when not requesting a completion per WQE, a single CQE might indicate the completion of several WQEs. However, in case some WQE in the batch causes an error, then an error completion is issued, breaking the batch, and pointing to the offending WQE in the wqe_counter field. Hence, WQE-specific error CQE handling (like printing, breaking, etc...) should be performed only for the last WQE in batch. Fixes: 130c7b46c93d ("net/mlx5e: TX, Dump WQs wqe descriptors on CQE with error events") Fixes: fd9b4be8002c ("net/mlx5e: RX, Support multiple outstanding UMR posts") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
6849c6d8 |
|
19-Nov-2019 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Rx, Update page pool numa node when changed Once every napi poll cycle, check if numa node is different than the page pool's numa id, and update it using page_pool_update_nid(). Alternatively, we could have registered an irq affinity change handler, but page_pool_update_nid() must be called from napi context anyways, so the handler won't actually help. Performance testing: XDP drop/tx rate and TCP single/multi stream, on mlx5 driver while migrating rx ring irq from close to far numa: mlx5 internal page cache was locally disabled to get pure page pool results. CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G) XDP Drop/TX single core: NUMA | XDP | Before | After --------------------------------------- Close | Drop | 11 Mpps | 10.9 Mpps Far | Drop | 4.4 Mpps | 5.8 Mpps Close | TX | 6.5 Mpps | 6.5 Mpps Far | TX | 3.5 Mpps | 4 Mpps Improvement is about 30% drop packet rate, 15% tx packet rate for numa far test. No degradation for numa close tests. TCP single/multi cpu/stream: NUMA | #cpu | Before | After -------------------------------------- Close | 1 | 18 Gbps | 18 Gbps Far | 1 | 15 Gbps | 18 Gbps Close | 12 | 80 Gbps | 80 Gbps Far | 12 | 68 Gbps | 80 Gbps In all test cases we see improvement for the far numa case, and no impact on the close numa case. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9df86bdb |
|
16-Sep-2019 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
net/mlx5e: Fix handling of compressed CQEs in case of low NAPI budget When CQE compression is enabled, compressed CQEs use the following structure: a title is followed by one or many blocks, each containing 8 mini CQEs (except the last, which may contain fewer mini CQEs). Due to NAPI budget restriction, a complete structure is not always parsed in one NAPI run, and some blocks with mini CQEs may be deferred to the next NAPI poll call - we have the mlx5e_decompress_cqes_cont call in the beginning of mlx5e_poll_rx_cq. However, if the budget is extremely low, some blocks may be left even after that, but the code that follows the mlx5e_decompress_cqes_cont call doesn't check it and assumes that a new CQE begins, which may not be the case. In such cases, random memory corruptions occur. An extremely low NAPI budget of 8 is used when busy_poll or busy_read is active. This commit adds a check to make sure that the previous compressed CQE has been completely parsed after mlx5e_decompress_cqes_cont, otherwise it prevents a new CQE from being fetched in the middle of a compressed CQE. This commit fixes random crashes in __build_skb, __page_pool_put_page and other not-related-directly places, that used to happen when both CQE compression and busy_poll/busy_read were enabled. Fixes: 7219ab34f184 ("net/mlx5e: CQE compression") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
8276ea13 |
|
26-Jun-2019 |
Aya Levin <ayal@mellanox.com> |
net/mlx5e: Report and recover from CQE with error on RQ Add support for report and recovery from error on completion on RQ by setting the queue back to ready state. Handle only errors with a syndrome indicating the RQ might enter error state and could be recovered. Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
0a35ab3e |
|
14-Jun-2019 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: RX, Handle CQE with error at the earliest stage Just to be aligned with the MPWQE handlers, handle RX WQE with error for legacy RQs in the top RX handlers, just before calling skb_from_cqe(). CQE error handling will now be called at the same stage regardless of the RQ type or netdev mode NIC, Representor, IPoIB, etc .. This will be useful for down stream patch to improve error CQE handling. Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
be5323c8 |
|
25-Jun-2019 |
Aya Levin <ayal@mellanox.com> |
net/mlx5e: Report and recover from CQE error on ICOSQ Add support for report and recovery from error on completion on ICOSQ. Deactivate RQ and flush, then deactivate ICOSQ. Set the queue back to ready state (firmware) and reset the ICOSQ and the RQ (software resources). Finally, activate the ICOSQ and the RQ. Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
a7bd4018 |
|
14-Aug-2019 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
net/mlx5e: Add AF_XDP need_wakeup support This commit adds support for the new need_wakeup feature of AF_XDP. The applications can opt-in by using the XDP_USE_NEED_WAKEUP bind() flag. When this feature is enabled, some behavior changes: RX side: If the Fill Ring is empty, instead of busy-polling, set the flag to tell the application to kick the driver when it refills the Fill Ring. TX side: If there are pending completions or packets queued for transmission, set the flag to tell the application that it can skip the sendto() syscall and save time. The performance testing was performed on a machine with the following configuration: - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz - Mellanox ConnectX-5 Ex with 100 Gbit/s link The results with retpoline disabled: | without need_wakeup | with need_wakeup | |----------------------|----------------------| | one core | two cores | one core | two cores | -------|----------|-----------|----------|-----------| txonly | 20.1 | 33.5 | 29.0 | 34.2 | rxdrop | 0.065 | 14.1 | 12.0 | 14.1 | l2fwd | 0.032 | 7.3 | 6.6 | 7.2 | "One core" means the application and NAPI run on the same core. "Two cores" means they are pinned to different cores. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
8c7698d5 |
|
03-May-2019 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Rx, checksum handling refactoring Move vlan checksum fixup flow into mlx5e_skb_padding_csum(), which is supposed to fixup SKB checksum if needed. And rename mlx5e_skb_padding_csum() to mlx5e_skb_csum_fixup(). Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
db849faa |
|
03-May-2019 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Rx, Fix checksum calculation for new hardware CQE checksum full mode in new HW, provides a full checksum of rx frame. Covering bytes starting from eth protocol up to last byte in the received frame (frame_size - ETH_HLEN), as expected by the stack. Fixing up skb->csum by the driver is not required in such case. This fix is to avoid wrong checksum calculation in drivers which already support the new hardware with the new checksum mode. Fixes: 85327a9c4150 ("net/mlx5: Update the list of the PCI supported devices") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
db05815b |
|
26-Jun-2019 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
net/mlx5e: Add XSK zero-copy support This commit adds support for AF_XDP zero-copy RX and TX. We create a dedicated XSK RQ inside the channel, it means that two RQs are running simultaneously: one for non-XSK traffic and the other for XSK traffic. The regular and XSK RQs use a single ID namespace split into two halves: the lower half is regular RQs, and the upper half is XSK RQs. When any zero-copy AF_XDP socket is active, changing the number of channels is not allowed, because it would break to mapping between XSK RQ IDs and channels. XSK requires different page allocation and release routines. Such functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are generic enough to be used for both regular and XSK RQs, and they use the mlx5e_page_{alloc,release} wrappers around the real allocation functions. Function pointers are not used to avoid losing the performance with retpolines. Wherever it's certain that the regular (non-XSK) page release function should be used, it's called directly. Only the stats that could be meaningful for XSK are exposed to the userspace. Those that don't take part in the XSK flow are not considered. Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ), because the newer xdpsock sample doesn't provide any Fill Ring entries at the setup stage. We create a dedicated XSK SQ in the channel. This separation has its advantages: 1. When the UMEM is closed, the XSK SQ can also be closed and stop receiving completions. If an existing SQ was used for XSK, it would continue receiving completions for the packets of the closed socket. If a new UMEM was opened at that point, it would start getting completions that don't belong to it. 2. Calculating statistics separately. When the userspace kicks the TX, the driver triggers a hardware interrupt by posting a NOP to a dedicated XSK ICO (internal control operations) SQ, in order to trigger NAPI on the right CPU core. This XSK ICO SQ is protected by a spinlock, as the userspace application may kick the TX from any core. Store the pointers to the UMEMs in the net device private context, independently from the kernel. This way the driver can distinguish between the zero-copy and non-zero-copy UMEMs. The kernel function xdp_get_umem_from_qid does not care about this difference, but the driver is only interested in zero-copy UMEMs, particularly, on the cleanup it determines whether to close the XSK RQ and SQ or not by looking at the presence of the UMEM. Use state_lock to protect the access to this area of UMEM pointers. LRO isn't compatible with XDP, but there may be active UMEMs while XDP is off. If this is the case, don't allow LRO to ensure XDP can be reenabled at any time. The validation of XSK parameters typically happens when XSK queues open. However, when the interface is down or the XDP program isn't set, it's still possible to have active AF_XDP sockets and even to open new, but the XSK queues will be closed. To cover these cases, perform the validation also in these flows: 1. A new UMEM is registered, but the XSK queues aren't going to be created due to missing XDP program or interface being down. 2. MTU changes while there are UMEMs registered. Having this early check prevents mlx5e_open_channels from failing at a later stage, where recovery is impossible and the application has no chance to handle the error, because it got the successful return value for an MTU change or XSK open operation. The performance testing was performed on a machine with the following configuration: - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz - Mellanox ConnectX-5 Ex with 100 Gbit/s link The results with retpoline disabled, single stream: txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU) rxdrop: 12.2 Mpps l2fwd: 9.4 Mpps The results with retpoline enabled, single stream: txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU) rxdrop: 9.9 Mpps l2fwd: 6.8 Mpps Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
ed084fb6 |
|
26-Jun-2019 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
net/mlx5e: Allow ICO SQ to be used by multiple RQs Prepare to creation of the XSK RQ, which will require posting UMRs, too. The same ICO SQ will be used for both RQs and also to trigger interrupts by posting NOPs. UMR WQEs can't be reused any more. Optimization introduced in commit ab966d7e4ff98 ("net/mlx5e: RX, Recycle buffer of UMR WQEs") is reverted. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
29b006a6 |
|
18-Jun-2019 |
Jesper Dangaard Brouer <brouer@redhat.com> |
mlx5: more strict use of page_pool API The mlx5 driver is using page_pool, but not for DMA-mapping (currently), and is a little too relaxed about returning or releasing page resources, as it is not strictly necessary, when not using DMA-mappings. As this patchset is working towards tracking page_pool resources, to know about in-flight frames on shutdown. Then fix places where mlx5 leak page_pool resource. In case of dma_mapping_error, then recycle into page_pool. In mlx5e_free_rq() moved the page_pool_destroy() call to after the mlx5e_page_release() calls, as it is more correct. In mlx5e_page_release() when no recycle was requested, then release page from the page_pool, via page_pool_release_page(). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
55f96872 |
|
11-Jun-2019 |
Paolo Abeni <pabeni@redhat.com> |
net/mlx5e: use indirect calls wrapper for the rx packet handler We can avoid another indirect call per packet wrapping the rx handler call with the proper helper. To ensure that even the last listed direct call experience measurable gain, despite the additional conditionals we must traverse before reaching it, I tested reversing the order of the listed options, with performance differences below noise level. Together with the previous indirect call patch, this gives ~6% performance improvement in raw UDP tput. v2 -> v3: - use only the direct calls always available regardless of the mlx5 build options - drop the direct call list macro, to keep the code as simple as possible for future rework v1 -> v2: - update the direct call list and use a macro to define it, as per Saeed suggestion. An intermediated additional macro is needed to allow arg list expansion Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b3c04e83 |
|
11-Jun-2019 |
Paolo Abeni <pabeni@redhat.com> |
net/mlx5e: use indirect calls wrapper for skb allocation We can avoid an indirect call per packet wrapping the skb creation with the appropriate helper. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fd9b4be8 |
|
26-Feb-2019 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Support multiple outstanding UMR posts The buffers mapping of the Multi-Packet WQEs (of Striding RQ) is done via UMR posts, one UMR WQE per an RX MPWQE. A single MPWQE is capable of serving many incoming packets, usually larger than the budget of a single napi cycle. Hence, posting a single UMR WQE per napi cycle (and handling its completion in the next cycle) works fine in many common cases, but not always. When an XDP program is loaded, every MPWQE is capable of serving less packets, to satisfy the packet-per-page requirement. Thus, for the same number of packets more MPWQEs (and UMR posts) are needed (twice as much for the default MTU), giving less latency room for the UMR completions. In this patch, we add support for multiple outstanding UMR posts, to allow faster gap closure between consuming MPWQEs and reposting them back into the WQ. For better SW and HW locality, we combine the UMR posts in bulks of (at least) two. This is expected to improve packet rate in high CPU scale. Performance test: As expected, huge improvement in large-scale (48 cores). xdp_redirect_map, 64B UDP multi-stream. Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. Before: Unstable, 7 to 30 Mpps After: Stable, at 70.5 Mpps No degradation in other tested scenarios. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
8c8811d4 |
|
30-Mar-2019 |
Or Gerlitz <ogerlitz@mellanox.com> |
Revert "net/mlx5e: Enable reporting checksum unnecessary also for L3 packets" This reverts commit b820e6fb0978f9c2ac438c199d2bb2f35950e9c9. Prior the commit we are reverting, checksum unnecessary was only set when both the L3 OK and L4 OK bits are set on the CQE. This caused packets of IP protocols such as SCTP which are not dealt by the current HW L4 parser (hence the L4 OK bit is not set, but the L4 header type none bit is set) to go through the checksum none code, where currently we wrongly report checksum unnecessary for them, a regression. Fix this by a revert. Note that on our usual track we report checksum complete, so the revert isn't expected to have any notable performance impact. Also, when we are not on the checksum complete track, the L4 protocols for which we report checksum none are not high performance ones, we will still report checksum unnecessary for UDP/TCP. Fixes: b820e6fb0978 ("net/mlx5e: Enable reporting checksum unnecessary also for L3 packets") Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reported-by: Avi Urman <aviu@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
0318a7b7 |
|
25-Mar-2019 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Rx, Check ip headers sanity In the two places is_last_ethertype_ip is being called, the caller will be looking inside the ip header, to be safe, add ip{4,6} header sanity check. And return true only on valid ip headers, i.e: the whole header is contained in the linear part of the skb. Note: Such situation is very rare and hard to reproduce, since mlx5e allocates a large enough headroom to contain the largest header one can imagine. Fixes: fe1dc069990c ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
0aa1d186 |
|
12-Mar-2019 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Rx, Fixup skb checksum for packets with tail padding When an ethernet frame with ip payload is padded, the padding octets are not covered by the hardware checksum. Prior to the cited commit, skb checksum was forced to be CHECKSUM_NONE when padding is detected. After it, the kernel will try to trim the padding bytes and subtract their checksum from skb->csum. In this patch we fixup skb->csum for any ip packet with tail padding of any size, if any padding found. FCS case is just one special case of this general purpose patch, hence, it is removed. Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"), Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
5d0bb3ba |
|
21-Mar-2019 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: XDP, Avoid checksum complete when XDP prog is loaded XDP programs might change packets data contents which will make the reported skb checksum (checksum complete) invalid. When XDP programs are loaded/unloaded set/clear rx RQs MLX5E_RQ_STATE_NO_CSUM_COMPLETE flag. Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support") Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
3d6f3cdf |
|
14-Jan-2019 |
Feras Daoud <ferasda@mellanox.com> |
net/mlx5e: IPoIB, Fix RX checksum statistics update Update the RX checksum only if the feature is enabled. Fixes: 9d6bd752c63c ("net/mlx5e: IPoIB, RX handler") Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
79d356ef |
|
12-Sep-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Take CQ decompress fields into a separate structure Only the Receive CQ makes use of these fields. Take them out into a separate struct and save space in the generic CQ structure. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
94816278 |
|
24-Sep-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Make sure packet header does not cross page boundary In the non-linear SKB memory scheme of Striding RQ, a packet header could cross page boundary. This requires special care in fast path that costs LoC, additional runtime instructions and branches. It could happen when the header (up to 256B) does not fit in a single stride. Avoid this by working with a stride size that fits the maximum possible header. Stride is increased form 64B to 256B. Performance: Tested packet rate for UDP streams, single ring, on ConnectX-5. Configuration: Set Striding RQ and LRO ON (to enabled the non-linear SKB scheme). GRO OFF, early drop by TC rule. 64B: 4x worse memory utilization, no page-crossers headers - No degradation (5,887,305 pps). - The reduction in memory utilization is compensated by the saving of branches tests. 192B: 1.33x worse memory utilization, avoid page-crossers headers - Before: 5,727,252. After: 5,777,037. ~1% gain. 256B: Same memory util, no page-crossers - Before: 5,691,885. After: 5,748,007. ~1% gain. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
e8c8b53c |
|
03-Dec-2018 |
Cong Wang <xiyou.wangcong@gmail.com> |
net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames When an ethernet frame is padded to meet the minimum ethernet frame size, the padding octets are not covered by the hardware checksum. Fortunately the padding octets are usually zero's, which don't affect checksum. However, we have a switch which pads non-zero octets, this causes kernel hardware checksum fault repeatedly. Prior to: commit '88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE ...")' skb checksum was forced to be CHECKSUM_NONE when padding is detected. After it, we need to keep skb->csum updated, like what we do for RXFCS. However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP headers, it is not worthy the effort as the packets are so small that CHECKSUM_COMPLETE can't save anything. Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"), Cc: Eric Dumazet <edumazet@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Nikola Ciprich <nikola.ciprich@linuxbox.cz> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
4fb2f516 |
|
21-Nov-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: XDP, Precede XDP-related operations in RQ poll by a loaded program check At the end of the RQ polling loop, some XDP-related operations might be required. Before checking them one by one, check if an XDP program is even loaded. Combine all the checks and operations in a single function in xdp files. This saves unnecessary checks for non-XDP flows. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
bfc69825 |
|
02-Dec-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Fix wrong early return in receive queue poll When the completion queue of the RQ is empty, do not immediately return. If left-over decompressed CQEs (from the previous cycle) were processed, need to go to the finalization part of the poll function. Bug exists only when CQE compression is turned ON. This solves the following issue: mlx5_core 0000:82:00.1: mlx5_eq_int:544:(pid 0): CQ error on CQN 0xc08, syndrome 0x1 mlx5_core 0000:82:00.1 p4p2: mlx5e_cq_error_event: cqn=0x000c08 event=0x04 Fixes: 4b7dfc992514 ("net/mlx5e: Early-return on empty completion queues") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
6254adeb |
|
04-Dec-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5: Use helper to get CQE opcode Introduce and use a helper that extracts the opcode from a CQE (completion queue entry) structure. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
ef6fcd45 |
|
28-Nov-2018 |
Cong Wang <xiyou.wangcong@gmail.com> |
mlx5: fix get_ip_proto() IP header is not necessarily located right after struct ethhdr, there could be multiple 802.1Q headers in between, this is why we call __vlan_get_protocol(). Fixes: fe1dc069990c ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets") Cc: Alaa Hleihel <alaa@mellanox.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0073c8f7 |
|
10-Oct-2018 |
Moshe Shemesh <moshe@mellanox.com> |
net/mlx5e: RX, verify received packet size in Linear Striding RQ In case of striding RQ, we use MPWRQ (Multi Packet WQE RQ), which means that WQE (RX descriptor) can be used for many packets and so the WQE is much bigger than MTU. In virtualization setups where the port mtu can be larger than the vf mtu, if received packet is bigger than MTU, it won't be dropped by HW on too small receive WQE. If we use linear SKB in striding RQ, since each stride has room for mtu size payload and skb info, an oversized packet can lead to crash for crossing allocated page boundary upon the call to build_skb. So driver needs to check packet size and drop it. Introduce new SW rx counter, rx_oversize_pkts_sw_drop, which counts the number of packets dropped by the driver for being too large. As a new field is added to the RQ struct, re-open the channels whenever this field is being used in datapath (i.e., in the case of linear Striding RQ). Fixes: 619a8f2a42f1 ("net/mlx5e: Use linear SKB in Striding RQ") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
d48051c5 |
|
30-Oct-2018 |
Eric Dumazet <edumazet@google.com> |
net/mlx5e: fix csum adjustments caused by RXFCS As shown by Dmitris, we need to use csum_block_add() instead of csum_add() when adding the FCS contribution to skb csum. Before 4.18 (more exactly commit 88078d98d1bb "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"), the whole skb csum was thrown away, so RXFCS changes were ignored. Then before commit d55bef5059dd ("net: fix pskb_trim_rcsum_slow() with odd trim offset") both mlx5 and pskb_trim_rcsum_slow() bugs were canceling each other. Now we fixed pskb_trim_rcsum_slow() we need to fix mlx5. Note that this patch also rewrites mlx5e_get_fcs() to : - Use skb_header_pointer() instead of reinventing it. - Use __get_unaligned_cpu32() to avoid possible non aligned accesses as Dmitris pointed out. Fixes: 902a545904c7 ("net/mlx5e: When RXFCS is set, add FCS data into checksum calculation") Reported-by: Paweł Staszewski <pstaszewski@itcare.pl> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eran Ben Elisha <eranbe@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Dimitris Michailidis <dmichail@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Paweł Staszewski <pstaszewski@itcare.pl> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Tested-By: Maria Pasechnik <mariap@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
55469bc6 |
|
25-Oct-2018 |
Eric Dumazet <edumazet@google.com> |
drivers: net: remove <net/busy_poll.h> inclusion when not needed Drivers using generic NAPI interface no longer need to include <net/busy_poll.h>, since busy polling was moved to core networking stack long ago. See commit 79e7fff47b7b ("net: remove support for per driver ndo_busy_poll()") for reference. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
37fdffb2 |
|
21-Aug-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5: WQ, fixes for fragmented WQ buffers API mlx5e netdevice used to calculate fragment edges by a call to mlx5_wq_cyc_get_frag_size(). This calculation did not give the correct indication for queues smaller than a PAGE_SIZE, (broken by default on PowerPC, where PAGE_SIZE == 64KB). Here it is replaced by the correct new calls/API. Since (TX/RX) Work Queues buffers are fragmented, here we introduce changes to the API in core driver, so that it gets a stride index and returns the index of last stride on same fragment, and an additional wrapping function that returns the number of physically contiguous strides that can be written contiguously to the work queue. This obsoletes the following API functions, and their buggy usage in EN driver: * mlx5_wq_cyc_get_frag_size() * mlx5_wq_cyc_ctr2fragix() The new API improves modularity and hides the details of such calculation for mlx5e netdevice and mlx5_ib rdma drivers. New calculation is also more efficient, and improves performance as follows: Packet rate test: pktgen, UDP / IPv4, 64byte, single ring, 8K ring size. Before: 16,477,619 pps After: 17,085,793 pps 3.7% improvement Fixes: 3a2f70331226 ("net/mlx5: Use order-0 allocations for all WQ types") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
b856df28 |
|
01-Jul-2018 |
Or Gerlitz <ogerlitz@mellanox.com> |
net/mlx5e: Allow reporting of checksum unnecessary Currently we practically never report checksum unnecessary, because for all IP packets we take the checksum complete path. Enable non-default runs with reprorting checksum unnecessary, using an ethtool private flag. This can be useful for performance evals and other explorations. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
b820e6fb |
|
01-Jul-2018 |
Or Gerlitz <ogerlitz@mellanox.com> |
net/mlx5e: Enable reporting checksum unnecessary also for L3 packets We can report checksum unnecessary also when the L3 checksum flag on the cqe is set and there's no L4 header. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
fe1dc069 |
|
10-Jul-2018 |
Alaa Hleihel <alaa@mellanox.com> |
net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets CHECKSUM_COMPLETE is not applicable to SCTP protocol. Setting it for SCTP packets leads to CRC32c validation failure. Fixes: bbceefce9adf ("net/mlx5e: Support RX CHECKSUM_COMPLETE") Signed-off-by: Alaa Hleihel <alaa@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
f007c13d |
|
20-Mar-2018 |
Natali Shechtman <natali@mellanox.com> |
net/mlx5e: Set ECN for received packets using CQE indication In multi-host (MH) NIC scheme, a single HW port serves multiple hosts or sockets on the same host. The HW uses a mechanism in the PCIe buffer which monitors the amount of consumed PCIe buffers per host. On a certain configuration, under congestion, the HW emulates a switch doing ECN marking on packets using ECN indication on the completion descriptor (CQE). The driver needs to set the ECN bits on the packet SKB, such that the network stack can react on that, this commit does that. Signed-off-by: Natali Shechtman <natali@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
19052a3b |
|
02-Sep-2018 |
Feras Daoud <ferasda@mellanox.com> |
net/mlx5e: IPoIB, Use priv stats in completion rx flow Since the RQs are shared between all pkey interfaces, the stats should be taken from where the per-ring stats are stored instead of the parent RQ. Fixes: 4c6c615e3f30 ("net/mlx5e: IPoIB, Add PKEY child interface nic profile") Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d3398a4f |
|
16-May-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Prefetch the xdp_frame data area A loaded XDP program might write to the xdp_frame data area, prefetchw() it to avoid a potential cache miss. Performance tests: ConnectX-5, XDP_TX packet rate, single ring. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Before: 13,172,976 pps After: 13,456,248 pps 2% gain. Fixes: 22f453988194 ("net/mlx5e: Support XDP over Striding RQ") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
dac0d15f |
|
22-May-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Re-order fields of struct mlx5e_xdpsq In the downstream patch that adds support to XDP_REDIRECT-out, the XDP xmit frame function doesn't share the same run context as the NAPI that polls the XDP-SQ completion queue. Hence, need to re-order the XDP-SQ fields to avoid cacheline false-sharing. Take redirect_flush and doorbell out of DB, into separated cachelines. Add a cacheline breaker within the stats struct. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
159d2131 |
|
15-Jul-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Move XDP related code into new XDP files Take XDP code out of the general EN header and RX file into new XDP files. Currently, XDP-SQ resides only within an RQ and used from a single flow (XDP_TX) triggered upon RX completions. In a downstream patch, additional type of XDP-SQ instances will be presented and used for the XDP_REDIRECT flow, totally unrelated to the RX context. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
cb5189d1 |
|
12-Jun-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Do not recycle RX pages in interface down flow Keep all page-pool recycle calls within NAPI context. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
afab995e |
|
12-Jun-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Replace call to MPWQE free with dealloc in interface down flow No need to expose the MPWQE free function to control path. The dealloc function already exposed, use it. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
b3ccf978 |
|
13-Jul-2018 |
Boris Pismenny <borisp@mellanox.com> |
net/mlx5e: IPsec, fix byte count in CQE This patch fixes the byte count indication in CQE for processed IPsec packets that contain a metadata header. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
00aebab2 |
|
13-Jul-2018 |
Boris Pismenny <borisp@mellanox.com> |
net/mlx5e: TLS, add Innova TLS rx data path Implement the TLS rx offload data path according to the requirements of the TLS generic NIC offload infrastructure. Special metadata ethertype is used to pass information to the hardware. When hardware loses synchronization a special resync request metadata message is used to request resync. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b71ba6b4 |
|
28-Jun-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Add counter for MPWQE filler strides Add ethtool counter to indicate the number of strides consumed by filler CQEs. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
dc983f0e |
|
04-Mar-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Add a counter for congested UMRs Add per-ring and global ethtool counters for congested UMR requests. These events indicate congestion in UMR handlers in HW. Such event is concluded when there's an outstanding UMR post, yet the SW consumed at least two additional MPWQEs in the meanwhile. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
cbe73aae |
|
04-Mar-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Add XDP_TX completions statistics Add per-ring and global ethtool counters for XDP_TX completions. This helps us monitor and analyze XDP_TX flow performance. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
c4000283 |
|
03-Jun-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Use existing WQ local variable Local variable 'wq' already points to &sq->wq, use it. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
069d1146 |
|
02-May-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme Enhance the memory scheme of the legacy RQ, such that only order-0 pages are used. Whenever possible, prefer using a linear SKB, and build it wrapping the WQE buffer. Otherwise (for example, jumbo frames on x86), use non-linear SKB, with as many frags as needed. In this case, multiple WQE scatter entries are used, up to a maximum of 4 frags and 10KB of MTU. This implied to remove support of HW LRO in legacy RQ, as it would require large number of page allocations and scatter entries per WQE on archs with PAGE_SIZE = 4KB, yielding bad performance. In earlier patches, we guaranteed that all completions are in-order, and that we use a cyclic WQ. This creates an oppurtunity for a performance optimization: The mapping between a "struct mlx5e_dma_info", and the WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant across different cycles of a WQ. This allows initializing the mapping in the time of RQ creation, and not handle it in datapath. A struct mlx5e_dma_info that is shared between different WQEs is allocated by the first WQE, and freed by the last one. This implies an important requirement: WQEs that share the same struct mlx5e_dma_info must be posted within the same NAPI. Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly point to the new struct mlx5e_dma_info, not the one that was posted (and the HW wrote to). This bulking requirement is actually good also for performance reasons, hence we extend the bulk beyong the minimal requirement above. With this memory scheme, the RQs memory footprint is reduce by a factor of 2 on x86, and by a factor of 32 on PowerPC. Same factors apply for the number of pages in a GRO session. Performance tests: ConnectX-4, single core, single RX ring, default MTU. x86: CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Packet rate (early drop in TC): no degradation TCP streams: ~5% improvement PowerPC: CPU: POWER8 (raw), altivec supported Packet rate (early drop in TC): 20% gain TCP streams: 25% gain Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
99cbfa93 |
|
02-Apr-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Use cyclic WQ in legacy RQ Now that LRO is not supported for Legacy RQ, there is no source of out-of-order completions in the WQ, and we can use a cyclic one. This has multiple advantages: - reduces the WQE size (smaller PCI transactions). - lower overhead in datapath (no handling of 'next' pointers). - no reserved WQE for the WQ head (was need in linked-list). - allows using a constant map between frag and dma_info struct, in downstream patch. Performance tests: ConnectX-4, single core, single RX ring. Major gain in packet rate of single ring XDP drop. Bottleneck is shifted form HW (at 16Mpps) to SW (at 20Mpps). Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
422d4c40 |
|
02-Apr-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Split WQ objects for different RQ types Replace the common RQ WQ object with two separate ones for the different RQ types. This is in preparation for switching to using a cyclic WQ type in Legacy RQ. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
386471f1 |
|
16-Apr-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Dedicate a function for copying SKB header Get the logic of copying the packet header into the SKB linear part into a generic function. Function does copy length alignment and dma buffer sync. It is currently called only within the MPWQE flow. In a downstream patch, it will be called within the legacy RQ flow as well. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
fa698366 |
|
02-Apr-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Generalise function of SKB frag addition Rename it and pass truesize as an extra argument, as it will be used also in Legacy RQ in a downstream patch. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
75aa889f |
|
02-Apr-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Generalise name of non-linear SKB head size Make name more generic by dropping MPWRQ from it, as it will be used also in Legacy RQ in a downstream patch. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
05909bab |
|
12-Apr-2018 |
Eran Ben Elisha <eranbe@mellanox.com> |
net/mlx5e: Avoid reset netdev stats on configuration changes Move all RQ, SQ and channel counters from the channel objects into the priv structure. With this change, counters will not be reset upon channel configuration changes. Channel's statistics for SQs which are associated with TCs higher than zero will be presented in ethtool -S, only for SQs which were opened at least once since the module was loaded (regardless of their open/close current status). This is done in order to decrease the total amount of statistics presented and calculated for the common out of box use (no QoS). mlx5e_channel_stats is a compound of CH,RQ,SQs stats in order to create locality for the NAPI when handling TX and RX of the same channel. Align the new statistics struct per ring to avoid several channels update to the same cache line at the same time. Packet rate was tested, no degradation sensed. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> CC: Qing Huang <qing.huang@oracle.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
3a2f7033 |
|
03-Apr-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5: Use order-0 allocations for all WQ types Complete the transition of all WQ types to use fragmented order-0 coherent memory instead of high-order allocations. CQ-WQ already uses order-0. Here we do the same for cyclic and linked-list WQs. This allows the driver to load cleanly on systems with a highly fragmented coherent memory. Performance tests: ConnectX-5 100Gbps, CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Packet rate of 64B packets, single transmit ring, size 8K. No degradation is sensed. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
043dc78e |
|
21-Mar-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: TX, Use actual WQE size for SQ edge fill We fill SQ edge with NOPs to avoid WQEs wrap. Here, instead of doing that in advance for the maximum possible WQE size, we do it on-demand using the actual WQE size. We re-order some parts in mlx5e_sq_xmit to finish the calculation of WQE size (ds_cnt) before doing any writes to the WQE buffer. When SQ work queue is fragmented (introduced in an downstream patch), dealing with WQE wraps becomes more frequent. This change would drastically reduce the overhead in this case. Performance tests: ConnectX-5 100Gbps, CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Packet rate of 64B packets, single transmit ring, size 8K. Before: 14.9 Mpps After: 15.8 Mpps Improvement of 6%. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
ddf385e3 |
|
02-May-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use WQ API functions instead of direct fields access Use the WQ API to get the WQ size, and to map a counter into a WQ entry index. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
902a5459 |
|
01-May-2018 |
Eran Ben Elisha <eranbe@mellanox.com> |
net/mlx5e: When RXFCS is set, add FCS data into checksum calculation When RXFCS feature is enabled, the HW do not strip the FCS data, however it is not present in the checksum calculated by the HW. Fix that by manually calculating the FCS checksum and adding it to the SKB checksum field. Add helper function to find the FCS data for all SKB forms (linear, one fragment or more). Fixes: 102722fc6832 ("net/mlx5e: Add support for RXFCS feature flag") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
0e5c04f6 |
|
22-Jan-2018 |
Gal Pressman <galp@mellanox.com> |
net/mlx5e: Remove MLX5E_TEST_BIT macro MLX5E_TEST_BIT macro is the same as the already existent test_bit, remove it and replace all usages. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
efb6d7a2 |
|
23-Nov-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use bool as return type for mlx5e_xdp_handle Function returns boolean values, use bool instead of int. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
bd206fd5 |
|
23-Nov-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use u8 instead of int for LRO number of segments Range of LRO number of segments fits in u8. Also, bring initialization and declaration together to save code. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
03993094 |
|
17-Apr-2018 |
Jesper Dangaard Brouer <brouer@redhat.com> |
xdp: transition into using xdp_frame for return API Changing API xdp_return_frame() to take struct xdp_frame as argument, seems like a natural choice. But there are some subtle performance details here that needs extra care, which is a deliberate choice. When de-referencing xdp_frame on a remote CPU during DMA-TX completion, result in the cache-line is change to "Shared" state. Later when the page is reused for RX, then this xdp_frame cache-line is written, which change the state to "Modified". This situation already happens (naturally) for, virtio_net, tun and cpumap as the xdp_frame pointer is the queued object. In tun and cpumap, the ptr_ring is used for efficiently transferring cache-lines (with pointers) between CPUs. Thus, the only option is to de-referencing xdp_frame. It is only the ixgbe driver that had an optimization, in which it can avoid doing the de-reference of xdp_frame. The driver already have TX-ring queue, which (in case of remote DMA-TX completion) have to be transferred between CPUs anyhow. In this data area, we stored a struct xdp_mem_info and a data pointer, which allowed us to avoid de-referencing xdp_frame. To compensate for this, a prefetchw is used for telling the cache coherency protocol about our access pattern. My benchmarks show that this prefetchw is enough to compensate the ixgbe driver. V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT") V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address and offset in dma_sync call") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
60bbf7ee |
|
17-Apr-2018 |
Jesper Dangaard Brouer <brouer@redhat.com> |
mlx5: use page_pool for xdp_return_frame call This patch shows how it is possible to have both the driver local page cache, which uses elevated refcnt for "catching"/avoiding SKB put_page returns the page through the page allocator. And at the same time, have pages getting returned to the page_pool from ndp_xdp_xmit DMA completion. The performance improvement for XDP_REDIRECT in this patch is really good. Especially considering that (currently) the xdp_return_frame API and page_pool_put_page() does per frame operations of both rhashtable ID-lookup and locked return into (page_pool) ptr_ring. (It is the plan to remove these per frame operation in a followup patchset). The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe, with xdp_redirect_map (using devmap) . And the target/maximum capability of ixgbe is 13Mpps (on this HW setup). Before this patch for mlx5, XDP redirected frames were returned via the page allocator. The single flow performance was 6Mpps, and if I started two flows the collective performance drop to 4Mpps, because we hit the page allocator lock (further negative scaling occurs). Two test scenarios need to be covered, for xdp_return_frame API, which is DMA-TX completion running on same-CPU or cross-CPU free/return. Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very close to our 13Mpps max target. The reason max target isn't reached in cross-CPU test, is likely due to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to ixgbe testing). It is also planned to remove this unnecessary DMA unmap in a later patchset V2: Adjustments requested by Tariq - Changed page_pool_create return codes not return NULL, only ERR_PTR, as this simplifies err handling in drivers. - Save a branch in mlx5e_page_release - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ V5: Updated patch desc V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params") V9: - Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication") - Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU") - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ V10: Req from Tariq - Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5168d732 |
|
17-Apr-2018 |
Jesper Dangaard Brouer <brouer@redhat.com> |
mlx5: basic XDP_REDIRECT forward support This implements basic XDP redirect support in mlx5 driver. Notice that the ndo_xdp_xmit() is NOT implemented, because that API need some changes that this patchset is working towards. The main purpose of this patch is have different drivers doing XDP_REDIRECT to show how different memory models behave in a cross driver world. Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect, as the return API does not exist yet to to keep this mapped. Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing, introduce xdpsq.db.redirect_flush boolian. V9: Adjust for commit 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ab966d7e |
|
04-Jan-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Recycle buffer of UMR WQEs Upon a new UMR post, check if the WQE buffer contains a previous UMR WQE. If so, modify the dynamic fields instead of a whole WQE overwrite. This saves a memcpy. In current setting, after 2 WQ cycles (12 UMR posts), this will always be the case. No degradation sensed. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
b8a98a4c |
|
20-Dec-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Keep single pre-initialized UMR WQE per RQ All UMR WQEs of an RQ share many common fields. We use pre-initialized structures to save calculations in datapath. One field (xlt_offset) was the only reason we saved a pre-initialized copy per WQE index. Here we remove its initialization (move its calculation to datapath), and reduce the number of copies to one-per-RQ. A very small datapath calculation is added, it occurs once per a MPWQE (i.e. once every 256KB), but reduces memory consumption and gives better cache utilization. Performance testing: Tested packet rate, no degradation sensed. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
9f9e9cd5 |
|
28-Nov-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Remove page_ref bulking in Striding RQ When many packets reside on the same page, the bulking of page_ref modifications reduces the total number of atomic operations executed. Besides the necessary 2 operations on page alloc/free, we have the following extra ops per page: - one on WQE allocation (bump refcnt to maximum possible), - zero ops for SKBs, - one on WQE free, a constant of two operations in total, no matter how many packets/SKBs actually populate the page. Without this bulking, we have: - no ops on WQE allocation or free, - one op per SKB, Comparing the two methods when PAGE_SIZE is 4K: - As mentioned above, bulking method always executes 2 operations, not more, but not less. - In the default MTU configuration (1500, stride size is 2K), the non-bulking method execute 2 ops as well. - For larger MTUs with stride size of 4K, non-bulking method executes only a single op. - For XDP (stride size of 4K, no SKBs), non-bulking method executes no ops at all! Hence, to optimize the flows with linear SKB and XDP over Striding RQ, we here remove the page_ref bulking method. Performance testing: ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. Single core packet rate (64 bytes). Early drop in TC: no degradation. XDP_DROP: before: 14,270,188 pps after: 20,503,603 pps, 43% improvement. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
22f45398 |
|
07-Feb-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Support XDP over Striding RQ Add XDP support over Striding RQ. Now that linear SKB is supported over Striding RQ, we can support XDP by setting stride size to PAGE_SIZE and headroom to XDP_PACKET_HEADROOM. Upon a MPWQE free, do not release pages that are being XDP xmit, they will be released upon completions. Striding RQ is capable of a higher packet-rate than conventional RQ. A performance gain is expected for all cases that had a HW packet-rate bottleneck. This is the case whenever using many flows that distribute to many cores. Performance testing: ConnectX-5, 24 rings, default MTU. CQE compression ON (to reduce completions BW in PCI). XDP_DROP packet rate: -------------------------------------------------- | pkt size | XDP rate | 100GbE linerate | pct% | -------------------------------------------------- | 64byte | 126.2 Mpps | 148.0 Mpps | 85% | | 128byte | 80.0 Mpps | 84.8 Mpps | 94% | | 256byte | 42.7 Mpps | 42.7 Mpps | 100% | | 512byte | 23.4 Mpps | 23.4 Mpps | 100% | -------------------------------------------------- Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
121e8927 |
|
12-Dec-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Refactor RQ XDP_TX indication Make the xdp_xmit indication available for Striding RQ by taking it out of the type-specific union. This refactor is a preparation for a downstream patch that adds XDP support over Striding RQ. In addition, use a bitmap instead of a boolean for possible future flags. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
619a8f2a |
|
07-Feb-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use linear SKB in Striding RQ Current Striding RQ HW feature utilizes the RX buffers so that there is no wasted room between the strides. This maximises the memory utilization. This prevents the use of build_skb() (which requires headroom and tailroom), and demands to memcpy the packets headers into the skb linear part. In this patch, whenever a set of conditions holds, we apply an RQ configuration that allows combining the use of linear SKB on top of a Striding RQ. To use build_skb() with Striding RQ, the following must hold: 1. packet does not cross a page boundary. 2. there is enough headroom and tailroom surrounding the packet. We can satisfy 1 and 2 by configuring: stride size = MTU + headroom + tailoom. This is possible only when: a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE. b. HW LRO is turned off. Using linear SKB has many advantages: - Saves a memcpy of the headers. - No page-boundary checks in datapath. - No filler CQEs. - Significantly smaller CQ. - SKB data continuously resides in linear part, and not split to small amount (linear part) and large amount (fragment). This saves datapath cycles in driver and improves utilization of SKB fragments in GRO. - The fragments of a resulting GRO SKB follow the IP forwarding assumption of equal-size fragments. Some implementation details: HW writes the packets to the beginning of a stride, i.e. does not keep headroom. To overcome this we make sure we can extend backwards and use the last bytes of stride i-1. Extra care is needed for stride 0 as it has no preceding stride. We make sure headroom bytes are available by shifting the buffer pointer passed to HW by headroom bytes. This configuration now becomes default, whenever capable. Of course, this implies turning LRO off. Performance testing: ConnectX-5, single core, single RX ring, default MTU. UDP packet rate, early drop in TC layer: -------------------------------------------- | pkt size | before | after | ratio | -------------------------------------------- | 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x | | 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x | | 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x | -------------------------------------------- TCP streams: ~20% gain Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
ea3886ca |
|
09-Jul-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use inline MTTs in UMR WQEs When modifying the page mapping of a HW memory region (via a UMR post), post the new values inlined in WQE, instead of using a data pointer. This is a micro-optimization, inline UMR WQEs of different rings scale better in HW. In addition, this obsoletes a few control flows and helps delete ~50 LOC. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
e4d86a4a |
|
07-Jan-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Do not busy-wait for UMR completion in Striding RQ Do not busy-wait a pending UMR completion. Under high HW load, busy-waiting a delayed completion would fully utilize the CPU core and mistakenly indicate a SW bottleneck. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
18187fb2 |
|
17-Jul-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Code movements in RX UMR WQE post Gets the process of a UMR WQE post in one function, in preparation for a downstream patch that inlines the WQE data. No functional change here. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
472a1e44 |
|
12-Mar-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Save MTU in channels params Knowing the MTU is required for RQ creation flow. By our design, channels creation flow is totally isolated from priv/netdev, and can be completed with access to channels params and mdev. Adding the MTU to the channels params helps preserving that. In addition, we save it in RQ to make its access faster in datapath checks. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
24fd07ab |
|
15-Feb-2018 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use no-offset function in skb header copy In copying skb header to skb->data, replace the call to skb_copy_to_linear_data_offset() with a zero offset with the call to the no-offset function skb_copy_to_linear_data(). Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
bd658dda |
|
27-Nov-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Separate dma base address and offset in dma_sync call Pass the base dma address and offset to dma_sync_single_range_for_cpu(), instead of doing the pre-calculation. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
8babd44d |
|
19-Dec-2017 |
Gal Pressman <galp@mellanox.com> |
net/mlx5e: Fix TCP checksum in LRO buffers When receiving an LRO packet, the checksum field is set by the hardware to the checksum of the first coalesced packet. Obviously, this checksum is not valid for the merged LRO packet and should be fixed. We can use the CQE checksum which covers the checksum of the entire merged packet TCP payload to help us calculate the checksum incrementally. Tested by sending IPv4/6 traffic with LRO enabled, RX checksum disabled and watching nstat checksum error counters (in addition to the obvious bandwidth drop caused by checksum errors). This bug is usually "hidden" since LRO packets would go through the CHECKSUM_UNNECESSARY flow which does not validate the packet checksum. It's important to note that previous to this patch, LRO packets provided with CHECKSUM_UNNECESSARY are indeed packets with a correct validated checksum (even though the checksum inside the TCP header is incorrect), since the hardware LRO aggregation is terminated upon receiving a packet with bad checksum. Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files") Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
388ca8be |
|
02-Jan-2018 |
Yonatan Cohen <yonatanc@mellanox.com> |
IB/mlx5: Implement fragmented completion queue (CQ) The current implementation of create CQ requires contiguous memory, such requirement is problematic once the memory is fragmented or the system is low in memory, it causes for failures in dma_zalloc_coherent(). This patch implements new scheme of fragmented CQ to overcome this issue by introducing new type: 'struct mlx5_frag_buf_ctrl' to allocate fragmented buffers, rather than contiguous ones. Base the Completion Queues (CQs) on this new fragmented buffer. It fixes following crashes: kworker/29:0: page allocation failure: order:6, mode:0x80d0 CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0 Workqueue: ib_cm cm_work_handler [ib_cm] Call Trace: [<>] dump_stack+0x19/0x1b [<>] warn_alloc_failed+0x110/0x180 [<>] __alloc_pages_slowpath+0x6b7/0x725 [<>] __alloc_pages_nodemask+0x405/0x420 [<>] dma_generic_alloc_coherent+0x8f/0x140 [<>] x86_swiotlb_alloc_coherent+0x21/0x50 [<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core] [<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core] [<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core] [<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core] [<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib] [<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib] Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
63a612f9 |
|
09-Jan-2018 |
Gal Pressman <galp@mellanox.com> |
net/mlx5e: Add likely to the common RX checksum flow Most of the packets return true for is_last_ethertype_ip, surround it with likely compiler hint. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
36e564b7 |
|
31-Oct-2017 |
Feras Daoud <ferasda@mellanox.com> |
net/mlx5e: IPoIB, Use correct timestamp in child receive flow The current implementation takes the child timestamp object from the parent since the rq in mlx5i_complete_rx_cqe belongs to the parent. This change fixes the issue by taking the correct timestamp. Fixes: 7e7f4780c340 ("net/mlx5e: IPoIB, Use hash-table to map between QPN to child netdev") Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
cd4a87df |
|
06-Jan-2018 |
Gal Pressman <galp@mellanox.com> |
net/mlx5e: Replace WARN_ONCE with netdev_WARN_ONCE Use the more appropriate netdev_WARN_ONCE instead of WARN_ONCE macro. Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0ddf5432 |
|
03-Jan-2018 |
Jesper Dangaard Brouer <brouer@redhat.com> |
xdp/mlx5: setup xdp_rxq_info The mlx5 driver have a special drop-RQ queue (one per interface) that simply drops all incoming traffic. It helps driver keep other HW objects (flow steering) alive upon down/up operations. It is temporarily pointed by flow steering objects during the interface setup, and when interface is down. It lacks many fields that are set in a regular RQ (for example its state is never switched to MLX5_RQC_STATE_RDY). (Thanks to Tariq Toukan for explanation). The XDP RX-queue info for this drop-RQ marked as unused, which allow us to use the same takedown/free code path as other RX-queues. Driver hook points for xdp_rxq_info: * reg : mlx5e_alloc_rq() * unused: mlx5e_alloc_drop_rq() * unreg : mlx5e_free_rq() Tested on actual hardware with samples/bpf program Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Matan Barak <matanb@mellanox.com> Cc: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
2e50b261 |
|
15-Oct-2017 |
Inbar Karmy <inbark@mellanox.com> |
net/mlx5e: Set page to null in case dma mapping fails Currently, when dma mapping fails, put_page is called, but the page is not set to null. Later, in the page_reuse treatment in mlx5e_free_rx_descs(), mlx5e_page_release() is called for the second time, improperly doing dma_unmap (for a non-mapped address) and an extra put_page. Prevent this by nullifying the page pointer when dma_map fails. Fixes: accd58833237 ("net/mlx5e: Introduce RX Page-Reuse") Signed-off-by: Inbar Karmy <inbark@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
f938daee |
|
30-Aug-2017 |
Gal Pressman <galp@mellanox.com> |
net/mlx5e: CHECKSUM_COMPLETE offload for VLAN/QinQ packets When the VLAN tag is present in the packet buffer (i.e VLAN stripping disabled, QinQ) the driver will currently report CHECKSUM_UNNECESSARY. Instead of using CHECKSUM_COMPLETE offload for packets with first ethertype of IPv4/6, use it for packets with last ethertype of IPv4/6 to cover the former cases as well. The checksum field present in the CQE is calculated from the IP header until the end of the packet. When the first ethertype is different than IPv4/6 (for ex. 802.1Q VLAN) a checksum of the VLAN header/s should be added. The small header/s checksum calculation will allow us to use CHECKSUM_COMPLETE instead of CHECKSUM_UNNECESSARY. Testing bandwidth of one and 8 TCP streams to a single RQ, LRO and VLAN stripping offloads disabled: CPU: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz NIC: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] Before: +--------------+--------------------+---------------------+----------------------+ | Traffic type | 1 Stream BW [Mbps] | 8 Streams BW [Mbps] | Checksum offload | +--------------+--------------------+---------------------+----------------------+ | Untagged | 28,247.35 | 24,716.88 | CHECKSUM_COMPLETE | | VLAN | 27,516.69 | 23,752.26 | CHECKSUM_UNNECESSARY | | QinQ | 6,961.30 | 20,667.04 | CHECKSUM_UNNECESSARY | +--------------+--------------------+---------------------+----------------------+ Now: +--------------+--------------------+---------------------+-------------------+ | Traffic type | 1 Stream BW [Mbps] | 8 Streams BW [Mbps] | Checksum offload | +--------------+--------------------+---------------------+-------------------+ | Untagged | 28,521.28 | 24,926.32 | CHECKSUM_COMPLETE | | VLAN | 27,389.37 | 23,715.34 | CHECKSUM_COMPLETE | | QinQ | 6,901.77 | 20,845.73 | CHECKSUM_COMPLETE | +--------------+--------------------+---------------------+-------------------+ No performance degradation observed. Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
f24686e8 |
|
10-Sep-2017 |
Gal Pressman <galp@mellanox.com> |
net/mlx5e: Add VLAN offloads statistics The following counters are now exposed through ethtool -S: rx[i]_removed_vlan_packets (per channel) rx_removed_vlan_packets tx[i]_added_vlan_packets (per channel) tx_added_vlan_packets rx_removed_vlan_packets: The number of packets that had their outer VLAN header stripped to the CQE by the hardware. tx_added_vlan_packets: The number of packets that had their outer VLAN header inserted by the hardware. Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
7e7f4780 |
|
14-Sep-2017 |
Alex Vesker <valex@mellanox.com> |
net/mlx5e: IPoIB, Use hash-table to map between QPN to child netdev This change is needed for PKEY support, since the RQs are shared between the child interface and the parent. The parent is responsible for NAPI and the precessing of RX completions. Using the dqpn in the completion descriptor we set the corresponding child IPoIB netdevice on the SKB. The mapping between the dqpn and the netdevice is done using a HT, each mlx5 IPoIB interface registers its mapping on creation. Signed-off-by: Alex Vesker <valex@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
|
#
7c39afb3 |
|
15-Aug-2017 |
Feras Daoud <ferasda@mellanox.com> |
net/mlx5: PTP code migration to driver core section PTP code is moved to core section of mlx5 driver in order to share it between ethernet and infiniband. This movement involves the following changes: - Change mlx5e_ prefix to be mlx5_ - Add clock structs to Core - Add clock object to mlx5_core_dev - Call Init/Uninit clock from core init/cleanup - Rename mlx5e_tstamp to be mlx5_clock Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Eitan Rabin <rabin@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
603e1f5b |
|
13-Sep-2017 |
Gal Pressman <galp@mellanox.com> |
net/mlx5e: Fix calculated checksum offloads counters Instead of calculating the offloads counters, count them explicitly. The calculations done for these counters would result in bugs in some cases, for example: When running TCP traffic over a VXLAN tunnel with TSO enabled the following counters would increase: tx_csum_partial: 1,333,284 tx_csum_partial_inner: 29,286 tx4_csum_partial_inner: 384 tx7_csum_partial_inner: 8 tx9_csum_partial_inner: 34 tx10_csum_partial_inner: 26,807 tx11_csum_partial_inner: 287 tx12_csum_partial_inner: 27 tx16_csum_partial_inner: 6 tx25_csum_partial_inner: 1,733 Seems like tx_csum_partial increased out of nowhere. The issue is in the following calculation in mlx5e_update_sw_counters: s->tx_csum_partial = s->tx_packets - tx_offload_none - s->tx_csum_partial_inner; While tx_packets increases by the number of GSO segments for each SKB, tx_csum_partial_inner will only increase by one, resulting in wrong tx_csum_partial counter. Fixes: bfe6d8d1d433 ("net/mlx5e: Reorganize ethtool statistics") Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
de8f3a83 |
|
24-Sep-2017 |
Daniel Borkmann <daniel@iogearbox.net> |
bpf: add meta pointer for direct access This work enables generic transfer of metadata from XDP into skb. The basic idea is that we can make use of the fact that the resulting skb must be linear and already comes with a larger headroom for supporting bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work on a similar principle and introduce a small helper bpf_xdp_adjust_meta() for adjusting a new pointer called xdp->data_meta. Thus, the packet has a flexible and programmable room for meta data, followed by the actual packet data. struct xdp_buff is therefore laid out that we first point to data_hard_start, then data_meta directly prepended to data followed by data_end marking the end of packet. bpf_xdp_adjust_head() takes into account whether we have meta data already prepended and if so, memmove()s this along with the given offset provided there's enough room. xdp->data_meta is optional and programs are not required to use it. The rationale is that when we process the packet in XDP (e.g. as DoS filter), we can push further meta data along with it for the XDP_PASS case, and give the guarantee that a clsact ingress BPF program on the same device can pick this up for further post-processing. Since we work with skb there, we can also set skb->mark, skb->priority or other skb meta data out of BPF, thus having this scratch space generic and programmable allows for more flexibility than defining a direct 1:1 transfer of potentially new XDP members into skb (it's also more efficient as we don't need to initialize/handle each of such new members). The facility also works together with GRO aggregation. The scratch space at the head of the packet can be multiple of 4 byte up to 32 byte large. Drivers not yet supporting xdp->data_meta can simply be set up with xdp->data_meta as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out, such that the subsequent match against xdp->data for later access is guaranteed to fail. The verifier treats xdp->data_meta/xdp->data the same way as we treat xdp->data/xdp->data_end pointer comparisons. The requirement for doing the compare against xdp->data is that it hasn't been modified from it's original address we got from ctx access. It may have a range marking already from prior successful xdp->data/xdp->data_end pointer comparisons though. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
70871f1e |
|
13-Jul-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Don't recycle page if moved to far NUMA Avoid recycling an RX page if it moved to another NUMA node. Add an ethtool counter to count such events. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
3b56f7b2 |
|
17-Jul-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Remove unnecessary fields in ICO SQ As of current design, in each NAPI, only a single UMR WQE completion could be available in the completion queue of the the internal control operations (ICO) send queue, in addition to nop operations that require no actions upon completion. This renders the consume index obsolete, as the wqe_counter field in CQE is sufficient. This helps removing a memory barrier, and obsoletes the need for tracking the num_wqebbs to update the consumer counter. In addition, remove other unused fields in icosq struct: pdev, dma_fifo_pc, and prev_cc. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
7cc6d77b |
|
16-Jul-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Type-specific optimizations for RX post WQEs function Separate the RX post WQEs function of the different RQ types. This enables RQ type-specific optimizations in data-path. Poll the ICOSQ completion queue only for Striding RQ, and only when a UMR post completion could be possibly available. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
a071cb9f |
|
03-Jul-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Non-atomic RQ state indicator for UMR WQE in progress The indication for a UMR WQE in progress is needed only within the NAPI context, and hence no races possible and no need for the use of atomic operations. The only place the flag is read outside of NAPI context is in closure flow, after RQ is disabled flag is no more accessed in NAPI. Use a boolean instead of a bit in ring state, so that its non-atomic set operations do not race with the atomic sets of the other bits. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
a1eaba4c |
|
03-Jul-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Non-atomic indicator for ring enabled state Rings enabled state change occurs in control path only, and is always followed by a napi_sychronize(), so that following NAPIs read the new value. This read does not need to be atomic. The RQ auto-moderation bit is not set/cleared in data-path. No need for atomic read, a regular read operation is sufficient. In RQ creation time as well, there's no multiple threads trying to access it yet, hence a regular read can be used. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
604acb19 |
|
05-Jun-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Refactor data-path lro header function Refactor function mlx5e_lro_update_hdr() to reduce number of branches. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
4b7dfc99 |
|
19-Jun-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Early-return on empty completion queues NAPI context handles different kinds of completion queues (RX, TX, and others). Hence, upon a poll trial, some of them might be empty. Here we early-return upon empty completion queues, as well as full rx buffer, and save unnecessary logic and memory barriers. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
4cbb7558 |
|
19-Jun-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: NAPI busy-poll when UMR post is in progress If a UMR post is in progress, it means that there's a missing WQE in RQ, and that a completion will be shortly available in ICO SQ completion queue. Prefer busy-poll to handle it as soon as possible. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
4c2af5cc |
|
25-Jun-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Small enhancements for RX MPWQE allocation and free The dma offset of a MPWQE (Multi-Packet WQE) in memory region is fixed for all rounds. Calculate it once on creation time, instead of in runtime. This also obsoletes the wqe argument in the function. In addition, optimize dma_info iterator calculation. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
9bafe2ad |
|
25-Jun-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use memset to init skbs_frags array to zeros In RX data-path, use memset() instead of loop assignment to init the whole skbs_frags array. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
89e89f7a |
|
02-Jul-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Replace multiplication by stride size with a shift In RX data-path, use shift operations instead of a regular multiplication by stride size, as it is a power of two. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
b45d8b50 |
|
13-Feb-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Reorganize struct mlx5e_rq Bring fast-path fields together, and combine RX WQE mutual exclusive fields into a union. Page-reuse and XDP are mutually exclusive and cannot be used at the same time. Use a union to combine their footprints. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
0556ce72 |
|
16-Aug-2017 |
Eran Ben Elisha <eranbe@mellanox.com> |
net/mlx5e: Fix dangling page pointer on DMA mapping error Function mlx5e_dealloc_rx_wqe is using page pointer value as an indication to valid DMA mapping. In case that the mapping failed, we released the page but kept the dangling pointer. Store the page pointer only after the DMA mapping passed to avoid invalid page DMA unmap. Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
1afdb771 |
|
11-Jul-2017 |
Or Gerlitz <ogerlitz@mellanox.com> |
net/mlx5e: Place constants on the right side of comparisons To fix these checkpatch complaints: WARNING: Comparisons should place the constant on the right side of the test Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
e80541ec |
|
05-Jun-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5: Add CONFIG_MLX5_ESWITCH Kconfig Allow to selectively build the driver with or without sriov eswitch, VF representors and TC offloads. Also remove the need of two ndo ops structures (sriov & basic) and keep only one unified ndo ops, compile out VF SRIOV ndos when not needed (MLX5_ESWITCH=n), and for VF netdev calling those ndos will result in returning -EPERM. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Cc: Jes Sorensen <jsorensen@fb.com> Cc: kernel-team@fb.com
|
#
899a59d3 |
|
19-Jun-2017 |
Ilan Tayari <ilant@mellanox.com> |
net/mlx5e: IPSec, Add Innova IPSec offload RX data path In RX data path, the hardware prepends a special metadata ethertype which indicates that the packet underwent decryption, and the result of the authentication check. Communicate this to the stack in skb->sp. Make wqe_size large enough to account for the injected metadata. Support only Linked-list RQ type. IPSec offload RX packets may have useful CHECKSUM_COMPLETE information, which the stack may not be able to use yet. Signed-off-by: Ilan Tayari <ilant@mellanox.com> Signed-off-by: Yossi Kuperman <yossiku@mellanox.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
4b673793 |
|
18-Jun-2017 |
Ilan Tayari <ilant@mellanox.com> |
net/mlx5: Make get_cqe routine not ethernet-specific Move mlx5e_get_cqe routine to wq.h and rename it to mlx5_cqwq_get_cqe. This allows it to be used by other CQ users outside of the ethernet driver code. A later patch in this patchset will make use of it from FPGA code for the FPGA high-speed connection. Signed-off-by: Ilan Tayari <ilant@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
accd5883 |
|
29-Jan-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Introduce RX Page-Reuse Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath. A WQE (RX descriptor) buffer is a page, that in most cases was fully wasted on a packet that is much smaller, requiring a new page for the next round. In this patch, we implement a page-reuse mechanism, that resembles a `SW Striding RQ`. We allow the WQE to reuse its allocated page as much as it could, until the page is fully consumed. In each round, the WQE is capable of receiving packet of maximal size (MTU). Yet, upon the reception of a packet, the WQE knows the actual packet size, and consumes the exact amount of memory needed to build a linear SKB. Then, it updates the buffer pointer within the page accordingly, for the next round. Feature is mutually exclusive with XDP (packet-per-page) and LRO (session size is a power of two, needs unused page). Performance tests: iperf tcp tests show huge gain: -------------------------------------------- num streams | BW before | BW after | ratio | 1 | 22.2 | 30.9 | 1.39x | 8 | 64.2 | 93.6 | 1.46x | 64 | 56.7 | 91.4 | 1.61x | -------------------------------------------- Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
78aedd32 |
|
18-Jan-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Build SKB with exact frag_size Build the SKB over the receive packet instead of the whole page. Getting the SKB's linear data and shared_info closer improves locality. In addition, this opens up the possibility to make use of other parts of the page in the downstream page-reuse patch. Fixes: 1bfecfca565c ("net/mlx5e: Build RX SKB on demand") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
3844b07e |
|
01-Jun-2017 |
Feras Daoud <ferasda@mellanox.com> |
net/mlx5e: IPoIB, Add PTP support to IPoIB device driver Enable PTP for IPoIB rdma_netdev and add the ability to get the time stamping parameters using ethtool. Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Eitan Rabin <rabin@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
c139dbfd |
|
18-May-2017 |
Erez Shitrit <erezsh@mellanox.com> |
net/mlx5e: Use hard_mtu as part of the mlx5e_priv struct The mtu extra space that kept for the HW is specific for each link type, and it is different in mlx5e and mlx5i modules. Now it is kept in the priv structures, set by the mlx5e/mlx5i driver accordingly. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
4301ba7b |
|
18-Jun-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: IPoIB, Move to a separate directory IPoIB netdevice driver was only introduced in previous kernel release and it is growing in terms of features and LOC, move it to a separate directory. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
b57fe691 |
|
27-Apr-2017 |
Erez Shitrit <erezsh@mellanox.com> |
net/mlx5e: IPoIB, handle RX packet correctly IPoIB packet contains the pseudo header area, we need to pull it prior to reset_mac_header in order to let the GRO work well. In more details: GRO checks the mac address of the new coming packet, it does that by comparing the hard_header_len size of the current packet to the previous one in that session, the comparison is over hard_header_len size. Now, the driver prepares that area in the skb by allocating area from the reserved part and resetting the correct mac header to it. Fixes: 9d6bd752c63c ("net/mlx5e: IPoIB, RX handler") Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
ad78af9b |
|
15-Feb-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use prefetchw when a write is to follow "prefetchw()" prefetches the cacheline for write. Use it for skb->data, as soon we'll be copying the packet header there. Performance: Single-stream packet-rate tested with pktgen. Packets are dropped in tc level to zoom into driver data-path. Larger gain is expected for smaller packets, as less time is spent on handling SKB fragments, making the path shorter and the improvement more significant. --------------------------------------------- packet size | before | after | gain | 64B | 4,113,306 | 4,778,720 | 16% | 1024B | 3,633,819 | 3,950,593 | 8.7% | Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
1d447a39 |
|
23-Apr-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Extendable vport representor netdev private data Make representor netdev private data extendable by adding new struct "mlx5e_rep_priv" and use it as the rep netdev private data struct instead of directly pointing to mlx5_eswitch_rep. Added new en_rep.h header file to contain all representor related definitions and prototypes, and moved all representor specific logic into en_rep.c. Needed for downstream patches to extend representor functionality to support neighbour update. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
|
#
8bf3198a |
|
21-Apr-2017 |
Stephen Hemminger <stephen@networkplumber.org> |
mlx5: fix warning about missing prototype Fix sparse warning about missing prototypes. The rx/tx code path defines functions with prototypes in ipoib.h. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
9d6bd752 |
|
12-Apr-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: IPoIB, RX handler Implement IPoIB RX SKB handler. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
be7e87f9 |
|
12-Feb-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Fail safe cqe compressing/moderation mode setting Use the new fail-safe channels switch mechanism to set new CQE compressing and CQE moderation mode settings. We also move RX CQE compression modify function out of en_rx file to a more appropriate place. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
|
#
6a9764ef |
|
21-Dec-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Isolate open_channels from priv->params In order to have a clean separation between channels resources creation flows and current active mlx5e netdev parameters, make sure each resource creation function do not access priv->params, and only works with on a new fresh set of parameters. For this we add "new" mlx5e_params field to mlx5e_channels structure and use it down the road to mlx5e_open_{cq,rq,sq} and so on. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
|
#
31391048 |
|
24-Mar-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Different SQ types Different SQ types (tx, xdp, ico) are growing apart, we separate them and remove unwanted parts in each one of them, to simplify data path and utilize data cache. Remove DB union from SQ structures since it is not needed anymore as we now have different SQ data type for each SQ. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
864b2d71 |
|
24-Mar-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Generalize tx helper functions for different SQ types In the next patches we will introduce different SQ types, for that we here generalize some TX helper functions to work with more basic SQ parameters, in order to re-use them for the different SQ types. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2239185c |
|
24-Mar-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Optimize XDP frame xmit XDP SQ has a fixed size WQE (MLX5E_XDP_TX_WQEBBS = 1) and only posts one kind of WQE (MLX5_OPCODE_SEND), Also we initialize SQ descriptors static fields once on open_xdpsq, rather than every time on critical path. Optimize the code in light of those facts and add a prefetch of the TX descriptor first thing in the xdp xmit function. Performance improvement: System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Test case Before Now improvement --------------------------------------------------------------- XDP TX (1 core) 13Mpps 13.7Mpps 5% Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
31871f87 |
|
24-Mar-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Move XDP SQ instance into RQ To save many rq->channel->sq dereferences in fast-path. And rename it to xdpsq. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1c4bf940 |
|
24-Mar-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Move XDP completion functions to rx file XDP code belongs to RX path, move mlx5e_poll_xdp_tx_cq and mlx5e_free_xdp_tx_descs to en_rx.c. Rename them to mlx5e_poll_xdpsq_cq and mlx5e_free_xdpsq_descs. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6982ab60 |
|
24-Mar-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Xmit, no write combining mlx5e netdev Blue Flame (write combining) support demands a lot of overhead for a little latency gain for some special cases, this overhead is hurting the common case. Here we remove xmit Blue Flame support by creating all bfregs with no write combining for all SQs, and we remove a lot of BF logic and conditions from xmit data path. Simplify mlx5e_tx_notify_hw (doorbell function) by removing BF related code and by removing one memory barrier needed for WC mapped SQ doorbell buffers, which no longer exist. Performance improvement: System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Test case Before Now improvement --------------------------------------------------------------- TX packets (24 threads) 50Mpps 54Mpps 8% Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8ab7e2ae |
|
21-Mar-2017 |
Gal Pressman <galp@mellanox.com> |
net/mlx5e: Count LRO packets correctly RX packets statistics ('rx_packets' counter) used to count LRO packets as one, even though it contains multiple segments. This patch will increment the counter by the number of segments, and align the driver with the behavior of other drivers in the stack. Note that no information is lost in this patch due to 'rx_lro_packets' counter existence. Before, ethtool showed: $ ethtool -S ens6 | egrep "rx_packets|rx_lro_packets" rx_packets: 435277 rx_lro_packets: 35847 rx_packets_phy: 1935066 Now, we will see the more logical statistics: $ ethtool -S ens6 | egrep "rx_packets|rx_lro_packets" rx_packets: 1935066 rx_lro_packets: 35847 rx_packets_phy: 1935066 Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files") Signed-off-by: Gal Pressman <galp@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
36154be4 |
|
22-Feb-2017 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Fix wrong CQE decompression In cqe compression with striding RQ, the decompression of the CQE field wqe_counter was done with a wrong wraparound value. This caused handling cqes with a wrong pointer to wqe (rx descriptor) and creating SKBs with wrong data, pointing to wrong (and already consumed) strides/pages. The meaning of the CQE field wqe_counter in striding RQ holds the stride index instead of the WQE index. Hence, when decompressing a CQE, wqe_counter should have wrapped-around the number of strides in a single multi-packet WQE. We dropped this wrap-around mask at all in CQE decompression of striding RQ. It is not needed as in such cases the CQE compression session would break because of different value of wqe_id field, starting a new compression session. Tested: ethtool -K ethxx lro off/on ethtool --set-priv-flags ethxx rx_cqe_compress on super_netperf 16 {ipv4,ipv6} -t TCP_STREAM -m 50 -D verified no csum errors and no page refcount issues. Fixes: 7219ab34f184 ("net/mlx5e: CQE compression") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reported-by: Tom Herbert <tom@herbertland.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6dc4b54e |
|
22-Feb-2017 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Update MPWQE stride size when modifying CQE compress state When the admin enables/disables cqe compression, updating mpwqe stride size is required: CQE compress ON ==> stride size = 256B CQE compress OFF ==> stride size = 64B This is already done on driver load via mlx5e_set_rq_type_params, all we need is just to call it on arbitrary admin changes of cqe compression state via priv flags or when changing timestamping state (as it is mutually exclusive with cqe compression). This bug introduces no functional damage, it only makes cqe compression occur less often, since in ConnectX4-LX CQE compression is performed only on packets smaller than stride size. Tested: ethtool --set-priv-flags ethxx rx_cqe_compress on pktgen with 64 < pkt size < 256 and netperf TCP_STREAM (IPv4/IPv6) verify `ethtool -S ethxx | grep compress` are advancing more often (rapidly) Fixes: 7219ab34f184 ("net/mlx5e: CQE compression") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
18bcf742 |
|
22-Feb-2017 |
Mohamad Haj Yahia <mohamad@mellanox.com> |
net/mlx5e: s390 system compilation fix Add necessary headers include for s390 arch compilation. Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files") Fixes: d605d6686dc7 ("net/mlx5e: Add support for ethtool self..") Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b70149dd |
|
06-Dec-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: XDP Tx, no inline copy on ConnectX-5 ConnectX-5 and later HW generations will report min inline mode == MLX5_INLINE_MODE_NONE, which means driver is not required to copy packet headers to inline fields of TX WQE. Avoid copy to inline segment in XDP TX routine when HW inline mode doesn't require it. This will improve CPU utilization and boost XDP TX performance. Tested with xdp2 single flow: CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz HCA: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] Before: 7.4Mpps After: 7.8Mpps Improvement: 5% Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
|
#
2b31f7ae |
|
28-Nov-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5: TX WQE update Add new TX WQE fields for Connect-X5 vlan insertion support, type and vlan_tci, when type = MLX5_ETH_WQE_INSERT_VLAN the HW will insert the vlan and prio fields (vlan_tci) to the packet. Those bits and the inline header fields are mutually exclusive, and valid only when: MLX5_CAP_ETH(mdev, wqe_inline_mode) == MLX5_CAP_INLINE_MODE_NOT_REQUIRED and MLX5_CAP_ETH(mdev, wqe_vlan_insert), who will be set in ConnectX-5 and later HW generations. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
|
#
a67edbf4 |
|
24-Jan-2017 |
Daniel Borkmann <daniel@iogearbox.net> |
bpf: add initial bpf tracepoints This work adds a number of tracepoints to paths that are either considered slow-path or exception-like states, where monitoring or inspecting them would be desirable. For bpf(2) syscall, tracepoints have been placed for main commands when they succeed. In XDP case, tracepoint is for exceptions, that is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED return code, or when error occurs during XDP_TX action and the packet could not be forwarded. Both have been split into separate event headers, and can be further extended. Worst case, if they unexpectedly should get into our way in future, they can also removed [1]. Of course, these tracepoints (like any other) can be analyzed by eBPF itself, etc. Example output: # ./perf record -a -e bpf:* sleep 10 # ./perf script sock_example 6197 [005] 283.980322: bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0 sock_example 6197 [005] 283.980721: bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5 sock_example 6197 [005] 283.988423: bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER sock_example 6197 [005] 283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00] [...] sock_example 6197 [005] 288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00] swapper 0 [005] 289.338243: bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER [1] https://lwn.net/Articles/705270/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5eb0249b |
|
10-Dec-2016 |
Shaker Daibes <shakerd@mellanox.com> |
net/mlx5e: CQE compression control code reuse This patch is intended for code reuse of mlx5e_modify_rx_cqe_compression function. Signed-off-by: Shaker Daibes <shakerd@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
e048fc50 |
|
19-Jan-2017 |
Eric Dumazet <edumazet@google.com> |
net/mlx5e: Do not recycle pages from emergency reserve A driver using dev_alloc_page() must not reuse a page allocated from emergency memory reserve. Otherwise all packets using this page will be immediately dropped, unless for very specific sockets having SOCK_MEMALLOC bit set. This issue might be hard to debug, because only a fraction of received packets would be dropped. Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d8bec2b2 |
|
17-Jan-2017 |
Martin KaFai Lau <kafai@fb.com> |
net/mlx5e: Support bpf_xdp_adjust_head() This patch adds bpf_xdp_adjust_head() support to mlx5e. 1. rx_headroom is added to struct mlx5e_rq. It uses an existing 4 byte hole in the struct. 2. The adjusted data length is checked against MLX5E_XDP_MIN_INLINE and MLX5E_SW2HW_MTU(rq->netdev->mtu). 3. The macro MLX5E_SW2HW_MTU is moved from en_main.c to en.h. MLX5E_HW2SW_MTU is also moved to en.h for symmetric reason but it is not a must. v2: - Keep the xdp specific logic in mlx5e_xdp_handle() - Update dma_len after the sanity checks in mlx5e_xmit_xdp_frame() Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c0f1147d |
|
06-Dec-2016 |
Mohamad Haj Yahia <mohamad@mellanox.com> |
net/mlx5e: Change the SQ/RQ operational state to positive logic When using the negative logic (i.e. FLUSH state), after the RQ/SQ reopen we will have a time interval that the RQ/SQ is not really ready and the state indicates that its not in FLUSH state because the initial SQ/RQ struct memory starts as zeros. Now we changed the state to indicate if the SQ/RQ is opened and we will set the READY state after finishing preparing all the SQ/RQ resources. Fixes: 6e8dd6d6f4bd ("net/mlx5e: Don't wait for SQ completions on close") Fixes: f2fde18c52a7 ("net/mlx5e: Don't wait for RQ completions on close") Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b8335d91 |
|
06-Dec-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Don't notify HW when filling the edge of ICO SQ We are going to do this a couple of steps ahead anyway. Fixes: d3c9bc2743dc ("net/mlx5e: Added ICO SQs") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
366cbf2f |
|
30-Nov-2016 |
Daniel Borkmann <daniel@iogearbox.net> |
bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller After 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"), the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers need to hold rcu_read_lock() already to make sure BPF program doesn't get released in the background. Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading. Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping in XDP supported drivers and to keep the typecheck on the context intact. For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock() out of the helper. When the driver gets atomic replace support, this will move to call-sites eventually. mlx5 needs actual fixing as it has the same issue as described already in 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"), that is, we're under RCU bh at this time, BPF programs are released via call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue reset. Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9bcc8606 |
|
27-Nov-2016 |
Shaker Daibes <shakerd@mellanox.com> |
net/mlx5e: Add CQE compression user control The user can now override the automatic driver decision using the rx_cqe_compress flag, which is the preference for CQE compression. The flag is initialized with the automatic driver decision. Signed-off-by: Shaker Daibes <shakerd@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f5f82476 |
|
22-Sep-2016 |
Or Gerlitz <ogerlitz@mellanox.com> |
net/mlx5: E-Switch, Support VLAN actions in the offloads mode Many virtualization systems use a policy under which a vlan tag is pushed to packets sent by guests, and popped before the packet is forwarded to the VM. The current generation of the mlx5 HW doesn't fully support that on a per flow level. As such, we are addressing the above common use case with the SRIOV e-Switch abilities to push vlan into packets sent by VFs and pop vlan from packets forwarded to VFs. The HW can match on the correct vlan being present in packets forwarded to VFs (eSwitch steering is done before stripping the tag), so this part is offloaded as is. A common practice for vlans is to avoid both push vlan and pop vlan for inter-host VM/VM (east-west) communication because in this case, push on egress cancels out with pop on ingress. For supporting that, we use a global eswitch vlan pop policy, hence allowing guest A to communicate with both remote VM B and local VM C. This works since the HW pops the vlan only if it exists (e.g for C --> A packets but not for B --> A packets). On the slow path, when a VF vport has an offloaded flow which involves pushing vlans, wheres another flow is not currently offloaded, the packets from the 2nd flow seen by the VF representor on the host have vlan. The VF rep driver removes such vlan before calling into the host networking stack. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8515c581 |
|
22-Sep-2016 |
Or Gerlitz <ogerlitz@mellanox.com> |
net/mlx5e: Refactor retrival of skb from rx completion element (cqe) Factor the relevant code into a static inline helper (skb_from_cqe) doing that. Move the call to napi_gro_receive to be carried out just after mlx5e_complete_rx_cqe returns. Both changes are to be used for the VF representor as well in the next commit. This patch doesn't change any functionality. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
35b510e2 |
|
20-Sep-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: XDP TX xmit more Previously we rang XDP SQ doorbell on every forwarded XDP packet. Here we introduce a xmit more like mechanism that will queue up more than one packet into SQ (up to RX napi budget) w/o notifying the hardware. Once RX napi budget is consumed and we exit napi RX loop, we will flush (doorbell) all XDP looped packets in case there are such. XDP forward packet rate: Comparing XDP with and w/o xmit more (bulk transmit): RX Cores XDP TX XDP TX (xmit more) --------------------------------------------------- 1 6.5Mpps 12.4Mpps 2 13.2Mpps 24.2Mpps 4 25.2Mpps 36.3Mpps* 8 36.3Mpps* 36.3Mpps* *My xmitter was limited to 36.3Mpps, so it is the bottleneck. It seems that receive side can handle more. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b5503b99 |
|
20-Sep-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: XDP TX forwarding support Adding support for XDP_TX forwarding from xdp program. Using XDP, now user can loop packets out of the same port. We create a dedicated TX SQ for each channel that will serve XDP programs that return XDP_TX action to loop packets back to the wire directly from the channel RQ RX path. For that RX pages will now need to be mapped bi-directionally, and on XDP_TX action we will sync the page back to device then queue it into SQ for transmission. The XDP xmit frame function will report back to the RX path if the page was consumed (transmitted), if so, RX path will forget about that page as if it were released to the stack. Later on, on XDP TX completion, the page will be released back to the page cache. For simplicity this patch will hit a doorbell on every XDP TX packet. Next patch will introduce a xmit more like mechanism that will queue up more than one packet into SQ w/o notifying the hardware, once RX napi loop is done we will hit doorbell once for all XDP TX packets form the previous loop. This should drastically improve XDP TX performance. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f10b7cc7 |
|
20-Sep-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Have a clear separation between different SQ types Make a clear separate between Regular SQ (TXQ) and ICO SQ creation, destruction and union their mutual information structures. Don't allocate redundant TXQ skb/wqe_info/dma_fifo arrays for ICO SQ. And have a different SQ edge for ICO SQ than TXQ SQ, to be more accurate. In preparation for XDP TX support. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
86994156 |
|
20-Sep-2016 |
Rana Shahout <ranas@mellanox.com> |
net/mlx5e: XDP fast RX drop bpf programs support Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver. When XDP is on we make sure to change channels RQs type to MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to ensure "page per packet". On XDP set, we fail if HW LRO is set and request from user to turn it off. Since on ConnectX4-LX HW LRO is always on by default, this will be annoying, but we prefer not to enforce LRO off from XDP set function. Full channels reset (close/open) is required only when setting XDP on/off. When XDP set is called just to exchange programs, we will update each RQ xdp program on the fly and for synchronization with current data path RX activity of that RQ, we temporally disable that RQ and ensure RX path is not running, quickly update and re-enable that RQ, for that we do: - rq.state = disabled - napi_synnchronize - xchg(rq->xdp_prg) - rq.state = enabled - napi_schedule // Just in case we've missed an IRQ Packet rate performance testing was done with pktgen 64B packets and on TX side and, TC drop action on RX side compared to XDP fast drop. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Comparison is done between: 1. Baseline, Before this patch with TC drop action 2. This patch with TC drop action 3. This patch with XDP RX fast drop RX Cores Baseline(TC drop) TC drop XDP fast Drop -------------------------------------------------------------- 1 5.3Mpps 5.3Mpps 16.5Mpps 2 10.2Mpps 10.2Mpps 31.3Mpps 4 20.5Mpps 19.9Mpps 36.3Mpps* *My xmitter was limited to 36.3Mpps, so it is the bottleneck. It seems that receive side can handle more. Signed-off-by: Rana Shahout <ranas@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
21c59685 |
|
20-Sep-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Union RQ RX info per RQ type We have two types of RX RQs, and they use two separate sets of info arrays and structures in RX data path function. Today those structures are mutually exclusive per RQ type, hence one kind is allocated on RQ creation according to the RQ type. For better cache locality and to minimalize the sizeof(struct mlx5e_rq), in this patch we define them as a union. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1bfecfca |
|
20-Sep-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Build RX SKB on demand For non-striding RQ configuration before this patch we had a ring with pre-allocated SKBs and mapped the SKB->data buffers for device. For robustness and better RX data buffers management, we allocate a page per packet and build_skb around it. This patch (which is a prerequisite for XDP) will actually reduce performance for normal stack usage, because we are now hitting a bottleneck in the page allocator. We use the page-cache to restore or even improve performance in comparison to the old RX scheme. Packet rate performance testing was done with pktgen 64B packets on xmit side and TC ingress dropping action on RX side. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Comparison is done between: 1.Baseline, before 'net/mlx5e: Build RX SKB on demand' 2.Build SKB with RX page cache (This patch) RX Cores Baseline Build SKB+page-cache Improvement ----------------------------------------------------------- 1 4.16Mpps 5.33Mpps 28% 2 7.16Mpps 10.24Mpps 43% 4 13.61Mpps 20.51Mpps 51% 8 25.32Mpps 32.00Mpps 26% All respective cores were 100% utilized. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4415a031 |
|
15-Sep-2016 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Implement RX mapped page cache for page recycle Instead of reallocating and mapping pages for RX data-path, recycle already used pages in a per ring cache. Performance tests: The following results were measured on a freshly booted system, giving optimal baseline performance, as high-order pages are yet to be fragmented and depleted. We ran pktgen single-stream benchmarks, with iptables-raw-drop: Single stride, 64 bytes: * 4,739,057 - baseline * 4,749,550 - order0 no cache * 4,786,899 - order0 with cache 1% gain Larger packets, no page cross, 1024 bytes: * 3,982,361 - baseline * 3,845,682 - order0 no cache * 4,127,852 - order0 with cache 3.7% gain Larger packets, every 3rd packet crosses a page, 1500 bytes: * 3,731,189 - baseline * 3,579,414 - order0 no cache * 3,931,708 - order0 with cache 5.4% gain Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a5a0c590 |
|
15-Sep-2016 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Introduce API for RX mapped pages Manage the allocation and deallocation of mapped RX pages only through dedicated API functions. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7e426671 |
|
15-Sep-2016 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Single flow order-0 pages for Striding RQ To improve the memory consumption scheme, we omit the flow that demands and splits high-order pages in Striding RQ, and stay with a single Striding RQ flow that uses order-0 pages. Moving to fragmented memory allows the use of larger MPWQEs, which reduces the number of UMR posts and filler CQEs. Moving to a single flow allows several optimizations that improve performance, especially in production servers where we would anyway fallback to order-0 allocations: - inline functions that were called via function pointers. - improve the UMR post process. This patch alone is expected to give a slight performance reduction. However, the new memory scheme gives the possibility to use a page-cache of a fair size, that doesn't inflate the memory footprint, which will dramatically fix the reduction and even give a performance gain. Performance tests: The following results were measured on a freshly booted system, giving optimal baseline performance, as high-order pages are yet to be fragmented and depleted. We ran pktgen single-stream benchmarks, with iptables-raw-drop: Single stride, 64 bytes: * 4,739,057 - baseline * 4,749,550 - this patch no reduction Larger packets, no page cross, 1024 bytes: * 3,982,361 - baseline * 3,845,682 - this patch 3.5% reduction Larger packets, every 3rd packet crosses a page, 1500 bytes: * 3,731,189 - baseline * 3,579,414 - this patch 4% reduction Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)") Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cd17d230 |
|
07-Sep-2016 |
Gal Pressman <galp@mellanox.com> |
net/mlx5e: Fix parsing of vlan packets when updating lro header Currently vlan tagged packets were not parsed correctly and assumed to be regular IPv4/IPv6 packets. We should check for 802.1Q/802.1ad tags and update the lro header accordingly. This fixes the use case where LRO is on and rxvlan is off (vlan stripping is off). Fixes: e586b3b0baee ('net/mlx5: Ethernet Datapath files') Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8484f9ed |
|
28-Aug-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Don't post fragmented MPWQE when RQ is disabled ICO (Internal control operations) SQ (Send Queue) is closed/disabled after RQ (Receive Queue). After RQ is closed an ICO SQ completion might post a fragmented MPWQE (Multi Packet Work Queue Element) into that RQ. As on regular RQ post, check if we are allowed to post to that RQ (RQ is enabled). Cleanup in-progress UMR MPWQE on mlx5e_free_rx_descs if needed. Fixes: bc77b240b3c5 ('net/mlx5e: Add fragmented memory support for RX multi packet WQE') Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f2fde18c |
|
28-Aug-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Don't wait for RQ completions on close This will significantly reduce receive queue flush time on interface down. Instead of asking the firmware to flush the RQ (Receive Queue) via asynchronous completions when moved to error, we handle RQ flush manually (mlx5e_free_rx_descs) same as we did when RQ flush got timed out. This will reduce RQs flush time and speedup interface down procedure (ifconfig down) from 6 sec to 0.3 sec on a 48 cores system. Moved mlx5e_free_rx_descs en_main.c where it is needed, to keep en_rx.c free form non critical data path code for better code locality. Fixes: 6cd392a082de ('net/mlx5e: Handle RQ flush in error cases') Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fe4c988b |
|
28-Aug-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Limit UMR length to the device's limitation ConnectX-4 UMR (User Memory Region) MTT translation table offset in WQE is limited to U16_MAX, before this patch we ignored that limitation and requested the maximum possible UMR translation length that the netdev might need (MAX channels * MAX pages per channel). In case of a system with #cores > 32 and when linear WQE allocation fails, falling back to using UMR WQEs will cause the RQ (Receive Queue) to get stuck. Here we limit UMR length to min(U16_MAX, max required pages) (while considering the required alignments) on driver load, by default U16_MAX is sufficient since the default RX rings value guarantees that we are in range, dynamically (on set_ringparam/set_channels) we will check if the new required UMR length (num mtts) is still in range, if not, fail the request. Fixes: bc77b240b3c5 ('net/mlx5e: Add fragmented memory support for RX multi packet WQE') Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6cd392a0 |
|
30-Jun-2016 |
Daniel Jurgens <danielj@mellanox.com> |
net/mlx5e: Handle RQ flush in error cases Add a timeout to avoid an infinite loop waiting for RQ's to flush. This occurs during AER/EEH and will also happen if the device stops posting completions due to internal error or reset, or if moving the RQ to the error state fails. Also cleanup posted receive resources when closing the RQ. Fixes: f62b8bb8f2d3 ('net/mlx5: Extend mlx5_core to support ConnectX-4 Ethernet functionality') Signed-off-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bfe6d8d1 |
|
26-Jun-2016 |
Gal Pressman <galp@mellanox.com> |
net/mlx5e: Reorganize ethtool statistics Categorize and reorganize ethtool statistics counters by renaming to "rx_*" and "tx_*" and removing redundant and duplicated counters, this way they are easier to grasp and more user friendly. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0139aa7b |
|
19-May-2016 |
Joonsoo Kim <iamjoonsoo.kim@lge.com> |
mm: rename _count, field of the struct page, to _refcount Many developers already know that field for reference count of the struct page is _count and atomic type. They would try to handle it directly and this could break the purpose of page reference count tracepoint. To prevent direct _count modification, this patch rename it to _refcount and add warning message on the code. After that, developer who need to handle reference count will find that field should not be accessed directly. [akpm@linux-foundation.org: fix comments, per Vlastimil] [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too] [sfr@canb.auug.org.au: sync ethernet driver changes] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Sunil Goutham <sgoutham@cavium.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Manish Chopra <manish.chopra@qlogic.com> Cc: Yuval Mintz <yuval.mintz@qlogic.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d9d9f156 |
|
10-May-2016 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Expand WQE stride when CQE compression is enabled Make the MPWQE/Striding RQ default configuration dynamic and not statically set at compile time. Now at driver load we set stride size and num strides dynamically. By default we use same values as before, but when CQE compression is enabled, we set larger stride size to benefit from CQE compression for larger packets. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7219ab34 |
|
10-May-2016 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: CQE compression CQE compression feature is meant to save PCIe bandwidth by compressing few CQEs into smaller amount of bytes on PCIe. CQE compression can be selectively enabled per CQ. By default is disabled for now and will be enabled later on. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1b223dd3 |
|
24-Apr-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Fix checksum handling for non-stripped vlan packets Now as rx-vlan offload can be disabled, packets can be received with vlan tag not stripped, which means is_first_ethertype_ip will return false, for that we need to check if the hardware reported csum OK so we will report CHECKSUM_UNNECESSARY for those packets. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
54984407 |
|
20-Apr-2016 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Add ethtool counter for RX buffer allocation failures Counts the number of RX buffer allocation failures and shows it in ethtool statistics. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e20a0db3 |
|
20-Apr-2016 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Delay skb->data access Move mlx5e_handle_csum and eth_type_trans to the end of mlx5e_build_rx_skb to gain some more time before accessing skb->data, to reduce cache misses. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c5adb96f |
|
20-Apr-2016 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use napi_alloc_skb for RX SKB allocations Instead of netdev_alloc_skb, we use the napi_alloc_skb function which is designated to allocate skbuff's for RX in a channel-specific NAPI instance, and implies the IP packet alignment. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bc77b240 |
|
20-Apr-2016 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Add fragmented memory support for RX multi packet WQE If the allocation of a linear (physically continuous) MPWQE fails, we allocate a fragmented MPWQE. This is implemented via device's UMR (User Memory Registration) which allows to register multiple memory fragments into ConnectX hardware as a continuous buffer. UMR registration is an asynchronous operation and is done via ICO SQs. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
461017cb |
|
20-Apr-2016 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Support RX multi-packet WQE (Striding RQ) Introduce the feature of multi-packet WQE (RX Work Queue Element) referred to as (MPWQE or Striding RQ), in which WQEs are larger and serve multiple packets each. Every WQE consists of many strides of the same size, every received packet is aligned to a beginning of a stride and is written to consecutive strides within a WQE. In the regular approach, each regular WQE is big enough to be capable of serving one received packet of any size up to MTU or 64K in case of device LRO is enabled, making it very wasteful when dealing with small packets or device LRO is enabled. For its flexibility, MPWQE allows a better memory utilization (implying improvements in CPU utilization and packet rate) as packets consume strides according to their size, preserving the rest of the WQE to be available for other packets. MPWQE default configuration: Num of WQEs = 16 Strides Per WQE = 2048 Stride Size = 64 byte The default WQEs memory footprint went from 1024*mtu (~1.5MB) to 16 * 2048 * 64 = 2MB per ring. However, HW LRO can now be supported at no additional cost in memory footprint, and hence we turn it on by default and get an even better performance. Performance tested on ConnectX4-Lx 50G. To isolate the feature under test, the numbers below were measured with HW LRO turned off. We verified that the performance just improves when LRO is turned back on. * Netperf single TCP stream: - BW raised by 10-15% for representative packet sizes: default, 64B, 1024B, 1478B, 65536B. * Netperf multi TCP stream: - No degradation, line rate reached. * Pktgen: packet rate raised by 2-10% for traffic of different message sizes: 64B, 128B, 256B, 1024B, and 1500B. * Pktgen: packet loss in bursts of small messages (64byte), single stream: - | num packets | packets loss before | packets loss after | 2K | ~ 1K | 0 | 8K | ~ 6K | 0 | 16K | ~13K | 0 | 32K | ~28K | 0 | 64K | ~57K | ~24K As expected as the driver can receive as many small packets (<=64B) as the number of total strides in the ring (default = 2048 * 16) vs. 1024 (default ring size regardless of packets size) before this feature. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2f48af12 |
|
20-Apr-2016 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Use function pointers for RX data path handling In preparation for Striding RQ feature, which will need its own RX handlers. This patch does not change any functionality. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
12185a9f |
|
07-Mar-2016 |
Amir Vadai <amir@vadai.me> |
net/mlx5e: Support offload cls_flower with skbedit mark action Introduce offloading of skbedit mark action. For example, to mark with 0x1234, all TCP (ip_proto 6) packets arriving to interface ens9: # tc qdisc add dev ens9 ingress # tc filter add dev ens9 protocol ip parent ffff: \ flower ip_proto 6 \ indev ens9 \ action skbedit mark 0x1234 Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b081da5e |
|
29-Feb-2016 |
Gal Pressman <galp@mellanox.com> |
net/mlx5e: Add rx/tx bytes software counters Sum up rx/tx bytes in software as we do for rx/tx packets, to be reported in upcoming statistics fix. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
59a7c2fd |
|
29-Feb-2016 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: Remove wrong poll CQ optimization With the MLX5E_CQ_HAS_CQES optimization flag, the following buggy flow might occur: - Suppose RX is always busy, TX has a single packet every second. - We poll a single TX cqe and clear its flag. - We never arm it again as RX is always busy. - TX CQ flag is never changed, and new TX cqes are not polled. We revert this optimization. Fixes: e586b3b0baee ('net/mlx5: Ethernet Datapath files') Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5f6d12d1 |
|
22-Feb-2016 |
Matthew Finlay <matt@mellanox.com> |
net/mlx5e: Move to checksum complete Use checksum complete for all IP packets, unless they are HW LRO, in which case, use checksum unnecessary. Signed-off-by: Matthew Finlay <matt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ef9814de |
|
29-Dec-2015 |
Eran Ben Elisha <eranbe@mellanox.com> |
net/mlx5e: Add HW timestamping (TS) support Add support for enable/disable HW timestamping for incoming and/or outgoing packets. To enable/disable HW timestamping appropriate ioctl should be used. Currently HWTSTAMP_FILTER_ALL/NONE and HWTSAMP_TX_ON/OFF only are supported. Make all relevant changes in RX/TX flows to consider TS request and plant HW timestamps into relevant structures. Add internal clock for converting hardware timestamp to nanoseconds. In addition, add a service task to catch internal clock overflow, to make sure timestamping is accurate. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
93f93a44 |
|
18-Nov-2015 |
Eric Dumazet <edumazet@google.com> |
net: move skb_mark_napi_id() into core networking stack We would like to automatically provide busy polling support to all NAPI drivers, without them having to implement anything. skb_mark_napi_id() can be called from napi_gro_receive() and napi_get_frags(). Few drivers are still calling skb_mark_napi_id() because they use netif_receive_skb(). They should eventually call napi_gro_receive() instead. I will leave this to drivers maintainers. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
44fb6fbb |
|
18-Nov-2015 |
Eric Dumazet <edumazet@google.com> |
mlx5: support napi_complete_done() A NAPI poll handler should return number of RX packets processed, instead of 0 / budget. This allows proper busy poll accounting through LINUX_MIB_BUSYPOLLRXPACKETS SNMP counter. napi_complete_done() allows /sys/class/net/ethX/gro_flush_timeout to be used for finer GRO aggregation control. Tested: Enabled busy polling, and checked TcpExtBusyPollRxPackets counter is increasing. echo 70 >/proc/sys/net/core/busy_read nstat >/dev/null netperf -H target -t TCP_RR >/dev/null nstat | grep TcpExtBusyPollRxPackets TcpExtBusyPollRxPackets 490958 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eli Cohen <eli@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7ae92ae5 |
|
18-Nov-2015 |
Eric Dumazet <edumazet@google.com> |
mlx5: add busy polling support It is now easy to add busy polling support to a NAPI driver, with very little impact on normal input path. This patch serves as a reference implementation. Note: A followup patch will add proper napi_complete_done() in mlx5, so that LINUX_MIB_BUSYPOLLRXPACKETS snmp counter is properly handled. Tested: Normal TCP_RR results without busy polling : lpk51:~# echo 0 >/proc/sys/net/core/busy_read lpk52:~# echo 0 >/proc/sys/net/core/busy_read lpk51:~# ./netperf -H 192.168.4.52 -t TCP_RR -l 10 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.52 () port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 10.00 53509.49 16384 87380 Now enable busy polling : lpk51:~# echo 70 >/proc/sys/net/core/busy_read lpk52:~# echo 70 >/proc/sys/net/core/busy_read lpk51:~# ./netperf -H 192.168.4.52 -t TCP_RR -l 10 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.52 () port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 10.00 97530.92 16384 87380 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ecf842f6 |
|
17-Aug-2015 |
David S. Miller <davem@davemloft.net> |
mlx5e: Fix sparse warnings in mlx5e_handle_csum(). >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:173:44: sparse: incorrect type in argument 1 (different base types) drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:173:44: expected restricted __sum16 [usertype] n drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:173:44: got restricted __be16 [usertype] check_sum Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bbceefce |
|
16-Aug-2015 |
Achiad Shochat <achiad@mellanox.com> |
net/mlx5e: Support RX CHECKSUM_COMPLETE Only for packets with first ethertype set to IPv4/6 for now. Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d9a40271 |
|
16-Aug-2015 |
Achiad Shochat <achiad@mellanox.com> |
net/mlx5e: HW LRO changes/fixes - Change the maximum LRO session size from 16KB to 64KB - Reduce the LRO session timeout from 512us to 32us in order to reduce the TCP latency of non-LRO'ed flows. - Fix skb_shinfo(skb)->gso_size and set skb_shinfo(skb)->gso_type. - Fix a bug accessing un-initialized mdev pointer. Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
99611ba1 |
|
23-Jun-2015 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Prefetch skb data on RX Prefetch the 1st cache line used by the buffer pointed by the skb linear data. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a1f5a1a8 |
|
23-Jun-2015 |
Achiad Shochat <achiad@mellanox.com> |
net/mlx5e: Pop cq outside mlx5e_get_cqe Separate between mlx5e_get_cqe() and mlx5_cqwq_pop(), this helps for better code readability and better CQ buffer management. Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e3391054 |
|
23-Jun-2015 |
Achiad Shochat <achiad@mellanox.com> |
net/mlx5e: Remove mlx5e_cq.sqrq back-pointer Use container_of() instead. Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fc11fbf9 |
|
11-Jun-2015 |
Saeed Mahameed <saeedm@mellanox.com> |
net/mlx5e: Add HW cacheline start padding Enable HW cacheline start padding and align RX WQE size to cacheline while considering HW start padding. Also, fix dma_unmap call to use the correct SKB data buffer size. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e586b3b0 |
|
28-May-2015 |
Amir Vadai <amirv@mellanox.com> |
net/mlx5: Ethernet Datapath files en_[rt]x.c contains the data path related code specific to tx or rx. en_txrx.c contains data path code which is common for both the rx and tx, this is mainly napi related code. Below are the objects that are being used by the hardware and the driver in the data path: Channel - one channel per IRQ. Every channel object contains: RQ - describes the rx queue TIR - One TIR (Transport Interface Receive) object per flow type. TIR contains attributes for a type of rx flow (e.g IPv4, IPv6 etc). A flow is defined in the Flow Table. Currently TIR describes the RSS hash parameters if exists and LRO attributes. SQ - describes the a tx queue. There is one SQ (Send Queue) per TC (traffic class). TIS - There is one TIS (Transport Interface Send) per TC. It describes the TC and may later be extended to describe more transport properties. Both RQ and SQ inherit from the object WQ (work queue). This common code to describe the layout of CQE's WQE's in memory is in the files wq.[cj] For every channel there is one NAPI context that is used for RX and for TX. Driver is using netdev_alloc_skb() to allocate skb's. Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|