#
ad2047cf |
|
24-Jan-2024 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: work on pre-XDP prog frag count Fix an OOM panic in XDP_DRV mode when a XDP program shrinks a multi-buffer packet by 4k bytes and then redirects it to an AF_XDP socket. Since support for handling multi-buffer frames was added to XDP, usage of bpf_xdp_adjust_tail() helper within XDP program can free the page that given fragment occupies and in turn decrease the fragment count within skb_shared_info that is embedded in xdp_buff struct. In current ice driver codebase, it can become problematic when page recycling logic decides not to reuse the page. In such case, __page_frag_cache_drain() is used with ice_rx_buf::pagecnt_bias that was not adjusted after refcount of page was changed by XDP prog which in turn does not drain the refcount to 0 and page is never freed. To address this, let us store the count of frags before the XDP program was executed on Rx ring struct. This will be used to compare with current frag count from skb_shared_info embedded in xdp_buff. A smaller value in the latter indicates that XDP prog freed frag(s). Then, for given delta decrement pagecnt_bias for XDP_DROP verdict. While at it, let us also handle the EOP frag within ice_set_rx_bufs_act() to make our life easier, so all of the adjustments needed to be applied against freed frags are performed in the single place. Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-5-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
714ed949 |
|
05-Dec-2023 |
Larysa Zaremba <larysa.zaremba@intel.com> |
ice: Implement VLAN tag hint Implement .xmo_rx_vlan_tag callback to allow XDP code to read packet's VLAN tag. At the same time, use vlan_tci instead of vlan_tag in touched code, because VLAN tag often refers to VLAN proto and VLAN TCI combined, while in the code we clearly store only VLAN TCI. Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-11-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
9031d5f4 |
|
05-Dec-2023 |
Larysa Zaremba <larysa.zaremba@intel.com> |
ice: Support HW timestamp hint Use previously refactored code and create a function that allows XDP code to read HW timestamp. Also, introduce packet context, where hints-related data will be stored. ice_xdp_buff contains only a pointer to this structure, to avoid copying it in ZC mode later in the series. HW timestamp is the first supported hint in the driver, so also add xdp_metadata_ops. Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-6-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
d951c14a |
|
05-Dec-2023 |
Larysa Zaremba <larysa.zaremba@intel.com> |
ice: Introduce ice_xdp_buff In order to use XDP hints via kfuncs we need to put RX descriptor and miscellaneous data next to xdp_buff. Same as in hints implementations in other drivers, we achieve this through putting xdp_buff into a child structure. Currently, xdp_buff is stored in the ring structure, so replace it with union that includes child structure. This way enough memory is available while existing XDP code remains isolated from hints. Minimum size of the new child structure (ice_xdp_buff) is exactly 64 bytes (single cache line). To place it at the start of a cache line, move 'next' field from CL1 to CL4, as it isn't used often. This still leaves 192 bits available in CL3 for packet context extensions. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-5-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
0ffb08b1 |
|
21-Nov-2023 |
Jacob Keller <jacob.e.keller@intel.com> |
ice: remove ptp_tx ring parameter flag Before performing a Tx timestamp in ice_stamp(), the driver checks a ptp_tx ring variable to see if timestamping is enabled on that ring. This value is set for all rings whenever userspace configures Tx timestamping. Ostensibly this was done to avoid wasting cycles checking other fields when timestamping has not been enabled. However, for Tx timestamps we already get an individual per-SKB flag indicating whether userspace wants to request a timestamp on that packet. We do not gain much by also having a separate flag to check for whether timestamping was enabled. In fact, the driver currently fails to restore the field after a PF reset. Because of this, if a PF reset occurs, timestamps will be disabled. Since this flag doesn't add value in the hotpath, remove it and always provide a timestamp if the SKB flag has been set. A following change will fix the reset path to properly restore user timestamping configuration completely. This went unnoticed for some time because one of the most common applications using Tx timestamps, ptp4l, will reconfigure the socket as part of its fault recovery logic. Fixes: ea9b847cda64 ("ice: enable transmit timestamps for E810 devices") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
9113302b |
|
11-May-2023 |
Jan Sokolowski <jan.sokolowski@intel.com> |
ice: Fix undersized tx_flags variable As not all ICE_TX_FLAGS_* fit in current 16-bit limited tx_flags field that was introduced in the Fixes commit, VLAN-related information would be discarded completely. As such, creating a vlan and trying to run ping through would result in no traffic passing. Fix that by refactoring tx_flags variable into flags only and a separate variable that holds VLAN ID. As there is some space left, type variable can fit between those two. Pahole reports no size change to ice_tx_buf struct. Fixes: aa1d3faf71a6 ("ice: Robustify cleaning/completing XDP Tx buffers") Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
055d0920 |
|
10-Feb-2023 |
Alexander Lobakin <alexandr.lobakin@intel.com> |
ice: Fix freeing XDP frames backed by Page Pool As already mentioned, freeing any &xdp_frame via page_frag_free() is wrong, as it assumes the frame is backed by either an order-0 page or a page with no "patrons" behind them, while in fact frames backed by Page Pool can be redirected to a device, which's driver doesn't use it. Keep storing a pointer to the raw buffer and then freeing it unconditionally via page_frag_free() for %XDP_TX frames, but introduce a separate type in the enum for frames coming through .ndo_xdp_xmit(), and free them via xdp_return_frame_bulk(). Note that saving xdpf as xdp_buff->data_hard_start is intentional and is always true when everything is configured properly. After this change, %XDP_REDIRECT from a Page Pool based driver to ice becomes zero-alloc as it should be and horrendous 3.3 Mpps / queue turn into 6.6, hehe. Let it go with no "Fixes:" tag as it spans across good 5+ commits and can't be trivially backported. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-6-alexandr.lobakin@intel.com
|
#
aa1d3faf |
|
10-Feb-2023 |
Alexander Lobakin <alexandr.lobakin@intel.com> |
ice: Robustify cleaning/completing XDP Tx buffers When queueing frames from a Page Pool for redirecting to a device backed by the ice driver, `perf top` shows heavy load on page_alloc() and page_frag_free(), despite that on a properly working system it must be fully or at least almost zero-alloc. The problem is in fact a bit deeper and raises from how ice cleans up completed Tx buffers. The story so far: when cleaning/freeing the resources related to a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some heuristics only without setting any type explicitly (except for dummy Flow Director packets, which are marked via ice_tx_buf::tx_flags). This kinda works, but only up to some point. For example, currently ice assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by either plain order-0 page or plain page frag, while it may also be backed by Page Pool or any other possible memory models introduced in future. This means any &xdp_frame must be freed properly via xdp_return_frame() family with no assumptions. In order to do that, the whole heuristics must be replaced with setting the Tx buffer/frame type explicitly, just how it's always been done via an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't hurt much -- especially given that sometimes there was a check for %ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum member. The rest of the changes is straightforward and most of it is just a conversion to rely now on the type set in &ice_tx_buf rather than to some secondary properties. For now, no functional changes intended, the change only prepares the ground for starting freeing XDP frames properly next step. And it must be done atomically/synchronously to not break stuff. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
|
#
a24b4c6e |
|
31-Jan-2023 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: xsk: Do not convert to buff to frame for XDP_TX Let us store pointer to xdp_buff that came from xsk_buff_pool on tx_buf so that it will be possible to recycle it via xsk_buff_free() on Tx cleaning side. This way it is not necessary to do expensive copy to another xdp_buff backed by a newly allocated page. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-14-maciej.fijalkowski@intel.com
|
#
f4db7b31 |
|
31-Jan-2023 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: Remove next_{dd,rs} fields from ice_tx_ring Now that both ZC and standard XDP data paths stopped using Tx logic based on next_dd and next_rs fields, we can safely remove these fields and shrink Tx ring structure. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-13-maciej.fijalkowski@intel.com
|
#
3246a107 |
|
31-Jan-2023 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
|
#
2fba7dc5 |
|
31-Jan-2023 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: Add support for XDP multi-buffer on Rx side Ice driver needs to be a bit reworked on Rx data path in order to support multi-buffer XDP. For skb path, it currently works in a way that Rx ring carries pointer to skb so if driver didn't manage to combine fragmented frame at current NAPI instance, it can restore the state on next instance and keep looking for last fragment (so descriptor with EOP bit set). What needs to be achieved is that xdp_buff needs to be combined in such way (linear + frags part) in the first place. Then skb will be ready to go in case of XDP_PASS or BPF program being not present on interface. If BPF program is there, it would work on multi-buffer XDP. At this point xdp_buff resides directly on Rx ring, so given the fact that skb will be built straight from xdp_buff, there will be no further need to carry skb on Rx ring. Besides removing skb pointer from Rx ring, lots of members have been moved around within ice_rx_ring. First and foremost reason was to place rx_buf with xdp_buff on the same cacheline. This means that once we touch rx_buf (which is a preceding step before touching xdp_buff), xdp_buff will already be hot in cache. Second thing was that xdp_rxq is used rather rarely and it occupies a separate cacheline, so maybe it is better to have it at the end of ice_rx_ring. Other change that affects ice_rx_ring is the introduction of ice_rx_ring::first_desc. Its purpose is twofold - first is to propagate rx_buf->act to all the parts of current xdp_buff after running XDP program, so that ice_put_rx_buf() that got moved out of the main Rx processing loop will be able to tak an appriopriate action on each buffer. Second is for ice_construct_skb(). ice_construct_skb() has a copybreak mechanism which had an explicit impact on xdp_buff->skb conversion in the new approach when legacy Rx flag is toggled. It works in a way that linear part is 256 bytes long, if frame is bigger than that, remaining bytes are going as a frag to skb_shared_info. This means while memcpying frags from xdp_buff to newly allocated skb, care needs to be taken when picking the destination frag array entry. Upon the time ice_construct_skb() is called, when dealing with fragmented frame, current rx_buf points to the *last* fragment, but copybreak needs to be done against the first one. That's where ice_rx_ring::first_desc helps. When frame building spans across NAPI polls (DD bit is not set on current descriptor and xdp->data is not NULL) with current Rx buffer handling state there might be some problems. Since calls to ice_put_rx_buf() were pulled out of the main Rx processing loop and were scoped from cached_ntc to current ntc, remember that now mentioned function relies on rx_buf->act, which is set within ice_run_xdp(). ice_run_xdp() is called when EOP bit was found, so currently we could put Rx buffer with rx_buf->act being *uninitialized*. To address this, change scoping to rely on first_desc on both boundaries instead. This also implies that cleaned_count which is used as an input to ice_alloc_rx_buffers() and tells how many new buffers should be refilled has to be adjusted. If it stayed as is, what could happen is a case where ntc would go over ntu. Therefore, remove cleaned_count altogether and use against allocing routine newly introduced ICE_RX_DESC_UNUSED() macro which is an equivalent of ICE_DESC_UNUSED() dedicated for Rx side and based on struct ice_rx_ring::first_desc instead of next_to_clean. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-11-maciej.fijalkowski@intel.com
|
#
1dc1a7e7 |
|
31-Jan-2023 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: Centrallize Rx buffer recycling Currently calls to ice_put_rx_buf() are sprinkled through ice_clean_rx_irq() - first place is for explicit flow director's descriptor handling, second is after running XDP prog and the last one is after taking care of skb. 1st callsite was actually only for ntc bump purpose, as Rx buffer to be recycled is not even passed to a function. It is possible to walk through Rx buffers processed in particular NAPI cycle by caching ntc from beginning of the ice_clean_rx_irq(). To do so, let us store XDP verdict inside ice_rx_buf, so action we need to take on will be known. For XDP prog absence, just store ICE_XDP_PASS as a verdict. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-7-maciej.fijalkowski@intel.com
|
#
ac075339 |
|
31-Jan-2023 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: Store page count inside ice_rx_buf This will allow us to avoid carrying additional auxiliary array of page counts when dealing with XDP multi buffer support. Previously combining fragmented frame to skb was not affected in the same way as XDP would be as whole frame is needed to be in place before executing XDP prog. Therefore, when going through HW Rx descriptors one-by-one, calls to ice_put_rx_buf() need to be taken *after* running XDP prog on a potentially multi buffered frame, so some additional storage of page count is needed. By adding page count to rx buf, it will make it easier to walk through processed entries at the end of rx cleaning routine and decide whether or not buffers should be recycled. While at it, bump ice_rx_buf::pagecnt_bias from u16 up to u32. It was proven many times that calculations on variables smaller than standard register size are harmful. This was also the case during experiments with embedding page count to ice_rx_buf - when this was added as u16 it had a performance impact. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-4-maciej.fijalkowski@intel.com
|
#
cb0473e0 |
|
31-Jan-2023 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: Add xdp_buff to ice_rx_ring struct In preparation for XDP multi-buffer support, let's store xdp_buff on Rx ring struct. This will allow us to combine fragmented frames across separate NAPI cycles in the same way as currently skb fragments are handled. This means that skb pointer on Rx ring will become redundant and will be removed. For now it is kept and layout of Rx ring struct was not inspected, some member movement will be needed later on so that will be the time to take care of it. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-3-maciej.fijalkowski@intel.com
|
#
c61bcebd |
|
31-Jan-2023 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: Prepare legacy-rx for upcoming XDP multi-buffer support Rx path is going to be modified in a way that fragmented frame will be gathered within xdp_buff in the first place. This approach implies that underlying buffer has to provide tailroom for skb_shared_info. This is currently the case when ring uses build_skb but not when legacy-rx knob is turned on. This case configures 2k Rx buffers and has no way to provide either headroom or tailroom - FWIW it currently has XDP_PACKET_HEADROOM which is broken and in here it is removed. 2k Rx buffers were used so driver in this setting was able to support 9k MTU as it can chain up to 5 Rx buffers. With offset configuring HW writing 2k of a data was passing the half of the page which broke the assumption of our internal page recycling tricks. Now if above got fixed and legacy-rx path would be left as is, when referring to skb_shared_info via xdp_get_shared_info_from_buff(), packet's content would be corrupted again. Hence size of Rx buffer needs to be lowered and therefore supported MTU. This operation will allow us to keep the unified data path and with 8k MTU users (if any of legacy-rx) would still be good to go. However, tendency is to drop the support for this code path at some point. Add ICE_RXBUF_1664 as vsi::rx_buf_len and ICE_MAX_FRAME_LEGACY_RX (8320) as vsi::max_frame for legacy-rx. For bigger page sizes configure 3k Rx buffers, not 2k. Since headroom support is removed, disable data_meta support on legacy-rx. When preparing XDP buff, rely on ice_rx_ring::rx_offset setting when deciding whether to support data_meta or not. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-2-maciej.fijalkowski@intel.com
|
#
288ecf49 |
|
18-Nov-2022 |
Benjamin Mikailenko <benjamin.mikailenko@intel.com> |
ice: Accumulate ring statistics over reset Resets may occur with or without user interaction. For example, a TX hang or reconfiguration of parameters will result in a reset. During reset, the VSI is freed, freeing any statistics structures inside as well. This would create an issue for the user where a reset happens in the background, statistics set to zero, and the user checks ring statistics expecting them to be populated. To ensure this doesn't happen, accumulate ring statistics over reset. Define a new ring statistics structure, ice_ring_stats. The new structure lives in the VSI's parent, preserving ring statistics when VSI is freed. 1. Define a new structure vsi_ring_stats in the PF scope 2. Allocate/free stats only during probe, unload, or change in ring size 3. Replace previous ring statistics functionality with new structure Signed-off-by: Benjamin Mikailenko <benjamin.mikailenko@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
dddd406d |
|
27-Jul-2022 |
Jesse Brandeburg <jesse.brandeburg@intel.com> |
ice: Implement control of FCS/CRC stripping The driver can allow the user to configure whether the CRC aka the FCS (Frame Check Sequence) is DMA'd to the host as part of the receive buffer. The driver usually wants this feature disabled so that the hardware checks the FCS and strips it in order to save PCI bandwidth. Control the reception of FCS to the host using the command: ethtool -K eth0 rx-fcs <on|off> The default shown in ethtool -k eth0 | grep fcs; should be "off", as the hardware will drop any frame with a bad checksum, and DMA of the checksum is useless overhead especially for small packets. Testing Hints: test the FCS/CRC arrives with received packets using tcpdump -nnpi eth0 -xxxx and it should show crc data as the last 4 bytes of the packet. Can also use wireshark to turn on CRC checking and check the data is correct. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Co-developed-by: Benjamin Mikailenko <benjamin.mikailenko@intel.com> Signed-off-by: Benjamin Mikailenko <benjamin.mikailenko@intel.com> Co-developed-by: Anatolii Gerasymenko <anatolii.gerasymenko@intel.com> Signed-off-by: Anatolii Gerasymenko <anatolii.gerasymenko@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
50ae0664 |
|
13-Apr-2022 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice, xsk: Terminate Rx side of NAPI when XSK Rx queue gets full When XSK pool uses need_wakeup feature, correlate -ENOBUFS that was returned from xdp_do_redirect() with a XSK Rx queue being full. In such case, terminate the Rx processing that is being done on the current HW Rx ring and let the user space consume descriptors from XSK Rx queue so that there is room that driver can use later on. Introduce new internal return code ICE_XDP_EXIT that will indicate case described above. Note that it does not affect Tx processing that is bound to the same NAPI context, nor the other Rx rings. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220413153015.453864-6-maciej.fijalkowski@intel.com
|
#
bf13502e |
|
08-May-2022 |
Michal Wilczynski <michal.wilczynski@intel.com> |
ice: Fix interrupt moderation settings getting cleared Adaptive-rx and Adaptive-tx are interrupt moderation settings that can be enabled/disabled using ethtool: ethtool -C ethX adaptive-rx on/off adaptive-tx on/off Unfortunately those settings are getting cleared after changing number of queues, or in ethtool world 'channels': ethtool -L ethX rx 1 tx 1 Clearing was happening due to introduction of bit fields in ice_ring_container struct. This way only itr_setting bits were rebuilt during ice_vsi_rebuild_set_coalesce(). Introduce an anonymous struct of bitfields and create a union to refer to them as a single variable. This way variable can be easily saved and restored. Fixes: 61dc79ced7aa ("ice: Restore interrupt throttle settings after VSI rebuild") Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
0d54d8f7 |
|
02-Dec-2021 |
Brett Creeley <brett.creeley@intel.com> |
ice: Add hot path support for 802.1Q and 802.1ad VLAN offloads Currently the driver only supports 802.1Q VLAN insertion and stripping. However, once Double VLAN Mode (DVM) is fully supported, then both 802.1Q and 802.1ad VLAN insertion and stripping will be supported. Unfortunately the VSI context parameters only allow for one VLAN ethertype at a time for VLAN offloads so only one or the other VLAN ethertype offload can be supported at once. To support this, multiple changes are needed. Rx path changes: [1] In DVM, the Rx queue context l2tagsel field needs to be cleared so the outermost tag shows up in the l2tag2_2nd field of the Rx flex descriptor. In Single VLAN Mode (SVM), the l2tagsel field should remain 1 to support SVM configurations. [2] Modify the ice_test_staterr() function to take a __le16 instead of the ice_32b_rx_flex_desc union pointer so this function can be used for both rx_desc->wb.status_error0 and rx_desc->wb.status_error1. [3] Add the new inline function ice_get_vlan_tag_from_rx_desc() that checks if there is a VLAN tag in l2tag1 or l2tag2_2nd. [4] In ice_receive_skb(), add a check to see if NETIF_F_HW_VLAN_STAG_RX is enabled in netdev->features. If it is, then this is the VLAN ethertype that needs to be added to the stripping VLAN tag. Since ice_fix_features() prevents CTAG_RX and STAG_RX from being enabled simultaneously, the VLAN ethertype will only ever be 802.1Q or 802.1ad. Tx path changes: [1] In DVM, the VLAN tag needs to be placed in the l2tag2 field of the Tx context descriptor. The new define ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN was added to the list of tx_flags to handle this case. [2] When the stack requests the VLAN tag to be offloaded on Tx, the driver needs to set either ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN or ICE_TX_FLAGS_HW_VLAN, so the tag is inserted in l2tag2 or l2tag1 respectively. To determine which location to use, set a bit in the Tx ring flags field during ring allocation that can be used to determine which field to use in the Tx descriptor. In DVM, always use l2tag2, and in SVM, always use l2tag1. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
59e92bfe |
|
25-Jan-2022 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: xsk: Borrow xdp_tx_active logic from i40e One of the things that commit 5574ff7b7b3d ("i40e: optimize AF_XDP Tx completion path") introduced was the @xdp_tx_active field. Its usage from i40e can be adjusted to ice driver and give us positive performance results. If the descriptor that @next_dd points to has been sent by HW (its DD bit is set), then we are sure that at least quarter of the ring is ready to be cleaned. If @xdp_tx_active is 0 which means that related xdp_ring is not used for XDP_{TX, REDIRECT} workloads, then we know how many XSK entries should placed to completion queue, IOW walking through the ring can be skipped. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-9-maciej.fijalkowski@intel.com
|
#
126cdfe1 |
|
25-Jan-2022 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: xsk: Improve AF_XDP ZC Tx and use batching API Apply the logic that was done for regular XDP from commit 9610bd988df9 ("ice: optimize XDP_TX workloads") to the ZC side of the driver. On top of that, introduce batching to Tx that is inspired by i40e's implementation with adjustments to the cleaning logic - take into the account NAPI budget in ice_clean_xdp_irq_zc(). Separating the stats structs onto separate cache lines seemed to improve the performance. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-8-maciej.fijalkowski@intel.com
|
#
3dd411ef |
|
25-Jan-2022 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: Make Tx threshold dependent on ring length XDP_TX workloads use a concept of Tx threshold that indicates the interval of setting RS bit on descriptors which in turn tells the HW to generate an interrupt to signal the completion of Tx on HW side. It is currently based on a constant value of 32 which might not work out well for various sizes of ring combined with for example batch size that can be set via SO_BUSY_POLL_BUDGET. Internal tests based on AF_XDP showed that most convenient setup of mentioned threshold is when it is equal to quarter of a ring length. Make use of recently introduced ICE_RING_QUARTER macro and use this value as a substitute for ICE_TX_THRESH. Align also ethtool -G callback so that next_dd/next_rs fields are up to date in terms of the ring size. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-5-maciej.fijalkowski@intel.com
|
#
3876ff52 |
|
25-Jan-2022 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: xsk: Handle SW XDP ring wrap and bump tail more often Currently, if ice_clean_rx_irq_zc() processed the whole ring and next_to_use != 0, then ice_alloc_rx_buf_zc() would not refill the whole ring even if the XSK buffer pool would have enough free entries (either from fill ring or the internal recycle mechanism) - it is because ring wrap is not handled. Improve the logic in ice_alloc_rx_buf_zc() to address the problem above. Do not clamp the count of buffers that is passed to xsk_buff_alloc_batch() in case when next_to_use + buffer count >= rx_ring->count, but rather split it and have two calls to the mentioned function - one for the part up until the wrap and one for the part after the wrap. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-4-maciej.fijalkowski@intel.com
|
#
dcbaf72a |
|
13-Dec-2021 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: xsk: fix cleaned_count setting Currently cleaned_count is initialized to ICE_DESC_UNUSED(rx_ring) and later on during the Rx processing it is incremented per each frame that driver consumed. This can result in excessive buffers requested from xsk pool based on that value. To address this, just drop cleaned_count and pass ICE_DESC_UNUSED(rx_ring) directly as a function argument to ice_alloc_rx_bufs_zc(). Idea is to ask for buffers as many as consumed. Let us also call ice_alloc_rx_bufs_zc unconditionally at the end of ice_clean_rx_irq_zc. This has been changed in that way for corresponding ice_clean_rx_irq, but not here. Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
0754d65b |
|
15-Oct-2021 |
Kiran Patil <kiran.patil@intel.com> |
ice: Add infrastructure for mqprio support via ndo_setup_tc Add infrastructure required for "ndo_setup_tc:qdisc_mqprio". ice_vsi_setup is modified to configure traffic classes based on mqprio data received from the stack. This includes low-level functions to configure min, max rate-limit parameters in hardware for traffic classes. Each traffic class gets mapped to a hardware channel (VSI) which can be individually configured with different bandwidth parameters. Co-developed-by: Tarun Singh <tarun.k.singh@intel.com> Signed-off-by: Tarun Singh <tarun.k.singh@intel.com> Signed-off-by: Kiran Patil <kiran.patil@intel.com> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Tested-by: Bharathi Sreenivas <bharathi.sreenivas@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
22bf877e |
|
19-Aug-2021 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: introduce XDP_TX fallback path Under rare circumstances there might be a situation where a requirement of having XDP Tx queue per CPU could not be fulfilled and some of the Tx resources have to be shared between CPUs. This yields a need for placing accesses to xdp_ring inside a critical section protected by spinlock. These accesses happen to be in the hot path, so let's introduce the static branch that will be triggered from the control plane when driver could not provide Tx queue dedicated for XDP on each CPU. Currently, the design that has been picked is to allow any number of XDP Tx queues that is at least half of a count of CPUs that platform has. For lower number driver will bail out with a response to user that there were not enough Tx resources that would allow configuring XDP. The sharing of rings is signalled via static branch enablement which in turn indicates that lock for xdp_ring accesses needs to be taken in hot path. Approach based on static branch has no impact on performance of a non-fallback path. One thing that is needed to be mentioned is a fact that the static branch will act as a global driver switch, meaning that if one PF got out of Tx resources, then other PFs that ice driver is servicing will suffer. However, given the fact that HW that ice driver is handling has 1024 Tx queues per each PF, this is currently an unlikely scenario. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
9610bd98 |
|
19-Aug-2021 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
eb087cd8 |
|
19-Aug-2021 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: propagate xdp_ring onto rx_ring With rings being split, it is now convenient to introduce a pointer to XDP ring within the Rx ring. For XDP_TX workloads this means that xdp_rings array access will be skipped, which was executed per each processed frame. Also, read the XDP prog once per NAPI and if prog is present, set up the local xdp_ring pointer. Reading prog a single time was discussed in [1] with some concern raised by Toke around dispatcher handling and having the need for going through the RCU grace period in the ndo_bpf driver callback, but ice currently is torning down NAPI instances regardless of the prog presence on VSI. Although the pointer to XDP ring introduced to Rx ring makes things a lot slimmer/simpler, I still feel that single prog read per NAPI lifetime is beneficial. Further patch that will introduce the fallback path will also get a profit from that as xdp_ring pointer will be set during the XDP rings setup. [1]: https://lore.kernel.org/bpf/87k0oseo6e.fsf@toke.dk/ Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
e72bba21 |
|
19-Aug-2021 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: split ice_ring onto Tx/Rx separate structs While it was convenient to have a generic ring structure that served both Tx and Rx sides, next commits are going to introduce several Tx-specific fields, so in order to avoid hurting the Rx side, let's pull out the Tx ring onto new ice_tx_ring and ice_rx_ring structs. Rx ring could be handled by the old ice_ring which would reduce the code churn within this patch, but this would make things asymmetric. Make the union out of the ring container within ice_q_vector so that it is possible to iterate over newly introduced ice_tx_ring. Remove the @size as it's only accessed from control path and it can be calculated pretty easily. Change definitions of ice_update_ring_stats and ice_fetch_u64_stats_per_ring so that they are ring agnostic and can be used for both Rx and Tx rings. Sizes of Rx and Tx ring structs are 256 and 192 bytes, respectively. In Rx ring xdp_rxq_info occupies its own cacheline, so it's the major difference now. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
dc23715c |
|
19-Aug-2021 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: move ice_container_type onto ice_ring_container Currently ice_container_type is scoped only for ice_ethtool.c. Next commit that will split the ice_ring struct onto Rx/Tx specific ring structs is going to also modify the type of linked list of rings that is within ice_ring_container. Therefore, the functions that are taking the ice_ring_container as an input argument will need to be aware of a ring type that will be looked up. Embed ice_container_type within ice_ring_container and initialize it properly when allocating the q_vectors. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
e93d1c37 |
|
19-Aug-2021 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: remove ring_active from ice_ring This field is dead and driver is not making any use of it. Simply remove it. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
2a87bd73 |
|
06-Aug-2021 |
Dave Ertman <david.m.ertman@intel.com> |
ice: Add DSCP support Implement code to handle submission of APP TLV's containing DSCP to TC mapping. The first such mapping received on an interface will cause that PF to switch to L3 DSCP QoS mode, apply the default config for that mode, and apply the received mapping. Only one such mapping will be allowed per DSCP value, and when the last DSCP mapping is deleted, the PF will switch back into L2 VLAN QoS mode, applying the appropriate default QoS settings. L3 DSCP QoS mode will only be allowed in SW DCBx mode, in other words, when the FW LLDP engine is disabled. Commands that break this mutual exclusivity will be blocked. Co-developed-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
57f7f8b6 |
|
22-Sep-2021 |
Magnus Karlsson <magnus.karlsson@intel.com> |
ice: Use xdp_buf instead of rx_buf for xsk zero-copy In order to use the new xsk batched buffer allocation interface, a pointer to an array of struct xsk_buff pointers need to be provided so that the function can put the result of the allocation there. In the ice driver, we already have a ring that stores pointers to xdp_buffs. This is only used for the xsk zero-copy driver and is a union with the structure that is used for the regular non zero-copy path. Unfortunately, that structure is larger than the xdp_buffs pointers which mean that there will be a stride (of 20 bytes) between each xdp_buff pointer. And feeding this into the xsk_buff_alloc_batch interface will not work since it assumes a regular array of xdp_buff pointers (each 8 bytes with 0 bytes in-between them on a 64-bit system). To fix this, remove the xdp_buff pointer from the rx_buf union and move it one step higher to the union above which only has pointers to arrays in it. This solves the problem and we can directly feed the SW ring of xdp_buff pointers straight into the allocation function in the next patch when that interface is used. This will improve performance. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210922075613.12186-4-magnus.karlsson@gmail.com
|
#
ea9b847c |
|
09-Jun-2021 |
Jacob Keller <jacob.e.keller@intel.com> |
ice: enable transmit timestamps for E810 devices Add support for enabling Tx timestamp requests for outgoing packets on E810 devices. The ice hardware can support multiple outstanding Tx timestamp requests. When sending a descriptor to hardware, a Tx timestamp request is made by setting a request bit, and assigning an index that represents which Tx timestamp index to store the timestamp in. Hardware makes no effort to synchronize the index use, so it is up to software to ensure that Tx timestamp indexes are not re-used before the timestamp is reported back. To do this, introduce a Tx timestamp tracker which will keep track of currently in-use indexes. In the hot path, if a packet has a timestamp request, an index will be requested from the tracker. Unfortunately, this does require a lock as the indexes are shared across all queues on a PHY. There are not enough indexes to reliably assign only 1 to each queue. For the E810 devices, the timestamp indexes are not shared across PHYs, so each port can have its own tracking. Once hardware captures a timestamp, an interrupt is fired. In this interrupt, trigger a new work item that will figure out which timestamp was completed, and report the timestamp back to the stack. This function loops through the Tx timestamp indexes and checks whether there is now a valid timestamp. If so, it clears the PHY timestamp indication in the PHY memory, locks and removes the SKB and bit in the tracker, then reports the timestamp to the stack. It is possible in some cases that a timestamp request will be initiated but never completed. This might occur if the packet is dropped by software or hardware before it reaches the PHY. Add a task to the periodic work function that will check whether a timestamp request is more than a few seconds old. If so, the timestamp index is cleared in the PHY, and the SKB is released. Just as with Rx timestamps, the Tx timestamps are only 40 bits wide, and use the same overall logic for extending to 64 bits of nanoseconds. With this change, E810 devices should be able to perform basic PTP functionality. Future changes will extend the support to cover the E822-based devices. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
77a78115 |
|
09-Jun-2021 |
Jacob Keller <jacob.e.keller@intel.com> |
ice: enable receive hardware timestamping Add SIOCGHWTSTAMP and SIOCSHWTSTAMP ioctl handlers to respond to requests to enable timestamping support. If the request is for enabling Rx timestamps, set a bit in the Rx descriptors to indicate that receive timestamps should be reported. Hardware captures receive timestamps in the PHY which only captures part of the timer, and reports only 40 bits into the Rx descriptor. The upper 32 bits represent the contents of GLTSYN_TIME_L at the point of packet reception, while the lower 8 bits represent the upper 8 bits of GLTSYN_TIME_0. The networking and PTP stack expect 64 bit timestamps in nanoseconds. To support this, implement some logic to extend the timestamps by using the full PHC time. If the Rx timestamp was captured prior to the PHC time, then the real timestamp is PHC - (lower_32_bits(PHC) - timestamp) If the Rx timestamp was captured after the PHC time, then the real timestamp is PHC + (timestamp - lower_32_bits(PHC)) These calculations are correct as long as neither the PHC timestamp nor the Rx timestamps are more than 2^32-1 nanseconds old. Further, we can detect when the Rx timestamp is before or after the PHC as long as the PHC timestamp is no more than 2^31-1 nanoseconds old. In that case, we calculate the delta between the lower 32 bits of the PHC and the Rx timestamp. If it's larger than 2^31-1 then the Rx timestamp must have been captured in the past. If it's smaller, then the Rx timestamp must have been captured after PHC time. Add an ice_ptp_extend_32b_ts function that relies on a cached copy of the PHC time and implements this algorithm to calculate the proper upper 32bits of the Rx timestamps. Cache the PHC time periodically in all of the Rx rings. This enables each Rx ring to simply call the extension function with a recent copy of the PHC time. By ensuring that the PHC time is kept up to date periodically, we ensure this algorithm doesn't use stale data and produce incorrect results. To cache the time, introduce a kworker and a kwork item to periodically store the Rx time. It might seem like we should use the .do_aux_work interface of the PTP clock. This doesn't work because all PFs must cache this time, but only one PF owns the PTP clock device. Thus, the ice driver will manage its own kthread instead of relying on the PTP do_aux_work handler. With this change, the driver can now report Rx timestamps on all incoming packets. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
d59684a0 |
|
31-Mar-2021 |
Jesse Brandeburg <jesse.brandeburg@intel.com> |
ice: refactor ITR data structures Use a dedicated bitfield in order to both increase the amount of checking around the length of ITR writes as well as simplify the checks of dynamic mode. Basically unpack the "high bit means dynamic" logic into bitfields. Also, remove some unused ITR defines. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
cdf1f1f1 |
|
31-Mar-2021 |
Jacob Keller <jacob.e.keller@intel.com> |
ice: replace custom AIM algorithm with kernel's DIM library The ice driver has support for adaptive interrupt moderation, an algorithm for tuning the interrupt rate dynamically. This algorithm is based on various assumptions about ring size, socket buffer size, link speed, SKB overhead, ethernet frame overhead and more. The Linux kernel has support for a dynamic interrupt moderation algorithm known as "dimlib". Replace the custom driver-specific implementation of dynamic interrupt moderation with the kernel's algorithm. The Intel hardware has a different hardware implementation than the originators of the dimlib code had to work with, which requires the driver to use a slightly different set of inputs for the actual moderation values, while getting all the advice from dimlib of better/worse, shift left or right. The change made for this implementation is to use a pair of values for each of the 5 "slots" that the dimlib moderation expects, and the driver will program those pairs when dimlib recommends a slot to use. The currently implementation uses two tables, one for receive and one for transmit, and the pairs of values in each slot set the maximum delay of an interrupt and a maximum number of interrupts per second (both expressed in microseconds). There are two separate kinds of bugs fixed by using DIMLIB, one is UDP single stream send was too slow, and the other is that 8K ping-pong was going to the most aggressive moderation and has much too high latency. The overall result of using DIMLIB is that we meet or exceed our performance expectations set based on the old algorithm. Co-developed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
51fe27e1 |
|
25-Mar-2021 |
Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> |
ice: Remove rx_gro_dropped stat Tracking of the rx_gro_dropped statistic was removed in commit f73fc40327c0 ("ice: drop dead code in ice_receive_skb()"). Remove the associated variables and its reporting to ethtool stats. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
2ec56385 |
|
02-Mar-2021 |
Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com> |
ice: handle increasing Tx or Rx ring sizes There is an issue when the Tx or Rx ring size increases using 'ethtool -L ...' where the new rings don't get the correct ITR values because when we rebuild the VSI we don't know that some of the rings may be new. Fix this by looking at the original number of rings and determining if the rings in ice_vsi_rebuild_set_coalesce() were not present in the original rings received in ice_vsi_rebuild_get_coalesce(). Also change the code to return an error if we can't allocate memory for the coalesce data in ice_vsi_rebuild(). Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
634da4c1 |
|
02-Mar-2021 |
Benita Bose <benita.bose@intel.com> |
ice: Add Support for XPS Enable and configure XPS. The driver code implemented sets up the Transmit Packet Steering Map, which in turn will be used by the kernel in queue selection during Tx. Signed-off-by: Benita Bose <benita.bose@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
f1b1f409 |
|
18-Jan-2021 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: store the result of ice_rx_offset() onto ice_ring Output of ice_rx_offset() is based on ethtool's priv flag setting, which when changed, causes PF reset (disables napi, frees irqs, loads different Rx mem model, etc.). This means that within napi its result is constant and there is no reason to call it per each processed frame. Add new 'rx_offset' field to ice_ring that is meant to hold the ice_rx_offset() result and use it within ice_clean_rx_irq(). Furthermore, use it within ice_alloc_mapped_page(). Reviewed-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
29b82f2a |
|
18-Jan-2021 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: move skb pointer from rx_buf to rx_ring Similar thing has been done in i40e, as there is no real need for having the sk_buff pointer in each rx_buf. Non-eop frames can be simply handled on that pointer moved upwards to rx_ring. Reviewed-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
1d9f7ca3 |
|
20-Nov-2020 |
Jesse Brandeburg <jesse.brandeburg@intel.com> |
ice: fix writeback enable logic The writeback enable logic was incorrectly implemented (due to misunderstanding what the side effects of the implementation would be during polling). Fix this logic issue, while implementing a new feature allowing the user to control the writeback frequency using the knobs for controlling interrupt throttling that we already have. Basically if you leave adaptive interrupts enabled, the writeback frequency will be varied even if busy_polling or if napi-poll is in use. If the interrupt rates are set to a fixed value by ethtool -C and adaptive is off, the driver will allow the user-set interrupt rate to guide how frequently the hardware will complete descriptors to the driver. Effectively the user will get a control over the hardware efficiency, allowing the choice between immediate interrupts or delayed up to a maximum of the interrupt rate, even when interrupts are disabled during polling. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Co-developed-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Brett Creeley <brett.creeley@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
b50f7bca |
|
25-Sep-2020 |
Jesse Brandeburg <jesse.brandeburg@intel.com> |
intel-ethernet: clean up W=1 warnings in kdoc This takes care of all of the trivial W=1 fixes in the Intel Ethernet drivers, which allows developers and maintainers to build more of the networking tree with more complete warning checks. There are three classes of kdoc warnings fixed: - cannot understand function prototype: 'x' - Excess function parameter 'x' description in 'y' - Function parameter or member 'x' not described in 'y' All of the changes were trivial comment updates on function headers. Inspired by Lee Jones' series of wireless work to do the same. Compile tested only, and passes simple test of $ git ls-files *.[ch] | egrep drivers/net/ethernet/intel | \ xargs scripts/kernel-doc -none Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1742b3d5 |
|
28-Aug-2020 |
Magnus Karlsson <magnus.karlsson@intel.com> |
xsk: i40e: ice: ixgbe: mlx5: Pass buffer pool to driver instead of umem Replace the explicit umem reference passed to the driver in AF_XDP zero-copy mode with the buffer pool instead. This in preparation for extending the functionality of the zero-copy mode so that umems can be shared between queues on the same netdev and also between netdevs. In this commit, only an umem reference has been added to the buffer pool struct. But later commits will add other entities to it. These are going to be entities that are different between different queue ids and netdevs even though the umem is shared between them. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com
|
#
a8fffd7a |
|
29-Jul-2020 |
Jesse Brandeburg <jesse.brandeburg@intel.com> |
ice: add useful statistics Display and count some useful hot-path statistics. The usefulness is as follows: - tx_restart: use to determine if the transmit ring size is too small or if the transmit interrupt rate is too low. - rx_gro_dropped: use to count drops from GRO layer, which previously were completely uncounted when occurring. - tx_busy: use to determine when the driver is miscounting number of descriptors needed for an skb. - tx_timeout: as our other drivers, count the number of times we've reset due to timeout because the kernel only prints a warning once per netdev. Several of these were already counted but not displayed. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
a4c493fe |
|
29-Jul-2020 |
Jesse Brandeburg <jesse.brandeburg@intel.com> |
ice: remove page_reuse statistic The page reuse statistic wasn't even being displayed to the user, even though the driver counted it. Don't waste the struct space and hot-path cycles since the driver doesn't display it. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
#
22bef5e7 |
|
15-May-2020 |
Jesse Brandeburg <jesse.brandeburg@intel.com> |
ice: fix signed vs unsigned comparisons Fix the remaining signed vs unsigned issues, which appear when compiling with -Werror=sign-compare. Many of these are because there is an external interface that is passing an int to us (which we can't change) but that we (rightfully) store and compare against as an unsigned in our data structures. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
cac2a27c |
|
11-May-2020 |
Henry Tieman <henry.w.tieman@intel.com> |
ice: Support IPv4 Flow Director filters Support the addition and deletion of IPv4 filters. Supported fields are: src-ip, dst-ip, src-port, and dst-port Supported flow-types are: tcp4, udp4, sctp4, ip4 Example usage: ethtool -N eth0 flow-type tcp4 src-ip 192.168.0.55 dst-ip 172.16.0.55 \ src-port 16 dst-port 12 action 32 Signed-off-by: Henry Tieman <henry.w.tieman@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
148beb61 |
|
11-May-2020 |
Henry Tieman <henry.w.tieman@intel.com> |
ice: Initialize Flow Director resources Flow Director allows for redirection based on ntuple rules. Rules are programmed using the ethtool set-ntuple interface. Supported actions are redirect to queue and drop. Setup the initial framework to process Flow Director filters. Create and allocate resources to manage and program filters to the hardware. Filters are processed via a sideband interface; a control VSI is created to manage communication and process requests through the sideband. Upon allocation of resources, update the hardware tables to accept perfect filters. Signed-off-by: Henry Tieman <henry.w.tieman@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
a4e82a81 |
|
06-May-2020 |
Tony Nguyen <anthony.l.nguyen@intel.com> |
ice: Add support for tunnel offloads Create a boost TCAM entry for each tunnel port in order to get a tunnel PTYPE. Update netdev feature flags and implement the appropriate logic to get and set values for hardware offloads. Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Henry Tieman <henry.w.tieman@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
175fc430 |
|
20-May-2020 |
Björn Töpel <bjorn@kernel.org> |
ice, xsk: Migrate to new MEM_TYPE_XSK_BUFF_POOL Remove MEM_TYPE_ZERO_COPY in favor of the new MEM_TYPE_XSK_BUFF_POOL APIs. v4->v5: Fixed "warning: Excess function parameter 'alloc' description in 'ice_alloc_rx_bufs_zc'" and "warning: Excess function parameter 'xdp' description in 'ice_construct_skb_zc'". (Jakub) Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: intel-wired-lan@lists.osuosl.org Link: https://lore.kernel.org/bpf/20200520192103.355233-10-bjorn.topel@gmail.com
|
#
840f8ad0 |
|
13-Feb-2020 |
Brett Creeley <brett.creeley@intel.com> |
ice: Don't reject odd values of usecs set by user Currently if a user sets an odd [tx|rx]-usecs value through ethtool, the request is denied because the hardware is set to have an ITR granularity of 2us. This caused poor customer experience. Fix this by aligning to a register allowed value, which results in rounding down. Also, print a once per ring container type message to be clear about our intentions. Also, change the ITR_TO_REG define to be the bitwise and of the ITR setting and the ICE_ITR_MASK. This makes the purpose of ITR_TO_REG more obvious. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
4ee656bb |
|
06-Feb-2020 |
Tony Nguyen <anthony.l.nguyen@intel.com> |
ice: Trivial fixes This is a collection of trivial fixes including fixing whitespace, typos, function headers, reverse Christmas tree, etc. Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
61dc79ce |
|
12-Dec-2019 |
Michal Swiatkowski <michal.swiatkowski@intel.com> |
ice: Restore interrupt throttle settings after VSI rebuild After each rebuild driver deallocates q_vectors, so the interrupt throttle rate (ITR) settings get lost. Create a function to save and restore ITR for each queue. If a user increases the number of queues, restore all the previous queue settings for each existing queue, and the additional queues will get the default setting. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
59bb0808 |
|
24-Oct-2019 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: introduce frame padding computation logic Take into account the underlying architecture specific settings and based on that calculate the possible padding that can be supplied. Typically, for x86 and standard MTU size we will end up with 192 bytes of headroom. This is the same behavior as our other drivers have and we can dedicate it for XDP purposes. Furthermore, introduce the Rx ring flag for indicating whether build_skb is used on particular. Based on that invoke the routines for padding calculation. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
7237f5b0 |
|
24-Oct-2019 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: introduce legacy Rx flag Add an ethtool "legacy-rx" priv flag for toggling the Rx path. This control knob will be mainly used for build_skb usage as well as buffer size/MTU manipulation. In preparation for adding build_skb support in a way that it takes care of how we set the values of max_frame and rx_buf_len fields of struct ice_vsi. Specifically, in this patch mentioned fields are set to values that will allow us to provide headroom and tailroom in-place. This can be mostly broken down onto following: - for legacy-rx "on" ethtool control knob, old behaviour is kept; - for standard 1500 MTU size configure the buffer of size 1536, as network stack is expecting the NET_SKB_PAD to be provided and NET_IP_ALIGN can have a non-zero value (these can be typically equal to 32 and 2, respectively); - for larger MTUs go with max_frame set to 9k and configure the 3k buffer in case when PAGE_SIZE of underlying arch is less than 8k; 3k buffer is implying the need for order 1 page, so that our page recycling scheme can still be applied; With that said, substitute the hardcoded ICE_RXBUF_2048 and PAGE_SIZE values in DMA API that we're making use of with rx_ring->rx_buf_len and ice_rx_pg_size(rx_ring). The latter is an introduced helper for determining the page size based on its order (which was figured out via ice_rx_pg_order). Last but not least, take care of truesize calculation. In the followup patch the headroom/tailroom computation logic will be introduced. This change aligns the buffer and frame configuration with other Intel drivers, most importantly with iavf. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
2d4238f5 |
|
04-Nov-2019 |
Krzysztof Kazimierczak <krzysztof.kazimierczak@intel.com> |
ice: Add support for AF_XDP Add zero copy AF_XDP support. This patch adds zero copy support for Tx and Rx; code for zero copy is added to ice_xsk.h and ice_xsk.c. For Tx, implement ndo_xsk_wakeup. As with other drivers, reuse existing XDP Tx queues for this task, since XDP_REDIRECT guarantees mutual exclusion between different NAPI contexts based on CPU ID. In turn, a netdev can XDP_REDIRECT to another netdev with a different NAPI context, since the operation is bound to a specific core and each core has its own hardware ring. For Rx, allocate frames as MEM_TYPE_ZERO_COPY on queues that AF_XDP is enabled. Signed-off-by: Krzysztof Kazimierczak <krzysztof.kazimierczak@intel.com> Co-developed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
0891d6d4 |
|
04-Nov-2019 |
Krzysztof Kazimierczak <krzysztof.kazimierczak@intel.com> |
ice: Move common functions to ice_txrx_lib.c In preparation of AF XDP, move functions that will be used both by skb and zero-copy paths to a new file called ice_txrx_lib.c. This allows us to avoid using ifdefs to control the staticness of said functions. Move other functions (ice_rx_csum, ice_rx_hash and ice_ptype_to_htype) called only by the moved ones to the new file as well. Signed-off-by: Krzysztof Kazimierczak <krzysztof.kazimierczak@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
efc2214b |
|
04-Nov-2019 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: Add support for XDP Add support for XDP. Implement ndo_bpf and ndo_xdp_xmit. Upon load of an XDP program, allocate additional Tx rings for dedicated XDP use. The following actions are supported: XDP_TX, XDP_DROP, XDP_REDIRECT, XDP_PASS, and XDP_ABORTED. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
eff380aa |
|
24-Oct-2019 |
Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> |
ice: Introduce ice_base.c Remove a few uses of kernel configuration flags from ice_lib.c by introducing a new source file ice_base.c. Also move corresponding function prototypes from ice_lib.h to ice_base.h and include ice_base.h where required. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
2ab28bb0 |
|
25-Jul-2019 |
Brett Creeley <brett.creeley@intel.com> |
ice: Set WB_ON_ITR when we don't re-enable interrupts Currently when busy polling is enabled we aren't setting/enabling WB_ON_ITR in the driver. This doesn't break the driver, but it does cause issues. If we don't enable WB_ON_ITR mode we will still get write-backs from hardware during polling when a cache line has been filled, but if a cache line is not filled we will not get the write-back because WB_ON_ITR is not set. Fix this by enabling WB_ON_ITR in the driver when interrupts are disabled. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
0ab54c5f |
|
16-Apr-2019 |
Jesse Brandeburg <jesse.brandeburg@intel.com> |
ice: Use bitfields when possible We can use bit fields to store boolean values and when the bit fields are next to each other, the compiler will combine them (as long as the size holds enough). Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
65124bbf |
|
16-Apr-2019 |
Jesse Brandeburg <jesse.brandeburg@intel.com> |
ice: Reorganize tx_buf and ring structs Use more efficient structure ordering by using the pahole tool and a lot of code inspection to get hot cache lines to have packed data (no holes if possible) and adjacent warm data. ice_ring prior to this change: /* size: 192, cachelines: 3, members: 23 */ /* sum members: 158, holes: 4, sum holes: 12 */ /* padding: 22 */ ice_ring after this change: /* size: 192, cachelines: 3, members: 25 */ /* sum members: 162, holes: 1, sum holes: 1 */ /* padding: 29 */ ice_tx_buf prior to this change: /* size: 48, cachelines: 1, members: 7 */ /* sum members: 38, holes: 2, sum holes: 6 */ /* padding: 4 */ /* last cacheline: 48 bytes */ ice_tx_buf after this change: /* size: 40, cachelines: 1, members: 7 */ /* sum members: 38, holes: 1, sum holes: 2 */ /* last cacheline: 40 bytes */ Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
b9c8bb06 |
|
28-Feb-2019 |
Brett Creeley <brett.creeley@intel.com> |
ice: Add ability to update rx-usecs-high Currently the driver allows rx-usecs-high values to be set, but when querying the device for rx-usecs-high the value does not stick. This is because it was not yet implemented. Add code to allow the user to change rx-usecs-high and use this to set the q_vector's intrl value. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
5f6aa50e |
|
28-Feb-2019 |
Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> |
ice: Add priority information into VLAN header This patch introduces a new function ice_tx_prepare_vlan_flags_dcb to insert 802.1p priority information into the VLAN header Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
a629cf0a |
|
28-Feb-2019 |
Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> |
ice: Update rings based on TC information This patch adds a new function ice_vsi_cfg_dcb_rings which updates a VSI's rings based on DCB traffic class information. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
92414f32 |
|
19-Feb-2019 |
Brett Creeley <brett.creeley@intel.com> |
ice: Update comment regarding the ITR_GRAN_S Since the driver now hard codes the ITR granularity to 2 us in the GLINT_CTL register the comment next to ITR_GRAN_S needs to be updated. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
8244dd2d |
|
19-Feb-2019 |
Brett Creeley <brett.creeley@intel.com> |
ice: Audit hotpath structures with pahole Currently the ice_q_vector structure and ice_ring_container structure are taking up more space than necessary due to cache alignment holes and unnecessary variables respectively. This is not helping the driver's performance. The following fixes were done to improve cache alignment, reduce wasted space, and increase performance. 1. Remove the ice_latency_range enum as it is unused. 2. Remove the latency_range variable in the ice_ring_container structure. 3. Change the size of the itr_idx in the ice_ring_container structure from an int to an u16. This reduced the size of ice_ring_container structure to 32 Bytes so it has no holes or padding. 4. Re-arrange the ice_q_vector structure using pahole to align members as best as possible in regards to 64 Byte cache line size. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
64a59d05 |
|
19-Feb-2019 |
Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> |
ice: Fix for adaptive interrupt moderation commit 63f545ed1285 ("ice: Add support for adaptive interrupt moderation") was meant to add support for adaptive interrupt moderation but there was an error on my part while formatting the patch, and thus only part of the patch ended up being submitted. This patch rectifies the error by adding the rest of the code. Fixes: 63f545ed1285 ("ice: Add support for adaptive interrupt moderation") Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
a65f71fe |
|
13-Feb-2019 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: map Rx buffer pages with DMA attributes Provide DMA_ATTR_WEAK_ORDERING and DMA_ATTR_SKIP_CPU_SYNC attributes to the DMA API during the mapping operations on Rx side. With this change the non-x86 platforms will be able to sync only with what is being used (2k buffer) instead of entire page. This should yield a slight performance improvement. Furthermore, DMA unmap may destroy the changes that were made to the buffer by CPU when platform is not a x86 one. DMA_ATTR_SKIP_CPU_SYNC attribute usage fixes this issue. Also add a sync_single_for_device call during the Rx buffer assignment, to make sure that the cache lines are cleared before device attempting to write to the buffer. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
03c66a13 |
|
13-Feb-2019 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
ice: Introduce bulk update for page count {get,put}_page are atomic operations which we use for page count handling. The current logic for refcount handling is that we increment it when passing a skb with the data from the first half of page up to netstack and recycle the second half of page. This operation protects us from losing a page since the network stack can decrement the refcount of page from skb. The performance can be gently improved by doing the bulk updates of refcount instead of doing it one by one. During the buffer initialization, maximize the page's refcount and don't allow the refcount to become less than two. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
70457520 |
|
08-Feb-2019 |
Brett Creeley <brett.creeley@intel.com> |
ice: configure GLINT_ITR to always have an ITR gran of 2 Instead of hoping that our ITR granularity will be 2 usec program the GLINT_CTL register to make sure the ITR granularity is always 2 usecs. Now that we know what the ITR granularity will be get rid of the check in ice_probe() to verify our previous assumption. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
67fe64d7 |
|
19-Dec-2018 |
Brett Creeley <brett.creeley@intel.com> |
ice: Implement getting and setting ethtool coalesce This patch includes the following ethtool operations: 1. get_coalesce 2. set_coalesce 3. get_per_q_coalesce 4. set_per_q_coalesce Each ITR value (current_itr/target_itr) are stored on a per ice_ring_container basis. This is because each valid ice_ring_container can have 1 or more rings that are tied to the same q_vector ITR index. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
63f545ed |
|
19-Dec-2018 |
Brett Creeley <brett.creeley@intel.com> |
ice: Add support for adaptive interrupt moderation Currently the driver does not support adaptive/dynamic interrupt moderation. This patch adds support for this. Also, adaptive/dynamic interrupt moderation is turned on by default upon driver load. In order to support adaptive interrupt moderation, two functions were added, ice_update_itr() and ice_itr_divisor(). These are used to determine the current packet load and to determine a divisor based on link speed respectively. This patch also adds the ICE_ITR_GRAN_S define that is used in the hot-path when setting a new ITR value. The shift is used to pet two birds with one hand, set the ITR value while re-enabling the interrupt. Also, the ICE_ITR_GRAN_S is defined as 1 because the device has a ITR granularity of 2usecs. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
c585ea42 |
|
26-Oct-2018 |
Brett Creeley <brett.creeley@intel.com> |
ice: Fix tx_timeout in PF driver Prior to this commit the driver was running into tx_timeouts when a queue was stressed enough. This was happening because the HW tail and SW tail (NTU) were incorrectly out of sync. Consequently this was causing the HW head to collide with the HW tail, which to the hardware means that all descriptors posted for Tx have been processed. Due to the Tx logic used in the driver SW tail and HW tail are allowed to be out of sync. This is done as an optimization because it allows the driver to write HW tail as infrequently as possible, while still updating the SW tail index to keep track. However, there are situations where this results in the tail never getting updated, resulting in Tx timeouts. Tx HW tail write condition: if (netif_xmit_stopped(txring_txq(tx_ring) || !skb->xmit_more) writel(sw_tail, tx_ring->tail); An issue was found in the Tx logic that was causing the afore mentioned condition for updating HW tail to never happen, causing tx_timeouts. In ice_xmit_frame_ring we calculate how many descriptors we need for the Tx transaction based on the skb the kernel hands us. This is then passed into ice_maybe_stop_tx along with some extra padding to determine if we have enough descriptors available for this transaction. If we don't then we return -EBUSY to the stack, otherwise we move on and eventually prepare the Tx descriptors accordingly in ice_tx_map and set next_to_watch. In ice_tx_map we make another call to ice_maybe_stop_tx with a value of MAX_SKB_FRAGS + 4. The key here is that this value is possibly less than the value we sent in the first call to ice_maybe_stop_tx in ice_xmit_frame_ring. Now, if the number of unused descriptors is between MAX_SKB_FRAGS + 4 and the value used in the first call to ice_maybe_stop_tx in ice_xmit_frame_ring then we do not update the HW tail because of the "Tx HW tail write condition" above. This is because in ice_maybe_stop_tx we return success from ice_maybe_stop_tx instead of calling __ice_maybe_stop_tx and subsequently calling netif_stop_subqueue, which sets the __QUEUE_STATE_DEV_XOFF bit. This bit is then checked in the "Tx HW tail write condition" by calling netif_xmit_stopped and subsequently updating HW tail if the afore mentioned bit is set. In ice_clean_tx_irq, if next_to_watch is not NULL, we end up cleaning the descriptors that HW sets the DD bit on and we have the budget. The HW head will eventually run into the HW tail in response to the description in the paragraph above. The next time through ice_xmit_frame_ring we make the initial call to ice_maybe_stop_tx with another skb from the stack. This time we do not have enough descriptors available and we return NETDEV_TX_BUSY to the stack and end up setting next_to_watch to NULL. This is where we are stuck. In ice_clean_tx_irq we never clean anything because next_to_watch is always NULL and in ice_xmit_frame_ring we never update HW tail because we already return NETDEV_TX_BUSY to the stack and eventually we hit a tx_timeout. This issue was fixed by making sure that the second call to ice_maybe_stop_tx in ice_tx_map is passed a value that is >= the value that was used on the initial call to ice_maybe_stop_tx in ice_xmit_frame_ring. This was done by adding the following defines to make the logic more clear and to reduce the chance of mucking this up again: ICE_CACHE_LINE_BYTES 64 ICE_DESCS_PER_CACHE_LINE (ICE_CACHE_LINE_BYTES / \ sizeof(struct ice_tx_desc)) ICE_DESCS_FOR_CTX_DESC 1 ICE_DESCS_FOR_SKB_DATA_PTR 1 The ICE_CACHE_LINE_BYTES being 64 is an assumption being made so we don't have to figure this out on every pass through the Tx path. Instead I added a sanity check in ice_probe to verify cache line size and print a message if it's not 64 Bytes. This will make it easier to file issues if they are seen when the cache line size is not 64 Bytes when reading from the GLPCI_CNF2 register. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
d2b464a7 |
|
19-Sep-2018 |
Brett Creeley <brett.creeley@intel.com> |
ice: Add more flexibility on how we assign an ITR index This issue came about when looking at the VF function ice_vc_cfg_irq_map_msg. Currently we are assigning the itr_setting value to the itr_idx received from the AVF driver, which is not correct and is not used for the VF flow anyway. Currently the only way we set the ITR index for both the PF and VF driver is by hard coding ICE_TX_ITR or ICE_RX_ITR for the ITR index on each q_vector. To fix this, add the member itr_idx in struct ice_ring_container. This can then be used to dynamically program the correct ITR index. This change also affected the PF driver so make the necessary changes there as well. Also, removed the itr_setting member in struct ice_ring because it is not being used meaningfully and is going to be removed in a future patch that includes dynamic ITR. On another note, this will be useful moving forward if we decide to split Rx/Tx rings on different q_vectors instead of sharing them as queue pairs. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
9e4ab4c2 |
|
19-Sep-2018 |
Brett Creeley <brett.creeley@intel.com> |
ice: Add support for dynamic interrupt moderation Currently there is no support for dynamic interrupt moderation. This patch adds some initial code to support this. The following changes were made: 1. Currently we are using multiple members to store the interrupt granularity (itr_gran_25/50/100/200). This is not necessary because we can query the device to determine what the interrupt granularity should be set to, done by a new function ice_get_itr_intrl_gran. 2. Added intrl to ice_q_vector structure to support interrupt rate limiting. 3. Added the function ice_intrl_usecs_to_reg for converting to a value in usecs that the device understands. 4. Added call to write to the GLINT_RATE register. Disable intrl by default for now. 5. Changed rx/tx_itr_setting to itr_setting because having both seems redundant because a ring is either Tx or Rx. 6. Initialize itr_setting for both Tx/Rx rings in ice_vsi_alloc_rings() Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
b3969fd7 |
|
09-Aug-2018 |
Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> |
ice: Add support for Tx hang, Tx timeout and malicious driver detection When a malicious operation is detected, the firmware triggers an interrupt, which is then picked up by the service task (specifically by ice_handle_mdd_event). A reset is scheduled if required. Tx hang detection works in a similar way, except the logic here monitors the VSI's Tx queues and tries to revive them if stalled. If the hang is not resolved, the kernel eventually calls ndo_tx_timeout, which is handled by ice_tx_timeout. Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
43f8b224 |
|
09-Aug-2018 |
Bruce Allan <bruce.w.allan@intel.com> |
ice: Change struct members from bool to u8 Recent versions of checkpatch have a new warning based on a documented preference of Linus to not use bool in structures due to wasted space and the size of bool is implementation dependent. For more information, see the email thread at https://lkml.org/lkml/2017/11/21/384. Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
d76a60ba |
|
20-Mar-2018 |
Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> |
ice: Add support for VLANs and offloads This patch adds support for VLANs. When a VLAN is created a switch filter is added to direct the VLAN traffic to the corresponding VSI. When a VLAN is deleted, the filter is deleted as well. This patch also adds support for the following hardware offloads. 1) VLAN tag insertion/stripping 2) Receive Side Scaling (RSS) 3) Tx checksum and TCP segmentation 4) Rx checksum Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
2b245cb2 |
|
20-Mar-2018 |
Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> |
ice: Implement transmit and NAPI support This patch implements ice_start_xmit (the handler for ndo_start_xmit) and related functions. ice_start_xmit ultimately calls ice_tx_map, where the Tx descriptor is built and posted to the hardware by bumping the ring tail. This patch also implements ice_napi_poll, which is invoked when there's an interrupt on the VSI's queues. The interrupt can be due to either a completed Tx or an Rx event. In case of a completed Tx/Rx event, resources are reclaimed. Additionally, in case of an Rx event, the skb is fetched and passed up to the network stack. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
cdedef59 |
|
20-Mar-2018 |
Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> |
ice: Configure VSIs for Tx/Rx This patch configures the VSIs to be able to send and receive packets by doing the following: 1) Initialize flexible parser to extract and include certain fields in the Rx descriptor. 2) Add Tx queues by programming the Tx queue context (implemented in ice_vsi_cfg_txqs). Note that adding the queues also enables (starts) the queues. 3) Add Rx queues by programming Rx queue context (implemented in ice_vsi_cfg_rxqs). Note that this only adds queues but doesn't start them. The rings will be started by calling ice_vsi_start_rx_rings on interface up. 4) Configure interrupts for VSI queues. 5) Implement ice_open and ice_stop. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
3a858ba3 |
|
20-Mar-2018 |
Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> |
ice: Add support for VSI allocation and deallocation This patch introduces data structures and functions to alloc/free VSIs. The driver represents a VSI using the ice_vsi structure. Some noteworthy points about VSI allocation: 1) A VSI is allocated in the firmware using the "add VSI" admin queue command (implemented as ice_aq_add_vsi). The firmware returns an identifier for the allocated VSI. The VSI context is used to program certain aspects (loopback, queue map, etc.) of the VSI's configuration. 2) A VSI is deleted using the "free VSI" admin queue command (implemented as ice_aq_free_vsi). 3) The driver represents a VSI using struct ice_vsi. This is allocated and initialized as part of the ice_vsi_alloc flow, and deallocated as part of the ice_vsi_delete flow. 4) Once the VSI is created, a netdev is allocated and associated with it. The VSI's ring and vector related data structures are also allocated and initialized. 5) A VSI's queues can either be contiguous or scattered. To do this, the driver maintains a bitmap (vsi->avail_txqs) which is kept in sync with the firmware's VSI queue allocation imap. If the VSI can't get a contiguous queue allocation, it will fallback to scatter. This is implemented in ice_vsi_get_qs which is called as part of the VSI setup flow. In the release flow, the VSI's queues are released and the bitmap is updated to reflect this by ice_vsi_put_qs. CC: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Acked-by: Shannon Nelson <shannon.nelson@oracle.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
940b61af |
|
20-Mar-2018 |
Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> |
ice: Initialize PF and setup miscellaneous interrupt This patch continues the initialization flow as follows: 1) Allocate and initialize necessary fields (like vsi, num_alloc_vsi, irq_tracker, etc) in the ice_pf instance. 2) Setup the miscellaneous interrupt handler. This also known as the "other interrupt causes" (OIC) handler and is used to handle non hotpath interrupts (like control queue events, link events, exceptions, etc. 3) Implement a background task to process admin queue receive (ARQ) events received by the driver. CC: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Acked-by: Shannon Nelson <shannon.nelson@oracle.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|