#
d80a5233 |
|
27-Jan-2024 |
Heiner Kallweit <hkallweit1@gmail.com> |
ethtool: replace struct ethtool_eee with a new struct ethtool_keee on kernel side In order to pass EEE link modes beyond bit 32 to userspace we have to complement the 32 bit bitmaps in struct ethtool_eee with linkmode bitmaps. Therefore, similar to ethtool_link_settings and ethtool_link_ksettings, add a struct ethtool_keee. In a first step it's an identical copy of ethtool_eee. This patch simply does a s/ethtool_eee/ethtool_keee/g for all users. No functional change intended. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
07c42d23 |
|
23-Jan-2024 |
Breno Leitao <leitao@debian.org> |
net: fill in MODULE_DESCRIPTION()s for enetc W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to the NXP ENETC Ethernet driver. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240123190332.677489-7-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
f968c564 |
|
06-Nov-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN NETLINK_MAX_FMTMSG_LEN is currently hardcoded to 80, and we provide an error printf-formatted string having 96 characters including the terminating \0. Assuming each %d (representing a queue) gets replaced by a number having at most 2 digits (a reasonable assumption), the final string is also 96 characters wide, which is too much. Reduce the verbiage a bit by removing some (partially) redundant words, which makes the new printf-formatted string be 73 characters wide with the trailing newline. Fixes: 800db2d125c2 ("net: enetc: ensure we always have a minimum number of TXQs for stack") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/lkml/202311061336.4dsWMT1h-lkp@intel.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20231106160311.616118-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
7f04bd10 |
|
08-Sep-2023 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
net: Tree wide: Replace xdp_do_flush_map() with xdp_do_flush(). xdp_do_flush_map() is deprecated and new code should use xdp_do_flush() instead. Replace xdp_do_flush_map() with xdp_do_flush(). Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Clark Wang <xiaoning.wang@nxp.com> Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: David Arinzon <darinzon@amazon.com> Cc: Edward Cree <ecree.xilinx@gmail.com> Cc: Felix Fietkau <nbd@nbd.name> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Jassi Brar <jaswinder.singh@linaro.org> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: John Crispin <john@phrozen.org> Cc: Leon Romanovsky <leon@kernel.org> Cc: Lorenzo Bianconi <lorenzo@kernel.org> Cc: Louis Peens <louis.peens@corigine.com> Cc: Marcin Wojtas <mw@semihalf.com> Cc: Mark Lee <Mark-MC.Lee@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: NXP Linux Team <linux-imx@nxp.com> Cc: Noam Dagan <ndagan@amazon.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Saeed Bishara <saeedb@amazon.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Sean Wang <sean.wang@mediatek.com> Cc: Shay Agroskin <shayagr@amazon.com> Cc: Shenwei Wang <shenwei.wang@nxp.com> Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Cc: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: Vladimir Oltean <vladimir.oltean@nxp.com> Cc: Wei Fang <wei.fang@nxp.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Arthur Kiyanovski <akiyano@amazon.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230908143215.869913-2-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
fa87c546 |
|
27-Jun-2023 |
Julia Lawall <Julia.Lawall@inria.fr> |
net: enetc: use vmalloc_array and vcalloc Use vmalloc_array and vcalloc to protect against multiplication overflows. The changes were done using the following Coccinelle semantic patch: // <smpl> @initialize:ocaml@ @@ let rename alloc = match alloc with "vmalloc" -> "vmalloc_array" | "vzalloc" -> "vcalloc" | _ -> failwith "unknown" @@ size_t e1,e2; constant C1, C2; expression E1, E2, COUNT, x1, x2, x3; typedef u8; typedef __u8; type t = {u8,__u8,char,unsigned char}; identifier alloc = {vmalloc,vzalloc}; fresh identifier realloc = script:ocaml(alloc) { rename alloc }; @@ ( alloc(x1*x2*x3) | alloc(C1 * C2) | alloc((sizeof(t)) * (COUNT), ...) | - alloc((e1) * (e2)) + realloc(e1, e2) | - alloc((e1) * (COUNT)) + realloc(COUNT, e1) | - alloc((E1) * (E2)) + realloc(E1, E2) ) // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Link: https://lore.kernel.org/r/20230627144339.144478-19-Julia.Lawall@inria.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
fdebd850 |
|
02-Jun-2023 |
Wei Fang <wei.fang@nxp.com> |
net: enetc: correct rx_bytes statistics of XDP The rx_bytes statistics of XDP are always zero, because rx_byte_cnt is not updated after it is initialized to 0. So fix it. Fixes: d1b15102dd16 ("net: enetc: add support for XDP_DROP and XDP_PASS") Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7190d0ff |
|
02-Jun-2023 |
Wei Fang <wei.fang@nxp.com> |
net: enetc: correct the statistics of rx bytes The rx_bytes of struct net_device_stats should count the length of ethernet frames excluding the FCS. However, there are two problems with the rx_bytes statistics of the current enetc driver. one is that the length of VLAN header is not counted if the VLAN extraction feature is enabled. The other is that the length of L2 header is not counted, because eth_type_trans() is invoked before updating rx_bytes which will subtract the length of L2 header from skb->len. BTW, the rx_bytes statistics of XDP path also have similar problem, I will fix it in another patch. Fixes: a800abd3ecb9 ("net: enetc: move skb creation into enetc_build_skb") Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5353599a |
|
29-May-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: refactor enetc_setup_tc_taprio() to have a switch/case for cmd Make enetc_setup_tc_taprio() more amenable to future extensions, like reporting statistics. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b51f4113 |
|
10-May-2023 |
Yunsheng Lin <linyunsheng@huawei.com> |
net: introduce and use skb_frag_fill_page_desc() Most users use __skb_frag_set_page()/skb_frag_off_set()/ skb_frag_size_set() to fill the page desc for a skb frag. Introduce skb_frag_fill_page_desc() to do that. net/bpf/test_run.c does not call skb_frag_off_set() to set the offset, "copy_from_user(page_address(page), ...)" and 'shinfo' being part of the 'data' kzalloced in bpf_test_init() suggest that it is assuming offset to be initialized as zero, so call skb_frag_fill_page_desc() with offset being zero for this case. Also, skb_frag_set_page() is not used anymore, so remove it. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
82714539 |
|
18-Apr-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: only commit preemptible TCs to hardware when MM TX is active This was left as TODO in commit 01e23b2b3bad ("net: enetc: add support for preemptible traffic classes") since it's relatively complicated. Where this makes a difference is with a configuration as follows: ethtool --set-mm eno0 pmac-enabled on tx-enabled on verify-enabled on Preemptible packets should only be sent when the MAC Merge TX direction becomes active (i.o.w. when the verification process succeeds, aka when the link partner confirms it can process preemptible traffic). But the tc qdisc with the preemptible traffic classes is offloaded completely asynchronously w.r.t. the MM becoming active. The ENETC manual does suggest that this should be handled in the driver: "On startup, software should wait for the verification process to complete (MMCSR[VSTS]=011) before initiating traffic". Adding the necessary logic allows future selftests to uphold the claim that an inactive or disabled MAC Merge layer should never send data packets through the pMAC. This change moves enetc_set_ptcfpr() from enetc.c to enetc_ethtool.c, where its only caller is now - enetc_mm_commit_preemptible_tcs(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
01e23b2b |
|
11-Apr-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: add support for preemptible traffic classes PFs which support the MAC Merge layer also have a set of 8 registers called "Port traffic class N frame preemption register (PTC0FPR - PTC7FPR)". Through these, a traffic class (group of TX rings of same dequeue priority) can be mapped to the eMAC or to the pMAC. There's nothing particularly spectacular here. We should probably only commit the preemptible TCs to hardware once the MAC Merge layer became active, but unlike Felix, we don't have an IRQ that notifies us of that. We'd have to sleep for up to verifyTime (127 ms) to wait for a resolution coming from the verification state machine; not only from the ndo_setup_tc() code path, but also from enetc_mm_link_state_update(). Since it's relatively complicated and has a relatively small benefit, I'm not doing it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
50764da3 |
|
11-Apr-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: rename "mqprio" to "qopt" To gain access to the larger encapsulating structure which has the type tc_mqprio_qopt_offload, rename just the "qopt" field as "qopt". Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
1a353111 |
|
04-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: act upon the requested mqprio queue configuration Regardless of the requested queue count per traffic class, the enetc driver allocates a number of TX rings equal to the number of TCs, and hardcodes a queue configuration of "1@0 1@1 ... 1@max-tc". Other configurations are silently ignored and treated the same. Improve that by allowing what the user requests to be actually fulfilled. This allows more than one TX ring per traffic class. For example: $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 4 \ map 0 0 1 1 2 2 3 3 queues 2@0 2@2 2@4 2@6 [ 146.267648] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0 [ 146.273451] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 0 [ 146.283280] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 1 [ 146.293987] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 1 [ 146.300467] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 2 [ 146.306866] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 2 [ 146.313261] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 3 [ 146.319622] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 3 $ tc qdisc del dev eno0 root [ 178.238418] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0 [ 178.244369] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 0 [ 178.251486] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 0 [ 178.258006] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 0 [ 178.265038] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 0 [ 178.271557] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 0 [ 178.277910] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 0 [ 178.284281] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 0 $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \ map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1 [ 186.113162] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0 [ 186.118764] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 1 [ 186.124374] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 2 [ 186.130765] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 3 [ 186.136404] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 4 [ 186.142049] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 5 [ 186.147674] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 6 [ 186.153305] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 7 The driver used to set TC_MQPRIO_HW_OFFLOAD_TCS, near which there is this comment in the UAPI header: TC_MQPRIO_HW_OFFLOAD_TCS, /* offload TCs, no queue counts */ which is what enetc was doing up until now (and no longer is; we offload queue counts too), remove that assignment. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
735ef62c |
|
04-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: request mqprio to validate the queue counts The enetc driver does not validate the mqprio queue configuration, so it currently allows things like this: $ tc qdisc add dev swp0 root handle 1: mqprio num_tc 8 \ map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1 But also things like this, completely omitting the queue configuration: $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \ map 0 1 2 3 4 5 6 7 hw 1 By requesting validation via the mqprio capability structure, this is no longer allowed, and we bring what is accepted by hardware in line with what is accepted by software. The check that num_tc <= real_num_tx_queues also becomes superfluous and can be dropped, because mqprio_validate_queue_counts() validates that no TXQ range exceeds real_num_tx_queues. That is a stronger check, because there is at least 1 TXQ per TC, so there are at least as many TXQs as TCs. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
800db2d1 |
|
02-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: ensure we always have a minimum number of TXQs for stack Currently it can happen that an mqprio qdisc is installed with num_tc 8, and this will reserve 8 (out of 8) TXQs for the network stack. Then we can attach an XDP program, and this will crop 2 TXQs, leaving just 6 for mqprio. That's not what the user requested, and we should fail it. On the other hand, if mqprio isn't requested, we still give the 8 TXQs to the network stack (with hashing among a single traffic class), but then, cropping 2 TXQs for XDP is fine, because the user didn't explicitly ask for any number of TXQs, so no expectations are violated. Simply put, the logic that mqprio should impose a minimum number of TXQs for the network never existed. Let's say (more or less arbitrarily) that without mqprio, the driver expects a minimum number of TXQs equal to the number of CPUs (on NXP LS1028A, that is either 1, or 2). And with mqprio, mqprio gives the minimum required number of TXQs. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
4ea1dd74 |
|
02-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: recalculate num_real_tx_queues when XDP program attaches Since the blamed net-next commit, enetc_setup_xdp_prog() no longer goes through enetc_open(), and therefore, the function which was supposed to detect whether a BPF program exists (in order to crop some TX queues from network stack usage), enetc_num_stack_tx_queues(), no longer gets called. We can move the netif_set_real_num_rx_queues() call to enetc_alloc_msix() (probe time), since it is a runtime invariant. We can do the same thing with netif_set_real_num_tx_queues(), and let enetc_reconfigure_xdp_cb() explicitly recalculate and change the number of stack TX queues. Fixes: c33bfaf91c4c ("net: enetc: set up XDP program under enetc_reconfigure()") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
46a0ecf9 |
|
02-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: allow the enetc_reconfigure() callback to fail enetc_reconfigure() was modified in commit c33bfaf91c4c ("net: enetc: set up XDP program under enetc_reconfigure()") to take an optional callback that runs while the netdev is down, but this callback currently cannot fail. Code up the error handling so that the interface is restarted with the old resources if the callback fails. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
1c81a9b3 |
|
02-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: simplify enetc_num_stack_tx_queues() We keep a pointer to the xdp_prog in the private netdev structure as well; what's replicated per RX ring is done so just for more convenient access from the NAPI poll procedure. Simplify enetc_num_stack_tx_queues() by looking at priv->xdp_prog rather than iterating through the information replicated per RX ring. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
12717dec |
|
19-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: implement software lockstep for port MAC registers Currently the enetc driver duplicates its writes to the PM0 registers also to PM1, but it doesn't do this consistently - for example we write to ENETC_PM0_MAXFRM but not to ENETC_PM1_MAXFRM. Create enetc_port_mac_wr() which writes both the PM0 and PM1 register with the same value (if frame preemption is supported on this port). Also create enetc_port_mac_rd() which reads from PM0 - the assumption being that PM1 contains just the same value. This will be necessary when we enable the MAC Merge layer properly, and the pMAC becomes operational. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
94557a9a |
|
19-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: detect frame preemption hardware capability Similar to other TSN features, query the Station Interface capability register to see whether preemption is supported on this port or not. On LS1028A, preemption is available on ports 0 and 2, but not on 1 and 3. This will allow us in the future to write the pMAC registers only on the ENETC ports where a pMAC actually exists. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e3972399 |
|
19-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: build common object files into a separate module The build system is complaining about the following: enetc.o is added to multiple modules: fsl-enetc fsl-enetc-vf enetc_cbdr.o is added to multiple modules: fsl-enetc fsl-enetc-vf enetc_ethtool.o is added to multiple modules: fsl-enetc fsl-enetc-vf Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ff58fda0 |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: prioritize ability to go down over packet processing napi_synchronize() from enetc_stop() waits until the softirq has finished execution and no longer wants to be rescheduled. However under high traffic load, this will never happen, and the interface can never be closed. The problem is the fact that the NAPI poll routine is written to update the consumer index which makes the device want to put more buffers in the RX ring, which restarts the madness again. Browsing around, it seems that some drivers like i40e keep a bit (__I40E_VSI_DOWN) which they use as communication between the control path and the data path. But that isn't my first choice, because complications ensue - since the enetc hardirq may trigger while we are in a theoretical ENETC_DOWN state, it may happen that enetc_msix() masks it, but enetc_poll() never unmasks it. To prevent a stall in that case, one would need to schedule all NAPI instances when ENETC_DOWN gets cleared, to process what's pending. I find it more desirable for the control path - enetc_stop() - to just quiesce the RX ring and let the softirq finish what remains there, without any explicit communication, just by making hardware not provide any more packets. This seems possible with the Enable bit of the RX BD ring (RBaMR[EN]). I can't seem to find an exact definition of what this bit does, but when the RX ring is disabled, the port seems to no longer update the producer index, and not react to software updates of the consumer index. In fact, the RBaMR[EN] bit is already toggled by the driver, but too late for what we want: enetc_close() -> enetc_stop() -> napi_synchronize() -> enetc_clear_bdrs() -> enetc_clear_rxbdr() The enetc_clear_bdrs() function contains not only logic to disable the RX and TX rings, but also logic to wait for the TX ring stop being busy. We split enetc_clear_bdrs() into enetc_disable_bdrs() and enetc_wait_bdrs(). One needs to run before napi_synchronize() and the other after (NAPI also processes TX completions, so we maximize our chances of not waiting for the ENETC_TBSR_BUSY bit - unless a packet is stuck for some reason, ofc). We also split off enetc_enable_bdrs() from enetc_setup_bdrs(), and call this from the mirror position in enetc_start() compared to enetc_stop(), i.e. right before netif_tx_start_all_queues(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
c33bfaf9 |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: set up XDP program under enetc_reconfigure() Offloading a BPF program to the RX path of the driver suffers from the same problems as the PTP reconfiguration - improper error checking can leave the driver in an invalid state, and the link on the PHY is lost. Reuse the enetc_reconfigure() procedure, but here, we need to run some code in the middle of the ring reconfiguration procedure - while the interface is still down. Introduce a callback which makes that possible. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
766338c7 |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: rename "xdp" and "dev" in enetc_setup_bpf() Follow the convention from this driver, which is to name "struct net_device *" as "ndev", and the convention from other drivers, to name "struct netdev_bpf *" as "bpf". Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
5093406c |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: implement ring reconfiguration procedure for PTP RX timestamping The crude enetc_stop() -> enetc_open() mechanism suffers from 2 problems: 1. improper error checking 2. it involves phylink_stop() -> phylink_start() which loses the link Right now, the driver is prepared to offer a better alternative: a ring reconfiguration procedure which takes the RX BD size (normal or extended) as argument. It allocates new resources (failing if that fails), stops the traffic, and assigns the new resources to the rings. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
598ca0d0 |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: move phylink_start/stop out of enetc_start/stop We want to introduce a fast interface reconfiguration procedure, which involves temporarily stopping the rings. But we want enetc_start() and enetc_stop() to not restart PHY autoneg, because that can take a few seconds until it completes again. So we need part of enetc_start() and enetc_stop(), but not all of them. Move phylink_start() right next to phylink_of_phy_connect(), and phylink_stop() right next to phylink_disconnect_phy(), both still in ndo_open() and ndo_stop(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
f3ce29e1 |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: split ring resource allocation from assignment We have a few instances in the enetc driver where the ring resources (BD ring iomem, software BD ring, software TSO headers, basically everything except RX buffers) need to be reallocated. For example, when RX timestamping is enabled, the RX BD format changes to an extended one (twice as large). Currently, this is done using a simplistic enetc_close() -> enetc_open() procedure. But this is quite crude, since it also invokes phylink_stop() -> phylink_start(), the link is lost, and a few seconds need to pass for autoneg to complete again. In fact it's bad also due to the improper (yolo) error checking. In case we fail to allocate new resources, we've already freed the old ones, so the interface is more or less stuck. To avoid that, we need a system where reconfiguration is possible in a way in which resources are allocated upfront. This means that there will be a higher memory usage temporarily, but the assignment of resources to rings can be done when both the old and new resources are still available. Introduce a struct enetc_bdr_resource which holds the resources for a ring, be it RX or TX. This structure duplicates a lot of fields from struct enetc_bdr (and access to the same fields in the ring structure was left duplicated, to not change cache characteristics in the fast path). When enetc_alloc_tx_resources() runs, it returns an array of resource elements (one per TX ring), in addition to the existing priv->tx_res. To populate priv->tx_res with that array, one must call enetc_assign_tx_resources(), and this also frees the old resources. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d075db51 |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: bring "bool extended" to top-level in enetc_open() Extended RX buffer descriptors are necessary if they carry RX timestamps, which will be true when PTP timestamping is enabled. Right now, the rx_ring->ext_en is set from the function that allocates ring resources (enetc_alloc_rx_resources() -> enetc_alloc_rxbdr()), and also used later, in enetc_setup_rxbdr(). It is also used in the enetc_rxbd() and enetc_rxbd_next() fast path helpers. We want to decouple resource allocation from BD ring setup, but both procedures depend on BD size (extended or not). Move the "extended" boolean to enetc_open() and pass it both to the RX allocation procedure as well as to the RX ring setup procedure. The latter will set rx_ring->ext_en from now on. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
bbd6043f |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: drop redundant enetc_free_tx_frame() call from enetc_free_txbdr() The call path in enetc_close() is: enetc_close() -> enetc_free_rxtx_rings() -> enetc_free_tx_ring() -> enetc_free_tx_frame() -> enetc_free_tx_resources() -> enetc_free_txbdr() -> enetc_free_tx_frame() The enetc_free_tx_frame() function is written such that the second call exits without doing anything, but nonetheless, it is completely redundant. Delete it. This makes the TX teardown path more similar to the RX one, where rx_swbd freeing is done in enetc_free_rx_ring(), not in enetc_free_rxbdr(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2c338710 |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: rx_swbd and tx_swbd are never NULL in enetc_free_rxtx_rings() The call path in enetc_close() is: enetc_close() -> enetc_free_rxtx_rings() -> enetc_free_rx_ring() -> tests whether rx_ring->rx_swbd is NULL -> enetc_free_tx_ring() -> tests whether tx_ring->tx_swbd is NULL -> enetc_free_rx_resources() -> enetc_free_rxbdr() -> sets rxr->rx_swbd to NULL -> enetc_free_tx_resources() -> enetc_free_txbdr() -> setx txr->tx_swbd to NULL From the above, it is clear that due to the function ordering, the checks for NULL are redundant, since the software buffer descriptor arrays have not yet been set to NULL. Drop these checks. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
0d6cfd0f |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: create enetc_dma_free_bdr() This is a refactoring change which introduces the opposite function of enetc_dma_alloc_bdr(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
fbf1cff9 |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: set up RX ring indices from enetc_setup_rxbdr() There is only one place which needs to set up indices in the RX ring. Be consistent with what was done in the TX path and do this in enetc_setup_rxbdr(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
1cbf19c5 |
|
17-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: set next_to_clean/next_to_use just from enetc_setup_txbdr() enetc_alloc_txbdr() deals with allocating resources necessary for a TX ring to work (the array of software BDs and the array of TSO headers). The next_to_clean and next_to_use pointers are overwritten with proper values which are read from hardware here: enetc_open -> enetc_alloc_tx_resources -> enetc_alloc_txbdr -> set to zero -> enetc_setup_bdrs -> enetc_setup_txbdr -> read from hardware So their initialization with zeroes is pointless and confusing. Delete it. Consequently, since enetc_setup_txbdr() has no opposite cleanup function, also delete the resetting of these indices from enetc_free_tx_ring(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
c7030d14 |
|
04-Jan-2023 |
Lorenzo Bianconi <lorenzo@kernel.org> |
net: ethernet: enetc: do not always access skb_shared_info in the XDP path Move XDP skb_shared_info structure initialization in from enetc_map_rx_buff_to_xdp() to enetc_add_rx_buff_to_xdp() and do not always access skb_shared_info in the xdp_buff/xdp_frame since it is located in a different cacheline with respect to hard_start and data xdp pointers. Rely on XDP_FLAGS_HAS_FRAGS flag to check if it really necessary to access non-linear part of the xdp_buff/xdp_frame. Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
8feb020f |
|
04-Jan-2023 |
Lorenzo Bianconi <lorenzo@kernel.org> |
net: ethernet: enetc: unlock XDP_REDIRECT for XDP non-linear buffers Even if full XDP_REDIRECT is not supported yet for non-linear XDP buffers since we allow redirecting just into CPUMAPs, unlock XDP_REDIRECT for S/G XDP buffer and rely on XDP stack to properly take care of the frames. Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
3c463721 |
|
11-Jan-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: avoid deadlock in enetc_tx_onestep_tstamp() This lockdep splat says it better than I could: ================================ WARNING: inconsistent lock state 6.2.0-rc2-07010-ga9b9500ffaac-dirty #967 Not tainted -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. kworker/1:3/179 [HC0[0]:SC0[0]:HE1:SE1] takes: ffff3ec4036ce098 (_xmit_ETHER#2){+.?.}-{3:3}, at: netif_freeze_queues+0x5c/0xc0 {IN-SOFTIRQ-W} state was registered at: _raw_spin_lock+0x5c/0xc0 sch_direct_xmit+0x148/0x37c __dev_queue_xmit+0x528/0x111c ip6_finish_output2+0x5ec/0xb7c ip6_finish_output+0x240/0x3f0 ip6_output+0x78/0x360 ndisc_send_skb+0x33c/0x85c ndisc_send_rs+0x54/0x12c addrconf_rs_timer+0x154/0x260 call_timer_fn+0xb8/0x3a0 __run_timers.part.0+0x214/0x26c run_timer_softirq+0x3c/0x74 __do_softirq+0x14c/0x5d8 ____do_softirq+0x10/0x20 call_on_irq_stack+0x2c/0x5c do_softirq_own_stack+0x1c/0x30 __irq_exit_rcu+0x168/0x1a0 irq_exit_rcu+0x10/0x40 el1_interrupt+0x38/0x64 irq event stamp: 7825 hardirqs last enabled at (7825): [<ffffdf1f7200cae4>] exit_to_kernel_mode+0x34/0x130 hardirqs last disabled at (7823): [<ffffdf1f708105f0>] __do_softirq+0x550/0x5d8 softirqs last enabled at (7824): [<ffffdf1f7081050c>] __do_softirq+0x46c/0x5d8 softirqs last disabled at (7811): [<ffffdf1f708166e0>] ____do_softirq+0x10/0x20 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(_xmit_ETHER#2); <Interrupt> lock(_xmit_ETHER#2); *** DEADLOCK *** 3 locks held by kworker/1:3/179: #0: ffff3ec400004748 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1f4/0x6c0 #1: ffff80000a0bbdc8 ((work_completion)(&priv->tx_onestep_tstamp)){+.+.}-{0:0}, at: process_one_work+0x1f4/0x6c0 #2: ffff3ec4036cd438 (&dev->tx_global_lock){+.+.}-{3:3}, at: netif_tx_lock+0x1c/0x34 Workqueue: events enetc_tx_onestep_tstamp Call trace: print_usage_bug.part.0+0x208/0x22c mark_lock+0x7f0/0x8b0 __lock_acquire+0x7c4/0x1ce0 lock_acquire.part.0+0xe0/0x220 lock_acquire+0x68/0x84 _raw_spin_lock+0x5c/0xc0 netif_freeze_queues+0x5c/0xc0 netif_tx_lock+0x24/0x34 enetc_tx_onestep_tstamp+0x20/0x100 process_one_work+0x28c/0x6c0 worker_thread+0x74/0x450 kthread+0x118/0x11c but I'll say it anyway: the enetc_tx_onestep_tstamp() work item runs in process context, therefore with softirqs enabled (i.o.w., it can be interrupted by a softirq). If we hold the netif_tx_lock() when there is an interrupt, and the NET_TX softirq then gets scheduled, this will take the netif_tx_lock() a second time and deadlock the kernel. To solve this, use netif_tx_lock_bh(), which blocks softirqs from running. Fixes: 7294380c5211 ("enetc: support PTP Sync packet one-step timestamping") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230112105440.1786799-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
628050ec |
|
12-Dec-2022 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: avoid buffer leaks on xdp_do_redirect() failure Before enetc_clean_rx_ring_xdp() calls xdp_do_redirect(), each software BD in the RX ring between index orig_i and i can have one of 2 refcount values on its page. We are the owner of the current buffer that is being processed, so the refcount will be at least 1. If the current owner of the buffer at the diametrically opposed index in the RX ring (i.o.w, the other half of this page) has not yet called kfree(), this page's refcount could even be 2. enetc_page_reusable() in enetc_flip_rx_buff() tests for the page refcount against 1, and [ if it's 2 ] does not attempt to reuse it. But if enetc_flip_rx_buff() is put after the xdp_do_redirect() call, the page refcount can have one of 3 values. It can also be 0, if there is no owner of the other page half, and xdp_do_redirect() for this buffer ran so far that it triggered a flush of the devmap/cpumap bulk queue, and the consumers of those bulk queues also freed the buffer, all by the time xdp_do_redirect() returns the execution back to enetc. This is the reason why enetc_flip_rx_buff() is called before xdp_do_redirect(), but there is a big flaw with that reasoning: enetc_flip_rx_buff() will set rx_swbd->page = NULL on both sides of the enetc_page_reusable() branch, and if xdp_do_redirect() returns an error, we call enetc_xdp_free(), which does not deal gracefully with that. In fact, what happens is quite special. The page refcounts start as 1. enetc_flip_rx_buff() figures they're reusable, transfers these rx_swbd->page pointers to a different rx_swbd in enetc_reuse_page(), and bumps the refcount to 2. When xdp_do_redirect() later returns an error, we call the no-op enetc_xdp_free(), but we still haven't lost the reference to that page. A copy of it is still at rx_ring->next_to_alloc, but that has refcount 2 (and there are no concurrent owners of it in flight, to drop the refcount). What really kills the system is when we'll flip the rx_swbd->page the second time around. With an updated refcount of 2, the page will not be reusable and we'll really leak it. Then enetc_new_page() will have to allocate more pages, which will then eventually leak again on further errors from xdp_do_redirect(). The problem, summarized, is that we zeroize rx_swbd->page before we're completely done with it, and this makes it impossible for the error path to do something with it. Since the packet is potentially multi-buffer and therefore the rx_swbd->page is potentially an array, manual passing of the old pointers between enetc_flip_rx_buff() and enetc_xdp_free() is a bit difficult. For the sake of going with a simple solution, we accept the possibility of racing with xdp_do_redirect(), and we move the flip procedure to execute only on the redirect success path. By racing, I mean that the page may be deemed as not reusable by enetc (having a refcount of 0), but there will be no leak in that case, either. Once we accept that, we have something better to do with buffers on XDP_REDIRECT failure. Since we haven't performed half-page flipping yet, we won't, either (and this way, we can avoid enetc_xdp_free() completely, which gives the entire page to the slab allocator). Instead, we'll call enetc_xdp_drop(), which will recycle this half of the buffer back to the RX ring. Fixes: 9d2b68cc108d ("net: enetc: add support for XDP_REDIRECT") Suggested-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20221213001908.2347046-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
290b5fe0 |
|
22-Nov-2022 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: preserve TX ring priority across reconfiguration In the blamed commit, a rudimentary reallocation procedure for RX buffer descriptors was implemented, for the situation when their format changes between normal (no PTP) and extended (PTP). enetc_hwtstamp_set() calls enetc_close() and enetc_open() in a sequence, and this sequence loses information which was previously configured in the TX BDR Mode Register, specifically via the enetc_set_bdr_prio() call. The TX ring priority is configured by tc-mqprio and tc-taprio, and affects important things for TSN such as the TX time of packets. The issue manifests itself most visibly by the fact that isochron --txtime reports premature packet transmissions when PTP is first enabled on an enetc interface. Save the TX ring priority in a new field in struct enetc_bdr (occupies a 2 byte hole on arm64) in order to make this survive a ring reconfiguration. Fixes: 434cebabd3a2 ("enetc: Add dynamic allocation of extended Rx BD rings") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/r/20221122130936.1704151-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
84ce1ca3 |
|
27-Oct-2022 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: survive memory pressure without crashing Under memory pressure, enetc_refill_rx_ring() may fail, and when called during the enetc_open() -> enetc_setup_rxbdr() procedure, this is not checked for. An extreme case of memory pressure will result in exactly zero buffers being allocated for the RX ring, and in such a case it is expected that hardware drops all RX packets due to lack of buffers. This does not happen, because the reset-default value of the consumer and produces index is 0, and this makes the ENETC think that all buffers have been initialized and that it owns them (when in reality none were). The hardware guide explains this best: | Configure the receive ring producer index register RBaPIR with a value | of 0. The producer index is initially configured by software but owned | by hardware after the ring has been enabled. Hardware increments the | index when a frame is received which may consume one or more BDs. | Hardware is not allowed to increment the producer index to match the | consumer index since it is used to indicate an empty condition. The ring | can hold at most RBLENR[LENGTH]-1 received BDs. | | Configure the receive ring consumer index register RBaCIR. The | consumer index is owned by software and updated during operation of the | of the BD ring by software, to indicate that any receive data occupied | in the BD has been processed and it has been prepared for new data. | - If consumer index and producer index are initialized to the same | value, it indicates that all BDs in the ring have been prepared and | hardware owns all of the entries. | - If consumer index is initialized to producer index plus N, it would | indicate N BDs have been prepared. Note that hardware cannot start if | only a single buffer is prepared due to the restrictions described in | (2). | - Software may write consumer index to match producer index anytime | while the ring is operational to indicate all received BDs prior have | been processed and new BDs prepared for hardware. Normally, the value of rx_ring->rcir (consumer index) is brought in sync with the rx_ring->next_to_use software index, but this only happens if page allocation ever succeeded. When PI==CI==0, the hardware appears to receive frames and write them to DMA address 0x0 (?!), then set the READY bit in the BD. The enetc_clean_rx_ring() function (and its XDP derivative) is naturally not prepared to handle such a condition. It will attempt to process those frames using the rx_swbd structure associated with index i of the RX ring, but that structure is not fully initialized (enetc_new_page() does all of that). So what happens next is undefined behavior. To operate using no buffer, we must initialize the CI to PI + 1, which will block the hardware from advancing the CI any further, and drop everything. The issue was seen while adding support for zero-copy AF_XDP sockets, where buffer memory comes from user space, which can even decide to supply no buffers at all (example: "xdpsock --txonly"). However, the bug is present also with the network stack code, even though it would take a very determined person to trigger a page allocation failure at the perfect time (a series of ifup/ifdown under memory pressure should eventually reproduce it given enough retries). Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20221027182925.3256653-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
715bf261 |
|
27-Sep-2022 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: cache accesses to &priv->si->hw The &priv->si->hw construct dereferences 2 pointers and makes lines longer than they need to be, in turn making the code harder to read. Replace &priv->si->hw accesses with a "hw" variable when there are 2 or more accesses within a function that dereference this. This includes loops, since &priv->si->hw is a loop invariant. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
b48b89f9 |
|
27-Sep-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: drop the weight argument from netif_napi_add We tell driver developers to always pass NAPI_POLL_WEIGHT as the weight to netif_napi_add(). This may be confusing to newcomers, drop the weight argument, those who really need to tweak the weight can use netif_napi_add_weight(). Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN Link: https://lore.kernel.org/r/20220927132753.750069-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
5641c751 |
|
16-Sep-2022 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: deny offload of tc-based TSN features on VF interfaces TSN features on the ENETC (taprio, cbs, gate, police) are configured through a mix of command BD ring messages and port registers: enetc_port_rd(), enetc_port_wr(). Port registers are a region of the ENETC memory map which are only accessible from the PCIe Physical Function. They are not accessible from the Virtual Functions. Moreover, attempting to access these registers crashes the kernel: $ echo 1 > /sys/bus/pci/devices/0000\:00\:00.0/sriov_numvfs pci 0000:00:01.0: [1957:ef00] type 00 class 0x020001 fsl_enetc_vf 0000:00:01.0: Adding to iommu group 15 fsl_enetc_vf 0000:00:01.0: enabling device (0000 -> 0002) fsl_enetc_vf 0000:00:01.0 eno0vf0: renamed from eth0 $ tc qdisc replace dev eno0vf0 root taprio num_tc 8 map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 base-time 0 \ sched-entry S 0x7f 900000 sched-entry S 0x80 100000 flags 0x2 Unable to handle kernel paging request at virtual address ffff800009551a08 Internal error: Oops: 96000007 [#1] PREEMPT SMP pc : enetc_setup_tc_taprio+0x170/0x47c lr : enetc_setup_tc_taprio+0x16c/0x47c Call trace: enetc_setup_tc_taprio+0x170/0x47c enetc_setup_tc+0x38/0x2dc taprio_change+0x43c/0x970 taprio_init+0x188/0x1e0 qdisc_create+0x114/0x470 tc_modify_qdisc+0x1fc/0x6c0 rtnetlink_rcv_msg+0x12c/0x390 Split enetc_setup_tc() into separate functions for the PF and for the VF drivers. Also remove enetc_qos.o from being included into enetc-vf.ko, since it serves absolutely no purpose there. Fixes: 34c6adf1977b ("enetc: Configure the Time-Aware Scheduler via tc-taprio offload") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20220916133209.3351399-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
fed38e64 |
|
16-Sep-2022 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: move enetc_set_psfp() out of the common enetc_set_features() The VF netdev driver shouldn't respond to changes in the NETIF_F_HW_TC flag; only PFs should. Moreover, TSN-specific code should go to enetc_qos.c, which should not be included in the VF driver. Fixes: 79e499829f3f ("net: enetc: add hw tc hw offload features for PSPF capability") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20220916133209.3351399-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
285e8ded |
|
10-May-2022 |
Po Liu <Po.Liu@nxp.com> |
net: enetc: count the tc-taprio window drops The enetc scheduler for IEEE 802.1Qbv has 2 options (depending on PTGCR[TG_DROP_DISABLE]) when we attempt to send an oversized packet which will never fit in its allotted time slot for its traffic class: either block the entire port due to head-of-line blocking, or drop the packet and set a bit in the writeback format of the transmit buffer descriptor, allowing other packets to be sent. We obviously choose the second option in the driver, but we do not detect the drop condition, so from the perspective of the network stack, the packet is sent and no error counter is incremented. This change checks the writeback of the TX BD when tc-taprio is enabled, and increments a specific ethtool statistics counter and a generic "tx_dropped" counter in ndo_get_stats64. Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
cfcfc8f5 |
|
09-Jan-2022 |
Christophe JAILLET <christophe.jaillet@wanadoo.fr> |
net: enetc: Remove useless DMA-32 fallback configuration As stated in [1], dma_set_mask() with a 64-bit mask never fails if dev->dma_mask is non-NULL. So, if it fails, the 32 bits case will also fail for the same reason. Simplify code and remove some dead code accordingly. [1]: https://lkml.org/lkml/2021/6/7/398 Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/dbecd4eb49a9586ee343b5473dda4b84c42112e9.1641742884.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
c8064e5b |
|
30-Nov-2021 |
Paolo Abeni <pabeni@redhat.com> |
bpf: Let bpf_warn_invalid_xdp_action() report more info In non trivial scenarios, the action id alone is not sufficient to identify the program causing the warning. Before the previous patch, the generated stack-trace pointed out at least the involved device driver. Let's additionally include the program name and id, and the relevant device name. If the user needs additional infos, he can fetch them via a kernel probe, leveraging the arguments added here. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com
|
#
52066149 |
|
20-Oct-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: use the skb variable directly in enetc_clean_tx_ring() The code checks whether the skb had one-step TX timestamping enabled, in order to schedule the work item for emptying the priv->tx_skbs queue. That code checks for "tx_swbd->skb" directly, when we already had a skb retrieved using enetc_tx_swbd_get_skb(tx_swbd) - a TX software BD can also hold an XDP_TX packet or an XDP frame. But since the direct tx_swbd dereference is in an "if" block guarded by the non-NULL quality of "skb", accessing "tx_swbd->skb" directly is not wrong, just confusing. Just use the local variable named "skb". Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
ae77bdbc |
|
20-Oct-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: remove local "priv" variable in enetc_clean_tx_ring() The "priv" variable is needed in the "check_writeback" scope since commit d39823121911 ("enetc: add hardware timestamping support"). Since commit 7294380c5211 ("enetc: support PTP Sync packet one-step timestamping"), we also need "priv" in the larger function scope. So the local variable from the "if" block scope is not needed, and actually shadows the other one. Delete it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
e79d8264 |
|
13-Oct-2021 |
Dan Carpenter <dan.carpenter@oracle.com> |
net: enetc: fix check for allocation failure This was supposed to be a check for if dma_alloc_coherent() failed but it has a copy and paste bug so it will not work. Fixes: fb8629e2cbfc ("net: enetc: add support for software TSO") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Link: https://lore.kernel.org/r/20211013080456.GC6010@kili Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
edce2a93 |
|
12-Oct-2021 |
Ioana Ciornei <ioana.ciornei@nxp.com> |
net: enetc: include ip6_checksum.h for csum_ipv6_magic For those architectures which do not define_HAVE_ARCH_IPV6_CSUM, we need to include ip6_checksum.h which provides the csum_ipv6_magic() function. Fixes: fb8629e2cbfc ("net: enetc: add support for software TSO") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20211012121358.16641-1-ioana.ciornei@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
fb8629e2 |
|
07-Oct-2021 |
Ioana Ciornei <ioana.ciornei@nxp.com> |
net: enetc: add support for software TSO This patch adds support for driver level TSO in the enetc driver using the TSO API. Beside using the usual tso_build_hdr(), tso_build_data() this specific implementation also has to compute the checksum, both IP and L4, for each resulted segment. This is because the ENETC controller does not support Tx checksum offload which is needed in order to take advantage of TSO. With the workaround for the ENETC MDIO erratum in place the Tx path of the driver is forced to lock/unlock for each skb sent. This is why, even though we are computing the checksum by hand we see the following improvement in TCP termination on the LS1028A SoC, on a single A72 core running at 1.3GHz: before: 1.63 Gbits/sec after: 2.34 Gbits/sec Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
acede3c5 |
|
07-Oct-2021 |
Ioana Ciornei <ioana.ciornei@nxp.com> |
net: enetc: declare NETIF_F_HW_CSUM and do it in software This is just a preparation patch for software TSO in the enetc driver. Unfortunately, ENETC does not support Tx checksum offload which would normally render TSO, even software, impossible. Declare NETIF_F_HW_CSUM as part of the feature set and do it at driver level using skb_csum_hwoffload_help() so that we can move forward and also add support for TSO in the next patch. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a72691ee |
|
15-Sep-2021 |
Cai Huoqing <caihuoqing@baidu.com> |
net: enetc: Make use of the helper function dev_err_probe() When possible use dev_err_probe help to properly deal with the PROBE_DEFER error, the benefit is that DEFER issue will be logged in the devices_deferred debugfs file. And using dev_err_probe() can reduce code size, and simplify the code. Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9f7afa05 |
|
17-Sep-2021 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Fix uninitialized struct dim_sample field usage The only struct dim_sample member that does not get initialized by dim_update_sample() is comp_ctr. (There is special API to initialize comp_ctr: dim_update_sample_with_comps(), and it is currently used only for RDMA.) comp_ctr is used to compute curr_stats->cmps and curr_stats->cpe_ratio (see dim_calc_stats()) which in turn are consumed by the rdma_dim_*() API. Therefore, functionally, the net_dim*() API consumers are not affected. Nevertheless, fix the computation of statistics based on an uninitialized variable, even if the mentioned statistics are not used at the moment. Fixes: ae0e6a5d1627 ("enetc: Add adaptive interrupt coalescing") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7237a494 |
|
17-Sep-2021 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Fix illegal access when reading affinity_hint irq_set_affinity_hit() stores a reference to the cpumask_t parameter in the irq descriptor, and that reference can be accessed later from irq_affinity_hint_proc_show(). Since the cpu_mask parameter passed to irq_set_affinity_hit() has only temporary storage (it's on the stack memory), later accesses to it are illegal. Thus reads from the corresponding procfs affinity_hint file can result in paging request oops. The issue is fixed by the get_cpu_mask() helper, which provides a permanent storage for the cpumask_t parameter. Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7ce9c3d3 |
|
23-Apr-2021 |
Yangbo Lu <yangbo.lu@nxp.com> |
enetc: fix locking for one-step timestamping packet transfer The previous patch to support PTP Sync packet one-step timestamping described one-step timestamping packet handling logic as below in commit message: - Trasmit packet immediately if no other one in transfer, or queue to skb queue if there is already one in transfer. The test_and_set_bit_lock() is used here to lock and check state. - Start a work when complete transfer on hardware, to release the bit lock and to send one skb in skb queue if has. There was not problem of the description, but there was a mistake in implementation. The locking/test_and_set_bit_lock() should be put in enetc_start_xmit() which may be called by worker, rather than in enetc_xmit(). Otherwise, the worker calling enetc_start_xmit() after bit lock released is not able to lock again for transfer. Fixes: 7294380c5211 ("enetc: support PTP Sync packet one-step timestamping") Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
24e39309 |
|
16-Apr-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: apply the MDIO workaround for XDP_REDIRECT too Described in fd5736bf9f23 ("enetc: Workaround for MDIO register access issue") is a workaround for a hardware bug that requires a register access of the MDIO controller to never happen concurrently with a register access of a port PF. To avoid that, a mutual exclusion scheme with rwlocks was implemented - the port PF accessors are the 'read' side, and the MDIO accessors are the 'write' side. When we do XDP_REDIRECT between two ENETC interfaces, all is fine because the MDIO lock is already taken from the NAPI poll loop. But when the ingress interface is not ENETC, just the egress is, the MDIO lock is not taken, so we might access the port PF registers concurrently with MDIO, which will make the link flap due to wrong values returned from the PHY. To avoid this, let's just slap an enetc_lock_mdio/enetc_unlock_mdio at the beginning and ending of enetc_xdp_xmit. The fact that the MDIO lock is designed as a rwlock is important here, because the read side is reentrant (that is one of the main reasons why we chose it). Usually, the way we benefit of its reentrancy is by running the data path concurrently on both CPUs, but in this case, we benefit from the reentrancy by taking the lock even when the lock is already taken (and that's the situation where ENETC is both the ingress and the egress interface for XDP_REDIRECT, which was fine before and still is fine now). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
92ff9a6e |
|
16-Apr-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: fix buffer leaks with XDP_TX enqueue rejections If the TX ring is congested, enetc_xdp_tx() returns false for the current XDP frame (represented as an array of software BDs). This array of software TX BDs is constructed in enetc_rx_swbd_to_xdp_tx_swbd from software BDs freshly cleaned from the RX ring. The issue is that we scrub the RX software BDs too soon, more precisely before we know that we can enqueue the TX BDs successfully into the TX ring. If we can't enqueue them (and enetc_xdp_tx returns false), we call enetc_xdp_drop which attempts to recycle the buffers held by the RX software BDs. But because we scrubbed those RX BDs already, two things happen: (a) we leak their memory (b) we populate the RX software BD ring with an all-zero rx_swbd structure, which makes the buffer refill path allocate more memory. enetc_refill_rx_ring -> if (unlikely(!rx_swbd->page)) -> enetc_new_page That is a recipe for fast OOM. Fixes: 7ed2bc80074e ("net: enetc: add support for XDP_TX") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
975acc83 |
|
16-Apr-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: handle the invalid XDP action the same way as XDP_DROP When the XDP program returns an invalid action, we should free the RX buffer. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7eab503b |
|
16-Apr-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: use dedicated TX rings for XDP It is possible for one CPU to perform TX hashing (see netdev_pick_tx) between the 8 ENETC TX rings, and the TX hashing to select TX queue 1. At the same time, it is possible for the other CPU to already use TX ring 1 for XDP (either XDP_TX or XDP_REDIRECT). Since there is no mutual exclusion between XDP and the network stack, we run into an issue because the ENETC TX procedure is not reentrant. The obvious approach would be to just make XDP take the lock of the network stack's TX queue corresponding to the ring it's about to enqueue in. For XDP_REDIRECT, this is quite straightforward, a lock at the beginning and end of enetc_xdp_xmit() should do the trick. But for XDP_TX, it's a bit more complicated. For one, we do TX batching all by ourselves for frames with the XDP_TX verdict. This is something we would like to keep the way it is, for performance reasons. But batching means that the network stack's lock should be kept from the first enqueued XDP_TX frame and until we ring the doorbell. That is mostly fine, except for cases when in the same NAPI loop we have mixed XDP_TX and XDP_REDIRECT frames. So if enetc_xdp_xmit() gets called while we are holding the lock from the RX NAPI, then bam, deadlock. The naive answer could be 'just flush the XDP_TX frames first, then release the network stack's TX queue lock, then call xdp_do_flush_map()'. But even xdp_do_redirect() is capable of flushing the batched XDP_REDIRECT frames, so unless we unlock/relock the TX queue around xdp_do_redirect(), there simply isn't any clean way to protect XDP_TX from concurrent network stack .ndo_start_xmit() on another CPU. So we need to take a different approach, and that is to reserve two rings for the sole use of XDP. We leave TX rings 0..ndev->real_num_tx_queues-1 to be handled by the network stack, and we pick them from the end of the priv->tx_ring array. We make an effort to keep the mapping done by enetc_alloc_msix() which decides which CPU handles the TX completions of which TX ring in its NAPI poll. So the XDP TX ring of CPU 0 is handled by TX ring 6, and the XDP TX ring of CPU 1 is handled by TX ring 7. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a6369fe6 |
|
16-Apr-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: remove unneeded xdp_do_flush_map() xdp_do_redirect already contains: -> dev_map_enqueue -> __xdp_enqueue -> bq_enqueue -> bq_xmit_all // if we have more than 16 frames So the logic from enetc will never be hit, because ENETC_DEFAULT_TX_WORK is 128. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8f50d8bb |
|
16-Apr-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: stop XDP NAPI processing when build_skb() fails When the code path below fails: enetc_clean_rx_ring_xdp // XDP_PASS -> enetc_build_skb -> enetc_map_rx_buff_to_skb -> build_skb enetc_clean_rx_ring_xdp will 'break', but that 'break' instruction isn't strong enough to actually break the NAPI poll loop, just the switch/case statement for XDP actions. So we increment rx_frm_cnt and go to the next frames minding our own business. Instead let's do what the skb NAPI poll function does, and break the loop now, waiting for the memory pressure to go away. Otherwise the next calls to build_skb() are likely to fail too. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
672f9a21 |
|
16-Apr-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: recycle buffers for frames with RX errors When receiving a frame with errors, currently we do nothing with it (we don't construct an skb or an xdp_buff), we just exit the NAPI poll loop. Let's put the buffer back into the RX ring (similar to XDP_DROP). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6b04830d |
|
16-Apr-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: rename the buffer reuse helpers enetc_put_xdp_buff has nothing to do with XDP, frankly, it is just a helper to populate the recycle end of the shadow RX BD ring (next_to_alloc) with a given buffer. On the other hand, enetc_put_rx_buff plays more tricks than its name would suggest. So let's rename enetc_put_rx_buff into enetc_flip_rx_buff to reflect the half-page buffer reuse tricks that it employs, and enetc_put_xdp_buff into enetc_put_rx_buff which suggests a more garden-variety operation. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e9e49ae8 |
|
16-Apr-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: remove redundant clearing of skb/xdp_frame pointer in TX conf path Later in enetc_clean_tx_ring we have: /* Scrub the swbd here so we don't have to do that * when we reuse it during xmit */ memset(tx_swbd, 0, sizeof(*tx_swbd)); So these assignments are unnecessary. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b6faf160 |
|
14-Apr-2021 |
Yangbo Lu <yangbo.lu@nxp.com> |
enetc: convert to schedule_work() Convert system_wq queue_work() to schedule_work() which is a wrapper around it, since the former is a rare construct. Fixes: 7294380c5211 ("enetc: support PTP Sync packet one-step timestamping") Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7294380c |
|
12-Apr-2021 |
Yangbo Lu <yangbo.lu@nxp.com> |
enetc: support PTP Sync packet one-step timestamping This patch is to add support for PTP Sync packet one-step timestamping. Since ENETC single-step register has to be configured dynamically per packet for correctionField offeset and UDP checksum update, current one-step timestamping packet has to be sent only when the last one completes transmitting on hardware. So, on the TX, this patch handles one-step timestamping packet as below: - Trasmit packet immediately if no other one in transfer, or queue to skb queue if there is already one in transfer. The test_and_set_bit_lock() is used here to lock and check state. - Start a work when complete transfer on hardware, to release the bit lock and to send one skb in skb queue if has. And the configuration for one-step timestamping on ENETC before transmitting is, - Set one-step timestamping flag in extension BD. - Write 30 bits current timestamp in tstamp field of extension BD. - Update PTP Sync packet originTimestamp field with current timestamp. - Configure single-step register for correctionField offeset and UDP checksum update. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f768e751 |
|
12-Apr-2021 |
Yangbo Lu <yangbo.lu@nxp.com> |
enetc: mark TX timestamp type per skb Mark TX timestamp type per skb on skb->cb[0], instead of global variable for all skbs. This is a preparation for one step timestamp support. For one-step timestamping enablement, there will be both one-step and two-step PTP messages to transfer. And a skb queue is needed for one-step PTP messages making sure start to send current message only after the last one completed on hardware. (ENETC single-step register has to be dynamically configured per message.) So, marking TX timestamp type per skb is required. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6c5e6b4c |
|
09-Apr-2021 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Use generic rule to map Tx rings to interrupt vectors Even if the current mapping is correct for the 1 CPU and 2 CPU cases (currently enetc is included in SoCs with up to 2 CPUs only), better use a generic rule for the mapping to cover all possible cases. The number of CPUs is the same as the number of interrupt vectors: Per device Tx rings - device_tx_ring[idx], where idx = 0..n_rings_total-1 Per interrupt vector Tx rings - int_vector[i].ring[j], where i = 0..n_int_vects-1 j = 0..n_rings_per_v-1 Mapping rule - n_rings_per_v = n_rings_total / n_int_vects for i = 0..n_int_vects - 1: for j = 0..n_rings_per_v - 1: idx = n_int_vects * j + i int_vector[i].ring[j] <- device_tx_ring[idx] Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20210409071613.28912-1-claudiu.manoil@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
a93580a0 |
|
09-Apr-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: fix TX ring interrupt storm The blamed commit introduced a bit in the TX software buffer descriptor structure for determining whether a BD is final or not; we rearm the TX interrupt vector for every frame (hence final BD) transmitted. But there is a problem with the patch: it replaced a condition whose expression is a bool which was evaluated at the beginning of the "while" loop with a bool expression that is evaluated on the spot: tx_swbd->is_eof. The problem with the latter expression is that the tx_swbd has already been incremented at that stage, so the tx_swbd->is_eof check is in fact with the _next_ software BD. Which is _not_ final. The effect is that the CPU is in 100% load with ksoftirqd because it does not acknowledge the TX interrupt, so the handler keeps getting called again and again. The fix is to restore the code structure, and keep the local bool is_eof variable, just to assign it the tx_swbd->is_eof value instead of !!tx_swbd->skb. Fixes: d504498d2eb3 ("net: enetc: add a dedicated is_eof bit in the TX software BD") Reported-by: Alex Marginean <alexandru.marginean@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20210409192759.3895104-1-olteanv@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
626b598a |
|
09-Apr-2021 |
Dan Carpenter <dan.carpenter@oracle.com> |
net: enetc: fix array underflow in error handling code This loop will try to unmap enetc_unmap_tx_buff[-1] and crash. Fixes: 9d2b68cc108d ("net: enetc: add support for XDP_REDIRECT") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/YHBHfCY/yv3EnM9z@mwanda Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
9d2b68cc |
|
31-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: add support for XDP_REDIRECT The driver implementation of the XDP_REDIRECT action reuses parts from XDP_TX, most notably the enetc_xdp_tx function which transmits an array of TX software BDs. Only this time, the buffers don't have DMA mappings, we need to create them. When a BPF program reaches the XDP_REDIRECT verdict for a frame, we can employ the same buffer reuse strategy as for the normal processing path and for XDP_PASS: we can flip to the other page half and seed that to the RX ring. Note that scatter/gather support is there, but disabled due to lack of multi-buffer support in XDP (which is added by this series): https://patchwork.kernel.org/project/netdevbpf/cover/cover.1616179034.git.lorenzo@kernel.org/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7ed2bc80 |
|
31-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: add support for XDP_TX For reflecting packets back into the interface they came from, we create an array of TX software BDs derived from the RX software BDs. Therefore, we need to extend the TX software BD structure to contain most of the stuff that's already present in the RX software BD structure, for reasons that will become evident in a moment. For a frame with the XDP_TX verdict, we don't reuse any buffer right away as we do for XDP_DROP (the same page half) or XDP_PASS (the other page half, same as the skb code path). Because the buffer transfers ownership from the RX ring to the TX ring, reusing any page half right away is very dangerous. So what we can do is we can recycle the same page half as soon as TX is complete. The code path is: enetc_poll -> enetc_clean_rx_ring_xdp -> enetc_xdp_tx -> enetc_refill_rx_ring (time passes, another MSI interrupt is raised) enetc_poll -> enetc_clean_tx_ring -> enetc_recycle_xdp_tx_buff But that creates a problem, because there is a potentially large time window between enetc_xdp_tx and enetc_recycle_xdp_tx_buff, period in which we'll have less and less RX buffers. Basically, when the ship starts sinking, the knee-jerk reaction is to let enetc_refill_rx_ring do what it does for the standard skb code path (refill every 16 consumed buffers), but that turns out to be very inefficient. The problem is that we have no rx_swbd->page at our disposal from the enetc_reuse_page path, so enetc_refill_rx_ring would have to call enetc_new_page for every buffer that we refill (if we choose to refill at this early stage). Very inefficient, it only makes the problem worse, because page allocation is an expensive process, and CPU time is exactly what we're lacking. Additionally, there is an even bigger problem: if we let enetc_refill_rx_ring top up the ring's buffers again from the RX path, remember that the buffers sent to transmission haven't disappeared anywhere. They will be eventually sent, and processed in enetc_clean_tx_ring, and an attempt will be made to recycle them. But surprise, the RX ring is already full of new buffers, because we were premature in deciding that we should refill. So not only we took the expensive decision of allocating new pages, but now we must throw away perfectly good and reusable buffers. So what we do is we implement an elastic refill mechanism, which keeps track of the number of in-flight XDP_TX buffer descriptors. We top up the RX ring only up to the total ring capacity minus the number of BDs that are in flight (because we know that those BDs will return to us eventually). The enetc driver manages 1 RX ring per CPU, and the default TX ring management is the same. So we do XDP_TX towards the TX ring of the same index, because it is affined to the same CPU. This will probably not produce great results when we have a tc-taprio/tc-mqprio qdisc on the interface, because in that case, the number of TX rings might be greater, but I didn't add any checks for that yet (mostly because I didn't know what checks to add). It should also be noted that we need to change the DMA mapping direction for RX buffers, since they may now be reflected into the TX ring of the same device. We choose to use DMA_BIDIRECTIONAL instead of unmapping and remapping as DMA_TO_DEVICE, because performance is better this way. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d1b15102 |
|
31-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: add support for XDP_DROP and XDP_PASS For the RX ring, enetc uses an allocation scheme based on pages split into two buffers, which is already very efficient in terms of preventing reallocations / maximizing reuse, so I see no reason why I would change that. +--------+--------+--------+--------+--------+--------+--------+ | | | | | | | | | half B | half B | half B | half B | half B | half B | half B | | | | | | | | | +--------+--------+--------+--------+--------+--------+--------+ | | | | | | | | | half A | half A | half A | half A | half A | half A | half A | RX ring | | | | | | | | +--------+--------+--------+--------+--------+--------+--------+ ^ ^ | | next_to_clean next_to_alloc next_to_use +--------+--------+--------+--------+--------+ | | | | | | | half B | half B | half B | half B | half B | | | | | | | +--------+--------+--------+--------+--------+--------+--------+ | | | | | | | | | half B | half B | half A | half A | half A | half A | half A | RX ring | | | | | | | | +--------+--------+--------+--------+--------+--------+--------+ | | | ^ ^ | half A | half A | | | | | | next_to_clean next_to_use +--------+--------+ ^ | next_to_alloc then when enetc_refill_rx_ring is called, whose purpose is to advance next_to_use, it sees that it can take buffers up to next_to_alloc, and it says "oh, hey, rx_swbd->page isn't NULL, I don't need to allocate one!". The only problem is that for default PAGE_SIZE values of 4096, buffer sizes are 2048 bytes. While this is enough for normal skb allocations at an MTU of 1500 bytes, for XDP it isn't, because the XDP headroom is 256 bytes, and including skb_shared_info and alignment, we end up being able to make use of only 1472 bytes, which is insufficient for the default MTU. To solve that problem, we implement scatter/gather processing in the driver, because we would really like to keep the existing allocation scheme. A packet of 1500 bytes is received in a buffer of 1472 bytes and another one of 28 bytes. Because the headroom required by XDP is different (and much larger) than the one required by the network stack, whenever a BPF program is added or deleted on the port, we drain the existing RX buffers and seed new ones with the required headroom. We also keep the required headroom in rx_ring->buffer_offset. The simplest way to implement XDP_PASS, where an skb must be created, is to create an xdp_buff based on the next_to_clean RX BDs, but not clear those BDs from the RX ring yet, just keep the original index at which the BDs for this frame started. Then, if the verdict is XDP_PASS, instead of converting the xdb_buff to an skb, we replay a call to enetc_build_skb (just as in the normal enetc_clean_rx_ring case), starting from the original BD index. We would also like to be minimally invasive to the regular RX data path, and not check whether there is a BPF program attached to the ring on every packet. So we create a separate RX ring processing function for XDP. Because we only install/remove the BPF program while the interface is down, we forgo the rcu_read_lock() in enetc_clean_rx_ring, since there shouldn't be any circumstance in which we are processing packets and there is a potentially freed BPF program attached to the RX ring. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
65d0cbb4 |
|
31-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: move up enetc_reuse_page and enetc_page_reusable For XDP_TX, we need to call enetc_reuse_page from enetc_clean_tx_ring, so we need to avoid a forward declaration. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1ee8d6f3 |
|
31-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: clean the TX software BD on the TX confirmation path With the future introduction of some new fields into enetc_tx_swbd such as is_xdp_tx, is_xdp_redirect etc, we need not only to set these bits to true from the XDP_TX/XDP_REDIRECT code path, but also to false from the old code paths. This is because TX software buffer descriptors are kept in a ring that is shadow of the hardware TX ring, so these structures keep getting reused, and there is always the possibility that when a software BD is reused (after we ran a full circle through the TX ring), the old user of the tx_swbd had set is_xdp_tx = true, and now we are sending a regular skb, which would need to set is_xdp_tx = false. To be minimally invasive to the old code paths, let's just scrub the software TX BD in the TX confirmation path (enetc_clean_tx_ring), once we know that nobody uses this software TX BD (tx_ring->next_to_clean hasn't yet been updated, and the TX paths check enetc_bd_unused which tells them if there's any more space in the TX ring for a new enqueue). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d504498d |
|
31-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: add a dedicated is_eof bit in the TX software BD In the transmit path, if we have a scatter/gather frame, it is put into multiple software buffer descriptors, the last of which has the skb pointer populated (which is necessary for rearming the TX MSI vector and for collecting the two-step TX timestamp from the TX confirmation path). At the moment, this is sufficient, but with XDP_TX, we'll need to service TX software buffer descriptors that don't have an skb pointer, however they might be final nonetheless. So add a dedicated bit for final software BDs that we populate and check explicitly. Also, we keep looking just for an skb when doing TX timestamping, because we don't want/need that for XDP. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a800abd3 |
|
31-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: move skb creation into enetc_build_skb We need to build an skb from two code paths now: from the plain RX data path and from the XDP data path when the verdict is XDP_PASS. Create a new enetc_build_skb function which contains the essential steps for building an skb based on the first and last positions of buffer descriptors within the RX ring. We also squash the enetc_process_skb function into enetc_build_skb, because what that function did wasn't very meaningful on its own. The "rx_frm_cnt++" instruction has been moved around napi_gro_receive for cosmetic reasons, to be in the same spot as rx_byte_cnt++, which itself must be before napi_gro_receive, because that's when we lose ownership of the skb. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2fa423f5 |
|
31-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: consume the error RX buffer descriptors in a dedicated function We can and should check the RX BD errors before starting to build the skb. The only apparent reason why things are done in this backwards order is to spare one call to enetc_rxbd_next. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7a5222cb |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: enetc: make enetc_refill_rx_ring update the consumer index Since commit fd5736bf9f23 ("enetc: Workaround for MDIO register access issue"), enetc_refill_rx_ring no longer updates the RX BD ring's consumer index, that is left to be done by the caller. This has led to bugs such as the ones found in 96a5223b918c ("net: enetc: remove bogus write to SIRXIDR from enetc_setup_rxbdr") and 3a5d12c9be6f ("net: enetc: keep RX ring consumer index in sync with hardware"), so it is desirable that we move back the update of the consumer index into enetc_refill_rx_ring. The trouble with that is the different MDIO locking context for the two callers of enetc_refill_rx_ring: - enetc_clean_rx_ring runs under enetc_lock_mdio() - enetc_setup_rxbdr runs outside enetc_lock_mdio() Simplify the callers of enetc_refill_rx_ring by making enetc_setup_rxbdr explicitly take enetc_lock_mdio() around the call. It will be the only place in need of ensuring the hot accessors can be used. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0486185e |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: enetc: remove forward declaration for enetc_map_tx_buffs There is no other reason why this forward declaration exists rather than poor ordering of the functions. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8580b3c3 |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: enetc: remove forward-declarations of enetc_clean_{rx,tx}_ring This patch moves the NAPI enetc_poll after enetc_clean_rx_ring such that we can delete the forward declarations. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c027aa92 |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: enetc: simplify callers of enetc_rxbd_next When we iterate through the BDs in the RX ring, the software producer index (which is already passed by value to enetc_rxbd_next) lags behind, and we end up with this funny looking "++i == rx_ring->bd_count" check so that we drag it after us. Let's pass the software producer index "i" by reference, so that enetc_rxbd_next can increment it by itself (mod rx_ring->bd_count), especially since enetc_rxbd_next has to increment the index anyway. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4b47c0b8 |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: enetc: don't initialize unused ports from a separate code path Since commit 3222b5b613db ("net: enetc: initialize RFS/RSS memories for unused ports too") there is a requirement to initialize the memories of unused PFs too, which has left the probe path in a bit of a rough shape, because we basically have a minimal initialization path for unused PFs which is separate from the main initialization path. Now that initializing a control BD ring is as simple as calling enetc_setup_cbdr, let's move that outside of enetc_alloc_si_resources (unused PFs don't need classification rules, so no point in allocating them just to free them later). But enetc_alloc_si_resources is called both for PFs and for VFs, so now that enetc_setup_cbdr is no longer called from this common function, it means that the VF probe path needs to explicitly call enetc_setup_cbdr too. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b4daa7f |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: enetc: pass bd_count as an argument to enetc_setup_cbdr It makes no sense from an API perspective to first initialize some portion of struct enetc_cbdr outside enetc_setup_cbdr, then leave that function to initialize the rest. enetc_setup_cbdr should be able to perform all initialization given a zero-initialized struct enetc_cbdr. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0bfde022 |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: enetc: squash clear_cbdr and free_cbdr into teardown_cbdr All call sites call enetc_clear_cbdr and enetc_free_cbdr one after another, so let's combine the two functions into a single method named enetc_teardown_cbdr which does both, and in the same order. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
27f9025d |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: enetc: save the mode register address inside struct enetc_cbdr enetc_clear_cbdr depends on struct enetc_hw because it must disable the ring through a register write. We'd like to remove that dependency, so let's do what's already done with the producer and consumer indices, which is to save the iomem address in a variable kept in struct enetc_cbdr. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
24be14e3 |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: enetc: squash enetc_alloc_cbdr and enetc_setup_cbdr enetc_alloc_cbdr and enetc_setup_cbdr are always called one after another, so we can simplify the callers and make enetc_setup_cbdr do everything that's needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
01121ab7 |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: enetc: save the DMA device for enetc_free_cbdr We shouldn't need to pass the struct device *dev to enetc CBDR APIs over and over again, so save this inside struct enetc_cbdr::dma_dev and avoid calling it from the enetc_free_cbdr functions. This breaks the dependency of the cbdr API from struct enetc_si (the station interface). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
176769d1 |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: enetc: move the CBDR API to enetc_cbdr.c Since there is a dedicated file in this driver for interacting with control BD rings, it makes sense to move these functions there. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
847cbfc0 |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: add a helper to avoid issues with HW TX timestamping and SO_TXTIME As explained in commit 29d98f54a4fe ("net: enetc: allow hardware timestamping on TX queues with tc-etf enabled"), hardware TX timestamping requires an skb with skb->tstamp = 0. When a packet is sent with SO_TXTIME, the skb->skb_mstamp_ns corrupts the value of skb->tstamp, so the drivers need to explicitly reset skb->tstamp to zero after consuming the TX time. Create a helper named skb_txtime_consumed() which does just that. All drivers which offload TC_SETUP_QDISC_ETF should implement it, and it would make it easier to assess during review whether they do the right thing in order to be compatible with hardware timestamping or not. Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
29d98f54 |
|
07-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: allow hardware timestamping on TX queues with tc-etf enabled The txtime is passed to the driver in skb->skb_mstamp_ns, which is actually in a union with skb->tstamp (the place where software timestamps are kept). Since commit b50a5c70ffa4 ("net: allow simultaneous SW and HW transmit timestamping"), __sock_recv_timestamp has some logic for making sure that the two calls to skb_tstamp_tx: skb_tx_timestamp(skb) # Software timestamp in the driver -> skb_tstamp_tx(skb, NULL) and skb_tstamp_tx(skb, &shhwtstamps) # Hardware timestamp in the driver will both do the right thing and in a race-free manner, meaning that skb_tx_timestamp will deliver a cmsg with the software timestamp only, and skb_tstamp_tx with a non-NULL hwtstamps argument will deliver a cmsg with the hardware timestamp only. Why are races even possible? Well, because although the software timestamp skb->tstamp is private per skb, the hardware timestamp skb_hwtstamps(skb) lives in skb_shinfo(skb), an area which is shared between skbs and their clones. And skb_tstamp_tx works by cloning the packets when timestamping them, therefore attempting to perform hardware timestamping on an skb's clone will also change the hardware timestamp of the original skb. And the original skb might have been yet again cloned for software timestamping, at an earlier stage. So the logic in __sock_recv_timestamp can't be as simple as saying "does this skb have a hardware timestamp? if yes I'll send the hardware timestamp to the socket, otherwise I'll send the software timestamp", precisely because the hardware timestamp is shared. Instead, it's quite the other way around: __sock_recv_timestamp says "does this skb have a software timestamp? if yes, I'll send the software timestamp, otherwise the hardware one". This works because the software timestamp is not shared with clones. But that means we have a problem when we attempt hardware timestamping with skbs that don't have the skb->tstamp == 0. __sock_recv_timestamp will say "oh, yeah, this must be some sort of odd clone" and will not deliver the hardware timestamp to the socket. And this is exactly what is happening when we have txtime enabled on the socket: as mentioned, that is put in a union with skb->tstamp, so it is quite easy to mistake it. Do what other drivers do (intel igb/igc) and write zero to skb->tstamp before taking the hardware timestamp. It's of no use to us now (we're already on the TX confirmation path). Fixes: 0d08c9ec7d6e ("enetc: add support time specific departure base on the qos etf") Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3a5d12c9 |
|
01-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: keep RX ring consumer index in sync with hardware The RX rings have a producer index owned by hardware, where newly received frame buffers are placed, and a consumer index owned by software, where newly allocated buffers are placed, in expectation of hardware being able to place frame data in them. Hardware increments the producer index when a frame is received, however it is not allowed to increment the producer index to match the consumer index (RBCIR) since the ring can hold at most RBLENR[LENGTH]-1 received BDs. Whenever the producer index matches the value of the consumer index, the ring has no unprocessed received frames and all BDs in the ring have been initialized/prepared by software, i.e. hardware owns all BDs in the ring. The code uses the next_to_clean variable to keep track of the producer index, and the next_to_use variable to keep track of the consumer index. The RX rings are seeded from enetc_refill_rx_ring, which is called from two places: 1. initially the ring is seeded until full with enetc_bd_unused(rx_ring), i.e. with 511 buffers. This will make next_to_clean=0 and next_to_use=511: .ndo_open -> enetc_open -> enetc_setup_bdrs -> enetc_setup_rxbdr -> enetc_refill_rx_ring 2. then during the data path processing, it is refilled with 16 buffers at a time: enetc_msix -> napi_schedule -> enetc_poll -> enetc_clean_rx_ring -> enetc_refill_rx_ring There is just one problem: the initial seeding done during .ndo_open updates just the producer index (ENETC_RBPIR) with 0, and the software next_to_clean and next_to_use variables. Notably, it will not update the consumer index to make the hardware aware of the newly added buffers. Wait, what? So how does it work? Well, the reset values of the producer index and of the consumer index of a ring are both zero. As per the description in the second paragraph, it means that the ring is full of buffers waiting for hardware to put frames in them, which by coincidence is almost true, because we have in fact seeded 511 buffers into the ring. But will the hardware attempt to access the 512th entry of the ring, which has an invalid BD in it? Well, no, because in order to do that, it would have to first populate the first 511 entries, and the NAPI enetc_poll will kick in by then. Eventually, after 16 processed slots have become available in the RX ring, enetc_clean_rx_ring will call enetc_refill_rx_ring and then will [ finally ] update the consumer index with the new software next_to_use variable. From now on, the next_to_clean and next_to_use variables are in sync with the producer and consumer ring indices. So the day is saved, right? Well, not quite. Freeing the memory allocated for the rings is done in: enetc_close -> enetc_clear_bdrs -> enetc_clear_rxbdr -> this just disables the ring -> enetc_free_rxtx_rings -> enetc_free_rx_ring -> sets next_to_clean and next_to_use to 0 but again, nothing is committed to the hardware producer and consumer indices (yay!). The assumption is that the ring is disabled, so the indices don't matter anyway, and it's the responsibility of the "open" code path to set those up. .. Except that the "open" code path does not set those up properly. While initially, things almost work, during subsequent enetc_close -> enetc_open sequences, we have problems. To be precise, the enetc_open that is subsequent to enetc_close will again refill the ring with 511 entries, but it will leave the consumer index untouched. Untouched means, of course, equal to the value it had before disabling the ring and draining the old buffers in enetc_close. But as mentioned, enetc_setup_rxbdr will at least update the producer index though, through this line of code: enetc_rxbdr_wr(hw, idx, ENETC_RBPIR, 0); so at this stage we'll have: next_to_clean=0 (in hardware 0) next_to_use=511 (in hardware we'll have the refill index prior to enetc_close) Again, the next_to_clean and producer index are in sync and set to correct values, so the driver manages to limp on. Eventually, 16 ring entries will be consumed by enetc_poll, and the savior enetc_clean_rx_ring will come and call enetc_refill_rx_ring, and then update the hardware consumer ring based upon the new next_to_use. So.. it works? Well, by coincidence, it almost does, but there's a circumstance where enetc_clean_rx_ring won't be there to save us. If the previous value of the consumer index was 15, there's a problem, because the NAPI poll sequence will only issue a refill when 16 or more buffers have been consumed. It's easiest to illustrate this with an example: ip link set eno0 up ip addr add 192.168.100.1/24 dev eno0 ping 192.168.100.1 -c 20 # ping this port from another board ip link set eno0 down ip link set eno0 up ping 192.168.100.1 -c 20 # ping it again from the same other board One by one: 1. ip link set eno0 up -> calls enetc_setup_rxbdr: -> calls enetc_refill_rx_ring(511 buffers) -> next_to_clean=0 (in hw 0) -> next_to_use=511 (in hw 0) 2. ping 192.168.100.1 -c 20 # ping this port from another board enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=15 next_to_clean 14 (in hw 15) next_to_use 511 (in hw 0) enetc_clean_rx_ring: enetc_refill_rx_ring(16) increments next_to_use by 16 (mod 512) and writes it to hw enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=0 next_to_clean 15 (in hw 16) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 16 (in hw 17) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 17 (in hw 18) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 18 (in hw 19) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 19 (in hw 20) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 20 (in hw 21) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 21 (in hw 22) next_to_use 15 (in hw 15) 20 packets transmitted, 20 packets received, 0% packet loss 3. ip link set eno0 down enetc_free_rx_ring: next_to_clean 0 (in hw 22), next_to_use 0 (in hw 15) 4. ip link set eno0 up -> calls enetc_setup_rxbdr: -> calls enetc_refill_rx_ring(511 buffers) -> next_to_clean=0 (in hw 0) -> next_to_use=511 (in hw 15) 5. ping 192.168.100.1 -c 20 # ping it again from the same other board enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 15) 20 packets transmitted, 12 packets received, 40% packet loss And there it dies. No enetc_refill_rx_ring (because cleaned_cnt must be equal to 15 for that to happen), no nothing. The hardware enters the condition where the producer (14) + 1 is equal to the consumer (15) index, which makes it believe it has no more free buffers to put packets in, so it starts discarding them: ip netns exec ns0 ethtool -S eno0 | grep -v ': 0' NIC statistics: Rx ring 0 discarded frames: 8 Summarized, if the interface receives between 16 and 32 (mod 512) frames and then there is a link flap, then the port will eventually die with no way to recover. If it receives less than 16 (mod 512) frames, then the initial NAPI poll [ before the link flap ] will not update the consumer index in hardware (it will remain zero) which will be ok when the buffers are later reinitialized. If more than 32 (mod 512) frames are received, the initial NAPI poll has the chance to refill the ring twice, updating the consumer index to at least 32. So after the link flap, the consumer index is still wrong, but the post-flap NAPI poll gets a chance to refill the ring once (because it passes through cleaned_cnt=15) and makes the consumer index be again back in sync with next_to_use. The solution to this problem is actually simple, we just need to write next_to_use into the hardware consumer index at enetc_open time, which always brings it back in sync after an initial buffer seeding process. The simpler thing would be to put the write to the consumer index into enetc_refill_rx_ring directly, but there are issues with the MDIO locking: in the NAPI poll code we have the enetc_lock_mdio() taken from top-level and we use the unlocked enetc_wr_reg_hot, whereas in enetc_open, the enetc_lock_mdio() is not taken at the top level, but instead by each individual enetc_wr_reg, so we are forced to put an additional enetc_wr_reg in enetc_setup_rxbdr. Better organization of the code is left as a refactoring exercise. Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
96a5223b |
|
01-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: remove bogus write to SIRXIDR from enetc_setup_rxbdr The Station Interface Receive Interrupt Detect Register (SIRXIDR) contains a 16-bit wide mask of 'interrupt detected' events for each ring associated with a port. Bit i is write-1-to-clean for RX ring i. I have no explanation whatsoever how this line of code came to be inserted in the blamed commit. I checked the downstream versions of that patch and none of them have it. The somewhat comical aspect of it is that we're writing a binary number to the SIRXIDR register, which is derived from enetc_bd_unused(rx_ring). Since the RX rings have 512 buffer descriptors, we end up writing 511 to this register, which is 0x1ff, so we are effectively clearing the 'interrupt detected' event for rings 0-8. This register is not what is used for interrupt handling though - it only provides a summary for the entire SI. The hardware provides one separate Interrupt Detect Register per RX ring, which auto-clears upon read. So there doesn't seem to be any adverse effect caused by this bogus write. There is, however, one reason why this should be handled as a bugfix: next_to_clean _should_ be committed to hardware, just not to that register, and this was obscuring the fact that it wasn't. This is fixed in the next patch, and removing the bogus line now allows the fix patch to be backported beyond that point. Fixes: fd5736bf9f23 ("enetc: Workaround for MDIO register access issue") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
827b6fd0 |
|
01-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: fix incorrect TPID when receiving 802.1ad tagged packets When the enetc ports have rx-vlan-offload enabled, they report a TPID of ETH_P_8021Q regardless of what was actually in the packet. When rx-vlan-offload is disabled, packets have the proper TPID. Fix this inconsistency by finishing the TODO left in the code. Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6d36ecdb |
|
01-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: take the MDIO lock only once per NAPI poll cycle The workaround for the ENETC MDIO erratum caused a performance degradation of 82 Kpps (seen with IP forwarding of two 1Gbps streams of 64B packets). This is due to excessive locking and unlocking in the fast path, which can be avoided. By taking the MDIO read-side lock only once per NAPI poll cycle, we are able to regain 54 Kpps (65%) of the performance hit. The rest of the performance degradation comes from the TX data path, but unfortunately it doesn't look like we can optimize that away easily, even with netdev_xmit_more(), there just isn't any skb batching done, to help with taking the MDIO lock less often than once per packet. We need to change the register accessor type for enetc_get_tx_tstamp, because it now runs under the enetc_lock_mdio as per the new call path detailed below: enetc_msix -> napi_schedule -> enetc_poll -> enetc_lock_mdio -> enetc_clean_tx_ring -> enetc_get_tx_tstamp -> enetc_clean_rx_ring -> enetc_unlock_mdio Fixes: fd5736bf9f23 ("enetc: Workaround for MDIO register access issue") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3222b5b6 |
|
01-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: initialize RFS/RSS memories for unused ports too Michael reports that since linux-next-20210211, the AER messages for ECC errors have started reappearing, and this time they can be reliably reproduced with the first ping on one of his LS1028A boards. $ ping 1[ 33.258069] pcieport 0000:00:1f.0: AER: Multiple Corrected error received: 0000:00:00.0 72.16.0.1 PING [ 33.267050] pcieport 0000:00:1f.0: AER: can't find device of ID0000 172.16.0.1 (172.16.0.1): 56 data bytes 64 bytes from 172.16.0.1: seq=0 ttl=64 time=17.124 ms 64 bytes from 172.16.0.1: seq=1 ttl=64 time=0.273 ms $ devmem 0x1f8010e10 32 0xC0000006 It isn't clear why this is necessary, but it seems that for the errors to go away, we must clear the entire RFS and RSS memory, not just for the ports in use. Sadly the code is structured in such a way that we can't have unified logic for the used and unused ports. For the minimal initialization of an unused port, we need just to enable and ioremap the PF memory space, and a control buffer descriptor ring. Unused ports must then free the CBDR because the driver will exit, but used ports can not pick up from where that code path left, since the CBDR API does not reinitialize a ring when setting it up, so its producer and consumer indices are out of sync between the software and hardware state. So a separate enetc_init_unused_port function was created, and it gets called right after the PF memory space is enabled. Fixes: 07bf34a50e32 ("net: enetc: initialize the RFS and RSS memories") Reported-by: Michael Walle <michael@walle.cc> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Michael Walle <michael@walle.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c646d10d |
|
01-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: enetc: don't overwrite the RSS indirection table when initializing After the blamed patch, all RX traffic gets hashed to CPU 0 because the hashing indirection table set up in: enetc_pf_probe -> enetc_alloc_si_resources -> enetc_configure_si -> enetc_setup_default_rss_table is overwritten later in: enetc_pf_probe -> enetc_init_port_rss_memory which zero-initializes the entire port RSS table in order to avoid ECC errors. The trouble really is that enetc_init_port_rss_memory really neads enetc_alloc_si_resources to be called, because it depends upon enetc_alloc_cbdr and enetc_setup_cbdr. But that whole enetc_configure_si thing could have been better thought out, it has nothing to do in a function called "alloc_si_resources", especially since its counterpart, "free_si_resources", does nothing to unwind the configuration of the SI. The point is, we need to pull out enetc_configure_si out of enetc_alloc_resources, and move it after enetc_init_port_rss_memory. This allows us to set up the default RSS indirection table after initializing the memory. Fixes: 07bf34a50e32 ("net: enetc: initialize the RFS and RSS memories") Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fd5736bf |
|
12-Nov-2020 |
Alex Marginean <alexandru.marginean@nxp.com> |
enetc: Workaround for MDIO register access issue Due to a hardware issue, an access to MDIO registers that is concurrent with other ENETC register accesses may lead to the MDIO access being dropped or corrupted. The workaround introduces locking for all register accesses to the ENETC register space. To reduce performance impact, a readers-writers locking scheme has been implemented. The writer in this case is the MDIO access code (irrelevant whether that MDIO access is a register read or write), and the reader is any access code to non-MDIO ENETC registers. Also, the datapath functions acquire the read lock fewer times and use _hot accessors. All the rest of the code uses the _wa accessors which lock every register access. The commit introducing MDIO support is - commit ebfcb23d62ab ("enetc: Add ENETC PF level external MDIO support") but due to subsequent refactoring this patch is applicable on top of a later commit. Fixes: 6517798dd343 ("enetc: Make MDIO accessors more generic and export to include/linux/fsl") Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20201112182608.26177-1-claudiu.manoil@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
82728b91 |
|
03-Nov-2020 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Remove Tx checksumming offload code Tx checksumming has been defeatured and completely removed from the h/w reference manual. Made a little cleanup for the TSE case as this is complementary code. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20201103140213.3294-1-claudiu.manoil@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
71b77a7a |
|
06-Oct-2020 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Migrate to PHYLINK and PCS_LYNX This is a methodical transition of the driver from phylib to phylink, following the guidelines from sfp-phylink.rst. The MAC register configurations based on interface mode were moved from the probing path to the mac_config() hook. MAC enable and disable commands (enabling Rx and Tx paths at MAC level) were also extracted and assigned to their corresponding phylink hooks. As part of the migration to phylink, the serdes configuration from the driver was offloaded to the PCS_LYNX module, introduced in commit 0da4c3d393e4 ("net: phy: add Lynx PCS module"), the PCS_LYNX module being a mandatory component required to make the enetc driver work with phylink. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Ioana Ciornei <ioana.cionei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
215602a8 |
|
03-Aug-2020 |
Jiafei Pan <Jiafei.Pan@nxp.com> |
enetc: use napi_schedule to be compatible with PREEMPT_RT The driver calls napi_schedule_irqoff() from a context where, in RT, hardirqs are not disabled, since the IRQ handler is force-threaded. In the call path of this function, __raise_softirq_irqoff() is modifying its per-CPU mask of pending softirqs that must be processed, using or_softirq_pending(). The or_softirq_pending() function is not atomic, but since interrupts are supposed to be disabled, nobody should be preempting it, and the operation should be safe. Nonetheless, when running with hardirqs on, as in the PREEMPT_RT case, it isn't safe, and the pending softirqs mask can get corrupted, resulting in softirqs being lost and never processed. To have common code that works with PREEMPT_RT and with mainline Linux, we can use plain napi_schedule() instead. The difference is that napi_schedule() (via __napi_schedule) also calls local_irq_save, which disables hardirqs if they aren't already. But, since they already are disabled in non-RT, this means that in practice we don't see any measurable difference in throughput or latency with this patch. Signed-off-by: Jiafei Pan <Jiafei.Pan@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ae0e6a5d |
|
21-Jul-2020 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Add adaptive interrupt coalescing Use the generic dynamic interrupt moderation (dim) framework to implement adaptive interrupt coalescing on Rx. With the per-packet interrupt scheme, a high interrupt rate has been noted for moderate traffic flows leading to high CPU utilization. The 'dim' scheme implemented by the current patch addresses this issue improving CPU utilization while using minimal coalescing time thresholds in order to preserve a good latency. On the Tx side use an optimal time threshold value by default. This value has been optimized for Tx TCP streams at a rate of around 85kpps on a 1G link, at which rate half of the Tx ring size (128) gets filled in 1500 usecs. Scaling this down to 2.5G links yields the current value of 600 usecs, which is conservative and gives good enough results for 1G links too (see next). Below are some measurement results for before and after this patch (and related dependencies) basically, for a 2 ARM Cortex-A72 @1.3Ghz CPUs system (32 KB L1 data cache), using 60secs log netperf TCP stream tests @ 1Gbit link (maximum throughput): 1) 1 Rx TCP flow, both Rx and Tx processed by the same NAPI thread on the same CPU: CPU utilization int rate (ints/sec) Before: 50%-60% (over 50%) 92k After: 13%-22% 3.5k-12k Comment: Major CPU utilization improvement for a single flow Rx TCP flow (i.e. netperf -t TCP_MAERTS) on a single CPU. Usually settles under 16% for longer tests. 2) 4 Rx TCP flows + 4 Tx TCP flows (+ pings to check the latency): Total CPU utilization Total int rate (ints/sec) Before: ~80% (spikes to 90%) ~100k After: 60% (more steady) ~4k Comment: Important improvement for this load test, while the ping test outcome does not show any notable difference compared to before. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
91571081 |
|
21-Jul-2020 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Add interrupt coalescing support Enable programming of the interrupt coalescing registers and allow manual configuration of the coalescing time thresholds via ethtool. Packet thresholds have been fixed to predetermined values as there's no point in making them run-time configurable, also anticipating the dynamic interrupt moderation (DIM) algorithm which uses fixed packet thresholds as well. If the interface is up when the operation mode of traffic interrupt events is changed by the user (i.e. switching from default per-packet interrupts to coalesced interrupts), the traffic needs to be paused in the process. This patch also prepares the ground for introducing DIM on Rx. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
12460a0a |
|
21-Jul-2020 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Fix interrupt coalescing register naming Interrupt coalescing registers naming in the current revision of the Ref Man (RM) is ICR, deprecating the ICIR name used in earlier (draft) versions of the RM. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bbb96dc7 |
|
21-Jul-2020 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Factor out the traffic start/stop procedures A reliable traffic pause (and reconfiguration) procedure is needed to be able to safely make h/w configuration changes during run-time, like changing the mode in which the interrupts are operating (i.e. with or without coalescing), as opposed to making on-the-fly register updates that may be subject to h/w or s/w concurrency issues. To this end, the code responsible of the run-time device configurations that basically starts resp. stops the traffic flow through the device has been extracted from the the enetc_open/_close procedures, to the separate standalone enetc_start/_stop procedures. Traffic stop should be as graceful as possible, it lets the executing napi threads to to finish while the interrupts stay disabled. But since the napi thread will try to re-enable interrupts by clearing the device's unmask register, the enable_irq/ disable_irq API has been used to avoid this potential concurrency issue and make the traffic pause procedure more reliable. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02293dd4 |
|
21-Jul-2020 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Refine buffer descriptor ring sizes It's time to differentiate between Rx and Tx ring sizes. Not only Tx rings are processed differently than Rx rings, but their default number also differs - i.e. up to 8 Tx rings per device (8 traffic classes) vs. 2 Rx rings (one per CPU). So let's set Tx rings sizes to half the size of the Rx rings for now, to be conservative. The default ring sizes were decreased as well (to the next lower power of 2), to reduce the memory footprint, buffering etc., since the measurements I've made so far show that the rings are very unlikely to get full. This change also anticipates the introduction of the dynamic interrupt moderation (dim) algorithm which operates on maximum packet thresholds of 256 packets for Rx and 128 packets for Tx. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0574e200 |
|
26-Jun-2020 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Fix tx rings bitmap iteration range, irq handling The rings bitmap of an interrupt vector encodes which of the device's rings were assigned to that interrupt vector. Hence the iteration range of the tx rings bitmap (for_each_set_bit()) should be the total number of Tx rings of that netdevice instead of the number of rings assigned to the interrupt vector. Since there are 2 cores, and one interrupt vector for each core, the number of rings asigned to an interrupt vector is half the number of available rings. The impact of this error is that the upper half of the tx rings could still generate interrupts during napi polling. Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9deba33f |
|
17-Jun-2020 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Fix HW_VLAN_CTAG_TX|RX toggling VLAN tag insertion/extraction offload is correctly activated at probe time but deactivation of this feature (i.e. via ethtool) is broken. Toggling works only for Tx/Rx ring 0 of a PF, and is ignored for the other rings, including the VF rings. To fix this, the existing VLAN offload toggling code was extended to all the rings assigned to a netdevice, instead of the default ring 0 (likely a leftover from the early validation days of this feature). And the code was moved to the common set_features() function to fix toggling for the VF driver too. Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1260e772 |
|
17-Jun-2020 |
Gustavo A. R. Silva <gustavoars@kernel.org> |
enetc: Use struct_size() helper in kzalloc() Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
888ae5a3 |
|
30-Apr-2020 |
Po Liu <Po.Liu@nxp.com> |
net: enetc: add tc flower psfp offload driver This patch is to add tc flower offload for the enetc IEEE 802.1Qci(PSFP) function. There are four main feature parts to implement the flow policing and filtering for ingress flow with IEEE 802.1Qci features. They are stream identify(this is defined in the P802.1cb exactly but needed for 802.1Qci), stream filtering, stream gate and flow metering. Each function block includes many entries by index to assign parameters. So for one frame would be filtered by stream identify first, then flow into stream filter block by the same handle between stream identify and stream filtering. Then flow into stream gate control which assigned by the stream filtering entry. And then policing by the gate and limited by the max sdu in the filter block(optional). At last, policing by the flow metering block, index choosing at the fitering block. So you can see that each entry of block may link to many upper entries since they can be assigned same index means more streams want to share the same feature in the stream filtering or stream gate or flow metering. To implement such features, each stream filtered by source/destination mac address, some stream maybe also plus the vlan id value would be treated as one flow chain. This would be identified by the chain_index which already in the tc filter concept. Driver would maintain this chain and also with gate modules. The stream filter entry create by the gate index and flow meter(optional) entry id and also one priority value. Offloading only transfer the gate action and flow filtering parameters. Driver would create (or search same gate id and flow meter id and priority) one stream filter entry to set to the hardware. So stream filtering do not need transfer by the action offloading. This architecture is same with tc filter and actions relationship. tc filter maintain the list for each flow feature by keys. And actions maintain by the action list. Below showing a example commands by tc: > tc qdisc add dev eth0 ingress > ip link set eth0 address 10:00:80:00:00:00 > tc filter add dev eth0 parent ffff: protocol ip chain 11 \ flower skip_sw dst_mac 10:00:80:00:00:00 \ action gate index 10 \ sched-entry open 200000000 1 8000000 \ sched-entry close 100000000 -1 -1 Command means to set the dst_mac 10:00:80:00:00:00 to index 11 of stream identify module. Then setting the gate index 10 of stream gate module. Keep the gate open for 200ms and limit the traffic volume to 8MB in this sched-entry. Then direct the frames to the ingress queue 1. Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
79e49982 |
|
30-Apr-2020 |
Po Liu <Po.Liu@nxp.com> |
net: enetc: add hw tc hw offload features for PSPF capability This patch is to let ethtool enable/disable the tc flower offload features. Hardware ENETC has the feature of PSFP which is for per-stream policing. When enable the tc hw offloading feature, driver would enable the IEEE 802.1Qci feature. It is only set the register enable bit for this feature not enable for any entry of per stream filtering and stream gate or stream identify but get how much capabilities for each feature. Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
434cebab |
|
10-Mar-2020 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Add dynamic allocation of extended Rx BD rings Hardware timestamping support (PTP) on Rx requires extended buffer descriptors, double the size of normal Rx descriptors. On the current controller revision only the timestamping offload requires extended Rx descriptors. Since Rx timestamping can be turned on/off at runtime, make Rx ring allocation configurable at runtime too. As a result, the static config option FSL_ENETC_HW_TIMESTAMPING can be dropped and the extended descriptors can be used only when Rx timestamping gets activated. The extension has the same size as the base descriptor, making the descriptor iterators easy to update for the extended case. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
714239ac |
|
10-Mar-2020 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Clean up Rx BD iteration Improve maintainability of the code iterating the Rx buffer descriptors to prepare it to support iterating extended Rx BD descriptors as well. Don't increment by one the h/w descriptor pointers explicitly, provide an iterator that takes care of the h/w details. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0d08c9ec |
|
01-Jan-2020 |
Po Liu <po.liu@nxp.com> |
enetc: add support time specific departure base on the qos etf ENETC implement time specific departure capability, which enables the user to specify when a frame can be transmitted. When this capability is enabled, the device will delay the transmission of the frame so that it can be transmitted at the precisely specified time. The delay departure time up to 0.5 seconds in the future. If the departure time in the transmit BD has not yet been reached, based on the current time, the packet will not be transmitted. This driver was loaded by Qos driver ETF. User could load it by tc commands. Here are the example commands: tc qdisc add dev eth0 root handle 1: mqprio \ num_tc 8 map 0 1 2 3 4 5 6 7 hw 1 tc qdisc replace dev eth0 parent 1:8 etf \ clockid CLOCK_TAI delta 30000 offload These example try to set queue mapping first and then set queue 7 with 30us ahead dequeue time. Then user send test frame should set SO_TXTIME feature for socket. There are also some limitations for this feature in hardware: - Transmit checksum offloads and time specific departure operation are mutually exclusive. - Time Aware Shaper feature (Qbv) offload and time specific departure operation are mutually exclusive. Signed-off-by: Po Liu <Po.Liu@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4caefbce |
|
09-Dec-2019 |
Michael Walle <michael@walle.cc> |
enetc: add software timestamping Provide a software TX timestamp and add it to the ethtool query interface. skb_tx_timestamp() is also needed if one would like to use PHY timestamping. Signed-off-by: Michael Walle <michael@walle.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a6a10d45 |
|
06-Dec-2019 |
Yangbo Lu <yangbo.lu@nxp.com> |
enetc: disable EEE autoneg by default The EEE support has not been enabled on ENETC, but it may connect to a PHY which supports EEE and advertises EEE by default, while its link partner also advertises EEE. If this happens, the PHY enters low power mode when the traffic rate is low and causes packet loss. This patch disables EEE advertisement by default for any PHY that ENETC connects to, to prevent the above unwanted outcome. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c431047c |
|
24-Nov-2019 |
Po Liu <po.liu@nxp.com> |
enetc: add support Credit Based Shaper(CBS) for hardware offload The ENETC hardware support the Credit Based Shaper(CBS) which part of the IEEE-802.1Qav. The CBS driver was loaded by the sch_cbs interface when set in the QOS in the kernel. Here is an example command to set 20Mbits bandwidth in 1Gbits port for taffic class 7: tc qdisc add dev eth0 root handle 1: mqprio \ num_tc 8 map 0 1 2 3 4 5 6 7 hw 1 tc qdisc replace dev eth0 parent 1:8 cbs \ locredit -1470 hicredit 30 \ sendslope -980000 idleslope 20000 offload 1 Signed-off-by: Po Liu <Po.Liu@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
13baf667 |
|
21-Nov-2019 |
Mao Wenan <maowenan@huawei.com> |
enetc: make enetc_setup_tc_mqprio static While using ARCH=mips CROSS_COMPILE=mips-linux-gnu- command to compile, make C=2 drivers/net/ethernet/freescale/enetc/enetc.o one warning can be found: drivers/net/ethernet/freescale/enetc/enetc.c:1439:5: warning: symbol 'enetc_setup_tc_mqprio' was not declared. Should it be static? This patch make symbol enetc_setup_tc_mqprio static. Fixes: 34c6adf1977b ("enetc: Configure the Time-Aware Scheduler via tc-taprio offload") Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2e47cb41 |
|
14-Nov-2019 |
Po Liu <po.liu@nxp.com> |
enetc: update TSN Qbv PSPEED set according to adjust link speed ENETC has a register PSPEED to indicate the link speed of hardware. It is need to update accordingly. PSPEED field needs to be updated with the port speed for QBV scheduling purposes. Or else there is chance for gate slot not free by frame taking the MAC if PSPEED and phy speed not match. So update PSPEED when link adjust. This is implement by the adjust_link. Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
34c6adf1 |
|
14-Nov-2019 |
Po Liu <po.liu@nxp.com> |
enetc: Configure the Time-Aware Scheduler via tc-taprio offload ENETC supports in hardware for time-based egress shaping according to IEEE 802.1Qbv. This patch implement the Qbv enablement by the hardware offload method qdisc tc-taprio method. Also update cbdr writeback to up level since control bd ring may writeback data to control bd ring. Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c55b810a |
|
07-Nov-2019 |
Michael Walle <michael@walle.cc> |
enetc: fix return value for enetc_ioctl() Return -EOPNOTSUPP instead of -EINVAL if the requested ioctl is not implemented. Signed-off-by: Michael Walle <michael@walle.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a613bafe |
|
07-Nov-2019 |
Michael Walle <michael@walle.cc> |
enetc: add ioctl() support for PHY-related ops If there is an attached PHY try to handle the requested ioctl with its handler, which allows the userspace to access PHY registers, for example. This will make mii-diag and similar tools work. Signed-off-by: Michael Walle <michael@walle.cc> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d7840976 |
|
22-Jul-2019 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
net: Use skb accessors in network drivers In preparation for unifying the skb_frag and bio_vec, use the fine accessors which already exist and use skb_frag_t instead of struct skb_frag_struct. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cbe9e835 |
|
27-May-2019 |
Camelia Groza <camelia.groza@nxp.com> |
enetc: Enable TC offloading with mqprio Add support to configure multiple prioritized TX traffic classes with mqprio. Configure one BD ring per TC for the moment, one netdev queue per TC. Signed-off-by: Camelia Groza <camelia.groza@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cec4f328 |
|
26-May-2019 |
Y.b. Lu <yangbo.lu@nxp.com> |
enetc: fix le32/le16 degrading to integer warnings Fix blow sparse warning introduced by a previous patch. - restricted __le32 degrades to integer - restricted __le16 degrades to integer Fixes: d39823121911 ("enetc: add hardware timestamping support") Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d3982312 |
|
22-May-2019 |
Y.b. Lu <yangbo.lu@nxp.com> |
enetc: add hardware timestamping support This patch is to add hardware timestamping support for ENETC. On Rx, timestamping is enabled for all frames. On Tx, we only instruct the hardware to timestamp the frames marked accordingly by the stack. Because the RX BD ring dynamic allocation has not been supported and it is too expensive to use extended RX BDs if timestamping is not used, a Kconfig option is used to enable extended RX BDs in order to support hardware timestamping. This option will be removed once RX BD ring dynamic allocation is implemented. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f4a0be84 |
|
15-May-2019 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Fix NULL dma address unmap for Tx BD extensions For the unlikely case of TxBD extensions (i.e. ptp) the driver tries to unmap the tx_swbd corresponding to the extension, which is bogus as it has no buffer attached. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bbcbf2ee |
|
28-Jan-2019 |
Stephen Rothwell <sfr@canb.auug.org.au> |
enetc: include linux/vmalloc.h for vzalloc etc Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d382563f |
|
22-Jan-2019 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Add RFS and RSS support A ternary match table is used for RFS. If multiple entries in the table match, the entry with the lowest numerical values index is chosen as the matching entry. Entries in the table are identified using an index which takes a value from 0 to PRFSCAPR[NUM_RFS]-1 when accessed by the PSI (PF). Portions of the RFS table can be assigned to each SI by the PSI (PF) driver in PSIaRFSCFGR. Assignments are cumulative, the entries assigned to SIn start after those assigned to SIn-1. The total assignments to all SIs must be equal to or less than the number available to the port as found in PRFSCAPR. For RSS, the Toeplitz hash function used requires two inputs, a 40B random secret key that is supplied through the PRSSKR0-9 registers as well as the relevant pieces of the packet header (n-tuple). The 6 LSB bits of the hash function result will then be used as a pointer to obtain the tag referenced in the 64 entry indirection table. The result will provide a winning group which will be used to help route the received packet. Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d4fd0404 |
|
22-Jan-2019 |
Claudiu Manoil <claudiu.manoil@nxp.com> |
enetc: Introduce basic PF and VF ENETC ethernet drivers ENETC is a multi-port virtualized Ethernet controller supporting GbE designs and Time-Sensitive Networking (TSN) functionality. ENETC is operating as an SR-IOV multi-PF capable Root Complex Integrated Endpoint (RCIE). As such, it contains multiple physical (PF) and virtual (VF) PCIe functions, discoverable by standard PCI Express. Introduce basic PF and VF ENETC ethernet drivers. The PF has access to the ENETC Port registers and resources and makes the required privileged configurations for the underlying VF devices. Common functionality is controlled through so called System Interface (SI) register blocks, PFs and VFs own a SI each. Though SI register blocks are almost identical, there are a few privileged SI level controls that are accessible only to PFs, and so the distinction is made between PF SIs (PSI) and VF SIs (VSI). As such, the bulk of the code, including datapath processing, basic h/w offload support and generic pci related configuration, is shared between the 2 drivers and is factored out in common source files (i.e. enetc.c). Major functionalities included (for both drivers): MSI-X support for Rx and Tx processing, assignment of Rx/Tx BD ring pairs to MSI-X entries, multi-queue support, Rx S/G (Rx frame fragmentation) and jumbo frame (up to 9600B) support, Rx paged allocation and reuse, Tx S/G support (NETIF_F_SG), Rx and Tx checksum offload, PF MAC filtering and initial control ring support, VLAN extraction/ insertion, PF Rx VLAN CTAG filtering, VF mac address config support, VF VLAN isolation support, etc. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|