#
b2657700 |
|
01-Feb-2024 |
Michal Koutný <mkoutny@suse.com> |
net/sched: Add helper macros with module names The macros are preparation for adding module aliases en mass in a separate commit. Although it would be tempting to create aliases like cls-foo for name cls_foo, this could not be used because modprobe utilities treat '-' and '_' interchangeably. In the end, the naming follows pattern of proto modules in linux/net.h. Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240201130943.19536-2-mkoutny@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
b6a3c606 |
|
16-Dec-2023 |
Victor Nogueira <victor@mojatatu.com> |
net: sched: Make tc-related drop reason more flexible for remaining qdiscs Incrementing on Daniel's patch[1], make tc-related drop reason more flexible for remaining qdiscs - that is, all qdiscs aside from clsact. In essence, the drop reason will be set by cls_api and act_api in case any error occurred in the data path. With that, we can give the user more detailed information so that they can distinguish between a policy drop or an error drop. [1] https://lore.kernel.org/all/20231009092655.22025-1-daniel@iogearbox.net Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fb278072 |
|
16-Dec-2023 |
Victor Nogueira <victor@mojatatu.com> |
net: sched: Move drop_reason to struct tc_skb_cb Move drop_reason from struct tcf_result to skb cb - more specifically to struct tc_skb_cb. With that, we'll be able to also set the drop reason for the remaining qdiscs (aside from clsact) that do not have access to tcf_result when time comes to set the skb drop reason. Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1add9073 |
|
20-Sep-2023 |
Eric Dumazet <edumazet@google.com> |
net_sched: constify qdisc_priv() In order to propagate const qualifiers, we change qdisc_priv() to accept a possibly const argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
150e33e6 |
|
10-Jul-2023 |
Pedro Tammela <pctammela@mojatatu.com> |
net/sched: make psched_mtu() RTNL-less safe Eric Dumazet says[1]: ------- Speaking of psched_mtu(), I see that net/sched/sch_pie.c is using it without holding RTNL, so dev->mtu can be changed underneath. KCSAN could issue a warning. ------- Annotate dev->mtu with READ_ONCE() so KCSAN don't issue a warning. [1] https://lore.kernel.org/all/CANn89iJoJO5VtaJ-2=_d2aOQhb0Xw8iBT_Cxqp2HyuS-zj6azw@mail.gmail.com/ v1 -> v2: Fix commit message Fixes: d4b36210c2e6 ("net: pkt_sched: PIE AQM scheme") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230711021634.561598-1-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2b84960f |
|
09-Jun-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: report class offload stats per TXQ, not per TC The taprio Qdisc creates child classes per netdev TX queue, but taprio_dump_class_stats() currently reports offload statistics per traffic class. Traffic classes are groups of TXQs sharing the same dequeue priority, so this is incorrect and we shouldn't be bundling up the TXQ stats when reporting them, as we currently do in enetc. Modify the API from taprio to drivers such that they report TXQ offload stats and not TC offload stats. There is no change in the UAPI or in the global Qdisc stats. Fixes: 6c1adb650c8d ("net/sched: taprio: add netlink reporting for offload statistics counters") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
886bc7d6 |
|
06-Jun-2023 |
Eric Dumazet <edumazet@google.com> |
net: sched: move rtm_tca_policy declaration to include file rtm_tca_policy is used from net/sched/sch_api.c and net/sched/cls_api.c, thus should be declared in an include file. This fixes the following sparse warning: net/sched/sch_api.c:1434:25: warning: symbol 'rtm_tca_policy' was not declared. Should it be static? Fixes: e331473fee3d ("net/sched: cls_api: add missing validation of netlink attributes") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6c1adb65 |
|
29-May-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: add netlink reporting for offload statistics counters Offloading drivers may report some additional statistics counters, some of them even suggested by 802.1Q, like TransmissionOverrun. In my opinion we don't have to limit ourselves to reporting counters only globally to the Qdisc/interface, especially if the device has more detailed reporting (per traffic class), since the more detailed info is valuable for debugging and can help identifying who is exceeding its time slot. But on the other hand, some devices may not be able to report both per TC and global stats. So we end up reporting both ways, and use the good old ethtool_put_stat() strategy to determine which statistics are supported by this NIC. Statistics which aren't set are simply not reported to netlink. For this reason, we need something dynamic (a nlattr nest) to be reported through TCA_STATS_APP, and not something daft like the fixed-size and inextensible struct tc_codel_xstats. A good model for xstats which are a nlattr nest rather than a fixed struct seems to be cake. # Global stats $ tc -s qdisc show dev eth0 root # Per-tc stats $ tc -s class show dev eth0 Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2d800bc5 |
|
29-May-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: replace tc_taprio_qopt_offload :: enable with a "cmd" enum Inspired from struct flow_cls_offload :: cmd, in order for taprio to be able to report statistics (which is future work), it seems that we need to drill one step further with the ndo_setup_tc(TC_SETUP_QDISC_TAPRIO) multiplexing, and pass the command as part of the common portion of the muxed structure. Since we already have an "enable" variable in tc_taprio_qopt_offload, refactor all drivers to check for "cmd" instead of "enable", and reject every other command except "replace" and "destroy" - to be future proof. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> # for lan966x Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek Reviewed-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f62af20b |
|
11-Apr-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: mqprio: allow per-TC user input of FP adminStatus IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each packet priority can be assigned to a "frame preemption status" value of either "express" or "preemptible". Express priorities are transmitted by the local device through the eMAC, and preemptible priorities through the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge layer). The FP adminStatus is defined per packet priority, but 802.1Q clause 12.30.1.1.1 framePreemptionAdminStatus also says that: | Priorities that all map to the same traffic class should be | constrained to use the same value of preemption status. It is impossible to ignore the cognitive dissonance in the standard here, because it practically means that the FP adminStatus only takes distinct values per traffic class, even though it is defined per priority. I can see no valid use case which is prevented by having the kernel take the FP adminStatus as input per traffic class (what we do here). In addition, this also enforces the above constraint by construction. User space network managers which wish to expose FP adminStatus per priority are free to do so; they must only observe the prio_tc_map of the netdev (which presumably is also under their control, when constructing the mqprio netlink attributes). The reason for configuring frame preemption as a property of the Qdisc layer is that the information about "preemptible TCs" is closest to the place which handles the num_tc and prio_tc_map of the netdev. If the UAPI would have been any other layer, it would be unclear what to do with the FP information when num_tc collapses to 0. A key assumption is that only mqprio/taprio change the num_tc and prio_tc_map of the netdev. Not sure if that's a great assumption to make. Having FP in tc-mqprio can be seen as an implementation of the use case defined in 802.1Q Annex S.2 "Preemption used in isolation". There will be a separate implementation of FP in tc-taprio, for the other use cases. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
c54876cd |
|
11-Apr-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: pass netlink extack to mqprio and taprio offload With the multiplexed ndo_setup_tc() model which lacks a first-class struct netlink_ext_ack * argument, the only way to pass the netlink extended ACK message down to the device driver is to embed it within the offload structure. Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload. Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload structure, and since device drivers might effectively reuse their mqprio implementation for the mqprio portion of taprio, we make taprio set the extack in both offload structures to point at the same netlink extack message. In fact, the taprio handling is a bit more tricky, for 2 reasons. First is because the offload structure has a longer lifetime than the extack structure. The driver is supposed to populate the extack synchronously from ndo_setup_tc() and leave it alone afterwards. To not have any use-after-free surprises, we zero out the extack pointer when we leave taprio_enable_offload(). The second reason is because taprio does overwrite the extack message on ndo_setup_tc() error. We need to switch to the weak form of setting an extack message, which preserves a potential message set by the driver. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
62423bd2 |
|
08-Mar-2023 |
Eric Dumazet <edumazet@google.com> |
net: sched: remove qdisc_watchdog->last_expires This field mirrors hrtimer softexpires, we can instead use the existing helpers. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230308182648.1150762-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2f530df7 |
|
07-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: give higher priority to higher TCs in software dequeue mode Current taprio software implementation is haunted by the shadow of the igb/igc hardware model. It iterates over child qdiscs in increasing order of TXQ index, therefore giving higher xmit priority to TXQ 0 and lower to TXQ N. According to discussions with Vinicius, that is the default (perhaps even unchangeable) prioritization scheme used for the NICs that taprio was first written for (igb, igc), and we have a case of two bugs canceling out, resulting in a functional setup on igb/igc, but a less sane one on other NICs. To the best of my understanding, taprio should prioritize based on the traffic class, so it should really dequeue starting with the highest traffic class and going down from there. We get to the TXQ using the tc_to_txq[] netdev property. TXQs within the same TC have the same (strict) priority, so we should pick from them as fairly as we can. We can achieve that by implementing something very similar to q->curband from multiq_dequeue(). Since igb/igc really do have TXQ 0 of higher hardware priority than TXQ 1 etc, we need to preserve the behavior for them as well. We really have no choice, because in txtime-assist mode, taprio is essentially a software scheduler towards offloaded child tc-etf qdiscs, so the TXQ selection really does matter (not all igb TXQs support ETF/SO_TXTIME, says Kurt Kanzenbach). To preserve the behavior, we need a capability bit so that taprio can determine if it's running on igb/igc, or on something else. Because igb doesn't offload taprio at all, we can't piggyback on the qdisc_offload_query_caps() call from taprio_enable_offload(), but instead we need a separate call which is also made for software scheduling. Introduce two static keys to minimize the performance penalty on systems which only have igb/igc NICs, and on systems which only have other NICs. For mixed systems, taprio will have to dynamically check whether to dequeue using one prioritization algorithm or using the other. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
522d15ea |
|
04-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: only pass gate mask per TXQ for igc, stmmac, tsnep, am65_cpsw There are 2 classes of in-tree drivers currently: - those who act upon struct tc_taprio_sched_entry :: gate_mask as if it holds a bit mask of TXQs - those who act upon the gate_mask as if it holds a bit mask of TCs When it comes to the standard, IEEE 802.1Q-2018 does say this in the second paragraph of section 8.6.8.4 Enhancements for scheduled traffic: | A gate control list associated with each Port contains an ordered list | of gate operations. Each gate operation changes the transmission gate | state for the gate associated with each of the Port's traffic class | queues and allows associated control operations to be scheduled. In typically obtuse language, it refers to a "traffic class queue" rather than a "traffic class" or a "queue". But careful reading of 802.1Q clarifies that "traffic class" and "queue" are in fact synonymous (see 8.6.6 Queuing frames): | A queue in this context is not necessarily a single FIFO data structure. | A queue is a record of all frames of a given traffic class awaiting | transmission on a given Bridge Port. The structure of this record is not | specified. i.o.w. their definition of "queue" isn't the Linux TX queue. The gate_mask really is input into taprio via its UAPI as a mask of traffic classes, but taprio_sched_to_offload() converts it into a TXQ mask. The breakdown of drivers which handle TC_SETUP_QDISC_TAPRIO is: - hellcreek, felix, sja1105: these are DSA switches, it's not even very clear what TXQs correspond to, other than purely software constructs. Only the mqprio configuration with 8 TCs and 1 TXQ per TC makes sense. So it's fine to convert these to a gate mask per TC. - enetc: I have the hardware and can confirm that the gate mask is per TC, and affects all TXQs (BD rings) configured for that priority. - igc: in igc_save_qbv_schedule(), the gate_mask is clearly interpreted to be per-TXQ. - tsnep: Gerhard Engleder clarifies that even though this hardware supports at most 1 TXQ per TC, the TXQ indices may be different from the TC values themselves, and it is the TXQ indices that matter to this hardware. So keep it per-TXQ as well. - stmmac: I have a GMAC datasheet, and in the EST section it does specify that the gate events are per TXQ rather than per TC. - lan966x: again, this is a switch, and while not a DSA one, the way in which it implements lan966x_mqprio_add() - by only allowing num_tc == NUM_PRIO_QUEUES (8) - makes it clear to me that TXQs are a purely software construct here as well. They seem to map 1:1 with TCs. - am65_cpsw: from looking at am65_cpsw_est_set_sched_cmds(), I get the impression that the fetch_allow variable is treated like a prio_mask. This definitely sounds closer to a per-TC gate mask rather than a per-TXQ one, and TI documentation does seem to recomment an identity mapping between TCs and TXQs. However, Roger Quadros would like to do some testing before making changes, so I'm leaving this driver to operate as it did before, for now. Link with more details at the end. Based on this breakdown, we have 5 drivers with a gate mask per TC and 4 with a gate mask per TXQ. So let's make the gate mask per TXQ the opt-in and the gate mask per TC the default. Benefit from the TC_QUERY_CAPS feature that Jakub suggested we add, and query the device driver before calling the proper ndo_setup_tc(), and figure out if it expects one or the other format. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230202003621.2679603-15-vladimir.oltean@nxp.com/#25193204 Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Cc: Siddharth Vadapalli <s-vadapalli@ti.com> Cc: Roger Quadros <rogerq@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
09c794c0 |
|
04-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc() The taprio qdisc does not currently pass the mqprio queue configuration down to the offloading device driver. So the driver cannot act upon the TXQ counts/offsets per TC, or upon the prio->tc map. It was probably assumed that the driver only wants to offload num_tc (see TC_MQPRIO_HW_OFFLOAD_TCS), which it can get from netdev_get_num_tc(), but there's clearly more to the mqprio configuration than that. I've considered 2 mechanisms to remedy that. First is to pass a struct tc_mqprio_qopt_offload as part of the tc_taprio_qopt_offload. The second is to make taprio actually call TC_SETUP_QDISC_MQPRIO, *in addition to* TC_SETUP_QDISC_TAPRIO. The difference is that in the first case, existing drivers (offloading or not) all ignore taprio's mqprio portion currently, whereas in the second case, we could control whether to call TC_SETUP_QDISC_MQPRIO, based on a new capability. The question is which approach would be better. I'm afraid that calling TC_SETUP_QDISC_MQPRIO unconditionally (not based on a taprio capability bit) would risk introducing regressions. For example, taprio doesn't populate (or validate) qopt->hw, as well as mqprio.flags, mqprio.shaper, mqprio.min_rate, mqprio.max_rate. In comparison, adding a capability is functionally equivalent to just passing the mqprio in a way that drivers can ignore it, except it's slightly more complicated to use it (need to set the capability). Ultimately, what made me go for the "mqprio in taprio" variant was that it's easier for offloading drivers to interpret the mqprio qopt slightly differently when it comes from taprio vs when it comes from mqprio, should that ever become necessary. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
19278d76 |
|
04-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: mqprio: allow offloading drivers to request queue count validation mqprio_parse_opt() proudly has a comment: /* If hardware offload is requested we will leave it to the device * to either populate the queue counts itself or to validate the * provided queue counts. */ Unfortunately some device drivers did not get this memo, and don't validate the queue counts, or populate them. In case drivers don't want to populate the queue counts themselves, just act upon the requested configuration, it makes sense to introduce a tc capability, and make mqprio query it, so they don't have to do the validation themselves. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9adafe2b |
|
04-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h Since mqprio is a scheduler and not a classifier, move its offload structure to pkt_sched.h, where struct tc_taprio_qopt_offload also lies. Also update some header inclusions in drivers that access this structure, to the best of my abilities. Cc: Igor Russkikh <irusskikh@marvell.com> Cc: Yisen Zhuang <yisen.zhuang@huawei.com> Cc: Salil Mehta <salil.mehta@huawei.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: Daniel Machon <daniel.machon@microchip.com> Cc: UNGLinuxDriver@microchip.com Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a54fc09e |
|
27-Sep-2022 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: allow user input of per-tc max SDU IEEE 802.1Q clause 12.29.1.1 "The queueMaxSDUTable structure and data types" and 8.6.8.4 "Enhancements for scheduled traffic" talk about the existence of a per traffic class limitation of maximum frame sizes, with a fallback on the port-based MTU. As far as I am able to understand, the 802.1Q Service Data Unit (SDU) represents the MAC Service Data Unit (MSDU, i.e. L2 payload), excluding any number of prepended VLAN headers which may be otherwise present in the MSDU. Therefore, the queueMaxSDU is directly comparable to the device MTU (1500 means L2 payload sizes are accepted, or frame sizes of 1518 octets, or 1522 plus one VLAN header). Drivers which offload this are directly responsible of translating into other units of measurement. To keep the fast path checks optimized, we keep 2 arrays in the qdisc, one for max_sdu translated into frame length (so that it's comparable to skb->len), and another for offloading and for dumping back to the user. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
aac4daa8 |
|
27-Sep-2022 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: query offload capabilities through ndo_setup_tc() When adding optional new features to Qdisc offloads, existing drivers must reject the new configuration until they are coded up to act on it. Since modifying all drivers in lockstep with the changes in the Qdisc can create problems of its own, it would be nice if there existed an automatic opt-in mechanism for offloading optional features. Jakub proposes that we multiplex one more kind of call through ndo_setup_tc(): one where the driver populates a Qdisc-specific capability structure. First user will be taprio in further changes. Here we are introducing the definitions for the base functionality. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220923163310.3192733-3-vladimir.oltean@nxp.com/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d7a68e56 |
|
20-Sep-2022 |
Zhengchao Shao <shaozhengchao@huawei.com> |
net/sched: sch_api: add helper for tc qdisc walker stats dump The walk implementation of most qdisc class modules is basically the same. That is, the values of count and skip are checked first. If count is greater than or equal to skip, the registered fn function is executed. Otherwise, increase the value of count. So we can reconstruct them. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
52327d2e |
|
14-Aug-2022 |
Zhengchao Shao <shaozhengchao@huawei.com> |
net: sched: remove the unused return value of unregister_qdisc Return value of unregister_qdisc is unused, remove it. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://lore.kernel.org/r/20220815030417.271894-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d7be266a |
|
04-Jul-2022 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: sched: provide shim definitions for taprio_offload_{get,free} All callers of taprio_offload_get() and taprio_offload_free() prior to the blamed commit are conditionally compiled based on CONFIG_NET_SCH_TAPRIO. felix_vsc9959.c is different; it provides vsc9959_qos_port_tas_set() even when taprio is compiled out. Provide shim definitions for the functions exported by taprio so that felix_vsc9959.c is able to compile. vsc9959_qos_port_tas_set() in that case is dead code anyway, and ocelot_port->taprio remains NULL, which is fine for the rest of the logic. Fixes: 1c9017e44af2 ("net: dsa: felix: keep reference on entire tc-taprio config") Reported-by: Colin Foster <colin.foster@in-advantage.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Colin Foster <colin.foster@in-advantage.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Link: https://lore.kernel.org/r/20220704190241.1288847-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
98b60862 |
|
26-Jan-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: sched: remove psched_tdiff_bounded() Not used since v3.9. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6f022c2d |
|
06-Jan-2022 |
Paul Blakey <paulb@nvidia.com> |
net: openvswitch: Fix ct_state nat flags for conns arriving from tc Netfilter conntrack maintains NAT flags per connection indicating whether NAT was configured for the connection. Openvswitch maintains NAT flags on the per packet flow key ct_state field, indicating whether NAT was actually executed on the packet. When a packet misses from tc to ovs the conntrack NAT flags are set. However, NAT was not necessarily executed on the packet because the connection's state might still be in NEW state. As such, openvswitch wrongly assumes that NAT was executed and sets an incorrect flow key NAT flags. Fix this, by flagging to openvswitch which NAT was actually done in act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so the packet flow key NAT flags will be correctly set. Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Paul Blakey <paulb@nvidia.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20220106153804.26451-1-paulb@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
38495958 |
|
14-Dec-2021 |
Paul Blakey <paulb@nvidia.com> |
net/sched: flow_dissector: Fix matching on zone id for invalid conns If ct rejects a flow, it removes the conntrack info from the skb. act_ct sets the post_ct variable so the dissector will see this case as an +tracked +invalid state, but the zone id is lost with the conntrack info. To restore the zone id on such cases, set the last executed zone, via the tc control block, when passing ct, and read it back in the dissector if there is no ct info on the skb (invalid connection). Fixes: 7baf2429a1a9 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support") Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
ec624fe7 |
|
14-Dec-2021 |
Paul Blakey <paulb@nvidia.com> |
net/sched: Extend qdisc control block with tc control block BPF layer extends the qdisc control block via struct bpf_skb_data_end and because of that there is no more room to add variables to the qdisc layer control block without going over the skb->cb size. Extend the qdisc control block with a tc control block, and move all tc related variables to there as a pre-step for extending the tc control block with additional members. Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
b193e15a |
|
23-Sep-2021 |
王贇 <yun.wang@linux.alibaba.com> |
net: prevent user from passing illegal stab size We observed below report when playing with netlink sock: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:580:10 shift exponent 249 is too large for 32-bit type CPU: 0 PID: 685 Comm: a.out Not tainted Call Trace: dump_stack_lvl+0x8d/0xcf ubsan_epilogue+0xa/0x4e __ubsan_handle_shift_out_of_bounds+0x161/0x182 __qdisc_calculate_pkt_len+0xf0/0x190 __dev_queue_xmit+0x2ed/0x15b0 it seems like kernel won't check the stab log value passing from user, and will use the insane value later to calculate pkt_len. This patch just add a check on the size/cell_log to avoid insane calculation. Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
102b55ee |
|
13-May-2021 |
Yunsheng Lin <linyunsheng@huawei.com> |
net: sched: fix tx action rescheduling issue during deactivation Currently qdisc_run() checks the STATE_DEACTIVATED of lockless qdisc before calling __qdisc_run(), which ultimately clear the STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED is set before clearing STATE_MISSED, there may be rescheduling of net_tx_action() at the end of qdisc_run_end(), see below: CPU0(net_tx_atcion) CPU1(__dev_xmit_skb) CPU2(dev_deactivate) . . . . set STATE_MISSED . . __netif_schedule() . . . set STATE_DEACTIVATED . . qdisc_reset() . . . .<--------------- . synchronize_net() clear __QDISC_STATE_SCHED | . . . | . . . | . some_qdisc_is_busy() . | . return *false* . | . . test STATE_DEACTIVATED | . . __qdisc_run() *not* called | . . . | . . test STATE_MISS | . . __netif_schedule()--------| . . . . . . . . __qdisc_run() is not called by net_tx_atcion() in CPU0 because CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule is called at the end of qdisc_run_end(), causing tx action rescheduling problem. qdisc_run() called by net_tx_action() runs in the softirq context, which should has the same semantic as the qdisc_run() called by __dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a synchronize_net() between STATE_DEACTIVATED flag being set and qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely bail out for the deactived lockless qdisc in net_tx_action(), and qdisc_reset() will reset all skb not dequeued yet. So add the rcu_read_lock() explicitly to protect the qdisc_run() and do the STATE_DEACTIVATED checking in net_tx_action() before calling qdisc_run_begin(). Another option is to do the checking in the qdisc_run_end(), but it will add unnecessary overhead for non-tx_action case, because __dev_queue_xmit() will not see qdisc with STATE_DEACTIVATED after synchronize_net(), the qdisc with STATE_DEACTIVATED can only be seen by net_tx_action() because of __netif_schedule(). The STATE_DEACTIVATED checking in qdisc_run() is to avoid race between net_tx_action() and qdisc_reset(), see: commit d518d2ed8640 ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc"). As the bailout added above for deactived lockless qdisc in net_tx_action() provides better protection for the race without calling qdisc_run() at all, so remove the STATE_DEACTIVATED checking in qdisc_run(). After qdisc_reset(), there is no skb in qdisc to be dequeued, so clear the STATE_MISSED in dev_reset_queue() too. Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> V8: Clearing STATE_MISSED before calling __netif_schedule() has avoid the endless rescheduling problem, but there may still be a unnecessary rescheduling, so adjust the commit log. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
847cbfc0 |
|
10-Mar-2021 |
Vladimir Oltean <olteanv@gmail.com> |
net: add a helper to avoid issues with HW TX timestamping and SO_TXTIME As explained in commit 29d98f54a4fe ("net: enetc: allow hardware timestamping on TX queues with tc-etf enabled"), hardware TX timestamping requires an skb with skb->tstamp = 0. When a packet is sent with SO_TXTIME, the skb->skb_mstamp_ns corrupts the value of skb->tstamp, so the drivers need to explicitly reset skb->tstamp to zero after consuming the TX time. Create a helper named skb_txtime_consumed() which does just that. All drivers which offload TC_SETUP_QDISC_ETF should implement it, and it would make it easier to assess during review whether they do the right thing in order to be compatible with hardware timestamping or not. Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6e1978a9 |
|
03-Nov-2020 |
Allen Pais <apais@linux.microsoft.com> |
net: sched: convert tasklets to use new tasklet_setup() API In preparation for unconditionally passing the struct tasklet_struct pointer to all tasklet callbacks, switch to using the new tasklet_setup() and from_tasklet() to pass the tasklet pointer explicitly. Signed-off-by: Romain Perier <romain.perier@gmail.com> Signed-off-by: Allen Pais <apais@linux.microsoft.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
846e463a |
|
07-Oct-2020 |
Eric Dumazet <edumazet@google.com> |
net/sched: get rid of qdisc->padded kmalloc() of sufficiently big portion of memory is cache-aligned in regular conditions. If some debugging options are used, there is no reason qdisc structures would need 64-byte alignment if most other kernel structures are not aligned. This get rid of QDISC_ALIGN and QDISC_ALIGNTO. Addition of privdata field will help implementing the reverse of qdisc_priv() and documents where the private data is. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Allen Pais <allen.lkml@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d7bf2ebe |
|
03-Jul-2020 |
Toke Høiland-Jørgensen <toke@redhat.com> |
sched: consistently handle layer3 header accesses in the presence of VLANs There are a couple of places in net/sched/ that check skb->protocol and act on the value there. However, in the presence of VLAN tags, the value stored in skb->protocol can be inconsistent based on whether VLAN acceleration is enabled. The commit quoted in the Fixes tag below fixed the users of skb->protocol to use a helper that will always see the VLAN ethertype. However, most of the callers don't actually handle the VLAN ethertype, but expect to find the IP header type in the protocol field. This means that things like changing the ECN field, or parsing diffserv values, stops working if there's a VLAN tag, or if there are multiple nested VLAN tags (QinQ). To fix this, change the helper to take an argument that indicates whether the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we make sure to skip all of them, so behaviour is consistent even in QinQ mode. To make the helper usable from the ECN code, move it to if_vlan.h instead of pkt_sched.h. v3: - Remove empty lines - Move vlan variable definitions inside loop in skb_protocol() - Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and bpf_skb_ecn_set_ce() v2: - Use eth_type_vlan() helper in skb_protocol() - Also fix code that reads skb->protocol directly - Change a couple of 'if/else if' statements to switch constructs to avoid calling the helper twice Reported-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com> Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
efe074c2 |
|
16-Mar-2020 |
Eric Dumazet <edumazet@google.com> |
net_sched: add qdisc_watchdog_schedule_range_ns() Some packet schedulers might want to add a slack when programming hrtimers. This can reduce number of interrupts and increase batch sizes and thus give good xmit_more savings. This commit adds qdisc_watchdog_schedule_range_ns() helper, with an extra delta_ns parameter. Legacy qdisc_watchdog_schedule_n() becomes an inline passing a zero slack. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b90feaff |
|
27-Feb-2020 |
Gustavo A. R. Silva <gustavo@embeddedor.com> |
net: sched: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9c66d156 |
|
14-Sep-2019 |
Vinicius Costa Gomes <vinicius.gomes@intel.com> |
taprio: Add support for hardware offloading This allows taprio to offload the schedule enforcement to capable network cards, resulting in more precise windows and less CPU usage. The gate mask acts on traffic classes (groups of queues of same priority), as specified in IEEE 802.1Q-2018, and following the existing taprio and mqprio semantics. It is up to the driver to perform conversion between tc and individual netdev queues if for some reason it needs to make that distinction. Full offload is requested from the network interface by specifying "flags 2" in the tc qdisc creation command, which in turn corresponds to the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit. The important detail here is the clockid which is implicitly /dev/ptpN for full offload, and hence not configurable. A reference counting API is added to support the use case where Ethernet drivers need to keep the taprio offload structure locally (i.e. they are a multi-port switch driver, and configuring a port depends on the settings of other ports as well). The refcount_t variable is kept in a private structure (__tc_taprio_qopt_offload) and not exposed to drivers. In the future, the private structure might also be expanded with a backpointer to taprio_sched *q, to implement the notification system described in the patch (of when admin became oper, or an error occurred, etc, so the offload can be monitored with 'tc qdisc show'). Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d518d2ed |
|
11-Sep-2019 |
Paolo Abeni <pabeni@redhat.com> |
net/sched: fix race between deactivation and dequeue for NOLOCK qdisc The test implemented by some_qdisc_is_busy() is somewhat loosy for NOLOCK qdisc, as we may hit the following scenario: CPU1 CPU2 // in net_tx_action() clear_bit(__QDISC_STATE_SCHED...); // in some_qdisc_is_busy() val = (qdisc_is_running(q) || test_bit(__QDISC_STATE_SCHED, &q->state)); // here val is 0 but... qdisc_run(q) // ... CPU1 is going to run the qdisc next As a conseguence qdisc_run() in net_tx_action() can race with qdisc_reset() in dev_qdisc_reset(). Such race is not possible for !NOLOCK qdisc as both the above bit operations are under the root qdisc lock(). After commit 021a17ed796b ("pfifo_fast: drop unneeded additional lock on dequeue") the race can cause use after free and/or null ptr dereference, but the root cause is likely older. This patch addresses the issue explicitly checking for deactivation under the seqlock for NOLOCK qdisc, so that the qdisc_run() in the critical scenario becomes a no-op. Note that the enqueue() op can still execute concurrently with dev_qdisc_reset(), but that is safe due to the skb_array() locking, and we can't avoid that for NOLOCK qdiscs. Fixes: 021a17ed796b ("pfifo_fast: drop unneeded additional lock on dequeue") Reported-by: Li Shuang <shuali@redhat.com> Reported-and-tested-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3a7d0d07 |
|
24-Sep-2018 |
Vlad Buslov <vladbu@mellanox.com> |
net: sched: extend Qdisc with rcu Currently, Qdisc API functions assume that users have rtnl lock taken. To implement rtnl unlocked classifiers update interface, Qdisc API must be extended with functions that do not require rtnl lock. Extend Qdisc structure with rcu. Implement special version of put function qdisc_put_unlocked() that is called without rtnl lock taken. This function only takes rtnl lock if Qdisc reference counter reached zero and is intended to be used as optimization. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
88cab771 |
|
03-Jul-2018 |
Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> |
net/sched: Add HW offloading capability to ETF Add infra so etf qdisc supports HW offload of time-based transmission. For hw offload, the time sorted list is still used, so packets are dequeued always in order of txtime. Example: $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \ clockid CLOCK_REALTIME In this example, the Qdisc will use HW offload for the control of the transmission time through the network adapter. The hrtimer used for packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME as reference and packets leave the Qdisc "delta" (100000) nanoseconds before their transmission time. Because this will be using HW offload and since dynamic clocks are not supported by the hrtimer, the system clock and the PHC clock must be synchronized for this mode to behave as expected. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
860b642b |
|
03-Jul-2018 |
Vinicius Costa Gomes <vinicius.gomes@intel.com> |
net/sched: Allow creating a Qdisc watchdog with other clocks This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be passed, this allows other time references to be used when scheduling the Qdisc to run. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
868717ae |
|
15-Jan-2018 |
Jakub Kicinski <kuba@kernel.org> |
net: remove prototype of qdisc_lookup_class() Looks like qdisc_lookup_class() never existed in the tree in the git era. Remove the prototype from the header. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a38a9882 |
|
19-Dec-2017 |
Alexander Aring <aring@mojatatu.com> |
net: sch: api: add extack support in qdisc_create_dflt This patch adds extack support for the function qdisc_create_dflt which is a common used function in the tc subsystem. Callers which are interested in the receiving error can assign extack to get a more detailed information why qdisc_create_dflt failed. The function qdisc_create_dflt will also call an init callback which can fail by any per-qdisc specific handling. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e9bc3fa2 |
|
19-Dec-2017 |
Alexander Aring <aring@mojatatu.com> |
net: sch: api: add extack support in qdisc_get_rtab This patch adds extack support for the function qdisc_get_rtab which is a common used function in the tc subsystem. Callers which are interested in the receiving error can assign extack to get a more detailed information why qdisc_get_rtab failed. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
29b86cda |
|
07-Dec-2017 |
John Fastabend <john.fastabend@gmail.com> |
net: sched: remove remaining uses for qdisc_qlen in xmit path sch_direct_xmit() uses qdisc_qlen as a return value but all call sites of the routine only check if it is zero or not. Simplify the logic so that we don't need to return an actual queue length value. This introduces a case now where sch_direct_xmit would have returned a qlen of zero but now it returns true. However in this case all call sites of sch_direct_xmit will implement a dequeue() and get a null skb and abort. This trades tracking qlen in the hotpath for an extra dequeue operation. Overall this seems to be good for performance. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6c148184 |
|
07-Dec-2017 |
John Fastabend <john.fastabend@gmail.com> |
net: sched: cleanup qdisc_run and __qdisc_run semantics Currently __qdisc_run calls qdisc_run_end() but does not call qdisc_run_begin(). This makes it hard to track pairs of qdisc_run_{begin,end} across function calls. To simplify reading these code paths this patch moves begin/end calls into qdisc_run(). Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b2441318 |
|
01-Nov-2017 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
3d0bd028 |
|
16-Oct-2017 |
Vinicius Costa Gomes <vinicius.gomes@intel.com> |
net/sched: Add support for HW offloading for CBS This adds support for offloading the CBS algorithm to the controller, if supported. Drivers wanting to support CBS offload must implement the .ndo_setup_tc callback and handle the TC_SETUP_CBS (introduced here) type. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Tested-by: Henrik Austad <henrik@austad.us> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
fa71212e |
|
19-Oct-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: sched: remove unused is_classid_clsact_ingress/egress helpers These helpers are no longer in use by drivers, so remove them. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
855319be |
|
13-Oct-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: sched: store net pointer in block and introduce qdisc_net helper Store net pointer in the block structure. Along the way, introduce qdisc_net helper which allows to easily obtain net pointer for qdisc instance. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
861932ec |
|
09-Aug-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: sched: Add helpers to identify classids Offloading drivers need to understand what qdisc class a filter is added to. Currently they only need to identify ingress, clsact->ingress and clsact->egress. So provide these helpers. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
87d83093 |
|
17-May-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: sched: move tc_classify function to cls_api.c Move tc_classify function to cls_api.c where it belongs, rename it to fit the namespace. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
49b49971 |
|
08-Mar-2017 |
Jiri Kosina <jkosina@suse.cz> |
net: sched: make default fifo qdiscs appear in the dump The original reason [1] for having hidden qdiscs (potential scalability issues in qdisc_match_from_root() with single linked list in case of large amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert qdisc linked list to hashtable"). This allows us for bringing more clarity and determinism into the dump by making default pfifo qdiscs visible. We're not turning this on by default though, at it was deemed [2] too intrusive / unnecessary change of default behavior towards userspace. Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows applications to request complete qdisc hierarchy dump, including the ones that have always been implicit/invisible. Singleton noop_qdisc stays invisible, as teaching the whole infrastructure about singletons would require quite some surgery with very little gain (seeing no qdisc or seeing noop qdisc in the dump is probably setting the same user expectation). [1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com [2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d0a81f67 |
|
03-Nov-2016 |
Jesper Dangaard Brouer <brouer@redhat.com> |
net: make default TX queue length a defined constant The default TX queue length of Ethernet devices have been a magic constant of 1000, ever since the initial git import. Looking back in historical trees[1][2] the value used to be 100, with the same comment "Ethernet wants good queues". The commit[3] that changed this from 100 to 1000 didn't describe why, but from conversations with Robert Olsson it seems that it was changed when Ethernet devices went from 100Mbit/s to 1Gbit/s, because the link speed increased x10 the queue size were also adjusted. This value later caused much heartache for the bufferbloat community. This patch merely moves the value into a defined constant. [1] https://git.kernel.org/cgit/linux/kernel/git/davem/netdev-vger-cvs.git/ [2] https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/ [3] https://git.kernel.org/tglx/history/c/98921832c232 Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
59cc1f61 |
|
10-Aug-2016 |
Jiri Kosina <jkosina@suse.cz> |
net: sched: convert qdisc linked list to hashtable Convert the per-device linked list into a hashtable. The primary motivation for this change is that currently, we're not tracking all the qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup performed over the linked list by qdisc_match_from_root() is rather expensive. The ultimate goal is to get rid of hidden qdiscs completely, which will bring much more determinism in user experience. Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
45f50bed |
|
10-Jun-2016 |
Eric Dumazet <edumazet@google.com> |
net_sched: remove generic throttled management __QDISC_STATE_THROTTLED bit manipulation is rather expensive for HTB and few others. I already removed it for sch_fq in commit f2600cf02b5b ("net: sched: avoid costly atomic operation in fq_dequeue()") and so far nobody complained. When one ore more packets are stuck in one or more throttled HTB class, a htb dequeue() performs two atomic operations to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc lock is held. Removing this pair of atomic operations bring me a 8 % performance increase on 200 TCP_RR tests, in presence of throttled classes. This patch has no side effect, since nothing actually uses disc_is_throttled() anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a9efad8b |
|
23-May-2016 |
Eric Dumazet <edumazet@google.com> |
net_sched: avoid too many hrtimer_start() calls I found a serious performance bug in packet schedulers using hrtimers. sch_htb and sch_fq are definitely impacted by this problem. We constantly rearm high resolution timers if some packets are throttled in one (or more) class, and other packets are flying through qdisc on another (non throttled) class. hrtimer_start() does not have the mod_timer() trick of doing nothing if expires value does not change : if (timer_pending(timer) && timer->expires == expires) return 1; This issue is particularly visible when multiple cpus can queue/dequeue packets on the same qdisc, as hrtimer code has to lock a remote base. I used following fix : 1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding it. 2) Cache watchdog prior expiration. hrtimer might provide this, but I prefer to not rely on some hrtimer internal. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3b3ae880 |
|
26-Aug-2015 |
Daniel Borkmann <daniel@iogearbox.net> |
net: sched: consolidate tc_classify{,_compat} For classifiers getting invoked via tc_classify(), we always need an extra function call into tc_classify_compat(), as both are being exported as symbols and tc_classify() itself doesn't do much except handling of reclassifications when tp->classify() returned with TC_ACT_RECLASSIFY. CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(), all others use tc_classify(). When tc actions are being configured out in the kernel, tc_classify() effectively does nothing besides delegating. We could spare this layer and consolidate both functions. pktgen on single CPU constantly pushing skbs directly into the netif_receive_skb() path with a dummy classifier on ingress qdisc attached, improves slightly from 22.3Mpps to 23.1Mpps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
df8a39de |
|
13-Jan-2015 |
Jiri Pirko <jiri@resnulli.us> |
net: rename vlan_tx_* helpers since "tx" is misleading there The same macros are used for rx as well. So rename it. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d8b9605d |
|
13-Jan-2015 |
Jiri Pirko <jiri@resnulli.us> |
net: sched: fix skb->protocol use in case of accelerated vlan path tc code implicitly considers skb->protocol even in case of accelerated vlan paths and expects vlan protocol type here. However, on rx path, if the vlan header was already stripped, skb->protocol contains value of next header. Similar situation is on tx path. So for skbs that use skb->vlan_tci for tagging, use skb->vlan_proto instead. Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f2600cf0 |
|
04-Oct-2014 |
Eric Dumazet <edumazet@google.com> |
net: sched: avoid costly atomic operation in fq_dequeue() Standard qdisc API to setup a timer implies an atomic operation on every packet dequeue : qdisc_unthrottled() It turns out this is not really needed for FQ, as FQ has no concept of global qdisc throttling, being a qdisc handling many different flows, some of them can be throttled, while others are not. Fix is straightforward : add a 'bool throttle' to qdisc_watchdog_schedule_ns(), and remove calls to qdisc_unthrottled() in sch_fq. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
55a93b3e |
|
03-Oct-2014 |
Eric Dumazet <edumazet@google.com> |
qdisc: validate skb without holding lock Validation of skb can be pretty expensive : GSO segmentation and/or checksum computations. We can do this without holding qdisc lock, so that other cpus can queue additional packets. Trick is that requeued packets were already validated, so we carry a boolean so that sch_direct_xmit() can validate a fresh skb list, or directly use an old one. Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads host. Turning TSO on or off had no effect on throughput, only few more cpu cycles. Lock contention on qdisc lock disappeared. Same if disabling TX checksum offload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d2de875c |
|
22-Aug-2014 |
Eric Dumazet <edumazet@google.com> |
net: use ktime_get_ns() and ktime_get_real_ns() helpers ktime_get_ns() replaces ktime_to_ns(ktime_get()) ktime_get_real_ns() replaces ktime_to_ns(ktime_get_real()) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6e765a00 |
|
11-Jun-2014 |
Florian Westphal <fw@strlen.de> |
net_sched: drr: warn when qdisc is not work conserving The DRR scheduler requires that items on the active list are work conserving, i.e. do not hold on to skbs for throttling purposes, etc. Attaching e.g. tbf renders DRR useless because all other classes on the active list are delayed as well. So, warn users that this configuration won't work as expected; we already do this in couple of other qdiscs, see e.g. commit b00355db3f88d96810a60011a30cfb2c3469409d ('pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving warning handler') The 'const' change is needed to avoid compiler warning ("discards 'const' qualifier from pointer target type"). tested with: drr_hier() { parent=$1 classes=$2 for i in $(seq 1 $classes); do classid=$parent$(printf %x $i) tc class add dev eth0 parent $parent classid $classid drr tc qdisc add dev eth0 parent $classid tbf rate 64kbit burst 256kbit limit 64kbit done } tc qdisc add dev eth0 root handle 1: drr drr_hier 1: 32 tc filter add dev eth0 protocol all pref 1 parent 1: handle 1 flow hash keys dst perturb 1 divisor 32 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
95dc1929 |
|
05-Dec-2013 |
Eric Dumazet <edumazet@google.com> |
pkt_sched: give visibility to mq slave qdiscs Commit 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline") added the ability to change default qdisc from pfifo_fast to say fq But as most modern ethernet devices are multiqueue, we cant really see all the statistics from "tc -s qdisc show", as the default root qdisc is mq. This patch adds the calls to qdisc_list_add() to mq and mqprio Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6da7c8fc |
|
27-Aug-2013 |
stephen hemminger <stephen@networkplumber.org> |
qdisc: allow setting default queuing discipline By default, the pfifo_fast queue discipline has been used by default for all devices. But we have better choices now. This patch allow setting the default queueing discipline with sysctl. This allows easy use of better queueing disciplines on all devices without having to use tc qdisc scripts. It is intended to allow an easy path for distributions to make fq_codel or sfq the default qdisc. This patch also makes pfifo_fast more of a first class qdisc, since it is now possible to manually override the default and explicitly use pfifo_fast. The behavior for systems who do not use the sysctl is unchanged, they still get pfifo_fast Also removes leftover random # in sysctl net core. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5c15257f |
|
30-Jul-2013 |
Joe Perches <joe@perches.com> |
net: Remove extern from include/net/ scheduling prototypes There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Reflow modified prototypes to 80 columns. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
34c5d292 |
|
11-Feb-2013 |
Jiri Pirko <jiri@resnulli.us> |
sch_api: introduce qdisc_watchdog_schedule_ns() tbf will need to schedule watchdog in ns. No need to convert it twice. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
95c96174 |
|
14-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: cleanup unsigned to unsigned int Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dc7f9f6e |
|
05-Jul-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: sched: constify tcf_proto and tc_action Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
25985edc |
|
30-Mar-2011 |
Lucas De Marchi <lucas.demarchi@profusion.mobi> |
Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
|
#
bc135b23 |
|
02-Jun-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Define accessors to manipulate QDISC_STATE_RUNNING Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a second patch will move on another location. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5d944c64 |
|
31-Mar-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
gen_estimator: deadlock fix One of my test machine got a deadlock during "tc" sessions, adding/deleting classes & filters, using traffic estimators. After some analysis, I believe we have a potential use after free case in est_timer() : spin_lock(e->stats_lock); << HERE >> read_lock(&est_lock); if (e->bstats == NULL) << TEST >> goto skip; Test is done a bit late, because after estimator is killed, and before rcu grace period elapsed, we might already have freed/reuse memory where e->stats_locks points to (some qdisc->q.lock) A possible fix is to respect a rcu grace period at Qdisc dismantle time. On 64bit, sizeof(struct Qdisc) is exactly 192 bytes. Adding 16 bytes to it (for struct rcu_head) is a problem because it might change performance, given QDISC_ALIGNTO is 32 bytes. This is why I also change QDISC_ALIGNTO to 64 bytes, to satisfy most current alignment requirements. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
57dbb2d8 |
|
23-Jan-2010 |
Hagen Paul Pfeifer <hagen@jauu.net> |
sched: add head drop fifo queue This adds an additional queuing strategy, called pfifo_head_drop, to remove the oldest skb in the case of an overflow within the queue - the head element - instead of the last skb (tail). To remove the oldest skb in congested situations is useful for sensor network environments where newer packets reflect the superior information. Reviewed-by: Florian Westphal <fw@strlen.de> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fd2c3ef7 |
|
02-Nov-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: cleanup include/net This cleanup patch puts struct/union/enum opening braces, in first line to ease grep games. struct something { becomes : struct something { Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2fbd3da3 |
|
01-Sep-2009 |
David S. Miller <davem@davemloft.net> |
pkt_sched: Revert tasklet_hrtimer changes. These are full of unresolved problems, mainly that conversions don't work 1-1 from hrtimers to tasklet_hrtimers because unlike hrtimers tasklets can't be killed from softirq context. And when a qdisc gets reset, that's exactly what we need to do here. We'll work this out in the net-next-2.6 tree and if warranted we'll backport that work to -stable. This reverts the following 3 changesets: a2cb6a4dd470d7a64255a10b843b0d188416b78f ("pkt_sched: Fix bogon in tasklet_hrtimer changes.") 38acce2d7983632100a9ff3fd20295f6e34074a8 ("pkt_sched: Convert CBQ to tasklet_hrtimer.") ee5f9757ea17759e1ce5503bdae2b07e48e32af9 ("pkt_sched: Convert qdisc_watchdog to tasklet_hrtimer") Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ee5f9757 |
|
21-Aug-2009 |
David S. Miller <davem@davemloft.net> |
pkt_sched: Convert qdisc_watchdog to tasklet_hrtimer None of this stuff should execute in hw IRQ context, therefore use a tasklet_hrtimer so that it runs in softirq context. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Thomas Gleixner <tglx@linutronix.de>
|
#
bbd8a0d3 |
|
05-Aug-2009 |
Krishna Kumar <krkumar2@in.ibm.com> |
net: Avoid enqueuing skb for default qdiscs dev_queue_xmit enqueue's a skb and calls qdisc_run which dequeue's the skb and xmits it. In most cases, the skb that is enqueue'd is the same one that is dequeue'd (unless the queue gets stopped or multiple cpu's write to the same queue and ends in a race with qdisc_run). For default qdiscs, we can remove the redundant enqueue/dequeue and simply xmit the skb since the default qdisc is work-conserving. The patch uses a new flag - TCQ_F_CAN_BYPASS to identify the default fast queue. The controversial part of the patch is incrementing qlen when a skb is requeued - this is to avoid checks like the second line below: + } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && >> !q->gso_skb && + !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) { Results of a 2 hour testing for multiple netperf sessions (1, 2, 4, 8, 12 sessions on a 4 cpu system-X). The BW numbers are aggregate Mb/s across iterations tested with this version on System-X boxes with Chelsio 10gbps cards: ---------------------------------- Size | ORG BW NEW BW | ---------------------------------- 128K | 156964 159381 | 256K | 158650 162042 | ---------------------------------- Changes from ver1: 1. Move sch_direct_xmit declaration from sch_generic.h to pkt_sched.h 2. Update qdisc basic statistics for direct xmit path. 3. Set qlen to zero in qdisc_reset. 4. Changed some function names to more meaningful ones. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ca44d6e6 |
|
15-Jun-2009 |
Jarek Poplawski <jarkao2@gmail.com> |
pkt_sched: Rename PSCHED_US2NS and PSCHED_NS2US Let's use TICKS instead of US, so PSCHED_TICKS2NS and PSCHED_NS2TICKS (like in PSCHED_TICKS_PER_SEC already) to avoid misleading. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a4a710c4 |
|
08-Jun-2009 |
Jarek Poplawski <jarkao2@gmail.com> |
pkt_sched: Change PSCHED_SHIFT from 10 to 6 Change PSCHED_SHIFT from 10 to 6 to increase schedulers time resolution. This will increase 16x a number of (internal) ticks per nanosecond, and is needed to improve accuracy of schedulers based on rate tables, like HTB, TBF or CBQ, with rates above 100Mbit. It is assumed this change is safe for 32bit accounting of time diffs up to 2 minutes, which should be enough for common use (extremely low rate values may overflow, so get inaccurate instead). To make full use of this change an updated iproute2 will be needed. (But using older iproute2 should be safe too.) This change breaks ticks - microseconds similarity, so some minor code fixes might be needed. It is also planned to change naming adequately eg. to PSCHED_TICKS2NS() etc. in the near future. Reported-by: Antonio Almeida <vexwek@gmail.com> Tested-by: Antonio Almeida <vexwek@gmail.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
728bf098 |
|
08-Jun-2009 |
Jarek Poplawski <jarkao2@gmail.com> |
pkt_sched: Use PSCHED_SHIFT in PSCHED time conversion Use PSCHED_SHIFT constant instead of '10' in PSCHED_US2NS() and PSCHED_NS2US() macros to enable changing this value later. Additionally use PSCHED_SHIFT in sch_hfsc SM_SHIFT and ISM_SHIFT definitions. This part of the patch is based on feedback from Patrick McHardy <kaber@trash.net>. Reported-by: Antonio Almeida <vexwek@gmail.com> Tested-by: Antonio Almeida <vexwek@gmail.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b00355db |
|
01-Feb-2009 |
Jarek Poplawski <jarkao2@gmail.com> |
pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving warning handler. Patrick McHardy <kaber@trash.net> suggested: > How about making this flag and the warning message (in a out-of-line > function) globally available? Other qdiscs (f.i. HFSC) can't deal with > inner non-work-conserving qdiscs as well. This patch uses qdisc->flags field of "suspected" child qdisc. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f4ab5432 |
|
23-Sep-2008 |
Jarek Poplawski <jarkao2@gmail.com> |
pkt_sched: Remove the tx queue state check in qdisc_run() The current check wrongly uses the state of one (currently the first) tx queue for all tx queues in case of non-default qdiscs. This check mainly prevented requeuing loop with __netif_schedule(), but now it's controlled inside __qdisc_run(), while dequeuing. The wrongness of this check was first noticed by Herbert Xu. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f6e0b239 |
|
22-Aug-2008 |
Jarek Poplawski <jarkao2@gmail.com> |
pkt_sched: Fix qdisc list locking Since some qdiscs call qdisc_tree_decrease_qlen() (so qdisc_lookup()) without rtnl_lock(), adding and deleting from a qdisc list needs additional locking. This patch adds global spinlock qdisc_list_lock and wrapper functions for modifying the list. It is considered as a temporary solution until hfsc_dequeue(), netem_dequeue() and tbf_dequeue() (or qdisc_tree_decrease_qlen()) are redone. With feedback from Herbert Xu and David S. Miller. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
83f36f3f |
|
13-Aug-2008 |
David S. Miller <davem@davemloft.net> |
pkt_sched: Add queue stopped test back to qdisc_run(). Based upon a bug report by Andrew Gallatin on netdev with subject "CPU utilization increased in 2.6.27rc" In commit 37437bb2e1ae8af470dfcd5b4ff454110894ccaf ("pkt_sched: Schedule qdiscs instead of netdev_queue.") the test of the queue being stopped was erroneously removed from qdisc_run(). When the TX queue of the device fills up, this omission causes lots of extraneous useless work to be queued up to softirq context, where we'll just return immediately because the device is still stuffed up. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
175f9c1b |
|
20-Jul-2008 |
Jussi Kivilinna <jussi.kivilinna@mbnet.fi> |
net_sched: Add size table for qdiscs Add size table functions for qdiscs and calculate packet size in qdisc_enqueue(). Based on patch by Patrick McHardy http://marc.info/?l=linux-netdev&m=115201979221729&w=2 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
37437bb2 |
|
16-Jul-2008 |
David S. Miller <davem@davemloft.net> |
pkt_sched: Schedule qdiscs instead of netdev_queue. When we have shared qdiscs, packets come out of the qdiscs for multiple transmit queues. Therefore it doesn't make any sense to schedule the transmit queue when logically we cannot know ahead of time the TX queue of the SKB that the qdisc->dequeue() will give us. Just for sanity I added a BUG check to make sure we never get into a state where the noop_qdisc is scheduled. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e2627c8c |
|
16-Jul-2008 |
David S. Miller <davem@davemloft.net> |
pkt_sched: Make QDISC_RUNNING a qdisc state. Currently it is associated with a netdev_queue, but when we have qdisc sharing that no longer makes any sense. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fd2ea0a7 |
|
17-Jul-2008 |
David S. Miller <davem@davemloft.net> |
net: Use queue aware tests throughout. This effectively "flips the switch" by making the core networking and multiqueue-aware drivers use the new TX multiqueue structures. Non-multiqueue drivers need no changes. The interfaces they use such as netif_stop_queue() degenerate into an operation on TX queue zero. So everything "just works" for them. Code that really wants to do "X" to all TX queues now invokes a routine that does so, such as netif_tx_wake_all_queues(), netif_tx_stop_all_queues(), etc. pktgen and netpoll required a little bit more surgery than the others. In particular the pktgen changes, whilst functional, could be largely improved. The initial check in pktgen_xmit() will sometimes check the wrong queue, which is mostly harmless. The thing to do is probably to invoke fill_packet() earlier. The bulk of the netpoll changes is to make the code operate solely on the TX queue indicated by by the SKB queue mapping. Setting of the SKB queue mapping is entirely confined inside of net/core/dev.c:dev_pick_tx(). If we end up needing any kind of special semantics (drops, for example) it will be implemented here. Finally, we now have a "real_num_tx_queues" which is where the driver indicates how many TX queues are actually active. With IGB changes from Jeff Kirsher. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
79d16385 |
|
09-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Move atomic queue state bits into netdev_queue. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eb6aafe3 |
|
09-Jul-2008 |
David S. Miller <davem@davemloft.net> |
pkt_sched: Make qdisc_run take a netdev_queue. This allows us to use this calling convention all the way down into qdisc_restart(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fb0305ce |
|
06-Jul-2008 |
Patrick McHardy <kaber@trash.net> |
net-sched: consolidate default fifo qdisc setup Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1e90474c |
|
22-Jan-2008 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: Convert packet schedulers from rtnetlink to new netlink API Convert packet schedulers to use the netlink API. Unfortunately a gradual conversion is not possible without breaking compilation in the middle or adding lots of casts, so this patch converts them all in one step. The patch has been mostly generated automatically with some minor edits to at least allow seperate conversion of classifiers and actions. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3b04ddde |
|
09-Oct-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[NET]: Move hardware header operations out of netdevice. Since hardware header operations are part of the protocol class not the device instance, make them into a separate object and save memory. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
73ca4918 |
|
15-Jul-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: act_api: qdisc internal reclassify support The behaviour of NET_CLS_POLICE for TC_POLICE_RECLASSIFY was to return it to the qdisc, which could handle it internally or ignore it. With NET_CLS_ACT however, tc_classify starts over at the first classifier and never returns it to the qdisc. This makes it impossible to support qdisc-internal reclassification, which in turn makes it impossible to remove the old NET_CLS_POLICE code without breaking compatibility since we have two qdiscs (CBQ and ATM) that support this. This patch adds a tc_classify_compat function that handles reclassification the old way and changes CBQ and ATM to use it. This again is of course not fully backwards compatible with the previous NET_CLS_ACT behaviour. Unfortunately there is no way to fully maintain compatibility *and* support qdisc internal reclassification with NET_CLS_ACT, but this seems like the better choice over keeping the two incompatible options around forever. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0463d4ae |
|
16-Apr-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: Eliminate qdisc_tree_lock Since we're now holding the rtnl during the entire dump operation, we can remove qdisc_tree_lock, whose only purpose is to protect dump callbacks from concurrent changes to the qdisc tree. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3bebcda2 |
|
23-Mar-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: turn PSCHED_GET_TIME into inline function Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
03cc45c0 |
|
23-Mar-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: turn PSCHED_TDIFF_SAFE into inline function Also rename to psched_tdiff_bounded. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8edc0c31 |
|
23-Mar-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: kill PSCHED_TDIFF Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a084980d |
|
23-Mar-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: kill PSCHED_SET_PASTPERFECT/PSCHED_IS_PASTPERFECT Use direct assignment and comparison instead. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
104e0878 |
|
23-Mar-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: kill PSCHED_TLESS Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7c59e25f |
|
23-Mar-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: kill PSCHED_TADD/PSCHED_TADD2 Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
26e252df |
|
23-Mar-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: kill PSCHED_AUDIT_TDIFF Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
00c04af9 |
|
16-Mar-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: kill jiffie conversion macros Now that all packet schedulers have been converted to hrtimers most users of PSCHED_JIFFIE2US and PSCHED_US2JIFFIE are gone. The remaining users use it to convert external time units to packet scheduler clock ticks, so use PSCHED_TICKS_PER_SEC instead. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4179477f |
|
16-Mar-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: Add hrtimer based qdisc watchdog Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
641b9e0e |
|
16-Mar-2007 |
Patrick McHardy <kaber@trash.net> |
[NET_SCHED]: Use ktime as clocksource Get rid of the manual clock source selection mess and use ktime. Also use a scalar representation, which allows to clean up pkt_sched.h a bit more and results in less ktime_to_ns() calls in most cases. The PSCHED_US2JIFFIE/PSCHED_JIFFIE2US macros are implemented quite inefficient by this patch, following patches will convert all qdiscs to hrtimers and get rid of them entirely. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2266d888 |
|
24-Jul-2006 |
Guillaume Chazarain <guichaz@yahoo.fr> |
[PKT_SCHED]: Fix regression in PSCHED_TADD{,2}. In PSCHED_TADD and PSCHED_TADD2, if delta is less than tv.tv_usec (so, less than USEC_PER_SEC too) then tv_res will be smaller than tv. The affectation "(tv_res).tv_usec = __delta;" is wrong. The fix is to revert to the original code before 4ee303dfeac6451b402e3d8512723d3a0f861857 and change the 'if' in 'while'. [Shuya MAEDA: "while (__delta >= USEC_PER_SEC){ ... }" instead of "while (__delta > USEC_PER_SEC){ ... }"] Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4ee303df |
|
28-Jun-2006 |
Shuya MAEDA <maeda-sxb@necst.nec.co.jp> |
[PKT_SCHED]: PSCHED_TADD() and PSCHED_TADD2() can result,tv_usec >= 1000000 Signed-off-by: Shuya MAEDA <maeda-sxb@necst.nec.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
48d83325 |
|
20-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Prevent multiple qdisc runs Having two or more qdisc_run's contend against each other is bad because it can induce packet reordering if the packets have to be requeued. It appears that this is an unintended consequence of relinquinshing the queue lock while transmitting. That in turn is needed for devices that spend a lot of time in their transmit routine. There are no advantages to be had as devices with queues are inherently single-threaded (the loopback device is not but then it doesn't have a queue). Even if you were to add a queue to a parallel virtual device (e.g., bolt a tbf filter in front of an ipip tunnel device), you would still want to process the queue in sequence to ensure that the packets are ordered correctly. The solution here is to steal a bit from net_device to prevent this. BTW, as qdisc_restart is no longer used by anyone as a module inside the kernel (IIRC it used to with netif_wake_queue), I have not exported the new __qdisc_run function. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
538e43a4 |
|
08-Jan-2006 |
Patrick McHardy <kaber@trash.net> |
[PKT_SCHED]: Use USEC_PER_SEC Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3d54b82f |
|
05-Jul-2005 |
Thomas Graf <tgraf@suug.ch> |
[PKT_SCHED]: Cleanup qdisc creation and alignment macros Adds qdisc_alloc() to share code between qdisc_create() and qdisc_create_dflt(). Hides the qdisc alignment behind macros and makes use of them. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e41a33e6 |
|
05-Jul-2005 |
Thomas Graf <tgraf@suug.ch> |
[PKT_SCHED]: Move sch_generic.c prototypes to correct header file Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9dfa277f |
|
03-May-2005 |
Patrick McHardy <kaber@trash.net> |
[PKT_SCHED]: Fix range in PSCHED_TDIFF_SAFE to 0..bound Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7b3c63ac |
|
28-Apr-2005 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
[PKT_SCHED]: Fix range in psched_tod_diff() to 0..bound Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1da177e4 |
|
16-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|