History log of /linux-master/net/sched/sch_mqprio.c
Revision Date Author Comments
# 241a94ab 01-Feb-2024 Michal Koutný <mkoutny@suse.com>

net/sched: Add module aliases for cls_,sch_,act_ modules

No functional change intended, aliases will be used in followup commits.
Note for backporters: you may need to add aliases also for modules that
are already removed in mainline kernel but still in your version.

Patches were generated with the help of Coccinelle scripts like:

cat >scripts/coccinelle/misc/tcf_alias.cocci <<EOD
virtual patch
virtual report

@ haskernel @
@@

@ tcf_has_kind depends on report && haskernel @
identifier ops;
constant K;
@@

static struct tcf_proto_ops ops = {
.kind = K,
...
};
+char module_alias = K;
EOD

/usr/bin/spatch -D report --cocci-file scripts/coccinelle/misc/tcf_alias.cocci \
--dir . \
-I ./arch/x86/include -I ./arch/x86/include/generated -I ./include \
-I ./arch/x86/include/uapi -I ./arch/x86/include/generated/uapi \
-I ./include/uapi -I ./include/generated/uapi \
--include ./include/linux/compiler-version.h --include ./include/linux/kconfig.h \
--jobs 8 --chunksize 1 2>/dev/null | \
sed 's/char module_alias = "\([^"]*\)";/MODULE_ALIAS_NET_CLS("\1");/'

And analogously for:

static struct tc_action_ops ops = {
.kind = K,

static struct Qdisc_ops ops = {
.id = K,

(Someone familiar would be able to fit those into one .cocci file
without sed post processing.)

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240201130943.19536-3-mkoutny@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# f96118c5 27-Oct-2023 Victor Nogueira <victor@mojatatu.com>

net: sched: Fill in missing MODULE_DESCRIPTION for qdiscs

W=1 builds now warn if module is built without a MODULE_DESCRIPTION().

Fill in missing MODULE_DESCRIPTIONs for TC qdiscs.

Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20231027155045.46291-4-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 6c58c881 24-Jul-2023 Lin Ma <linma@zju.edu.cn>

net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64

The nla_for_each_nested parsing in function mqprio_parse_nlattr() does
not check the length of the nested attribute. This can lead to an
out-of-attribute read and allow a malformed nlattr (e.g., length 0) to
be viewed as 8 byte integer and passed to priv->max_rate/min_rate.

This patch adds the check based on nla_len() when check the nla_type(),
which ensures that the length of these two attribute must equals
sizeof(u64).

Fixes: 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and shaper in mqprio")
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/20230725024227.426561-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d636fc5d 06-Jun-2023 Eric Dumazet <edumazet@google.com>

net: sched: add rcu annotations around qdisc->qdisc_sleeping

syzbot reported a race around qdisc->qdisc_sleeping [1]

It is time we add proper annotations to reads and writes to/from
qdisc->qdisc_sleeping.

[1]
BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu

read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1:
qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331
__tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174
tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547
rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386
netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmsg+0x1e3/0x270 net/socket.c:2586
__do_sys_sendmsg net/socket.c:2595 [inline]
__se_sys_sendmsg net/socket.c:2593 [inline]
__x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0:
dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115
qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103
tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693
rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395
netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmsg+0x1e3/0x270 net/socket.c:2586
__do_sys_sendmsg net/socket.c:2595 [inline]
__se_sys_sendmsg net/socket.c:2593 [inline]
__x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b0cc6 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023

Fixes: 3a7d0d07a386 ("net: sched: extend Qdisc with rcu")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f62af20b 11-Apr-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: mqprio: allow per-TC user input of FP adminStatus

IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each
packet priority can be assigned to a "frame preemption status" value of
either "express" or "preemptible". Express priorities are transmitted by
the local device through the eMAC, and preemptible priorities through
the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge
layer).

The FP adminStatus is defined per packet priority, but 802.1Q clause
12.30.1.1.1 framePreemptionAdminStatus also says that:

| Priorities that all map to the same traffic class should be
| constrained to use the same value of preemption status.

It is impossible to ignore the cognitive dissonance in the standard
here, because it practically means that the FP adminStatus only takes
distinct values per traffic class, even though it is defined per
priority.

I can see no valid use case which is prevented by having the kernel take
the FP adminStatus as input per traffic class (what we do here).
In addition, this also enforces the above constraint by construction.
User space network managers which wish to expose FP adminStatus per
priority are free to do so; they must only observe the prio_tc_map of
the netdev (which presumably is also under their control, when
constructing the mqprio netlink attributes).

The reason for configuring frame preemption as a property of the Qdisc
layer is that the information about "preemptible TCs" is closest to the
place which handles the num_tc and prio_tc_map of the netdev. If the
UAPI would have been any other layer, it would be unclear what to do
with the FP information when num_tc collapses to 0. A key assumption is
that only mqprio/taprio change the num_tc and prio_tc_map of the netdev.
Not sure if that's a great assumption to make.

Having FP in tc-mqprio can be seen as an implementation of the use case
defined in 802.1Q Annex S.2 "Preemption used in isolation". There will
be a separate implementation of FP in tc-taprio, for the other use
cases.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# c54876cd 11-Apr-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: pass netlink extack to mqprio and taprio offload

With the multiplexed ndo_setup_tc() model which lacks a first-class
struct netlink_ext_ack * argument, the only way to pass the netlink
extended ACK message down to the device driver is to embed it within the
offload structure.

Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload.

Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload
structure, and since device drivers might effectively reuse their mqprio
implementation for the mqprio portion of taprio, we make taprio set the
extack in both offload structures to point at the same netlink extack
message.

In fact, the taprio handling is a bit more tricky, for 2 reasons.

First is because the offload structure has a longer lifetime than the
extack structure. The driver is supposed to populate the extack
synchronously from ndo_setup_tc() and leave it alone afterwards.
To not have any use-after-free surprises, we zero out the extack pointer
when we leave taprio_enable_offload().

The second reason is because taprio does overwrite the extack message on
ndo_setup_tc() error. We need to switch to the weak form of setting an
extack message, which preserves a potential message set by the driver.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# ab277d20 11-Apr-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: mqprio: add an extack message to mqprio_parse_opt()

Ferenc reports that a combination of poor iproute2 defaults and obscure
cases where the kernel returns -EINVAL make it difficult to understand
what is wrong with this command:

$ ip link add veth0 numtxqueues 8 numrxqueues 8 type veth peer name veth1
$ tc qdisc add dev veth0 root mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7
RTNETLINK answers: Invalid argument

Hopefully with this patch, the cause is clearer:

Error: Device does not support hardware offload.

The kernel was (and still is) rejecting this because iproute2 defaults
to "hw 1" if this command line option is not specified.

Link: https://lore.kernel.org/netdev/ede5e9a2f27bf83bfb86d3e8c4ca7b34093b99e2.camel@inf.elte.hu/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 57f21bf8 11-Apr-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: mqprio: add extack to mqprio_parse_nlattr()

Netlink attribute parsing in mqprio is a minesweeper game, with many
options having the possibility of being passed incorrectly and the user
being none the wiser.

Try to make errors less sour by giving user space some information
regarding what went wrong.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 3dd0c16e 11-Apr-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: mqprio: simplify handling of nlattr portion of TCA_OPTIONS

In commit 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and
shaper in mqprio"), the TCA_OPTIONS format of mqprio was extended to
contain a fixed portion (of size NLA_ALIGN(sizeof struct tc_mqprio_qopt))
and a variable portion of other nlattrs (in the TCA_MQPRIO_* type space)
following immediately afterwards.

In commit feb2cf3dcfb9 ("net/sched: mqprio: refactor nlattr parsing to a
separate function"), we've moved the nlattr handling to a smaller
function, but yet, a small parse_attr() still remains, and the larger
mqprio_parse_nlattr() still does not have access to the beginning, and
the length, of the TCA_OPTIONS region containing these other nlattrs.

In a future change, the mqprio qdisc will need to iterate through this
nlattr region to discover other attributes, so eliminate parse_attr()
and add 2 variables in mqprio_parse_nlattr() which hold the beginning
and the length of the nlattr range.

We avoid the need to memset when nlattr_opt_len has insufficient length
by pre-initializing the table "tb".

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 8b0f2565 04-Apr-2023 Pedro Tammela <pctammela@mojatatu.com>

net/sched: sch_mqprio: use netlink payload helpers

For the sake of readability, use the netlink payload helpers from
the 'nla_get_*()' family to parse the attributes.

tdc results:
1..5
ok 1 9903 - Add mqprio Qdisc to multi-queue device (8 queues)
ok 2 453a - Delete nonexistent mqprio Qdisc
ok 3 5292 - Delete mqprio Qdisc twice
ok 4 45a9 - Add mqprio Qdisc to single-queue device
ok 5 2ba9 - Show mqprio class

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230404203449.1627033-1-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 9dd6ad67 04-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: refactor mqprio qopt reconstruction to a library function

The taprio qdisc will need to reconstruct a struct tc_mqprio_qopt from
netdev settings once more in a future patch, but this code was already
written twice, once in taprio and once in mqprio.

Refactor the code to a helper in the common mqprio library.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1dfe086d 04-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: centralize mqprio qopt validation

There is a lot of code in taprio which is "borrowed" from mqprio.
It makes sense to put a stop to the "borrowing" and start actually
reusing code.

Because taprio and mqprio are built as part of different kernel modules,
code reuse can only take place either by writing it as static inline
(limiting), putting it in sch_generic.o (not generic enough), or
creating a third auto-selectable kernel module which only holds library
code. I opted for the third variant.

In a previous change, mqprio gained support for reverse TC:TXQ mappings,
something which taprio still denies. Make taprio use the same validation
logic so that it supports this configuration as well.

The taprio code didn't enforce TXQ overlaps in txtime-assist mode and
that looks intentional, even if I've no idea why that might be. Preserve
that, but add a comment.

There isn't any dedicated MAINTAINERS entry for mqprio, so nothing to
update there.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d404959f 04-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: mqprio: add extack messages for queue count validation

To make mqprio more user-friendly, create netlink extended ack messages
which say exactly what is wrong about the queue counts. This uses the
new support for printf-formatted extack messages.

Example:

$ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
Error: sch_mqprio: TC 0 queues 3@0 overlap with TC 1 queues 1@1.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 19278d76 04-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: mqprio: allow offloading drivers to request queue count validation

mqprio_parse_opt() proudly has a comment:

/* If hardware offload is requested we will leave it to the device
* to either populate the queue counts itself or to validate the
* provided queue counts.
*/

Unfortunately some device drivers did not get this memo, and don't
validate the queue counts, or populate them.

In case drivers don't want to populate the queue counts themselves, just
act upon the requested configuration, it makes sense to introduce a tc
capability, and make mqprio query it, so they don't have to do the
validation themselves.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d7045f52 04-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: mqprio: allow reverse TC:TXQ mappings

By imposing that the last TXQ of TC i is smaller than the first TXQ of
any TC j (j := i+1 .. n), mqprio imposes a strict ordering condition for
the TXQ indices (they must increase as TCs increase).

Claudiu points out that the complexity of the TXQ count validation is
too high for this logic, i.e. instead of iterating over j, it is
sufficient that the TXQ indices of TC i and i + 1 are ordered, and that
will eventually ensure global ordering.

This is true, however it doesn't appear to me that is what the code
really intended to do. Instead, based on the comments, it just wanted to
check for overlaps (and this isn't how one does that).

So the following mqprio configuration, which I had recommended to
Vinicius more than once for igb/igc (to account for the fact that on
this hardware, lower numbered TXQs have higher dequeue priority than
higher ones):

num_tc 4 map 0 1 2 3 queues 1@3 1@2 1@1 1@0

is in fact denied today by mqprio.

The full story is that in fact, it's only denied with "hw 0"; if
hardware offloading is requested, mqprio defers TXQ range overlap
validation to the device driver (a strange decision in itself).

This is most certainly a bug, but it's not one that has any merit for
being fixed on "stable" as far as I can tell. This is because mqprio
always rejected a configuration which was in fact valid, and this has
shaped the way in which mqprio configuration scripts got built for
various hardware (see igb/igc in the link below). Therefore, one could
consider it to be merely an improvement for mqprio to allow reverse
TC:TXQ mappings.

Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-9-vladimir.oltean@nxp.com/#25188310
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230128010719.2182346-6-vladimir.oltean@nxp.com/#25186442
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5cfb45e2 04-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: mqprio: refactor offloading and unoffloading to dedicated functions

Some more logic will be added to mqprio offloading, so split that code
up from mqprio_init(), which is already large, and create a new
function, mqprio_enable_offload(), similar to taprio_enable_offload().
Also create the opposite function mqprio_disable_offload().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# feb2cf3d 04-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: mqprio: refactor nlattr parsing to a separate function

mqprio_init() is quite large and unwieldy to add more code to.
Split the netlink attribute parsing to a dedicated function.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e046fa89 20-Sep-2022 Zhengchao Shao <shaozhengchao@huawei.com>

net/sched: use tc_qdisc_stats_dump() in qdisc

use tc_qdisc_stats_dump() in qdisc.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 29cbcd85 16-Oct-2021 Ahmed S. Darwish <a.darwish@linutronix.de>

net: sched: Remove Qdisc::running sequence counter

The Qdisc::running sequence counter has two uses:

1. Reliably reading qdisc's tc statistics while the qdisc is running
(a seqcount read/retry loop at gnet_stats_add_basic()).

2. As a flag, indicating whether the qdisc in question is running
(without any retry loops).

For the first usage, the Qdisc::running sequence counter write section,
qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what
is actually needed: the raw qdisc's bstats update. A u64_stats sync
point was thus introduced (in previous commits) inside the bstats
structure itself. A local u64_stats write section is then started and
stopped for the bstats updates.

Use that u64_stats sync point mechanism for the bstats read/retry loop
at gnet_stats_add_basic().

For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag,
accessed with atomic bitops, is sufficient. Using a bit flag instead of
a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads
to the SMP barriers implicitly added through raw_read_seqcount() and
write_seqcount_begin/end() getting removed. All call sites have been
surveyed though, and no required ordering was identified.

Now that the qdisc->running sequence counter is no longer used, remove
it.

Note, using u64_stats implies no sequence counter protection for 64-bit
architectures. This can lead to the qdisc tc statistics "packets" vs.
"bytes" values getting out of sync on rare occasions. The individual
values will still be valid.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 50dc9a85 16-Oct-2021 Ahmed S. Darwish <a.darwish@linutronix.de>

net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data types

The only factor differentiating per-CPU bstats data type (struct
gnet_stats_basic_cpu) from the packed non-per-CPU one (struct
gnet_stats_basic_packed) was a u64_stats sync point inside the former.
The two data types are now equivalent: earlier commits added a u64_stats
sync point to the latter.

Combine both data types into "struct gnet_stats_basic_sync". This
eliminates redundancy and simplifies the bstats read/write APIs.

Use u64_stats_t for bstats "packets" and "bytes" data types. On 64-bit
architectures, u64_stats sync points do not use sequence counter
protection.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 67c9e6270 16-Oct-2021 Ahmed S. Darwish <a.darwish@linutronix.de>

net: sched: Protect Qdisc::bstats with u64_stats

The not-per-CPU variant of qdisc tc (traffic control) statistics,
Qdisc::gnet_stats_basic_packed bstats, is protected with Qdisc::running
sequence counter.

This sequence counter is used for reliably protecting bstats reads from
parallel writes. Meanwhile, the seqcount's write section covers a much
wider area than bstats update: qdisc_run_begin() => qdisc_run_end().

That read/write section asymmetry can lead to needless retries of the
read section. To prepare for removing the Qdisc::running sequence
counter altogether, introduce a u64_stats sync point inside bstats
instead.

Modify _bstats_update() to start/end the bstats u64_stats write
section.

For bisectability, and finer commits granularity, the bstats read
section is still protected with a Qdisc::running read/retry loop and
qdisc_run_begin/end() still starts/ends that seqcount write section.
Once all call sites are modified to use _bstats_update(), the
Qdisc::running seqcount will be removed and bstats read/retry loop will
be modified to utilize the internal u64_stats sync point.

Note, using u64_stats implies no sequence counter protection for 64-bit
architectures. This can lead to the statistics "packets" vs. "bytes"
values getting out of sync on rare occasions. The individual values will
still be valid.

[bigeasy: Minor commit message edits, init all gnet_stats_basic_packed.]

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7361df46 16-Oct-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

mq, mqprio: Use gnet_stats_add_queue().

gnet_stats_add_basic() and gnet_stats_add_queue() add up the statistics
so they can be used directly for both the per-CPU and global case.

gnet_stats_add_queue() copies either Qdisc's per-CPU
gnet_stats_queue::qlen or the global member. The global
gnet_stats_queue::qlen isn't touched in the per-CPU case so there is no
need to consider it in the global-case.

In the per-CPU case, the sum of global gnet_stats_queue::qlen and
the per-CPU gnet_stats_queue::qlen was assigned to sch->q.qlen and
sch->qstats.qlen. Now both fields are copied individually.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fbf307c8 16-Oct-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

gen_stats: Add instead Set the value in __gnet_stats_copy_basic().

__gnet_stats_copy_basic() always assigns the value to the bstats
argument overwriting the previous value. The later added per-CPU version
always accumulated the values in the returning gnet_stats_basic_packed
argument.

Based on review there are five users of that function as of today:
- est_fetch_counters(), ___gnet_stats_copy_basic()
memsets() bstats to zero, single invocation.

- mq_dump(), mqprio_dump(), mqprio_dump_class_stats()
memsets() bstats to zero, multiple invocation but does not use the
function due to !qdisc_is_percpu_stats().

Add the values in __gnet_stats_copy_basic() instead overwriting. Rename
the function to gnet_stats_add_basic() to make it more obvious.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f7116fb4 17-Sep-2021 Jakub Kicinski <kuba@kernel.org>

net: sched: move and reuse mq_change_real_num_tx()

The code for handling active queue changes is identical
between mq and mqprio, reuse it.

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1e080f17 13-Sep-2021 Jakub Kicinski <kuba@kernel.org>

net: sched: update default qdisc visibility after Tx queue cnt changes

mq / mqprio make the default child qdiscs visible. They only do
so for the qdiscs which are within real_num_tx_queues when the
device is registered. Depending on order of calls in the driver,
or if user space changes config via ethtool -L the number of
qdiscs visible under tc qdisc show will differ from the number
of queues. This is confusing to users and potentially to system
configuration scripts which try to make sure qdiscs have the
right parameters.

Add a new Qdisc_ops callback and make relevant qdiscs TTRT.

Note that this uncovers the "shortcut" created by
commit 1f27cde313d7 ("net: sched: use pfifo_fast for non real queues")
The default child qdiscs beyond initial real_num_tx are always
pfifo_fast, no matter what the sysfs setting is. Fixing this
gets a little tricky because we'd need to keep a reference
on whatever the default qdisc was at the time of creation.
In practice this is likely an non-issue the qdiscs likely have
to be configured to non-default settings, so whatever user space
is doing such configuration can replace the pfifos... now that
it will see them.

Reported-by: Matthew Massey <matthewmassey@fb.com>
Reviewed-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 14132690 07-Oct-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

mqprio: Correct stats in mqprio_dump_class_stats().

Introduction of lockless subqueues broke the class statistics.
Before the change stats were accumulated in `bstats' and `qstats'
on the stack which was then copied to struct gnet_dump.

After the change the `bstats' and `qstats' are initialized to 0
and never updated, yet still fed to gnet_dump. The code updates
the global qdisc->cpu_bstats and qdisc->cpu_qstats instead,
clobbering them. Most likely a copy-paste error from the code in
mqprio_dump().

__gnet_stats_copy_basic() and __gnet_stats_copy_queue() accumulate
the values for per-CPU case but for global stats they overwrite
the value, so only stats from the last loop iteration / tc end up
in sch->[bq]stats.

Use the on-stack [bq]stats variables again and add the stats manually
in the global case.

Fixes: ce679e8df7ed2 ("net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
https://lore.kernel.org/all/20211007175000.2334713-2-bigeasy@linutronix.de/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 9f104c77 06-Dec-2019 Vladyslav Tarasiuk <vladyslavt@mellanox.com>

mqprio: Fix out-of-bounds access in mqprio_dump

When user runs a command like
tc qdisc add dev eth1 root mqprio
KASAN stack-out-of-bounds warning is emitted.
Currently, NLA_ALIGN macro used in mqprio_dump provides too large
buffer size as argument for nla_put and memcpy down the call stack.
The flow looks like this:
1. nla_put expects exact object size as an argument;
2. Later it provides this size to memcpy;
3. To calculate correct padding for SKB, nla_put applies NLA_ALIGN
macro itself.

Therefore, NLA_ALIGN should not be applied to the nla_put parameter.
Otherwise it will lead to out-of-bounds memory access in memcpy.

Fixes: 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and shaper in mqprio")
Signed-off-by: Vladyslav Tarasiuk <vladyslavt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f23cd42 02-Dec-2019 Dust Li <dust.li@linux.alibaba.com>

net: sched: fix dump qlen for sch_mq/sch_mqprio with NOLOCK subqueues

sch->q.len hasn't been set if the subqueue is a NOLOCK qdisc
in mq_dump() and mqprio_dump().

Fixes: ce679e8df7ed ("net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 14e54ab9 27-Nov-2019 Dust Li <dust.li@linux.alibaba.com>

net: sched: fix `tc -s class show` no bstats on class with nolock subqueues

When a classful qdisc's child qdisc has set the flag
TCQ_F_CPUSTATS (pfifo_fast for example), the child qdisc's
cpu_bstats should be passed to gnet_stats_copy_basic(),
but many classful qdisc didn't do that. As a result,
`tc -s class show dev DEV` always return 0 for bytes and
packets in this case.

Pass the child qdisc's cpu_bstats to gnet_stats_copy_basic()
to fix this issue.

The qstats also has this problem, but it has been fixed
in 5dd431b6b9 ("net: sched: introduce and use qstats read...")
and bstats still remains buggy.

Fixes: 22e0f8b9322c ("net: sched: make bstats per cpu and estimator RCU safe")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d2912cb1 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 8cb08174 26-Apr-2019 Johannes Berg <johannes.berg@intel.com>

netlink: make validation more configurable for future strictness

We currently have two levels of strict validation:

1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted

Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)

@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ae0be8de 26-Apr-2019 Michal Kubecek <mkubecek@suse.cz>

netlink: make nla_nest_start() add NLA_F_NESTED flag

Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5dd431b6 28-Mar-2019 Paolo Abeni <pabeni@redhat.com>

net: sched: introduce and use qstats read helpers

Classful qdiscs can't access directly the child qdiscs backlog
length: if such qdisc is NOLOCK, per CPU values should be
accounted instead.

Most qdiscs no not respect the above. As a result, qstats fetching
for most classful qdisc is currently incorrect: if the child qdisc is
NOLOCK, it always reports 0 len backlog.

This change introduces a pair of helpers to safely fetch
both backlog and qlen and use them in stats class dumping
functions, fixing the above issue and cleaning a bit the code.

DRR needs also to access the child qdisc queue length, so it
needs custom handling.

Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 86bd446b 24-Sep-2018 Vlad Buslov <vladbu@mellanox.com>

net: sched: rename qdisc_destroy() to qdisc_put()

Current implementation of qdisc_destroy() decrements Qdisc reference
counter and only actually destroy Qdisc if reference counter value reached
zero. Rename qdisc_destroy() to qdisc_put() in order for it to better
describe the way in which this function currently implemented and used.

Extract code that deallocates Qdisc into new private qdisc_destroy()
function. It is intended to be shared between regular qdisc_put() and its
unlocked version that is introduced in next patch in this series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a38a9882 19-Dec-2017 Alexander Aring <aring@mojatatu.com>

net: sch: api: add extack support in qdisc_create_dflt

This patch adds extack support for the function qdisc_create_dflt which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why qdisc_create_dflt failed. The function qdisc_create_dflt
will also call an init callback which can fail by any per-qdisc specific
handling.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 653d6fd6 19-Dec-2017 Alexander Aring <aring@mojatatu.com>

net: sched: sch: add extack for graft callback

This patch adds extack support for graft callback to prepare per-qdisc
specific changes for extack.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e63d7dfd 19-Dec-2017 Alexander Aring <aring@mojatatu.com>

net: sched: sch: add extack for init callback

This patch adds extack support for init callback to prepare per-qdisc
specific changes for extack.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ce679e8d 07-Dec-2017 John Fastabend <john.fastabend@gmail.com>

net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio

The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 575ed7d3 05-Nov-2017 Nogah Frankel <nogahf@mellanox.com>

net_sch: mqprio: Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO

Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO to match the new
convention.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0f7787b4 16-Oct-2017 Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>

net/sched: Add select_queue() class_ops for mqprio

When replacing a child qdisc from mqprio, tc_modify_qdisc() must fetch
the netdev_queue pointer that the current child qdisc is associated
with before creating the new qdisc.

Currently, when using mqprio as root qdisc, the kernel will end up
getting the queue #0 pointer from the mqprio (root qdisc), which leaves
any new child qdisc with a possibly wrong netdev_queue pointer.

Implementing the Qdisc_class_ops select_queue() on mqprio fixes this
issue and avoid an inconsistent state when child qdiscs are replaced.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Tested-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>


# 22ce97fe 17-Oct-2017 Colin Ian King <colin.king@canonical.com>

mqprio: fix potential null pointer dereference on opt

The pointer opt has a null check however before for this check opt is
dereferenced when len is initialized, hence we potentially have a null
pointer deference on opt. Avoid this by checking for a null opt before
dereferencing it.

Detected by CoverityScan, CID#1458234 ("Dereference before null check")

Fixes: 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and shaper in mqprio")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32302902 12-Oct-2017 Alexander Duyck <alexander.h.duyck@intel.com>

mqprio: Reserve last 32 classid values for HW traffic classes and misc IDs

This patch makes a slight tweak to mqprio in order to bring the
classid values used back in line with what is used for mq. The general idea
is to reserve values :ffe0 - :ffef to identify hardware traffic classes
normally reported via dev->num_tc. By doing this we can maintain a
consistent behavior with mq for classid where :1 - :ffdf will represent a
physical qdisc mapped onto a Tx queue represented by classid - 1, and the
traffic classes will be mapped onto a known subset of classid values
reserved for our virtual qdiscs.

Note I reserved the range from :fff0 - :ffff since this way we might be
able to reuse these classid values with clsact and ingress which would mean
that for mq, mqprio, ingress, and clsact we should be able to maintain a
similar classid layout.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4e8b86c0 07-Sep-2017 Amritha Nambiar <amritha.nambiar@intel.com>

mqprio: Introduce new hardware offload mode and shaper in mqprio

The offload types currently supported in mqprio are 0 (no offload) and
1 (offload only TCs) by setting these values for the 'hw' option. If
offloads are supported by setting the 'hw' option to 1, the default
offload mode is 'dcb' where only the TC values are offloaded to the
device. This patch introduces a new hardware offload mode called
'channel' with 'hw' set to 1 in mqprio which makes full use of the
mqprio options, the TCs, the queue configurations and the QoS parameters
for the TCs. This is achieved through a new netlink attribute for the
'mode' option which takes values such as 'dcb' (default) and 'channel'.
The 'channel' mode also supports QoS attributes for traffic class such as
minimum and maximum values for bandwidth rate limits.

This patch enables configuring additional HW shaper attributes associated
with a traffic class. Currently the shaper for bandwidth rate limiting is
supported which takes options such as minimum and maximum bandwidth rates
and are offloaded to the hardware in the 'channel' mode. The min and max
limits for bandwidth rates are provided by the user along with the TCs
and the queue configurations when creating the mqprio qdisc. The interface
can be extended to support new HW shapers in future through the 'shaper'
attribute.

Introduces a new data structure 'tc_mqprio_qopt_offload' for offloading
mqprio queue options and use this to be shared between the kernel and
device driver. This contains a copy of the existing data structure
for mqprio queue options. This new data structure can be extended when
adding new attributes for traffic class such as mode, shaper, shaper
parameters (bandwidth rate limits). The existing data structure for mqprio
queue options will be shared between the kernel and userspace.

Example:
queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit

To dump the bandwidth rates:

qdisc mqprio 804a: root tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
queues:(0:3) (4:7)
mode:channel
shaper:bw_rlimit min_rate:1Gbit 2Gbit max_rate:4Gbit 5Gbit

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>


# 143976ce 24-Aug-2017 WANG Cong <xiyou.wangcong@gmail.com>

net_sched: remove tc class reference counting

For TC classes, their ->get() and ->put() are always paired, and the
reference counting is completely useless, because:

1) For class modification and dumping paths, we already hold RTNL lock,
so all of these ->get(),->change(),->put() are atomic.

2) For filter bindiing/unbinding, we use other reference counter than
this one, and they should have RTNL lock too.

3) For ->qlen_notify(), it is special because it is called on ->enqueue()
path, but we already hold qdisc tree lock there, and we hold this
tree lock when graft or delete the class too, so it should not be gone
or changed until we release the tree lock.

Therefore, this patch removes ->get() and ->put(), but:

1) Adds a new ->find() to find the pointer to a class by classid, no
refcnt.

2) Move the original class destroy upon the last refcnt into ->delete(),
right after releasing tree lock. This is fine because the class is
already removed from hash when holding the lock.

For those who also use ->put() as ->unbind(), just rename them to reflect
this change.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# de4784ca 07-Aug-2017 Jiri Pirko <jiri@mellanox.com>

net: sched: get rid of struct tc_to_netdev

Get rid of struct tc_to_netdev which is now just unnecessary container
and rather pass per-type structures down to drivers directly.
Along with that, consolidate the naming of per-type structure variables
in cls_*.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5fd9fc4e 07-Aug-2017 Jiri Pirko <jiri@mellanox.com>

net: sched: push cls related args into cls_common structure

As ndo_setup_tc is generic offload op for whole tc subsystem, does not
really make sense to have cls-specific args. So move them under
cls_common structurure which is embedded in all cls structs.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2572ac53 07-Aug-2017 Jiri Pirko <jiri@mellanox.com>

net: sched: make type an argument for ndo_setup_tc

Since the type is always present, push it to be a separate argument to
ndo_setup_tc. On the way, name the type enum and use it for arg type.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a5fcf8a6 06-Jun-2017 Jiri Pirko <jiri@mellanox.com>

net: propagate tc filter chain index down the ndo_setup_tc call

We need to push the chain index down to the drivers, so they have the
information to which chain the rule belongs. For now, no driver supports
multichain offload, so only chain 0 is supported. This is needed to
prevent chain squashes during offload for now. Later this will be used
to implement multichain offload.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 56f36acd 15-Mar-2017 Amritha Nambiar <amritha.nambiar@intel.com>

mqprio: Modify mqprio to pass user parameters via ndo_setup_tc.

The configurable priority to traffic class mapping and the user specified
queue ranges are used to configure the traffic class, overriding the
hardware defaults when the 'hw' option is set to 0. However, when the 'hw'
option is non-zero, the hardware QOS defaults are used.

This patch makes it so that we can pass the data the user provided to
ndo_setup_tc. This allows us to pull in the queue configuration if the
user requested it as well as any additional hardware offload type
requested by using a value other than 1 for the hw value.

Finally it also provides a means for the device driver to return the level
supported for the offload type via the qopt->hw value. Previously we were
just always assuming the value to be 1, in the future values beyond just 1
may be supported.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2026fecf 15-Mar-2017 Alexander Duyck <alexander.h.duyck@intel.com>

mqprio: Change handling of hw u8 to allow for multiple hardware offload modes

This patch is meant to allow for support of multiple hardware offload type
for a single device. There is currently no bounds checking for the hw
member of the mqprio_qopt structure. This results in us being able to pass
values from 1 to 255 with all being treated the same. On retreiving the
value it is returned as 1 for anything 1 or greater being set.

With this change we are currently adding limited bounds checking by
defining an enum and using those values to limit the reported hardware
offloads.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 49b49971 08-Mar-2017 Jiri Kosina <jkosina@suse.cz>

net: sched: make default fifo qdiscs appear in the dump

The original reason [1] for having hidden qdiscs (potential scalability
issues in qdisc_match_from_root() with single linked list in case of large
amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
qdisc linked list to hashtable").

This allows us for bringing more clarity and determinism into the dump by
making default pfifo qdiscs visible.

We're not turning this on by default though, at it was deemed [2] too
intrusive / unnecessary change of default behavior towards userspace.
Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
applications to request complete qdisc hierarchy dump, including the
ones that have always been implicit/invisible.

Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
about singletons would require quite some surgery with very little gain
(seeing no qdisc or seeing noop qdisc in the dump is probably setting
the same user expectation).

[1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
[2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 87b60cfa 10-Feb-2017 Eric Dumazet <edumazet@google.com>

net_sched: fix error recovery at qdisc creation

Dmitry reported uses after free in qdisc code [1]

The problem here is that ops->init() can return an error.

qdisc_create_dflt() then call ops->destroy(),
while qdisc_create() does _not_ call it.

Four qdisc chose to call their own ops->destroy(), assuming their caller
would not.

This patch makes sure qdisc_create() calls ops->destroy()
and fixes the four qdisc to avoid double free.

[1]
BUG: KASAN: use-after-free in mq_destroy+0x242/0x290 net/sched/sch_mq.c:33 at addr ffff8801d415d440
Read of size 8 by task syz-executor2/5030
CPU: 0 PID: 5030 Comm: syz-executor2 Not tainted 4.3.5-smp-DEV #119
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
0000000000000046 ffff8801b435b870 ffffffff81bbbed4 ffff8801db000400
ffff8801d415d440 ffff8801d415dc40 ffff8801c4988510 ffff8801b435b898
ffffffff816682b1 ffff8801b435b928 ffff8801d415d440 ffff8801c49880c0
Call Trace:
[<ffffffff81bbbed4>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81bbbed4>] dump_stack+0x6c/0x98 lib/dump_stack.c:51
[<ffffffff816682b1>] kasan_object_err+0x21/0x70 mm/kasan/report.c:158
[<ffffffff81668524>] print_address_description mm/kasan/report.c:196 [inline]
[<ffffffff81668524>] kasan_report_error+0x1b4/0x4b0 mm/kasan/report.c:285
[<ffffffff81668953>] kasan_report mm/kasan/report.c:305 [inline]
[<ffffffff81668953>] __asan_report_load8_noabort+0x43/0x50 mm/kasan/report.c:326
[<ffffffff82527b02>] mq_destroy+0x242/0x290 net/sched/sch_mq.c:33
[<ffffffff82524bdd>] qdisc_destroy+0x12d/0x290 net/sched/sch_generic.c:953
[<ffffffff82524e30>] qdisc_create_dflt+0xf0/0x120 net/sched/sch_generic.c:848
[<ffffffff8252550d>] attach_default_qdiscs net/sched/sch_generic.c:1029 [inline]
[<ffffffff8252550d>] dev_activate+0x6ad/0x880 net/sched/sch_generic.c:1064
[<ffffffff824b1db1>] __dev_open+0x221/0x320 net/core/dev.c:1403
[<ffffffff824b24ce>] __dev_change_flags+0x15e/0x3e0 net/core/dev.c:6858
[<ffffffff824b27de>] dev_change_flags+0x8e/0x140 net/core/dev.c:6926
[<ffffffff824f5bf6>] dev_ifsioc+0x446/0x890 net/core/dev_ioctl.c:260
[<ffffffff824f61fa>] dev_ioctl+0x1ba/0xb80 net/core/dev_ioctl.c:546
[<ffffffff82430509>] sock_do_ioctl+0x99/0xb0 net/socket.c:879
[<ffffffff82430d30>] sock_ioctl+0x2a0/0x390 net/socket.c:958
[<ffffffff816f3b68>] vfs_ioctl fs/ioctl.c:44 [inline]
[<ffffffff816f3b68>] do_vfs_ioctl+0x8a8/0xe50 fs/ioctl.c:611
[<ffffffff816f41a4>] SYSC_ioctl fs/ioctl.c:626 [inline]
[<ffffffff816f41a4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:617
[<ffffffff8123e357>] entry_SYSCALL_64_fastpath+0x12/0x17

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 59cc1f61 10-Aug-2016 Jiri Kosina <jkosina@suse.cz>

net: sched: convert qdisc linked list to hashtable

Convert the per-device linked list into a hashtable. The primary
motivation for this change is that currently, we're not tracking all the
qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
performed over the linked list by qdisc_match_from_root() is rather
expensive.

The ultimate goal is to get rid of hidden qdiscs completely, which will
bring much more determinism in user experience.

Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# edb09eb1 06-Jun-2016 Eric Dumazet <edumazet@google.com>

net: sched: do not acquire qdisc spinlock in qdisc/class stats dump

Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
agent [1] are problematic at scale :

For each qdisc/class found in the dump, we currently lock the root qdisc
spinlock in order to get stats. Sampling stats every 5 seconds from
thousands of HTB classes is a challenge when the root qdisc spinlock is
under high pressure. Not only the dumps take time, they also slow
down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.

An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
that might need the qdisc lock in fq_codel_dump_stats() and
fq_codel_dump_class_stats()

In v2 of this patch, I now use the Qdisc running seqcount to provide
consistent reads of packets/bytes counters, regardless of 32/64 bit arches.

I also changed rate estimators to use the same infrastructure
so that they no longer need to lock root qdisc lock.

[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kevin Athey <kda@google.com>
Cc: Xiaotian Pei <xiaotian@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1f27cde3 02-Mar-2016 Eric Dumazet <edumazet@google.com>

net: sched: use pfifo_fast for non real queues

Some devices declare a high number of TX queues, then set a much
lower real_num_tx_queues

This cause setups using fq_codel, sfq or fq as the default qdisc to consume
more memory than really needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 241deec9 01-Mar-2016 David S. Miller <davem@davemloft.net>

sch_mqprio: Fix build with older gcc.

CC [M] net/sched/sch_mqprio.o
net/sched/sch_mqprio.c: In function ?mqprio_init?:
net/sched/sch_mqprio.c:145: error: unknown field ?tc? specified in initializer
net/sched/sch_mqprio.c:145: warning: missing braces around initializer
net/sched/sch_mqprio.c:145: warning: (near initialization for ?tc.<anonymous>?)
make[2]: *** [net/sched/sch_mqprio.o] Error 1
make[1]: *** [net/sched] Error 2
make: *** [net] Error 2

Several people reported this, surround the unnamed union
member initialization with braces to fix.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 16e5cc64 16-Feb-2016 John Fastabend <john.fastabend@gmail.com>

net: rework setup_tc ndo op to consume general tc operand

This patch updates setup_tc so we can pass additional parameters into
the ndo op in a generic way. To do this we provide structured union
and type flag.

This lets each classifier and qdisc provide its own set of attributes
without having to add new ndo ops or grow the signature of the
callback.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e4c6734e 16-Feb-2016 John Fastabend <john.fastabend@gmail.com>

net: rework ndo tc op to consume additional qdisc handle parameter

The ndo_setup_tc() op was added to support drivers offloading tx
qdiscs however only support for mqprio was ever added. So we
only ever added support for passing the number of traffic classes
to the driver.

This patch generalizes the ndo_setup_tc op so that a handle can
be provided to indicate if the offload is for ingress or egress
or potentially even child qdiscs.

CC: Murali Karicheri <m-karicheri2@ti.com>
CC: Shradha Shah <sshah@solarflare.com>
CC: Or Gerlitz <ogerlitz@mellanox.com>
CC: Ariel Elior <ariel.elior@qlogic.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Bruce Allan <bruce.w.allan@intel.com>
CC: Jesse Brandeburg <jesse.brandeburg@intel.com>
CC: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4eaf3b84 01-Dec-2015 Eric Dumazet <edumazet@google.com>

net_sched: fix qdisc_tree_decrease_qlen() races

qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
devices.

One problem is that it updates sch->q.qlen and sch->qstats.drops
on the mq/mqprio root qdisc, while it should not : Daniele
reported underflows errors :
[ 681.774821] PAX: sch->q.qlen: 0 n: 1
[ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
[ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1
[ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
[ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
[ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
[ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
[ 681.774962] Call Trace:
[ 681.774967] [<ffffffffa95d2810>] dump_stack+0x4c/0x7f
[ 681.774970] [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50
[ 681.774972] [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160
[ 681.774976] [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
[ 681.774978] [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
[ 681.774980] [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0
[ 681.774983] [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160
[ 681.774985] [<ffffffffa90664c1>] __do_softirq+0xf1/0x200
[ 681.774987] [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30
[ 681.774989] [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260
[ 681.774991] [<ffffffffa9089560>] ? sort_range+0x40/0x40
[ 681.774992] [<ffffffffa9085fe4>] kthread+0xe4/0x100
[ 681.774994] [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170
[ 681.774995] [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70

mq/mqprio have their own ways to report qlen/drops by folding stats on
all their queues, with appropriate locking.

A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
without proper locking : concurrent qdisc updates could corrupt the list
that qdisc_match_from_root() parses to find a qdisc given its handle.

Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
qdisc_tree_decrease_qlen() can use to abort its tree traversal,
as soon as it meets a mq/mqprio qdisc children.

Second problem can be fixed by RCU protection.
Qdisc are already freed after RCU grace period, so qdisc_list_add() and
qdisc_list_del() simply have to use appropriate rcu list variants.

A future patch will add a per struct netdev_queue list anchor, so that
qdisc_tree_decrease_qlen() can have more efficient lookups.

Reported-by: Daniele Fucini <dfucini@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b0ab6f92 28-Sep-2014 John Fastabend <john.fastabend@gmail.com>

net: sched: enable per cpu qstats

After previous patches to simplify qstats the qstats can be
made per cpu with a packed union in Qdisc struct.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 64015853 28-Sep-2014 John Fastabend <john.fastabend@gmail.com>

net: sched: restrict use of qstats qlen

This removes the use of qstats->qlen variable from the classifiers
and makes it an explicit argument to gnet_stats_copy_queue().

The qlen represents the qdisc queue length and is packed into
the qstats at the last moment before passnig to user space. By
handling it explicitely we avoid, in the percpu stats case, having
to figure out which per_cpu variable to put it in.

It would probably be best to remove it from qstats completely
but qstats is a user space ABI and can't be broken. A future
patch could make an internal only qstats structure that would
avoid having to allocate an additional u32 variable on the
Qdisc struct. This would make the qstats struct 128bits instead
of 128+32.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 22e0f8b9 28-Sep-2014 John Fastabend <john.fastabend@gmail.com>

net: sched: make bstats per cpu and estimator RCU safe

In order to run qdisc's without locking statistics and estimators
need to be handled correctly.

To resolve bstats make the statistics per cpu. And because this is
only needed for qdiscs that are running without locks which is not
the case for most qdiscs in the near future only create percpu
stats when qdiscs set the TCQ_F_CPUSTATS flag.

Next because estimators use the bstats to calculate packets per
second and bytes per second the estimator code paths are updated
to use the per cpu statistics.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 46e5da40a 12-Sep-2014 John Fastabend <john.fastabend@gmail.com>

net: qdisc: use rcu prefix and silence sparse warnings

Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.

Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b26b0d1e 12-Sep-2014 John Fastabend <john.fastabend@gmail.com>

net: qdisc: use rcu prefix and silence sparse warnings

Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.

Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 95dc1929 05-Dec-2013 Eric Dumazet <edumazet@google.com>

pkt_sched: give visibility to mq slave qdiscs

Commit 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline")
added the ability to change default qdisc from pfifo_fast to say fq

But as most modern ethernet devices are multiqueue, we cant really
see all the statistics from "tc -s qdisc show", as the default root
qdisc is mq.

This patch adds the calls to qdisc_list_add() to mq and mqprio

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6da7c8fc 27-Aug-2013 stephen hemminger <stephen@networkplumber.org>

qdisc: allow setting default queuing discipline

By default, the pfifo_fast queue discipline has been used by default
for all devices. But we have better choices now.

This patch allow setting the default queueing discipline with sysctl.
This allows easy use of better queueing disciplines on all devices
without having to use tc qdisc scripts. It is intended to allow
an easy path for distributions to make fq_codel or sfq the default
qdisc.

This patch also makes pfifo_fast more of a first class qdisc, since
it is now possible to manually override the default and explicitly
use pfifo_fast. The behavior for systems who do not use the sysctl
is unchanged, they still get pfifo_fast

Also removes leftover random # in sysctl net core.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1abbe139 11-Dec-2012 Eric Dumazet <edumazet@google.com>

pkt_sched: avoid requeues if possible

With BQL being deployed, we can more likely have following behavior :

We dequeue a packet from qdisc in dequeue_skb(), then we realize target
tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
skb into gso_skb for later.

This shows in stats (tc -s qdisc dev eth0) as requeues.

Problem of these requeues is that high priority packets can not be
dequeued as long as this (possibly low prio and big TSO packet) is not
removed from gso_skb.

At 1Gbps speed, a full size TSO packet is 500 us of extra latency.

In some cases, we know that all packets dequeued from a qdisc are
for a particular and known txq :

- If device is non multi queue
- For all MQ/MQPRIO slave qdiscs

This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
this capability, so that dequeue_skb() is allowed to dequeue a packet
only if the associated txq is not stopped.

This indeed reduce latencies for high prio packets (or improve fairness
with sfq/fq_codel), and almost remove qdisc 'requeues'.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1b34ec43 29-Mar-2012 David S. Miller <davem@davemloft.net>

pkt_sched: Stop using NLA_PUT*().

These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 7838f2ce 21-Dec-2011 Thomas Graf <tgraf@redhat.com>

mqprio: Avoid panic if no options are provided

Userspace may not provide TCA_OPTIONS, in fact tc currently does
so not do so if no arguments are specified on the command line.
Return EINVAL instead of panicing.

Signed-off-by: Thomas Graf <tgraf@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3a9a231d 27-May-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

net: Fix files explicitly needing to include module.h

With calls to modular infrastructure, these files really
needs the full module.h header. Call it out so some of the
cleanups of implicit and unrequired includes elsewhere can be
cleaned up.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# ea18fd95 23-Feb-2011 stephen hemminger <shemminger@vyatta.com>

mqprio: cleanups

* make qdisc_ops local
* add sparse annotation about expected unlock/unlock in dump_class_stats
* fix indentation

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ac7100ba 14-Feb-2011 Ben Hutchings <bhutchings@solarflare.com>

sch_mqprio: Always set num_tc to 0 in mqprio_destroy()

All the cleanup code in mqprio_destroy() is currently conditional on
priv->qdiscs being non-null, but that condition should only apply to
the per-queue qdisc cleanup. We should always set the number of
traffic classes back to 0 here.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>


# 144ce879 26-Jan-2011 Eric Dumazet <eric.dumazet@gmail.com>

net_sched: sch_mqprio: dont leak kernel memory

mqprio_dump() should make sure all fields of struct tc_mqprio_qopt are
initialized.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 23624935 21-Jan-2011 Eric Dumazet <eric.dumazet@gmail.com>

net_sched: TCQ_F_CAN_BYPASS generalization

Now qdisc stab is handled before TCQ_F_CAN_BYPASS test in
__dev_xmit_skb(), we can generalize TCQ_F_CAN_BYPASS to other qdiscs
than pfifo_fast : pfifo, bfifo, pfifo_head_drop and sfq

SFQ is special because it can have external classifiers, and in these
cases, we cannot bypass queue discipline (packet could be dropped by
classifier) without admin asking it, or further changes.

Its worth doing this, especially for SFQ, avoiding dirtying memory in
case no packets are already waiting in queue.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b8970f0b 17-Jan-2011 John Fastabend <john.r.fastabend@intel.com>

net_sched: implement a root container qdisc sch_mqprio

This implements a mqprio queueing discipline that by default creates
a pfifo_fast qdisc per tx queue and provides the needed configuration
interface.

Using the mqprio qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.

Configurable parameters,

struct tc_mqprio_qopt {
__u8 num_tc;
__u8 prio_tc_map[TC_BITMASK + 1];
__u8 hw;
__u16 count[TC_MAX_QUEUE];
__u16 offset[TC_MAX_QUEUE];
};

Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc.

The hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.

It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup().

One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to know specifics about
traffic types.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>