History log of /linux-master/net/sched/sch_taprio.c
Revision Date Author Comments
# 343041b5 11-Mar-2024 Eric Dumazet <edumazet@google.com>

net/sched: taprio: proper TCA_TAPRIO_TC_ENTRY_INDEX check

taprio_parse_tc_entry() is not correctly checking
TCA_TAPRIO_TC_ENTRY_INDEX attribute:

int tc; // Signed value

tc = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_INDEX]);
if (tc >= TC_QOPT_MAX_QUEUE) {
NL_SET_ERR_MSG_MOD(extack, "TC entry index out of range");
return -ERANGE;
}

syzbot reported that it could fed arbitary negative values:

UBSAN: shift-out-of-bounds in net/sched/sch_taprio.c:1722:18
shift exponent -2147418108 is negative
CPU: 0 PID: 5066 Comm: syz-executor367 Not tainted 6.8.0-rc7-syzkaller-00136-gc8a5c731fd12 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
ubsan_epilogue lib/ubsan.c:217 [inline]
__ubsan_handle_shift_out_of_bounds+0x3c7/0x420 lib/ubsan.c:386
taprio_parse_tc_entry net/sched/sch_taprio.c:1722 [inline]
taprio_parse_tc_entries net/sched/sch_taprio.c:1768 [inline]
taprio_change+0xb87/0x57d0 net/sched/sch_taprio.c:1877
taprio_init+0x9da/0xc80 net/sched/sch_taprio.c:2134
qdisc_create+0x9d4/0x1190 net/sched/sch_api.c:1355
tc_modify_qdisc+0xa26/0x1e40 net/sched/sch_api.c:1776
rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6617
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543
netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367
netlink_sendmsg+0xa3b/0xd70 net/netlink/af_netlink.c:1908
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
___sys_sendmsg net/socket.c:2638 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7f1b2dea3759
Code: 48 83 c4 28 c3 e8 d7 19 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd4de452f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f1b2def0390 RCX: 00007f1b2dea3759
RDX: 0000000000000000 RSI: 00000000200007c0 RDI: 0000000000000004
RBP: 0000000000000003 R08: 0000555500000000 R09: 0000555500000000
R10: 0000555500000000 R11: 0000000000000246 R12: 00007ffd4de45340
R13: 00007ffd4de45310 R14: 0000000000000001 R15: 00007ffd4de45340

Fixes: a54fc09e4cba ("net/sched: taprio: allow user input of per-tc max SDU")
Reported-and-tested-by: syzbot+a340daa06412d6028918@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 241a94ab 01-Feb-2024 Michal Koutný <mkoutny@suse.com>

net/sched: Add module aliases for cls_,sch_,act_ modules

No functional change intended, aliases will be used in followup commits.
Note for backporters: you may need to add aliases also for modules that
are already removed in mainline kernel but still in your version.

Patches were generated with the help of Coccinelle scripts like:

cat >scripts/coccinelle/misc/tcf_alias.cocci <<EOD
virtual patch
virtual report

@ haskernel @
@@

@ tcf_has_kind depends on report && haskernel @
identifier ops;
constant K;
@@

static struct tcf_proto_ops ops = {
.kind = K,
...
};
+char module_alias = K;
EOD

/usr/bin/spatch -D report --cocci-file scripts/coccinelle/misc/tcf_alias.cocci \
--dir . \
-I ./arch/x86/include -I ./arch/x86/include/generated -I ./include \
-I ./arch/x86/include/uapi -I ./arch/x86/include/generated/uapi \
-I ./include/uapi -I ./include/generated/uapi \
--include ./include/linux/compiler-version.h --include ./include/linux/kconfig.h \
--jobs 8 --chunksize 1 2>/dev/null | \
sed 's/char module_alias = "\([^"]*\)";/MODULE_ALIAS_NET_CLS("\1");/'

And analogously for:

static struct tc_action_ops ops = {
.kind = K,

static struct Qdisc_ops ops = {
.id = K,

(Someone familiar would be able to fit those into one .cocci file
without sed post processing.)

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240201130943.19536-3-mkoutny@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 0efc7e54 25-Jan-2024 Alessandro Marcolini <alessandromarcolini99@gmail.com>

taprio: validate TCA_TAPRIO_ATTR_FLAGS through policy instead of open-coding

As of now, the field TCA_TAPRIO_ATTR_FLAGS is being validated by manually
checking its value, using the function taprio_flags_valid().

With this patch, the field will be validated through the netlink policy
NLA_POLICY_MASK, where the mask is defined by TAPRIO_SUPPORTED_FLAGS.
The mutual exclusivity of the two flags TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD
and TCA_TAPRIO_ATTR_FLAG_TXTIME_ASSIST is still checked manually.

Changes since RFC:
- fixed reversed xmas tree
- use NL_SET_ERR_MSG_MOD() for both invalid configuration

Changes since v1:
- Changed NL_SET_ERR_MSG_MOD to NL_SET_ERR_MSG_ATTR when wrong flags
issued
- Changed __u32 to u32

Changes since v2:
- Added the missing parameter for NL_SET_ERR_MSG_ATTR (sorry again for
the noise)

Signed-off-by: Alessandro Marcolini <alessandromarcolini99@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f96118c5 27-Oct-2023 Victor Nogueira <victor@mojatatu.com>

net: sched: Fill in missing MODULE_DESCRIPTION for qdiscs

W=1 builds now warn if module is built without a MODULE_DESCRIPTION().

Fill in missing MODULE_DESCRIPTIONs for TC qdiscs.

Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20231027155045.46291-4-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# ea23fbd2 25-Oct-2023 Jakub Kicinski <kuba@kernel.org>

netlink: make range pointers in policies const

struct nla_policy is usually constant itself, but unless
we make the ranges inside constant we won't be able to
make range structs const. The ranges are not modified
by the core.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231025162204.132528-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 665338b2 07-Aug-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: dump class stats for the actual q->qdiscs[]

This makes a difference for the software scheduling mode, where
dev_queue->qdisc_sleeping is the same as the taprio root Qdisc itself,
but when we're talking about what Qdisc and stats get reported for a
traffic class, the root taprio isn't what comes to mind, but q->qdiscs[]
is.

To understand the difference, I've attempted to send 100 packets in
software mode through class 8001:5, and recorded the stats before and
after the change.

Here is before:

$ tc -s class show dev eth0
class taprio 8001:1 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:2 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:3 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:4 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:5 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:6 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:7 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:8 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0

and here is after:

class taprio 8001:1 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:2 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:3 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:4 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:5 root
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:6 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:7 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:8 root leaf 800d:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0

The most glaring (and expected) difference is that before, all class
stats reported the global stats, whereas now, they really report just
the counters for that traffic class.

Finally, Pedro Tammela points out that there is a tc selftest which
checks specifically which handle do the child Qdiscs corresponding to
each class have. That's changing here - taprio no longer reports
tcm->tcm_info as the same handle "1:" as itself (the root Qdisc), but 0
(the handle of the default pfifo child Qdiscs). Since iproute2 does not
print a child Qdisc handle of 0, adjust the test's expected output.

Link: https://lore.kernel.org/netdev/3b83fcf6-a5e8-26fb-8c8a-ec34ec4c3342@mojatatu.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-6-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 6e0ec800 07-Aug-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: delete misleading comment about preallocating child qdiscs

As mentioned in commit af7b29b1deaa ("Revert "net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"") - unlike
mqprio, taprio doesn't use q->qdiscs[] only as a temporary transport
between Qdisc_ops :: init() and Qdisc_ops :: attach().

Delete the comment, which is just stolen from mqprio, but there, the
usage patterns are a lot different, and this is nothing but confusing.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 98766add 07-Aug-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf()

This is another stab at commit 1461d212ab27 ("net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"), later
reverted in commit af7b29b1deaa ("Revert "net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"").

I believe that the problems that caused the revert were fixed, and thus,
this change is identical to the original patch.

Its purpose is to properly reject attaching a software taprio child
qdisc to a software taprio parent. Because unoffloaded taprio currently
reports itself (the root Qdisc) as the return value from qdisc_leaf(),
then the process of attaching another taprio as child to a Qdisc class
of the root will just result in a Qdisc_ops :: change() call for the
root. Whereas that's not we want. We want Qdisc_ops :: init() to be
called for the taprio child, in order to give the taprio child a chance
to check whether its sch->parent is TC_H_ROOT or not (and reject this
configuration).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 25b0d4e4 07-Aug-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode

Normally, Qdiscs have one reference on them held by their owner and one
held for each TXQ to which they are attached, however this is not the
case with the children of an offloaded taprio. Instead, the taprio qdisc
currently lives in the following fragile equilibrium.

In the software scheduling case, taprio attaches itself (the root Qdisc)
to all TXQs, thus having a refcount of 1 + the number of TX queues. In
this mode, the q->qdiscs[] children are not visible directly to the
Qdisc API. The lifetime of the Qdiscs from this private array lasts
until qdisc_destroy() -> taprio_destroy().

In the fully offloaded case, the root taprio has a refcount of 1, and
all child q->qdiscs[] also have a refcount of 1. The child q->qdiscs[]
are attached to the netdev TXQs directly and thus are visible to the
Qdisc API, however taprio loses a reference to them very early - during
qdisc_graft(parent==NULL) -> taprio_attach(). At that time, taprio frees
the q->qdiscs[] array to not leak memory, but interestingly, it does not
release a reference on these qdiscs because it doesn't effectively own
them - they are created by taprio but owned by the Qdisc core, and will
be freed by qdisc_graft(parent==NULL, new==NULL) -> qdisc_put(old) when
the Qdisc is deleted or when the child Qdisc is replaced with something
else.

My interest is to change this equilibrium such that taprio also owns a
reference on the q->qdiscs[] child Qdiscs for the lifetime of the root
Qdisc, including in full offload mode. I want this because I would like
taprio_leaf(), taprio_dump_class(), taprio_dump_class_stats() to have
insight into q->qdiscs[] for the software scheduling mode - currently
they look at dev_queue->qdisc_sleeping, which is, as mentioned, the same
as the root taprio.

The following set of changes is necessary:
- don't free q->qdiscs[] early in taprio_attach(), free it late in
taprio_destroy() for consistency with software mode. But:
- currently that's not possible, because taprio doesn't own a reference
on q->qdiscs[]. So hold that reference - once during the initial
attach() and once during subsequent graft() calls when the child is
changed.
- always keep track of the current child in q->qdiscs[], even for full
offload mode, so that we free in taprio_destroy() what we should, and
not something stale.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 09e0c3bb 07-Aug-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach()

This is a simple code transformation with no intended behavior change,
just to make it absolutely clear that q->qdiscs[] is only attached to
the child taprio classes in full offload mode.

Right now we use the q->qdiscs[] variable in taprio_attach() for
software mode too, but that is quite confusing and avoidable. We use
it only to reach the netdev TX queue, but we could as well just use
netdev_get_tx_queue() for that.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# e7397184 28-Jul-2023 Kuniyuki Iwashima <kuniyu@amazon.com>

net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.

syzkaller found zero division error [0] in div_s64_rem() called from
get_cycle_time_elapsed(), where sched->cycle_time is the divisor.

We have tests in parse_taprio_schedule() so that cycle_time will never
be 0, and actually cycle_time is not 0 in get_cycle_time_elapsed().

The problem is that the types of divisor are different; cycle_time is
s64, but the argument of div_s64_rem() is s32.

syzkaller fed this input and 0x100000000 is cast to s32 to be 0.

@TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME={0xc, 0x8, 0x100000000}

We use s64 for cycle_time to cast it to ktime_t, so let's keep it and
set max for cycle_time.

While at it, we prevent overflow in setup_txtime() and add another
test in parse_taprio_schedule() to check if cycle_time overflows.

Also, we add a new tdc test case for this issue.

[0]:
divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 1 PID: 103 Comm: kworker/1:3 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: ipv6_addrconf addrconf_dad_work
RIP: 0010:div_s64_rem include/linux/math64.h:42 [inline]
RIP: 0010:get_cycle_time_elapsed net/sched/sch_taprio.c:223 [inline]
RIP: 0010:find_entry_to_transmit+0x252/0x7e0 net/sched/sch_taprio.c:344
Code: 3c 02 00 0f 85 5e 05 00 00 48 8b 4c 24 08 4d 8b bd 40 01 00 00 48 8b 7c 24 48 48 89 c8 4c 29 f8 48 63 f7 48 99 48 89 74 24 70 <48> f7 fe 48 29 d1 48 8d 04 0f 49 89 cc 48 89 44 24 20 49 8d 85 10
RSP: 0018:ffffc90000acf260 EFLAGS: 00010206
RAX: 177450e0347560cf RBX: 0000000000000000 RCX: 177450e0347560cf
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000100000000
RBP: 0000000000000056 R08: 0000000000000000 R09: ffffed10020a0934
R10: ffff8880105049a7 R11: ffff88806cf3a520 R12: ffff888010504800
R13: ffff88800c00d800 R14: ffff8880105049a0 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0edf84f0e8 CR3: 000000000d73c002 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
<TASK>
get_packet_txtime net/sched/sch_taprio.c:508 [inline]
taprio_enqueue_one+0x900/0xff0 net/sched/sch_taprio.c:577
taprio_enqueue+0x378/0xae0 net/sched/sch_taprio.c:658
dev_qdisc_enqueue+0x46/0x170 net/core/dev.c:3732
__dev_xmit_skb net/core/dev.c:3821 [inline]
__dev_queue_xmit+0x1b2f/0x3000 net/core/dev.c:4169
dev_queue_xmit include/linux/netdevice.h:3088 [inline]
neigh_resolve_output net/core/neighbour.c:1552 [inline]
neigh_resolve_output+0x4a7/0x780 net/core/neighbour.c:1532
neigh_output include/net/neighbour.h:544 [inline]
ip6_finish_output2+0x924/0x17d0 net/ipv6/ip6_output.c:135
__ip6_finish_output+0x620/0xaa0 net/ipv6/ip6_output.c:196
ip6_finish_output net/ipv6/ip6_output.c:207 [inline]
NF_HOOK_COND include/linux/netfilter.h:292 [inline]
ip6_output+0x206/0x410 net/ipv6/ip6_output.c:228
dst_output include/net/dst.h:458 [inline]
NF_HOOK.constprop.0+0xea/0x260 include/linux/netfilter.h:303
ndisc_send_skb+0x872/0xe80 net/ipv6/ndisc.c:508
ndisc_send_ns+0xb5/0x130 net/ipv6/ndisc.c:666
addrconf_dad_work+0xc14/0x13f0 net/ipv6/addrconf.c:4175
process_one_work+0x92c/0x13a0 kernel/workqueue.c:2597
worker_thread+0x60f/0x1240 kernel/workqueue.c:2748
kthread+0x2fe/0x3f0 kernel/kthread.c:389
ret_from_fork+0x2c/0x50 arch/x86/entry/entry_64.S:308
</TASK>
Modules linked in:

Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Co-developed-by: Eric Dumazet <edumazet@google.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2b84960f 09-Jun-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: report class offload stats per TXQ, not per TC

The taprio Qdisc creates child classes per netdev TX queue, but
taprio_dump_class_stats() currently reports offload statistics per
traffic class. Traffic classes are groups of TXQs sharing the same
dequeue priority, so this is incorrect and we shouldn't be bundling up
the TXQ stats when reporting them, as we currently do in enetc.

Modify the API from taprio to drivers such that they report TXQ offload
stats and not TC offload stats.

There is no change in the UAPI or in the global Qdisc stats.

Fixes: 6c1adb650c8d ("net/sched: taprio: add netlink reporting for offload statistics counters")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d457a0e3 08-Jun-2023 Eric Dumazet <edumazet@google.com>

net: move gso declarations and functions to their own files

Move declarations into include/net/gso.h and code into net/core/gso.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# be3618d9 08-Jun-2023 Zhengchao Shao <shaozhengchao@huawei.com>

net/sched: taprio: fix slab-out-of-bounds Read in taprio_dequeue_from_txq

As shown in [1], out-of-bounds access occurs in two cases:
1)when the qdisc of the taprio type is used to replace the previously
configured taprio, count and offset in tc_to_txq can be set to 0. In this
case, the value of *txq in taprio_next_tc_txq() will increases
continuously. When the number of accessed queues exceeds the number of
queues on the device, out-of-bounds access occurs.
2)When packets are dequeued, taprio can be deleted. In this case, the tc
rule of dev is cleared. The count and offset values are also set to 0. In
this case, out-of-bounds access is also caused.

Now the restriction on the queue number is added.

[1] https://groups.google.com/g/syzkaller-bugs/c/_lYOKgkBVMg
Fixes: 2f530df76c8c ("net/sched: taprio: give higher priority to higher TCs in software dequeue mode")
Reported-by: syzbot+04afcb3d2c840447559a@syzkaller.appspotmail.com
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d636fc5d 06-Jun-2023 Eric Dumazet <edumazet@google.com>

net: sched: add rcu annotations around qdisc->qdisc_sleeping

syzbot reported a race around qdisc->qdisc_sleeping [1]

It is time we add proper annotations to reads and writes to/from
qdisc->qdisc_sleeping.

[1]
BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu

read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1:
qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331
__tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174
tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547
rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386
netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmsg+0x1e3/0x270 net/socket.c:2586
__do_sys_sendmsg net/socket.c:2595 [inline]
__se_sys_sendmsg net/socket.c:2593 [inline]
__x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0:
dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115
qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103
tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693
rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395
netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmsg+0x1e3/0x270 net/socket.c:2586
__do_sys_sendmsg net/socket.c:2595 [inline]
__se_sys_sendmsg net/socket.c:2593 [inline]
__x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b0cc6 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023

Fixes: 3a7d0d07a386 ("net: sched: extend Qdisc with rcu")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6c1adb65 29-May-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: add netlink reporting for offload statistics counters

Offloading drivers may report some additional statistics counters, some
of them even suggested by 802.1Q, like TransmissionOverrun.

In my opinion we don't have to limit ourselves to reporting counters
only globally to the Qdisc/interface, especially if the device has more
detailed reporting (per traffic class), since the more detailed info is
valuable for debugging and can help identifying who is exceeding its
time slot.

But on the other hand, some devices may not be able to report both per
TC and global stats.

So we end up reporting both ways, and use the good old ethtool_put_stat()
strategy to determine which statistics are supported by this NIC.
Statistics which aren't set are simply not reported to netlink. For this
reason, we need something dynamic (a nlattr nest) to be reported through
TCA_STATS_APP, and not something daft like the fixed-size and
inextensible struct tc_codel_xstats. A good model for xstats which are a
nlattr nest rather than a fixed struct seems to be cake.

# Global stats
$ tc -s qdisc show dev eth0 root
# Per-tc stats
$ tc -s class show dev eth0

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2d800bc5 29-May-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: replace tc_taprio_qopt_offload :: enable with a "cmd" enum

Inspired from struct flow_cls_offload :: cmd, in order for taprio to be
able to report statistics (which is future work), it seems that we need
to drill one step further with the ndo_setup_tc(TC_SETUP_QDISC_TAPRIO)
multiplexing, and pass the command as part of the common portion of the
muxed structure.

Since we already have an "enable" variable in tc_taprio_qopt_offload,
refactor all drivers to check for "cmd" instead of "enable", and reject
every other command except "replace" and "destroy" - to be future proof.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> # for lan966x
Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Reviewed-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dced11ef 29-May-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()

In taprio_dump_class_stats() we don't need a reference to the root Qdisc
once we get the reference to the child corresponding to this traffic
class, so it's okay to overwrite "sch". But in a future patch we will
need the root Qdisc too, so create a dedicated "child" pointer variable
to hold the child reference. This also makes the code adhere to a more
conventional coding style.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a721c3e5 11-Apr-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: allow per-TC user input of FP adminStatus

This is a duplication of the FP adminStatus logic introduced for
tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload
structure embedded within tc_taprio_qopt_offload. So practically, if a
device driver is written to treat the mqprio portion of taprio just like
standalone mqprio, it gets unified handling of frame preemption.

I would have reused more code with taprio, but this is mostly netlink
attribute parsing, which is hard to transform into generic code without
having something that stinks as a result. We have the same variables
with the same semantics, just different nlattr type values
(TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12;
TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and
consequently, different policies for the nest.

Every time nla_parse_nested() is called, an on-stack table "tb" of
nlattr pointers is allocated statically, up to the maximum understood
nlattr type. That array size is hardcoded as a constant, but when
transforming this into a common parsing function, it would become either
a VLA (which the Linux kernel rightfully doesn't like) or a call to the
allocator.

Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q
Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE"
and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE"
use cases. HOLD and RELEASE events are emitted towards the underlying
MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a
Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space,
one can distinguish between the 2 use cases by choosing whether to use
the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate
operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.

A small part of the change is dedicated to refactoring the max_sdu
nlattr parsing to put all logic under the "if" that tests for presence
of that nlattr.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# c54876cd 11-Apr-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: pass netlink extack to mqprio and taprio offload

With the multiplexed ndo_setup_tc() model which lacks a first-class
struct netlink_ext_ack * argument, the only way to pass the netlink
extended ACK message down to the device driver is to embed it within the
offload structure.

Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload.

Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload
structure, and since device drivers might effectively reuse their mqprio
implementation for the mqprio portion of taprio, we make taprio set the
extack in both offload structures to point at the same netlink extack
message.

In fact, the taprio handling is a bit more tricky, for 2 reasons.

First is because the offload structure has a longer lifetime than the
extack structure. The driver is supposed to populate the extack
synchronously from ndo_setup_tc() and leave it alone afterwards.
To not have any use-after-free surprises, we zero out the extack pointer
when we leave taprio_enable_offload().

The second reason is because taprio does overwrite the extack message on
ndo_setup_tc() error. We need to switch to the weak form of setting an
extack message, which preserves a potential message set by the driver.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 64cb6aad 15-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: dynamic max_sdu larger than the max_mtu is unlimited

It makes no sense to keep randomly large max_sdu values, especially if
larger than the device's max_mtu. These are visible in "tc qdisc show".
Such a max_sdu is practically unlimited and will cause no packets for
that traffic class to be dropped on enqueue.

Just set max_sdu_dynamic to U32_MAX, which in the logic below causes
taprio to save a max_frm_len of U32_MAX and a max_sdu presented to user
space of 0 (unlimited).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# bdf366bd 15-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: don't allow dynamic max_sdu to go negative after stab adjustment

The overhead specified in the size table comes from the user. With small
time intervals (or gates always closed), the overhead can be larger than
the max interval for that traffic class, and their difference is
negative.

What we want to happen is for max_sdu_dynamic to have the smallest
non-zero value possible (1) which means that all packets on that traffic
class are dropped on enqueue. However, since max_sdu_dynamic is u32, a
negative is represented as a large value and oversized dropping never
happens.

Use max_t with int to force a truncation of max_frm_len to no smaller
than dev->hard_header_len + 1, which in turn makes max_sdu_dynamic no
smaller than 1.

Fixes: fed87cc6718a ("net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 09dbdf28 15-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: fix calculation of maximum gate durations

taprio_calculate_gate_durations() depends on netdev_get_num_tc() and
this returns 0. So it calculates the maximum gate durations for no
traffic class.

I had tested the blamed commit only with another patch in my tree, one
which in the end I decided isn't valuable enough to submit ("net/sched:
taprio: mask off bits in gate mask that exceed number of TCs").

The problem is that having this patch threw off my testing. By moving
the netdev_set_num_tc() call earlier, we implicitly gave to
taprio_calculate_gate_durations() the information it needed.

Extract only the portion from the unsubmitted change which applies the
mqprio configuration to the netdev earlier.

Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-15-vladimir.oltean@nxp.com/
Fixes: a306a90c8ffe ("net/sched: taprio: calculate tc gate durations")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 39b02d6d 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: don't segment unnecessarily

Improve commit 497cc00224cf ("taprio: Handle short intervals and large
packets") to only perform segmentation when skb->len exceeds what
taprio_dequeue() expects.

In practice, this will make the biggest difference when a traffic class
gate is always open in the schedule. This is because the max_frm_len
will be U32_MAX, and such large skb->len values as Kurt reported will be
sent just fine unsegmented.

What I don't seem to know how to handle is how to make sure that the
segmented skbs themselves are smaller than the maximum frame size given
by the current queueMaxSDU[tc]. Nonetheless, we still need to drop
those, otherwise the Qdisc will hang.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2d5e8071 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: split segmentation logic from qdisc_enqueue()

The majority of the taprio_enqueue()'s function is spent doing TCP
segmentation, which doesn't look right to me. Compilers shouldn't have a
problem in inlining code no matter how we write it, so move the
segmentation logic to a separate function.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fed87cc6 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations

taprio today has a huge problem with small TC gate durations, because it
might accept packets in taprio_enqueue() which will never be sent by
taprio_dequeue().

Since not much infrastructure was available, a kludge was added in
commit 497cc00224cf ("taprio: Handle short intervals and large
packets"), which segmented large TCP segments, but the fact of the
matter is that the issue isn't specific to large TCP segments (and even
worse, the performance penalty in segmenting those is absolutely huge).

In commit a54fc09e4cba ("net/sched: taprio: allow user input of per-tc
max SDU"), taprio gained support for queueMaxSDU, which is precisely the
mechanism through which packets should be dropped at qdisc_enqueue() if
they cannot be sent.

After that patch, it was necessary for the user to manually limit the
maximum MTU per TC. This change adds the necessary logic for taprio to
further limit the values specified (or not specified) by the user to
some minimum values which never allow oversized packets to be sent.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a878fd46 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: keep the max_frm_len information inside struct sched_gate_list

I have one practical reason for doing this and one concerning correctness.

The practical reason has to do with a follow-up patch, which aims to mix
2 sources of max_sdu (one coming from the user and the other automatically
calculated based on TC gate durations @current link speed). Among those
2 sources of input, we must always select the smaller max_sdu value, but
this can change at various link speeds. So the max_sdu coming from the
user must be kept separated from the value that is operationally used
(the minimum of the 2), because otherwise we overwrite it and forget
what the user asked us to do.

To solve that, this patch proposes that struct sched_gate_list contains
the operationally active max_frm_len, and q->max_sdu contains just what
was requested by the user.

The reason having to do with correctness is based on the following
observation: the admin sched_gate_list becomes operational at a given
base_time in the future. Until then, it is inactive and applies no
shaping, all gates are open, etc. So the queueMaxSDU dropping shouldn't
apply either (this is a mechanism to ensure that packets smaller than
the largest gate duration for that TC don't hang the port; clearly it
makes little sense if the gates are always open).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a3d91b2c 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: warn about missing size table

Vinicius intended taprio to take the L1 overhead into account when
estimating packet transmission time through user input, specifically
through the qdisc size table (man tc-stab).

Something like this:

tc qdisc replace dev $eth root stab overhead 24 taprio \
num_tc 8 \
map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
base-time 0 \
sched-entry S 0x7e 9000000 \
sched-entry S 0x82 1000000 \
max-sdu 0 0 0 0 0 0 0 200 \
flags 0x0 clockid CLOCK_TAI

Without the overhead being specified, transmission times will be
underestimated and will cause late transmissions. For an offloading
driver, it might even cause TX hangs if there is no open gate large
enough to send the maximum sized packets for that TC (including L1
overhead). Properly knowing the L1 overhead will ensure that we are able
to auto-calculate the queueMaxSDU per traffic class just right, and
avoid these hangs due to head-of-line blocking.

We can't make the stab mandatory due to existing setups, but we can warn
the user that it's important with a warning netlink extack.

Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220505160357.298794-1-vladimir.oltean@nxp.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a1e6ad30 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: calculate guard band against actual TC gate close time

taprio_dequeue_from_txq() looks at the entry->end_time to determine
whether the skb will overrun its traffic class gate, as if at the end of
the schedule entry there surely is a "gate close" event for it. Hint:
maybe there isn't.

For each schedule entry, introduce an array of kernel times which
actually tracks when in the future will there be an *actual* gate close
event for that traffic class, and use that in the guard band overrun
calculation.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d2ad689d 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: calculate budgets per traffic class

Currently taprio assumes that the budget for a traffic class expires at
the end of the current interval as if the next interval contains a "gate
close" event for this traffic class.

This is, however, an unfounded assumption. Allow schedule entry
intervals to be fused together for a particular traffic class by
calculating the budget until the gate *actually* closes.

This means we need to keep budgets per traffic class, and we also need
to update the budget consumption procedure.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e5517551 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: rename close_time to end_time

There is a confusion in terms in taprio which makes what is called
"close_time" to be actually used for 2 things:

1. determining when an entry "closes" such that transmitted skbs are
never allowed to overrun that time (?!)
2. an aid for determining when to advance and/or restart the schedule
using the hrtimer

It makes more sense to call this so-called "close_time" "end_time",
because it's not clear at all to me what "closes". Future patches will
hopefully make better use of the term "to close".

This is an absolutely mechanical change.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a306a90c 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: calculate tc gate durations

Current taprio code operates on a very simplistic (and incorrect)
assumption: that egress scheduling for a traffic class can only take
place for the duration of the current interval, or i.o.w., it assumes
that at the end of each schedule entry, there is a "gate close" event
for all traffic classes.

As an example, traffic sent with the schedule below will be jumpy, even
though all 8 TC gates are open, so there is absolutely no "gate close"
event (effectively a transition from BIT(tc)==1 to BIT(tc)==0 in
consecutive schedule entries):

tc qdisc replace dev veth0 parent root taprio \
num_tc 2 \
map 0 1 \
queues 1@0 1@1 \
base-time 0 \
sched-entry S 0xff 4000000000 \
clockid CLOCK_TAI \
flags 0x0

This qdisc simply does not have what it takes in terms of logic to
*actually* compute the durations of traffic classes. Also, it does not
recognize the need to use this information on a per-traffic-class basis:
it always looks at entry->interval and entry->close_time.

This change proposes that each schedule entry has an array called
tc_gate_duration[tc]. This holds the information: "for how long will
this traffic class gate remain open, starting from *this* schedule
entry". If the traffic class gate is always open, that value is equal to
the cycle time of the schedule.

We'll also need to keep track, for the purpose of queueMaxSDU[tc]
calculation, what is the maximum time duration for a traffic class
having an open gate. This gives us directly what is the maximum sized
packet that this traffic class will have to accept. For everything else
it has to qdisc_drop() it in qdisc_enqueue().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f530df7 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: give higher priority to higher TCs in software dequeue mode

Current taprio software implementation is haunted by the shadow of the
igb/igc hardware model. It iterates over child qdiscs in increasing
order of TXQ index, therefore giving higher xmit priority to TXQ 0 and
lower to TXQ N. According to discussions with Vinicius, that is the
default (perhaps even unchangeable) prioritization scheme used for the
NICs that taprio was first written for (igb, igc), and we have a case of
two bugs canceling out, resulting in a functional setup on igb/igc, but
a less sane one on other NICs.

To the best of my understanding, taprio should prioritize based on the
traffic class, so it should really dequeue starting with the highest
traffic class and going down from there. We get to the TXQ using the
tc_to_txq[] netdev property.

TXQs within the same TC have the same (strict) priority, so we should
pick from them as fairly as we can. We can achieve that by implementing
something very similar to q->curband from multiq_dequeue().

Since igb/igc really do have TXQ 0 of higher hardware priority than
TXQ 1 etc, we need to preserve the behavior for them as well. We really
have no choice, because in txtime-assist mode, taprio is essentially a
software scheduler towards offloaded child tc-etf qdiscs, so the TXQ
selection really does matter (not all igb TXQs support ETF/SO_TXTIME,
says Kurt Kanzenbach).

To preserve the behavior, we need a capability bit so that taprio can
determine if it's running on igb/igc, or on something else. Because igb
doesn't offload taprio at all, we can't piggyback on the
qdisc_offload_query_caps() call from taprio_enable_offload(), but
instead we need a separate call which is also made for software
scheduling.

Introduce two static keys to minimize the performance penalty on systems
which only have igb/igc NICs, and on systems which only have other NICs.
For mixed systems, taprio will have to dynamically check whether to
dequeue using one prioritization algorithm or using the other.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4c229427 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: avoid calling child->ops->dequeue(child) twice

Simplify taprio_dequeue_from_txq() by noticing that we can goto one call
earlier than the previous skb_found label. This is possible because
we've unified the treatment of the child->ops->dequeue(child) return
call, we always try other TXQs now, instead of abandoning the root
dequeue completely if we failed in the peek() case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 92f96667 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: refactor one skb dequeue from TXQ to separate function

Future changes will refactor the TXQ selection procedure, and a lot of
stuff will become messy, the indentation of the bulk of the dequeue
procedure would increase, etc.

Break out the bulk of the function into a new one, which knows the TXQ
(child qdisc) we should perform a dequeue from.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1638bbbe 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: continue with other TXQs if one dequeue() failed

This changes the handling of an unlikely condition to not stop dequeuing
if taprio failed to dequeue the peeked skb in taprio_dequeue().

I've no idea when this can happen, but the only side effect seems to be
that the atomic_sub_return() call right above will have consumed some
budget. This isn't a big deal, since either that made us remain without
any budget (and therefore, we'd exit on the next peeked skb anyway), or
we could send some packets from other TXQs.

I'm making this change because in a future patch I'll be refactoring the
dequeue procedure to simplify it, and this corner case will have to go
away.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ecc0cc98 07-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: delete peek() implementation

There isn't any code in the network stack which calls taprio_peek().
We only see qdisc->ops->peek() being called on child qdiscs of other
classful qdiscs, never from the generic qdisc code. Whereas taprio is
never a child qdisc, it is always root.

This snippet of a comment from qdisc_peek_dequeued() seems to confirm:

/* we can reuse ->gso_skb because peek isn't called for root qdiscs */

Since I've been known to be wrong many times though, I'm not completely
removing it, but leaving a stub function in place which emits a warning.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 522d15ea 04-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: only pass gate mask per TXQ for igc, stmmac, tsnep, am65_cpsw

There are 2 classes of in-tree drivers currently:

- those who act upon struct tc_taprio_sched_entry :: gate_mask as if it
holds a bit mask of TXQs

- those who act upon the gate_mask as if it holds a bit mask of TCs

When it comes to the standard, IEEE 802.1Q-2018 does say this in the
second paragraph of section 8.6.8.4 Enhancements for scheduled traffic:

| A gate control list associated with each Port contains an ordered list
| of gate operations. Each gate operation changes the transmission gate
| state for the gate associated with each of the Port's traffic class
| queues and allows associated control operations to be scheduled.

In typically obtuse language, it refers to a "traffic class queue"
rather than a "traffic class" or a "queue". But careful reading of
802.1Q clarifies that "traffic class" and "queue" are in fact
synonymous (see 8.6.6 Queuing frames):

| A queue in this context is not necessarily a single FIFO data structure.
| A queue is a record of all frames of a given traffic class awaiting
| transmission on a given Bridge Port. The structure of this record is not
| specified.

i.o.w. their definition of "queue" isn't the Linux TX queue.

The gate_mask really is input into taprio via its UAPI as a mask of
traffic classes, but taprio_sched_to_offload() converts it into a TXQ
mask.

The breakdown of drivers which handle TC_SETUP_QDISC_TAPRIO is:

- hellcreek, felix, sja1105: these are DSA switches, it's not even very
clear what TXQs correspond to, other than purely software constructs.
Only the mqprio configuration with 8 TCs and 1 TXQ per TC makes sense.
So it's fine to convert these to a gate mask per TC.

- enetc: I have the hardware and can confirm that the gate mask is per
TC, and affects all TXQs (BD rings) configured for that priority.

- igc: in igc_save_qbv_schedule(), the gate_mask is clearly interpreted
to be per-TXQ.

- tsnep: Gerhard Engleder clarifies that even though this hardware
supports at most 1 TXQ per TC, the TXQ indices may be different from
the TC values themselves, and it is the TXQ indices that matter to
this hardware. So keep it per-TXQ as well.

- stmmac: I have a GMAC datasheet, and in the EST section it does
specify that the gate events are per TXQ rather than per TC.

- lan966x: again, this is a switch, and while not a DSA one, the way in
which it implements lan966x_mqprio_add() - by only allowing num_tc ==
NUM_PRIO_QUEUES (8) - makes it clear to me that TXQs are a purely
software construct here as well. They seem to map 1:1 with TCs.

- am65_cpsw: from looking at am65_cpsw_est_set_sched_cmds(), I get the
impression that the fetch_allow variable is treated like a prio_mask.
This definitely sounds closer to a per-TC gate mask rather than a
per-TXQ one, and TI documentation does seem to recomment an identity
mapping between TCs and TXQs. However, Roger Quadros would like to do
some testing before making changes, so I'm leaving this driver to
operate as it did before, for now. Link with more details at the end.

Based on this breakdown, we have 5 drivers with a gate mask per TC and
4 with a gate mask per TXQ. So let's make the gate mask per TXQ the
opt-in and the gate mask per TC the default.

Benefit from the TC_QUERY_CAPS feature that Jakub suggested we add, and
query the device driver before calling the proper ndo_setup_tc(), and
figure out if it expects one or the other format.

Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230202003621.2679603-15-vladimir.oltean@nxp.com/#25193204
Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
Cc: Siddharth Vadapalli <s-vadapalli@ti.com>
Cc: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 09c794c0 04-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc()

The taprio qdisc does not currently pass the mqprio queue configuration
down to the offloading device driver. So the driver cannot act upon the
TXQ counts/offsets per TC, or upon the prio->tc map. It was probably
assumed that the driver only wants to offload num_tc (see
TC_MQPRIO_HW_OFFLOAD_TCS), which it can get from netdev_get_num_tc(),
but there's clearly more to the mqprio configuration than that.

I've considered 2 mechanisms to remedy that. First is to pass a struct
tc_mqprio_qopt_offload as part of the tc_taprio_qopt_offload. The second
is to make taprio actually call TC_SETUP_QDISC_MQPRIO, *in addition to*
TC_SETUP_QDISC_TAPRIO.

The difference is that in the first case, existing drivers (offloading
or not) all ignore taprio's mqprio portion currently, whereas in the
second case, we could control whether to call TC_SETUP_QDISC_MQPRIO,
based on a new capability. The question is which approach would be
better.

I'm afraid that calling TC_SETUP_QDISC_MQPRIO unconditionally (not based
on a taprio capability bit) would risk introducing regressions. For
example, taprio doesn't populate (or validate) qopt->hw, as well as
mqprio.flags, mqprio.shaper, mqprio.min_rate, mqprio.max_rate.

In comparison, adding a capability is functionally equivalent to just
passing the mqprio in a way that drivers can ignore it, except it's
slightly more complicated to use it (need to set the capability).

Ultimately, what made me go for the "mqprio in taprio" variant was that
it's easier for offloading drivers to interpret the mqprio qopt slightly
differently when it comes from taprio vs when it comes from mqprio,
should that ever become necessary.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9dd6ad67 04-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: refactor mqprio qopt reconstruction to a library function

The taprio qdisc will need to reconstruct a struct tc_mqprio_qopt from
netdev settings once more in a future patch, but this code was already
written twice, once in taprio and once in mqprio.

Refactor the code to a helper in the common mqprio library.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1dfe086d 04-Feb-2023 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: centralize mqprio qopt validation

There is a lot of code in taprio which is "borrowed" from mqprio.
It makes sense to put a stop to the "borrowing" and start actually
reusing code.

Because taprio and mqprio are built as part of different kernel modules,
code reuse can only take place either by writing it as static inline
(limiting), putting it in sch_generic.o (not generic enough), or
creating a third auto-selectable kernel module which only holds library
code. I opted for the third variant.

In a previous change, mqprio gained support for reverse TC:TXQ mappings,
something which taprio still denies. Make taprio use the same validation
logic so that it supports this configuration as well.

The taprio code didn't enforce TXQ overlaps in txtime-assist mode and
that looks intentional, even if I've no idea why that might be. Preserve
that, but add a comment.

There isn't any dedicated MAINTAINERS entry for mqprio, so nothing to
update there.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ea4fdbaa 23-Jan-2023 Eric Dumazet <edumazet@google.com>

net/sched: sch_taprio: do not schedule in taprio_reset()

As reported by syzbot and hinted by Vinicius, I should not have added
a qdisc_synchronize() call in taprio_reset()

taprio_reset() can be called with qdisc spinlock held (and BH disabled)
as shown in included syzbot report [1].

Only taprio_destroy() needed this synchronization, as explained
in the blamed commit changelog.

[1]

BUG: scheduling while atomic: syz-executor150/5091/0x00000202
2 locks held by syz-executor150/5091:
Modules linked in:
Preemption disabled at:
[<0000000000000000>] 0x0
Kernel panic - not syncing: scheduling while atomic: panic_on_warn set ...
CPU: 1 PID: 5091 Comm: syz-executor150 Not tainted 6.2.0-rc3-syzkaller-00219-g010a74f52203 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd1/0x138 lib/dump_stack.c:106
panic+0x2cc/0x626 kernel/panic.c:318
check_panic_on_warn.cold+0x19/0x35 kernel/panic.c:238
__schedule_bug.cold+0xd5/0xfe kernel/sched/core.c:5836
schedule_debug kernel/sched/core.c:5865 [inline]
__schedule+0x34e4/0x5450 kernel/sched/core.c:6500
schedule+0xde/0x1b0 kernel/sched/core.c:6682
schedule_timeout+0x14e/0x2a0 kernel/time/timer.c:2167
schedule_timeout_uninterruptible kernel/time/timer.c:2201 [inline]
msleep+0xb6/0x100 kernel/time/timer.c:2322
qdisc_synchronize include/net/sch_generic.h:1295 [inline]
taprio_reset+0x93/0x270 net/sched/sch_taprio.c:1703
qdisc_reset+0x10c/0x770 net/sched/sch_generic.c:1022
dev_reset_queue+0x92/0x130 net/sched/sch_generic.c:1285
netdev_for_each_tx_queue include/linux/netdevice.h:2464 [inline]
dev_deactivate_many+0x36d/0x9f0 net/sched/sch_generic.c:1351
dev_deactivate+0xed/0x1b0 net/sched/sch_generic.c:1374
qdisc_graft+0xe4a/0x1380 net/sched/sch_api.c:1080
tc_modify_qdisc+0xb6b/0x19a0 net/sched/sch_api.c:1689
rtnetlink_rcv_msg+0x43e/0xca0 net/core/rtnetlink.c:6141
netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2564
netlink_unicast_kernel net/netlink/af_netlink.c:1330 [inline]
netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1356
netlink_sendmsg+0x91b/0xe10 net/netlink/af_netlink.c:1932
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xd3/0x120 net/socket.c:734
____sys_sendmsg+0x712/0x8c0 net/socket.c:2476
___sys_sendmsg+0x110/0x1b0 net/socket.c:2530
__sys_sendmsg+0xf7/0x1c0 net/socket.c:2559
do_syscall_x64 arch/x86/entry/common.c:50 [inline]

Fixes: 3a415d59c1db ("net/sched: sch_taprio: fix possible use-after-free")
Link: https://lore.kernel.org/netdev/167387581653.2747.13878941339893288655.git-patchwork-notify@kernel.org/T/
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20230123084552.574396-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 3a415d59 13-Jan-2023 Eric Dumazet <edumazet@google.com>

net/sched: sch_taprio: fix possible use-after-free

syzbot reported a nasty crash [1] in net_tx_action() which
made little sense until we got a repro.

This repro installs a taprio qdisc, but providing an
invalid TCA_RATE attribute.

qdisc_create() has to destroy the just initialized
taprio qdisc, and taprio_destroy() is called.

However, the hrtimer used by taprio had already fired,
therefore advance_sched() called __netif_schedule().

Then net_tx_action was trying to use a destroyed qdisc.

We can not undo the __netif_schedule(), so we must wait
until one cpu serviced the qdisc before we can proceed.

Many thanks to Alexander Potapenko for his help.

[1]
BUG: KMSAN: uninit-value in queued_spin_trylock include/asm-generic/qspinlock.h:94 [inline]
BUG: KMSAN: uninit-value in do_raw_spin_trylock include/linux/spinlock.h:191 [inline]
BUG: KMSAN: uninit-value in __raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline]
BUG: KMSAN: uninit-value in _raw_spin_trylock+0x92/0xa0 kernel/locking/spinlock.c:138
queued_spin_trylock include/asm-generic/qspinlock.h:94 [inline]
do_raw_spin_trylock include/linux/spinlock.h:191 [inline]
__raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline]
_raw_spin_trylock+0x92/0xa0 kernel/locking/spinlock.c:138
spin_trylock include/linux/spinlock.h:359 [inline]
qdisc_run_begin include/net/sch_generic.h:187 [inline]
qdisc_run+0xee/0x540 include/net/pkt_sched.h:125
net_tx_action+0x77c/0x9a0 net/core/dev.c:5086
__do_softirq+0x1cc/0x7fb kernel/softirq.c:571
run_ksoftirqd+0x2c/0x50 kernel/softirq.c:934
smpboot_thread_fn+0x554/0x9f0 kernel/smpboot.c:164
kthread+0x31b/0x430 kernel/kthread.c:376
ret_from_fork+0x1f/0x30

Uninit was created at:
slab_post_alloc_hook mm/slab.h:732 [inline]
slab_alloc_node mm/slub.c:3258 [inline]
__kmalloc_node_track_caller+0x814/0x1250 mm/slub.c:4970
kmalloc_reserve net/core/skbuff.c:358 [inline]
__alloc_skb+0x346/0xcf0 net/core/skbuff.c:430
alloc_skb include/linux/skbuff.h:1257 [inline]
nlmsg_new include/net/netlink.h:953 [inline]
netlink_ack+0x5f3/0x12b0 net/netlink/af_netlink.c:2436
netlink_rcv_skb+0x55d/0x6c0 net/netlink/af_netlink.c:2507
rtnetlink_rcv+0x30/0x40 net/core/rtnetlink.c:6108
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0xf3b/0x1270 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x1288/0x1440 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
____sys_sendmsg+0xabc/0xe90 net/socket.c:2482
___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2536
__sys_sendmsg net/socket.c:2565 [inline]
__do_sys_sendmsg net/socket.c:2574 [inline]
__se_sys_sendmsg net/socket.c:2572 [inline]
__x64_sys_sendmsg+0x367/0x540 net/socket.c:2572
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

CPU: 0 PID: 13 Comm: ksoftirqd/0 Not tainted 6.0.0-rc2-syzkaller-47461-gac3859c02d7f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022

Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# af7b29b1 04-Oct-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"

taprio_attach() has this logic at the end, which should have been
removed with the blamed patch (which is now being reverted):

/* access to the child qdiscs is not needed in offload mode */
if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
kfree(q->qdiscs);
q->qdiscs = NULL;
}

because otherwise, we make use of q->qdiscs[] even after this array was
deallocated, namely in taprio_leaf(). Therefore, whenever one would try
to attach a valid child qdisc to a fully offloaded taprio root, one
would immediately dereference a NULL pointer.

$ tc qdisc replace dev eno0 handle 8001: parent root taprio \
num_tc 8 \
map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
max-sdu 0 0 0 0 0 200 0 0 \
base-time 200 \
sched-entry S 80 20000 \
sched-entry S a0 20000 \
sched-entry S 5f 60000 \
flags 2
$ max_frame_size=1500
$ data_rate_kbps=20000
$ port_transmit_rate_kbps=1000000
$ idleslope=$data_rate_kbps
$ sendslope=$(($idleslope - $port_transmit_rate_kbps))
$ locredit=$(($max_frame_size * $sendslope / $port_transmit_rate_kbps))
$ hicredit=$(($max_frame_size * $idleslope / $port_transmit_rate_kbps))
$ tc qdisc replace dev eno0 parent 8001:7 cbs \
idleslope $idleslope \
sendslope $sendslope \
hicredit $hicredit \
locredit $locredit \
offload 0

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000030
pc : taprio_leaf+0x28/0x40
lr : qdisc_leaf+0x3c/0x60
Call trace:
taprio_leaf+0x28/0x40
tc_modify_qdisc+0xf0/0x72c
rtnetlink_rcv_msg+0x12c/0x390
netlink_rcv_skb+0x5c/0x130
rtnetlink_rcv+0x1c/0x2c

The solution is not as obvious as the problem. The code which deallocates
q->qdiscs[] is in fact copied and pasted from mqprio, which also
deallocates the array in mqprio_attach() and never uses it afterwards.

Therefore, the identical cleanup logic of priv->qdiscs[] that
mqprio_destroy() has is deceptive because it will never take place at
qdisc_destroy() time, but just at raw ops->destroy() time (otherwise
said, priv->qdiscs[] do not last for the entire lifetime of the mqprio
root), but rather, this is just the twisted way in which the Qdisc API
understands error path cleanup should be done (Qdisc_ops :: destroy() is
called even when Qdisc_ops :: init() never succeeded).

Side note, in fact this is also what the comment in mqprio_init() says:

/* pre-allocate qdisc, attachment can't fail */

Or reworded, mqprio's priv->qdiscs[] scheme is only meant to serve as
data passing between Qdisc_ops :: init() and Qdisc_ops :: attach().

[ this comment was also copied and pasted into the initial taprio
commit, even though taprio_attach() came way later ]

The problem is that taprio also makes extensive use of the q->qdiscs[]
array in the software fast path (taprio_enqueue() and taprio_dequeue()),
but it does not keep a reference of its own on q->qdiscs[i] (you'd think
that since it creates these Qdiscs, it holds the reference, but nope,
this is not completely true).

To understand the difference between taprio_destroy() and mqprio_destroy()
one must look before commit 13511704f8d7 ("net: taprio offload: enforce
qdisc to netdev queue mapping"), because that just muddied the waters.

In the "original" taprio design, taprio always attached itself (the root
Qdisc) to all netdev TX queues, so that dev_qdisc_enqueue() would go
through taprio_enqueue().

It also called qdisc_refcount_inc() on itself for as many times as there
were netdev TX queues, in order to counter-balance what tc_get_qdisc()
does when destroying a Qdisc (simplified for brevity below):

if (n->nlmsg_type == RTM_DELQDISC)
err = qdisc_graft(dev, parent=NULL, new=NULL, q, extack);

qdisc_graft(where "new" is NULL so this deletes the Qdisc):

for (i = 0; i < num_q; i++) {
struct netdev_queue *dev_queue;

dev_queue = netdev_get_tx_queue(dev, i);

old = dev_graft_qdisc(dev_queue, new);
if (new && i > 0)
qdisc_refcount_inc(new);

qdisc_put(old);
~~~~~~~~~~~~~~
this decrements taprio's refcount once for each TX queue
}

notify_and_destroy(net, skb, n, classid,
rtnl_dereference(dev->qdisc), new);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
and this finally decrements it to zero,
making qdisc_put() call qdisc_destroy()

The q->qdiscs[] created using qdisc_create_dflt() (or their
replacements, if taprio_graft() was ever to get called) were then
privately freed by taprio_destroy().

This is still what is happening after commit 13511704f8d7 ("net: taprio
offload: enforce qdisc to netdev queue mapping"), but only for software
mode.

In full offload mode, the per-txq "qdisc_put(old)" calls from
qdisc_graft() now deallocate the child Qdiscs rather than decrement
taprio's refcount. So when notify_and_destroy(taprio) finally calls
taprio_destroy(), the difference is that the child Qdiscs were already
deallocated.

And this is exactly why the taprio_attach() comment "access to the child
qdiscs is not needed in offload mode" is deceptive too. Not only the
q->qdiscs[] array is not needed, but it is also necessary to get rid of
it as soon as possible, because otherwise, we will also call qdisc_put()
on the child Qdiscs in qdisc_destroy() -> taprio_destroy(), and this
will cause a nasty use-after-free/refcount-saturate/whatever.

In short, the problem is that since the blamed commit, taprio_leaf()
needs q->qdiscs[] to not be freed by taprio_attach(), while qdisc_destroy()
-> taprio_destroy() does need q->qdiscs[] to be freed by taprio_attach()
for full offload. Fixing one problem triggers the other.

All of this can be solved by making taprio keep its q->qdiscs[i] with a
refcount elevated at 2 (in offloaded mode where they are attached to the
netdev TX queues), both in taprio_attach() and in taprio_graft(). The
generic qdisc_graft() would just decrement the child qdiscs' refcounts
to 1, and taprio_destroy() would give them the final coup de grace.

However the rabbit hole of changes is getting quite deep, and the
complexity increases. The blamed commit was supposed to be a bug fix in
the first place, and the bug it addressed is not so significant so as to
justify further rework in stable trees. So I'd rather just revert it.
I don't know enough about multi-queue Qdisc design to make a proper
judgement right now regarding what is/isn't idiomatic use of Qdisc
concepts in taprio. I will try to study the problem more and come with a
different solution in net-next.

Fixes: 1461d212ab27 ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs")
Reported-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
Reported-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20221004220100.1650558-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# a54fc09e 27-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: allow user input of per-tc max SDU

IEEE 802.1Q clause 12.29.1.1 "The queueMaxSDUTable structure and data
types" and 8.6.8.4 "Enhancements for scheduled traffic" talk about the
existence of a per traffic class limitation of maximum frame sizes, with
a fallback on the port-based MTU.

As far as I am able to understand, the 802.1Q Service Data Unit (SDU)
represents the MAC Service Data Unit (MSDU, i.e. L2 payload), excluding
any number of prepended VLAN headers which may be otherwise present in
the MSDU. Therefore, the queueMaxSDU is directly comparable to the
device MTU (1500 means L2 payload sizes are accepted, or frame sizes of
1518 octets, or 1522 plus one VLAN header). Drivers which offload this
are directly responsible of translating into other units of measurement.

To keep the fast path checks optimized, we keep 2 arrays in the qdisc,
one for max_sdu translated into frame length (so that it's comparable to
skb->len), and another for offloading and for dumping back to the user.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# fc4f2fd0 23-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: simplify list iteration in taprio_dev_notifier()

taprio_dev_notifier() subscribes to netdev state changes in order to
determine whether interfaces which have a taprio root qdisc have changed
their link speed, so the internal calculations can be adapted properly.

The 'qdev' temporary variable serves no purpose, because we just use it
only once, and can just as well use qdisc_dev(q->root) directly (or the
"dev" that comes from the netdev notifier; this is because qdev is only
interesting if it was the subject of the state change, _and_ its root
qdisc belongs in the taprio list).

The 'found' variable also doesn't really serve too much of a purpose
either; we can just call taprio_set_picos_per_byte() within the loop,
and exit immediately afterwards.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20220923145921.3038904-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# e046fa89 20-Sep-2022 Zhengchao Shao <shaozhengchao@huawei.com>

net/sched: use tc_qdisc_stats_dump() in qdisc

use tc_qdisc_stats_dump() in qdisc.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# a2c2a4dd 20-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: remove unnecessary taprio_list_lock

The 3 functions that want access to the taprio_list:
taprio_dev_notifier(), taprio_destroy() and taprio_init() are all called
with the rtnl_mutex held, therefore implicitly serialized with respect
to each other. A spin lock serves no purpose.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20220921095632.1379251-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 2c08a4f8 15-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: replace safety precautions with comments

The WARN_ON_ONCE() checks introduced in commit 13511704f8d7 ("net:
taprio offload: enforce qdisc to netdev queue mapping") take a small
toll on performance, but otherwise, the conditions are never expected to
happen. Replace them with comments, such that the information is still
conveyed to developers.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 026de64d 15-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: add extack messages in taprio_init

Stop contributing to the proverbial user unfriendliness of tc, and tell
the user what is wrong wherever possible.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 25becba6 15-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: stop going through private ops for dequeue and peek

Since commit 13511704f8d7 ("net: taprio offload: enforce qdisc to netdev
queue mapping"), taprio_dequeue_soft() and taprio_peek_soft() are de
facto the only implementations for Qdisc_ops :: dequeue and Qdisc_ops ::
peek that taprio provides.

This is because in full offload mode, __dev_queue_xmit() will select a
txq->qdisc which is never root taprio qdisc. So if nothing is enqueued
in the root qdisc, it will never be run and nothing will get dequeued
from it.

Therefore, we can remove the private indirection from taprio, and always
point Qdisc_ops :: dequeue to taprio_dequeue_soft (now simply named
taprio_dequeue) and Qdisc_ops :: peek to taprio_peek_soft (now simply
named taprio_peek).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# fa65edde 15-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: remove redundant FULL_OFFLOAD_IS_ENABLED check in taprio_enqueue

Since commit 13511704f8d7 ("net: taprio offload: enforce qdisc to netdev
queue mapping"), __dev_queue_xmit() will select a txq->qdisc for the
full offload case of taprio which isn't the root taprio qdisc, so
qdisc enqueues will never pass through taprio_enqueue().

That commit already introduced one safety precaution check for
FULL_OFFLOAD_IS_ENABLED(); a second one is really not needed, so
simplify the conditional for entering into the GSO segmentation logic.
Also reword the comment a little, to appear more natural after the code
change.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 9af23657 15-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: use rtnl_dereference for oper and admin sched in taprio_destroy()

Sparse complains that taprio_destroy() dereferences q->oper_sched and
q->admin_sched without rcu_dereference(), since they are marked as __rcu
in the taprio private structure.

1671:28: warning: incorrect type in argument 1 (different address spaces)
1671:28: expected struct callback_head *head
1671:28: got struct callback_head [noderef] __rcu *
1674:28: warning: incorrect type in argument 1 (different address spaces)
1674:28: expected struct callback_head *head
1674:28: got struct callback_head [noderef] __rcu *

To silence that build warning, do actually use rtnl_dereference(), since
we know the rtnl_mutex is held at the time of q->destroy().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 18cdd2f0 15-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: taprio_dump and taprio_change are protected by rtnl_mutex

Since the writer-side lock is taken here, we do not need to open an RCU
read-side critical section, instead we can use rtnl_dereference() to
tell lockdep we are serialized with concurrent writes.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# c8cbe123 15-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: taprio_offload_config_changed() is protected by rtnl_mutex

The locking in taprio_offload_config_changed() is wrong (but also
inconsequentially so). The current_entry_lock does not serialize changes
to the admin and oper schedules, only to the current entry. In fact, the
rtnl_mutex does that, and that is taken at the time when taprio_change()
is called.

Replace the rcu_dereference_protected() method with the proper RCU
annotation, and drop the unnecessary spin lock.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# c19d893f 23-Aug-2022 Zhengchao Shao <shaozhengchao@huawei.com>

net: sched: delete duplicate cleanup of backlog and qlen

qdisc_reset() is clearing qdisc->q.qlen and qdisc->qstats.backlog
_after_ calling qdisc->ops->reset. There is no need to clear them
again in the specific reset function.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20220824005231.345727-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 1461d212 15-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs

taprio can only operate as root qdisc, and to that end, there exists the
following check in taprio_init(), just as in mqprio:

if (sch->parent != TC_H_ROOT)
return -EOPNOTSUPP;

And indeed, when we try to attach taprio to an mqprio child, it fails as
expected:

$ tc qdisc add dev swp0 root handle 1: mqprio num_tc 8 \
map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
$ tc qdisc replace dev swp0 parent 1:2 taprio num_tc 8 \
map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
flags 0x0 clockid CLOCK_TAI
Error: sch_taprio: Can only be attached as root qdisc.

(extack message added by me)

But when we try to attach a taprio child to a taprio root qdisc,
surprisingly it doesn't fail:

$ tc qdisc replace dev swp0 root handle 1: taprio num_tc 8 \
map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
flags 0x0 clockid CLOCK_TAI
$ tc qdisc replace dev swp0 parent 1:2 taprio num_tc 8 \
map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
flags 0x0 clockid CLOCK_TAI

This is because tc_modify_qdisc() behaves differently when mqprio is
root, vs when taprio is root.

In the mqprio case, it finds the parent qdisc through
p = qdisc_lookup(dev, TC_H_MAJ(clid)), and then the child qdisc through
q = qdisc_leaf(p, clid). This leaf qdisc q has handle 0, so it is
ignored according to the comment right below ("It may be default qdisc,
ignore it"). As a result, tc_modify_qdisc() goes through the
qdisc_create() code path, and this gives taprio_init() a chance to check
for sch_parent != TC_H_ROOT and error out.

Whereas in the taprio case, the returned q = qdisc_leaf(p, clid) is
different. It is not the default qdisc created for each netdev queue
(both taprio and mqprio call qdisc_create_dflt() and keep them in
a private q->qdiscs[], or priv->qdiscs[], respectively). Instead, taprio
makes qdisc_leaf() return the _root_ qdisc, aka itself.

When taprio does that, tc_modify_qdisc() goes through the qdisc_change()
code path, because the qdisc layer never finds out about the child qdisc
of the root. And through the ->change() ops, taprio has no reason to
check whether its parent is root or not, just through ->init(), which is
not called.

The problem is the taprio_leaf() implementation. Even though code wise,
it does the exact same thing as mqprio_leaf() which it is copied from,
it works with different input data. This is because mqprio does not
attach itself (the root) to each device TX queue, but one of the default
qdiscs from its private array.

In fact, since commit 13511704f8d7 ("net: taprio offload: enforce qdisc
to netdev queue mapping"), taprio does this too, but just for the full
offload case. So if we tried to attach a taprio child to a fully
offloaded taprio root qdisc, it would properly fail too; just not to a
software root taprio.

To fix the problem, stop looking at the Qdisc that's attached to the TX
queue, and instead, always return the default qdiscs that we've
allocated (and to which we privately enqueue and dequeue, in software
scheduling mode).

Since Qdisc_class_ops :: leaf is only called from tc_modify_qdisc(),
the risk of unforeseen side effects introduced by this change is
minimal.

Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# db46e3a8 15-Sep-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

net/sched: taprio: avoid disabling offload when it was never enabled

In an incredibly strange API design decision, qdisc->destroy() gets
called even if qdisc->init() never succeeded, not exclusively since
commit 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation"),
but apparently also earlier (in the case of qdisc_create_dflt()).

The taprio qdisc does not fully acknowledge this when it attempts full
offload, because it starts off with q->flags = TAPRIO_FLAGS_INVALID in
taprio_init(), then it replaces q->flags with TCA_TAPRIO_ATTR_FLAGS
parsed from netlink (in taprio_change(), tail called from taprio_init()).

But in taprio_destroy(), we call taprio_disable_offload(), and this
determines what to do based on FULL_OFFLOAD_IS_ENABLED(q->flags).

But looking at the implementation of FULL_OFFLOAD_IS_ENABLED()
(a bitwise check of bit 1 in q->flags), it is invalid to call this macro
on q->flags when it contains TAPRIO_FLAGS_INVALID, because that is set
to U32_MAX, and therefore FULL_OFFLOAD_IS_ENABLED() will return true on
an invalid set of flags.

As a result, it is possible to crash the kernel if user space forces an
error between setting q->flags = TAPRIO_FLAGS_INVALID, and the calling
of taprio_enable_offload(). This is because drivers do not expect the
offload to be disabled when it was never enabled.

The error that we force here is to attach taprio as a non-root qdisc,
but instead as child of an mqprio root qdisc:

$ tc qdisc add dev swp0 root handle 1: \
mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
$ tc qdisc replace dev swp0 parent 1:1 \
taprio num_tc 8 map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 base-time 0 \
sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
flags 0x0 clockid CLOCK_TAI
Unable to handle kernel paging request at virtual address fffffffffffffff8
[fffffffffffffff8] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Call trace:
taprio_dump+0x27c/0x310
vsc9959_port_setup_tc+0x1f4/0x460
felix_port_setup_tc+0x24/0x3c
dsa_slave_setup_tc+0x54/0x27c
taprio_disable_offload.isra.0+0x58/0xe0
taprio_destroy+0x80/0x104
qdisc_create+0x240/0x470
tc_modify_qdisc+0x1fc/0x6b0
rtnetlink_rcv_msg+0x12c/0x390
netlink_rcv_skb+0x5c/0x130
rtnetlink_rcv+0x1c/0x2c

Fix this by keeping track of the operations we made, and undo the
offload only if we actually did it.

I've added "bool offloaded" inside a 4 byte hole between "int clockid"
and "atomic64_t picos_per_byte". Now the first cache line looks like
below:

$ pahole -C taprio_sched net/sched/sch_taprio.o
struct taprio_sched {
struct Qdisc * * qdiscs; /* 0 8 */
struct Qdisc * root; /* 8 8 */
u32 flags; /* 16 4 */
enum tk_offsets tk_offset; /* 20 4 */
int clockid; /* 24 4 */
bool offloaded; /* 28 1 */

/* XXX 3 bytes hole, try to pack */

atomic64_t picos_per_byte; /* 32 0 */

/* XXX 8 bytes hole, try to pack */

spinlock_t current_entry_lock; /* 40 0 */

/* XXX 8 bytes hole, try to pack */

struct sched_entry * current_entry; /* 48 8 */
struct sched_gate_list * oper_sched; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */

Fixes: 9c66d1564676 ("taprio: Add support for hardware offloading")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 837ced3a 28-Jun-2022 Vladimir Oltean <vladimir.oltean@nxp.com>

time64.h: consolidate uses of PSEC_PER_NSEC

Time-sensitive networking code needs to work with PTP times expressed in
nanoseconds, and with packet transmission times expressed in
picoseconds, since those would be fractional at higher than gigabit
speed when expressed in nanoseconds.

Convert the existing uses in tc-taprio and the ocelot/felix DSA driver
to a PSEC_PER_NSEC macro. This macro is placed in include/linux/time64.h
as opposed to its relatives (PSEC_PER_SEC etc) from include/vdso/time64.h
because the vDSO library does not (yet) need/use it.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> # for the vDSO parts
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# e8a64bba 08-Apr-2022 Benedikt Spranger <b.spranger@linutronix.de>

net/sched: taprio: Check if socket flags are valid

A user may set the SO_TXTIME socket option to ensure a packet is send
at a given time. The taprio scheduler has to confirm, that it is allowed
to send a packet at that given time, by a check against the packet time
schedule. The scheduler drop the packet, if the gates are closed at the
given send time.

The check, if SO_TXTIME is set, may fail since sk_flags are part of an
union and the union is used otherwise. This happen, if a socket is not
a full socket, like a request socket for example.

Add a check to verify, if the union is used for sk_flags.

Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6dc25401 08-Nov-2021 Eric Dumazet <edumazet@google.com>

net/sched: sch_taprio: fix undefined behavior in ktime_mono_to_any

1) if q->tk_offset == TK_OFFS_MAX, then get_tcp_tstamp() calls
ktime_mono_to_any() with out-of-bound value.

2) if q->tk_offset is changed in taprio_parse_clockid(),
taprio_get_time() might also call ktime_mono_to_any()
with out-of-bound value as sysbot found:

UBSAN: array-index-out-of-bounds in kernel/time/timekeeping.c:908:27
index 3 is out of range for type 'ktime_t *[3]'
CPU: 1 PID: 25668 Comm: kworker/u4:0 Not tainted 5.15.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
ubsan_epilogue+0xb/0x5a lib/ubsan.c:151
__ubsan_handle_out_of_bounds.cold+0x62/0x6c lib/ubsan.c:291
ktime_mono_to_any+0x1d4/0x1e0 kernel/time/timekeeping.c:908
get_tcp_tstamp net/sched/sch_taprio.c:322 [inline]
get_packet_txtime net/sched/sch_taprio.c:353 [inline]
taprio_enqueue_one+0x5b0/0x1460 net/sched/sch_taprio.c:420
taprio_enqueue+0x3b1/0x730 net/sched/sch_taprio.c:485
dev_qdisc_enqueue+0x40/0x300 net/core/dev.c:3785
__dev_xmit_skb net/core/dev.c:3869 [inline]
__dev_queue_xmit+0x1f6e/0x3630 net/core/dev.c:4194
batadv_send_skb_packet+0x4a9/0x5f0 net/batman-adv/send.c:108
batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:393 [inline]
batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:421 [inline]
batadv_iv_send_outstanding_bat_ogm_packet+0x6d7/0x8e0 net/batman-adv/bat_iv_ogm.c:1701
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
kthread+0x405/0x4f0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

Fixes: 7ede7b03484b ("taprio: make clock reference conversions easier")
Fixes: 54002066100b ("taprio: Adjust timestamps for TCP packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vedang Patel <vedang.patel@intel.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20211108180815.1822479-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 29cbcd85 16-Oct-2021 Ahmed S. Darwish <a.darwish@linutronix.de>

net: sched: Remove Qdisc::running sequence counter

The Qdisc::running sequence counter has two uses:

1. Reliably reading qdisc's tc statistics while the qdisc is running
(a seqcount read/retry loop at gnet_stats_add_basic()).

2. As a flag, indicating whether the qdisc in question is running
(without any retry loops).

For the first usage, the Qdisc::running sequence counter write section,
qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what
is actually needed: the raw qdisc's bstats update. A u64_stats sync
point was thus introduced (in previous commits) inside the bstats
structure itself. A local u64_stats write section is then started and
stopped for the bstats updates.

Use that u64_stats sync point mechanism for the bstats read/retry loop
at gnet_stats_add_basic().

For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag,
accessed with atomic bitops, is sufficient. Using a bit flag instead of
a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads
to the SMP barriers implicitly added through raw_read_seqcount() and
write_seqcount_begin/end() getting removed. All call sites have been
surveyed though, and no required ordering was identified.

Now that the qdisc->running sequence counter is no longer used, remove
it.

Note, using u64_stats implies no sequence counter protection for 64-bit
architectures. This can lead to the qdisc tc statistics "packets" vs.
"bytes" values getting out of sync on rare occasions. The individual
values will still be valid.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a56d447f 04-Oct-2021 Eric Dumazet <edumazet@google.com>

net/sched: sch_taprio: properly cancel timer from taprio_destroy()

There is a comment in qdisc_create() about us not calling ops->reset()
in some cases.

err_out4:
/*
* Any broken qdiscs that would require a ops->reset() here?
* The qdisc was never in action so it shouldn't be necessary.
*/

As taprio sets a timer before actually receiving a packet, we need
to cancel it from ops->destroy, just in case ops->reset has not
been called.

syzbot reported:

ODEBUG: free active (active state 0) object type: hrtimer hint: advance_sched+0x0/0x9a0 arch/x86/include/asm/atomic64_64.h:22
WARNING: CPU: 0 PID: 8441 at lib/debugobjects.c:505 debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Modules linked in:
CPU: 0 PID: 8441 Comm: syz-executor813 Not tainted 5.14.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Code: ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 af 00 00 00 48 8b 14 dd e0 d3 e3 89 4c 89 ee 48 c7 c7 e0 c7 e3 89 e8 5b 86 11 05 <0f> 0b 83 05 85 03 92 09 01 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e c3
RSP: 0018:ffffc9000130f330 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: ffff88802baeb880 RSI: ffffffff815d87b5 RDI: fffff52000261e58
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff815d25ee R11: 0000000000000000 R12: ffffffff898dd020
R13: ffffffff89e3ce20 R14: ffffffff81653630 R15: dffffc0000000000
FS: 0000000000f0d300(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffb64b3e000 CR3: 0000000036557000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__debug_check_no_obj_freed lib/debugobjects.c:987 [inline]
debug_check_no_obj_freed+0x301/0x420 lib/debugobjects.c:1018
slab_free_hook mm/slub.c:1603 [inline]
slab_free_freelist_hook+0x171/0x240 mm/slub.c:1653
slab_free mm/slub.c:3213 [inline]
kfree+0xe4/0x540 mm/slub.c:4267
qdisc_create+0xbcf/0x1320 net/sched/sch_api.c:1299
tc_modify_qdisc+0x4c8/0x1a60 net/sched/sch_api.c:1663
rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5571
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
____sys_sendmsg+0x6e8/0x810 net/socket.c:2403
___sys_sendmsg+0xf3/0x170 net/socket.c:2457
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2486
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80

Fixes: 44d4775ca518 ("net/sched: sch_taprio: reset child qdiscs before freeing them")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Davide Caratti <dcaratti@redhat.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# efe487fc 29-Aug-2021 Haimin Zhang <tcs_kernel@tencent.com>

fix array-index-out-of-bounds in taprio_change

syzbot report an array-index-out-of-bounds in taprio_change
index 16 is out of range for type '__u16 [16]'
that's because mqprio->num_tc is lager than TC_MAX_QUEUE,so we check
the return value of netdev_set_num_tc.

Reported-by: syzbot+2b3e5fb6c7ef285a94f6@syzkaller.appspotmail.com
Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ebca25ea 30-Jul-2021 Yannick Vignon <yannick.vignon@nxp.com>

net/sched: taprio: Fix init procedure

Commit 13511704f8d759 ("net: taprio offload: enforce qdisc to netdev queue mapping")
resulted in duplicate entries in the qdisc hash.
While this did not impact the overall operation of the qdisc and taprio
code paths, it did result in an infinite loop when dumping the qdisc
properties, at least on one target (NXP LS1028 ARDB).
Removing the duplicate call to qdisc_hash_add() solves the problem.

Fixes: 13511704f8d759 ("net: taprio offload: enforce qdisc to netdev queue mapping")
Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 633fa666 01-Jul-2021 Jesper Dangaard Brouer <brouer@redhat.com>

net/sched: sch_taprio: fix typo in comment

I have checked that the IEEE standard 802.1Q-2018 section 8.6.9.4.5
is called AdminGateStates.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 13511704 11-May-2021 Yannick Vignon <yannick.vignon@nxp.com>

net: taprio offload: enforce qdisc to netdev queue mapping

Even though the taprio qdisc is designed for multiqueue devices, all the
queues still point to the same top-level taprio qdisc. This works and is
probably required for software taprio, but at least with offload taprio,
it has an undesirable side effect: because the whole qdisc is run when a
packet has to be sent, it allows packets in a best-effort class to be
processed in the context of a task sending higher priority traffic. If
there are packets left in the qdisc after that first run, the NET_TX
softirq is raised and gets executed immediately in the same process
context. As with any other softirq, it runs up to 10 times and for up to
2ms, during which the calling process is waiting for the sendmsg call (or
similar) to return. In my use case, that calling process is a real-time
task scheduled to send a packet every 2ms, so the long sendmsg calls are
leading to missed timeslots.

By attaching each netdev queue to its own qdisc, as it is done with
the "classic" mq qdisc, each traffic class can be processed independently
without touching the other classes. A high-priority process can then send
packets without getting stuck in the sendmsg call anymore.

Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# faa5f5da 11-May-2021 Guenter Roeck <linux@roeck-us.net>

net/sched: taprio: Drop unnecessary NULL check after container_of

The rcu_head pointer passed to taprio_free_sched_cb is never NULL.
That means that the result of container_of() operations on it is also
never NULL, even though rcu_head is the first element of the structure
embedding it. On top of that, it is misleading to perform a NULL check
on the result of container_of() because the position of the contained
element could change, which would make the check invalid. Remove the
unnecessary NULL check.

This change was made automatically with the following Coccinelle script.

@@
type t;
identifier v;
statement s;
@@

<+...
(
t v = container_of(...);
|
v = container_of(...);
)
...
when != v
- if (\( !v \| v == NULL \) ) s
...+>

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ed8157f1 16-Apr-2021 Du Cheng <ducheng2@gmail.com>

net: sched: tapr: prevent cycle_time == 0 in parse_taprio_schedule

There is a reproducible sequence from the userland that will trigger a WARN_ON()
condition in taprio_get_start_time, which causes kernel to panic if configured
as "panic_on_warn". Catch this condition in parse_taprio_schedule to
prevent this condition.

Reported as bug on syzkaller:
https://syzkaller.appspot.com/bug?extid=d50710fd0873a9c6b40c

Reported-by: syzbot+d50710fd0873a9c6b40c@syzkaller.appspotmail.com
Signed-off-by: Du Cheng <ducheng2@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 497cc002 18-Mar-2021 Kurt Kanzenbach <kurt@linutronix.de>

taprio: Handle short intervals and large packets

When using short intervals e.g. below one millisecond, large packets won't be
transmitted at all. The software implementations checks whether the packet can
be fit into the remaining interval. Therefore, it takes the packet length and
the transmission speed into account. That is correct.

However, for large packets it may be that the transmission time exceeds the
interval resulting in no packet transmission. The same situation works fine with
hardware offloading applied.

The problem has been observed with the following schedule and iperf3:

|tc qdisc replace dev lan1 parent root handle 100 taprio \
| num_tc 8 \
| map 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 \
| queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
| base-time $base \
| sched-entry S 0x40 500000 \
| sched-entry S 0xbf 500000 \
| clockid CLOCK_TAI \
| flags 0x00

[...]

|root@tsn:~# iperf3 -c 192.168.2.105
|Connecting to host 192.168.2.105, port 5201
|[ 5] local 192.168.2.121 port 52610 connected to 192.168.2.105 port 5201
|[ ID] Interval Transfer Bitrate Retr Cwnd
|[ 5] 0.00-1.00 sec 45.2 KBytes 370 Kbits/sec 0 1.41 KBytes
|[ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes

After debugging, it seems that the packet length stored in the SKB is about
7000-8000 bytes. Using a 100 Mbit/s link the transmission time is about 600us
which larger than the interval of 500us.

Therefore, segment the SKB into smaller chunks if the packet is too big. This
yields similar results than the hardware offload:

|root@tsn:~# iperf3 -c 192.168.2.105
|Connecting to host 192.168.2.105, port 5201
|- - - - - - - - - - - - - - - - - - - - - - - - -
|[ ID] Interval Transfer Bitrate Retr
|[ 5] 0.00-10.00 sec 48.9 MBytes 41.0 Mbits/sec 0 sender
|[ 5] 0.00-10.02 sec 48.7 MBytes 40.7 Mbits/sec receiver

Furthermore, the segmentation can be skipped for the full offload case, as the
driver or the hardware is expected to handle this.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0deee7aa 18-Jan-2021 Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>

taprio: boolean values to a bool variable

Fix the following coccicheck warnings:

./net/sched/sch_taprio.c:393:3-16: WARNING: Assignment of 0/1 to bool
variable.

./net/sched/sch_taprio.c:375:2-15: WARNING: Assignment of 0/1 to bool
variable.

./net/sched/sch_taprio.c:244:4-19: WARNING: Assignment of 0/1 to bool
variable.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Link: https://lore.kernel.org/r/1610958662-71166-1-git-send-email-abaci-bugfix@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 698285da 17-Dec-2020 Davide Caratti <dcaratti@redhat.com>

net/sched: sch_taprio: ensure to reset/destroy all child qdiscs

taprio_graft() can insert a NULL element in the array of child qdiscs. As
a consquence, taprio_reset() might not reset child qdiscs completely, and
taprio_destroy() might leak resources. Fix it by ensuring that loops that
iterate over q->qdiscs[] don't end when they find the first NULL item.

Fixes: 44d4775ca518 ("net/sched: sch_taprio: reset child qdiscs before freeing them")
Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/13edef6778fef03adc751582562fba4a13e06d6a.1608240532.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 44d4775c 16-Dec-2020 Davide Caratti <dcaratti@redhat.com>

net/sched: sch_taprio: reset child qdiscs before freeing them

syzkaller shows that packets can still be dequeued while taprio_destroy()
is running. Let sch_taprio use the reset() function to cancel the advance
timer and drop all skbs from the child qdiscs.

Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Link: https://syzkaller.appspot.com/bug?id=f362872379bf8f0017fb667c1ab158f2d1e764ae
Reported-by: syzbot+8971da381fb5a31f542d@syzkaller.appspotmail.com
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/63b6d79b0e830ebb0283e020db4df3cdfdfb2b94.1608142843.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# cc69837f 20-Nov-2020 Jakub Kicinski <kuba@kernel.org>

net: don't include ethtool.h from netdevice.h

linux/netdevice.h is included in very many places, touching any
of its dependecies causes large incremental builds.

Drop the linux/ethtool.h include, linux/netdevice.h just needs
a forward declaration of struct ethtool_ops.

Fix all the places which made use of this implicit include.

Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Shannon Nelson <snelson@pensando.io>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20201120225052.1427503-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# b5b73b26 09-Sep-2020 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Fix allowing too small intervals

It's possible that the user specifies an interval that couldn't allow
any packet to be transmitted. This also avoids the issue of the
hrtimer handler starving the other threads because it's running too
often.

The solution is to reject interval sizes that according to the current
link speed wouldn't allow any packet to be transmitted.

Reported-by: syzbot+8267241609ae8c23b248@syzkaller.appspotmail.com
Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 09e31cf0 25-Aug-2020 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Fix using wrong queues in gate mask

Since commit 9c66d1564676 ("taprio: Add support for hardware
offloading") there's a bit of inconsistency when offloading schedules
to the hardware:

In software mode, the gate masks are specified in terms of traffic
classes, so if say "sched-entry S 03 20000", it means that the traffic
classes 0 and 1 are open for 20us; when taprio is offloaded to
hardware, the gate masks are specified in terms of hardware queues.

The idea here is to fix hardware offloading, so schedules in hardware
and software mode have the same behavior. What's needed to do is to
map traffic classes to queues when applying the offload to the driver.

Fixes: 9c66d1564676 ("taprio: Add support for hardware offloading")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ac5c66f2 14-Jul-2020 Petr Machata <petrm@mellanox.com>

Revert "net: sched: Pass root lock to Qdisc_ops.enqueue"

This reverts commit aebe4426ccaa4838f36ea805cdf7d76503e65117.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# aebe4426 26-Jun-2020 Petr Machata <petrm@mellanox.com>

net: sched: Pass root lock to Qdisc_ops.enqueue

A following patch introduces qevents, points in qdisc algorithm where
packet can be processed by user-defined filters. Should this processing
lead to a situation where a new packet is to be enqueued on the same port,
holding the root lock would lead to deadlocks. To solve the issue, qevent
handler needs to unlock and relock the root lock when necessary.

To that end, add the root lock argument to the qdisc op enqueue, and
propagate throughout.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 11a33de2 18-Jun-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

taprio: Use struct_size() in kzalloc()

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes. Also, remove unnecessary
variable _size_.

This code was detected with the help of Coccinelle and, audited and
fixed manually.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b09fe70e 09-Mar-2020 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Fix sending packets without dequeueing them

There was a bug that was causing packets to be sent to the driver
without first calling dequeue() on the "child" qdisc. And the KASAN
report below shows that sending a packet without calling dequeue()
leads to bad results.

The problem is that when checking the last qdisc "child" we do not set
the returned skb to NULL, which can cause it to be sent to the driver,
and so after the skb is sent, it may be freed, and in some situations a
reference to it may still be in the child qdisc, because it was never
dequeued.

The crash log looks like this:

[ 19.937538] ==================================================================
[ 19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780
[ 19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0
[ 19.939612]
[ 19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97
[ 19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4
[ 19.941523] Call Trace:
[ 19.941774] <IRQ>
[ 19.941985] dump_stack+0x97/0xe0
[ 19.942323] print_address_description.constprop.0+0x3b/0x60
[ 19.942884] ? taprio_dequeue_soft+0x620/0x780
[ 19.943325] ? taprio_dequeue_soft+0x620/0x780
[ 19.943767] __kasan_report.cold+0x1a/0x32
[ 19.944173] ? taprio_dequeue_soft+0x620/0x780
[ 19.944612] kasan_report+0xe/0x20
[ 19.944954] taprio_dequeue_soft+0x620/0x780
[ 19.945380] __qdisc_run+0x164/0x18d0
[ 19.945749] net_tx_action+0x2c4/0x730
[ 19.946124] __do_softirq+0x268/0x7bc
[ 19.946491] irq_exit+0x17d/0x1b0
[ 19.946824] smp_apic_timer_interrupt+0xeb/0x380
[ 19.947280] apic_timer_interrupt+0xf/0x20
[ 19.947687] </IRQ>
[ 19.947912] RIP: 0010:default_idle+0x2d/0x2d0
[ 19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3
[ 19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
[ 19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e
[ 19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f
[ 19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1
[ 19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001
[ 19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 19.954408] ? default_idle_call+0x2e/0x70
[ 19.954816] ? default_idle+0x1f/0x2d0
[ 19.955192] default_idle_call+0x5e/0x70
[ 19.955584] do_idle+0x3d4/0x500
[ 19.955909] ? arch_cpu_idle_exit+0x40/0x40
[ 19.956325] ? _raw_spin_unlock_irqrestore+0x23/0x30
[ 19.956829] ? trace_hardirqs_on+0x30/0x160
[ 19.957242] cpu_startup_entry+0x19/0x20
[ 19.957633] start_secondary+0x2a6/0x380
[ 19.958026] ? set_cpu_sibling_map+0x18b0/0x18b0
[ 19.958486] secondary_startup_64+0xa4/0xb0
[ 19.958921]
[ 19.959078] Allocated by task 33:
[ 19.959412] save_stack+0x1b/0x80
[ 19.959747] __kasan_kmalloc.constprop.0+0xc2/0xd0
[ 19.960222] kmem_cache_alloc+0xe4/0x230
[ 19.960617] __alloc_skb+0x91/0x510
[ 19.960967] ndisc_alloc_skb+0x133/0x330
[ 19.961358] ndisc_send_ns+0x134/0x810
[ 19.961735] addrconf_dad_work+0xad5/0xf80
[ 19.962144] process_one_work+0x78e/0x13a0
[ 19.962551] worker_thread+0x8f/0xfa0
[ 19.962919] kthread+0x2ba/0x3b0
[ 19.963242] ret_from_fork+0x3a/0x50
[ 19.963596]
[ 19.963753] Freed by task 33:
[ 19.964055] save_stack+0x1b/0x80
[ 19.964386] __kasan_slab_free+0x12f/0x180
[ 19.964830] kmem_cache_free+0x80/0x290
[ 19.965231] ip6_mc_input+0x38a/0x4d0
[ 19.965617] ipv6_rcv+0x1a4/0x1d0
[ 19.965948] __netif_receive_skb_one_core+0xf2/0x180
[ 19.966437] netif_receive_skb+0x8c/0x3c0
[ 19.966846] br_handle_frame_finish+0x779/0x1310
[ 19.967302] br_handle_frame+0x42a/0x830
[ 19.967694] __netif_receive_skb_core+0xf0e/0x2a90
[ 19.968167] __netif_receive_skb_one_core+0x96/0x180
[ 19.968658] process_backlog+0x198/0x650
[ 19.969047] net_rx_action+0x2fa/0xaa0
[ 19.969420] __do_softirq+0x268/0x7bc
[ 19.969785]
[ 19.969940] The buggy address belongs to the object at ffff888112862840
[ 19.969940] which belongs to the cache skbuff_head_cache of size 224
[ 19.971202] The buggy address is located 140 bytes inside of
[ 19.971202] 224-byte region [ffff888112862840, ffff888112862920)
[ 19.972344] The buggy address belongs to the page:
[ 19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0
[ 19.973930] flags: 0x8000000000010200(slab|head)
[ 19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0
[ 19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000
[ 19.975915] page dumped because: kasan: bad access detected
[ 19.976461] page_owner tracks the page as allocated
[ 19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO)
[ 19.978332] prep_new_page+0x24b/0x330
[ 19.978707] get_page_from_freelist+0x2057/0x2c90
[ 19.979170] __alloc_pages_nodemask+0x218/0x590
[ 19.979619] new_slab+0x9d/0x300
[ 19.979948] ___slab_alloc.constprop.0+0x2f9/0x6f0
[ 19.980421] __slab_alloc.constprop.0+0x30/0x60
[ 19.980870] kmem_cache_alloc+0x201/0x230
[ 19.981269] __alloc_skb+0x91/0x510
[ 19.981620] alloc_skb_with_frags+0x78/0x4a0
[ 19.982043] sock_alloc_send_pskb+0x5eb/0x750
[ 19.982476] unix_stream_sendmsg+0x399/0x7f0
[ 19.982904] sock_sendmsg+0xe2/0x110
[ 19.983262] ____sys_sendmsg+0x4de/0x6d0
[ 19.983660] ___sys_sendmsg+0xe4/0x160
[ 19.984032] __sys_sendmsg+0xab/0x130
[ 19.984396] do_syscall_64+0xe7/0xae0
[ 19.984761] page last free stack trace:
[ 19.985142] __free_pages_ok+0x432/0xbc0
[ 19.985533] qlist_free_all+0x56/0xc0
[ 19.985907] quarantine_reduce+0x149/0x170
[ 19.986315] __kasan_kmalloc.constprop.0+0x9e/0xd0
[ 19.986791] kmem_cache_alloc+0xe4/0x230
[ 19.987182] prepare_creds+0x24/0x440
[ 19.987548] do_faccessat+0x80/0x590
[ 19.987906] do_syscall_64+0xe7/0xae0
[ 19.988276] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 19.988775]
[ 19.988930] Memory state around the buggy address:
[ 19.989402] ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.990111] ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[ 19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 19.991529] ^
[ 19.992081] ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.992796] ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Reported-by: Michael Schmidt <michael.schmidt@eti.uni-siegen.de>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e13aaa06 02-Mar-2020 Jakub Kicinski <kuba@kernel.org>

net: taprio: add missing attribute validation for txtime delay

Add missing attribute validation for TCA_TAPRIO_ATTR_TXTIME_DELAY
to the netlink policy.

Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bfabd41d 06-Feb-2020 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Fix dropping packets when using taprio + ETF offloading

When using taprio offloading together with ETF offloading, configured
like this, for example:

$ tc qdisc replace dev $IFACE parent root handle 100 taprio \
num_tc 4 \
map 2 2 1 0 3 2 2 2 2 2 2 2 2 2 2 2 \
queues 1@0 1@1 1@2 1@3 \
base-time $BASE_TIME \
sched-entry S 01 1000000 \
sched-entry S 0e 1000000 \
flags 0x2

$ tc qdisc replace dev $IFACE parent 100:1 etf \
offload delta 300000 clockid CLOCK_TAI

During enqueue, it works out that the verification added for the
"txtime" assisted mode is run when using taprio + ETF offloading, the
only thing missing is initializing the 'next_txtime' of all the cycle
entries. (if we don't set 'next_txtime' all packets from SO_TXTIME
sockets are dropped)

Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7c16680a 06-Feb-2020 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Use taprio_reset_tc() to reset Traffic Classes configuration

When destroying the current taprio instance, which can happen when the
creation of one fails, we should reset the traffic class configuration
back to the default state.

netdev_reset_tc() is a better way because in addition to setting the
number of traffic classes to zero, it also resets the priority to
traffic classes mapping to the default value.

Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 49c684d7 06-Feb-2020 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Add missing policy validation for flags

netlink policy validation for the 'flags' argument was missing.

Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a9d62274 06-Feb-2020 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Fix still allowing changing the flags during runtime

Because 'q->flags' starts as zero, and zero is a valid value, we
aren't able to detect the transition from zero to something else
during "runtime".

The solution is to initialize 'q->flags' with an invalid value, so we
can detect if 'q->flags' was set by the user or not.

To better solidify the behavior, 'flags' handling is moved to a
separate function. The behavior is:
- 'flags' if unspecified by the user, is assumed to be zero;
- 'flags' cannot change during "runtime" (i.e. a change() request
cannot modify it);

With this new function we can remove taprio_flags, which should reduce
the risk of future accidents.

Allowing flags to be changed was causing the following RCU stall:

[ 1730.558249] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1730.558258] rcu: 6-...0: (190 ticks this GP) idle=922/0/0x1 softirq=25580/25582 fqs=16250
[ 1730.558264] (detected by 2, t=65002 jiffies, g=33017, q=81)
[ 1730.558269] Sending NMI from CPU 2 to CPUs 6:
[ 1730.559277] NMI backtrace for cpu 6
[ 1730.559277] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G E 5.5.0-rc6+ #35
[ 1730.559278] Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS ULTRA/Z390 AORUS ULTRA-CF, BIOS F7 03/14/2019
[ 1730.559278] RIP: 0010:__hrtimer_run_queues+0xe2/0x440
[ 1730.559278] Code: 48 8b 43 28 4c 89 ff 48 8b 75 c0 48 89 45 c8 e8 f4 bb 7c 00 0f 1f 44 00 00 65 8b 05 40 31 f0 68 89 c0 48 0f a3 05 3e 5c 25 01 <0f> 82 fc 01 00 00 48 8b 45 c8 48 89 df ff d0 89 45 c8 0f 1f 44 00
[ 1730.559279] RSP: 0018:ffff9970802d8f10 EFLAGS: 00000083
[ 1730.559279] RAX: 0000000000000006 RBX: ffff8b31645bff38 RCX: 0000000000000000
[ 1730.559280] RDX: 0000000000000000 RSI: ffffffff9710f2ec RDI: ffffffff978daf0e
[ 1730.559280] RBP: ffff9970802d8f68 R08: 0000000000000000 R09: 0000000000000000
[ 1730.559280] R10: 0000018336d7944e R11: 0000000000000001 R12: ffff8b316e39f9c0
[ 1730.559281] R13: ffff8b316e39f940 R14: ffff8b316e39f998 R15: ffff8b316e39f7c0
[ 1730.559281] FS: 0000000000000000(0000) GS:ffff8b316e380000(0000) knlGS:0000000000000000
[ 1730.559281] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1730.559281] CR2: 00007f1105303760 CR3: 0000000227210005 CR4: 00000000003606e0
[ 1730.559282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1730.559282] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1730.559282] Call Trace:
[ 1730.559282] <IRQ>
[ 1730.559283] ? taprio_dequeue_soft+0x2d0/0x2d0 [sch_taprio]
[ 1730.559283] hrtimer_interrupt+0x104/0x220
[ 1730.559283] ? irqtime_account_irq+0x34/0xa0
[ 1730.559283] smp_apic_timer_interrupt+0x6d/0x230
[ 1730.559284] apic_timer_interrupt+0xf/0x20
[ 1730.559284] </IRQ>
[ 1730.559284] RIP: 0010:cpu_idle_poll+0x35/0x1a0
[ 1730.559285] Code: 88 82 ff 65 44 8b 25 12 7d 73 68 0f 1f 44 00 00 e8 90 c3 89 ff fb 65 48 8b 1c 25 c0 7e 01 00 48 8b 03 a8 08 74 0b eb 1c f3 90 <48> 8b 03 a8 08 75 13 8b 05 be a8 a8 00 85 c0 75 ed e8 75 48 84 ff
[ 1730.559285] RSP: 0018:ffff997080137ea8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 1730.559285] RAX: 0000000000000001 RBX: ffff8b316bc3c580 RCX: 0000000000000000
[ 1730.559286] RDX: 0000000000000001 RSI: 000000002819aad9 RDI: ffffffff978da730
[ 1730.559286] RBP: ffff997080137ec0 R08: 0000018324a6d387 R09: 0000000000000000
[ 1730.559286] R10: 0000000000000400 R11: 0000000000000001 R12: 0000000000000006
[ 1730.559286] R13: ffff8b316bc3c580 R14: 0000000000000000 R15: 0000000000000000
[ 1730.559287] ? cpu_idle_poll+0x20/0x1a0
[ 1730.559287] ? cpu_idle_poll+0x20/0x1a0
[ 1730.559287] do_idle+0x4d/0x1f0
[ 1730.559287] ? complete+0x44/0x50
[ 1730.559288] cpu_startup_entry+0x1b/0x20
[ 1730.559288] start_secondary+0x142/0x180
[ 1730.559288] secondary_startup_64+0xb6/0xc0
[ 1776.686313] nvme nvme0: I/O 96 QID 1 timeout, completion polled

Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5652e63d 06-Feb-2020 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Fix enabling offload with wrong number of traffic classes

If the driver implementing taprio offloading depends on the value of
the network device number of traffic classes (dev->num_tc) for
whatever reason, it was going to receive the value zero. The value was
only set after the offloading function is called.

So, moving setting the number of traffic classes to before the
offloading function is called fixes this issue. This is safe because
this only happens when taprio is instantiated (we don't allow this
configuration to be changed without first removing taprio).

Fixes: 9c66d1564676 ("taprio: Add support for hardware offloading")
Reported-by: Po Liu <po.liu@nxp.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b5a0faa3 18-Nov-2019 Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>

taprio: don't reject same mqprio settings

The taprio qdisc allows to set mqprio setting but only once. In case
if mqprio settings are provided next time the error is returned as
it's not allowed to change traffic class mapping in-flignt and that
is normal. But if configuration is absolutely the same - no need to
return error. It allows to provide same command couple times,
changing only base time for instance, or changing only scheds maps,
but leaving mqprio setting w/o modification. It more corresponds the
message: "Changing the traffic mapping of a running schedule is not
supported", so reject mqprio if it's really changed.

Also corrected TC_BITMASK + 1 for consistency, as proposed.

Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule")
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0763b3e8 01-Nov-2019 Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>

taprio: fix panic while hw offload sched list swap

Don't swap oper and admin schedules too early, it's not correct and
causes crash.

Steps to reproduce:

1)
tc qdisc replace dev eth0 parent root handle 100 taprio \
num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
queues 1@0 1@1 1@2 \
base-time $SOME_BASE_TIME \
sched-entry S 01 80000 \
sched-entry S 02 15000 \
sched-entry S 04 40000 \
flags 2

2)
tc qdisc replace dev eth0 parent root handle 100 taprio \
base-time $SOME_BASE_TIME \
sched-entry S 01 90000 \
sched-entry S 02 20000 \
sched-entry S 04 40000 \
flags 2

3)
tc qdisc replace dev eth0 parent root handle 100 taprio \
base-time $SOME_BASE_TIME \
sched-entry S 01 150000 \
sched-entry S 02 200000 \
sched-entry S 04 40000 \
flags 2

Do 2 3 2 .. steps more times if not happens and observe:

[ 305.832319] Unable to handle kernel write to read-only memory at
virtual address ffff0000087ce7f0
[ 305.910887] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
[ 305.919306] Hardware name: Texas Instruments AM654 Base Board (DT)

[...]

[ 306.017119] x1 : ffff800848031d88 x0 : ffff800848031d80
[ 306.022422] Call trace:
[ 306.024866] taprio_free_sched_cb+0x4c/0x98
[ 306.029040] rcu_process_callbacks+0x25c/0x410
[ 306.033476] __do_softirq+0x10c/0x208
[ 306.037132] irq_exit+0xb8/0xc8
[ 306.040267] __handle_domain_irq+0x64/0xb8
[ 306.044352] gic_handle_irq+0x7c/0x178
[ 306.048092] el1_irq+0xb0/0x128
[ 306.051227] arch_cpu_idle+0x10/0x18
[ 306.054795] do_idle+0x120/0x138
[ 306.058015] cpu_startup_entry+0x20/0x28
[ 306.061931] rest_init+0xcc/0xd8
[ 306.065154] start_kernel+0x3bc/0x3e4
[ 306.068810] Code: f2fbd5b7 f2fbd5b6 d503201f f9400422 (f9000662)
[ 306.074900] ---[ end trace 96c8e2284a9d9d6e ]---
[ 306.079507] Kernel panic - not syncing: Fatal exception in interrupt
[ 306.085847] SMP: stopping secondary CPUs
[ 306.089765] Kernel Offset: disabled

Try to explain one of the possible crash cases:

The "real" admin list is assigned when admin_sched is set to
new_admin, it happens after "swap", that assigns to oper_sched NULL.
Thus if call qdisc show it can crash.

Farther, next second time, when sched list is updated, the admin_sched
is not NULL and becomes the oper_sched, previous oper_sched was NULL so
just skipped. But then admin_sched is assigned new_admin, but schedules
to free previous assigned admin_sched (that already became oper_sched).

Farther, next third time, when sched list is updated,
while one more swap, oper_sched is not null, but it was happy to be
freed already (while prev. admin update), so while try to free
oper_sched the kernel panic happens at taprio_free_sched_cb().

So, move the "swap emulation" where it should be according to function
comment from code.

Fixes: 9c66d15646760e ("taprio: Add support for hardware offloading")
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d665c128 21-Oct-2019 Yi Wang <wang.yi59@zte.com.cn>

net: sched: taprio: fix -Wmissing-prototypes warnings

We get one warnings when build kernel W=1:
net/sched/sch_taprio.c:1155:6: warning: no previous prototype for ‘taprio_offload_config_changed’ [-Wmissing-prototypes]

Make the function static to fix this.

Fixes: 9c66d1564676 ("taprio: Add support for hardware offloading")
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>


# a954380a 08-Oct-2019 Vinicius Costa Gomes <vinicius.gomes@intel.com>

net: taprio: Fix returning EINVAL when configuring without flags

When configuring a taprio instance if "flags" is not specified (or
it's zero), taprio currently replies with an "Invalid argument" error.

So, set the return value to zero after we are done with all the
checks.

Fixes: 9c66d1564676 ("taprio: Add support for hardware offloading")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>


# 9a9251a3 28-Sep-2019 Vladimir Oltean <olteanv@gmail.com>

net: sched: taprio: Avoid division by zero on invalid link speed

The check in taprio_set_picos_per_byte is currently not robust enough
and will trigger this division by zero, due to e.g. PHYLINK not setting
kset->base.speed when there is no PHY connected:

[ 27.109992] Division by zero in kernel.
[ 27.113842] CPU: 1 PID: 198 Comm: tc Not tainted 5.3.0-rc5-01246-gc4006b8c2637-dirty #212
[ 27.121974] Hardware name: Freescale LS1021A
[ 27.126234] [<c03132e0>] (unwind_backtrace) from [<c030d8b8>] (show_stack+0x10/0x14)
[ 27.133938] [<c030d8b8>] (show_stack) from [<c10b21b0>] (dump_stack+0xb0/0xc4)
[ 27.141124] [<c10b21b0>] (dump_stack) from [<c10af97c>] (Ldiv0_64+0x8/0x18)
[ 27.148052] [<c10af97c>] (Ldiv0_64) from [<c0700260>] (div64_u64+0xcc/0xf0)
[ 27.154978] [<c0700260>] (div64_u64) from [<c07002d0>] (div64_s64+0x4c/0x68)
[ 27.161993] [<c07002d0>] (div64_s64) from [<c0f3d890>] (taprio_set_picos_per_byte+0xe8/0xf4)
[ 27.170388] [<c0f3d890>] (taprio_set_picos_per_byte) from [<c0f3f614>] (taprio_change+0x668/0xcec)
[ 27.179302] [<c0f3f614>] (taprio_change) from [<c0f2bc24>] (qdisc_create+0x1fc/0x4f4)
[ 27.187091] [<c0f2bc24>] (qdisc_create) from [<c0f2c0c8>] (tc_modify_qdisc+0x1ac/0x6f8)
[ 27.195055] [<c0f2c0c8>] (tc_modify_qdisc) from [<c0ee9604>] (rtnetlink_rcv_msg+0x268/0x2dc)
[ 27.203449] [<c0ee9604>] (rtnetlink_rcv_msg) from [<c0f4fef0>] (netlink_rcv_skb+0xe0/0x114)
[ 27.211756] [<c0f4fef0>] (netlink_rcv_skb) from [<c0f4f6cc>] (netlink_unicast+0x1b4/0x22c)
[ 27.219977] [<c0f4f6cc>] (netlink_unicast) from [<c0f4fa84>] (netlink_sendmsg+0x284/0x340)
[ 27.228198] [<c0f4fa84>] (netlink_sendmsg) from [<c0eae5fc>] (sock_sendmsg+0x14/0x24)
[ 27.235988] [<c0eae5fc>] (sock_sendmsg) from [<c0eaedf8>] (___sys_sendmsg+0x214/0x228)
[ 27.243863] [<c0eaedf8>] (___sys_sendmsg) from [<c0eb015c>] (__sys_sendmsg+0x50/0x8c)
[ 27.251652] [<c0eb015c>] (__sys_sendmsg) from [<c0301000>] (ret_fast_syscall+0x0/0x54)
[ 27.259524] Exception stack(0xe8045fa8 to 0xe8045ff0)
[ 27.264546] 5fa0: b6f608c8 000000f8 00000003 bed7e2f0 00000000 00000000
[ 27.272681] 5fc0: b6f608c8 000000f8 004ce54c 00000128 5d3ce8c7 00000000 00000026 00505c9c
[ 27.280812] 5fe0: 00000070 bed7e298 004ddd64 b6dd1e64

Russell King points out that the ethtool API says zero is a valid return
value of __ethtool_get_link_ksettings:

* If it is enabled then they are read-only; if the link
* is up they represent the negotiated link mode; if the link is down,
* the speed is 0, %SPEED_UNKNOWN or the highest enabled speed and
* @duplex is %DUPLEX_UNKNOWN or the best enabled duplex mode.

So, it seems that taprio is not following the API... I'd suggest either
fixing taprio, or getting agreement to change the ethtool API.

The chosen path was to fix taprio.

Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 68ce6688 28-Sep-2019 Vladimir Oltean <olteanv@gmail.com>

net: sched: taprio: Fix potential integer overflow in taprio_set_picos_per_byte

The speed divisor is used in a context expecting an s64, but it is
evaluated using 32-bit arithmetic.

To avoid that happening, instead of multiplying by 1,000,000 in the
first place, simplify the fraction and do a standard 32 bit division
instead.

Fixes: f04b514c0ce2 ("taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte")
Reported-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9c66d156 14-Sep-2019 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Add support for hardware offloading

This allows taprio to offload the schedule enforcement to capable
network cards, resulting in more precise windows and less CPU usage.

The gate mask acts on traffic classes (groups of queues of same
priority), as specified in IEEE 802.1Q-2018, and following the existing
taprio and mqprio semantics.
It is up to the driver to perform conversion between tc and individual
netdev queues if for some reason it needs to make that distinction.

Full offload is requested from the network interface by specifying
"flags 2" in the tc qdisc creation command, which in turn corresponds to
the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit.

The important detail here is the clockid which is implicitly /dev/ptpN
for full offload, and hence not configurable.

A reference counting API is added to support the use case where Ethernet
drivers need to keep the taprio offload structure locally (i.e. they are
a multi-port switch driver, and configuring a port depends on the
settings of other ports as well). The refcount_t variable is kept in a
private structure (__tc_taprio_qopt_offload) and not exposed to drivers.

In the future, the private structure might also be expanded with a
backpointer to taprio_sched *q, to implement the notification system
described in the patch (of when admin became oper, or an error occurred,
etc, so the offload can be monitored with 'tc qdisc show').

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f04b514c 29-Aug-2019 Vladimir Oltean <olteanv@gmail.com>

taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte

The taprio budget needs to be adapted at runtime according to interface
link speed. But that handling is problematic.

For one thing, installing a qdisc on an interface that doesn't have
carrier is not illegal. But taprio prints the following stack trace:

[ 31.851373] ------------[ cut here ]------------
[ 31.856024] WARNING: CPU: 1 PID: 207 at net/sched/sch_taprio.c:481 taprio_dequeue+0x1a8/0x2d4
[ 31.864566] taprio: dequeue() called with unknown picos per byte.
[ 31.864570] Modules linked in:
[ 31.873701] CPU: 1 PID: 207 Comm: tc Not tainted 5.3.0-rc5-01199-g8838fe023cd6 #1689
[ 31.881398] Hardware name: Freescale LS1021A
[ 31.885661] [<c03133a4>] (unwind_backtrace) from [<c030d8cc>] (show_stack+0x10/0x14)
[ 31.893368] [<c030d8cc>] (show_stack) from [<c10ac958>] (dump_stack+0xb4/0xc8)
[ 31.900555] [<c10ac958>] (dump_stack) from [<c0349d04>] (__warn+0xe0/0xf8)
[ 31.907395] [<c0349d04>] (__warn) from [<c0349d64>] (warn_slowpath_fmt+0x48/0x6c)
[ 31.914841] [<c0349d64>] (warn_slowpath_fmt) from [<c0f38db4>] (taprio_dequeue+0x1a8/0x2d4)
[ 31.923150] [<c0f38db4>] (taprio_dequeue) from [<c0f227b0>] (__qdisc_run+0x90/0x61c)
[ 31.930856] [<c0f227b0>] (__qdisc_run) from [<c0ec82ac>] (net_tx_action+0x12c/0x2bc)
[ 31.938560] [<c0ec82ac>] (net_tx_action) from [<c0302298>] (__do_softirq+0x130/0x3c8)
[ 31.946350] [<c0302298>] (__do_softirq) from [<c03502a0>] (irq_exit+0xbc/0xd8)
[ 31.953536] [<c03502a0>] (irq_exit) from [<c03a4808>] (__handle_domain_irq+0x60/0xb4)
[ 31.961328] [<c03a4808>] (__handle_domain_irq) from [<c0754478>] (gic_handle_irq+0x58/0x9c)
[ 31.969638] [<c0754478>] (gic_handle_irq) from [<c0301a8c>] (__irq_svc+0x6c/0x90)
[ 31.977076] Exception stack(0xe8167b20 to 0xe8167b68)
[ 31.982100] 7b20: e9d4bd80 00000cc0 000000cf 00000000 e9d4bd80 c1f38958 00000cc0 c1f38960
[ 31.990234] 7b40: 00000001 000000cf 00000004 e9dc0800 00000000 e8167b70 c0f478ec c0f46d94
[ 31.998363] 7b60: 60070013 ffffffff
[ 32.001833] [<c0301a8c>] (__irq_svc) from [<c0f46d94>] (netlink_trim+0x18/0xd8)
[ 32.009104] [<c0f46d94>] (netlink_trim) from [<c0f478ec>] (netlink_broadcast_filtered+0x34/0x414)
[ 32.017930] [<c0f478ec>] (netlink_broadcast_filtered) from [<c0f47cec>] (netlink_broadcast+0x20/0x28)
[ 32.027102] [<c0f47cec>] (netlink_broadcast) from [<c0eea378>] (rtnetlink_send+0x34/0x88)
[ 32.035238] [<c0eea378>] (rtnetlink_send) from [<c0f25890>] (notify_and_destroy+0x2c/0x44)
[ 32.043461] [<c0f25890>] (notify_and_destroy) from [<c0f25e08>] (qdisc_graft+0x398/0x470)
[ 32.051595] [<c0f25e08>] (qdisc_graft) from [<c0f27a00>] (tc_modify_qdisc+0x3a4/0x724)
[ 32.059470] [<c0f27a00>] (tc_modify_qdisc) from [<c0ee4c84>] (rtnetlink_rcv_msg+0x260/0x2ec)
[ 32.067864] [<c0ee4c84>] (rtnetlink_rcv_msg) from [<c0f4a988>] (netlink_rcv_skb+0xb8/0x110)
[ 32.076172] [<c0f4a988>] (netlink_rcv_skb) from [<c0f4a170>] (netlink_unicast+0x1b4/0x22c)
[ 32.084392] [<c0f4a170>] (netlink_unicast) from [<c0f4a5e4>] (netlink_sendmsg+0x33c/0x380)
[ 32.092614] [<c0f4a5e4>] (netlink_sendmsg) from [<c0ea9f40>] (sock_sendmsg+0x14/0x24)
[ 32.100403] [<c0ea9f40>] (sock_sendmsg) from [<c0eaa780>] (___sys_sendmsg+0x214/0x228)
[ 32.108279] [<c0eaa780>] (___sys_sendmsg) from [<c0eabad0>] (__sys_sendmsg+0x50/0x8c)
[ 32.116068] [<c0eabad0>] (__sys_sendmsg) from [<c0301000>] (ret_fast_syscall+0x0/0x54)
[ 32.123938] Exception stack(0xe8167fa8 to 0xe8167ff0)
[ 32.128960] 7fa0: b6fa68c8 000000f8 00000003 bea142d0 00000000 00000000
[ 32.137093] 7fc0: b6fa68c8 000000f8 0052154c 00000128 5d6468a2 00000000 00000028 00558c9c
[ 32.145224] 7fe0: 00000070 bea14278 00530d64 b6e17e64
[ 32.150659] ---[ end trace 2139c9827c3e5177 ]---

This happens because the qdisc ->dequeue callback gets called. Which
again is not illegal, the qdisc will dequeue even when the interface is
up but doesn't have carrier (and hence SPEED_UNKNOWN), and the frames
will be dropped further down the stack in dev_direct_xmit().

And, at the end of the day, for what? For calculating the initial budget
of an interface which is non-operational at the moment and where frames
will get dropped anyway.

So if we can't figure out the link speed, default to SPEED_10 and move
along. We can also remove the runtime check now.

Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation")
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# efb55222 29-Aug-2019 Vladimir Oltean <olteanv@gmail.com>

taprio: Fix kernel panic in taprio_destroy

taprio_init may fail earlier than this line:

list_add(&q->taprio_list, &taprio_list);

i.e. due to the net device not being multi queue.

Attempting to remove q from the global taprio_list when it is not part
of it will result in a kernel panic.

Fix it by matching list_add and list_del better to one another in the
order of operations. This way we can keep the deletion unconditional
and with lower complexity - O(1).

Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ca497fb6 08-Aug-2019 YueHaibing <yuehaibing@huawei.com>

taprio: remove unused variable 'entry_list_policy'

net/sched/sch_taprio.c:680:32: warning:
entry_list_policy defined but not used [-Wunused-const-variable=]

One of the points of commit a3d43c0d56f1 ("taprio: Add support adding
an admin schedule") is that it removes support (it now returns "not
supported") for schedules using the TCA_TAPRIO_ATTR_SCHED_SINGLE_ENTRY
attribute (which were never used), the parsing of those types of schedules
was the only user of this policy. So removing this policy should be fine.

Reported-by: Hulk Robot <hulkci@huawei.com>
Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 51650d33 06-Aug-2019 Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>

net: sched: sch_taprio: fix memleak in error path for sched list parse

In error case, all entries should be freed from the sched list
before deleting it. For simplicity use rcu way.

Fixes: 5a781ccbd19e46 ("tc: Add support for configuring the taprio scheduler")
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a5b64700 16-Jul-2019 Vedang Patel <vedang.patel@intel.com>

fix: taprio: Change type of txtime-delay parameter to u32

During the review of the iproute2 patches for txtime-assist mode, it was
pointed out that it does not make sense for the txtime-delay parameter to
be negative. So, change the type of the parameter from s32 to u32.

Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 54002066 25-Jun-2019 Vedang Patel <vedang.patel@intel.com>

taprio: Adjust timestamps for TCP packets

When the taprio qdisc is running in "txtime offload" mode, it will
set the launchtime value (in skb->tstamp) for all the packets which do
not have the SO_TXTIME socket option. But, the TCP packets already have
this value set and it indicates the earliest departure time represented
in CLOCK_MONOTONIC clock.

We need to respect the timestamp set by the TCP subsystem. So, convert
this time to the clock which taprio is using and ensure that the packet
is not transmitted before the deadline set by TCP.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7ede7b03 25-Jun-2019 Vedang Patel <vedang.patel@intel.com>

taprio: make clock reference conversions easier

Later in this series we will need to transform from
CLOCK_MONOTONIC (used in TCP) to the clock reference used in TAPRIO.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4cfd5779 25-Jun-2019 Vedang Patel <vedang.patel@intel.com>

taprio: Add support for txtime-assist mode

Currently, we are seeing non-critical packets being transmitted outside of
their timeslice. We can confirm that the packets are being dequeued at the
right time. So, the delay is induced in the hardware side. The most likely
reason is the hardware queues are starving the lower priority queues.

In order to improve the performance of taprio, we will be making use of the
txtime feature provided by the ETF qdisc. For all the packets which do not
have the SO_TXTIME option set, taprio will set the transmit timestamp (set
in skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit
time for the packet is set to when the gate is open. If SO_TXTIME is set,
the TAPrio qdisc will validate whether the timestamp (in skb->tstamp)
occurs when the gate corresponding to skb's traffic class is open.

Following two parameters added to support this mode:
- flags: used to enable txtime-assist mode. Will also be used to enable
other modes (like hardware offloading) later.
- txtime-delay: This indicates the minimum time it will take for the packet
to hit the wire. This is useful in determining whether we can transmit
the packet in the remaining time if the gate corresponding to the packet is
currently open.

An example configuration for enabling txtime-assist:

tc qdisc replace dev eth0 parent root handle 100 taprio \\
num_tc 3 \\
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
queues 1@0 1@0 1@0 \\
base-time 1558653424279842568 \\
sched-entry S 01 300000 \\
sched-entry S 02 300000 \\
sched-entry S 04 400000 \\
flags 0x1 \\
txtime-delay 40000 \\
clockid CLOCK_TAI

tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
offload delta 200000 clockid CLOCK_TAI

Note that all the traffic classes are mapped to the same queue. This is
only possible in taprio when txtime-assist is enabled. Also, note that the
ETF Qdisc is enabled with offload mode set.

In this mode, if the packet's traffic class is open and the complete packet
can be transmitted, taprio will try to transmit the packet immediately.
This will be done by setting skb->tstamp to current_time + the time delta
indicated in the txtime-delay parameter. This parameter indicates the time
taken (in software) for packet to reach the network adapter.

If the packet cannot be transmitted in the current interval or if the
packet's traffic is not currently transmitting, the skb->tstamp is set to
the next available timestamp value. This is tracked in the next_launchtime
parameter in the struct sched_entry.

The behaviour w.r.t admin and oper schedules is not changed from what is
present in software mode.

The transmit time is already known in advance. So, we do not need the HR
timers to advance the schedule and wakeup the dequeue side of taprio. So,
HR timer won't be run when this mode is enabled.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 566af331 25-Jun-2019 Vedang Patel <vedang.patel@intel.com>

taprio: Remove inline directive

Remove inline directive from length_to_duration(). We will let the compiler
make the decisions.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 037be037 25-Jun-2019 Vedang Patel <vedang.patel@intel.com>

taprio: calculate cycle_time when schedule is installed

cycle time for a particular schedule is calculated only when it is first
installed. So, it makes sense to just calculate it once right after the
'cycle_time' parameter has been parsed and store it in cycle_time.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e4acf427 05-May-2019 Colin Ian King <colin.king@canonical.com>

taprio: add null check on sched_nest to avoid potential null pointer dereference

The call to nla_nest_start_noflag can return a null pointer and currently
this is not being checked and this can lead to a null pointer dereference
when the null pointer sched_nest is passed to function nla_nest_end. Fix
this by adding in a null pointer check.

Addresses-Coverity: ("Dereference null return value")
Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c25031e9 29-Apr-2019 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Add support for cycle-time-extension

IEEE 802.1Q-2018 defines the concept of a cycle-time-extension, so the
last entry of a schedule before the start of a new schedule can be
extended, so "too-short" entries can be avoided.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6ca6a665 29-Apr-2019 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Add support for setting the cycle-time manually

IEEE 802.1Q-2018 defines that a the cycle-time of a schedule may be
overridden, so the schedule is truncated to a determined "width".

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a3d43c0d 29-Apr-2019 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Add support adding an admin schedule

The IEEE 802.1Q-2018 defines two "types" of schedules, the "Oper" (from
operational?) and "Admin" ones. Up until now, 'taprio' only had
support for the "Oper" one, added when the qdisc is created. This adds
support for the "Admin" one, which allows the .change() operation to
be supported.

Just for clarification, some quick (and dirty) definitions, the "Oper"
schedule is the currently (as in this instant) running one, and it's
read-only. The "Admin" one is the one that the system configurator has
installed, it can be changed, and it will be "promoted" to "Oper" when
it's 'base-time' is reached.

The idea behing this patch is that calling something like the below,
(after taprio is already configured with an initial schedule):

$ tc qdisc change taprio dev IFACE parent root \
base-time X \
sched-entry <CMD> <GATES> <INTERVAL> \
...

Will cause a new admin schedule to be created and programmed to be
"promoted" to "Oper" at instant X. If an "Admin" schedule already
exists, it will be overwritten with the new parameters.

Up until now, there was some code that was added to ease the support
of changing a single entry of a schedule, but was ultimately unused.
Now, that we have support for "change" with more well thought
semantics, updating a single entry seems to be less useful.

So we remove what is in practice dead code, and return a "not
supported" error if the user tries to use it. If changing a single
entry would make the user's life easier we may ressurrect this idea,
but at this point, removing it simplifies the code.

For now, only the schedule specific bits are allowed to be added for a
new schedule, that means that 'clockid', 'num_tc', 'map' and 'queues'
cannot be modified.

Example:

$ tc qdisc change dev IFACE parent root handle 100 taprio \
base-time $BASE_TIME \
sched-entry S 00 500000 \
sched-entry S 0f 500000 \
clockid CLOCK_TAI

The only change in the netlink API introduced by this change is the
introduction of an "admin" type in the response to a dump request,
that type allows userspace to separate the "oper" schedule from the
"admin" schedule. If userspace doesn't support the "admin" type, it
will only display the "oper" schedule.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8c79f0ea 29-Apr-2019 Vinicius Costa Gomes <vinicius.gomes@intel.com>

taprio: Fix potencial use of invalid memory during dequeue()

Right now, this isn't a problem, but the next commit allows schedules
to be added during runtime. When a new schedule transitions from the
inactive to the active state ("admin" -> "oper") the previous one can
be freed, if it's freed just after the RCU read lock is released, we
may access an invalid entry.

So, we should take care to protect the dequeue() flow, so all the
places that access the entries are protected by the RCU read lock.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8cb08174 26-Apr-2019 Johannes Berg <johannes.berg@intel.com>

netlink: make validation more configurable for future strictness

We currently have two levels of strict validation:

1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted

Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)

@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ae0be8de 26-Apr-2019 Michal Kubecek <mkubecek@suse.cz>

netlink: make nla_nest_start() add NLA_F_NESTED flag

Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6e734c82 23-Apr-2019 Andre Guedes <andre.guedes@intel.com>

net: sched: taprio: Fix taprio_dequeue()

In case we don't have 'guard' or 'budget' to transmit the skb, we should
continue traversing the qdisc list since the remaining guard/budget
might be enough to transmit a skb from other children qdiscs.

Fixes: 5a781ccbd19e (“tc: Add support for configuring the taprio scheduler”)
Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2684d1b7 23-Apr-2019 Andre Guedes <andre.guedes@intel.com>

net: sched: taprio: Fix taprio_peek()

While traversing taprio's children qdisc list, if the gate is closed for
a given traffic class, we should continue traversing the list since the
remaining qdiscs may have skb ready for transmission.

This patch also takes this opportunity and changes the function to use
the TAPRIO_ALL_GATES_OPEN macro instead of the magic number '-1'.

Fixes: 5a781ccbd19e (“tc: Add support for configuring the taprio scheduler”)
Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5175aafe 23-Apr-2019 Andre Guedes <andre.guedes@intel.com>

net: sched: taprio: Remove should_restart_cycle()

The 'entry' argument from should_restart_cycle() cannot be NULL since it
is already checked by the caller so the WARN_ON() within should_
restart_cycle() could be removed. By doing that, that function becomes
a dummy wrapper on list_is_last() so this patch simply gets rid of it
and call list_is_last() within advance_sched() instead.

Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8599099f 23-Apr-2019 Andre Guedes <andre.guedes@intel.com>

net: sched: taprio: Refactor taprio_get_start_time()

This patch does a code refactoring to taprio_get_start_time() function
to improve readability and report error properly.

If 'base' time is later than 'now', the start time is equal to 'base'
and taprio_get_start_time() is done. That's the natural case so we move
that code to the beginning of the function. Also, if 'cycle' calculation
is zero, something went really wrong with taprio and we should log that
internal error properly.

Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 59ab87f6 23-Apr-2019 Andre Guedes <andre.guedes@intel.com>

net: sched: taprio: Remove pointless variable assigment

This patch removes a pointless variable assigment in taprio_change().
The 'err' variable is not used from this assignment to the next one so
this patch removes it.

Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 23bddf69 17-Apr-2019 Jakub Kicinski <kuba@kernel.org>

net/sched: taprio: fix build without 64bit div

Recent changes to taprio did not use the correct div64 helpers,
leading to:

net/sched/sch_taprio.o: In function `taprio_dequeue':
sch_taprio.c:(.text+0x34a): undefined reference to `__divdi3'
net/sched/sch_taprio.o: In function `advance_sched':
sch_taprio.c:(.text+0xa0b): undefined reference to `__divdi3'
net/sched/sch_taprio.o: In function `taprio_init':
sch_taprio.c:(.text+0x1450): undefined reference to `__divdi3'
/home/jkicinski/devel/linux/Makefile:1032: recipe for target 'vmlinux' failed

Use math64 helpers.

Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7b9eba7b 08-Apr-2019 Leandro Dorileo <leandro.maciel.dorileo@intel.com>

net/sched: taprio: fix picos_per_byte miscalculation

The Time Aware Priority Scheduler is heavily dependent to link speed,
it relies on it to calculate transmission bytes per cycle, we can't
properly calculate the so called budget if the device has failed
to report the link speed.

In that case we can't dequeue packets assuming a wrong budget.
This patch makes sure we fail to dequeue case:

1) __ethtool_get_link_ksettings() reports error or 2) the ethernet
driver failed to set the ksettings' speed value (setting link speed
to SPEED_UNKNOWN).

Additionally we re calculate the budget whenever the link speed is
changed.

Fixes: 5a781ccbd19e4 ("tc: Add support for configuring the taprio scheduler")
Signed-off-by: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Reviewed-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5dd431b6 28-Mar-2019 Paolo Abeni <pabeni@redhat.com>

net: sched: introduce and use qstats read helpers

Classful qdiscs can't access directly the child qdiscs backlog
length: if such qdisc is NOLOCK, per CPU values should be
accounted instead.

Most qdiscs no not respect the above. As a result, qstats fetching
for most classful qdisc is currently incorrect: if the child qdisc is
NOLOCK, it always reports 0 len backlog.

This change introduces a pair of helpers to safely fetch
both backlog and qlen and use them in stats class dumping
functions, fixing the above issue and cleaning a bit the code.

DRR needs also to access the child qdisc queue length, so it
needs custom handling.

Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a781ccb 28-Sep-2018 Vinicius Costa Gomes <vinicius.gomes@intel.com>

tc: Add support for configuring the taprio scheduler

This traffic scheduler allows traffic classes states (transmission
allowed/not allowed, in the simplest case) to be scheduled, according
to a pre-generated time sequence. This is the basis of the IEEE
802.1Qbv specification.

Example configuration:

tc qdisc replace dev enp3s0 parent root handle 100 taprio \
num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
queues 1@0 1@1 2@2 \
base-time 1528743495910289987 \
sched-entry S 01 300000 \
sched-entry S 02 300000 \
sched-entry S 04 300000 \
clockid CLOCK_TAI

The configuration format is similar to mqprio. The main difference is
the presence of a schedule, built by multiple "sched-entry"
definitions, each entry has the following format:

sched-entry <CMD> <GATE MASK> <INTERVAL>

The only supported <CMD> is "S", which means "SetGateStates",
following the IEEE 802.1Qbv-2015 definition (Table 8-6). <GATE MASK>
is a bitmask where each bit is a associated with a traffic class, so
bit 0 (the least significant bit) being "on" means that traffic class
0 is "active" for that schedule entry. <INTERVAL> is a time duration
in nanoseconds that specifies for how long that state defined by <CMD>
and <GATE MASK> should be held before moving to the next entry.

This schedule is circular, that is, after the last entry is executed
it starts from the first one, indefinitely.

The other parameters can be defined as follows:

- base-time: specifies the instant when the schedule starts, if
'base-time' is a time in the past, the schedule will start at

base-time + (N * cycle-time)

where N is the smallest integer so the resulting time is greater
than "now", and "cycle-time" is the sum of all the intervals of the
entries in the schedule;

- clockid: specifies the reference clock to be used;

The parameters should be similar to what the IEEE 802.1Q family of
specification defines.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>