History log of /linux-master/net/sched/sch_codel.c
Revision Date Author Comments
# 32c7eec2 11-Feb-2024 Stephen Hemminger <stephen@networkplumber.org>

net: sched: codel replace GPLv2/BSD boilerplate

The prologue to codel is using BSD-3 clause and GPL-2 boiler plate
language. Replace it by using SPDX. The automated treewide scan in
commit d2912cb15bdd ("treewide: Replace GPLv2 boilerplate/reference with
SPDX - rule 500") did not pickup dual licensed code.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Dave Taht <dave.taht@gmail.com>
Link: https://lore.kernel.org/r/20240211172532.6568-1-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 241a94ab 01-Feb-2024 Michal Koutný <mkoutny@suse.com>

net/sched: Add module aliases for cls_,sch_,act_ modules

No functional change intended, aliases will be used in followup commits.
Note for backporters: you may need to add aliases also for modules that
are already removed in mainline kernel but still in your version.

Patches were generated with the help of Coccinelle scripts like:

cat >scripts/coccinelle/misc/tcf_alias.cocci <<EOD
virtual patch
virtual report

@ haskernel @
@@

@ tcf_has_kind depends on report && haskernel @
identifier ops;
constant K;
@@

static struct tcf_proto_ops ops = {
.kind = K,
...
};
+char module_alias = K;
EOD

/usr/bin/spatch -D report --cocci-file scripts/coccinelle/misc/tcf_alias.cocci \
--dir . \
-I ./arch/x86/include -I ./arch/x86/include/generated -I ./include \
-I ./arch/x86/include/uapi -I ./arch/x86/include/generated/uapi \
-I ./include/uapi -I ./include/generated/uapi \
--include ./include/linux/compiler-version.h --include ./include/linux/kconfig.h \
--jobs 8 --chunksize 1 2>/dev/null | \
sed 's/char module_alias = "\([^"]*\)";/MODULE_ALIAS_NET_CLS("\1");/'

And analogously for:

static struct tc_action_ops ops = {
.kind = K,

static struct Qdisc_ops ops = {
.id = K,

(Someone familiar would be able to fit those into one .cocci file
without sed post processing.)

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240201130943.19536-3-mkoutny@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# a102c897 29-Aug-2022 Zhengchao Shao <shaozhengchao@huawei.com>

net: sched: remove redundant NULL check in change hook function

Currently, the change function can be called by two ways. The one way is
that qdisc_change() will call it. Before calling change function,
qdisc_change() ensures tca[TCA_OPTIONS] is not empty. The other way is
that .init() will call it. The opt parameter is also checked before
calling change function in .init(). Therefore, it's no need to check the
input parameter opt in change function.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20220829071219.208646-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# ac5c66f2 14-Jul-2020 Petr Machata <petrm@mellanox.com>

Revert "net: sched: Pass root lock to Qdisc_ops.enqueue"

This reverts commit aebe4426ccaa4838f36ea805cdf7d76503e65117.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# aebe4426 26-Jun-2020 Petr Machata <petrm@mellanox.com>

net: sched: Pass root lock to Qdisc_ops.enqueue

A following patch introduces qevents, points in qdisc algorithm where
packet can be processed by user-defined filters. Should this processing
lead to a situation where a new packet is to be enqueued on the same port,
holding the root lock would lead to deadlocks. To solve the issue, qevent
handler needs to unlock and relock the root lock when necessary.

To that end, add the root lock argument to the qdisc op enqueue, and
propagate throughout.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 051c7b39 29-Jul-2019 Jia-Ju Bai <baijiaju1990@gmail.com>

net: sched: Fix a possible null-pointer dereference in dequeue_func()

In dequeue_func(), there is an if statement on line 74 to check whether
skb is NULL:
if (skb)

When skb is NULL, it is used on line 77:
prefetch(&skb->end);

Thus, a possible null-pointer dereference may occur.

To fix this bug, skb->end is used when skb is not NULL.

This bug is found by a static analysis tool STCheck written by us.

Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM")
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8cb08174 26-Apr-2019 Johannes Berg <johannes.berg@intel.com>

netlink: make validation more configurable for future strictness

We currently have two levels of strict validation:

1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted

Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)

@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ae0be8de 26-Apr-2019 Michal Kubecek <mkubecek@suse.cz>

netlink: make nla_nest_start() add NLA_F_NESTED flag

Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2030721c 19-Dec-2017 Alexander Aring <aring@mojatatu.com>

net: sched: sch: add extack for change qdisc ops

This patch adds extack support for change callback for qdisc ops
structtur to prepare per-qdisc specific changes for extack.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e63d7dfd 19-Dec-2017 Alexander Aring <aring@mojatatu.com>

net: sched: sch: add extack for init callback

This patch adds extack support for init callback to prepare per-qdisc
specific changes for extack.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fceb6435 12-Apr-2017 Johannes Berg <johannes.berg@intel.com>

netlink: pass extended ACK struct to parsing functions

Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ed760cb8 17-Sep-2016 Florian Westphal <fw@strlen.de>

sched: replace __skb_dequeue with __qdisc_dequeue_head

After previous patch these functions are identical.
Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head.

Next patch will then make __qdisc_dequeue_head handle
single-linked list instead of strcut sk_buff_head argument.

Doesn't change generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 520ac30f 22-Jun-2016 Eric Dumazet <edumazet@google.com>

net_sched: drop packets after root qdisc lock is released

Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.

Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.

Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.

This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.

I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b3d7e2b2 13-Jun-2016 Eric Dumazet <edumazet@google.com>

net_sched: sch_codel: defer skb freeing in codel_change()

codel_change() can use rtnl_qdisc_drop()
to defer expensive skb freeing after locks are released.

codel_reset() already has support for deferred skb freeing
because it uses qdisc_reset_queue()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d068ca2a 22-Apr-2016 Michal Kazior <michal.kazior@tieto.com>

codel: split into multiple files

It was impossible to include codel.h for the
purpose of having access to codel_params or
codel_vars structure definitions and using them
for embedding in other more complex structures.

This splits allows codel.h itself to be treated
like any other header file while codel_qdisc.h and
codel_impl.h contain function definitions with
logic that was previously in codel.h.

This copies over copyrights and doesn't involve
code changes other than adding a few additional
include directives to net/sched/sch*codel.c.

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 79bdc4c8 22-Apr-2016 Michal Kazior <michal.kazior@tieto.com>

codel: generalize the implementation

This strips out qdisc specific bits from the code
and makes it slightly more reusable. Codel will be
used by wireless/mac80211 in the future.

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2ccccf5f 25-Feb-2016 WANG Cong <xiyou.wangcong@gmail.com>

net_sched: update hierarchical backlog too

When the bottom qdisc decides to, for example, drop some packet,
it calls qdisc_tree_decrease_qlen() to update the queue length
for all its ancestors, we need to update the backlog too to
keep the stats on root qdisc accurate.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 80ba92fa 08-May-2015 Eric Dumazet <edumazet@google.com>

codel: add ce_threshold attribute

For DCTCP or similar ECN based deployments on fabrics with shallow
buffers, hosts are responsible for a good part of the buffering.

This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
so that DCTCP can have feedback from queuing in the host.

A DCTCP enabled egress port simply have a queue occupancy threshold
above which ECT packets get CE mark.

In codel language this translates to a sojourn time, so that one doesn't
have to worry about bytes or bandwidth but delays.

This makes the host an active participant in the health of the whole
network.

This also helps experimenting DCTCP in a setup without DCTCP compliant
fabric.

On following example, ce_threshold is set to 1ms, and we can see from
'ldelay xxx us' that TCP is not trying to go around the 5ms codel
target.

Queue has more capacity to absorb inelastic bursts (say from UDP
traffic), as queues are maintained to an optimal level.

lpaa23:~# ./tc -s -d qd sh dev eth1
qdisc mq 1: dev eth1 root
Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
backlog 3108242b 364p requeues 42961
qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
count 0 lastcount 0 ldelay 1.0ms drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
count 0 lastcount 0 ldelay 694us drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
count 0 lastcount 0 ldelay 889us drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a5d28090 30-Apr-2015 Eric Dumazet <edumazet@google.com>

codel: fix maxpacket/mtu confusion

Under presence of TSO/GSO/GRO packets, codel at low rates can be quite
useless. In following example, not a single packet was ever dropped,
while average delay in codel queue is ~100 ms !

qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
Sent 134376498 bytes 88797 pkt (dropped 0, overlimits 0 requeues 0)
backlog 13626b 3p requeues 0
count 0 lastcount 0 ldelay 96.9ms drop_next 0us
maxpacket 9084 ecn_mark 0 drop_overlimit 0

This comes from a confusion of what should be the minimal backlog. It is
pretty clear it is not 64KB or whatever max GSO packet ever reached the
qdisc.

codel intent was to use MTU of the device.

After the fix, we finally drop some packets, and rtt/cwnd of my single
TCP flow are meeting our expectations.

qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
Sent 102798497 bytes 67912 pkt (dropped 1365, overlimits 0 requeues 0)
backlog 6056b 3p requeues 0
count 1 lastcount 1 ldelay 36.3ms drop_next 0us
maxpacket 10598 ecn_mark 0 drop_overlimit 0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Van Jacobson <vanj@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 25331d6c 28-Sep-2014 John Fastabend <john.fastabend@gmail.com>

net: sched: implement qstat helper routines

This adds helpers to manipulate qstats logic and replaces locations
that touch the counters directly. This simplifies future patches
to push qstats onto per cpu counters.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 865ec552 15-May-2012 Eric Dumazet <edumazet@google.com>

fq_codel: should use qdisc backlog as threshold

codel_should_drop() logic allows a packet being not dropped if queue
size is under max packet size.

In fq_codel, we have two possible backlogs : The qdisc global one, and
the flow local one.

The meaningful one for codel_should_drop() should be the global backlog,
not the per flow one, so that thin flows can have a non zero drop/mark
probability.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Van Jacobson <van@pollere.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ce5b4b97 14-May-2012 Geert Uytterhoeven <geert@linux-m68k.org>

net/codel: Add missing #include <linux/prefetch.h>

m68k allmodconfig:

net/sched/sch_codel.c: In function ‘dequeue’:
net/sched/sch_codel.c:70: error: implicit declaration of function ‘prefetch’
make[1]: *** [net/sched/sch_codel.o] Error 1

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 76e3cc12 10-May-2012 Eric Dumazet <edumazet@google.com>

codel: Controlled Delay AQM

An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.

http://queue.acm.org/detail.cfm?id=2209336

This AQM main input is no longer queue size in bytes or packets, but the
delay packets stay in (FIFO) queue.

As we don't have infinite memory, we still can drop packets in enqueue()
in case of massive load, but mean of CoDel is to drop packets in
dequeue(), using a control law based on two simple parameters :

target : target sojourn time (default 5ms)
interval : width of moving time window (default 100ms)

Based on initial work from Dave Taht.

Refactored to help future codel inclusion as a plugin for other linux
qdisc (FQ_CODEL, ...), like RED.

include/net/codel.h contains codel algorithm as close as possible than
Kathleen reference.

net/sched/sch_codel.c contains the linux qdisc specific glue.

Separate structures permit a memory efficient implementation of fq_codel
(to be sent as a separate work) : Each flow has its own struct
codel_vars.

timestamps are taken at enqueue() time with 1024 ns precision, allowing
a range of 2199 seconds in queue, and 100Gb links support. iproute2 uses
usec as base unit.

Selected packets are dropped, unless ECN is enabled and packets can get
ECN mark instead.

Tested from 2Mb to 10Gb speeds with no particular problems, on ixgbe and
tg3 drivers (BQL enabled).

Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
[ interval TIME ] [ ecn ]

qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
maxpacket 1514 ecn_mark 84399 drop_overlimit 0

CoDel must be seen as a base module, and should be used keeping in mind
there is still a FIFO queue. So a typical setup will probably need a
hierarchy of several qdiscs and packet classifiers to be able to meet
whatever constraints a user might have.

One possible example would be to use fq_codel, which combines Fair
Queueing and CoDel, in replacement of sfq / sfq_red.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dave Taht <dave.taht@bufferbloat.net>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Van Jacobson <van@pollere.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>