History log of /linux-master/net/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
fc0c0548 16-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'net-5.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf.

Current release - regressions:

- vhost_net: fix OoB on sendmsg() failure

- mlx5: bridge, fix uninitialized variable usage

- bnxt_en: fix error recovery regression

Current release - new code bugs:

- bpf, mm: fix lockdep warning triggered by stack_map_get_build_id_offset()

Previous releases - regressions:

- r6040: restore MDIO clock frequency after MAC reset

- tcp: fix tp->undo_retrans accounting in tcp_sacktag_one()

- dsa: flush switchdev workqueue before tearing down CPU/DSA ports

Previous releases - always broken:

- ptp: dp83640: don't define PAGE0, avoid compiler warning

- igc: fix tunnel segmentation offloads

- phylink: update SFP selected interface on advertising changes

- stmmac: fix system hang caused by eee_ctrl_timer during suspend/resume

- mlx5e: fix mutual exclusion between CQE compression and HW TS

Misc:

- bpf, cgroups: fix cgroup v2 fallback on v1/v2 mixed mode

- sfc: fallback for lack of xdp tx queues

- hns3: add option to turn off page pool feature"

* tag 'net-5.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits)
mlxbf_gige: clear valid_polarity upon open
igc: fix tunnel offloading
net/{mlx5|nfp|bnxt}: Remove unnecessary RTNL lock assert
net: wan: wanxl: define CROSS_COMPILE_M68K
selftests: nci: replace unsigned int with int
net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports
Revert "net: phy: Uniform PHY driver access"
net: dsa: destroy the phylink instance on any error in dsa_slave_phy_setup
ptp: dp83640: don't define PAGE0
bnx2x: Fix enabling network interfaces without VFs
Revert "Revert "ipv4: fix memory leaks in ip_cmsg_send() callers""
tcp: fix tp->undo_retrans accounting in tcp_sacktag_one()
net-caif: avoid user-triggerable WARN_ON(1)
bpf, selftests: Add test case for mixed cgroup v1/v2
bpf, selftests: Add cgroup v1 net_cls classid helpers
bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode
bpf: Add oversize check before call kvcalloc()
net: hns3: fix the timing issue of VF clearing interrupt sources
net: hns3: fix the exception when query imp info
net: hns3: disable mac in flr process
...


a57d8c21 14-Sep-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports

Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
messages appear:

mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2

(and similarly for other ports)

What happens is that DSA has a policy "even if there are bugs, let's at
least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
dp->mdbs lists, which are supposed to be empty.

But deleting that cleanup code, the warnings go away.

=> the FDB and MDB lists (used for refcounting on shared ports, aka CPU
and DSA ports) will eventually be empty, but are not empty by the time
we tear down those ports. Aka we are deleting them too soon.

The addresses that DSA complains about are host-trapped addresses: the
local addresses of the ports, and the MAC address of the bridge device.

The problem is that offloading those entries happens from a deferred
work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
races with the teardown of the CPU and DSA ports where the refcounting
is kept.

In fact, not only it races, but fundamentally speaking, if we iterate
through the port list linearly, we might end up tearing down the shared
ports even before we delete a DSA user port which has a bridge upper.

So as it turns out, we need to first tear down the user ports (and the
unused ones, for no better place of doing that), then the shared ports
(the CPU and DSA ports). In between, we need to ensure that all work
items scheduled by our switchdev handlers (which only run for user
ports, hence the reason why we tear them down first) have finished.

Fixes: 161ca59d39e9 ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210914134726.2305133-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6a52e733 14-Sep-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: destroy the phylink instance on any error in dsa_slave_phy_setup

DSA supports connecting to a phy-handle, and has a fallback to a non-OF
based method of connecting to an internal PHY on the switch's own MDIO
bus, if no phy-handle and no fixed-link nodes were present.

The -ENODEV error code from the first attempt (phylink_of_phy_connect)
is what triggers the second attempt (phylink_connect_phy).

However, when the first attempt returns a different error code than
-ENODEV, this results in an unbalance of calls to phylink_create and
phylink_destroy by the time we exit the function. The phylink instance
has leaked.

There are many other error codes that can be returned by
phylink_of_phy_connect. For example, phylink_validate returns -EINVAL.
So this is a practical issue too.

Fixes: aab9c4067d23 ("net: dsa: Plug in PHYLINK support")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20210914134331.2303380-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d198b277 13-Sep-2021 Eric Dumazet <edumazet@google.com>

Revert "Revert "ipv4: fix memory leaks in ip_cmsg_send() callers""

This reverts commit d7807a9adf4856171f8441f13078c33941df48ab.

As mentioned in https://lkml.org/lkml/2021/9/13/1819
5 years old commit 919483096bfe ("ipv4: fix memory leaks in ip_cmsg_send() callers")
was a correct fix.

ip_cmsg_send() can loop over multiple cmsghdr()

If IP_RETOPTS has been successful, but following cmsghdr generates an error,
we do not free ipc.ok

If IP_RETOPTS is not successful, we have freed the allocated temporary space,
not the one currently in ipc.opt.

Sure, code could be refactored, but let's not bring back old bugs.

Fixes: d7807a9adf48 ("Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

4f884f39 13-Sep-2021 zhenggy <zhenggy@chinatelecom.cn>

tcp: fix tp->undo_retrans accounting in tcp_sacktag_one()

Commit 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit
time") may directly retrans a multiple segments TSO/GSO packet without
split, Since this commit, we can no longer assume that a retransmitted
packet is a single segment.

This patch fixes the tp->undo_retrans accounting in tcp_sacktag_one()
that use the actual segments(pcount) of the retransmitted packet.

Before that commit (10d3be569243), the assumption underlying the
tp->undo_retrans-- seems correct.

Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: zhenggy <zhenggy@chinatelecom.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2865ba82 14-Sep-2021 David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2021-09-14

The following pull-request contains BPF updates for your *net* tree.

We've added 7 non-merge commits during the last 13 day(s) which contain
a total of 18 files changed, 334 insertions(+), 193 deletions(-).

The main changes are:

1) Fix mmap_lock lockdep splat in BPF stack map's build_id lookup, from Yonghong Song.

2) Fix BPF cgroup v2 program bypass upon net_cls/prio activation, from Daniel Borkmann.

3) Fix kvcalloc() BTF line info splat on oversized allocation attempts, from Bixuan Cui.

4) Fix BPF selftest build of task_pt_regs test for arm64/s390, from Jean-Philippe Brucker.

5) Fix BPF's disasm.{c,h} to dual-license so that it is aligned with bpftool given the former
is a build dependency for the latter, from Daniel Borkmann with ACKs from contributors.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


550ac9c1 13-Sep-2021 Eric Dumazet <edumazet@google.com>

net-caif: avoid user-triggerable WARN_ON(1)

syszbot triggers this warning, which looks something
we can easily prevent.

If we initialize priv->list_field in chnl_net_init(),
then always use list_del_init(), we can remove robust_list_del()
completely.

WARNING: CPU: 0 PID: 3233 at net/caif/chnl_net.c:67 robust_list_del net/caif/chnl_net.c:67 [inline]
WARNING: CPU: 0 PID: 3233 at net/caif/chnl_net.c:67 chnl_net_uninit+0xc9/0x2e0 net/caif/chnl_net.c:375
Modules linked in:
CPU: 0 PID: 3233 Comm: syz-executor.3 Not tainted 5.14.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:robust_list_del net/caif/chnl_net.c:67 [inline]
RIP: 0010:chnl_net_uninit+0xc9/0x2e0 net/caif/chnl_net.c:375
Code: 89 eb e8 3a a3 ba f8 48 89 d8 48 c1 e8 03 42 80 3c 28 00 0f 85 bf 01 00 00 48 81 fb 00 14 4e 8d 48 8b 2b 75 d0 e8 17 a3 ba f8 <0f> 0b 5b 5d 41 5c 41 5d e9 0a a3 ba f8 4c 89 e3 e8 02 a3 ba f8 4c
RSP: 0018:ffffc90009067248 EFLAGS: 00010202
RAX: 0000000000008780 RBX: ffffffff8d4e1400 RCX: ffffc9000fd34000
RDX: 0000000000040000 RSI: ffffffff88bb6e49 RDI: 0000000000000003
RBP: ffff88802cd9ee08 R08: 0000000000000000 R09: ffffffff8d0e6647
R10: ffffffff88bb6dc2 R11: 0000000000000000 R12: ffff88803791ae08
R13: dffffc0000000000 R14: 00000000e600ffce R15: ffff888073ed3480
FS: 00007fed10fa0700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2c322000 CR3: 00000000164a6000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
register_netdevice+0xadf/0x1500 net/core/dev.c:10347
ipcaif_newlink+0x4c/0x260 net/caif/chnl_net.c:468
__rtnl_newlink+0x106d/0x1750 net/core/rtnetlink.c:3458
rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3506
rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5572
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
__sys_sendto+0x21c/0x320 net/socket.c:2036
__do_sys_sendto net/socket.c:2048 [inline]
__se_sys_sendto net/socket.c:2044 [inline]
__x64_sys_sendto+0xdd/0x1b0 net/socket.c:2044
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: cc36a070b590 ("net-caif: add CAIF netdevice")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8520e224 13-Sep-2021 Daniel Borkmann <daniel@iogearbox.net>

bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode

Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d671.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.

Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.

In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.

Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.

[0] https://github.com/nestybox/sysbox/
[1] https://kind.sigs.k8s.io/

Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net

e87b5052 09-Sep-2021 zhang kai <zhangkaiheb@126.com>

ipv6: delay fib6_sernum increase in fib6_add

only increase fib6_sernum in net namespace after add fib6_info
successfully.

Signed-off-by: zhang kai <zhangkaiheb@126.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

f4bb62e6 13-Sep-2021 Hoang Le <hoang.h.le@dektech.com.au>

tipc: increase timeout in tipc_sk_enqueue()

In tipc_sk_enqueue() we use hardcoded 2 jiffies to extract
socket buffer from generic queue to particular socket.
The 2 jiffies is too short in case there are other high priority
tasks get CPU cycles for multiple jiffies update. As result, no
buffer could be enqueued to particular socket.

To solve this, we switch to use constant timeout 20msecs.
Then, the function will be expired between 2 jiffies (CONFIG_100HZ)
and 20 jiffies (CONFIG_1000HZ).

Fixes: c637c1035534 ("tipc: resolve race problem at unicast message reception")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

e50e7113 13-Sep-2021 Aya Levin <ayal@nvidia.com>

udp_tunnel: Fix udp_tunnel_nic work-queue type

Turn udp_tunnel_nic work-queue to an ordered work-queue. This queue
holds the UDP-tunnel configuration commands of the different netdevs.
When the netdevs are functions of the same NIC the order of
execution may be crucial.

Problem example:
NIC with 2 PFs, both PFs declare offload quota of up to 3 UDP-ports.
$ifconfig eth2 1.1.1.1/16 up

$ip link add eth2_19503 type vxlan id 5049 remote 1.1.1.2 dev eth2 dstport 19053
$ip link set dev eth2_19503 up

$ip link add eth2_19504 type vxlan id 5049 remote 1.1.1.3 dev eth2 dstport 19054
$ip link set dev eth2_19504 up

$ip link add eth2_19505 type vxlan id 5049 remote 1.1.1.4 dev eth2 dstport 19055
$ip link set dev eth2_19505 up

$ip link add eth2_19506 type vxlan id 5049 remote 1.1.1.5 dev eth2 dstport 19056
$ip link set dev eth2_19506 up

NIC RX port offload infrastructure offloads the first 3 UDP-ports (on
all devices which sets NETIF_F_RX_UDP_TUNNEL_PORT feature) and not
UDP-port 19056. So both PFs gets this offload configuration.

$ip link set dev eth2_19504 down

This triggers udp-tunnel-core to remove the UDP-port 19504 from
offload-ports-list and offload UDP-port 19056 instead.

In this scenario it is important that the UDP-port of 19504 will be
removed from both PFs before trying to add UDP-port 19056. The NIC can
stop offloading a UDP-port only when all references are removed.
Otherwise the NIC may report exceeding of the offload quota.

Fixes: cc4e3835eff4 ("udp_tunnel: add central NIC RX port offload infrastructure")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d7807a9a 12-Sep-2021 Yajun Deng <yajun.deng@linux.dev>

Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"

This reverts commit 919483096bfe75dda338e98d56da91a263746a0a.

There is only when ip_options_get() return zero need to free.
It already called kfree() when return error.

Fixes: 919483096bfe ("ipv4: fix memory leaks in ip_cmsg_send() callers")
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

78e70952 11-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

- vduse driver ("vDPA Device in Userspace") supporting emulated virtio
block devices

- virtio-vsock support for end of record with SEQPACKET

- vdpa: mac and mq support for ifcvf and mlx5

- vdpa: management netlink for ifcvf

- virtio-i2c, gpio dt bindings

- misc fixes and cleanups

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (39 commits)
Documentation: Add documentation for VDUSE
vduse: Introduce VDUSE - vDPA Device in Userspace
vduse: Implement an MMU-based software IOTLB
vdpa: Support transferring virtual addressing during DMA mapping
vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
vhost-iotlb: Add an opaque pointer for vhost IOTLB
vhost-vdpa: Handle the failure of vdpa_reset()
vdpa: Add reset callback in vdpa_config_ops
vdpa: Fix some coding style issues
file: Export receive_fd() to modules
eventfd: Export eventfd_wake_count to modules
iova: Export alloc_iova_fast() and free_iova_fast()
virtio-blk: remove unneeded "likely" statements
virtio-balloon: Use virtio_find_vqs() helper
vdpa: Make use of PFN_PHYS/PFN_UP/PFN_DOWN helper macro
vsock_test: update message bounds test for MSG_EOR
af_vsock: rename variables in receive loop
virtio/vsock: support MSG_EOR bit processing
vhost/vsock: support MSG_EOR bit processing
...


dc41c4a9 09-Sep-2021 Baruch Siach <baruch@tkos.co.il>

net/packet: clarify source of pr_*() messages

Add pr_fmt macro to spell out the source of messages in prefix.

Before this patch:

packet size is too long (1543 > 1518)

With this patch:

af_packet: packet size is too long (1543 > 1518)

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b6ff7eb 08-Sep-2021 Xiyu Yang <xiyuyang19@fudan.edu.cn>

net/l2tp: Fix reference count leak in l2tp_udp_recv_core

The reference count leak issue may take place in an error handling
path. If both conditions of tunnel->version == L2TP_HDR_VER_3 and the
return value of l2tp_v3_ensure_opt_in_linear is nonzero, the function
would directly jump to label invalid, without decrementing the reference
count of the l2tp_session object session increased earlier by
l2tp_tunnel_get_session(). This may result in refcount leaks.

Fix this issue by decrease the reference count before jumping to the
label invalid.

Fixes: 4522a70db7aa ("l2tp: fix reading optional fields of L2TPv3")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

04f08eb4 08-Sep-2021 Eric Dumazet <edumazet@google.com>

net/af_unix: fix a data-race in unix_dgram_poll

syzbot reported another data-race in af_unix [1]

Lets change __skb_insert() to use WRITE_ONCE() when changing
skb head qlen.

Also, change unix_dgram_poll() to use lockless version
of unix_recvq_full()

It is verry possible we can switch all/most unix_recvq_full()
to the lockless version, this will be done in a future kernel version.

[1] HEAD commit: 8596e589b787732c8346f0482919e83cc9362db1

BUG: KCSAN: data-race in skb_queue_tail / unix_dgram_poll

write to 0xffff88814eeb24e0 of 4 bytes by task 25815 on cpu 0:
__skb_insert include/linux/skbuff.h:1938 [inline]
__skb_queue_before include/linux/skbuff.h:2043 [inline]
__skb_queue_tail include/linux/skbuff.h:2076 [inline]
skb_queue_tail+0x80/0xa0 net/core/skbuff.c:3264
unix_dgram_sendmsg+0xff2/0x1600 net/unix/af_unix.c:1850
sock_sendmsg_nosec net/socket.c:703 [inline]
sock_sendmsg net/socket.c:723 [inline]
____sys_sendmsg+0x360/0x4d0 net/socket.c:2392
___sys_sendmsg net/socket.c:2446 [inline]
__sys_sendmmsg+0x315/0x4b0 net/socket.c:2532
__do_sys_sendmmsg net/socket.c:2561 [inline]
__se_sys_sendmmsg net/socket.c:2558 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2558
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff88814eeb24e0 of 4 bytes by task 25834 on cpu 1:
skb_queue_len include/linux/skbuff.h:1869 [inline]
unix_recvq_full net/unix/af_unix.c:194 [inline]
unix_dgram_poll+0x2bc/0x3e0 net/unix/af_unix.c:2777
sock_poll+0x23e/0x260 net/socket.c:1288
vfs_poll include/linux/poll.h:90 [inline]
ep_item_poll fs/eventpoll.c:846 [inline]
ep_send_events fs/eventpoll.c:1683 [inline]
ep_poll fs/eventpoll.c:1798 [inline]
do_epoll_wait+0x6ad/0xf00 fs/eventpoll.c:2226
__do_sys_epoll_wait fs/eventpoll.c:2238 [inline]
__se_sys_epoll_wait fs/eventpoll.c:2233 [inline]
__x64_sys_epoll_wait+0xf6/0x120 fs/eventpoll.c:2233
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x0000001b -> 0x00000001

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 25834 Comm: syz-executor.1 Tainted: G W 5.14.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 86b18aaa2b5b ("skbuff: fix a data race in skb_queue_len()")
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

14e2bc4e 08-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfsd-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

- Restore performance on memory-starved servers

* tag 'nfsd-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: improve error response to over-size gss credential
SUNRPC: don't pause on incomplete allocation


34c59da4 08-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '9p-for-5.15-rc1' of git://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:
"A couple of harmless fixes, increase max tcp msize (64KB -> 1MB), and
increase default msize (8KB -> 128KB)

The default increase has been discussed with Christian for the qemu
side of things but makes sense for all supported transports"

* tag '9p-for-5.15-rc1' of git://github.com/martinetd/linux:
net/9p: increase default msize to 128k
net/9p: use macro to define default msize
net/9p: increase tcp max msize to 1MB
9p/xen: Fix end of loop tests for list_for_each_entry
9p/trans_virtio: Remove sysfs file on probe failure


581edcd0 07-Sep-2021 Jeremy Kerr <jk@codeconstruct.com.au>

mctp: perform route destruction under RCU read lock

The kernel test robot reports:

[ 843.509974][ T345] =============================
[ 843.524220][ T345] WARNING: suspicious RCU usage
[ 843.538791][ T345] 5.14.0-rc2-00606-g889b7da23abf #1 Not tainted
[ 843.553617][ T345] -----------------------------
[ 843.567412][ T345] net/mctp/route.c:310 RCU-list traversed in non-reader section!!

- we're missing the rcu read lock acquire around the destruction path.

This change adds the acquire/release - the path is already atomic, and
we're using the _rcu list iterators.

Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

d9ea761f 07-Sep-2021 Lin, Zhenpeng <zplin@psu.edu>

dccp: don't duplicate ccid when cloning dccp sock

Commit 2677d2067731 ("dccp: don't free ccid2_hc_tx_sock ...") fixed
a UAF but reintroduced CVE-2017-6074.

When the sock is cloned, two dccps_hc_tx_ccid will reference to the
same ccid. So one can free the ccid object twice from two socks after
cloning.

This issue was found by "Hadar Manor" as well and assigned with
CVE-2020-16119, which was fixed in Ubuntu's kernel. So here I port
the patch from Ubuntu to fix it.

The patch prevents cloned socks from referencing the same ccid.

Fixes: 2677d2067731410 ("dccp: don't free ccid2_hc_tx_sock ...")
Signed-off-by: Zhenpeng Lin <zplin@psu.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>

626bf91a 07-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes and stragglers from Jakub Kicinski:
"Networking stragglers and fixes, including changes from netfilter,
wireless and can.

Current release - regressions:

- qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi

- ip_gre: validate csum_start only on pull

- bnxt_en: fix 64-bit doorbell operation on 32-bit kernels

- ionic: fix double use of queue-lock, fix a sleeping in atomic

- can: c_can: fix null-ptr-deref on ioctl()

- cs89x0: disable compile testing on powerpc

Current release - new code bugs:

- bridge: mcast: fix vlan port router deadlock, consistently disable
BH

Previous releases - regressions:

- dsa: tag_rtl4_a: fix egress tags, only port 0 was working

- mptcp: fix possible divide by zero

- netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex

- netfilter: socket: icmp6: fix use-after-scope

- stmmac: fix MAC not working when system resume back with WoL active

Previous releases - always broken:

- ip/ip6_gre: use the same logic as SIT interfaces when computing
v6LL address

- seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6

- mptcp: only send extra TCP acks in eligible socket states

- dsa: lantiq_gswip: fix maximum frame length

- stmmac: fix overall budget calculation for rxtx_napi

- bnxt_en: fix firmware version reporting via devlink

- renesas: sh_eth: add missing barrier to fix freeing wrong tx
descriptor

Stragglers:

- netfilter: conntrack: switch to siphash

- netfilter: refuse insertion if chain has grown too large

- ncsi: add get MAC address command to get Intel i210 MAC address"

* tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
ieee802154: Remove redundant initialization of variable ret
net: stmmac: fix MAC not working when system resume back with WoL active
net: phylink: add suspend/resume support
net: renesas: sh_eth: Fix freeing wrong tx descriptor
bonding: 3ad: pass parameter bond_params by reference
cxgb3: fix oops on module removal
can: c_can: fix null-ptr-deref on ioctl()
can: rcar_canfd: add __maybe_unused annotation to silence warning
net: wwan: iosm: Unify IO accessors used in the driver
net: wwan: iosm: Replace io.*64_lo_hi() with regular accessors
net: qcom/emac: Replace strlcpy with strscpy
ip6_gre: Revert "ip6_gre: add validation for csum_start"
net: hns3: make hclgevf_cmd_caps_bit_map0 and hclge_cmd_caps_bit_map0 static
selftests/bpf: Test XDP bonding nest and unwind
bonding: Fix negative jump label count on nested bonding
MAINTAINERS: add VM SOCKETS (AF_VSOCK) entry
stmmac: dwmac-loongson:Fix missing return value
iwlwifi: fix printk format warnings in uefi.c
net: create netdev->dev_addr assignment helpers
bnxt_en: Fix possible unintended driver initiated error recovery
...


0f77f2de 07-Sep-2021 Colin Ian King <colin.king@canonical.com>

ieee802154: Remove redundant initialization of variable ret

The variable ret is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fe63339e 06-Sep-2021 Willem de Bruijn <willemb@google.com>

ip6_gre: Revert "ip6_gre: add validation for csum_start"

This reverts commit 9cf448c200ba9935baa94e7a0964598ce947db9d.

This commit was added for equivalence with a similar fix to ip_gre.
That fix proved to have a bug. Upon closer inspection, ip6_gre is not
susceptible to the original bug.

So revert the unnecessary extra check.

In short, ipgre_xmit calls skb_pull to remove ipv4 headers previously
inserted by dev_hard_header. ip6gre_tunnel_xmit does not.

Link: https://lore.kernel.org/netdev/CA+FuTSe+vJgTVLc9SojGuN-f9YQ+xWLPKE_S4f=f+w+_P2hgUg@mail.gmail.com/#t
Fixes: 9cf448c200ba ("ip6_gre: add validation for csum_start")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8fc92b7c 03-Sep-2021 Arseny Krasnov <arseny.krasnov@kaspersky.com>

af_vsock: rename variables in receive loop

Record is supported via MSG_EOR flag, while current logic operates
with message, so rename variables from 'record' to 'message'.

Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20210903123306.3273757-1-arseny.krasnov@kaspersky.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

8d5ac871 03-Sep-2021 Arseny Krasnov <arseny.krasnov@kaspersky.com>

virtio/vsock: support MSG_EOR bit processing

If packet has 'EOR' bit - set MSG_EOR in 'recvmsg()' flags.

Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20210903123251.3273639-1-arseny.krasnov@kaspersky.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

9af8f106 03-Sep-2021 Arseny Krasnov <arseny.krasnov@kaspersky.com>

virtio/vsock: rename 'EOR' to 'EOM' bit.

This current implemented bit is used to mark end of messages
('EOM' - end of message), not records('EOR' - end of record).
Also rename 'record' to 'message' in implementation as it is
different things.

Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20210903123109.3273053-1-arseny.krasnov@kaspersky.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

8a0ed250 05-Sep-2021 Willem de Bruijn <willemb@google.com>

ip_gre: validate csum_start only on pull

The GRE tunnel device can pull existing outer headers in ipge_xmit.
This is a rare path, apparently unique to this device. The below
commit ensured that pulling does not move skb->data beyond csum_start.

But it has a false positive if ip_summed is not CHECKSUM_PARTIAL and
thus csum_start is irrelevant.

Refine to exclude this. At the same time simplify and strengthen the
test.

Simplify, by moving the check next to the offending pull, making it
more self documenting and removing an unnecessary branch from other
code paths.

Strengthen, by also ensuring that the transport header is correct and
therefore the inner headers will be after skb_reset_inner_headers.
The transport header is set to csum_start in skb_partial_csum_set.

Link: https://lore.kernel.org/netdev/YS+h%2FtqCJJiQei+W@shredder/
Fixes: 1d011c4803c7 ("ip_gre: add validation for csum_start")
Reported-by: Ido Schimmel <idosch@idosch.org>
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e5dd7294 03-Sep-2021 Antonio Quartulli <a@unstable.cc>

ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address

GRE interfaces are not Ether-like and therefore it is not
possible to generate the v6LL address the same way as (for example)
GRETAP devices.

With default settings, a GRE interface will attempt generating its v6LL
address using the EUI64 approach, but this will fail when the local
endpoint of the GRE tunnel is set to "any". In this case the GRE
interface will end up with no v6LL address, thus violating RFC4291.

SIT interfaces already implement a different logic to ensure that a v6LL
address is always computed.

Change the GRE v6LL generation logic to follow the same approach as SIT.
This way GRE interfaces will always have a v6LL address as well.

Behaviour of GRETAP interfaces has not been changed as they behave like
classic Ether-like interfaces.

To avoid code duplication sit_add_v4_addrs() has been renamed to
add_v4_addrs() and adapted to handle also the IP6GRE/GRE cases.

Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>

9c4d94dc 04-Sep-2021 Christian Schoenebeck <linux_oss@crudebyte.com>

net/9p: increase default msize to 128k

Let's raise the default msize value to 128k.

The 'msize' option defines the maximum message size allowed for any
message being transmitted (in both directions) between 9p server and 9p
client during a 9p session.

Currently the default 'msize' is just 8k, which is way too conservative.
Such a small 'msize' value has quite a negative performance impact,
because individual 9p messages have to be split up far too often into
numerous smaller messages to fit into this message size limitation.

A default value of just 8k also has a much higher probablity of hitting
short-read issues like: https://gitlab.com/qemu-project/qemu/-/issues/409

Unfortunately user feedback showed that many 9p users are not aware that
this option even exists, nor the negative impact it might have if it is
too low.

Link: http://lkml.kernel.org/r/61ea0f0faaaaf26dd3c762eabe4420306ced21b9.1630770829.git.linux_oss@crudebyte.com
Link: https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg01003.html
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>

9210fc0a 04-Sep-2021 Christian Schoenebeck <linux_oss@crudebyte.com>

net/9p: use macro to define default msize

Use a macro to define the default value for the 'msize' option
at one place instead of using two separate integer literals.

Link: http://lkml.kernel.org/r/28bb651ae0349a7d57e8ddc92c1bd5e62924a912.1630770829.git.linux_oss@crudebyte.com
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>

22bb3b79 04-Sep-2021 Dominique Martinet <asmadeus@codewreck.org>

net/9p: increase tcp max msize to 1MB

Historically TCP has been limited to 64K buffers, but increasing
msize provides huge performance benefits especially as latency
increase so allow for bigger buffers.

Ideally further improvements could change the allocation from the
current contiguous chunk in slab (kmem_cache) to some scatter-gather
compatible API...

Note this only increases the max possible setting, not the default
value.

Link: http://lkml.kernel.org/r/YTQB5jCbvhmCWzNd@codewreck.org
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>

0961f0c0 04-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
"New Features:
- Better client responsiveness when server isn't replying
- Use refcount_t in sunrpc rpc_client refcount tracking
- Add srcaddr and dst_port to the sunrpc sysfs info files
- Add basic support for connection sharing between servers with multiple NICs`

Bugfixes and Cleanups:
- Sunrpc tracepoint cleanups
- Disconnect after ib_post_send() errors to avoid deadlocks
- Fix for tearing down rpcrdma_reps
- Fix a potential pNFS layoutget livelock loop
- pNFS layout barrier fixes
- Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
- Fix reconnection locking
- Fix return value of get_srcport()
- Remove rpcrdma_post_sends()
- Remove pNFS dead code
- Remove copy size restriction for inter-server copies
- Overhaul the NFS callback service
- Clean up sunrpc TCP socket shutdowns
- Always provide aligned buffers to RPC read layers"

* tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (39 commits)
NFS: Always provide aligned buffers to the RPC read layers
NFSv4.1 add network transport when session trunking is detected
SUNRPC enforce creation of no more than max_connect xprts
NFSv4 introduce max_connect mount options
SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs
SUNRPC keep track of number of transports to unique addresses
NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox
SUNRPC: Tweak TCP socket shutdown in the RPC client
SUNRPC: Simplify socket shutdown when not reusing TCP ports
NFSv4.2: remove restriction of copy size for inter-server copy.
NFS: Clean up the synopsis of callback process_op()
NFS: Extract the xdr_init_encode/decode() calls from decode_compound
NFS: Remove unused callback void decoder
NFS: Add a private local dispatcher for NFSv4 callback operations
SUNRPC: Eliminate the RQ_AUTHERR flag
SUNRPC: Set rq_auth_stat in the pg_authenticate() callout
SUNRPC: Add svc_rqst::rq_auth_stat
SUNRPC: Add dst_port to the sysfs xprt info file
SUNRPC: Add srcaddr as a file in sysfs
sunrpc: Fix return value of get_srcport()
...


c7c5e6ff 03-Sep-2021 Eric Dumazet <edumazet@google.com>

fq_codel: reject silly quantum parameters

syzbot found that forcing a big quantum attribute would crash hosts fast,
essentially using this:

tc qd replace dev eth0 root fq_codel quantum 4294967295

This is because fq_codel_dequeue() would have to loop
~2^31 times in :

if (flow->deficit <= 0) {
flow->deficit += q->quantum;
list_move_tail(&flow->flowchain, &q->old_flows);
goto begin;
}

SFQ max quantum is 2^19 (half a megabyte)
Lets adopt a max quantum of one megabyte for FQ_CODEL.

Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

10905b4a 03-Sep-2021 Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Protect nft_ct template with global mutex, from Pavel Skripkin.

2) Two recent commits switched inet rt and nexthop exception hashes
from jhash to siphash. If those two spots are problematic then
conntrack is affected as well, so switch voer to siphash too.
While at it, add a hard upper limit on chain lengths and reject
insertion if this is hit. Patches from Florian Westphal.

3) Fix use-after-scope in nf_socket_ipv6 reported by KASAN,
from Benjamin Hesmans.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: socket: icmp6: fix use-after-scope
netfilter: refuse insertion if chain has grown too large
netfilter: conntrack: switch to siphash
netfilter: conntrack: sanitize table size default settings
netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
====================

Link: https://lore.kernel.org/r/20210903163020.13741-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


b250e6d1 03-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

- Add -s option (strict mode) to merge_config.sh to make it fail when
any symbol is redefined.

- Show a warning if a different compiler is used for building external
modules.

- Infer --target from ARCH for CC=clang to let you cross-compile the
kernel without CROSS_COMPILE.

- Make the integrated assembler default (LLVM_IAS=1) for CC=clang.

- Add <linux/stdarg.h> to the kernel source instead of borrowing
<stdarg.h> from the compiler.

- Add Nick Desaulniers as a Kbuild reviewer.

- Drop stale cc-option tests.

- Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG
to handle symbols in inline assembly.

- Show a warning if 'FORCE' is missing for if_changed rules.

- Various cleanups

* tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits)
kbuild: redo fake deps at include/ksym/*.h
kbuild: clean up objtool_args slightly
modpost: get the *.mod file path more simply
checkkconfigsymbols.py: Fix the '--ignore' option
kbuild: merge vmlinux_link() between ARCH=um and other architectures
kbuild: do not remove 'linux' link in scripts/link-vmlinux.sh
kbuild: merge vmlinux_link() between the ordinary link and Clang LTO
kbuild: remove stale *.symversions
kbuild: remove unused quiet_cmd_update_lto_symversions
gen_compile_commands: extract compiler command from a series of commands
x86: remove cc-option-yn test for -mtune=
arc: replace cc-option-yn uses with cc-option
s390: replace cc-option-yn uses with cc-option
ia64: move core-y in arch/ia64/Makefile to arch/ia64/Kbuild
sparc: move the install rule to arch/sparc/Makefile
security: remove unneeded subdir-$(CONFIG_...)
kbuild: sh: remove unused install script
kbuild: Fix 'no symbols' warning when CONFIG_TRIM_UNUSD_KSYMS=y
kbuild: Switch to 'f' variants of integrated assembler flag
kbuild: Shuffle blank line to improve comment meaning
...


0c217d50 01-Sep-2021 NeilBrown <neilb@suse.de>

SUNRPC: improve error response to over-size gss credential

When the NFS server receives a large gss (kerberos) credential and tries
to pass it up to rpc.svcgssd (which is deprecated), it triggers an
infinite loop in cache_read().

cache_request() always returns -EAGAIN, and this causes a "goto again".

This patch:
- changes the error to -E2BIG to avoid the infinite loop, and
- generates a WARN_ONCE when rsi_request first sees an over-sized
credential. The warning suggests switching to gssproxy.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=196583
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

730affed 03-Sep-2021 Benjamin Hesmans <benjamin.hesmans@tessares.net>

netfilter: socket: icmp6: fix use-after-scope

Bug reported by KASAN:

BUG: KASAN: use-after-scope in inet6_ehashfn (net/ipv6/inet6_hashtables.c:40)
Call Trace:
(...)
inet6_ehashfn (net/ipv6/inet6_hashtables.c:40)
(...)
nf_sk_lookup_slow_v6 (net/ipv6/netfilter/nf_socket_ipv6.c:91
net/ipv6/netfilter/nf_socket_ipv6.c:146)

It seems that this bug has already been fixed by Eric Dumazet in the
past in:
commit 78296c97ca1f ("netfilter: xt_socket: fix a stack corruption bug")

But a variant of the same issue has been introduced in
commit d64d80a2cde9 ("netfilter: x_tables: don't extract flow keys on early demuxed sks in socket match")

`daddr` and `saddr` potentially hold a reference to ipv6_var that is no
longer in scope when the call to `nf_socket_get_sock_v6` is made.

Fixes: d64d80a2cde9 ("netfilter: x_tables: don't extract flow keys on early demuxed sks in socket match")
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

9756e44f 02-Sep-2021 王贇 <yun.wang@linux.alibaba.com>

net: remove the unnecessary check in cipso_v4_doi_free

The commit 733c99ee8be9 ("net: fix NULL pointer reference in
cipso_v4_doi_free") was merged by a mistake, this patch try
to cleanup the mess.

And we already have the commit e842cb60e8ac ("net: fix NULL
pointer reference in cipso_v4_doi_free") which fixed the root
cause of the issue mentioned in it's description.

Suggested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ddd0d529 02-Sep-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: fix vlan port router deadlock

Before vlan/port mcast router support was added
br_multicast_set_port_router was used only with bh already disabled due
to the bridge port lock, but that is no longer the case and when it is
called to configure a vlan/port mcast router we can deadlock with the
timer, so always disable bh to make sure it can be called from contexts
with both enabled and disabled bh.

Fixes: 2796d846d74a ("net: bridge: vlan: convert mcast router global option to per-vlan entry")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bf0df73a 02-Sep-2021 Colin Ian King <colin.king@canonical.com>

seg6_iptunnel: Remove redundant initialization of variable err

The variable err is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

743902c5 02-Sep-2021 Colin Ian King <colin.king@canonical.com>

tipc: clean up inconsistent indenting

There is a statement that is indented one character too deeply,
clean this up.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c645fe9b 02-Sep-2021 Colin Ian King <colin.king@canonical.com>

skbuff: clean up inconsistent indenting

There is a statement that is indented one character too deeply,
clean this up.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

340fa666 02-Sep-2021 Mat Martineau <mathew.j.martineau@linux.intel.com>

mptcp: Only send extra TCP acks in eligible socket states

Recent changes exposed a bug where specifically-timed requests to the
path manager netlink API could trigger a divide-by-zero in
__tcp_select_window(), as syzkaller does:

divide error: 0000 [#1] SMP KASAN NOPTI
CPU: 0 PID: 9667 Comm: syz-executor.0 Not tainted 5.14.0-rc6+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:__tcp_select_window+0x509/0xa60 net/ipv4/tcp_output.c:3016
Code: 44 89 ff e8 c9 29 e9 fd 45 39 e7 0f 8d 20 ff ff ff e8 db 28 e9 fd 44 89 e3 e9 13 ff ff ff e8 ce 28 e9 fd 44 89 e0 44 89 e3 99 <f7> 7c 24 04 29 d3 e9 fc fe ff ff e8 b7 28 e9 fd 44 89 f1 48 89 ea
RSP: 0018:ffff888031ccf020 EFLAGS: 00010216
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000040000
RDX: 0000000000000000 RSI: ffff88811532c080 RDI: 0000000000000002
RBP: 0000000000000000 R08: ffffffff835807c2 R09: 0000000000000000
R10: 0000000000000004 R11: ffffed1020b92441 R12: 0000000000000000
R13: 1ffff11006399e08 R14: 0000000000000000 R15: 0000000000000000
FS: 00007fa4c8344700(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2f424000 CR3: 000000003e4e2003 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
tcp_select_window net/ipv4/tcp_output.c:264 [inline]
__tcp_transmit_skb+0xc00/0x37a0 net/ipv4/tcp_output.c:1351
__tcp_send_ack.part.0+0x3ec/0x760 net/ipv4/tcp_output.c:3972
__tcp_send_ack net/ipv4/tcp_output.c:3978 [inline]
tcp_send_ack+0x7d/0xa0 net/ipv4/tcp_output.c:3978
mptcp_pm_nl_addr_send_ack+0x1ab/0x380 net/mptcp/pm_netlink.c:654
mptcp_pm_remove_addr+0x161/0x200 net/mptcp/pm.c:58
mptcp_nl_remove_id_zero_address+0x197/0x460 net/mptcp/pm_netlink.c:1328
mptcp_nl_cmd_del_addr+0x98b/0xd40 net/mptcp/pm_netlink.c:1359
genl_family_rcv_msg_doit.isra.0+0x225/0x340 net/netlink/genetlink.c:731
genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
genl_rcv_msg+0x341/0x5b0 net/netlink/genetlink.c:792
netlink_rcv_skb+0x148/0x430 net/netlink/af_netlink.c:2504
genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x537/0x750 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x846/0xd80 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0x14e/0x190 net/socket.c:724
____sys_sendmsg+0x709/0x870 net/socket.c:2403
___sys_sendmsg+0xff/0x170 net/socket.c:2457
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2486
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

mptcp_pm_nl_addr_send_ack() was attempting to send a TCP ACK on the
first subflow in the MPTCP socket's connection list without validating
that the subflow was in a suitable connection state. To address this,
always validate subflow state when sending extra ACKs on subflows
for address advertisement or subflow priority change.

Fixes: 84dfe3677a6f ("mptcp: send out dedicated ADD_ADDR packet")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/229
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Acked-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

20e7b9f8 02-Sep-2021 Eric Dumazet <edumazet@google.com>

pktgen: remove unused variable

pktgen_thread_worker() no longer needs wait variable, delete it.

Fixes: ef87979c273a ("pktgen: better scheduler friendliness")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9aca491e 01-Sep-2021 Ryoga Saito <contact@proelbtn.com>

Set fc_nlinfo in nh_create_ipv4, nh_create_ipv6

This patch fixes kernel NULL pointer dereference when creating nexthop
which is bound with SRv6 decapsulation. In the creation of nexthop,
__seg6_end_dt_vrf_build is called. __seg6_end_dt_vrf_build expects
fc_lninfo in fib6_config is set correctly, but it isn't set in
nh_create_ipv6, which causes kernel crash.

Here is steps to reproduce kernel crash:

1. modprobe vrf
2. ip -6 nexthop add encap seg6local action End.DT4 vrftable 1 dev eth0

We got the following message:

[ 901.370336] BUG: kernel NULL pointer dereference, address: 0000000000000ba0
[ 901.371658] #PF: supervisor read access in kernel mode
[ 901.372672] #PF: error_code(0x0000) - not-present page
[ 901.373672] PGD 0 P4D 0
[ 901.374248] Oops: 0000 [#1] SMP PTI
[ 901.374944] CPU: 0 PID: 8593 Comm: ip Not tainted 5.14-051400-generic #202108310811-Ubuntu
[ 901.376404] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module_el8.2.0+320+13f867d7 04/01/2014
[ 901.377907] RIP: 0010:vrf_ifindex_lookup_by_table_id+0x19/0x90 [vrf]
[ 901.379182] Code: c1 e9 72 ff ff ff e8 96 49 01 c2 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 89 f5 41 54 53 8b 05 47 4c 00 00 <48> 8b 97 a0 0b 00 00 48 8b 1c c2 e8 57 27 53 c1 4c 8d a3 88 00 00
[ 901.382652] RSP: 0018:ffffbf2d02043590 EFLAGS: 00010282
[ 901.383746] RAX: 000000000000000b RBX: ffff990808255e70 RCX: ffffbf2d02043aa8
[ 901.385436] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000000
[ 901.386924] RBP: ffffbf2d020435b0 R08: 00000000000000c0 R09: ffff990808255e40
[ 901.388537] R10: ffffffff83b08c90 R11: 0000000000000009 R12: 0000000000000000
[ 901.389937] R13: 0000000000000001 R14: 0000000000000000 R15: 000000000000000b
[ 901.391226] FS: 00007fe49381f740(0000) GS:ffff99087dc00000(0000) knlGS:0000000000000000
[ 901.392737] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 901.393803] CR2: 0000000000000ba0 CR3: 000000000e3e8003 CR4: 0000000000770ef0
[ 901.395122] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 901.396496] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 901.397833] PKRU: 55555554
[ 901.398578] Call Trace:
[ 901.399144] l3mdev_ifindex_lookup_by_table_id+0x3b/0x70
[ 901.400179] __seg6_end_dt_vrf_build+0x34/0xd0
[ 901.401067] seg6_end_dt4_build+0x16/0x20
[ 901.401904] seg6_local_build_state+0x271/0x430
[ 901.402797] lwtunnel_build_state+0x81/0x130
[ 901.403645] fib_nh_common_init+0x82/0x100
[ 901.404465] ? sock_def_readable+0x4b/0x80
[ 901.405285] fib6_nh_init+0x115/0x7c0
[ 901.406033] nh_create_ipv6.isra.0+0xe1/0x140
[ 901.406932] rtm_new_nexthop+0x3b7/0xeb0
[ 901.407828] rtnetlink_rcv_msg+0x152/0x3a0
[ 901.408663] ? rtnl_calcit.isra.0+0x130/0x130
[ 901.409535] netlink_rcv_skb+0x55/0x100
[ 901.410319] rtnetlink_rcv+0x15/0x20
[ 901.411026] netlink_unicast+0x1a8/0x250
[ 901.411813] netlink_sendmsg+0x238/0x470
[ 901.412602] ? _copy_from_user+0x2b/0x60
[ 901.413394] sock_sendmsg+0x65/0x70
[ 901.414112] ____sys_sendmsg+0x218/0x290
[ 901.414929] ? copy_msghdr_from_user+0x5c/0x90
[ 901.415814] ___sys_sendmsg+0x81/0xc0
[ 901.416559] ? fsnotify_destroy_marks+0x27/0xf0
[ 901.417447] ? call_rcu+0xa4/0x230
[ 901.418153] ? kmem_cache_free+0x23f/0x410
[ 901.418972] ? dentry_free+0x37/0x70
[ 901.419705] ? mntput_no_expire+0x4c/0x260
[ 901.420574] __sys_sendmsg+0x62/0xb0
[ 901.421297] __x64_sys_sendmsg+0x1f/0x30
[ 901.422057] do_syscall_64+0x5c/0xc0
[ 901.422756] ? syscall_exit_to_user_mode+0x27/0x50
[ 901.423675] ? __x64_sys_close+0x12/0x40
[ 901.424462] ? do_syscall_64+0x69/0xc0
[ 901.425219] ? irqentry_exit_to_user_mode+0x9/0x20
[ 901.426149] ? irqentry_exit+0x19/0x30
[ 901.426901] ? exc_page_fault+0x89/0x160
[ 901.427709] ? asm_exc_page_fault+0x8/0x30
[ 901.428536] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 901.429514] RIP: 0033:0x7fe493945747
[ 901.430248] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[ 901.433549] RSP: 002b:00007ffe9932cf68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 901.434981] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe493945747
[ 901.436303] RDX: 0000000000000000 RSI: 00007ffe9932cfe0 RDI: 0000000000000003
[ 901.437607] RBP: 00000000613053f7 R08: 0000000000000001 R09: 00007ffe9932d07c
[ 901.438990] R10: 000055f4a903a010 R11: 0000000000000246 R12: 0000000000000001
[ 901.440340] R13: 0000000000000001 R14: 000055f4a802b163 R15: 000055f4a8042020
[ 901.441630] Modules linked in: vrf nls_utf8 isofs nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common isst_if_mbox_msr isst_if_common nfit rapl input_leds joydev serio_raw qemu_fw_cfg mac_hid sch_fq_codel drm virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd virtio_net net_failover cryptd psmouse virtio_blk failover i2c_piix4 pata_acpi floppy
[ 901.450808] CR2: 0000000000000ba0
[ 901.451514] ---[ end trace c27b934b99ade304 ]---
[ 901.452403] RIP: 0010:vrf_ifindex_lookup_by_table_id+0x19/0x90 [vrf]
[ 901.453626] Code: c1 e9 72 ff ff ff e8 96 49 01 c2 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 89 f5 41 54 53 8b 05 47 4c 00 00 <48> 8b 97 a0 0b 00 00 48 8b 1c c2 e8 57 27 53 c1 4c 8d a3 88 00 00
[ 901.456910] RSP: 0018:ffffbf2d02043590 EFLAGS: 00010282
[ 901.457912] RAX: 000000000000000b RBX: ffff990808255e70 RCX: ffffbf2d02043aa8
[ 901.459238] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000000
[ 901.460552] RBP: ffffbf2d020435b0 R08: 00000000000000c0 R09: ffff990808255e40
[ 901.461882] R10: ffffffff83b08c90 R11: 0000000000000009 R12: 0000000000000000
[ 901.463208] R13: 0000000000000001 R14: 0000000000000000 R15: 000000000000000b
[ 901.464529] FS: 00007fe49381f740(0000) GS:ffff99087dc00000(0000) knlGS:0000000000000000
[ 901.466058] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 901.467189] CR2: 0000000000000ba0 CR3: 000000000e3e8003 CR4: 0000000000770ef0
[ 901.468515] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 901.469858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 901.471139] PKRU: 55555554

Signed-off-by: Ryoga Saito <contact@proelbtn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d2cabd2d 02-Sep-2021 Dan Carpenter <dan.carpenter@oracle.com>

net: qrtr: revert check in qrtr_endpoint_post()

I tried to make this check stricter as a hardenning measure but it broke
audo and wifi on these devices so revert it.

Fixes: aaa8e4922c88 ("net: qrtr: make checks in qrtr_endpoint_post() stricter")
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Tested-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

3f22bb13 01-Sep-2021 Jiwon Kim <jiwonaid0@gmail.com>

ipv6: change return type from int to void for mld_process_v2

The mld_process_v2 only returned 0.

So, the return type is changed to void.

Signed-off-by: Jiwon Kim <jiwonaid0@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

205b95fe 30-Aug-2021 Ivan Mikhaylov <i.mikhaylov@yadro.com>

net/ncsi: add get MAC address command to get Intel i210 MAC address

This patch adds OEM Intel GMA command and response handler for it.

Signed-off-by: Brad Ho <Brad_Ho@phoenix.com>
Signed-off-by: Paul Fertser <fercerpav@gmail.com>
Signed-off-by: Ivan Mikhaylov <i.mikhaylov@yadro.com>
Link: https://lore.kernel.org/r/20210830171806.119857-2-i.mikhaylov@yadro.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

1094c6fe 01-Sep-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: fix possible divide by zero

Florian noted that if mptcp_alloc_tx_skb() allocation fails
in __mptcp_push_pending(), we can end-up invoking
mptcp_push_release()/tcp_push() with a zero mss, causing
a divide by 0 error.

This change addresses the issue refactoring the skb allocation
code checking if skb collapsing will happen for sure and doing
the skb allocation only after such check. Skb allocation will
now happen only after the call to tcp_send_mss() which
correctly initializes mss_now.

As side bonuses we now fill the skb tx cache only when needed,
and this also clean-up a bit the output path.

v1 -> v2:
- use lockdep_assert_held_once() - Jakub
- fix indentation - Jakub

Reported-by: Florian Westphal <fw@strlen.de>
Fixes: 724cfd2ee8aa ("mptcp: allocate TX skbs in msk context")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7c314bdf 01-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'tty-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty / serial updates from Greg KH:
"Here is the "big" set of tty/serial driver patches for 5.15-rc1

Nothing major in here at all, just some driver updates and more
cleanups on old tty apis and code that needed it that includes:

- tty.h cleanup of things that didn't belong in it

- other tty cleanups by Jiri

- driver cleanups

- rs485 support added to amba-pl011 driver

- dts updates

- stm32 serial driver updates

- other minor fixes and driver updates

All have been in linux-next for a while with no reported problems"

* tag 'tty-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (83 commits)
tty: serial: uartlite: Use read_poll_timeout for a polling loop
tty: serial: uartlite: Use constants in early_uartlite_putc
tty: Fix data race between tiocsti() and flush_to_ldisc()
serial: vt8500: Use of_device_get_match_data
serial: tegra: Use of_device_get_match_data
serial: 8250_ingenic: Use of_device_get_match_data
tty: serial: linflexuart: Remove redundant check to simplify the code
tty: serial: fsl_lpuart: do software reset for imx7ulp and imx8qxp
tty: serial: fsl_lpuart: enable two stop bits for lpuart32
tty: serial: fsl_lpuart: fix the wrong mapbase value
mxser: use semi-colons instead of commas
tty: moxa: use semi-colons instead of commas
tty: serial: fsl_lpuart: check dma_tx_in_progress in tx dma callback
tty: replace in_irq() with in_hardirq()
serial: sh-sci: fix break handling for sysrq
serial: stm32: use devm_platform_get_and_ioremap_resource()
serial: stm32: use the defined variable to simplify code
Revert "arm pl011 serial: support multi-irq request"
tty: serial: samsung: Add Exynos850 SoC data
tty: serial: samsung: Fix driver data macros style
...


e38b3f20 29-Aug-2021 NeilBrown <neilb@suse.de>

SUNRPC: don't pause on incomplete allocation

alloc_pages_bulk_array() attempts to allocate at least one page based on
the provided pages, and then opportunistically allocates more if that
can be done without dropping the spinlock.

So if it returns fewer than requested, that could just mean that it
needed to drop the lock. In that case, try again immediately.

Only pause for a time if no progress could be made.

Reported-and-tested-by: Mike Javorski <mike.javorski@gmail.com>
Reported-and-tested-by: Lothar Paltins <lopa@mailbox.org>
Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Mel Gorman <mgorman@suse.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

780aa120 31-Aug-2021 Wan Jiabing <wanjiabing@vivo.com>

mptcp: Fix duplicated argument in protocol.h

Fix the following coccicheck warning:
./net/mptcp/protocol.h:36:50-73: duplicated argument to & or |

The OPTION_MPTCP_MPJ_SYNACK here is duplicate.
Here should be OPTION_MPTCP_MPJ_ACK.

Fixes: 74c7dfbee3e18 ("mptcp: consolidate in_opt sub-options fields in a bitmask")
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

0e90dfa7 31-Aug-2021 Linus Walleij <linus.walleij@linaro.org>

net: dsa: tag_rtl4_a: Fix egress tags

I noticed that only port 0 worked on the RTL8366RB since we
started to use custom tags.

It turns out that the format of egress custom tags is actually
different from ingress custom tags. While the lower bits just
contain the port number in ingress tags, egress tags need to
indicate destination port by setting the bit for the
corresponding port.

It was working on port 0 because port 0 added 0x00 as port
number in the lower bits, and if you do this the packet appears
at all ports, including the intended port. Ooops.

Fix this and all ports work again. Use the define for shifting
the "type A" into place while we're at it.

Tested on the D-Link DIR-685 by sending traffic to each of
the ports in turn. It works.

Fixes: 86dd9868b878 ("net: dsa: tag_rtl4_a: Support also egress tags")
Cc: DENG Qingfang <dqfext@gmail.com>
Cc: Mauri Sandberg <sandberg@mailfence.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9e9fb765 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
"Core:

- Enable memcg accounting for various networking objects.

BPF:

- Introduce bpf timers.

- Add perf link and opaque bpf_cookie which the program can read out
again, to be used in libbpf-based USDT library.

- Add bpf_task_pt_regs() helper to access user space pt_regs in
kprobes, to help user space stack unwinding.

- Add support for UNIX sockets for BPF sockmap.

- Extend BPF iterator support for UNIX domain sockets.

- Allow BPF TCP congestion control progs and bpf iterators to call
bpf_setsockopt(), e.g. to switch to another congestion control
algorithm.

Protocols:

- Support IOAM Pre-allocated Trace with IPv6.

- Support Management Component Transport Protocol.

- bridge: multicast: add vlan support.

- netfilter: add hooks for the SRv6 lightweight tunnel driver.

- tcp:
- enable mid-stream window clamping (by user space or BPF)
- allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
- more accurate DSACK processing for RACK-TLP

- mptcp:
- add full mesh path manager option
- add partial support for MP_FAIL
- improve use of backup subflows
- optimize option processing

- af_unix: add OOB notification support.

- ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by the
router.

- mac80211: Target Wake Time support in AP mode.

- can: j1939: extend UAPI to notify about RX status.

Driver APIs:

- Add page frag support in page pool API.

- Many improvements to the DSA (distributed switch) APIs.

- ethtool: extend IRQ coalesce uAPI with timer reset modes.

- devlink: control which auxiliary devices are created.

- Support CAN PHYs via the generic PHY subsystem.

- Proper cross-chip support for tag_8021q.

- Allow TX forwarding for the software bridge data path to be
offloaded to capable devices.

Drivers:

- veth: more flexible channels number configuration.

- openvswitch: introduce per-cpu upcall dispatch.

- Add internet mix (IMIX) mode to pktgen.

- Transparently handle XDP operations in the bonding driver.

- Add LiteETH network driver.

- Renesas (ravb):
- support Gigabit Ethernet IP

- NXP Ethernet switch (sja1105):
- fast aging support
- support for "H" switch topologies
- traffic termination for ports under VLAN-aware bridge

- Intel 1G Ethernet
- support getcrosststamp() with PCIe PTM (Precision Time
Measurement) for better time sync
- support Credit-Based Shaper (CBS) offload, enabling HW traffic
prioritization and bandwidth reservation

- Broadcom Ethernet (bnxt)
- support pulse-per-second output
- support larger Rx rings

- Mellanox Ethernet (mlx5)
- support ethtool RSS contexts and MQPRIO channel mode
- support LAG offload with bridging
- support devlink rate limit API
- support packet sampling on tunnels

- Huawei Ethernet (hns3):
- basic devlink support
- add extended IRQ coalescing support
- report extended link state

- Netronome Ethernet (nfp):
- add conntrack offload support

- Broadcom WiFi (brcmfmac):
- add WPA3 Personal with FT to supported cipher suites
- support 43752 SDIO device

- Intel WiFi (iwlwifi):
- support scanning hidden 6GHz networks
- support for a new hardware family (Bz)

- Xen pv driver:
- harden netfront against malicious backends

- Qualcomm mobile
- ipa: refactor power management and enable automatic suspend
- mhi: move MBIM to WWAN subsystem interfaces

Refactor:

- Ambient BPF run context and cgroup storage cleanup.

- Compat rework for ndo_ioctl.

Old code removal:

- prism54 remove the obsoleted driver, deprecated by the p54 driver.

- wan: remove sbni/granch driver"

* tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1715 commits)
net: Add depends on OF_NET for LiteX's LiteETH
ipv6: seg6: remove duplicated include
net: hns3: remove unnecessary spaces
net: hns3: add some required spaces
net: hns3: clean up a type mismatch warning
net: hns3: refine function hns3_set_default_feature()
ipv6: remove duplicated 'net/lwtunnel.h' include
net: w5100: check return value after calling platform_get_resource()
net/mlxbf_gige: Make use of devm_platform_ioremap_resourcexxx()
net: mdio: mscc-miim: Make use of the helper function devm_platform_ioremap_resource()
net: mdio-ipq4019: Make use of devm_platform_ioremap_resource()
fou: remove sparse errors
ipv4: fix endianness issue in inet_rtm_getroute_build_skb()
octeontx2-af: Set proper errorcode for IPv4 checksum errors
octeontx2-af: Fix static code analyzer reported issues
octeontx2-af: Fix mailbox errors in nix_rss_flowkey_cfg
octeontx2-af: Fix loop in free and unmap counter
af_unix: fix potential NULL deref in unix_dgram_connect()
dpaa2-eth: Replace strlcpy with strscpy
octeontx2-af: Use NDC TX for transmit packet data
...


8bda9557 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
"New features:

- Support for server-side disconnect injection via debugfs

- Protocol definitions for new RPC_AUTH_TLS authentication flavor

Performance improvements:

- Reduce page allocator traffic in the NFSD splice read actor

- Reduce CPU utilization in svcrdma's Send completion handler

Notable bug fixes:

- Stabilize lockd operation when re-exporting NFS mounts

- Fix the use of %.*s in NFSD tracepoints

- Fix /proc/sys/fs/nfs/nsm_use_hostnames"

* tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
nfsd: fix crash on LOCKT on reexported NFSv3
nfs: don't allow reexport reclaims
lockd: don't attempt blocking locks on nfs reexports
nfs: don't atempt blocking locks on nfs reexports
Keep read and write fds with each nlm_file
lockd: update nlm_lookup_file reexport comment
nlm: minor refactoring
nlm: minor nlm_lookup_file argument change
lockd: lockd server-side shouldn't set fl_ops
SUNRPC: Add documentation for the fail_sunrpc/ directory
SUNRPC: Server-side disconnect injection
SUNRPC: Move client-side disconnect injection
SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory
svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free()
nfsd4: Fix forced-expiry locking
rpc: fix gss_svc_init cleanup on failure
SUNRPC: Add RPC_AUTH_TLS protocol numbers
lockd: change the proc_handler for nsm_use_hostnames
sysctl: introduce new proc handler proc_dobool
SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency()
...


29ce8f97 31-Aug-2021 Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

include/linux/netdevice.h
net/socket.c

d0efb16294d1 ("net: don't unconditionally copy_from_user a struct ifreq for socket ioctls")

876f0bf9d0d5 ("net: socket: simplify dev_ifconf handling")
29c4964822aa ("net: socket: rework compat_ifreq_ioctl()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>


a9e7c3ce 31-Aug-2021 Lv Ruyi <lv.ruyi@zte.com.cn>

ipv6: seg6: remove duplicated include

Remove all but the first include of net/lwtunnel.h from 'seg6_local.c.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

53c622db 31-Aug-2021 Lv Ruyi <lv.ruyi@zte.com.cn>

ipv6: remove duplicated 'net/lwtunnel.h' include

Remove all but the first include of net/lwtunnel.h from seg6_iptunnel.c.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

8d65cd8d 30-Aug-2021 Eric Dumazet <edumazet@google.com>

fou: remove sparse errors

We need to add __rcu qualifier to avoid these errors:

net/ipv4/fou.c:250:18: warning: incorrect type in assignment (different address spaces)
net/ipv4/fou.c:250:18: expected struct net_offload const **offloads
net/ipv4/fou.c:250:18: got struct net_offload const [noderef] __rcu **
net/ipv4/fou.c:251:15: error: incompatible types in comparison expression (different address spaces):
net/ipv4/fou.c:251:15: struct net_offload const [noderef] __rcu *
net/ipv4/fou.c:251:15: struct net_offload const *
net/ipv4/fou.c:272:18: warning: incorrect type in assignment (different address spaces)
net/ipv4/fou.c:272:18: expected struct net_offload const **offloads
net/ipv4/fou.c:272:18: got struct net_offload const [noderef] __rcu **
net/ipv4/fou.c:273:15: error: incompatible types in comparison expression (different address spaces):
net/ipv4/fou.c:273:15: struct net_offload const [noderef] __rcu *
net/ipv4/fou.c:273:15: struct net_offload const *
net/ipv4/fou.c:442:18: warning: incorrect type in assignment (different address spaces)
net/ipv4/fou.c:442:18: expected struct net_offload const **offloads
net/ipv4/fou.c:442:18: got struct net_offload const [noderef] __rcu **
net/ipv4/fou.c:443:15: error: incompatible types in comparison expression (different address spaces):
net/ipv4/fou.c:443:15: struct net_offload const [noderef] __rcu *
net/ipv4/fou.c:443:15: struct net_offload const *
net/ipv4/fou.c:489:18: warning: incorrect type in assignment (different address spaces)
net/ipv4/fou.c:489:18: expected struct net_offload const **offloads
net/ipv4/fou.c:489:18: got struct net_offload const [noderef] __rcu **
net/ipv4/fou.c:490:15: error: incompatible types in comparison expression (different address spaces):
net/ipv4/fou.c:490:15: struct net_offload const [noderef] __rcu *
net/ipv4/fou.c:490:15: struct net_offload const *
net/ipv4/udp_offload.c:170:26: warning: incorrect type in assignment (different address spaces)
net/ipv4/udp_offload.c:170:26: expected struct net_offload const **offloads
net/ipv4/udp_offload.c:170:26: got struct net_offload const [noderef] __rcu **
net/ipv4/udp_offload.c:171:23: error: incompatible types in comparison expression (different address spaces):
net/ipv4/udp_offload.c:171:23: struct net_offload const [noderef] __rcu *
net/ipv4/udp_offload.c:171:23: struct net_offload const *

Fixes: efc98d08e1ec ("fou: eliminate IPv4,v6 specific GRO functions")
Fixes: 8bce6d7d0d1e ("udp: Generalize skb_udp_segment")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

92548b0e 30-Aug-2021 Eric Dumazet <edumazet@google.com>

ipv4: fix endianness issue in inet_rtm_getroute_build_skb()

The UDP length field should be in network order.
This removes the following sparse error:

net/ipv4/route.c:3173:27: warning: incorrect type in assignment (different base types)
net/ipv4/route.c:3173:27: expected restricted __be16 [usertype] len
net/ipv4/route.c:3173:27: got unsigned long

Fixes: 404eb77ea766 ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

dc56ad70 30-Aug-2021 Eric Dumazet <edumazet@google.com>

af_unix: fix potential NULL deref in unix_dgram_connect()

syzbot was able to trigger NULL deref in unix_dgram_connect() [1]

This happens in

if (unix_peer(sk))
sk->sk_state = other->sk_state = TCP_ESTABLISHED; // crash because @other is NULL

Because locks have been dropped, unix_peer() might be non NULL,
while @other is NULL (AF_UNSPEC case)

We need to move code around, so that we no longer access
unix_peer() and sk_state while locks have been released.

[1]
general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
CPU: 0 PID: 10341 Comm: syz-executor239 Not tainted 5.14.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:unix_dgram_connect+0x32a/0xc60 net/unix/af_unix.c:1226
Code: 00 00 45 31 ed 49 83 bc 24 f8 05 00 00 00 74 69 e8 eb 5b a6 f9 48 8d 7d 12 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 48 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 e0 07 00 00
RSP: 0018:ffffc9000a89fcd8 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000000004 RCX: 0000000000000000
RDX: 0000000000000002 RSI: ffffffff87cf4ef5 RDI: 0000000000000012
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff88802e1917c3
R10: ffffffff87cf4eba R11: 0000000000000001 R12: ffff88802e191740
R13: 0000000000000000 R14: ffff88802e191d38 R15: ffff88802e1917c0
FS: 00007f3eb0052700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004787d0 CR3: 0000000029c0a000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__sys_connect_file+0x155/0x1a0 net/socket.c:1890
__sys_connect+0x161/0x190 net/socket.c:1907
__do_sys_connect net/socket.c:1917 [inline]
__se_sys_connect net/socket.c:1914 [inline]
__x64_sys_connect+0x6f/0xb0 net/socket.c:1914
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x446a89
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 a1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f3eb0052208 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 00000000004cc4d8 RCX: 0000000000446a89
RDX: 000000000000006e RSI: 0000000020000180 RDI: 0000000000000003
RBP: 00000000004cc4d0 R08: 00007f3eb0052700 R09: 0000000000000000
R10: 00007f3eb0052700 R11: 0000000000000246 R12: 00000000004cc4dc
R13: 00007ffd791e79cf R14: 00007f3eb0052300 R15: 0000000000022000
Modules linked in:
---[ end trace 4eb809357514968c ]---
RIP: 0010:unix_dgram_connect+0x32a/0xc60 net/unix/af_unix.c:1226
Code: 00 00 45 31 ed 49 83 bc 24 f8 05 00 00 00 74 69 e8 eb 5b a6 f9 48 8d 7d 12 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 48 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 e0 07 00 00
RSP: 0018:ffffc9000a89fcd8 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000000004 RCX: 0000000000000000
RDX: 0000000000000002 RSI: ffffffff87cf4ef5 RDI: 0000000000000012
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff88802e1917c3
R10: ffffffff87cf4eba R11: 0000000000000001 R12: ffff88802e191740
R13: 0000000000000000 R14: ffff88802e191d38 R15: ffff88802e1917c0
FS: 00007f3eb0052700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd791fe960 CR3: 0000000029c0a000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: 83301b5367a9 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6baeb395 28-Aug-2021 MichelleJin <shjy180909@gmail.com>

net: bridge: use mld2r_ngrec instead of icmpv6_dataun

br_ip6_multicast_mld2_report function uses icmp6h
to parse mld2_report packet.

mld2r_ngrec defines mld2r_hdr.icmp6_dataun.un_data16[1]
in include/net/mld.h.

So, it is more compact to use mld2r rather than icmp6h.

By doing printk test, it is confirmed that
icmp6h->icmp6_dataun.un_data16[1] and mld2r->mld2r_ngrec are
indeed equivalent.

Also, sizeof(*mld2r) and sizeof(*icmp6h) are equivalent, too.

Signed-off-by: MichelleJin <shjy180909@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c6607012 29-Aug-2021 Xiyu Yang <xiyuyang19@fudan.edu.cn>

net: sched: Fix qdisc_rate_table refcount leak when get tcf_block failed

The reference counting issue happens in one exception handling path of
cbq_change_class(). When failing to get tcf_block, the function forgets
to decrease the refcount of "rtab" increased by qdisc_put_rtab(),
causing a refcount leak.

Fix this issue by jumping to "failure" label when get tcf_block failed.

Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/1630252681-71588-1-git-send-email-xiyuyang19@fudan.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c547d89a 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

- cancellation cleanups (Hao, Pavel)

- io-wq accounting cleanup (Hao)

- io_uring submit locking fix (Hao)

- io_uring link handling fixes (Hao)

- fixed file improvements (wangyangbo, Pavel)

- allow updates of linked timeouts like regular timeouts (Pavel)

- IOPOLL fix (Pavel)

- remove batched file get optimization (Pavel)

- improve reference handling (Pavel)

- IRQ task_work batching (Pavel)

- allow pure fixed file, and add support for open/accept (Pavel)

- GFP_ATOMIC RT kernel fix

- multiple CQ ring waiter improvement

- funnel IRQ completions through task_work

- add support for limiting async workers explicitly

- add different clocksource support for timeouts

- io-wq wakeup race fix

- lots of cleanups and improvement (Pavel et al)

* tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block: (87 commits)
io-wq: fix wakeup race when adding new work
io-wq: wqe and worker locks no longer need to be IRQ safe
io-wq: check max_worker limits if a worker transitions bound state
io_uring: allow updating linked timeouts
io_uring: keep ltimeouts in a list
io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts
io-wq: provide a way to limit max number of workers
io_uring: add build check for buf_index overflows
io_uring: clarify io_req_task_cancel() locking
io_uring: add task-refs-get helper
io_uring: fix failed linkchain code logic
io_uring: remove redundant req_set_fail()
io_uring: don't free request to slab
io_uring: accept directly into fixed file table
io_uring: hand code io_accept() fd installing
io_uring: openat directly into fixed fd table
net: add accept helper not installing fd
io_uring: fix io_try_cancel_userdata race for iowq
io_uring: IRQ rw completion batching
io_uring: batch task work locking
...


9a1d6c9e 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
"Sitting on top of the core block changes, here are the driver changes
for the 5.15 merge window:

- NVMe updates via Christoph:
- suspend improvements for devices with an HMB (Keith Busch)
- handle double completions more gacefull (Sagi Grimberg)
- cleanup the selects for the nvme core code a bit (Sagi Grimberg)
- don't update queue count when failing to set io queues (Ruozhu Li)
- various nvmet connect fixes (Amit Engel)
- cleanup lightnvm leftovers (Keith Busch, me)
- small cleanups (Colin Ian King, Hou Pu)
- add tracing for the Set Features command (Hou Pu)
- CMB sysfs cleanups (Keith Busch)
- add a mutex_destroy call (Keith Busch)

- remove lightnvm subsystem. It's served its purpose and ultimately
led to zoned nvme support, we no longer need it (Christoph)

- revert floppy O_NDELAY fix (Denis)

- nbd fixes (Hou, Pavel, Baokun)

- nbd locking fixes (Tetsuo)

- nbd device removal fixes (Christoph)

- raid10 rcu warning fix (Xiao)

- raid1 write behind fix (Guoqing)

- rnbd fixes (Gioh, Md Haris)

- misc fixes (Colin)"

* tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block: (42 commits)
Revert "floppy: reintroduce O_NDELAY fix"
raid1: ensure write behind bio has less than BIO_MAX_VECS sectors
md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard
nbd: remove nbd->destroy_complete
nbd: only return usable devices from nbd_find_unused
nbd: set nbd->index before releasing nbd_index_mutex
nbd: prevent IDR lookups from finding partially initialized devices
nbd: reset NBD to NULL when restarting in nbd_genl_connect
nbd: add missing locking to the nbd_dev_add error path
nvme: remove the unused NVME_NS_* enum
nvme: remove nvm_ndev from ns
nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers
block: nbd: add sanity check for first_minor
nvmet: check that host sqsize does not exceed ctrl MQES
nvmet: avoid duplicate qid in connect cmd
nvmet: pass back cntlid on successful completion
nvme-rdma: don't update queue count when failing to set io queues
nvme-tcp: don't update queue count when failing to set io queues
nvme-tcp: pair send_mutex init with destroy
nvme: allow user toggling hmb usage
...


19a31d79 30-Aug-2021 Jakub Kicinski <kuba@kernel.org>

Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2021-08-31

We've added 116 non-merge commits during the last 17 day(s) which contain
a total of 126 files changed, 6813 insertions(+), 4027 deletions(-).

The main changes are:

1) Add opaque bpf_cookie to perf link which the program can read out again,
to be used in libbpf-based USDT library, from Andrii Nakryiko.

2) Add bpf_task_pt_regs() helper to access userspace pt_regs, from Daniel Xu.

3) Add support for UNIX stream type sockets for BPF sockmap, from Jiang Wang.

4) Allow BPF TCP congestion control progs to call bpf_setsockopt() e.g. to switch
to another congestion control algorithm during init, from Martin KaFai Lau.

5) Extend BPF iterator support for UNIX domain sockets, from Kuniyuki Iwashima.

6) Allow bpf_{set,get}sockopt() calls from setsockopt progs, from Prankur Gupta.

7) Add bpf_get_netns_cookie() helper for BPF_PROG_TYPE_{SOCK_OPS,CGROUP_SOCKOPT}
progs, from Xu Liu and Stanislav Fomichev.

8) Support for __weak typed ksyms in libbpf, from Hao Luo.

9) Shrink struct cgroup_bpf by 504 bytes through refactoring, from Dave Marchevsky.

10) Fix a smatch complaint in verifier's narrow load handling, from Andrey Ignatov.

11) Fix BPF interpreter's tail call count limit, from Daniel Borkmann.

12) Big batch of improvements to BPF selftests, from Magnus Karlsson, Li Zhijian,
Yucong Sun, Yonghong Song, Ilya Leoshkevich, Jussi Maki, Ilya Leoshkevich, others.

13) Another big batch to revamp XDP samples in order to give them consistent look
and feel, from Kumar Kartikeya Dwivedi.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (116 commits)
MAINTAINERS: Remove self from powerpc BPF JIT
selftests/bpf: Fix potential unreleased lock
samples: bpf: Fix uninitialized variable in xdp_redirect_cpu
selftests/bpf: Reduce more flakyness in sockmap_listen
bpf: Fix bpf-next builds without CONFIG_BPF_EVENTS
bpf: selftests: Add dctcp fallback test
bpf: selftests: Add connect_to_fd_opts to network_helpers
bpf: selftests: Add sk_state to bpf_tcp_helpers.h
bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt
selftests: xsk: Preface options with opt
selftests: xsk: Make enums lower case
selftests: xsk: Generate packets from specification
selftests: xsk: Generate packet directly in umem
selftests: xsk: Simplify cleanup of ifobjects
selftests: xsk: Decrease sending speed
selftests: xsk: Validate tx stats on tx thread
selftests: xsk: Simplify packet validation in xsk tests
selftests: xsk: Rename worker_* functions that are not thread entry points
selftests: xsk: Disassociate umem size with packets sent
selftests: xsk: Remove end-of-test packet
...
====================

Link: https://lore.kernel.org/r/20210830225618.11634-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


ca49bfd9 26-Aug-2021 Maxim Mikityanskiy <maximmi@nvidia.com>

sch_htb: Fix inconsistency when leaf qdisc creation fails

In HTB offload mode, qdiscs of leaf classes are grafted to netdev
queues. sch_htb expects the dev_queue field of these qdiscs to point to
the corresponding queues. However, qdisc creation may fail, and in that
case noop_qdisc is used instead. Its dev_queue doesn't point to the
right queue, so sch_htb can lose track of used netdev queues, which will
cause internal inconsistencies.

This commit fixes this bug by keeping track of the netdev queue inside
struct htb_class. All reads of cl->leaf.q->dev_queue are replaced by the
new field, the two values are synced on writes, and WARNs are added to
assert equality of the two values.

The driver API has changed: when TC_HTB_LEAF_DEL needs to move a queue,
the driver used to pass the old and new queue IDs to sch_htb. Now that
there is a new field (offload_queue) in struct htb_class that needs to
be updated on this operation, the driver will pass the old class ID to
sch_htb instead (it already knows the new class ID).

Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210826115425.1744053-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

1b9fbe81 30-Aug-2021 Yajun Deng <yajun.deng@linux.dev>

net: ipv4: Fix the warning for dereference

Add a if statements to avoid the warning.

Dan Carpenter report:
The patch faf482ca196a: "net: ipv4: Move ip_options_fragment() out of
loop" from Aug 23, 2021, leads to the following Smatch complaint:

net/ipv4/ip_output.c:833 ip_do_fragment()
warn: variable dereferenced before check 'iter.frag' (see line 828)

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: faf482ca196a ("net: ipv4: Move ip_options_fragment() out of loop")
Link: https://lore.kernel.org/netdev/20210830073802.GR7722@kadam/T/#t
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

aaa8e492 30-Aug-2021 Dan Carpenter <dan.carpenter@oracle.com>

net: qrtr: make checks in qrtr_endpoint_post() stricter

These checks are still not strict enough. The main problem is that if
"cb->type == QRTR_TYPE_NEW_SERVER" is true then "len - hdrlen" is
guaranteed to be 4 but we need to be at least 16 bytes. In fact, we
can reject everything smaller than sizeof(*pkt) which is 20 bytes.

Also I don't like the ALIGN(size, 4). It's better to just insist that
data is needs to be aligned at the start.

Fixes: 0baa99ee353c ("net: qrtr: Allow non-immediate node routing")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

efe487fc 29-Aug-2021 Haimin Zhang <tcs_kernel@tencent.com>

fix array-index-out-of-bounds in taprio_change

syzbot report an array-index-out-of-bounds in taprio_change
index 16 is out of range for type '__u16 [16]'
that's because mqprio->num_tc is lager than TC_MAX_QUEUE,so we check
the return value of netdev_set_num_tc.

Reported-by: syzbot+2b3e5fb6c7ef285a94f6@syzkaller.appspotmail.com
Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e842cb60 30-Aug-2021 王贇 <yun.wang@linux.alibaba.com>

net: fix NULL pointer reference in cipso_v4_doi_free

In netlbl_cipsov4_add_std() when 'doi_def->map.std' alloc
failed, we sometime observe panic:

BUG: kernel NULL pointer dereference, address:
...
RIP: 0010:cipso_v4_doi_free+0x3a/0x80
...
Call Trace:
netlbl_cipsov4_add_std+0xf4/0x8c0
netlbl_cipsov4_add+0x13f/0x1b0
genl_family_rcv_msg_doit.isra.15+0x132/0x170
genl_rcv_msg+0x125/0x240

This is because in cipso_v4_doi_free() there is no check
on 'doi_def->map.std' when doi_def->type got value 1, which
is possibe, since netlbl_cipsov4_add_std() haven't initialize
it before alloc 'doi_def->map.std'.

This patch just add the check to prevent panic happen in similar
cases.

Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

67d6d681 29-Aug-2021 Eric Dumazet <edumazet@google.com>

ipv4: make exception cache less predictible

Even after commit 6457378fe796 ("ipv4: use siphash instead of Jenkins in
fnhe_hashfun()"), an attacker can still use brute force to learn
some secrets from a victim linux host.

One way to defeat these attacks is to make the max depth of the hash
table bucket a random value.

Before this patch, each bucket of the hash table used to store exceptions
could contain 6 items under attack.

After the patch, each bucket would contains a random number of items,
between 6 and 10. The attacker can no longer infer secrets.

This is slightly increasing memory size used by the hash table,
by 50% in average, we do not expect this to be a problem.

This patch is more complex than the prior one (IPv6 equivalent),
because IPv4 was reusing the oldest entry.
Since we need to be able to evict more than one entry per
update_or_create_fnhe() call, I had to replace
fnhe_oldest() with fnhe_remove_oldest().

Also note that we will queue extra kfree_rcu() calls under stress,
which hopefully wont be a too big issue.

Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Tested-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a00df2ca 29-Aug-2021 Eric Dumazet <edumazet@google.com>

ipv6: make exception cache less predictible

Even after commit 4785305c05b2 ("ipv6: use siphash in rt6_exception_hash()"),
an attacker can still use brute force to learn some secrets from a victim
linux host.

One way to defeat these attacks is to make the max depth of the hash
table bucket a random value.

Before this patch, each bucket of the hash table used to store exceptions
could contain 6 items under attack.

After the patch, each bucket would contains a random number of items,
between 6 and 10. The attacker can no longer infer secrets.

This is slightly increasing memory size used by the hash table,
we do not expect this to be a problem.

Following patch is dealing with the same issue in IPv4.

Fixes: 35732d01fe31 ("ipv6: introduce a hash table to store dst cache")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Cc: Wei Wang <weiwan@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9dfa859d 30-Aug-2021 David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Clean up and consolidate ct ecache infrastructure by merging ct and
expect notifiers, from Florian Westphal.

2) Missing counters and timestamp in nfnetlink_queue and _log conntrack
information.

3) Missing error check for xt_register_template() in iptables mangle,
as a incremental fix for the previous pull request, also from
Florian Westphal.

4) Add netfilter hooks for the SRv6 lightweigh tunnel driver, from
Ryoga Sato. The hooks are enabled via nf_hooks_lwtunnel sysctl
to make sure existing netfilter rulesets do not break. There is
a static key to disable the hooks by default.

The pktgen_bench_xmit_mode_netif_receive.sh shows no noticeable
impact in the seg6_input path for non-netfilter users: similar
numbers with and without this patch.

This is a sample of the perf report output:

11.67% kpktgend_0 [ipv6] [k] ipv6_get_saddr_eval
7.89% kpktgend_0 [ipv6] [k] __ipv6_addr_label
7.52% kpktgend_0 [ipv6] [k] __ipv6_dev_get_saddr
6.63% kpktgend_0 [kernel.vmlinux] [k] asm_exc_nmi
4.74% kpktgend_0 [ipv6] [k] fib6_node_lookup_1
3.48% kpktgend_0 [kernel.vmlinux] [k] pskb_expand_head
3.33% kpktgend_0 [ipv6] [k] ip6_rcv_core.isra.29
3.33% kpktgend_0 [ipv6] [k] seg6_do_srh_encap
2.53% kpktgend_0 [ipv6] [k] ipv6_dev_get_saddr
2.45% kpktgend_0 [ipv6] [k] fib6_table_lookup
2.24% kpktgend_0 [kernel.vmlinux] [k] ___cache_free
2.16% kpktgend_0 [ipv6] [k] ip6_pol_route
2.11% kpktgend_0 [kernel.vmlinux] [k] __ipv6_addr_type
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


d7e7747a 26-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: refuse insertion if chain has grown too large

Also add a stat counter for this that gets exported both via old /proc
interface and ctnetlink.

Assuming the old default size of 16536 buckets and max hash occupancy of
64k, this results in 128k insertions (origin+reply), so ~8 entries per
chain on average.

The revised settings in this series will result in about two entries per
bucket on average.

This allows a hard-limit ceiling of 64.

This is not tunable at the moment, but its possible to either increase
nf_conntrack_buckets or decrease nf_conntrack_max to reduce average
lengths.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

dd6d2910 26-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: switch to siphash

Replace jhash in conntrack and nat core with siphash.

While at it, use the netns mix value as part of the input key
rather than abuse the seed value.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

d532bcd0 26-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: sanitize table size default settings

conntrack has two distinct table size settings:
nf_conntrack_max and nf_conntrack_buckets.

The former limits how many conntrack objects are allowed to exist
in each namespace.

The second sets the size of the hashtable.

As all entries are inserted twice (once for original direction, once for
reply), there should be at least twice as many buckets in the table than
the maximum number of conntrack objects that can exist at the same time.

Change the default multiplier to 1 and increase the chosen bucket sizes.
This results in the same nf_conntrack_max settings as before but reduces
the average bucket list length.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

7a3f5b0d 17-Aug-2021 Ryoga Saito <contact@proelbtn.com>

netfilter: add netfilter hooks to SRv6 data plane

This patch introduces netfilter hooks for solving the problem that
conntrack couldn't record both inner flows and outer flows.

This patch also introduces a new sysctl toggle for enabling lightweight
tunnel netfilter hooks.

Signed-off-by: Ryoga Saito <contact@proelbtn.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

49b99da2 27-Aug-2021 Rocco Yue <rocco.yue@mediatek.com>

ipv6: add IFLA_INET6_RA_MTU to expose mtu value

The kernel provides a "/proc/sys/net/ipv6/conf/<iface>/mtu"
file, which can temporarily record the mtu value of the last
received RA message when the RA mtu value is lower than the
interface mtu, but this proc has following limitations:

(1) when the interface mtu (/sys/class/net/<iface>/mtu) is
updeated, mtu6 (/proc/sys/net/ipv6/conf/<iface>/mtu) will
be updated to the value of interface mtu;
(2) mtu6 (/proc/sys/net/ipv6/conf/<iface>/mtu) only affect
ipv6 connection, and not affect ipv4.

Therefore, when the mtu option is carried in the RA message,
there will be a problem that the user sometimes cannot obtain
RA mtu value correctly by reading mtu6.

After this patch set, if a RA message carries the mtu option,
you can send a netlink msg which nlmsg_type is RTM_GETLINK,
and then by parsing the attribute of IFLA_INET6_RA_MTU to
get the mtu value carried in the RA message received on the
inet6 device. In addition, you can also get a link notification
when ra_mtu is updated so it doesn't have to poll.

In this way, if the MTU values that the device receives from
the network in the PCO IPv4 and the RA IPv6 procedures are
different, the user can obtain the correct ipv6 ra_mtu value
and compare the value of ra_mtu and ipv4 mtu, then the device
can use the lower MTU value for both IPv4 and IPv6.

Signed-off-by: Rocco Yue <rocco.yue@mediatek.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210827150412.9267-1-rocco.yue@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dc48e0ab 27-Aug-2021 Olga Kornievskaia <kolga@netapp.com>

SUNRPC enforce creation of no more than max_connect xprts

If we are adding new transports via rpc_clnt_test_and_add_xprt()
then check if we've reached the limit. Currently only pnfs path
adds transports via that function but this is done in
preparation when the client would add new transports when
session trunking is detected. A warning is logged if the
limit is reached.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

df205d0a 27-Aug-2021 Olga Kornievskaia <kolga@netapp.com>

SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs

In sysfs's xprt_switch_info attribute also display the value of
number of transports with unique destination addresses for this
xprt_switch.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

3a3f9766 27-Aug-2021 Olga Kornievskaia <kolga@netapp.com>

SUNRPC keep track of number of transports to unique addresses

Currently, xprt_switch keeps a number of all xprts (xps_nxprts)
that were added to the switch regardless of whethere it's an
nconnect transport or a transport to a trunkable address.
Introduce a new counter to keep track of transports to unique
destination addresses per xprt_switch.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

7c81e6a9 24-Aug-2021 Trond Myklebust <trond.myklebust@hammerspace.com>

SUNRPC: Tweak TCP socket shutdown in the RPC client

We only really need to call shutdown() if we're in the ESTABLISHED TCP
state, since that is the only case where the client is initiating a
close of an established connection.

If the socket is in FIN_WAIT1 or FIN_WAIT2, then we've already initiated
socket shutdown and are waiting for the server's reply, so do nothing.

In all other cases where we've already received a FIN from the server,
we should be able to just close the socket.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

0a6ff58e 24-Aug-2021 Trond Myklebust <trond.myklebust@hammerspace.com>

SUNRPC: Simplify socket shutdown when not reusing TCP ports

If we're not required to reuse the TCP port, then we can just
immediately close the socket, and leave the cleanup details to the TCP
layer.

Fixes: e6237b6feb37 ("NFSv4.1: Don't rebind to the same source port when reconnecting to the server")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

fe50893a 27-Aug-2021 David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/
ipsec-next

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2021-08-27

1) Remove an unneeded extra variable in esp4 esp_ssg_unref.
From Corey Minyard.

2) Add a configuration option to change the default behaviour
to block traffic if there is no matching policy.
Joint work with Christian Langrock and Antony Antony.

3) Fix a shift-out-of-bounce bug reported from syzbot.
From Pavel Skripkin.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


9758f40e 26-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: make the locking tx schema more readable

Florian noted the locking schema used by __mptcp_push_pending()
is hard to follow, let's add some more descriptive comments
and drop an unneeded and confusing check.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f6c2ef59 26-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: optimize the input options processing

Most MPTCP packets carries a single MPTCP subption: the
DSS containing the mapping for the current packet.

Check explicitly for the above, so that is such scenario we
replace most conditional statements with a single likely() one.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

74c7dfbe 26-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: consolidate in_opt sub-options fields in a bitmask

This makes input options processing more consistent with
output ones and will simplify the next patch.

Also avoid clearing the suboption field after processing
it, since it's not needed.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a086aeba 26-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: better binary layout for mptcp_options_received

This change reorder the mptcp_options_received fields
to shrink the structure a bit and to ensure the most
frequently used fields are all in the first cacheline.

Sub-opt specific flags are moved out of the suboptions area,
and we must now explicitly set them when the relevant
suboption is parsed.

There is a notable exception: 'csum_reqd' is used by both DSS
and MPC suboptions, and keeping such field in the suboptions
flag area will simplfy the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8d548ea1 26-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: do not set unconditionally csum_reqd on incoming opt

Should be set only if the ingress packets present it, otherwise
we can confuse csum validation.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d0efb162 26-Aug-2021 Peter Collingbourne <pcc@google.com>

net: don't unconditionally copy_from_user a struct ifreq for socket ioctls

A common implementation of isatty(3) involves calling a ioctl passing
a dummy struct argument and checking whether the syscall failed --
bionic and glibc use TCGETS (passing a struct termios), and musl uses
TIOCGWINSZ (passing a struct winsize). If the FD is a socket, we will
copy sizeof(struct ifreq) bytes of data from the argument and return
-EFAULT if that fails. The result is that the isatty implementations
may return a non-POSIX-compliant value in errno in the case where part
of the dummy struct argument is inaccessible, as both struct termios
and struct winsize are smaller than struct ifreq (at least on arm64).

Although there is usually enough stack space following the argument
on the stack that this did not present a practical problem up to now,
with MTE stack instrumentation it's more likely for the copy to fail,
as the memory following the struct may have a different tag.

Fix the problem by adding an early check for whether the ioctl is a
valid socket ioctl, and return -ENOTTY if it isn't.

Fixes: 44c02a2c3dc5 ("dev_ioctl(): move copyin/copyout to callers")
Link: https://linux-review.googlesource.com/id/I869da6cf6daabc3e4b7b82ac979683ba05e27d4d
Signed-off-by: Peter Collingbourne <pcc@google.com>
Cc: <stable@vger.kernel.org> # 4.19
Signed-off-by: David S. Miller <davem@davemloft.net>

3aa7857f 25-Aug-2021 Neil Spring <ntspring@fb.com>

tcp: enable mid stream window clamp

The TCP_WINDOW_CLAMP socket option is defined in tcp(7) to "Bound the size
of the advertised window to this value." Window clamping is distributed
across two variables, window_clamp ("Maximal window to advertise" in
tcp.h) and rcv_ssthresh ("Current window clamp").

This patch updates the function where the window clamp is set to also
reduce the current window clamp, rcv_sshthresh, if needed. With this,
setting the TCP_WINDOW_CLAMP option has the documented effect of limiting
the window.

Signed-off-by: Neil Spring <ntspring@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210825210117.1668371-1-ntspring@fb.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

97c78d0a 26-Aug-2021 Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

drivers/net/wwan/mhi_wwan_mbim.c - drop the extra arg.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>


73367f05 26-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfsd-5.14-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fix from Bruce Fields:
"This is a one-liner fix for a serious bug that can cause the server to
become unresponsive to a client, so I think it's worth the last-minute
inclusion for 5.14"

* tag 'nfsd-5.14-1' of git://linux-nfs.org/~bfields/linux:
SUNRPC: Fix XPT_BUSY flag leakage in svc_handle_xprt()...


9ebc2758 26-Aug-2021 Kalle Valo <kvalo@codeaurora.org>

Revert "net: really fix the build..."

This reverts commit ce78ffa3ef1681065ba451cfd545da6126f5ca88.

Wren and Nicolas reported that ath11k was failing to initialise QCA6390
Wi-Fi 6 device with error:

qcom_mhi_qrtr: probe of mhi0_IPCR failed with error -22

Commit ce78ffa3ef16 ("net: really fix the build..."), introduced in
v5.14-rc5, caused this regression in qrtr. Most likely all ath11k
devices are broken, but I only tested QCA6390. Let's revert the broken
commit so that ath11k works again.

Reported-by: Wren Turkal <wt@penguintechs.org>
Reported-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20210826172816.24478-1-kvalo@codeaurora.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

733c99ee 25-Aug-2021 王贇 <yun.wang@linux.alibaba.com>

net: fix NULL pointer reference in cipso_v4_doi_free

In netlbl_cipsov4_add_std() when 'doi_def->map.std' alloc
failed, we sometime observe panic:

BUG: kernel NULL pointer dereference, address:
...
RIP: 0010:cipso_v4_doi_free+0x3a/0x80
...
Call Trace:
netlbl_cipsov4_add_std+0xf4/0x8c0
netlbl_cipsov4_add+0x13f/0x1b0
genl_family_rcv_msg_doit.isra.15+0x132/0x170
genl_rcv_msg+0x125/0x240

This is because in cipso_v4_doi_free() there is no check
on 'doi_def->map.std' when 'doi_def->type' equal 1, which
is possibe, since netlbl_cipsov4_add_std() haven't initialize
it before alloc 'doi_def->map.std'.

This patch just add the check to prevent panic happen for similar
cases.

Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

96a6b93b 25-Aug-2021 Andrey Ignatov <rdna@fb.com>

rtnetlink: Return correct error on changing device netns

Currently when device is moved between network namespaces using
RTM_NEWLINK message type and one of netns attributes (FLA_NET_NS_PID,
IFLA_NET_NS_FD, IFLA_TARGET_NETNSID) but w/o specifying IFLA_IFNAME, and
target namespace already has device with same name, userspace will get
EINVAL what is confusing and makes debugging harder.

Fix it so that userspace gets more appropriate EEXIST instead what makes
debugging much easier.

Before:

# ./ifname.sh
+ ip netns add ns0
+ ip netns exec ns0 ip link add l0 type dummy
+ ip netns exec ns0 ip link show l0
8: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 66:90:b5:d5:78:69 brd ff:ff:ff:ff:ff:ff
+ ip link add l0 type dummy
+ ip link show l0
10: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 6e:c6:1f:15:20:8d brd ff:ff:ff:ff:ff:ff
+ ip link set l0 netns ns0
RTNETLINK answers: Invalid argument

After:

# ./ifname.sh
+ ip netns add ns0
+ ip netns exec ns0 ip link add l0 type dummy
+ ip netns exec ns0 ip link show l0
8: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1e:4a:72:e3:e3:8f brd ff:ff:ff:ff:ff:ff
+ ip link add l0 type dummy
+ ip link show l0
10: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether f2:fc:fe:2b:7d:a6 brd ff:ff:ff:ff:ff:ff
+ ip link set l0 netns ns0
RTNETLINK answers: File exists

The problem is that do_setlink() passes its `char *ifname` argument,
that it gets from a caller, to __dev_change_net_namespace() as is (as
`const char *pat`), but semantics of ifname and pat can be different.

For example, __rtnl_newlink() does this:

net/core/rtnetlink.c
3270 char ifname[IFNAMSIZ];
...
3286 if (tb[IFLA_IFNAME])
3287 nla_strscpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
3288 else
3289 ifname[0] = '\0';
...
3364 if (dev) {
...
3394 return do_setlink(skb, dev, ifm, extack, tb, ifname, status);
3395 }

, i.e. do_setlink() gets ifname pointer that is always valid no matter
if user specified IFLA_IFNAME or not and then do_setlink() passes this
ifname pointer as is to __dev_change_net_namespace() as pat argument.

But the pat (pattern) in __dev_change_net_namespace() is used as:

net/core/dev.c
11198 err = -EEXIST;
11199 if (__dev_get_by_name(net, dev->name)) {
11200 /* We get here if we can't use the current device name */
11201 if (!pat)
11202 goto out;
11203 err = dev_get_valid_name(net, dev, pat);
11204 if (err < 0)
11205 goto out;
11206 }

As the result the `goto out` path on line 11202 is neven taken and
instead of returning EEXIST defined on line 11198,
__dev_change_net_namespace() returns an error from dev_get_valid_name()
and this, in turn, will be EINVAL for ifname[0] = '\0' set earlier.

Fixes: d8a5ec672768 ("[NET]: netlink support for moving devices between network namespaces.")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8b325d2a 26-Aug-2021 David S. Miller <davem@davemloft.net>

Merge tag 'mac80211-next-for-net-next-2021-08-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
A few more things:
* Use correct DFS domain for self-managed devices
* some preparations for transmit power element handling
and other 6 GHz regulatory handling
* TWT support in AP mode in mac80211
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


723783d0 25-Aug-2021 Yunsheng Lin <linyunsheng@huawei.com>

sock: remove one redundant SKB_FRAG_PAGE_ORDER macro

Both SKB_FRAG_PAGE_ORDER are defined to the same value in
net/core/sock.c and drivers/vhost/net.c.

Move the SKB_FRAG_PAGE_ORDER definition to net/core/sock.h,
as both net/core/sock.c and drivers/vhost/net.c include it,
and it seems a reasonable file to put the macro.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6457378f 25-Aug-2021 Eric Dumazet <edumazet@google.com>

ipv4: use siphash instead of Jenkins in fnhe_hashfun()

A group of security researchers brought to our attention
the weakness of hash function used in fnhe_hashfun().

Lets use siphash instead of Jenkins Hash, to considerably
reduce security risks.

Also remove the inline keyword, this really is distracting.

Fixes: d546c621542d ("ipv4: harden fnhe_hashfun()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>

4785305c 25-Aug-2021 Eric Dumazet <edumazet@google.com>

ipv6: use siphash in rt6_exception_hash()

A group of security researchers brought to our attention
the weakness of hash function used in rt6_exception_hash()

Lets use siphash instead of Jenkins Hash, to considerably
reduce security risks.

Following patch deals with IPv4.

Fixes: 35732d01fe31 ("ipv6: introduce a hash table to store dst cache")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Cc: Wei Wang <weiwan@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

90bd5bee 25-Aug-2021 Sriram R <srirrama@codeaurora.org>

cfg80211: use wiphy DFS domain if it is self-managed

Currently during CAC start or other radar events, the DFS
domain is fetched from cfg based on global DFS domain,
even if the wiphy regdomain disagrees.

But this could be different in case of self managed wiphy's
in case the self managed driver updates its database or supports
regions which has DFS domain set to UNSET in cfg80211 local
regdomain.

So for explicitly self-managed wiphys, just use their DFS
domain.

Signed-off-by: Sriram R <srirrama@codeaurora.org>
Link: https://lore.kernel.org/r/1629934730-16388-1-git-send-email-srirrama@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

b0345850 20-Aug-2021 Wen Gong <wgong@codeaurora.org>

mac80211: parse transmit power envelope element

Parse and store the transmit power envelope element.

Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/20210820122041.12157-8-wgong@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

eb18b49e 24-Aug-2021 Martin KaFai Lau <kafai@fb.com>

bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt

This patch allows the bpf-tcp-cc to call bpf_setsockopt. One use
case is to allow a bpf-tcp-cc switching to another cc during init().
For example, when the tcp flow is not ecn ready, the bpf_dctcp
can switch to another cc by calling setsockopt(TCP_CONGESTION).

During setsockopt(TCP_CONGESTION), the new tcp-cc's init() will be
called and this could cause a recursion but it is stopped by the
current trampoline's logic (in the prog->active counter).

While retiring a bpf-tcp-cc (e.g. in tcp_v[46]_destroy_sock()),
the tcp stack calls bpf-tcp-cc's release(). To avoid the retiring
bpf-tcp-cc making further changes to the sk, bpf_setsockopt is not
available to the bpf-tcp-cc's release(). This will avoid release()
making setsockopt() call that will potentially allocate new resources.

Although the bpf-tcp-cc already has a more powerful way to read tcp_sock
from the PTR_TO_BTF_ID, it is usually expected that bpf_getsockopt and
bpf_setsockopt are available together. Thus, bpf_getsockopt() is also
added to all tcp_congestion_ops except release().

When the old bpf-tcp-cc is calling setsockopt(TCP_CONGESTION)
to switch to a new cc, the old bpf-tcp-cc will be released by
bpf_struct_ops_put(). Thus, this patch also puts the bpf_struct_ops_map
after a rcu grace period because the trampoline's image cannot be freed
while the old bpf-tcp-cc is still running.

bpf-tcp-cc can only access icsk_ca_priv as SCALAR. All kernel's
tcp-cc is also accessing the icsk_ca_priv as SCALAR. The size
of icsk_ca_priv has already been raised a few times to avoid
extra kmalloc and memory referencing. The only exception is the
kernel's tcp_cdg.c that stores a kmalloc()-ed pointer in icsk_ca_priv.
To avoid the old bpf-tcp-cc accidentally overriding this tcp_cdg's pointer
value stored in icsk_ca_priv after switching and without over-complicating
the bpf's verifier for this one exception in tcp_cdg, this patch does not
allow switching to tcp_cdg. If there is a need, bpf_tcp_cdg can be
implemented and then use the bpf_sk_storage as the extended storage.

bpf_sk_setsockopt proto has only been recently added and used
in bpf-sockopt and bpf-iter-tcp, so impose the tcp_cdg limitation in the
same proto instead of adding a new proto specifically for bpf-tcp-cc.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210824173007.3976921-1-kafai@fb.com

062b829c 25-Aug-2021 Trond Myklebust <trond.myklebust@hammerspace.com>

SUNRPC: Fix XPT_BUSY flag leakage in svc_handle_xprt()...

If the attempt to reserve a slot fails, we currently leak the XPT_BUSY
flag on the socket. Among other things, this make it impossible to close
the socket.

Fixes: 82011c80b3ec ("SUNRPC: Move svc_xprt_received() call sites")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

d32f89da 24-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

net: add accept helper not installing fd

Introduce and reuse a helper that acts similarly to __sys_accept4_file()
but returns struct file instead of installing file descriptor. Will be
used by io_uring.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Acked-by: David S. Miller <davem@davemloft.net>
Link: https://lore.kernel.org/r/c57b9e8e818d93683a3d24f8ca50ca038d1da8c4.1629888991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7bc416f1 23-Aug-2021 Lukas Bulwahn <lukas.bulwahn@gmail.com>

netfilter: x_tables: handle xt_register_template() returning an error value

Commit fdacd57c79b7 ("netfilter: x_tables: never register tables by
default") introduces the function xt_register_template(), and in one case,
a call to that function was missing the error-case handling.

Handle when xt_register_template() returns an error value.

This was identified with the clang-analyzer's Dead-Store analysis.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

6c89dac5 18-Aug-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: ctnetlink: missing counters and timestamp in nfnetlink_{log,queue}

Add counters and timestamps (if available) to the conntrack object
that is represented in nfnetlink_log and _queue messages.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

bd1431db 16-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: ecache: remove nf_exp_event_notifier structure

Reuse the conntrack event notofier struct, this allows to remove the
extra register/unregister functions and avoids a pointer in struct net.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

b86c0e64 16-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: ecache: prepare for event notifier merge

This prepares for merge for ct and exp notifier structs.

The 'fcn' member is renamed to something unique.
Second, the register/unregister api is simplified. There is only
one implementation so there is no need to do any error checking.

Replace the EBUSY logic with WARN_ON_ONCE. This allows to remove
error unwinding.

The exp notifier register/unregister function is removed in
a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

b3afdc17 16-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: ecache: add common helper for nf_conntrack_eventmask_report

nf_ct_deliver_cached_events and nf_conntrack_eventmask_report are very
similar. Split nf_conntrack_eventmask_report into a common helper
function that can be used for both cases.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

9291f090 16-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: ecache: remove another indent level

... by changing:

if (unlikely(ret < 0 || missed)) {
if (ret < 0) {
to
if (likely(ret >= 0 && !missed))
goto out;

if (ret < 0) {

After this nf_conntrack_eventmask_report and nf_ct_deliver_cached_events
look pretty much the same, next patch moves common code to a helper.

This patch has no effect on generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

478374a3 16-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: ecache: remove one indent level

nf_conntrack_eventmask_report and nf_ct_deliver_cached_events shared
most of their code. This unifies the layout by changing

if (nf_ct_is_confirmed(ct)) {
foo
}

to
if (!nf_ct_is_confirmed(ct)))
return
foo

This removes one level of indentation.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

cd9b50ad 24-Aug-2021 Davide Caratti <dcaratti@redhat.com>

net/sched: ets: fix crash when flipping from 'strict' to 'quantum'

While running kselftests, Hangbin observed that sch_ets.sh often crashes,
and splats like the following one are seen in the output of 'dmesg':

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 159f12067 P4D 159f12067 PUD 159f13067 PMD 0
Oops: 0000 [#1] SMP NOPTI
CPU: 2 PID: 921 Comm: tc Not tainted 5.14.0-rc6+ #458
Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
RIP: 0010:__list_del_entry_valid+0x2d/0x50
Code: 48 8b 57 08 48 b9 00 01 00 00 00 00 ad de 48 39 c8 0f 84 ac 6e 5b 00 48 b9 22 01 00 00 00 00 ad de 48 39 ca 0f 84 cf 6e 5b 00 <48> 8b 32 48 39 fe 0f 85 af 6e 5b 00 48 8b 50 08 48 39 f2 0f 85 94
RSP: 0018:ffffb2da005c3890 EFLAGS: 00010217
RAX: 0000000000000000 RBX: ffff9073ba23f800 RCX: dead000000000122
RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff9073ba23fbc8
RBP: ffff9073ba23f890 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: dead000000000100
R13: ffff9073ba23fb00 R14: 0000000000000002 R15: 0000000000000002
FS: 00007f93e5564e40(0000) GS:ffff9073bba00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000014ad34000 CR4: 0000000000350ee0
Call Trace:
ets_qdisc_reset+0x6e/0x100 [sch_ets]
qdisc_reset+0x49/0x1d0
tbf_reset+0x15/0x60 [sch_tbf]
qdisc_reset+0x49/0x1d0
dev_reset_queue.constprop.42+0x2f/0x90
dev_deactivate_many+0x1d3/0x3d0
dev_deactivate+0x56/0x90
qdisc_graft+0x47e/0x5a0
tc_get_qdisc+0x1db/0x3e0
rtnetlink_rcv_msg+0x164/0x4c0
netlink_rcv_skb+0x50/0x100
netlink_unicast+0x1a5/0x280
netlink_sendmsg+0x242/0x480
sock_sendmsg+0x5b/0x60
____sys_sendmsg+0x1f2/0x260
___sys_sendmsg+0x7c/0xc0
__sys_sendmsg+0x57/0xa0
do_syscall_64+0x3a/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f93e44b8338
Code: 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 43 2c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 41 89 d4 55
RSP: 002b:00007ffc0db737a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000061255c06 RCX: 00007f93e44b8338
RDX: 0000000000000000 RSI: 00007ffc0db73810 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000687880 R14: 0000000000000000 R15: 0000000000000000
Modules linked in: sch_ets sch_tbf dummy rfkill iTCO_wdt iTCO_vendor_support intel_rapl_msr intel_rapl_common joydev i2c_i801 pcspkr i2c_smbus lpc_ich virtio_balloon ip_tables xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci ghash_clmulni_intel libata serio_raw virtio_blk virtio_console virtio_net net_failover failover sunrpc dm_mirror dm_region_hash dm_log dm_mod
CR2: 0000000000000000

When the change() function decreases the value of 'nstrict', we must take
into account that packets might be already enqueued on a class that flips
from 'strict' to 'quantum': otherwise that class will not be added to the
bandwidth-sharing list. Then, a call to ets_qdisc_reset() will attempt to
do list_del(&alist) with 'alist' filled with zero, hence the NULL pointer
dereference.
For classes flipping from 'strict' to 'quantum', initialize an empty list
and eventually add it to the bandwidth-sharing list, if there are packets
already enqueued. In this way, the kernel will:
a) prevent crashing as described above.
b) avoid retaining the backlog packets (for an arbitrarily long time) in
case no packet is enqueued after a change from 'strict' to 'quantum'.

Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Fixes: dcc68b4d8084 ("net: sch_ets: Add a new Qdisc")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8ded9160 24-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_sja1105: stop asking the sja1105 driver in sja1105_xmit_tpid

Introduced in commit 38b5beeae7a4 ("net: dsa: sja1105: prepare tagger
for handling DSA tags and VLAN simultaneously"), the sja1105_xmit_tpid
function solved quite a different problem than our needs are now.

Then, we used best-effort VLAN filtering and we were using the xmit_tpid
to tunnel packets coming from an 8021q upper through the TX VLAN allocated
by tag_8021q to that egress port. The need for a different VLAN protocol
depending on switch revision came from the fact that this in itself was
more of a hack to trick the hardware into accepting tunneled VLANs in
the first place.

Right now, we deny 8021q uppers (see sja1105_prechangeupper). Even if we
supported them again, we would not do that using the same method of
{tunneling the VLAN on egress, retagging the VLAN on ingress} that we
had in the best-effort VLAN filtering mode. It seems rather simpler that
we just allocate a VLAN in the VLAN table that is simply not used by the
bridge at all, or by any other port.

Anyway, I have 2 gripes with the current sja1105_xmit_tpid:

1. When sending packets on behalf of a VLAN-aware bridge (with the new
TX forwarding offload framework) plus untagged (with the tag_8021q
VLAN added by the tagger) packets, we can see that on SJA1105P/Q/R/S
and later (which have a qinq_tpid of ETH_P_8021AD), some packets sent
through the DSA master have a VLAN protocol of 0x8100 and others of
0x88a8. This is strange and there is no reason for it now. If we have
a bridge and are therefore forced to send using that bridge's TPID,
we can as well blend with that bridge's VLAN protocol for all packets.

2. The sja1105_xmit_tpid introduces a dependency on the sja1105 driver,
because it looks inside dp->priv. It is desirable to keep as much
separation between taggers and switch drivers as possible. Now it
doesn't do that anymore.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b0b8c67e 24-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: sja1105: drop untagged packets on the CPU and DSA ports

The sja1105 driver is a bit special in its use of VLAN headers as DSA
tags. This is because in VLAN-aware mode, the VLAN headers use an actual
TPID of 0x8100, which is understood even by the DSA master as an actual
VLAN header.

Furthermore, control packets such as PTP and STP are transmitted with no
VLAN header as a DSA tag, because, depending on switch generation, there
are ways to steer these control packets towards a precise egress port
other than VLAN tags. Transmitting control packets as untagged means
leaving a door open for traffic in general to be transmitted as untagged
from the DSA master, and for it to traverse the switch and exit a random
switch port according to the FDB lookup.

This behavior is a bit out of line with other DSA drivers which have
native support for DSA tagging. There, it is to be expected that the
switch only accepts DSA-tagged packets on its CPU port, dropping
everything that does not match this pattern.

We perhaps rely a bit too much on the switches' hardware dropping on the
CPU port, and place no other restrictions in the kernel data path to
avoid that. For example, sja1105 is also a bit special in that STP/PTP
packets are transmitted using "management routes"
(sja1105_port_deferred_xmit): when sending a link-local packet from the
CPU, we must first write a SPI message to the switch to tell it to
expect a packet towards multicast MAC DA 01-80-c2-00-00-0e, and to route
it towards port 3 when it gets it. This entry expires as soon as it
matches a packet received by the switch, and it needs to be reinstalled
for the next packet etc. All in all quite a ghetto mechanism, but it is
all that the sja1105 switches offer for injecting a control packet.
The driver takes a mutex for serializing control packets and making the
pairs of SPI writes of a management route and its associated skb atomic,
but to be honest, a mutex is only relevant as long as all parties agree
to take it. With the DSA design, it is possible to open an AF_PACKET
socket on the DSA master net device, and blast packets towards
01-80-c2-00-00-0e, and whatever locking the DSA switch driver might use,
it all goes kaput because management routes installed by the driver will
match skbs sent by the DSA master, and not skbs generated by the driver
itself. So they will end up being routed on the wrong port.

So through the lens of that, maybe it would make sense to avoid that
from happening by doing something in the network stack, like: introduce
a new bit in struct sk_buff, like xmit_from_dsa. Then, somewhere around
dev_hard_start_xmit(), introduce the following check:

if (netdev_uses_dsa(dev) && !skb->xmit_from_dsa)
kfree_skb(skb);

Ok, maybe that is a bit drastic, but that would at least prevent a bunch
of problems. For example, right now, even though the majority of DSA
switches drop packets without DSA tags sent by the DSA master (and
therefore the majority of garbage that user space daemons like avahi and
udhcpcd and friends create), it is still conceivable that an aggressive
user space program can open an AF_PACKET socket and inject a spoofed DSA
tag directly on the DSA master. We have no protection against that; the
packet will be understood by the switch and be routed wherever user
space says. Furthermore: there are some DSA switches where we even have
register access over Ethernet, using DSA tags. So even user space
drivers are possible in this way. This is a huge hole.

However, the biggest thing that bothers me is that udhcpcd attempts to
ask for an IP address on all interfaces by default, and with sja1105, it
will attempt to get a valid IP address on both the DSA master as well as
on sja1105 switch ports themselves. So with IP addresses in the same
subnet on multiple interfaces, the routing table will be messed up and
the system will be unusable for traffic until it is configured manually
to not ask for an IP address on the DSA master itself.

It turns out that it is possible to avoid that in the sja1105 driver, at
least very superficially, by requesting the switch to drop VLAN-untagged
packets on the CPU port. With the exception of control packets, all
traffic originated from tag_sja1105.c is already VLAN-tagged, so only
STP and PTP packets need to be converted. For that, we need to uphold
the equivalence between an untagged and a pvid-tagged packet, and to
remember that the CPU port of sja1105 uses a pvid of 4095.

Now that we drop untagged traffic on the CPU port, non-aggressive user
space applications like udhcpcd stop bothering us, and sja1105 effectively
becomes just as vulnerable to the aggressive kind of user space programs
as other DSA switches are (ok, users can also create 8021q uppers on top
of the DSA master in the case of sja1105, but in future patches we can
easily deny that, but it still doesn't change the fact that VLAN-tagged
packets can still be injected over raw sockets).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eb7f3365 24-Aug-2021 Geliang Tang <geliangtang@xiaomi.com>

mptcp: add the mibs for MP_FAIL

This patch added the mibs for MP_FAIL: MPTCP_MIB_MPFAILTX and
MPTCP_MIB_MPFAILRX.

Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

478d7700 24-Aug-2021 Geliang Tang <geliangtang@xiaomi.com>

mptcp: send out MP_FAIL when data checksum fails

When a bad checksum is detected, set the send_mp_fail flag to send out
the MP_FAIL option.

Add a new function mptcp_has_another_subflow() to check whether there's
only a single subflow.

When multiple subflows are in use, close the affected subflow with a RST
that includes an MP_FAIL option and discard the data with the bad
checksum.

Set the sk_state of the subsocket to TCP_CLOSE, then the flag
MPTCP_WORK_CLOSE_SUBFLOW will be set in subflow_sched_work_if_closed,
and the subflow will be closed.

When a single subfow is in use, temporarily handled by sending MP_FAIL
with a RST too.

Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5580d41b 24-Aug-2021 Geliang Tang <geliangtang@xiaomi.com>

mptcp: MP_FAIL suboption receiving

This patch added handling for receiving MP_FAIL suboption.

Add a new members mp_fail and fail_seq in struct mptcp_options_received.
When MP_FAIL suboption is received, set mp_fail to 1 and save the sequence
number to fail_seq.

Then invoke mptcp_pm_mp_fail_received to deal with the MP_FAIL suboption.

Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c25aeb4e 24-Aug-2021 Geliang Tang <geliangtang@xiaomi.com>

mptcp: MP_FAIL suboption sending

This patch added the MP_FAIL suboption sending support.

Add a new flag named send_mp_fail in struct mptcp_subflow_context. If
this flag is set, send out MP_FAIL suboption.

Add a new member fail_seq in struct mptcp_out_options to save the data
sequence number to put into the MP_FAIL suboption.

An MP_FAIL option could be included in a RST or on the subflow-level
ACK.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1bff1e43 24-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: optimize out option generation

Currently we have several protocol constraints on MPTCP options
generation (e.g. MPC and MPJ subopt are mutually exclusive)
and some additional ones required by our implementation
(e.g. almost all ADD_ADDR variant are mutually exclusive with
everything else).

We can leverage the above to optimize the out option generation:
we check DSS/MPC/MPJ presence in a mutually exclusive way,
avoiding many unneeded conditionals in the common cases.

Additionally extend the existing constraints on ADD_ADDR opt on
all subvariants, so that it becomes fully mutually exclusive with
the above and we can skip another conditional statement for the
common case.

This change is also needed by the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

406f42fa 19-Aug-2021 Gilad Naaman <gnaaman@drivenets.com>

net-next: When a bond have a massive amount of VLANs with IPv6 addresses, performance of changing link state, attaching a VRF, changing an IPv6 address, etc. go down dramtically.

The source of most of the slow down is the `dev_addr_lists.c` module,
which mainatins a linked list of HW addresses.
When using IPv6, this list grows for each IPv6 address added on a
VLAN, since each IPv6 address has a multicast HW address associated with
it.

When performing any modification to the involved links, this list is
traversed many times, often for nothing, all while holding the RTNL
lock.

Instead, this patch adds an auxilliary rbtree which cuts down
traversal time significantly.

Performance can be seen with the following script:

#!/bin/bash
ip netns del test || true 2>/dev/null
ip netns add test

echo 1 | ip netns exec test tee /proc/sys/net/ipv6/conf/all/keep_addr_on_down > /dev/null

set -e

ip -n test link add foo type veth peer name bar
ip -n test link add b1 type bond
ip -n test link add florp type vrf table 10

ip -n test link set bar master b1
ip -n test link set foo up
ip -n test link set bar up
ip -n test link set b1 up
ip -n test link set florp up

VLAN_COUNT=1500
BASE_DEV=b1

echo Creating vlans
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test link add link $BASE_DEV name foo.\$i type vlan id \$i; done"

echo Bringing them up
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test link set foo.\$i up; done"

echo Assiging IPv6 Addresses
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test address add dev foo.\$i 2000::\$i/64; done"

echo Attaching to VRF
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test link set foo.\$i master florp; done"

On an Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz machine, the performance
before the patch is (truncated):

Creating vlans
real 108.35
Bringing them up
real 4.96
Assiging IPv6 Addresses
real 19.22
Attaching to VRF
real 458.84

After the patch:

Creating vlans
real 5.59
Bringing them up
real 5.07
Assiging IPv6 Addresses
real 5.64
Attaching to VRF
real 25.37

Cc: David S. Miller <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Lu Wei <luwei32@huawei.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a37c5c26 23-Aug-2021 Kangmin Park <l4stpr0gr4m@gmail.com>

net: bridge: change return type of br_handle_ingress_vlan_tunnel

br_handle_ingress_vlan_tunnel() is only referenced in
br_handle_frame(). If br_handle_ingress_vlan_tunnel() is called and
return non-zero value, goto drop in br_handle_frame().

But, br_handle_ingress_vlan_tunnel() always return 0. So, the
routines that check the return value and goto drop has no meaning.

Therefore, change return type of br_handle_ingress_vlan_tunnel() to
void and remove if statement of br_handle_frame().

Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210823102118.17966-1-l4stpr0gr4m@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

fab60e29 20-Aug-2021 Xu Liu <liuxu623@gmail.com>

bpf: Allow bpf_get_netns_cookie in BPF_PROG_TYPE_SK_MSG

We'd like to be able to identify netns from sk_msg hooks
to accelerate local process communication form different netns.

Signed-off-by: Xu Liu <liuxu623@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210820071712.52852-2-liuxu623@gmail.com

f3ccfda1 20-Aug-2021 Yufeng Mo <moyufeng@huawei.com>

ethtool: extend coalesce setting uAPI with CQE mode

In order to support more coalesce parameters through netlink,
add two new parameter kernel_coal and extack for .set_coalesce
and .get_coalesce, then some extra info can return to user with
the netlink API.

Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

029ee6b1 20-Aug-2021 Yufeng Mo <moyufeng@huawei.com>

ethtool: add two coalesce attributes for CQE mode

Currently, there are many drivers who support CQE mode configuration,
some configure it as a fixed when initialized, some provide an
interface to change it by ethtool private flags. In order to make it
more generic, add two new 'ETHTOOL_A_COALESCE_USE_CQE_TX' and
'ETHTOOL_A_COALESCE_USE_CQE_RX' coalesce attributes, then these
parameters can be accessed by ethtool netlink coalesce uAPI.

Also add an new structure kernel_ethtool_coalesce, then the
new parameter can be added into this struct.

Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7fb9b66d 24-Aug-2021 Yunsheng Lin <linyunsheng@huawei.com>

page_pool: use relaxed atomic for release side accounting

There is no need to synchronize the account updating, so
use the relaxed atomic to avoid some memory barrier in the
data path.

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

446e7f21 22-Aug-2021 zhang kai <zhangkaiheb@126.com>

ipv6: correct comments about fib6_node sernum

correct comments in set and get fn_sernum

Signed-off-by: zhang kai <zhangkaiheb@126.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

58adf9dc 23-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: let drivers state that they need VLAN filtering while standalone

As explained in commit e358bef7c392 ("net: dsa: Give drivers the chance
to veto certain upper devices"), the hellcreek driver uses some tricks
to comply with the network stack expectations: it enforces port
separation in standalone mode using VLANs. For untagged traffic,
bridging between ports is prevented by using different PVIDs, and for
VLAN-tagged traffic, it never accepts 8021q uppers with the same VID on
two ports, so packets with one VLAN cannot leak from one port to another.

That is almost fine*, and has worked because hellcreek relied on an
implicit behavior of the DSA core that was changed by the previous
patch: the standalone ports declare the 'rx-vlan-filter' feature as 'on
[fixed]'. Since most of the DSA drivers are actually VLAN-unaware in
standalone mode, that feature was actually incorrectly reflecting the
hardware/driver state, so there was a desire to fix it. This leaves the
hellcreek driver in a situation where it has to explicitly request this
behavior from the DSA framework.

We configure the ports as follows:

- Standalone: 'rx-vlan-filter' is on. An 8021q upper on top of a
standalone hellcreek port will go through dsa_slave_vlan_rx_add_vid
and will add a VLAN to the hardware tables, giving the driver the
opportunity to refuse it through .port_prechangeupper.

- Bridged with vlan_filtering=0: 'rx-vlan-filter' is off. An 8021q upper
on top of a bridged hellcreek port will not go through
dsa_slave_vlan_rx_add_vid, because there will not be any attempt to
offload this VLAN. The driver already disables VLAN awareness, so that
upper should receive the traffic it needs.

- Bridged with vlan_filtering=1: 'rx-vlan-filter' is on. An 8021q upper
on top of a bridged hellcreek port will call dsa_slave_vlan_rx_add_vid,
and can again be vetoed through .port_prechangeupper.

*It is not actually completely fine, because if I follow through
correctly, we can have the following situation:

ip link add br0 type bridge vlan_filtering 0
ip link set lan0 master br0 # lan0 now becomes VLAN-unaware
ip link set lan0 nomaster # lan0 fails to become VLAN-aware again, therefore breaking isolation

This patch fixes that corner case by extending the DSA core logic, based
on this requested attribute, to change the VLAN awareness state of the
switch (port) when it leaves the bridge.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

06cfb2df 23-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: don't advertise 'rx-vlan-filter' when not needed

There have been multiple independent reports about
dsa_slave_vlan_rx_add_vid being called (and consequently calling the
drivers' .port_vlan_add) when it isn't needed, and sometimes (not
always) causing problems in the process.

Case 1:
mv88e6xxx_port_vlan_prepare is stubborn and only accepts VLANs on
bridged ports. That is understandably so, because standalone mv88e6xxx
ports are VLAN-unaware, and VTU entries are said to be a scarce
resource.

Otherwise said, the following fails lamentably on mv88e6xxx:

ip link add br0 type bridge vlan_filtering 1
ip link set lan3 master br0
ip link add link lan10 name lan10.1 type vlan id 1
[485256.724147] mv88e6085 d0032004.mdio-mii:12: p10: hw VLAN 1 already used by port 3 in br0
RTNETLINK answers: Operation not supported

This has become a worse issue since commit 9b236d2a69da ("net: dsa:
Advertise the VLAN offload netdev ability only if switch supports it").
Up to that point, the driver was returning -EOPNOTSUPP and DSA was
reconverting that error to 0, making the 8021q upper think all is ok
(but obviously the error message was there even prior to this change).
After that change the -EOPNOTSUPP is propagated to vlan_vid_add, and it
is a hard error.

Case 2:
Ports that don't offload the Linux bridge (have a dp->bridge_dev = NULL
because they don't implement .port_bridge_{join,leave}). Understandably,
a standalone port should not offload VLANs either, it should remain VLAN
unaware and any VLAN should be a software VLAN (as long as the hardware
is not quirky, that is).

In fact, dsa_slave_port_obj_add does do the right thing and rejects
switchdev VLAN objects coming from the bridge when that bridge is not
offloaded:

case SWITCHDEV_OBJ_ID_PORT_VLAN:
if (!dsa_port_offloads_bridge_port(dp, obj->orig_dev))
return -EOPNOTSUPP;

err = dsa_slave_vlan_add(dev, obj, extack);

But it seems that the bridge is able to trick us. The __vlan_vid_add
from br_vlan.c has:

/* Try switchdev op first. In case it is not supported, fallback to
* 8021q add.
*/
err = br_switchdev_port_vlan_add(dev, v->vid, flags, extack);
if (err == -EOPNOTSUPP)
return vlan_vid_add(dev, br->vlan_proto, v->vid);

So it says "no, no, you need this VLAN in your life!". And we, naive as
we are, say "oh, this comes from the vlan_vid_add code path, it must be
an 8021q upper, sure, I'll take that". And we end up with that bridge
VLAN installed on our port anyway. But this time, it has the wrong flags:
if the bridge was trying to install VLAN 1 as a pvid/untagged VLAN,
failed via switchdev, retried via vlan_vid_add, we have this comment:

/* This API only allows programming tagged, non-PVID VIDs */

So what we do makes absolutely no sense.

Backtracing a bit, we see the common pattern. We allow the network stack
to think that our standalone ports are VLAN-aware, but they aren't, for
the vast majority of switches. The quirky ones should not dictate the
norm. The dsa_slave_vlan_rx_add_vid and dsa_slave_vlan_rx_kill_vid
methods exist for drivers that need the 'rx-vlan-filter: on' feature in
ethtool -k, which can be due to any of the following reasons:

1. vlan_filtering_is_global = true, and some ports are under a
VLAN-aware bridge while others are standalone, and the standalone
ports would otherwise drop VLAN-tagged traffic. This is described in
commit 061f6a505ac3 ("net: dsa: Add ndo_vlan_rx_{add, kill}_vid
implementation").

2. the ports that are under a VLAN-aware bridge should also set this
feature, for 8021q uppers having a VID not claimed by the bridge.
In this case, the driver will essentially not even know that the VID
is coming from the 8021q layer and not the bridge.

3. Hellcreek. This driver needs it because in standalone mode, it uses
unique VLANs per port to ensure separation. For separation of untagged
traffic, it uses different PVIDs for each port, and for separation of
VLAN-tagged traffic, it never accepts 8021q uppers with the same vid
on two ports.

If a driver does not fall under any of the above 3 categories, there is
no reason why it should advertise the 'rx-vlan-filter' feature, therefore
no reason why it should offload the VLANs added through vlan_vid_add.

This commit fixes the problem by removing the 'rx-vlan-filter' feature
from the slave devices when they operate in standalone mode, and when
they offload a VLAN-unaware bridge.

The way it works is that vlan_vid_add will now stop its processing here:

vlan_add_rx_filter_info:
if (!vlan_hw_filter_capable(dev, proto))
return 0;

So the VLAN will still be saved in the interface's VLAN RX filtering
list, but because it does not declare VLAN filtering in its features,
the 8021q module will return zero without committing that VLAN to
hardware.

This gives the drivers what they want, since it keeps the 8021q VLANs
away from the VLAN table until VLAN awareness is enabled (point at which
the ports are no longer standalone, hence in the mv88e6xxx case, the
check in mv88e6xxx_port_vlan_prepare passes).

Since the issue predates the existence of the hellcreek driver, case 3
will be dealt with in a separate patch.

The main change that this patch makes is to no longer set
NETIF_F_HW_VLAN_CTAG_FILTER unconditionally, but toggle it dynamically
(for most switches, never).

The second part of the patch addresses an issue that the first part
introduces: because the 'rx-vlan-filter' feature is now dynamically
toggled, and our .ndo_vlan_rx_add_vid does not get called when
'rx-vlan-filter' is off, we need to avoid bugs such as the following by
replaying the VLANs from 8021q uppers every time we enable VLAN
filtering:

ip link add link lan0 name lan0.100 type vlan id 100
ip addr add 192.168.100.1/24 dev lan0.100
ping 192.168.100.2 # should work
ip link add br0 type bridge vlan_filtering 0
ip link set lan0 master br0
ping 192.168.100.2 # should still work
ip link set br0 type bridge vlan_filtering 1
ping 192.168.100.2 # should still work but doesn't

As reported by Florian, some drivers look at ds->vlan_filtering in
their .port_vlan_add() implementation. So this patch also makes sure
that ds->vlan_filtering is committed before calling the driver. This is
the reason why it is first committed, then restored on the failure path.

Reported-by: Tobias Waldekranz <tobias@waldekranz.com>
Reported-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

67b5fb5d 23-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: properly fall back to software bridging

If the driver does not implement .port_bridge_{join,leave}, then we must
fall back to standalone operation on that port, and trigger the error
path of dsa_port_bridge_join. This sets dp->bridge_dev = NULL.

In turn, having a non-NULL dp->bridge_dev when there is no offloading
support makes the following things go wrong:

- dsa_default_offload_fwd_mark make the wrong decision in setting
skb->offload_fwd_mark. It should set skb->offload_fwd_mark = 0 for
ports that don't offload the bridge, which should instruct the bridge
to forward in software. But this does not happen, dp->bridge_dev is
incorrectly set to point to the bridge, so the bridge is told that
packets have been forwarded in hardware, which they haven't.

- switchdev objects (MDBs, VLANs) should not be offloaded by ports that
don't offload the bridge. Standalone ports should behave as packet-in,
packet-out and the bridge should not be able to manipulate the pvid of
the port, or tag stripping on egress, or ingress filtering. This
should already work fine because dsa_slave_port_obj_add has:

case SWITCHDEV_OBJ_ID_PORT_VLAN:
if (!dsa_port_offloads_bridge_port(dp, obj->orig_dev))
return -EOPNOTSUPP;

err = dsa_slave_vlan_add(dev, obj, extack);

but since dsa_port_offloads_bridge_port works based on dp->bridge_dev,
this is again sabotaging us.

All the above work in case the port has an unoffloaded LAG interface, so
this is well exercised code, we should apply it for plain unoffloaded
bridge ports too.

Reported-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

09dba21b 23-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: don't call switchdev_bridge_port_unoffload for unoffloaded bridge ports

For ports that have a NULL dp->bridge_dev, dsa_port_to_bridge_port()
also returns NULL as expected.

Issue #1 is that we are performing a NULL pointer dereference on brport_dev.

Issue #2 is that these are ports on which switchdev_bridge_port_offload
has not been called, so we should not call switchdev_bridge_port_unoffload
on them either.

Both issues are addressed by checking against a NULL brport_dev in
dsa_port_pre_bridge_leave and exiting early.

Fixes: 2f5dc00f7a3e ("net: bridge: switchdev: let drivers inform which bridge ports are offloaded")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f5a4c24e 23-Aug-2021 Lorenzo Bianconi <lorenzo@kernel.org>

mac80211: introduce individual TWT support in AP mode

Introduce TWT action frames parsing support to mac80211.
Currently just individual TWT agreement are support in AP mode.
Whenever the AP receives a TWT action frame from an associated client,
after performing sanity checks, it will notify the underlay driver with
requested parameters in order to check if they are supported and if there
is enough room for a new agreement. The driver is expected to set the
agreement result and report it to mac80211.

Drivers supporting this have two new callbacks:
- add_twt_setup (mandatory)
- twt_teardown_request (optional)

mac80211 will send an action frame reply according to the result
reported by the driver.

Tested-by: Peter Chiu <chui-hao.chiu@mediatek.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/257512f2e22ba42b9f2624942a128dd8f141de4b.1629741512.git.lorenzo@kernel.org
[use le16p_replace_bits(), minor cleanups, use (void *) casts,
fix to use ieee80211_get_he_iftype_cap() correctly]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

c233ef13 23-Aug-2021 Yonglong Li <liyonglong@chinatelecom.cn>

mptcp: remove MPTCP_ADD_ADDR_IPV6 and MPTCP_ADD_ADDR_PORT

MPTCP_ADD_ADDR_IPV6 and MPTCP_ADD_ADDR_PORT are not necessary, we can get
these info from pm.local or pm.remote.

Drop mptcp_pm_should_add_signal_ipv6 and mptcp_pm_should_add_signal_port
too.

Co-developed-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f462a446 23-Aug-2021 Yonglong Li <liyonglong@chinatelecom.cn>

mptcp: build ADD_ADDR/echo-ADD_ADDR option according pm.add_signal

According to the MPTCP_ADD_ADDR_SIGNAL or MPTCP_ADD_ADDR_ECHO flag, build
the ADD_ADDR/ADD_ADDR_ECHO option.

In mptcp_pm_add_addr_signal(), use opts->addr to save the announced
ADD_ADDR or ADD_ADDR_ECHO address.

Co-developed-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

119c0220 23-Aug-2021 Yonglong Li <liyonglong@chinatelecom.cn>

mptcp: fix ADD_ADDR and RM_ADDR maybe flush addr_signal each other

ADD_ADDR shares pm.addr_signal with RM_ADDR, so after RM_ADDR/ADD_ADDR
has done, we should not clean ADD_ADDR/RM_ADDR's addr_signal.

Co-developed-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

18fc1a92 23-Aug-2021 Yonglong Li <liyonglong@chinatelecom.cn>

mptcp: make MPTCP_ADD_ADDR_SIGNAL and MPTCP_ADD_ADDR_ECHO separate

Use MPTCP_ADD_ADDR_SIGNAL only for the action of sending ADD_ADDR, and
use MPTCP_ADD_ADDR_ECHO only for the action of sending ADD_ADDR echo.

Use msk->pm.local to save the announced ADD_ADDR address only, and reuse
msk->pm.remote to save the announced ADD_ADDR_ECHO address.

To prepare for the next patch.

Co-developed-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1f5e9e2f 23-Aug-2021 Yonglong Li <liyonglong@chinatelecom.cn>

mptcp: move drop_other_suboptions check under pm lock

This patch moved the drop_other_suboptions check from
mptcp_established_options_add_addr() into mptcp_pm_add_addr_signal(), do
it under the PM lock to avoid the race between this check and
mptcp_pm_add_addr_signal().

For this, added a new parameter for mptcp_pm_add_addr_signal() to get
the drop_other_suboptions value. And drop the other suboptions after the
option length check if drop_other_suboptions is true.

Additionally, always drop the other suboption for TCP pure ack:
that makes both the code simpler and the MPTCP behaviour more
consistent.

Co-developed-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

faf482ca 22-Aug-2021 Yajun Deng <yajun.deng@linux.dev>

net: ipv4: Move ip_options_fragment() out of loop

The ip_options_fragment() only called when iter->offset is equal to zero,
so move it out of loop, and inline 'Copy the flags to each fragment.'
As also, remove the unused parameter in ip_frag_ipcb().

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

6fc88c35 19-Aug-2021 Dave Marchevsky <davemarchevsky@fb.com>

bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enum

Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf
attach types and a function to map bpf_attach_type values to the new
enum. Inspired by netns_bpf_attach_type.

Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever
possible. Functionality is unchanged as attach_type_to_prog_type
switches in bpf/syscall.c were preventing non-cgroup programs from
making use of the invalid cgroup_bpf array slots.

As a result struct cgroup_bpf uses 504 fewer bytes relative to when its
arrays were sized using MAX_BPF_ATTACH_TYPE.

bpf_cgroup_storage is notably not migrated as struct
bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type
member which is not meant to be opaque. Similarly, bpf_cgroup_link
continues to report its bpf_attach_type member to userspace via fdinfo
and bpf_link_info.

To ease disambiguation, bpf_attach_type variables are renamed from
'type' to 'atype' when changed to cgroup_bpf_attach_type.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210819092420.1984861-2-davemarchevsky@fb.com

d359902d 21-Aug-2021 Jiang Wang <jiang.wang@bytedance.com>

af_unix: Fix NULL pointer bug in unix_shutdown

Commit 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
introduced a bug for af_unix SEQPACKET type. In unix_shutdown, the
unhash function will call prot->unhash(), which is NULL for SEQPACKET.
And kernel will panic. On ARM32, it will show following messages: (it
likely affects x86 too).

Fix the bug by checking the prot->unhash is NULL or not first.

Kernel log:
<--- cut here ---
Unable to handle kernel NULL pointer dereference at virtual address
00000000
pgd = 2fba1ffb
*pgd=00000000
Internal error: Oops: 80000005 [#1] PREEMPT SMP THUMB2
Modules linked in:
CPU: 1 PID: 1999 Comm: falkon Tainted: G W
5.14.0-rc5-01175-g94531cfcbe79-dirty #9240
Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
PC is at 0x0
LR is at unix_shutdown+0x81/0x1a8
pc : [<00000000>] lr : [<c08f3311>] psr: 600f0013
sp : e45aff70 ip : e463a3c0 fp : beb54f04
r10: 00000125 r9 : e45ae000 r8 : c4a56664
r7 : 00000001 r6 : c4a56464 r5 : 00000001 r4 : c4a56400
r3 : 00000000 r2 : c5a6b180 r1 : 00000000 r0 : c4a56400
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 50c5387d Table: 05aa804a DAC: 00000051
Register r0 information: slab PING start c4a56400 pointer offset 0
Register r1 information: NULL pointer
Register r2 information: slab task_struct start c5a6b180 pointer offset 0
Register r3 information: NULL pointer
Register r4 information: slab PING start c4a56400 pointer offset 0
Register r5 information: non-paged memory
Register r6 information: slab PING start c4a56400 pointer offset 100
Register r7 information: non-paged memory
Register r8 information: slab PING start c4a56400 pointer offset 612
Register r9 information: non-slab/vmalloc memory
Register r10 information: non-paged memory
Register r11 information: non-paged memory
Register r12 information: slab filp start e463a3c0 pointer offset 0
Process falkon (pid: 1999, stack limit = 0x9ec48895)
Stack: (0xe45aff70 to 0xe45b0000)
ff60: e45ae000 c5f26a00 00000000 00000125
ff80: c0100264 c07f7fa3 beb54f04 fffffff7 00000001 e6f3fc0e b5e5e9ec beb54ec4
ffa0: b5da0ccc c010024b b5e5e9ec beb54ec4 0000000f 00000000 00000000 beb54ebc
ffc0: b5e5e9ec beb54ec4 b5da0ccc 00000125 beb54f58 00785238 beb5529c beb54f04
ffe0: b5da1e24 beb54eac b301385c b62b6ee8 600f0030 0000000f 00000000 00000000
[<c08f3311>] (unix_shutdown) from [<c07f7fa3>] (__sys_shutdown+0x2f/0x50)
[<c07f7fa3>] (__sys_shutdown) from [<c010024b>]
(__sys_trace_return+0x1/0x16)
Exception stack(0xe45affa8 to 0xe45afff0)

Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
Reported-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Dmitry Osipenko <digetx@gmail.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/bpf/20210821180738.1151155-1-jiang.wang@bytedance.com

f5e165e7 19-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: track unique bridge numbers across all DSA switch trees

Right now, cross-tree bridging setups work somewhat by mistake.

In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.

With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.

The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.

If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.

Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9cf448c2 20-Aug-2021 Shreyansh Chouhan <chouhan.shreyansh630@gmail.com>

ip6_gre: add validation for csum_start

Validate csum_start in gre_handle_offloads before we call _gre_xmit so
that we do not crash later when the csum_start value is used in the
lco_csum function call.

This patch deals with ipv6 code.

Fixes: Fixes: b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common
GRE functions")
Reported-by: syzbot+ff8e1b9f2f36481e2efc@syzkaller.appspotmail.com
Signed-off-by: Shreyansh Chouhan <chouhan.shreyansh630@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1d011c48 20-Aug-2021 Shreyansh Chouhan <chouhan.shreyansh630@gmail.com>

ip_gre: add validation for csum_start

Validate csum_start in gre_handle_offloads before we call _gre_xmit so
that we do not crash later when the csum_start value is used in the
lco_csum function call.

This patch deals with ipv4 code.

Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
Reported-by: syzbot+ff8e1b9f2f36481e2efc@syzkaller.appspotmail.com
Signed-off-by: Shreyansh Chouhan <chouhan.shreyansh630@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3a126180 03-Aug-2021 Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Server-side disconnect injection

Disconnect injection stress-tests the ability for both client and
server implementations to behave resiliently in the face of network
instability.

A file called /sys/kernel/debug/fail_sunrpc/ignore-server-disconnect
enables administrators to turn off server-side disconnect injection
while allowing other types of sunrpc errors to be injected. The
default setting is that server-side disconnect injection is enabled
(ignore=false).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a4ae3081 05-Aug-2021 Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Move client-side disconnect injection

Disconnect injection stress-tests the ability for both client and
server implementations to behave resiliently in the face of network
instability.

Convert the existing client-side disconnect injection infrastructure
to use the kernel's generic error injection facility. The generic
facility has a richer set of injection criteria.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c782af25 03-Aug-2021 Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory

This directory will contain a set of administrative controls for
enabling error injection for kernel RPC consumers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

4af14dba 20-Aug-2021 Jakub Kicinski <kuba@kernel.org>

Merge tag 'mac80211-next-for-net-next-2021-08-20' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Minor updates:
* BSS coloring support
* MEI commands for Intel platforms
* various fixes/cleanups

* tag 'mac80211-next-for-net-next-2021-08-20' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next:
cfg80211: fix BSS color notify trace enum confusion
mac80211: Fix insufficient headroom issue for AMSDU
mac80211: add support for BSS color change
nl80211: add support for BSS coloring
mac80211: Use flex-array for radiotap header bitmap
mac80211: radiotap: Use BIT() instead of shifts
mac80211: Remove unnecessary variable and label
mac80211: include <linux/rbtree.h>
mac80211: Fix monitor MTU limit so that A-MSDUs get through
mac80211: remove unnecessary NULL check in ieee80211_register_hw()
mac80211: Reject zero MAC address in sta_info_insert_check()
nl80211: vendor-cmd: add Intel vendor commands for iwlmei usage
====================

Link: https://lore.kernel.org/r/20210820105329.48674-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


2796d846 20-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: convert mcast router global option to per-vlan entry

The per-vlan router option controls the port/vlan and host vlan entries'
mcast router config. The global option controlled only the host vlan
config, but that is unnecessary and incosistent as it's not really a
global vlan option, but rather bridge option to control host router
config, so convert BRIDGE_VLANDB_GOPTS_MCAST_ROUTER to
BRIDGE_VLANDB_ENTRY_MCAST_ROUTER which can be used to control both host
vlan and port vlan mcast router config.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a53581d5 20-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: br_multicast_set_port_router takes multicast context as argument

Change br_multicast_set_port_router to take port multicast context as
its first argument so we can later use it to control port/vlan mcast
router option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7e78c597 19-Aug-2021 Xiaolong Huang <butterflyhuangxx@gmail.com>

net: qrtr: fix another OOB Read in qrtr_endpoint_post

This check was incomplete, did not consider size is 0:

if (len != ALIGN(size, 4) + hdrlen)
goto err;

if size from qrtr_hdr is 0, the result of ALIGN(size, 4)
will be 0, In case of len == hdrlen and size == 0
in header this check won't fail and

if (cb->type == QRTR_TYPE_NEW_SERVER) {
/* Remote node endpoint can bridge other distant nodes */
const struct qrtr_ctrl_pkt *pkt = data + hdrlen;

qrtr_node_assign(node, le32_to_cpu(pkt->server.node));
}

will also read out of bound from data, which is hdrlen allocated block.

Fixes: 194ccc88297a ("net: qrtr: Support decoding incoming v2 packets")
Fixes: ad9d24c9429e ("net: qrtr: fix OOB Read in qrtr_endpoint_post")
Signed-off-by: Xiaolong Huang <butterflyhuangxx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e61fbee7 19-Aug-2021 David S. Miller <davem@davemloft.net>

Merge tag 'for-net-next-2021-08-19' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

- Add support for Foxconn Mediatek Chip
- Add support for LG LGSBWAC92/TWCM-K505D
- hci_h5 flow control fixes and suspend support
- Switch to use lock_sock for SCO and RFCOMM
- Various fixes for extended advertising
- Reword Intel's setup on btusb unifying the supported generations
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


815cc21d 19-Aug-2021 David S. Miller <davem@davemloft.net>

Merge tag 'batadv-next-pullrequest-20210819' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

- bump version strings, by Simon Wunderlich

- update docs about move IRC channel away from freenode,
by Sven Eckelmann

- Switch to kstrtox.h for kstrtou64, by Sven Eckelmann

- Update NULL checks, by Sven Eckelmann (2 patches)

- remove remaining skb-copy calls for broadcast packets,
by Linus Lüssing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


f444fea7 19-Aug-2021 Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

drivers/ptp/Kconfig:
55c8fca1dae1 ("ptp_pch: Restore dependency on PCI")
e5f31552674e ("ethernet: fix PTP_1588_CLOCK dependencies")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>


61969ef8 19-Aug-2021 Kangmin Park <l4stpr0gr4m@gmail.com>

Bluetooth: Fix return value in hci_dev_do_close()

hci_error_reset() return without calling hci_dev_do_open() when
hci_dev_do_close() return error value which is not 0.

Also, hci_dev_close() return hci_dev_do_close() function's return
value.

But, hci_dev_do_close() return always 0 even if hdev->shutdown
return error value. So, fix hci_dev_do_close() to save and return
the return value of the hdev->shutdown when it is called.

Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

f41a4b2b 19-Aug-2021 Pavel Skripkin <paskripkin@gmail.com>

Bluetooth: add timeout sanity check to hci_inquiry

Syzbot hit "task hung" bug in hci_req_sync(). The problem was in
unreasonable huge inquiry timeout passed from userspace.
Fix it by adding sanity check for timeout value to hci_inquiry().

Since hci_inquiry() is the only user of hci_req_sync() with user
controlled timeout value, it makes sense to check timeout value in
hci_inquiry() and don't touch hci_req_sync().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-and-tested-by: syzbot+be2baed593ea56c6a84c@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

a31e5a41 17-Aug-2021 Kees Cook <keescook@chromium.org>

Bluetooth: mgmt: Pessimize compile-time bounds-check

After gaining __alloc_size hints, GCC thinks it can reach a memcpy()
with eir_len == 0 (since it can't see into the rewrite of status).
Instead, check eir_len == 0, avoiding this future warning:

In function 'eir_append_data',
inlined from 'read_local_oob_ext_data_complete' at net/bluetooth/mgmt.c:7210:12:
./include/linux/fortify-string.h:54:29: warning: '__builtin_memcpy' offset 5 is out of the bounds [0, 3] [-Warray-bounds]
...
net/bluetooth/hci_request.h:133:2: note: in expansion of macro 'memcpy'
133 | memcpy(&eir[eir_len], data, data_len);
| ^~~~~~

Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: linux-bluetooth@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

729580dd 18-Aug-2021 Chuck Lever <chuck.lever@oracle.com>

svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free()

svc_xprt_free() already "puts" the bc_xprt before calling the
transport's "free" method. No need to do it twice.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

74fc4f82 17-Aug-2021 Eli Cohen <elic@nvidia.com>

net: Fix offloading indirect devices dependency on qdisc order creation

Currently, when creating an ingress qdisc on an indirect device before
the driver registered for callbacks, the driver will not have a chance
to register its filter configuration callbacks.

To fix that, modify the code such that it keeps track of all the ingress
qdiscs that call flow_indr_dev_setup_offload(). When a driver calls
flow_indr_dev_register(), go through the list of tracked ingress qdiscs
and call the driver callback entry point so as to give it a chance to
register its callback.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c1c5cb3a 17-Aug-2021 Eli Cohen <elic@nvidia.com>

net/core: Remove unused field from struct flow_indr_dev

rcu field is not used. Remove it.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

67b12f79 18-Aug-2021 Matthieu Baerts <matthieu.baerts@tessares.net>

mptcp: full fully established support after ADD_ADDR

If directly after an MP_CAPABLE 3WHS, the client receives an ADD_ADDR
with HMAC from the server, it is enough to switch to a "fully
established" mode because it has received more MPTCP options.

It was then OK to enable the "fully_established" flag on the MPTCP
socket. Still, best to check if the ADD_ADDR looks valid by looking if
it contains an HMAC (no 'echo' bit). If an ADD_ADDR echo is received
while we are not in "fully established" mode, it is strange and then
we should not switch to this mode now.

But that is not enough. On one hand, the path-manager has be notified
the state has changed. On the other hand, the "fully_established" flag
on the subflow socket should be turned on as well not to re-send the
MP_CAPABLE 3rd ACK content with the next ACK.

Fixes: 84dfe3677a6f ("mptcp: send out dedicated ADD_ADDR packet")
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a0eea5f1 18-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: fix memory leak on address flush

The endpoint cleanup path is prone to a memory leak, as reported
by syzkaller:

BUG: memory leak
unreferenced object 0xffff88810680ea00 (size 64):
comm "syz-executor.6", pid 6191, jiffies 4295756280 (age 24.138s)
hex dump (first 32 bytes):
58 75 7d 3c 80 88 ff ff 22 01 00 00 00 00 ad de Xu}<....".......
01 00 02 00 00 00 00 00 ac 1e 00 07 00 00 00 00 ................
backtrace:
[<0000000072a9f72a>] kmalloc include/linux/slab.h:591 [inline]
[<0000000072a9f72a>] mptcp_nl_cmd_add_addr+0x287/0x9f0 net/mptcp/pm_netlink.c:1170
[<00000000f6e931bf>] genl_family_rcv_msg_doit.isra.0+0x225/0x340 net/netlink/genetlink.c:731
[<00000000f1504a2c>] genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
[<00000000f1504a2c>] genl_rcv_msg+0x341/0x5b0 net/netlink/genetlink.c:792
[<0000000097e76f6a>] netlink_rcv_skb+0x148/0x430 net/netlink/af_netlink.c:2504
[<00000000ceefa2b8>] genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
[<000000008ff91aec>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
[<000000008ff91aec>] netlink_unicast+0x537/0x750 net/netlink/af_netlink.c:1340
[<0000000041682c35>] netlink_sendmsg+0x846/0xd80 net/netlink/af_netlink.c:1929
[<00000000df3aa8e7>] sock_sendmsg_nosec net/socket.c:704 [inline]
[<00000000df3aa8e7>] sock_sendmsg+0x14e/0x190 net/socket.c:724
[<000000002154c54c>] ____sys_sendmsg+0x709/0x870 net/socket.c:2403
[<000000001aab01d7>] ___sys_sendmsg+0xff/0x170 net/socket.c:2457
[<00000000fa3b1446>] __sys_sendmsg+0xe5/0x1b0 net/socket.c:2486
[<00000000db2ee9c7>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<00000000db2ee9c7>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
[<000000005873517d>] entry_SYSCALL_64_after_hwframe+0x44/0xae

We should not require an allocation to cleanup stuff.

Rework the code a bit so that the additional RCU work is no more needed.

Fixes: 1729cf186d8a ("mptcp: create the listening socket for new port")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c0891ac1 02-Aug-2021 Alexey Dobriyan <adobriyan@gmail.com>

isystem: ship and use stdarg.h

Ship minimal stdarg.h (1 type, 4 macros) as <linux/stdarg.h>.
stdarg.h is the only userspace header commonly used in the kernel.

GPL 2 version of <stdarg.h> can be extracted from
http://archive.debian.org/debian/pool/main/g/gcc-4.2/gcc-4.2_4.2.4.orig.tar.gz

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>

fb4b1373 17-Aug-2021 Gerd Rausch <gerd.rausch@oracle.com>

net/rds: dma_map_sg is entitled to merge entries

Function "dma_map_sg" is entitled to merge adjacent entries
and return a value smaller than what was passed as "nents".

Subsequently "ib_map_mr_sg" needs to work with this value ("sg_dma_len")
rather than the original "nents" parameter ("sg_len").

This old RDS bug was exposed and reliably causes kernel panics
(using RDMA operations "rds-stress -D") on x86_64 starting with:
commit c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the iommu ops")

Simply put: Linux 5.11 and later.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Link: https://lore.kernel.org/r/60efc69f-1f35-529d-a7ef-da0549cad143@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6cf1770d 18-Aug-2021 Xu Liu <liuxu623@gmail.com>

bpf: Allow bpf_get_netns_cookie in BPF_PROG_TYPE_SOCK_OPS

We'd like to be able to identify netns from sockops hooks to
accelerate local process communication form different netns.

Signed-off-by: Xu Liu <liuxu623@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818105820.91894-2-liuxu623@gmail.com

808cfdfa 16-May-2021 Linus Lüssing <linus.luessing@c0d3.blue>

batman-adv: bcast: remove remaining skb-copy calls

We currently have two code paths for broadcast packets:

A) self-generated, via batadv_interface_tx()->
batadv_send_bcast_packet().
B) received/forwarded, via batadv_recv_bcast_packet()->
batadv_forw_bcast_packet().

For A), self-generated broadcast packets:

The only modifications to the skb data is the ethernet header which is
added/pushed to the skb in
batadv_send_broadcast_skb()->batadv_send_skb_packet(). However before
doing so, batadv_skb_head_push() is called which calls skb_cow_head() to
unshare the space for the to be pushed ethernet header. So for this
case, it is safe to use skb clones.

For B), received/forwarded packets:

The same applies as in A) for the to be forwarded packets. Only the
ethernet header is added. However after (queueing for) forwarding the
packet in batadv_recv_bcast_packet()->batadv_forw_bcast_packet(), a
packet is additionally decapsulated and is sent up the stack through
batadv_recv_bcast_packet()->batadv_interface_rx().

Protocols higher up the stack are already required to check if the
packet is shared and create a copy for further modifications. When the
next (protocol) layer works correctly, it cannot happen that it tries to
operate on the data behind the skb clone which is still queued up for
forwarding.

Co-authored-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>

7e5a3ef6 17-Aug-2021 Nick Richardson <richardsonnick@google.com>

pktgen: Remove fill_imix_distribution() CONFIG_XFRM dependency

Currently, the declaration of fill_imix_distribution() is dependent
on CONFIG_XFRM. This is incorrect.

Move fill_imix_distribution() declaration out of #ifndef CONFIG_XFRM
block.

Signed-off-by: Nick Richardson <richardsonnick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4b1327be 17-Aug-2021 Wei Wang <weiwan@google.com>

net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()

Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
to give more control to the networking stack and enable it to change
memcg charging behavior. In the future, the networking stack may decide
to avoid oom-kills when fallbacks are more appropriate.

One behavior change in mem_cgroup_charge_skmem() by this patch is to
avoid force charging by default and let the caller decide when and if
force charging is needed through the presence or absence of
__GFP_NOFAIL.

Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

01634047 17-Aug-2021 kaixi.fan <fankaixi.li@bytedance.com>

ovs: clear skb->tstamp in forwarding path

fq qdisc requires tstamp to be cleared in the forwarding path. Now ovs
doesn't clear skb->tstamp. We encountered a problem with linux
version 5.4.56 and ovs version 2.14.1, and packets failed to
dequeue from qdisc when fq qdisc was attached to ovs port.

Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: kaixi.fan <fankaixi.li@bytedance.com>
Signed-off-by: xiexiaohui <xiexiaohui.xxh@bytedance.com>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

41467d2f 17-Aug-2021 Yajun Deng <yajun.deng@linux.dev>

net: net_namespace: Optimize the code

There is only one caller for ops_free(), so inline it.
Separate net_drop_ns() and net_free(), so the net_free()
can be called directly.
Add free_exit_list() helper function for free net_exit_list.

====================
v2:
- v1 does not apply, rebase it.
====================

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

994d2cbb 17-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_sja1105: be dsa_loop-safe

Add support for tag_sja1105 running on non-sja1105 DSA ports, by making
sure that every time we dereference dp->priv, we check the switch's
dsa_switch_ops (otherwise we access a struct sja1105_port structure that
is in fact something else).

This adds an unconditional build-time dependency between sja1105 being
built as module => tag_sja1105 must also be built as module. This was
there only for PTP before.

Some sane defaults must also take place when not running on sja1105
hardware. These are:

- sja1105_xmit_tpid: the sja1105 driver uses different VLAN protocols
depending on VLAN awareness and switch revision (when an encapsulated
VLAN must be sent). Default to 0x8100.

- sja1105_rcv_meta_state_machine: this aggregates PTP frames with their
metadata timestamp frames. When running on non-sja1105 hardware, don't
do that and accept all frames unmodified.

- sja1105_defer_xmit: calls sja1105_port_deferred_xmit in sja1105_main.c
which writes a management route over SPI. When not running on sja1105
hardware, bypass the SPI write and send the frame as-is.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

86b9bbd3 16-Aug-2021 Toke Høiland-Jørgensen <toke@redhat.com>

sch_cake: fix srchost/dsthost hashing mode

When adding support for using the skb->hash value as the flow hash in CAKE,
I accidentally introduced a logic error that broke the host-only isolation
modes of CAKE (srchost and dsthost keywords). Specifically, the flow_hash
variable should stay initialised to 0 in cake_hash() in pure host-based
hashing mode. Add a check for this before using the skb->hash value as
flow_hash.

Fixes: b0c19ed6088a ("sch_cake: Take advantage of skb->hash where appropriate")
Reported-by: Pete Heist <pete@heistp.net>
Tested-by: Pete Heist <pete@heistp.net>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ec18e845 16-Aug-2021 Yajun Deng <yajun.deng@linux.dev>

net: procfs: add seq_puts() statement for dev_mcast

Add seq_puts() statement for dev_mcast, make it more readable.
As also, keep vertical alignment for {dev, ptype, dev_mcast} that
under /proc/net.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

95d5e675 15-Aug-2021 Randy Dunlap <rdunlap@infradead.org>

net: RxRPC: make dependent Kconfig symbols be shown indented

Make all dependent RxRPC kconfig entries be dependent on AF_RXRPC
so that they are presented (indented) after AF_RXRPC instead
of being presented at the same level on indentation.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: linux-afs@lists.infradead.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

1a0d6136 17-Aug-2021 Geliang Tang <geliangtang@xiaomi.com>

mptcp: local addresses fullmesh

In mptcp_pm_nl_add_addr_received(), fill a temporary allocate array of
all local address corresponding to the fullmesh endpoint. If such array
is empty, keep the current behavior.

Elsewhere loop on such array and create a subflow for each local address
towards the given remote address

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2843ff6f 17-Aug-2021 Geliang Tang <geliangtang@xiaomi.com>

mptcp: remote addresses fullmesh

This patch added and managed a new per endpoint flag, named
MPTCP_PM_ADDR_FLAG_FULLMESH.

In mptcp_pm_create_subflow_or_signal_addr(), if such flag is set, instead
of:
remote_address((struct sock_common *)sk, &remote);
fill a temporary allocated array of all known remote address. After
releaseing the pm lock loop on such array and create a subflow for each
remote address from the given local.

Note that the we could still use an array even for non 'fullmesh'
endpoint: with a single entry corresponding to the primary MPC subflow
remote address.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee285257 17-Aug-2021 Geliang Tang <geliangtang@xiaomi.com>

mptcp: drop flags and ifindex arguments

This patch added a new helper mptcp_pm_get_flags_and_ifindex_by_id(),
and used it in __mptcp_subflow_connect() to get the flags and ifindex
values.

Then the two arguments flags and ifindex of __mptcp_subflow_connect()
can be dropped.

Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c448f0fd 18-Aug-2021 Johannes Berg <johannes.berg@intel.com>

cfg80211: fix BSS color notify trace enum confusion

The wrong enum was used here, leading to warnings.
Just use a u32 instead.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 0d2ab3aea50b ("nl80211: add support for BSS coloring")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

5a475344 12-Aug-2021 J. Bruce Fields <bfields@redhat.com>

rpc: fix gss_svc_init cleanup on failure

The failure case here should be rare, but it's obviously wrong.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5c117207 05-Aug-2021 Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency()

Some paths through svc_process() leave rqst->rq_procinfo set to
NULL, which triggers a crash if tracing happens to be enabled.

Fixes: 89ff87494c6e ("SUNRPC: Display RPC procedure names instead of proc numbers")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

07a92d00 08-Feb-2021 Chuck Lever <chuck.lever@oracle.com>

svcrdma: Convert rdma->sc_rw_ctxts to llist

Relieve contention on sc_rw_ctxt_lock by converting rdma->sc_rw_ctxts
to an llist.

The goal is to reduce the average overhead of Send completions,
because a transport's completion handlers are single-threaded on
one CPU core. This change reduces CPU utilization of each Send
completion by 2-3% on my server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>

b6c2bfea 09-Feb-2021 Chuck Lever <chuck.lever@oracle.com>

svcrdma: Relieve contention on sc_send_lock.

/proc/lock_stat indicates the the sc_send_lock is heavily
contended when the server is under load from a single client.

To address this, convert the send_ctxt free list to an llist.
Returning an item to the send_ctxt cache is now waitless, which
reduces the instruction path length in the single-threaded Send
handler (svc_rdma_wc_send).

The goal is to enable the ib_comp_wq worker to handle a higher
RPC/RDMA Send completion rate given the same CPU resources. This
change reduces CPU utilization of Send completion by 2-3% on my
server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>

6c8c84f5 07-Jul-2021 Chuck Lever <chuck.lever@oracle.com>

svcrdma: Fewer calls to wake_up() in Send completion handler

Because wake_up() takes an IRQ-safe lock, it can be expensive,
especially to call inside of a single-threaded completion handler.
What's more, the Send wait queue almost never has waiters, so
most of the time, this is an expensive no-op.

As always, the goal is to reduce the average overhead of each
completion, because a transport's completion handlers are single-
threaded on one CPU core. This change reduces CPU utilization of
the Send completion thread by 2-3% on my server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>

2f0f88f4 01-Jul-2021 Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Add svc_rqst_replace_page() API

Replacing a page in rq_pages[] requires a get_page(), which is a
bus-locked operation, and a put_page(), which can be even more
costly.

To reduce the cost of replacing a page in rq_pages[], batch the
put_page() operations by collecting "freed" pages in a pagevec,
and then release those pages when the pagevec is full. This
pagevec is also emptied when each RPC completes.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

276e189f 16-Aug-2021 Johannes Berg <johannes.berg@intel.com>

mac80211: fix locking in ieee80211_restart_work()

Ilan's change to move locking around accidentally lost the
wiphy_lock() during some porting, add it back.

Fixes: 45daaa131841 ("mac80211: Properly WARN on HW scan before restart")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20210817121210.47bdb177064f.Ib1ef79440cd27f318c028ddfc0c642406917f512@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f50d2ff8 16-Aug-2021 Chih-Kang Chang <gary.chang@realtek.com>

mac80211: Fix insufficient headroom issue for AMSDU

ieee80211_amsdu_realloc_pad() fails to account for extra_tx_headroom,
the original reserved headroom might be eaten. Add the necessary
extra_tx_headroom.

Fixes: 6e0456b54545 ("mac80211: add A-MSDU tx support")
Signed-off-by: Chih-Kang Chang <gary.chang@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Link: https://lore.kernel.org/r/20210816085128.10931-2-pkshih@realtek.com
[fix indentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

5f9404ab 02-Jul-2021 John Crispin <john@phrozen.org>

mac80211: add support for BSS color change

The color change announcement is very similar to how CSA works where
we have an IE that includes a counter. When the counter hits 0, the new
color is applied via an updated beacon.

This patch makes the CSA counter functionality reusable, rather than
implementing it again. This also allows for future reuse incase support
for other counter IEs gets added.

Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/057c1e67b82bee561ea44ce6a45a8462d3da6995.1625247619.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

0d2ab3ae 02-Jul-2021 John Crispin <john@phrozen.org>

nl80211: add support for BSS coloring

This patch adds support for BSS color collisions to the wireless subsystem.
Add the required functionality to nl80211 that will notify about color
collisions, triggering the color change and notifying when it is completed.

Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/500b3582aec8fe2c42ef46f3117b148cb7cbceb5.1625247619.git.lorenzo@kernel.org
[remove unnecessary NULL initialisation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

affce9a7 16-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: toggle also host vlan state in br_multicast_toggle_vlan

When changing vlan mcast state by br_multicast_toggle_vlan it iterates
over all ports and enables/disables the port mcast ctx based on the new
state, but I forgot to update the host vlan (bridge master vlan entry)
with the new state so it will be left out. Also that function is not
used outside of br_multicast.c, so make it static.

Fixes: f4b7002a7076 ("net: bridge: add vlan mcast snooping knob")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3f0d14ef 16-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: use the correct vlan group helper

When dereferencing the port vlan group we should use the rcu helper
instead of the one relying on rtnl. In br_multicast_pg_to_port_ctx the
entry cannot disappear as we hold the multicast lock and rcu as explained
in the comment above it.
For the same reason we're ok in br_multicast_start_querier.

=============================
WARNING: suspicious RCU usage
5.14.0-rc5+ #429 Tainted: G W
-----------------------------
net/bridge/br_private.h:1478 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by swapper/2/0:
#0: ffff88822be85eb0 ((&p->timer)){+.-.}-{0:0}, at: call_timer_fn+0x5/0x2da
#1: ffff88810b32f260 (&br->multicast_lock){+.-.}-{3:3}, at: br_multicast_port_group_expired+0x28/0x13d [bridge]
#2: ffffffff824f6c80 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire.constprop.0+0x0/0x22 [bridge]

stack backtrace:
CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G W 5.14.0-rc5+ #429
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x45/0x59
nbp_vlan_group+0x3e/0x44 [bridge]
br_multicast_pg_to_port_ctx+0xd6/0x10d [bridge]
br_multicast_star_g_handle_mode+0xa1/0x2ce [bridge]
? netlink_broadcast+0xf/0x11
? nlmsg_notify+0x56/0x99
? br_mdb_notify+0x224/0x2e9 [bridge]
? br_multicast_del_pg+0x1dc/0x26d [bridge]
br_multicast_del_pg+0x1dc/0x26d [bridge]
br_multicast_port_group_expired+0xaa/0x13d [bridge]
? __grp_src_delete_marked.isra.0+0x35/0x35 [bridge]
? __grp_src_delete_marked.isra.0+0x35/0x35 [bridge]
call_timer_fn+0x134/0x2da
__run_timers+0x169/0x193
run_timer_softirq+0x19/0x2d
__do_softirq+0x1bc/0x42a
__irq_exit_rcu+0x5c/0xb3
irq_exit_rcu+0xa/0x12
sysvec_apic_timer_interrupt+0x5e/0x75
</IRQ>
asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0010:default_idle+0xc/0xd
Code: e8 14 40 71 ff e8 10 b3 ff ff 4c 89 e2 48 89 ef 31 f6 5d 41 5c e9 a9 e8 c2 ff cc cc cc cc 0f 1f 44 00 00 e8 7f 55 65 ff fb f4 <c3> 0f 1f 44 00 00 55 65 48 8b 2c 25 40 6f 01 00 53 f0 80 4d 02 20
RSP: 0018:ffff88810033bf00 EFLAGS: 00000206
RAX: ffffffff819cf828 RBX: ffff888100328000 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff819cfa2d
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
R10: ffff8881008302c0 R11: 00000000000006db R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
? __sched_text_end+0x4/0x4
? default_idle_call+0x15/0x7b
default_idle_call+0x4d/0x7b
do_idle+0x124/0x2a2
cpu_startup_entry+0x1d/0x1f
secondary_startup_64_no_verify+0xb0/0xbb

Fixes: 74edfd483de8 ("net: bridge: multicast: add helper to get port mcast context from port group")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

05d6f38e 16-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: account for router port lists when notifying

When sending a global vlan notification we should account for the number
of router ports when allocating the skb, otherwise we might end up
losing notifications.

Fixes: dc002875c22b ("net: bridge: vlan: use br_rports_fill_info() to export mcast router ports")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b92dace3 16-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: enable mcast snooping for existing master vlans

We always create a vlan with enabled mcast snooping, so when the user
turns on per-vlan mcast contexts they'll get consistent behaviour with
the current situation, but one place wasn't updated when a bridge/master
vlan which already exists (created due to port vlans) is being added as
real bridge vlan (BRIDGE_VLAN_INFO_BRENTRY). We need to enable mcast
snooping for that vlan when that happens.

Fixes: 7b54aaaf53cb ("net: bridge: multicast: add vlan state initialization and control")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

94531cfc 16-Aug-2021 Jiang Wang <jiang.wang@bytedance.com>

af_unix: Add unix_stream_proto for sockmap

Previously, sockmap for AF_UNIX protocol only supports
dgram type. This patch add unix stream type support, which
is similar to unix_dgram_proto. To support sockmap, dgram
and stream cannot share the same unix_proto anymore, because
they have different implementations, such as unhash for stream
type (which will remove closed or disconnected sockets from the map),
so rename unix_proto to unix_dgram_proto and add a new
unix_stream_proto.

Also implement stream related sockmap functions.
And add dgram key words to those dgram specific functions.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-3-jiang.wang@bytedance.com

77462de1 16-Aug-2021 Jiang Wang <jiang.wang@bytedance.com>

af_unix: Add read_sock for stream socket types

To support sockmap for af_unix stream type, implement
read_sock, which is similar to the read_sock for unix
dgram sockets.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-2-jiang.wang@bytedance.com

e3faa49b 16-Aug-2021 Luke Hsiao <lukehsiao@google.com>

tcp: enable data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD

Since the original TFO server code was implemented in commit
168a8f58059a22feb9e9a2dcc1b8053dbbbc12ef ("tcp: TCP Fast Open Server -
main code path") the TFO server code has supported the sysctl bit flag
TFO_SERVER_COOKIE_NOT_REQD. Currently, when the TFO_SERVER_ENABLE and
TFO_SERVER_COOKIE_NOT_REQD sysctl bit flags are set, a server connection
will accept a SYN with N bytes of data (N > 0) that has no TFO cookie,
create a new fast open connection, process the incoming data in the SYN,
and make the connection ready for accepting. After accepting, the
connection is ready for read()/recvmsg() to read the N bytes of data in
the SYN, ready for write()/sendmsg() calls and data transmissions to
transmit data.

This commit changes an edge case in this feature by changing this
behavior to apply to (N >= 0) bytes of data in the SYN rather than only
(N > 0) bytes of data in the SYN. Now, a server will accept a data-less
SYN without a TFO cookie if TFO_SERVER_COOKIE_NOT_REQD is set.

Caveat! While this enables a new kind of TFO (data-less empty-cookie
SYN), some firewall rules setup may not work if they assume such packets
are not legit TFOs and will filter them.

Signed-off-by: Luke Hsiao <lukehsiao@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210816205105.2533289-1-luke.w.hsiao@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

fb7dd8bc 15-Aug-2021 Andrii Nakryiko <andrii@kernel.org>

bpf: Refactor BPF_PROG_RUN into a function

Turn BPF_PROG_RUN into a proper always inlined function. No functional and
performance changes are intended, but it makes it much easier to understand
what's going on with how BPF programs are actually get executed. It's more
obvious what types and callbacks are expected. Also extra () around input
parameters can be dropped, as well as `__` variable prefixes intended to avoid
naming collisions, which makes the code simpler to read and write.

This refactoring also highlighted one extra issue. BPF_PROG_RUN is both
a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning
BPF_PROG_RUN into a function causes naming conflict compilation error. So
rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to
bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of
BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org

ecb71f25 15-Aug-2021 Kiran K <kiran.k@intel.com>

Bluetooth: Fix race condition in handling NOP command

For NOP command, need to cancel work scheduled on cmd_timer,
on receiving command status or commmand complete event.

Below use case might lead to race condition multiple when NOP
commands are queued sequentially:

hci_cmd_work() {
if (atomic_read(&hdev->cmd_cnt) {
.
.
.
atomic_dec(&hdev->cmd_cnt);
hci_send_frame(hdev,...);
schedule_delayed_work(&hdev->cmd_timer,...);
}
}

On receiving event for first NOP, the work scheduled on hdev->cmd_timer
is not cancelled and second NOP is dequeued and sent to controller.

While waiting for an event for second NOP command, work scheduled on
cmd_timer for the first NOP can get scheduled, resulting in sending third
NOP command (sending back to back NOP commands). This might
cause issues at controller side (like memory overrun, controller going
unresponsive) resulting in hci tx timeouts, hardware errors etc.

The fix to this issue is to cancel the delayed work scheduled on
cmd_timer on receiving command status or command complete event for
NOP command (this patch handles NOP command same as any other SIG
command).

Signed-off-by: Kiran K <kiran.k@intel.com>
Reviewed-by: Chethan T N <chethan.tumkur.narayan@intel.com>
Reviewed-by: Srivatsa Ravishankar <ravishankar.srivatsa@intel.com>
Acked-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

7087c4f6 11-Aug-2021 Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: Store advertising handle so it can be re-enabled

This stores the advertising handle/instance into hci_conn so it is
accessible when re-enabling the advertising once disconnected.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

cafae4cd 11-Aug-2021 Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: Fix handling of LE Enhanced Connection Complete

LE Enhanced Connection Complete contains the Local RPA used in the
connection which must be used when set otherwise there could problems
when pairing since the address used by the remote stack could be the
Local RPA:

BLUETOOTH CORE SPECIFICATION Version 5.2 | Vol 4, Part E
page 2396

'Resolvable Private Address being used by the local device for this
connection. This is only valid when the Own_Address_Type (from the
HCI_LE_Create_Connection, HCI_LE_Set_Advertising_Parameters,
HCI_LE_Set_Extended_Advertising_Parameters, or
HCI_LE_Extended_Create_Connection commands) is set to 0x02 or
0x03, and the Controller generated a resolvable private address for the
local device using a non-zero local IRK. For other Own_Address_Type
values, the Controller shall return all zeros.'

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

0ea53674 09-Aug-2021 Kai-Heng Feng <kai.heng.feng@canonical.com>

Bluetooth: Move shutdown callback before flushing tx and rx queue

Commit 0ea9fd001a14 ("Bluetooth: Shutdown controller after workqueues
are flushed or cancelled") introduced a regression that makes mtkbtsdio
driver stops working:
[ 36.593956] Bluetooth: hci0: Firmware already downloaded
[ 46.814613] Bluetooth: hci0: Execution of wmt command timed out
[ 46.814619] Bluetooth: hci0: Failed to send wmt func ctrl (-110)

The shutdown callback depends on the result of hdev->rx_work, so we
should call it before flushing rx_work:
-> btmtksdio_shutdown()
-> mtk_hci_wmt_sync()
-> __hci_cmd_send()
-> wait for BTMTKSDIO_TX_WAIT_VND_EVT gets cleared

-> btmtksdio_recv_event()
-> hci_recv_frame()
-> queue_work(hdev->workqueue, &hdev->rx_work)
-> clears BTMTKSDIO_TX_WAIT_VND_EVT

So move the shutdown callback before flushing TX/RX queue to resolve the
issue.

Reported-and-tested-by: Mattijs Korpershoek <mkorpershoek@baylibre.com>
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Fixes: 0ea9fd001a14 ("Bluetooth: Shutdown controller after workqueues are flushed or cancelled")
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

175e6692 16-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: account for ipv6 size when dumping querier state

We need to account for the IPv6 attributes when dumping querier state.

Fixes: 5e924fe6ccfd ("net: bridge: mcast: dump ipv6 querier state")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cdda378b 16-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: drop sizeof for nest attribute's zero size

This was a dumb error I made instead of writing nla_total_size(0)
for a nest attribute, I wrote nla_total_size(sizeof(0)).

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 606433fe3e11 ("net: bridge: mcast: dump ipv4 querier state")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f137b7d4 16-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: don't dump querier state if snooping is disabled

A minor improvement to avoid dumping mcast ctx querier state if snooping
is disabled for that context (either bridge or vlan).

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2a14c9ae 16-Jun-2021 Sagi Grimberg <sagi@grimberg.me>

params: lift param_set_uint_minmax to common code

It is a useful helper hence move it to common code so others can enjoy
it.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

7387a72c 15-Aug-2021 Xin Long <lucien.xin@gmail.com>

tipc: call tipc_wait_for_connect only when dlen is not 0

__tipc_sendmsg() is called to send SYN packet by either tipc_sendmsg()
or tipc_connect(). The difference is in tipc_connect(), it will call
tipc_wait_for_connect() after __tipc_sendmsg() to wait until connecting
is done. So there's no need to wait in __tipc_sendmsg() for this case.

This patch is to fix it by calling tipc_wait_for_connect() only when dlen
is not 0 in __tipc_sendmsg(), which means it's called by tipc_connect().

Note this also fixes the failure in tipcutils/test/ptts/:

# ./tipcTS &
# ./tipcTC 9
(hang)

Fixes: 36239dab6da7 ("tipc: fix implicit-connect for SYN+")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b2b89133 13-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_8021q: fix notifiers broadcast when they shouldn't, and vice versa

During the development of the blamed patch, the "bool broadcast"
argument of dsa_port_tag_8021q_vlan_{add,del} was originally called
"bool local", and the meaning was the exact opposite.

Due to a rookie mistake where the patch was modified at the last minute
without retesting, the instances of dsa_port_tag_8021q_vlan_{add,del}
are called with the wrong values. During setup and teardown, cross-chip
notifiers should not be broadcast to all DSA trees, while during
bridging, they should.

Fixes: 724395f4dc95 ("net: dsa: tag_8021q: don't broadcast during setup/teardown")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

19eed721 13-Aug-2021 Rao Shoaib <Rao.Shoaib@oracle.com>

af_unix: check socket state when queuing OOB

edumazet@google.com pointed out that queue_oob
does not check socket state after acquiring
the lock. He also pointed to an incorrect usage
of kfree_skb and an unnecessary setting of skb
length. This patch addresses those issue.

Signed-off-by: Rao Shoaib <Rao.Shoaib@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2c860a43 13-Aug-2021 Kuniyuki Iwashima <kuniyu@amazon.co.jp>

bpf: af_unix: Implement BPF iterator for UNIX domain socket.

This patch implements the BPF iterator for the UNIX domain socket.

Currently, the batch optimisation introduced for the TCP iterator in the
commit 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock") is not
used for the UNIX domain socket. It will require replacing the big lock
for the hash table with small locks for each hash list not to block other
processes.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210814015718.42704-2-kuniyu@amazon.co.jp

ddc649d1 13-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: dump mcast ctx querier state

Use the new mcast querier state dump infrastructure and export vlans'
mcast context querier state embedded in attribute
BRIDGE_VLANDB_GOPTS_MCAST_QUERIER_STATE.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

85b41082 13-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: dump ipv6 querier state

Add support for dumping global IPv6 querier state, we dump the state
only if our own querier is enabled or there has been another external
querier which has won the election. For the bridge global state we use
a new attribute IFLA_BR_MCAST_QUERIER_STATE and embed the state inside.
The structure is:
[IFLA_BR_MCAST_QUERIER_STATE]
`[BRIDGE_QUERIER_IPV6_ADDRESS] - ip address of the querier
`[BRIDGE_QUERIER_IPV6_PORT] - bridge port ifindex where the querier
was seen (set only if external querier)
`[BRIDGE_QUERIER_IPV6_OTHER_TIMER] - other querier timeout

IPv4 and IPv6 attributes are embedded at the same level of
IFLA_BR_MCAST_QUERIER_STATE. If we didn't dump anything we cancel the nest
and return.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c7fa1d9b 13-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: dump ipv4 querier state

Add support for dumping global IPv4 querier state, we dump the state
only if our own querier is enabled or there has been another external
querier which has won the election. For the bridge global state we use
a new attribute IFLA_BR_MCAST_QUERIER_STATE and embed the state inside.
The structure is:
[IFLA_BR_MCAST_QUERIER_STATE]
`[BRIDGE_QUERIER_IP_ADDRESS] - ip address of the querier
`[BRIDGE_QUERIER_IP_PORT] - bridge port ifindex where the querier was
seen (set only if external querier)
`[BRIDGE_QUERIER_IP_OTHER_TIMER] - other querier timeout

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c3fb3698 13-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: consolidate querier selection for ipv4 and ipv6

We can consolidate both functions as they share almost the same logic.
This is easier to maintain and we have a single querier update function.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

67b746f9 13-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: make sure querier port/address updates are consistent

Use a sequence counter to make sure port/address updates can be read
consistently without requiring the bridge multicast_lock. We need to
zero out the port and address when the other querier has expired and
we're about to select ourselves as querier. br_multicast_read_querier
will be used later when dumping querier state. Updates are done only
with the multicast spinlock and softirqs disabled, while reads are done
from process context and from softirqs (due to notifications).

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bb18ef8e 13-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: record querier port device ifindex instead of pointer

Currently when a querier port is detected its net_bridge_port pointer is
recorded, but it's used only for comparisons so it's fine to have stale
pointer, in order to dereference and use the port pointer a proper
accounting of its usage must be implemented adding unnecessary
complexity. To solve the problem we can just store the netdevice ifindex
instead of the port pointer and retrieve the bridge port. It is a best
effort and the device needs to be validated that is still part of that
bridge before use, but that is small price to pay for avoiding querier
reference counting for each port/vlan.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ed43fbac 13-Aug-2021 Leon Romanovsky <leon@kernel.org>

devlink: Clear whole devlink_flash_notify struct

The { 0 } doesn't clear all fields in the struct, but tells to the
compiler to set all fields to zero and doesn't touch any sub-fields
if they exists.

The {} is an empty initialiser that instructs to fully initialize whole
struct including sub-fields, which is error-prone for future
devlink_flash_notify extensions.

Fixes: 6700acc5f1fe ("devlink: collect flash notify params into a struct")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

11a861d7 13-Aug-2021 Leon Romanovsky <leon@kernel.org>

devlink: Use xarray to store devlink instances

We can use xarray instead of linearly organized linked lists for the
devlink instances. This will let us revise the locking scheme in favour
of internal xarray locking that protects database.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

437ebfd9 13-Aug-2021 Leon Romanovsky <leon@kernel.org>

devlink: Count struct devlink consumers

The struct devlink itself is protected by internal lock and doesn't
need global lock during operation. That global lock is used to protect
addition/removal new devlink instances from the global list in use by
all devlink consumers in the system.

The future conversion of linked list to be xarray will allow us to
actually delete that lock, but first we need to count all struct devlink
users.

The reference counting provides us a way to ensure that no new user
space commands success to grab devlink instance which is going to be
destroyed makes it is safe to access it without lock.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7ca973dc 13-Aug-2021 Leon Romanovsky <leon@kernel.org>

devlink: Remove check of always valid devlink pointer

Devlink objects are accessible only after they were registered and
have valid devlink_*->devlink pointers.

Remove that check and simplify respective fill functions as an outcome
of such change.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cbf6ab67 13-Aug-2021 Leon Romanovsky <leon@kernel.org>

devlink: Simplify devlink_pernet_pre_exit call

The devlink_pernet_pre_exit() will be called if net namespace exits.

That routine is relevant for devlink instances that were assigned to
that namespaces first. This assignment is possible only with the following
command: "devlink reload DEV netns ...", which already checks reload support.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0460ce22 13-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: backup flag from incoming MPJ ack option

the parsed incoming backup flag is not propagated
to the subflow itself, the client may end-up using it
to send data.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/191
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc1b4e3b 13-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: add mibs for stale subflows processing

This allows monitoring exceptional events like
active backup scenarios.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ff5a0b42 13-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: faster active backup recovery

The msk can use backup subflows to transmit in-sequence data
only if there are no other active subflow. On active backup
scenario, the MPTCP connection can do forward progress only
due to MPTCP retransmissions - rtx can pick backup subflows.

This patch introduces a new flag flow MPTCP subflows: if the
underlying TCP connection made no progresses for long time,
and there are other less problematic subflows available, the
given subflow become stale.

Stale subflows are not considered active: if all non backup
subflows become stale, the MPTCP scheduler can pick backup
subflows for plain transmissions.

Stale subflows can return in active state, as soon as any reply
from the peer is observed.

Active backup scenarios can now leverage the available b/w
with no restrinction.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6da14d74 13-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: cleanup sysctl data and helpers

Reorder the data in mptcp_pernet to avoid wasting space
with no reasons and constify the access helpers.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1e1d9d6f 13-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: handle pending data on closed subflow

The PM can close active subflow, e.g. due to ingress RM_ADDR
option. Such subflow could carry data still unacked at the
MPTCP-level, both in the write and the rtx_queue, which has
never reached the other peer.

Currently the mptcp-level retransmission will deliver such data,
but at a very low rate (at most 1 DSM for each MPTCP rtx interval).

We can speed-up the recovery a lot, moving all the unacked in the
tcp write_queue, so that it will be pushed again via other
subflows, at the speed allowed by them.

Also make available the new helper for later patches.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

71b7dec2 13-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: less aggressive retransmission strategy

The current mptcp re-inject strategy is very aggressive,
we have mptcp-level retransmissions even on single subflow
connection, if the link in-use is lossy.

Let's be a little more conservative: we do retransmit
only if at least a subflow has write and rtx queue empty.

Additionally use the backup subflows only if the active
subflows are stale - no progresses in at least an rtx period
and ignore stale subflows for rtx timeout update

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

33d41c9c 13-Aug-2021 Paolo Abeni <pabeni@redhat.com>

mptcp: more accurate timeout

As reported by Maxim, we have a lot of MPTCP-level
retransmissions when multilple links with different latencies
are in use.

This patch refactor the mptcp-level timeout accounting so that
the maximum of all the active subflow timeout is used. To avoid
traversing the subflow list multiple times, the update is
performed inside the packet scheduler.

Additionally clean-up a bit timeout handling.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d8d9ba8d 12-Aug-2021 Lukas Bulwahn <lukas.bulwahn@gmail.com>

net: 802: remove dead leftover after ipx driver removal

Commit 7a2e838d28cf ("staging: ipx: delete it from the tree") removes the
ipx driver and the config IPX. Since then, there is some dead leftover in
./net/802/, that was once used by the IPX driver, but has no other user.

Remove this dead leftover.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

afa79d08 13-Aug-2021 Changbin Du <changbin.du@intel.com>

net: in_irq() cleanup

Replace the obsolete and ambiguos macro in_irq() with new
macro in_hardirq().

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Link: https://lore.kernel.org/r/20210813145749.86512-1-changbin.du@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

876c14ad 11-Aug-2021 Rao Shoaib <rao.shoaib@oracle.com>

af_unix: fix holding spinlock in oob handling

syzkaller found that OOB code was holding spinlock
while calling a function in which it could sleep.

Reported-by: syzbot+8760ca6c1ee783ac4abd@syzkaller.appspotmail.com
Fixes: 314001f0bf92 ("af_unix: Add OOB support")
Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com>
Link: https://lore.kernel.org/r/20210811220652.567434-1-Rao.Shoaib@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f4083a75 13-Aug-2021 Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Conflicts:

drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h
9e26680733d5 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp")
9e518f25802c ("bnxt_en: 1PPS functions to configure TSIO pins")
099fdeda659d ("bnxt_en: Event handler for PPS events")

kernel/bpf/helpers.c
include/linux/bpf-cgroup.h
a2baf4e8bb0f ("bpf: Fix potentially incorrect results with bpf_get_local_storage()")
c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current")

drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
5957cc557dc5 ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray")
2d0b41a37679 ("net/mlx5: Refcount mlx5_irq with integer")

MAINTAINERS
7b637cd52f02 ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo")
7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>


8c89f7b3 06-Aug-2021 Kees Cook <keescook@chromium.org>

mac80211: Use flex-array for radiotap header bitmap

In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
intentionally writing across neighboring fields.

The it_present member of struct ieee80211_radiotap_header is treated as a
flexible array (multiple u32s can be conditionally present). In order for
memcpy() to reason (or really, not reason) about the size of operations
against this struct, use of bytes beyond it_present need to be treated
as part of the flexible array. Add a trailing flexible array and
initialize its initial index via pointer arithmetic.

Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210806215305.2875621-1-keescook@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

5cafd378 06-Aug-2021 Kees Cook <keescook@chromium.org>

mac80211: radiotap: Use BIT() instead of shifts

IEEE80211_RADIOTAP_EXT has a value of 31, which means if shift was ever
cast to 64-bit, the result would become sign-extended. As a matter of
robustness, just replace all the open-coded shifts with BIT().

Suggested-by: David Sterba <dsterba@suse.cz>
Link: https://lore.kernel.org/lkml/20210728092323.GW5047@twin.jikos.cz/
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210806215112.2874773-1-keescook@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

0323689d 05-Aug-2021 dingsenjie <dingsenjie@yulong.com>

mac80211: Remove unnecessary variable and label

The variable ret and label just used as return, so we delete it and
use the return statement instead of the goto statement.

Signed-off-by: dingsenjie <dingsenjie@yulong.com>
Link: https://lore.kernel.org/r/20210805064349.202148-1-dingsenjie@163.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

779969e3 15-Jul-2021 Johannes Berg <johannes.berg@intel.com>

mac80211: include <linux/rbtree.h>

This is needed for the rbtree, and we shouldn't just rely
on it getting included somewhere implicitly. Include it
explicitly.

Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20210715180234.512d64dee655.Ia51c29a9fb1e651e06bc00eabec90974103d333e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

79f5962b 28-Jun-2021 Johan Almbladh <johan.almbladh@anyfinetworks.com>

mac80211: Fix monitor MTU limit so that A-MSDUs get through

The maximum MTU was set to 2304, which is the maximum MSDU size. While
this is valid for normal WLAN interfaces, it is too low for monitor
interfaces. A monitor interface may receive and inject MPDU frames, and
the maximum MPDU frame size is larger than 2304. The MPDU may also
contain an A-MSDU frame, in which case the size may be much larger than
the MTU limit. Since the maximum size of an A-MSDU depends on the PHY
mode of the transmitting STA, it is not possible to set an exact MTU
limit for a monitor interface. Now the maximum MTU for a monitor
interface is unrestricted.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Link: https://lore.kernel.org/r/20210628123246.2070558-1-johan.almbladh@anyfinetworks.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

4a11174d 28-Jun-2021 Dan Carpenter <dan.carpenter@oracle.com>

mac80211: remove unnecessary NULL check in ieee80211_register_hw()

The address "&sband->iftype_data[i]" points to an array at the end of
struct. It can't be NULL and so the check can be removed.

Fixes: bac2fd3d7534 ("mac80211: remove use of ieee80211_get_he_sta_cap()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/YNmgHi7Rh3SISdog@mwanda
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

deebea0a 26-Jun-2021 YueHaibing <yuehaibing@huawei.com>

mac80211: Reject zero MAC address in sta_info_insert_check()

As commit 52dba8d7d5ab ("mac80211: reject zero MAC address in add station")
said, we don't consider all-zeroes to be a valid MAC address in most places,
so also reject it here.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20210626130334.13624-1-yuehaibing@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

a9a50701 12-Aug-2021 Jakub Kicinski <kuba@kernel.org>

Merge tag 'ieee802154-for-davem-2021-08-12' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan

Stefan Schmidt says:

====================
ieee802154 for net 2021-08-12

Mostly fixes coming from bot reports. Dongliang Mu tackled some syzkaller
reports in hwsim again and Takeshi Misawa a memory leak in ieee802154 raw.

* tag 'ieee802154-for-davem-2021-08-12' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan:
net: Fix memory leak in ieee802154_raw_deliver
ieee802154: hwsim: fix GPF in hwsim_new_edge_nl
ieee802154: hwsim: fix GPF in hwsim_set_edge_lqi
====================

Link: https://lore.kernel.org/r/20210812183912.1663996-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


49b0b6ff 11-Aug-2021 Longpeng(Mike) <longpeng2@huawei.com>

vsock/virtio: avoid potential deadlock when vsock device remove

There's a potential deadlock case when remove the vsock device or
process the RESET event:

vsock_for_each_connected_socket:
spin_lock_bh(&vsock_table_lock) ----------- (1)
...
virtio_vsock_reset_sock:
lock_sock(sk) --------------------- (2)
...
spin_unlock_bh(&vsock_table_lock)

lock_sock() may do initiative schedule when the 'sk' is owned by
other thread at the same time, we would receivce a warning message
that "scheduling while atomic".

Even worse, if the next task (selected by the scheduler) try to
release a 'sk', it need to request vsock_table_lock and the deadlock
occur, cause the system into softlockup state.
Call trace:
queued_spin_lock_slowpath
vsock_remove_bound
vsock_remove_sock
virtio_transport_release
__vsock_release
vsock_release
__sock_release
sock_close
__fput
____fput

So we should not require sk_lock in this case, just like the behavior
in vhost_vsock or vmci.

Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20210812053056.1699-1-longpeng2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

724395f4 11-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_8021q: don't broadcast during setup/teardown

Currently, on my board with multiple sja1105 switches in disjoint trees
described in commit f66a6a69f97a ("net: dsa: permit cross-chip bridging
between all trees in the system"), rebooting the board triggers the
following benign warnings:

[ 12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
[ 12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
[ 12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
[ 12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
[ 12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
[ 12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT

Basically switch 1 calls dsa_tag_8021q_unregister, and switch 1's TX and
RX VLANs cannot be found on switch 2's CPU port.

But why would switch 2 even attempt to delete switch 1's TX and RX
tag_8021q VLANs from its CPU port? Well, because we use dsa_broadcast,
and it is supposed that it had added those VLANs in the first place
(because in dsa_port_tag_8021q_vlan_match, all CPU ports match
regardless of their tree index or switch index).

The two trees probe asynchronously, and when switch 1 probed, it called
dsa_broadcast which did not notify the tree of switch 2, because that
didn't probe yet. But during unbind, switch 2's tree _is_ probed, so it
_is_ notified of the deletion.

Before jumping to introduce a synchronization mechanism between the
probing across disjoint switch trees, let's take a step back and see
whether we _need_ to do that in the first place.

The RX and TX VLANs of switch 1 would be needed on switch 2's CPU port
only if switch 1 and 2 were part of a cross-chip bridge. And
dsa_tag_8021q_bridge_join takes care precisely of that (but if probing
was synchronous, the bridge_join would just end up bumping the VLANs'
refcount, because they are already installed by the setup path).

Since by the time the ports are bridged, all DSA trees are already set
up, and we don't need the tag_8021q VLANs of one switch installed on the
other switches during probe time, the answer is that we don't need to
fix the synchronization issue.

So make the setup and teardown code paths call dsa_port_notify, which
notifies only the local tree, and the bridge code paths call
dsa_broadcast, which let the other trees know as well.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ab97462b 11-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: print more information when a cross-chip notifier fails

Currently this error message does not say a lot:

[ 32.693498] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[ 32.699725] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[ 32.705931] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[ 32.712139] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[ 32.718347] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[ 32.724554] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT

but in this form, it is immediately obvious (at least to me) what the
problem is, even without further looking at the code:

[ 12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
[ 12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
[ 12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
[ 12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
[ 12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
[ 12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

769afb3f 10-Aug-2021 Nick Richardson <richardsonnick@google.com>

pktgen: Add output for imix results

The bps for imix mode is calculated by:
sum(imix_entry.size) / time_elapsed

The actual counts of each imix_entry are displayed under the
"Current:" section of the interface output in the following format:
imix_size_counts: size_1,count_1 size_2,count_2 ... size_n,count_n

Example (count = 200000):
imix_weights: 256,1 859,3 205,2
imix_size_counts: 256,32082 859,99796 205,68122
Result: OK: 17992362(c17964678+d27684) usec, 200000 (859byte,0frags)
11115pps 47Mb/sec (47977140bps) errors: 0

Summary of changes:
Calculate bps based on imix counters when in IMIX mode.
Add output for IMIX counters.

Signed-off-by: Nick Richardson <richardsonnick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

90149031 10-Aug-2021 Nick Richardson <richardsonnick@google.com>

pktgen: Add imix distribution bins

In order to represent the distribution of imix packet sizes, a
pre-computed data structure is used. It features 100 (IMIX_PRECISION)
"bins". Contiguous ranges of these bins represent the respective
packet size of each imix entry. This is done to avoid the overhead of
selecting the correct imix packet size based on the corresponding weights.

Example:
imix_weights 40,7 576,4 1500,1
total_weight = 7 + 4 + 1 = 12

pkt_size 40 occurs 7/total_weight = 58% of the time
pkt_size 576 occurs 4/total_weight = 33% of the time
pkt_size 1500 occurs 1/total_weight = 9% of the time

We generate a random number between 0-100 and select the corresponding
packet size based on the specified weights.
Eg. random number = 358723895 % 100 = 65
Selects the packet size corresponding to index:65 in the pre-computed
imix_distribution array.
An example of the pre-computed array is below:

The imix_distribution will look like the following:
0 -> 0 (index of imix_entry.size == 40)
1 -> 0 (index of imix_entry.size == 40)
2 -> 0 (index of imix_entry.size == 40)
[...] -> 0 (index of imix_entry.size == 40)
57 -> 0 (index of imix_entry.size == 40)
58 -> 1 (index of imix_entry.size == 576)
[...] -> 1 (index of imix_entry.size == 576)
90 -> 1 (index of imix_entry.size == 576)
91 -> 2 (index of imix_entry.size == 1500)
[...] -> 2 (index of imix_entry.size == 1500)
99 -> 2 (index of imix_entry.size == 1500)

Create and use "bin" representation of the imix distribution.

Signed-off-by: Nick Richardson <richardsonnick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

52a62f86 10-Aug-2021 Nick Richardson <richardsonnick@google.com>

pktgen: Parse internet mix (imix) input

Adds "imix_weights" command for specifying internet mix distribution.

The command is in this format:
"imix_weights size_1,weight_1 size_2,weight_2 ... size_n,weight_n"
where the probability that packet size_i is picked is:
weight_i / (weight_1 + weight_2 + .. + weight_n)

The user may provide up to 100 imix entries (size_i,weight_i) in this
command.

The user specified imix entries will be displayed in the "Params"
section of the interface output.

Values for clone_skb > 0 is not supported in IMIX mode.

Summary of changes:
Add flag for enabling internet mix mode.
Add command (imix_weights) for internet mix input.
Return -ENOTSUPP when clone_skb > 0 in IMIX mode.
Display imix_weights in Params.
Create data structures to store imix entries and distribution.

Signed-off-by: Nick Richardson <richardsonnick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

86704993 10-Aug-2021 Hoang Le <hoang.h.le@dektech.com.au>

Revert "tipc: Return the correct errno code"

This reverts commit 0efea3c649f0 because of:
- The returning -ENOBUF error is fine on socket buffer allocation.
- There is side effect in the calling path
tipc_node_xmit()->tipc_link_xmit() when checking error code returning.

Fixes: 0efea3c649f0 ("tipc: Return the correct errno code")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c4110d9 09-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: fix global vlan option range dumping

When global vlan options are equal sequentially we compress them in a
range to save space and reduce processing time. In order to have the
proper range end id we need to update range_end if the options are equal
otherwise we get ranges with the same end vlan id as the start.

Fixes: 743a53d9636a ("net: bridge: vlan: add support for dumping global vlan options")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210810092139.11700-1-razor@blackwall.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

83f0a0b7 09-Aug-2021 Jeremy Kerr <jk@codeconstruct.com.au>

mctp: Specify route types, require rtm_type in RTM_*ROUTE messages

This change adds a 'type' attribute to routes, which can be parsed from
a RTM_NEWROUTE message. This will help to distinguish local vs. peer
routes in a future change.

This means userspace will need to set a correct rtm_type in RTM_NEWROUTE
and RTM_DELROUTE messages; we currently only accept RTN_UNICAST.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://lore.kernel.org/r/20210810023834.2231088-1-jk@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

b69dd5b3 11-Aug-2021 Eric Dumazet <edumazet@google.com>

net: igmp: increase size of mr_ifc_count

Some arches support cmpxchg() on 4-byte and 8-byte only.
Increase mr_ifc_count width to 32bit to fix this problem.

Fixes: 4a2b285e7e10 ("net: igmp: fix data-race in igmp_ifc_timer_expire()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20210811195715.3684218-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6de035fe 10-Aug-2021 Neal Cardwell <ncardwell@google.com>

tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets

Currently if BBR congestion control is initialized after more than 2B
packets have been delivered, depending on the phase of the
tp->delivered counter the tracking of BBR round trips can get stuck.

The bug arises because if tp->delivered is between 2^31 and 2^32 at
the time the BBR congestion control module is initialized, then the
initialization of bbr->next_rtt_delivered to 0 will cause the logic to
believe that the end of the round trip is still billions of packets in
the future. More specifically, the following check will fail
repeatedly:

!before(rs->prior_delivered, bbr->next_rtt_delivered)

and thus the connection will take up to 2B packets delivered before
that check will pass and the connection will set:

bbr->round_start = 1;

This could cause many mechanisms in BBR to fail to trigger, for
example bbr_check_full_bw_reached() would likely never exit STARTUP.

This bug is 5 years old and has not been observed, and as a practical
matter this would likely rarely trigger, since it would require
transferring at least 2B packets, or likely more than 3 terabytes of
data, before switching congestion control algorithms to BBR.

This patch is a stable candidate for kernels as far back as v4.9,
when tcp_bbr.c was added.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Kevin Yang <yyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210811024056.235161-1-ncardwell@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6922110d 09-Aug-2021 Willy Tarreau <w@1wt.eu>

net: linkwatch: fix failure to restore device state across suspend/resume

After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed
that my Ethernet port to which a bond and a VLAN interface are attached
appeared to remain up after resuming from suspend with the cable unplugged
(and that problem still persists with 5.10-LTS).

It happens that the following happens:

- the network driver (e1000e here) prepares to suspend, calls e1000e_down()
which calls netif_carrier_off() to signal that the link is going down.
- netif_carrier_off() adds a link_watch event to the list of events for
this device
- the device is completely stopped.
- the machine suspends
- the cable is unplugged and the machine brought to another location
- the machine is resumed
- the queued linkwatch events are processed for the device
- the device doesn't yet have the __LINK_STATE_PRESENT bit and its events
are silently dropped
- the device is resumed with its link down
- the upper VLAN and bond interfaces are never notified that the link had
been turned down and remain up
- the only way to provoke a change is to physically connect the machine
to a port and possibly unplug it.

The state after resume looks like this:
$ ip -br li | egrep 'bond|eth'
bond0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP>
eth0 DOWN e8:6a:64:64:64:64 <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP>
eth0.2@eth0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>

Placing an explicit call to netdev_state_change() either in the suspend
or the resume code in the NIC driver worked around this but the solution
is not satisfying.

The issue in fact really is in link_watch that loses events while it
ought not to. It happens that the test for the device being present was
added by commit 124eee3f6955 ("net: linkwatch: add check for netdevice
being present to linkwatch_do_dev") in 4.20 to avoid an access to
devices that are not present.

Instead of dropping events, this patch proceeds slightly differently by
postponing their handling so that they happen after the device is fully
resumed.

Fixes: 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev")
Link: https://lists.openwall.net/netdev/2018/03/15/62
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20210809160628.22623-1-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a72808b6 10-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: create a helper for locating EtherType DSA headers on TX

Create a similar helper for locating the offset to the DSA header
relative to skb->data, and make the existing EtherType header taggers to
use it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

5d928ff4 10-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: create a helper for locating EtherType DSA headers on RX

It seems that protocol tagging driver writers are always surprised about
the formula they use to reach their EtherType header on RX, which
becomes apparent from the fact that there are comments in multiple
drivers that mention the same information.

Create a helper that returns a void pointer to skb->data - 2, as well as
centralize the explanation why that is the case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

6bef794d 10-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: create a helper which allocates space for EtherType DSA headers

Hide away the memmove used by DSA EtherType header taggers to shift the
MAC SA and DA to the left to make room for the header, after they've
called skb_push(). The call to skb_push() is still left explicit in
drivers, to be symmetric with dsa_strip_etype_header, and because not
all callers can be refactored to do it (for example, brcm_tag_xmit_ll
has common code for a pre-Ethernet DSA tag and an EtherType DSA tag).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f1dacd7a 10-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: create a helper that strips EtherType DSA headers on RX

All header taggers open-code a memmove that is fairly not all that
obvious, and we can hide the details behind a helper function, since the
only thing specific to the driver is the length of the header tag.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9c4a7665 10-Aug-2021 Parav Pandit <parav@nvidia.com>

devlink: Add APIs to publish, unpublish individual parameter

Enable drivers to publish/unpublish individual parameter.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b40c51ef 10-Aug-2021 Parav Pandit <parav@nvidia.com>

devlink: Add API to register and unregister single parameter

Currently device configuration parameters can be registered as an array.
Due to this a constant array must be registered. A single driver
supporting multiple devices each with different device capabilities end
up registering all parameters even if it doesn't support it.

One possible workaround a driver can do is, it registers multiple single
entry arrays to overcome such limitation.

Better is to provide a API that enables driver to register/unregister a
single parameter. This also further helps in two ways.
(1) to reduce the memory of devlink_param_entry by avoiding in registering
parameters which are not supported by the device.
(2) avoid generating multiple parameter add, delete, publish, unpublish,
init value notifications for such unsupported parameters

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

699784f7 10-Aug-2021 Parav Pandit <parav@nvidia.com>

devlink: Create a helper function for one parameter registration

Create and use a helper function for one parameter registration.
Subsequent patch also will reuse this for driver facing routine to
register a single parameter.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

076b2a9d 10-Aug-2021 Parav Pandit <parav@nvidia.com>

devlink: Add new "enable_vnet" generic device param

Add new device generic parameter to enable/disable creation of
VDPA net auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
name enable_vnet value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
VDPA net functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8ddaabee 10-Aug-2021 Parav Pandit <parav@nvidia.com>

devlink: Add new "enable_rdma" generic device param

Add new device generic parameter to enable/disable creation of
RDMA auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
name enable_rdma value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
RDMA functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f13a5ad8 10-Aug-2021 Parav Pandit <parav@nvidia.com>

devlink: Add new "enable_eth" generic device param

Add new device generic parameter to enable/disable creation of
Ethernet auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
name enable_eth value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
Ethernet functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dc002875 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: use br_rports_fill_info() to export mcast router ports

Embed the standard multicast router port export by br_rports_fill_info()
into a new global vlan attribute BRIDGE_VLANDB_GOPTS_MCAST_ROUTER_PORTS.
In order to have the same format for the global bridge mcast context and
the per-vlan mcast context we need a double-nesting:
- BRIDGE_VLANDB_GOPTS_MCAST_ROUTER_PORTS
- MDBA_ROUTER

Currently we don't compare router lists, if any router port exists in
the bridge mcast contexts we consider their option sets as different and
export them separately.

In addition we export the router port vlan id when dumping similar to
the router port notification format.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e04d377f 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: use the proper multicast context when dumping router ports

When we are dumping the router ports of a vlan mcast context we need to
use the bridge/vlan and port/vlan's multicast contexts to check if
IPv4/IPv6 router port is present and later to dump the vlan id.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a97df080 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: add support for mcast router global option

Add support to change and retrieve global vlan multicast router state
which is used for the bridge itself. We just need to pass multicast context
to br_multicast_set_router instead of bridge device and the rest of the
logic remains the same.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

62938182 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: add support for mcast querier global option

Add support to change and retrieve global vlan multicast querier state.
We just need to pass multicast context to br_multicast_set_querier
instead of bridge device and the rest of the logic remains the same.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cb486ce9 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: querier and query state affect only current context type

It is a minor optimization and better behaviour to make sure querier and
query sending routines affect only the matching multicast context
depending if vlan snooping is enabled (vlan ctx vs bridge ctx).
It also avoids sending unnecessary extra query packets.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4d5b4e84 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: mcast: move querier state to the multicast context

We need to have the querier state per multicast context in order to have
per-vlan control, so remove the internal option bit and move it to the
multicast context. Also annotate the lockless reads of the new variable.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

941121ee 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: add support for mcast startup query interval global option

Add support to change and retrieve global vlan multicast startup query
interval option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

42521450 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: add support for mcast query response interval global option

Add support to change and retrieve global vlan multicast query response
interval option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d6c08aba 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: add support for mcast query interval global option

Add support to change and retrieve global vlan multicast query interval
option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cd9269d4 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: add support for mcast querier interval global option

Add support to change and retrieve global vlan multicast querier interval
option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2da0aea2 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: add support for mcast membership interval global option

Add support to change and retrieve global vlan multicast membership
interval option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

77f6abab 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: add support for mcast last member interval global option

Add support to change and retrieve global vlan multicast last member
interval option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

50725f6e 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: add support for mcast startup query count global option

Add support to change and retrieve global vlan multicast startup query
count option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

931ba87d 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: add support for mcast last member count global option

Add support to change and retrieve global vlan multicast last member
count option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

df271cd6 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: vlan: add support for mcast igmp/mld version global options

Add support to change and retrieve global vlan IGMP/MLD versions.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6f45933d 11-Aug-2021 David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Use nfnetlink_unicast() instead of netlink_unicast() in nft_compat.

2) Remove call to nf_ct_l4proto_find() in flowtable offload timeout
fixup.

3) CLUSTERIP registers ARP hook on demand, from Florian.

4) Use clusterip_net to store pernet warning, also from Florian.

5) Remove struct netns_xt, from Florian Westphal.

6) Enable ebtables hooks in initns on demand, from Florian.

7) Allow to filter conntrack netlink dump per status bits,
from Florian Westphal.

8) Register x_tables hooks in initns on demand, from Florian.

9) Remove queue_handler from per-netns structure, again from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


e3245a7b 10-Aug-2021 Pavel Skripkin <paskripkin@gmail.com>

netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex

Syzbot hit use-after-free in nf_tables_dump_sets. The problem was in
missing lock protection for nft_ct_pcpu_template_refcnt.

Before commit f102d66b335a ("netfilter: nf_tables: use dedicated
mutex to guard transactions") all transactions were serialized by global
mutex, but then global mutex was changed to local per netnamespace
commit_mutex.

This change causes use-after-free bug, when 2 netnamespaces concurently
changing nft_ct_pcpu_template_refcnt without proper locking. Fix it by
adding nft_ct_pcpu_mutex and protect all nft_ct_pcpu_template_refcnt
changes with it.

Fixes: f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Reported-and-tested-by: syzbot+649e339fa6658ee623d3@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

d3432bf1 10-Aug-2021 Lahav Schlesinger <lschlesinger@drivenets.com>

net: Support filtering interfaces on no master

Currently there's support for filtering neighbours/links for interfaces
which have a specific master device (using the IFLA_MASTER/NDA_MASTER
attributes).

This patch adds support for filtering interfaces/neighbours dump for
interfaces that *don't* have a master.

Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210810090658.2778960-1-lschlesinger@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a5397d68 09-Aug-2021 Mark Bloch <mbloch@nvidia.com>

net/sched: cls_api, reset flags on replay

tc_new_tfilter() can replay a request if it got EAGAIN. The cited commit
didn't account for this when it converted TC action ->init() API
to use flags instead of parameters. This can lead to passing stale flags
down the call chain which results in trying to lock rtnl when it's
already locked, deadlocking the entire system.

Fix by making sure to reset flags on each replay.

============================================
WARNING: possible recursive locking detected
5.14.0-rc3-custom-49011-g3d2bbb4f104d #447 Not tainted
--------------------------------------------
tc/37605 is trying to acquire lock:
ffffffff841df2f0 (rtnl_mutex){+.+.}-{3:3}, at: tc_setup_cb_add+0x14b/0x4d0

but task is already holding lock:
ffffffff841df2f0 (rtnl_mutex){+.+.}-{3:3}, at: tc_new_tfilter+0xb12/0x22e0

other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(rtnl_mutex);
lock(rtnl_mutex);

*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by tc/37605:
#0: ffffffff841df2f0 (rtnl_mutex){+.+.}-{3:3}, at: tc_new_tfilter+0xb12/0x22e0

stack backtrace:
CPU: 0 PID: 37605 Comm: tc Not tainted 5.14.0-rc3-custom-49011-g3d2bbb4f104d #447
Hardware name: Mellanox Technologies Ltd. MSN2010/SA002610, BIOS 5.6.5 08/24/2017
Call Trace:
dump_stack_lvl+0x8b/0xb3
__lock_acquire.cold+0x175/0x3cb
lock_acquire+0x1a4/0x4f0
__mutex_lock+0x136/0x10d0
fl_hw_replace_filter+0x458/0x630 [cls_flower]
fl_change+0x25f2/0x4a64 [cls_flower]
tc_new_tfilter+0xa65/0x22e0
rtnetlink_rcv_msg+0x86c/0xc60
netlink_rcv_skb+0x14d/0x430
netlink_unicast+0x539/0x7e0
netlink_sendmsg+0x84d/0xd80
____sys_sendmsg+0x7ff/0x970
___sys_sendmsg+0xf8/0x170
__sys_sendmsg+0xea/0x1b0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f7b93b6c0a7
Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48>
RSP: 002b:00007ffe365b3818 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b93b6c0a7
RDX: 0000000000000000 RSI: 00007ffe365b3880 RDI: 0000000000000003
RBP: 00000000610a75f6 R08: 0000000000000001 R09: 0000000000000000
R10: fffffffffffff3a9 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 00007ffe365b7b58 R15: 00000000004822c0

Fixes: 695176bfe5de ("net_sched: refactor TC action init API")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20210810034305.63997-1-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

519133de 09-Aug-2021 Yang Yingliang <yangyingliang@huawei.com>

net: bridge: fix memleak in br_add_if()

I got a memleak report:

BUG: memory leak
unreferenced object 0x607ee521a658 (size 240):
comm "syz-executor.0", pid 955, jiffies 4294780569 (age 16.449s)
hex dump (first 32 bytes, cpu 1):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000d830ea5a>] br_multicast_add_port+0x1c2/0x300 net/bridge/br_multicast.c:1693
[<00000000274d9a71>] new_nbp net/bridge/br_if.c:435 [inline]
[<00000000274d9a71>] br_add_if+0x670/0x1740 net/bridge/br_if.c:611
[<0000000012ce888e>] do_set_master net/core/rtnetlink.c:2513 [inline]
[<0000000012ce888e>] do_set_master+0x1aa/0x210 net/core/rtnetlink.c:2487
[<0000000099d1cafc>] __rtnl_newlink+0x1095/0x13e0 net/core/rtnetlink.c:3457
[<00000000a01facc0>] rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3488
[<00000000acc9186c>] rtnetlink_rcv_msg+0x369/0xa10 net/core/rtnetlink.c:5550
[<00000000d4aabb9c>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504
[<00000000bc2e12a3>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
[<00000000bc2e12a3>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340
[<00000000e4dc2d0e>] netlink_sendmsg+0x789/0xc70 net/netlink/af_netlink.c:1929
[<000000000d22c8b3>] sock_sendmsg_nosec net/socket.c:654 [inline]
[<000000000d22c8b3>] sock_sendmsg+0x139/0x170 net/socket.c:674
[<00000000e281417a>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350
[<00000000237aa2ab>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404
[<000000004f2dc381>] __sys_sendmsg+0xd3/0x190 net/socket.c:2433
[<0000000005feca6c>] do_syscall_64+0x37/0x90 arch/x86/entry/common.c:47
[<000000007304477d>] entry_SYSCALL_64_after_hwframe+0x44/0xae

On error path of br_add_if(), p->mcast_stats allocated in
new_nbp() need be freed, or it will be leaked.

Fixes: 1080ab95e3c7 ("net: bridge: add support for IGMP/MLD stats and export them via netlink")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210809132023.978546-1-yangyingliang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c35b57ce 10-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: switchdev: zero-initialize struct switchdev_notifier_fdb_info emitted by drivers towards the bridge

The blamed commit added a new field to struct switchdev_notifier_fdb_info,
but did not make sure that all call paths set it to something valid.
For example, a switchdev driver may emit a SWITCHDEV_FDB_ADD_TO_BRIDGE
notifier, and since the 'is_local' flag is not set, it contains junk
from the stack, so the bridge might interpret those notifications as
being for local FDB entries when that was not intended.

To avoid that now and in the future, zero-initialize all
switchdev_notifier_fdb_info structures created by drivers such that all
newly added fields to not need to touch drivers again.

Fixes: 2c4eca3ef716 ("net: bridge: switchdev: include local flag in FDB notifications")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20210810115024.1629983-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

45a68787 10-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: fix flags interpretation for extern learn fdb entries

Ignore fdb flags when adding port extern learn entries and always set
BR_FDB_LOCAL flag when adding bridge extern learn entries. This is
closest to the behaviour we had before and avoids breaking any use cases
which were allowed.

This patch fixes iproute2 calls which assume NUD_PERMANENT and were
allowed before, example:
$ bridge fdb add 00:11:22:33:44:55 dev swp1 extern_learn

Extern learn entries are allowed to roam, but do not expire, so static
or dynamic flags make no sense for them.

Also add a comment for future reference.

Fixes: eb100e0e24a2 ("net: bridge: allow to add externally learned entries from user-space")
Fixes: 0541a6293298 ("net: bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB entry")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20210810110010.43859-1-razor@blackwall.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9082e1d9 15-Jul-2021 Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Eliminate the RQ_AUTHERR flag

Now that there is an alternate method for returning an auth_stat
value, replace the RQ_AUTHERR flag with use of that new method.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

5c2465df 15-Jul-2021 Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Set rq_auth_stat in the pg_authenticate() callout

In a few moments, rq_auth_stat will need to be explicitly set to
rpc_auth_ok before execution gets to the dispatcher.

svc_authenticate() already sets it, but it often gets reset to
rpc_autherr_badcred right after that call, even when authentication
is successful. Let's ensure that the pg_authenticate callout and
svc_set_client() set it properly in every case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

438623a0 15-Jul-2021 Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Add svc_rqst::rq_auth_stat

I'd like to take commit 4532608d71c8 ("SUNRPC: Clean up generic
dispatcher code") even further by using only private local SVC
dispatchers for all kernel RPC services. This change would enable
the removal of the logic that switches between
svc_generic_dispatch() and a service's private dispatcher, and
simplify the invocation of the service's pc_release method
so that humans can visually verify that it is always invoked
properly.

All that will come later.

First, let's provide a better way to return authentication errors
from SVC dispatcher functions. Instead of overloading the dispatch
method's *statp argument, add a field to struct svc_rqst that can
hold an error value.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

e1dee2c1 09-Aug-2021 Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

Bluetooth: fix repeated calls to sco_sock_kill

In commit 4e1a720d0312 ("Bluetooth: avoid killing an already killed
socket"), a check was added to sco_sock_kill to skip killing a socket
if the SOCK_DEAD flag was set.

This was done after a trace for a use-after-free bug showed that the
same sock pointer was being killed twice.

Unfortunately, this check prevents sco_sock_kill from running on any
socket. sco_sock_kill kills a socket only if it's zapped and orphaned,
however sock_orphan announces that the socket is dead before detaching
it. i.e., orphaned sockets have the SOCK_DEAD flag set.

To fix this, we remove the check for SOCK_DEAD, and avoid repeated
calls to sco_sock_kill by removing incorrect calls in:

1. sco_sock_timeout. The socket should not be killed on timeout as
further processing is expected to be done. For example,
sco_sock_connect sets the timer then waits for the socket to be
connected or for an error to be returned.

2. sco_conn_del. This function should clean up resources for the
connection, but the socket itself should be cleaned up in
sco_sock_release.

3. sco_sock_close. Calls to sco_sock_close in sco_sock_cleanup_listen
and sco_sock_release are followed by sco_sock_kill. Hence the
duplicated call should be removed.

Fixes: 4e1a720d0312 ("Bluetooth: avoid killing an already killed socket")
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

b7ce436a 09-Aug-2021 Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

Bluetooth: switch to lock_sock in RFCOMM

Other than rfcomm_sk_state_change and rfcomm_connect_ind, functions in
RFCOMM use lock_sock to lock the socket.

Since bh_lock_sock and spin_lock_bh do not provide synchronization
with lock_sock, these calls should be changed to lock_sock.

This is now safe to do because packet processing is now done in a
workqueue instead of a tasklet, so bh_lock_sock/spin_lock_bh are no
longer necessary to synchronise between user contexts and SOFTIRQ
processing.

Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

3f2c89fb 09-Aug-2021 Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

Bluetooth: serialize calls to sco_sock_{set,clear}_timer

Currently, calls to sco_sock_set_timer are made under the locked
socket, but this does not apply to all calls to sco_sock_clear_timer.

Both sco_sock_{set,clear}_timer should be serialized by lock_sock to
prevent unexpected concurrent clearing/setting of timers.

Additionally, since sco_pi(sk)->conn is only cleared under the locked
socket, this change allows us to avoid races between
sco_sock_clear_timer and the call to kfree(conn) in sco_conn_del.

Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

27c24fda 09-Aug-2021 Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

Bluetooth: switch to lock_sock in SCO

Since sco_sock_timeout is now scheduled using delayed work, it is no
longer run in SOFTIRQ context. Hence bh_lock_sock is no longer
necessary in SCO to synchronise between user contexts and SOFTIRQ
processing.

As such, calls to bh_lock_sock should be replaced with lock_sock to
synchronize with other concurrent processes that use lock_sock.

Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

734bc5ff 09-Aug-2021 Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

Bluetooth: avoid circular locks in sco_sock_connect

In a future patch, calls to bh_lock_sock in sco.c should be replaced
by lock_sock now that none of the functions are run in IRQ context.

However, doing so results in a circular locking dependency:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc4-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.2/14867 is trying to acquire lock:
ffff88803e3c1120 (sk_lock-AF_BLUETOOTH-BTPROTO_SCO){+.+.}-{0:0}, at:
lock_sock include/net/sock.h:1613 [inline]
ffff88803e3c1120 (sk_lock-AF_BLUETOOTH-BTPROTO_SCO){+.+.}-{0:0}, at:
sco_conn_del+0x12a/0x2a0 net/bluetooth/sco.c:191

but task is already holding lock:
ffffffff8d2dc7c8 (hci_cb_list_lock){+.+.}-{3:3}, at:
hci_disconn_cfm include/net/bluetooth/hci_core.h:1497 [inline]
ffffffff8d2dc7c8 (hci_cb_list_lock){+.+.}-{3:3}, at:
hci_conn_hash_flush+0xda/0x260 net/bluetooth/hci_conn.c:1608

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (hci_cb_list_lock){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:959 [inline]
__mutex_lock+0x12a/0x10a0 kernel/locking/mutex.c:1104
hci_connect_cfm include/net/bluetooth/hci_core.h:1482 [inline]
hci_remote_features_evt net/bluetooth/hci_event.c:3263 [inline]
hci_event_packet+0x2f4d/0x7c50 net/bluetooth/hci_event.c:6240
hci_rx_work+0x4f8/0xd30 net/bluetooth/hci_core.c:5122
process_one_work+0x98d/0x1630 kernel/workqueue.c:2276
worker_thread+0x658/0x11f0 kernel/workqueue.c:2422
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

-> #1 (&hdev->lock){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:959 [inline]
__mutex_lock+0x12a/0x10a0 kernel/locking/mutex.c:1104
sco_connect net/bluetooth/sco.c:245 [inline]
sco_sock_connect+0x227/0xa10 net/bluetooth/sco.c:601
__sys_connect_file+0x155/0x1a0 net/socket.c:1879
__sys_connect+0x161/0x190 net/socket.c:1896
__do_sys_connect net/socket.c:1906 [inline]
__se_sys_connect net/socket.c:1903 [inline]
__x64_sys_connect+0x6f/0xb0 net/socket.c:1903
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_SCO){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3051 [inline]
check_prevs_add kernel/locking/lockdep.c:3174 [inline]
validate_chain kernel/locking/lockdep.c:3789 [inline]
__lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5015
lock_acquire kernel/locking/lockdep.c:5625 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
lock_sock_nested+0xca/0x120 net/core/sock.c:3170
lock_sock include/net/sock.h:1613 [inline]
sco_conn_del+0x12a/0x2a0 net/bluetooth/sco.c:191
sco_disconn_cfm+0x71/0xb0 net/bluetooth/sco.c:1202
hci_disconn_cfm include/net/bluetooth/hci_core.h:1500 [inline]
hci_conn_hash_flush+0x127/0x260 net/bluetooth/hci_conn.c:1608
hci_dev_do_close+0x528/0x1130 net/bluetooth/hci_core.c:1778
hci_unregister_dev+0x1c0/0x5a0 net/bluetooth/hci_core.c:4015
vhci_release+0x70/0xe0 drivers/bluetooth/hci_vhci.c:340
__fput+0x288/0x920 fs/file_table.c:280
task_work_run+0xdd/0x1a0 kernel/task_work.c:164
exit_task_work include/linux/task_work.h:32 [inline]
do_exit+0xbd4/0x2a60 kernel/exit.c:825
do_group_exit+0x125/0x310 kernel/exit.c:922
get_signal+0x47f/0x2160 kernel/signal.c:2808
arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:865
handle_signal_work kernel/entry/common.c:148 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:209
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302
ret_from_fork+0x15/0x30 arch/x86/entry/entry_64.S:288

other info that might help us debug this:

Chain exists of:
sk_lock-AF_BLUETOOTH-BTPROTO_SCO --> &hdev->lock --> hci_cb_list_lock

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(hci_cb_list_lock);
lock(&hdev->lock);
lock(hci_cb_list_lock);
lock(sk_lock-AF_BLUETOOTH-BTPROTO_SCO);

*** DEADLOCK ***

The issue is that the lock hierarchy should go from &hdev->lock -->
hci_cb_list_lock --> sk_lock-AF_BLUETOOTH-BTPROTO_SCO. For example,
one such call trace is:

hci_dev_do_close():
hci_dev_lock();
hci_conn_hash_flush():
hci_disconn_cfm():
mutex_lock(&hci_cb_list_lock);
sco_disconn_cfm():
sco_conn_del():
lock_sock(sk);

However, in sco_sock_connect, we call lock_sock before calling
hci_dev_lock inside sco_connect, thus inverting the lock hierarchy.

We fix this by pulling the call to hci_dev_lock out from sco_connect.

Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

ba316be1 09-Aug-2021 Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

Bluetooth: schedule SCO timeouts with delayed_work

struct sock.sk_timer should be used as a sock cleanup timer. However,
SCO uses it to implement sock timeouts.

This causes issues because struct sock.sk_timer's callback is run in
an IRQ context, and the timer callback function sco_sock_timeout takes
a spin lock on the socket. However, other functions such as
sco_conn_del and sco_conn_ready take the spin lock with interrupts
enabled.

This inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} lock usage could
lead to deadlocks as reported by Syzbot [1]:
CPU0
----
lock(slock-AF_BLUETOOTH-BTPROTO_SCO);
<Interrupt>
lock(slock-AF_BLUETOOTH-BTPROTO_SCO);

To fix this, we use delayed work to implement SCO sock timouts
instead. This allows us to avoid taking the spin lock on the socket in
an IRQ context, and corrects the misuse of struct sock.sk_timer.

As a note, cancel_delayed_work is used instead of
cancel_delayed_work_sync in sco_sock_set_timer and
sco_sock_clear_timer to avoid a deadlock. In the future, the call to
bh_lock_sock inside sco_sock_timeout should be changed to lock_sock to
synchronize with other functions using lock_sock. However, since
sco_sock_set_timer and sco_sock_clear_timer are sometimes called under
the locked socket (in sco_connect and __sco_sock_close),
cancel_delayed_work_sync might cause them to sleep until an
sco_sock_timeout that has started finishes running. But
sco_sock_timeout would also sleep until it can grab the lock_sock.

Using cancel_delayed_work is fine because sco_sock_timeout does not
change from run to run, hence there is no functional difference
between:
1. waiting for a timeout to finish running before scheduling another
timeout
2. scheduling another timeout while a timeout is running.

Link: https://syzkaller.appspot.com/bug?id=9089d89de0502e120f234ca0fc8a703f7368b31e [1]
Reported-by: syzbot+2f6d7c28bb4bf7e82060@syzkaller.appspotmail.com
Tested-by: syzbot+2f6d7c28bb4bf7e82060@syzkaller.appspotmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

87029970 04-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: nf_queue: move hookfn registration out of struct net

This was done to detect when the pernet->init() function was not called
yet, by checking if net->nf.queue_handler is NULL.

Once the nfnetlink_queue module is active, all struct net pointers
contain the same address. So place this back in nf_queue.c.

Handle the 'netns error unwind' test by checking nfnl_queue_net for a
NULL pointer and add a comment for this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

d1a4e0a9 10-Aug-2021 Jakub Kicinski <kuba@kernel.org>

Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2021-08-10

We've added 31 non-merge commits during the last 8 day(s) which contain
a total of 28 files changed, 3644 insertions(+), 519 deletions(-).

1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki.

2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from
32-bit MIPS JIT development, from Johan Almbladh.

3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev.

4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko.

5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang.

6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet.

7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover,
and Muhammad Falak R Wani.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
bpf, tests: Add tail call test suite
bpf, tests: Add tests for BPF_CMPXCHG
bpf, tests: Add tests for atomic operations
bpf, tests: Add test for 32-bit context pointer argument passing
bpf, tests: Add branch conversion JIT test
bpf, tests: Add word-order tests for load/store of double words
bpf, tests: Add tests for ALU operations implemented with function calls
bpf, tests: Add more ALU64 BPF_MUL tests
bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64
bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH
bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations
bpf, tests: Fix typos in test case descriptions
bpf, tests: Add BPF_MOV tests for zero and sign extension
bpf, tests: Add BPF_JMP32 test cases
samples, bpf: Add an explict comment to handle nested vlan tagging.
selftests/bpf: Add tests for XDP bonding
selftests/bpf: Fix xdp_tx.c prog section name
net, core: Allow netdev_lower_get_next_private_rcu in bh context
bpf, devmap: Exclude XDP broadcast to master device
net, bonding: Add XDP support to the bonding driver
...
====================

Link: https://lore.kernel.org/r/20210810130038.16927-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


2e273b09 10-Aug-2021 Jakub Kicinski <kuba@kernel.org>

Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
bpf 2021-08-10

We've added 5 non-merge commits during the last 2 day(s) which contain
a total of 7 files changed, 27 insertions(+), 15 deletions(-).

1) Fix missing bpf_read_lock_trace() context for BPF loader progs, from Yonghong Song.

2) Fix corner case where BPF prog retrieves wrong local storage, also from Yonghong Song.

3) Restrict availability of BPF write_user helper behind lockdown, from Daniel Borkmann.

4) Fix multiple kernel-doc warnings in BPF core, from Randy Dunlap.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, core: Fix kernel-doc notation
bpf: Fix potentially incorrect results with bpf_get_local_storage()
bpf: Add missing bpf_read_[un]lock_trace() for syscall program
bpf: Add lockdown check for probe_write_user helper
bpf: Add _kernel suffix to internal lockdown_bpf_read
====================

Link: https://lore.kernel.org/r/20210810144025.22814-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


4a2b285e 10-Aug-2021 Eric Dumazet <edumazet@google.com>

net: igmp: fix data-race in igmp_ifc_timer_expire()

Fix the data-race reported by syzbot [1]
Issue here is that igmp_ifc_timer_expire() can update in_dev->mr_ifc_count
while another change just occured from another context.

in_dev->mr_ifc_count is only 8bit wide, so the race had little
consequences.

[1]
BUG: KCSAN: data-race in igmp_ifc_event / igmp_ifc_timer_expire

write to 0xffff8881051e3062 of 1 bytes by task 12547 on cpu 0:
igmp_ifc_event+0x1d5/0x290 net/ipv4/igmp.c:821
igmp_group_added+0x462/0x490 net/ipv4/igmp.c:1356
____ip_mc_inc_group+0x3ff/0x500 net/ipv4/igmp.c:1461
__ip_mc_join_group+0x24d/0x2c0 net/ipv4/igmp.c:2199
ip_mc_join_group_ssm+0x20/0x30 net/ipv4/igmp.c:2218
do_ip_setsockopt net/ipv4/ip_sockglue.c:1285 [inline]
ip_setsockopt+0x1827/0x2a80 net/ipv4/ip_sockglue.c:1423
tcp_setsockopt+0x8c/0xa0 net/ipv4/tcp.c:3657
sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3362
__sys_setsockopt+0x18f/0x200 net/socket.c:2159
__do_sys_setsockopt net/socket.c:2170 [inline]
__se_sys_setsockopt net/socket.c:2167 [inline]
__x64_sys_setsockopt+0x62/0x70 net/socket.c:2167
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff8881051e3062 of 1 bytes by interrupt on cpu 1:
igmp_ifc_timer_expire+0x706/0xa30 net/ipv4/igmp.c:808
call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1419
expire_timers+0x135/0x250 kernel/time/timer.c:1464
__run_timers+0x358/0x420 kernel/time/timer.c:1732
run_timer_softirq+0x19/0x30 kernel/time/timer.c:1745
__do_softirq+0x12c/0x26e kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x9a/0xb0 kernel/softirq.c:636
sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1100
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
console_unlock+0x8e8/0xb30 kernel/printk/printk.c:2646
vprintk_emit+0x125/0x3d0 kernel/printk/printk.c:2174
vprintk_default+0x22/0x30 kernel/printk/printk.c:2185
vprintk+0x15a/0x170 kernel/printk/printk_safe.c:392
printk+0x62/0x87 kernel/printk/printk.c:2216
selinux_netlink_send+0x399/0x400 security/selinux/hooks.c:6041
security_netlink_send+0x42/0x90 security/security.c:2070
netlink_sendmsg+0x59e/0x7c0 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:703 [inline]
sock_sendmsg net/socket.c:723 [inline]
____sys_sendmsg+0x360/0x4d0 net/socket.c:2392
___sys_sendmsg net/socket.c:2446 [inline]
__sys_sendmsg+0x1ed/0x270 net/socket.c:2475
__do_sys_sendmsg net/socket.c:2484 [inline]
__se_sys_sendmsg net/socket.c:2482 [inline]
__x64_sys_sendmsg+0x42/0x50 net/socket.c:2482
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x01 -> 0x02

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 12539 Comm: syz-executor.1 Not tainted 5.14.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1090340f 05-Aug-2021 Takeshi Misawa <jeliantsurux@gmail.com>

net: Fix memory leak in ieee802154_raw_deliver

If IEEE-802.15.4-RAW is closed before receive skb, skb is leaked.
Fix this, by freeing sk_receive_queue in sk->sk_destruct().

syzbot report:
BUG: memory leak
unreferenced object 0xffff88810f644600 (size 232):
comm "softirq", pid 0, jiffies 4294967032 (age 81.270s)
hex dump (first 32 bytes):
10 7d 4b 12 81 88 ff ff 10 7d 4b 12 81 88 ff ff .}K......}K.....
00 00 00 00 00 00 00 00 40 7c 4b 12 81 88 ff ff ........@|K.....
backtrace:
[<ffffffff83651d4a>] skb_clone+0xaa/0x2b0 net/core/skbuff.c:1496
[<ffffffff83fe1b80>] ieee802154_raw_deliver net/ieee802154/socket.c:369 [inline]
[<ffffffff83fe1b80>] ieee802154_rcv+0x100/0x340 net/ieee802154/socket.c:1070
[<ffffffff8367cc7a>] __netif_receive_skb_one_core+0x6a/0xa0 net/core/dev.c:5384
[<ffffffff8367cd07>] __netif_receive_skb+0x27/0xa0 net/core/dev.c:5498
[<ffffffff8367cdd9>] netif_receive_skb_internal net/core/dev.c:5603 [inline]
[<ffffffff8367cdd9>] netif_receive_skb+0x59/0x260 net/core/dev.c:5662
[<ffffffff83fe6302>] ieee802154_deliver_skb net/mac802154/rx.c:29 [inline]
[<ffffffff83fe6302>] ieee802154_subif_frame net/mac802154/rx.c:102 [inline]
[<ffffffff83fe6302>] __ieee802154_rx_handle_packet net/mac802154/rx.c:212 [inline]
[<ffffffff83fe6302>] ieee802154_rx+0x612/0x620 net/mac802154/rx.c:284
[<ffffffff83fe59a6>] ieee802154_tasklet_handler+0x86/0xa0 net/mac802154/main.c:35
[<ffffffff81232aab>] tasklet_action_common.constprop.0+0x5b/0x100 kernel/softirq.c:557
[<ffffffff846000bf>] __do_softirq+0xbf/0x2ab kernel/softirq.c:345
[<ffffffff81232f4c>] do_softirq kernel/softirq.c:248 [inline]
[<ffffffff81232f4c>] do_softirq+0x5c/0x80 kernel/softirq.c:235
[<ffffffff81232fc1>] __local_bh_enable_ip+0x51/0x60 kernel/softirq.c:198
[<ffffffff8367a9a4>] local_bh_enable include/linux/bottom_half.h:32 [inline]
[<ffffffff8367a9a4>] rcu_read_unlock_bh include/linux/rcupdate.h:745 [inline]
[<ffffffff8367a9a4>] __dev_queue_xmit+0x7f4/0xf60 net/core/dev.c:4221
[<ffffffff83fe2db4>] raw_sendmsg+0x1f4/0x2b0 net/ieee802154/socket.c:295
[<ffffffff8363af16>] sock_sendmsg_nosec net/socket.c:654 [inline]
[<ffffffff8363af16>] sock_sendmsg+0x56/0x80 net/socket.c:674
[<ffffffff8363deec>] __sys_sendto+0x15c/0x200 net/socket.c:1977
[<ffffffff8363dfb6>] __do_sys_sendto net/socket.c:1989 [inline]
[<ffffffff8363dfb6>] __se_sys_sendto net/socket.c:1985 [inline]
[<ffffffff8363dfb6>] __x64_sys_sendto+0x26/0x30 net/socket.c:1985

Fixes: 9ec767160357 ("net: add IEEE 802.15.4 socket family implementation")
Reported-and-tested-by: syzbot+1f68113fa907bf0695a8@syzkaller.appspotmail.com
Signed-off-by: Takeshi Misawa <jeliantsurux@gmail.com>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/r/20210805075414.GA15796@DESKTOP
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>

87b7b533 09-Aug-2021 Yonghong Song <yhs@fb.com>

bpf: Add missing bpf_read_[un]lock_trace() for syscall program

Commit 79a7f8bdb159d ("bpf: Introduce bpf_sys_bpf() helper and program type.")
added support for syscall program, which is a sleepable program.

But the program run missed bpf_read_lock_trace()/bpf_read_unlock_trace(),
which is needed to ensure proper rcu callback invocations. This patch adds
bpf_read_[un]lock_trace() properly.

Fixes: 79a7f8bdb159d ("bpf: Introduce bpf_sys_bpf() helper and program type.")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210809235151.1663680-1-yhs@fb.com

53e0961d 05-Aug-2021 Yunsheng Lin <linyunsheng@huawei.com>

page_pool: add frag page recycling support in page pool

Currently page pool only support page recycling when there
is only one user of the page, and the split page reusing
implemented in the most driver can not use the page pool as
bing-pong way of reusing requires the multi user support in
page pool.

Those reusing or recycling has below limitations:
1. page from page pool can only be used be one user in order
for the page recycling to happen.
2. Bing-pong way of reusing in most driver does not support
multi desc using different part of the same page in order
to save memory.

So add multi-users support and frag page recycling in page
pool to overcome the above limitation.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0e9d2a0a 05-Aug-2021 Yunsheng Lin <linyunsheng@huawei.com>

page_pool: add interface to manipulate frag count in page pool

For 32 bit systems with 64 bit dma, dma_addr[1] is used to
store the upper 32 bit dma addr, those system should be rare
those days.

For normal system, the dma_addr[1] in 'struct page' is not
used, so we can reuse dma_addr[1] for storing frag count,
which means how many frags this page might be splited to.

In order to simplify the page frag support in the page pool,
the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
the 32 bit systems with 64 bit dma, and the page frag support
in page pool is disabled for such system.

The newly added page_pool_set_frag_count() is called to reserve
the maximum frag count before any page frag is passed to the
user. The page_pool_atomic_sub_frag_count_return() is called
when user is done with the page frag.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

57f05bc2 05-Aug-2021 Yunsheng Lin <linyunsheng@huawei.com>

page_pool: keep pp info as long as page pool owns the page

Currently, page->pp is cleared and set everytime the page
is recycled, which is unnecessary.

So only set the page->pp when the page is added to the page
pool and only clear it when the page is released from the
page pool.

This is also a preparation to support allocating frag page
in page pool.

Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d6e712aa 08-Aug-2021 Randy Dunlap <rdunlap@infradead.org>

net: openvswitch: fix kernel-doc warnings in flow.c

Repair kernel-doc notation in a few places to make it conform to
the expected format.

Fixes the following kernel-doc warnings:

flow.c:296: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Parse vlan tag from vlan header.
flow.c:296: warning: missing initial short description on line:
* Parse vlan tag from vlan header.
flow.c:537: warning: No description found for return value of 'key_extract_l3l4'
flow.c:769: warning: No description found for return value of 'key_extract'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: dev@openvswitch.org
Link: https://lore.kernel.org/r/20210808190834.23362-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

69f2cd6d 29-Jul-2021 Anna Schumaker <Anna.Schumaker@Netapp.com>

SUNRPC: Add dst_port to the sysfs xprt info file

This is most likely going to be 2049 for NFS, but some servers might be
configured to export on a non-standard port. Let's show this information
just in case somebody needs it.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

e44773da 29-Jul-2021 Anna Schumaker <Anna.Schumaker@Netapp.com>

SUNRPC: Add srcaddr as a file in sysfs

I don't support changing it right now, but it could be useful
information for clients with multiple network cards.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

5d46dd04 20-Jul-2021 Anna Schumaker <Anna.Schumaker@Netapp.com>

sunrpc: Fix return value of get_srcport()

Since bc1c56e9bbe9 transport->srcport may by unset, causing
get_srcport() to return 0 when called. Fix this by querying the port
from the underlying socket instead of the transport.

Fixes: bc1c56e9bbe9 (SUNRPC: prevent port reuse on transports which don't request it)
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

68918669 30-Jul-2021 Jussi Maki <joamaki@gmail.com>

net, core: Allow netdev_lower_get_next_private_rcu in bh context

For the XDP bonding slave lookup to work in the NAPI poll context in which
the redudant rcu_read_lock() has been removed we have to follow the same
approach as in 694cea395fde ("bpf: Allow RCU-protected lookups to happen
from bh context") and modify the WARN_ON to also check rcu_read_lock_bh_held().

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210731055738.16820-6-joamaki@gmail.com

879af96f 30-Jul-2021 Jussi Maki <joamaki@gmail.com>

net, core: Add support for XDP redirection to slave device

This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX
into XDP_REDIRECT after BPF program run when the ingress device
is a bond slave.

The dev_xdp_prog_count is exposed so that slave devices can be checked
for loaded XDP programs in order to avoid the situation where both
bond master and slave have programs loaded according to xdp_state.

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com

f99fa508 26-Jul-2021 Trond Myklebust <trond.myklebust@hammerspace.com>

SUNRPC/xprtrdma: Fix reconnection locking

The xprtrdma client code currently relies on the task that initiated the
connect to hold the XPRT_LOCK for the duration of the connection
attempt. If the task is woken early, due to some other event, then that
lock could get released early.
Avoid races by using the same mechanism that the socket code uses of
transferring lock ownership to the RDMA connect worker itself. That
frees us to call rpcrdma_xprt_disconnect() directly since we're now
guaranteed exclusion w.r.t. other callers.

Fixes: 4cf44be6f1e8 ("xprtrdma: Fix recursion into rpcrdma_xprt_disconnect()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

e26d9972 26-Jul-2021 Trond Myklebust <trond.myklebust@hammerspace.com>

SUNRPC: Clean up scheduling of autoclose

Consolidate duplicated code in xprt_force_disconnect() and
xprt_conditional_disconnect().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

c2dc3e5f 26-Jul-2021 Trond Myklebust <trond.myklebust@hammerspace.com>

SUNRPC: Fix potential memory corruption

We really should not call rpc_wake_up_queued_task_set_status() with
xprt->snd_task as an argument unless we are certain that is actually an
rpc_task.

Fixes: 0445f92c5d53 ("SUNRPC: Fix disconnection races")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

71d3d0eb 26-Jul-2021 Trond Myklebust <trond.myklebust@hammerspace.com>

SUNRPC: Convert rpc_client refcount to use refcount_t

There are now tools in the refcount library that allow us to convert the
client shutdown code.

Reported-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

8d863b1f 02-Aug-2021 Chuck Lever <chuck.lever@oracle.com>

xprtrdma: Eliminate rpcrdma_post_sends()

Clean up.

Now that there is only one registration mode, there is only one
target "post_send" method: frwr_send(). rpcrdma_post_sends() no
longer adds much value, especially since all of its call sites
ignore the return code value except to check if it's non-zero.

Just have them call frwr_send() directly instead.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

d9ae8134 02-Aug-2021 Chuck Lever <chuck.lever@oracle.com>

xprtrdma: Add an xprtrdma_post_send_err tracepoint

Unlike xprtrdma_post_send(), this one can be left enabled all the
time, and should almost never fire. But we do want to know about
immediate errors when they happen.

Note that there is already a similar post_linv_err tracepoint.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

683f31c3 02-Aug-2021 Chuck Lever <chuck.lever@oracle.com>

xprtrdma: Add xprtrdma_post_recvs_err() tracepoint

In the vast majority of cases, rc=0. Don't record that in the
post_recvs tracepoint. Instead, add a separate tracepoint that can
be left enabled all the time to capture the very rare immediate
errors returned by ib_post_recv().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

97480cae 02-Aug-2021 Chuck Lever <chuck.lever@oracle.com>

xprtrdma: Put rpcrdma_reps before waking the tear-down completion

Ensure the tear-down completion is awoken only /after/ we've stopped
fiddling with rpcrdma_rep objects in rpcrdma_post_recvs().

Fixes: 15788d1d1077 ("xprtrdma: Do not refresh Receive Queue while it is draining")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

1143129e 02-Aug-2021 Chuck Lever <chuck.lever@oracle.com>

xprtrdma: Disconnect after an ib_post_send() immediate error

ib_post_send() does not disconnect the QP when it returns an
immediate error. Thus, the code that posts LocalInv has to
explicitly disconnect after an immediate error. This is just
like the frwr_send() callers handle it.

If a disconnect isn't done here, the transport deadlocks.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

823c73d0 19-Jul-2021 Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Unset RPC_TASK_NO_RETRANS_TIMEOUT for NULL RPCs

In some rare failure modes, the server is actually reading the
transport, but then just dropping the requests on the floor.
TCP_USER_TIMEOUT cannot detect that case.

Prevent such a stuck server from pinning client resources
indefinitely by ensuring that certain idempotent requests
(such as NULL) can time out even if the connection is still
operational.

Otherwise rpc_bind_new_program(), gss_destroy_cred(), or
rpc_clnt_test_and_add_xprt() can wait forever.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

aede5172 19-Jul-2021 Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Refactor rpc_ping()

Make it use the rpc_null_call_helper() so that it can share the
new rpc_call_ops structure to be introduced in the next patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

2a2b6e36 09-Aug-2021 Leon Romanovsky <leon@kernel.org>

devlink: Fix port_type_set function pointer check

Fix a typo when checking existence of port_type_set function pointer.

Fixes: 82564f6c706a ("devlink: Simplify devlink port API calls")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d09c548d 09-Aug-2021 Hangbin Liu <liuhangbin@gmail.com>

net: sched: act_mirred: Reset ct info when mirror/redirect skb

When mirror/redirect a skb to a different port, the ct info should be reset
for reclassification. Or the pkts will match unexpected rules. For example,
with following topology and commands:

-----------
|
veth0 -+-------
|
veth1 -+-------
|
------------

tc qdisc add dev veth0 clsact
# The same with "action mirred egress mirror dev veth1" or "action mirred ingress redirect dev veth1"
tc filter add dev veth0 egress chain 1 protocol ip flower ct_state +trk action mirred ingress mirror dev veth1
tc filter add dev veth0 egress chain 0 protocol ip flower ct_state -inv action ct commit action goto chain 1
tc qdisc add dev veth1 clsact
tc filter add dev veth1 ingress chain 0 protocol ip flower ct_state +trk action drop

ping <remove ip via veth0> &
tc -s filter show dev veth1 ingress

With command 'tc -s filter show', we can find the pkts were dropped on
veth1.

Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

67161779 09-Aug-2021 Stefan Raspl <raspl@linux.ibm.com>

net/smc: Allow SMC-D 1MB DMB allocations

Commit a3fe3d01bd0d7 ("net/smc: introduce sg-logic for RMBs") introduced
a restriction for RMB allocations as used by SMC-R. However, SMC-D does
not use scatter-gather lists to back its DMBs, yet it was limited by
this restriction, still.
This patch exempts SMC, but limits allocations to the maximum RMB/DMB
size respectively.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

64513d26 09-Aug-2021 Guvenc Gulce <guvenc@linux.ibm.com>

net/smc: Correct smc link connection counter in case of smc client

SMC clients may be assigned to a different link after the initial
connection between two peers was established. In such a case,
the connection counter was not correctly set.

Update the connection counter correctly when a smc client connection
is assigned to a different smc link.

Fixes: 07d51580ff65 ("net/smc: Add connection counters for links")
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Tested-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8f3d65c1 09-Aug-2021 Karsten Graul <kgraul@linux.ibm.com>

net/smc: fix wait on already cleared link

There can be a race between the waiters for a tx work request buffer
and the link down processing that finally clears the link. Although
all waiters are woken up before the link is cleared there might be
waiters which did not yet get back control and are still waiting.
This results in an access to a cleared wait queue head.

Fix this by introducing atomic reference counting around the wait calls,
and wait with the link clear processing until all waiters have finished.
Move the work request layer related calls into smc_wr.c and set the
link state to INACTIVE before calling smcr_link_clear() in
smc_llc_srv_add_link().

Fixes: 15e1b99aadfb ("net/smc: no WR buffer wait for terminating link group")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

919d13a7 08-Aug-2021 Leon Romanovsky <leon@kernel.org>

devlink: Set device as early as possible

All kernel devlink implementations call to devlink_alloc() during
initialization routine for specific device which is used later as
a parent device for devlink_register().

Such late device assignment causes to the situation which requires us to
call to device_register() before setting other parameters, but that call
opens devlink to the world and makes accessible for the netlink users.

Any attempt to move devlink_register() to be the last call generates the
following error due to access to the devlink->dev pointer.

[ 8.758862] devlink_nl_param_fill+0x2e8/0xe50
[ 8.760305] devlink_param_notify+0x6d/0x180
[ 8.760435] __devlink_params_register+0x2f1/0x670
[ 8.760558] devlink_params_register+0x1e/0x20

The simple change of API to set devlink device in the devlink_alloc()
instead of devlink_register() fixes all this above and ensures that
prior to call to devlink_register() everything already set.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8c39ed48 09-Aug-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

net/iucv: Replace deprecated CPU-hotplug functions.

The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Cc: Julian Wiedmann <jwi@linux.ibm.com>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: linux-s390@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

50348fac 09-Aug-2021 Heiko Carstens <hca@linux.ibm.com>

net/iucv: get rid of register asm usage

Using register asm statements has been proven to be very error prone,
especially when using code instrumentation where gcc may add function
calls, which clobbers register contents in an unexpected way.

Therefore get rid of register asm statements in iucv code, even though
there is currently nothing wrong with it. This way we know for sure
that the above mentioned bug class won't be introduced here.

Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ff8424be 09-Aug-2021 Julian Wiedmann <jwi@linux.ibm.com>

net/af_iucv: remove wrappers around iucv (de-)registration

These wrappers are just unnecessary obfuscation.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4eb9eda6 09-Aug-2021 Julian Wiedmann <jwi@linux.ibm.com>

net/af_iucv: clean up a try_then_request_module()

Use IS_ENABLED(CONFIG_IUCV) to determine whether the iucv_if symbol
is available, and let depmod deal with the module dependency.

This was introduced back with commit 6fcd61f7bf5d ("af_iucv: use
loadable iucv interface"). And to avoid sprinkling IS_ENABLED() over
all the code, we're keeping the indirection through pr_iucv->...().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

10d6393d 09-Aug-2021 Julian Wiedmann <jwi@linux.ibm.com>

net/af_iucv: support drop monitoring

Change the good paths to use consume_skb() instead of kfree_skb(). This
avoids flooding dropwatch with false-positives.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0fa32ca4 05-Aug-2021 Yunsheng Lin <linyunsheng@huawei.com>

page_pool: mask the page->signature before the checking

As mentioned in commit c07aea3ef4d4 ("mm: add a signature in
struct page"):
"The page->signature field is aliased to page->lru.next and
page->compound_head."

And as the comment in page_is_pfmemalloc():
"lru.next has bit 1 set if the page is allocated from the
pfmemalloc reserves. Callers may simply overwrite it if they
do not need to preserve that information."

The page->signature is OR’ed with PP_SIGNATURE when a page is
allocated in page pool, see __page_pool_alloc_pages_slow(),
and page->signature is checked directly with PP_SIGNATURE in
page_pool_return_skb_page(), which might cause resoure leaking
problem for a page from page pool if bit 1 of lru.next is set
for a pfmemalloc page. What happens here is that the original
pp->signature is OR'ed with PP_SIGNATURE after the allocation
in order to preserve any existing bits(such as the bit 1, used
to indicate a pfmemalloc page), so when those bits are present,
those page is not considered to be from page pool and the DMA
mapping of those pages will be left stale.

As bit 0 is for page->compound_head, So mask both bit 0/1 before
the checking in page_pool_return_skb_page(). And we will return
those pfmemalloc pages back to the page allocator after cleaning
up the DMA mapping.

Fixes: 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB recycling")
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

86aab09a 08-Aug-2021 Randy Dunlap <rdunlap@infradead.org>

dccp: add do-while-0 stubs for dccp_pr_debug macros

GCC complains about empty macros in an 'if' statement, so convert
them to 'do {} while (0)' macros.

Fixes these build warnings:

net/dccp/output.c: In function 'dccp_xmit_packet':
../net/dccp/output.c:283:71: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
283 | dccp_pr_debug("transmit_skb() returned err=%d\n", err);
net/dccp/ackvec.c: In function 'dccp_ackvec_update_old':
../net/dccp/ackvec.c:163:80: warning: suggest braces around empty body in an 'else' statement [-Wempty-body]
163 | (unsigned long long)seqno, state);

Fixes: dc841e30eaea ("dccp: Extend CCID packet dequeueing interface")
Fixes: 380240864451 ("dccp ccid-2: Update code for the Ack Vector input/registration routine")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: dccp@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

bee7c577 08-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: avoid fast ageing twice when port leaves a bridge

Drivers that support both the toggling of address learning and dynamic
FDB flushing (mv88e6xxx, b53, sja1105) currently need to fast-age a port
twice when it leaves a bridge:

- once, when del_nbp() calls br_stp_disable_port() which puts the port
in the BLOCKING state
- twice, when dsa_port_switchdev_unsync_attrs() calls
dsa_port_clear_brport_flags() which disables address learning

The knee-jerk reaction might be to say "dsa_port_clear_brport_flags does
not need to fast-age the port at all", but the thing is, we still need
both code paths to flush the dynamic FDB entries in different situations.
When a DSA switch port leaves a bonding/team interface that is (still) a
bridge port, no del_nbp() will be called, so we rely on
dsa_port_clear_brport_flags() function to restore proper standalone port
functionality with address learning disabled.

So the solution is just to avoid double the work when both code paths
are called in series. Luckily, DSA already caches the STP port state, so
we can skip flushing the dynamic FDB when we disable address learning
and the STP state is one where no address learning takes place at all.
Under that condition, not flushing the FDB is safe because there is
supposed to not be any dynamic FDB entry at all (they were flushed
during the transition towards that state, and none were learned in the
meanwhile).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a4ffe09f 08-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: still fast-age ports joining a bridge if they can't configure learning

Commit 39f32101543b ("net: dsa: don't fast age standalone ports")
assumed that all standalone ports disable address learning, but if the
switch driver implements .port_fast_age but not .port_bridge_flags (like
ksz9477, ksz8795, lantiq_gswip, lan9303), then that might not actually
be true.

So whereas before, the bridge temporarily walking us through the
BLOCKING STP state meant that the standalone ports had a checkpoint to
flush their baggage and start fresh when they join a bridge, after that
commit they no longer do.

Restore the old behavior for these drivers by checking if the switch can
toggle address learning. If it can't, disregard the "do_fast_age"
argument and unconditionally perform fast ageing on STP state changes.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fdacd57c 03-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: x_tables: never register tables by default

For historical reasons x_tables still register tables by default in the
initial namespace.
Only newly created net namespaces add the hook on demand.

This means that the init_net always pays hook cost, even if no filtering
rules are added (e.g. only used inside a single netns).

Note that the hooks are added even when 'iptables -L' is called.
This is because there is no way to tell 'iptables -A' and 'iptables -L'
apart at kernel level.

The only solution would be to register the table, but delay hook
registration until the first rule gets added (or policy gets changed).

That however means that counters are not hooked either, so 'iptables -L'
would always show 0-counters even when traffic is flowing which might be
unexpected.

This keeps table and hook registration consistent with what is already done
in non-init netns: first iptables(-save) invocation registers both table
and hooks.

This applies the same solution adopted for ebtables.
All tables register a template that contains the l3 family, the name
and a constructor function that is called when the initial table has to
be added.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

15e58028 09-Aug-2021 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Merge 5.14-rc5 into tty-next

We need the tty/serial fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


9264e4ad 08-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: flush the dynamic FDB of the software bridge when fast ageing a port

Currently, when DSA performs fast ageing on a port, 'bridge fdb' shows
us that the 'self' entries (corresponding to the hardware bridge, as
printed by dsa_slave_fdb_dump) are deleted, but the 'master' entries
(corresponding to the software bridge) aren't.

Indeed, searching through the bridge driver, neither the
brport_attr_learning handler nor the IFLA_BRPORT_LEARNING handler call
br_fdb_delete_by_port. However, br_stp_disable_port does, which is one
of the paths which DSA uses to trigger a fast ageing process anyway.

There is, however, one other very promising caller of
br_fdb_delete_by_port, and that is the bridge driver's handler of the
SWITCHDEV_FDB_FLUSH_TO_BRIDGE atomic notifier. Currently the s390/qeth
HiperSockets card driver is the only user of this.

I can't say I understand that driver's architecture or interaction with
the bridge, but it appears to not be a switchdev driver in the traditional
sense of the word. Nonetheless, the mechanism it provides is a useful
way for DSA to express the fact that it performs fast ageing too, in a
way that does not change the existing behavior for other drivers.

Cc: Alexandra Winter <wintera@linux.ibm.com>
Cc: Julian Wiedmann <jwi@linux.ibm.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4eab90d9 08-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: don't fast age bridge ports with learning turned off

On topology changes, stations that were dynamically learned on ports
that are no longer part of the active topology must be flushed - this is
described by clause "17.11 Updating learned station location information"
of IEEE 802.1D-2004.

However, when address learning on the bridge port is turned off in the
first place, there is nothing to flush, so skip a potentially expensive
operation.

We can finally do this now since DSA is aware of the learning state of
its bridged ports.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

045c45d1 08-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: centralize fast ageing when address learning is turned off

Currently DSA leaves it down to device drivers to fast age the FDB on a
port when address learning is disabled on it. There are 2 reasons for
doing that in the first place:

- when address learning is disabled by user space, through
IFLA_BRPORT_LEARNING or the brport_attr_learning sysfs, what user
space typically wants to achieve is to operate in a mode with no
dynamic FDB entry on that port. But if the port is already up, some
addresses might have been already learned on it, and it seems silly to
wait for 5 minutes for them to expire until something useful can be
done.

- when a port leaves a bridge and becomes standalone, DSA turns off
address learning on it. This also has the nice side effect of flushing
the dynamically learned bridge FDB entries on it, which is a good idea
because standalone ports should not have bridge FDB entries on them.

We let drivers manage fast ageing under this condition because if DSA
were to do it, it would need to track each port's learning state, and
act upon the transition, which it currently doesn't.

But there are 2 reasons why doing it is better after all:

- drivers might get it wrong and not do it (see b53_port_set_learning)

- we would like to flush the dynamic entries from the software bridge
too, and letting drivers do that would be another pain point

So track the port learning state and trigger a fast age process
automatically within DSA.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

79a0bffb 08-Aug-2021 Sven Eckelmann <sven@narfation.org>

batman-adv: Drop NULL check before dropping references

The check if a batman-adv related object is NULL or not is now directly in
the batadv_*_put functions. It is not needed anymore to perform this check
outside these function:

The changes were generated using a coccinelle semantic patch:

@@
expression E;
@@
- if (likely(E != NULL))
(
batadv_backbone_gw_put
|
batadv_claim_put
|
batadv_dat_entry_put
|
batadv_gw_node_put
|
batadv_hardif_neigh_put
|
batadv_hardif_put
|
batadv_nc_node_put
|
batadv_nc_path_put
|
batadv_neigh_ifinfo_put
|
batadv_neigh_node_put
|
batadv_orig_ifinfo_put
|
batadv_orig_node_put
|
batadv_orig_node_vlan_put
|
batadv_softif_vlan_put
|
batadv_tp_vars_put
|
batadv_tt_global_entry_put
|
batadv_tt_local_entry_put
|
batadv_tt_orig_list_entry_put
|
batadv_tt_req_node_put
|
batadv_tvlv_container_put
|
batadv_tvlv_handler_put
)(E);

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>

6340dcbd 08-Aug-2021 Sven Eckelmann <sven@narfation.org>

batman-adv: Check ptr for NULL before reducing its refcnt

The commit b37a46683739 ("netdevice: add the case if dev is NULL") changed
the way how the NULL check for net_devices have to be handled when trying
to reduce its reference counter. Before this commit, it was the
responsibility of the caller to check whether the object is NULL or not.
But it was changed to behave more like kfree. Now the callee has to handle
the NULL-case.

The batman-adv code was scanned via cocinelle for similar places. These
were changed to use the paradigm

@@
identifier E, T, R, C;
identifier put;
@@
void put(struct T *E)
{
+ if (!E)
+ return;
kref_put(&E->C, R);
}

Functions which were used in other sources files were moved to the header
to allow the compiler to inline the NULL check and the kref_put call.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>

70eeb75d 23-Jul-2021 Sven Eckelmann <sven@narfation.org>

batman-adv: Switch to kstrtox.h for kstrtou64

The commit 4c52729377ea ("kernel.h: split out kstrtox() and simple_strtox()
to a separate header") moved the kstrtou64 function to a new header called
linux/kstrtox.h.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>

53972e43 13-Jun-2021 Simon Wunderlich <sw@simonwunderlich.de>

batman-adv: Start new development cycle

This version will contain all the (major or even only minor) changes for
Linux 5.15.

The version number isn't a semantic version number with major and minor
information. It is just encoding the year of the expected publishing as
Linux -rc1 and the number of published versions this year (starting at 0).

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>

82564f6c 08-Aug-2021 Leon Romanovsky <leon@kernel.org>

devlink: Simplify devlink port API calls

Devlink port already has pointer to the devlink instance and all API
calls that forward these devlink ports to the drivers perform same
"devlink_port->devlink" assignment before actual call.

This patch removes useless parameter and allows us in the future
to create specific devlink_port_ops to manage user space access with
reliable ops assignment.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

39f32101 08-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: don't fast age standalone ports

DSA drives the procedure to flush dynamic FDB entries from a port based
on the change of STP state: whenever we go from a state where address
learning is enabled (LEARNING, FORWARDING) to a state where it isn't
(LISTENING, BLOCKING, DISABLED), we need to flush the existing dynamic
entries.

However, there are cases when this is not needed. Internally, when a
DSA switch interface is not under a bridge, DSA still keeps it in the
"FORWARDING" STP state. And when that interface joins a bridge, the
bridge will meticulously iterate that port through all STP states,
starting with BLOCKING and ending with FORWARDING. Because there is a
state transition from the standalone version of FORWARDING into the
temporary BLOCKING bridge port state, DSA calls the fast age procedure.

Since commit 5e38c15856e9 ("net: dsa: configure better brport flags when
ports leave the bridge"), DSA asks standalone ports to disable address
learning. Therefore, there can be no dynamic FDB entries on a standalone
port. Therefore, it does not make sense to flush dynamic FDB entries on
one.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cc4e5eec 06-Aug-2021 Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Restrict range element expansion in ipset to avoid soft lockup,
from Jozsef Kadlecsik.

2) Memleak in error path for nf_conntrack_bridge for IPv4 packets,
from Yajun Deng.

3) Simplify conntrack garbage collection strategy to avoid frequent
wake-ups, from Florian Westphal.

4) Fix NFNLA_HOOK_FUNCTION_NAME string, do not include module name.

5) Missing chain family netlink attribute in chain description
in nfnetlink_hook.

6) Incorrect sequence number on nfnetlink_hook dumps.

7) Use netlink request family in reply message for consistency.

8) Remove offload_pickup sysctl, use conntrack for established state
instead, from Florian Westphal.

9) Translate NFPROTO_INET/ingress to NFPROTO_NETDEV/ingress, since
NFPROTO_INET is not exposed through nfnetlink_hook.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nfnetlink_hook: translate inet ingress to netdev
netfilter: conntrack: remove offload_pickup sysctl again
netfilter: nfnetlink_hook: Use same family as request message
netfilter: nfnetlink_hook: use the sequence number of the request message
netfilter: nfnetlink_hook: missing chain family
netfilter: nfnetlink_hook: strip off module name from hookfn
netfilter: conntrack: collect all entries in one cycle
netfilter: nf_conntrack_bridge: Fix memory leak when error
netfilter: ipset: Limit the maximal range of consecutive elements to add/delete
====================

Link: https://lore.kernel.org/r/20210806151149.6356-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


269fc695 05-Aug-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nfnetlink_hook: translate inet ingress to netdev

The NFPROTO_INET pseudofamily is not exposed through this new netlink
interface. The netlink dump either shows NFPROTO_IPV4 or NFPROTO_IPV6
for NFPROTO_INET prerouting/input/forward/output/postrouting hooks.
The NFNLA_CHAIN_FAMILY attribute provides the family chain, which
specifies if this hook applies to inet traffic only (either IPv4 or
IPv6).

Translate the inet/ingress hook to netdev/ingress to fully hide the
NFPROTO_INET implementation details.

Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

4592ee7f 04-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove offload_pickup sysctl again

These two sysctls were added because the hardcoded defaults (2 minutes,
tcp, 30 seconds, udp) turned out to be too low for some setups.

They appeared in 5.14-rc1 so it should be fine to remove it again.

Marcelo convinced me that there should be no difference between a flow
that was offloaded vs. a flow that was not wrt. timeout handling.
Thus the default is changed to those for TCP established and UDP stream,
5 days and 120 seconds, respectively.

Marcelo also suggested to account for the timeout value used for the
offloading, this avoids increase beyond the value in the conntrack-sysctl
and will also instantly expire the conntrack entry with altered sysctls.

Example:
nf_conntrack_udp_timeout_stream=60
nf_flowtable_udp_timeout=60

This will remove offloaded udp flows after one minute, rather than two.

An earlier version of this patch also cleared the ASSURED bit to
allow nf_conntrack to evict the entry via early_drop (i.e., table full).
However, it looks like we can safely assume that connection timed out
via HW is still in established state, so this isn't needed.

Quoting Oz:
[..] the hardware sends all packets with a set FIN flags to sw.
[..] Connections that are aged in hardware are expected to be in the
established state.

In case it turns out that back-to-sw-path transition can occur for
'dodgy' connections too (e.g., one side disappeared while software-path
would have been in RETRANS timeout), we can adjust this later.

Cc: Oz Shlomo <ozsh@nvidia.com>
Cc: Paul Blakey <paulb@nvidia.com>
Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

69311e7c 03-Aug-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nfnetlink_hook: Use same family as request message

Use the same family as the request message, for consistency. The
netlink payload provides sufficient information to describe the hook
object, including the family.

This makes it easier to userspace to correlate the hooks are that
visited by the packets for a certain family.

Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

3d9bbaf6 03-Aug-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nfnetlink_hook: use the sequence number of the request message

The sequence number allows to correlate the netlink reply message (as
part of the dump) with the original request message.

The cb->seq field is internally used to detect an interference (update)
of the hook list during the netlink dump, do not use it as sequence
number in the netlink dump header.

Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

a6e57c4a 02-Aug-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nfnetlink_hook: missing chain family

The family is relevant for pseudo-families like NFPROTO_INET
otherwise the user needs to rely on the hook function name to
differentiate it from NFPROTO_IPV4 and NFPROTO_IPV6 names.

Add nfnl_hook_chain_desc_attributes instead of using the existing
NFTA_CHAIN_* attributes, since these do not provide a family number.

Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

61e0c2bc 02-Aug-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nfnetlink_hook: strip off module name from hookfn

NFNLA_HOOK_FUNCTION_NAME should include the hook function name only,
the module name is already provided by NFNLA_HOOK_MODULE_NAME.

Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

4608fdfc 26-Jul-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: collect all entries in one cycle

Michal Kubecek reports that conntrack gc is responsible for frequent
wakeups (every 125ms) on idle systems.

On busy systems, timed out entries are evicted during lookup.
The gc worker is only needed to remove entries after system becomes idle
after a busy period.

To resolve this, always scan the entire table.
If the scan is taking too long, reschedule so other work_structs can run
and resume from next bucket.

After a completed scan, wait for 2 minutes before the next cycle.
Heuristics for faster re-schedule are removed.

GC_SCAN_INTERVAL could be exposed as a sysctl in the future to allow
tuning this as-needed or even turn the gc worker off.

Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

c73c5708 05-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: don't disable multicast flooding to the CPU even without an IGMP querier

Commit 08cc83cc7fd8 ("net: dsa: add support for BRIDGE_MROUTER
attribute") added an option for users to turn off multicast flooding
towards the CPU if they turn off the IGMP querier on a bridge which
already has enslaved ports (echo 0 > /sys/class/net/br0/bridge/multicast_router).

And commit a8b659e7ff75 ("net: dsa: act as passthrough for bridge port flags")
simply papered over that issue, because it moved the decision to flood
the CPU with multicast (or not) from the DSA core down to individual drivers,
instead of taking a more radical position then.

The truth is that disabling multicast flooding to the CPU is simply
something we are not prepared to do now, if at all. Some reasons:

- ICMP6 neighbor solicitation messages are unregistered multicast
packets as far as the bridge is concerned. So if we stop flooding
multicast, the outside world cannot ping the bridge device's IPv6
link-local address.

- There might be foreign interfaces bridged with our DSA switch ports
(sending a packet towards the host does not necessarily equal
termination, but maybe software forwarding). So if there is no one
interested in that multicast traffic in the local network stack, that
doesn't mean nobody is.

- PTP over L4 (IPv4, IPv6) is multicast, but is unregistered as far as
the bridge is concerned. This should reach the CPU port.

- The switch driver might not do FDB partitioning. And since we don't
even bother to do more fine-grained flood disabling (such as "disable
flooding _from_port_N_ towards the CPU port" as opposed to "disable
flooding _from_any_port_ towards the CPU port"), this breaks standalone
ports, or even multiple bridges where one has an IGMP querier and one
doesn't.

Reverting the logic makes all of the above work.

Fixes: a8b659e7ff75 ("net: dsa: act as passthrough for bridge port flags")
Fixes: 08cc83cc7fd8 ("net: dsa: add support for BRIDGE_MROUTER attribute")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7df4e7449 05-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: stop syncing the bridge mcast_router attribute at join time

Qingfang points out that when a bridge with the default settings is
created and a port joins it:

ip link add br0 type bridge
ip link set swp0 master br0

DSA calls br_multicast_router() on the bridge to see if the br0 device
is a multicast router port, and if it is, it enables multicast flooding
to the CPU port, otherwise it disables it.

If we look through the multicast_router_show() sysfs or at the
IFLA_BR_MCAST_ROUTER netlink attribute, we see that the default mrouter
attribute for the bridge device is "1" (MDB_RTR_TYPE_TEMP_QUERY).

However, br_multicast_router() will return "0" (MDB_RTR_TYPE_DISABLED),
because an mrouter port in the MDB_RTR_TYPE_TEMP_QUERY state may not be
actually _active_ until it receives an actual IGMP query. So, the
br_multicast_router() function should really have been called
br_multicast_router_active() perhaps.

When/if an IGMP query is received, the bridge device will transition via
br_multicast_mark_router() into the active state until the
ip4_mc_router_timer expires after an multicast_querier_interval.

Of course, this does not happen if the bridge is created with an
mcast_router attribute of "2" (MDB_RTR_TYPE_PERM).

The point is that in lack of any IGMP query messages, and in the default
bridge configuration, unregistered multicast packets will not be able to
reach the CPU port through flooding, and this breaks many use cases
(most obviously, IPv6 ND, with its ICMP6 neighbor solicitation multicast
messages).

Leave the multicast flooding setting towards the CPU port down to a driver
level decision.

Fixes: 010e269f91be ("net: dsa: sync up switchdev objects and port attributes when joining the bridge")
Reported-by: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

596690e9 05-Aug-2021 Heiner Kallweit <hkallweit1@gmail.com>

ethtool: return error from ethnl_ops_begin if dev is NULL

Julian reported that after d43c65b05b84 Coverity complains about a
missing check whether dev is NULL in ethnl_ops_complete().
There doesn't seem to be any valid case where dev could be NULL when
calling ethnl_ops_begin(), therefore return an error if dev is NULL.

Fixes: d43c65b05b84 ("ethtool: runtime-resume netdev parent in ethnl_ops_begin")
Reported-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0ca8d3ca 05-Aug-2021 Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Build failure in drivers/net/wwan/mhi_wwan_mbim.c:
add missing parameter (0, assuming we don't want buffer pre-alloc).

Conflict in drivers/net/dsa/sja1105/sja1105_main.c between:
589918df9322 ("net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
0fac6aa098ed ("net: dsa: sja1105: delete the best_effort_vlan_filtering mode")

Follow the instructions from the commit message of the former commit
- removed the if conditions. When looking at commit 589918df9322 ("net:
dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
note that the mask_iotag fields get removed by the following patch.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>


902e7f37 05-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'net-5.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from ipsec.

Current release - regressions:

- sched: taprio: fix init procedure to avoid inf loop when dumping

- sctp: move the active_key update after sh_keys is added

Current release - new code bugs:

- sparx5: fix build with old GCC & bitmask on 32-bit targets

Previous releases - regressions:

- xfrm: redo the PREEMPT_RT RCU vs hash_resize_mutex deadlock fix

- xfrm: fixes for the compat netlink attribute translator

- phy: micrel: Fix detection of ksz87xx switch

Previous releases - always broken:

- gro: set inner transport header offset in tcp/udp GRO hook to avoid
crashes when such packets reach GSO

- vsock: handle VIRTIO_VSOCK_OP_CREDIT_REQUEST, as required by spec

- dsa: sja1105: fix static FDB entries on SJA1105P/Q/R/S and SJA1110

- bridge: validate the NUD_PERMANENT bit when adding an extern_learn
FDB entry

- usb: lan78xx: don't modify phy_device state concurrently

- usb: pegasus: check for errors of IO routines"

* tag 'net-5.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits)
net: vxge: fix use-after-free in vxge_device_unregister
net: fec: fix use-after-free in fec_drv_remove
net: pegasus: fix uninit-value in get_interrupt_interval
net: ethernet: ti: am65-cpsw: fix crash in am65_cpsw_port_offload_fwd_mark_update()
bnx2x: fix an error code in bnx2x_nic_load()
net: wwan: iosm: fix recursive lock acquire in unregister
net: wwan: iosm: correct data protocol mask bit
net: wwan: iosm: endianness type correction
net: wwan: iosm: fix lkp buildbot warning
net: usb: lan78xx: don't modify phy_device state concurrently
docs: networking: netdevsim rules
net: usb: pegasus: Remove the changelog and DRIVER_VERSION.
net: usb: pegasus: Check the return value of get_geristers() and friends;
net/prestera: Fix devlink groups leakage in error flow
net: sched: fix lockdep_set_class() typo error for sch->seqlock
net: dsa: qca: ar9331: reorder MDIO write sequence
VSOCK: handle VIRTIO_VSOCK_OP_CREDIT_REQUEST
mptcp: drop unused rcu member in mptcp_pm_addr_entry
net: ipv6: fix returned variable type in ip6_skb_dst_mtu
nfp: update ethtool reporting of pauseframe control
...


e0448092 04-Aug-2021 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>

Bluetooth: defer cleanup of resources in hci_unregister_dev()

syzbot is hitting might_sleep() warning at hci_sock_dev_event() due to
calling lock_sock() with rw spinlock held [1].

It seems that history of this locking problem is a trial and error.

Commit b40df5743ee8 ("[PATCH] bluetooth: fix socket locking in
hci_sock_dev_event()") in 2.6.21-rc4 changed bh_lock_sock() to
lock_sock() as an attempt to fix lockdep warning.

Then, commit 4ce61d1c7a8e ("[BLUETOOTH]: Fix locking in
hci_sock_dev_event().") in 2.6.22-rc2 changed lock_sock() to
local_bh_disable() + bh_lock_sock_nested() as an attempt to fix the
sleep in atomic context warning.

Then, commit 4b5dd696f81b ("Bluetooth: Remove local_bh_disable() from
hci_sock.c") in 3.3-rc1 removed local_bh_disable().

Then, commit e305509e678b ("Bluetooth: use correct lock to prevent UAF
of hdev object") in 5.13-rc5 again changed bh_lock_sock_nested() to
lock_sock() as an attempt to fix CVE-2021-3573.

This difficulty comes from current implementation that
hci_sock_dev_event(HCI_DEV_UNREG) is responsible for dropping all
references from sockets because hci_unregister_dev() immediately
reclaims resources as soon as returning from
hci_sock_dev_event(HCI_DEV_UNREG).

But the history suggests that hci_sock_dev_event(HCI_DEV_UNREG) was not
doing what it should do.

Therefore, instead of trying to detach sockets from device, let's accept
not detaching sockets from device at hci_sock_dev_event(HCI_DEV_UNREG),
by moving actual cleanup of resources from hci_unregister_dev() to
hci_cleanup_dev() which is called by bt_host_release() when all
references to this unregistered device (which is a kobject) are gone.

Since hci_sock_dev_event(HCI_DEV_UNREG) no longer resets
hci_pi(sk)->hdev, we need to check whether this device was unregistered
and return an error based on HCI_UNREGISTER flag. There might be subtle
behavioral difference in "monitor the hdev" functionality; please report
if you found something went wrong due to this patch.

Link: https://syzkaller.appspot.com/bug?extid=a5df189917e79d5e59c9 [1]
Reported-by: syzbot <syzbot+a5df189917e79d5e59c9@syzkaller.appspotmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: e305509e678b ("Bluetooth: use correct lock to prevent UAF of hdev object")
Acked-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6ec56613 04-Aug-2021 Tedd Ho-Jeong An <tedd.an@intel.com>

Bluetooth: Add support hdev to allocate private data

This patch adds support hdev to allocate extra size for private data.
The size of private data is specified in the hdev_alloc_size(priv_size)
and the allocated buffer can be accessed with hci_get_priv(hdev).

Signed-off-by: Tedd Ho-Jeong An <tedd.an@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

713baf3d 04-Aug-2021 Colin Ian King <colin.king@canonical.com>

Bluetooth: increase BTNAMSIZ to 21 chars to fix potential buffer overflow

An earlier commit replaced using batostr to using %pMR sprintf for the
construction of session->name. Static analysis detected that this new
method can use a total of 21 characters (including the trailing '\0')
so we need to increase the BTNAMSIZ from 18 to 21 to fix potential
buffer overflows.

Addresses-Coverity: ("Out-of-bounds write")
Fixes: fcb73338ed53 ("Bluetooth: Use %pMR in sprintf/seq_printf instead of batostr")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

f8b17a0b 05-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_sja1105: optionally build as module when switch driver is module if PTP is enabled

TX timestamps are sent by SJA1110 as Ethernet packets containing
metadata, so they are received by the tagging driver but must be
processed by the switch driver - the one that is stateful since it
keeps the TX timestamp queue.

This means that there is an sja1110_process_meta_tstamp() symbol
exported by the switch driver which is called by the tagging driver.

There is a shim definition for that function when the switch driver is
not compiled, which does nothing, but that shim is not effective when
the tagging protocol driver is built-in and the switch driver is a
module, because built-in code cannot call symbols exported by modules.

So add an optional dependency between the tagger and the switch driver,
if PTP support is enabled in the switch driver. If PTP is not enabled,
sja1110_process_meta_tstamp() will translate into the shim "do nothing
with these meta frames" function.

Fixes: 566b18c8b752 ("net: dsa: sja1105: implement TX timestamping for SJA1110")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1160dfa1 05-Aug-2021 Yajun Deng <yajun.deng@linux.dev>

net: Remove redundant if statements

The 'if (dev)' statement already move into dev_{put , hold}, so remove
redundant if statements.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

9344988d 30-Jul-2021 Florian Westphal <fw@strlen.de>

netfilter: ctnetlink: allow to filter dump by status bits

If CTA_STATUS is present, but CTA_STATUS_MASK is not, then the
mask is automatically set to 'status', so that kernel returns those
entries that have all of the requested bits set.

This makes more sense than using a all-one mask since we'd hardly
ever find a match.

There are no other checks for status bits, so if e.g. userspace
sets impossible combinations it will get an empty dump.

If kernel would reject unknown status bits, then a program that works on
a future kernel that has IPS_FOO bit fails on old kernels.

Same for 'impossible' combinations:

Kernel never sets ASSURED without first having set SEEN_REPLY, but its
possible that a future kernel could do so.

Therefore no sanity tests other than a 0-mask.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ff1199db 30-Jul-2021 Florian Westphal <fw@strlen.de>

netfilter: ctnetlink: add and use a helper for mark parsing

ctnetlink dumps can be filtered based on the connmark.

Prepare for status bit filtering by using a named structure and by
moving the mark parsing code to a helper.

Else ctnetlink_alloc_filter size grows a bit too big for my taste
when status handling is added.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

e11c0e25 04-Aug-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

net/ipv6/mcast: Use struct_size() helper

Replace IP6_SFLSIZE() with struct_size() helper in order to avoid any
potential type mistakes or integer overflows that, in the worst
scenario, could lead to heap overflows.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

e6a1f7e0 04-Aug-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

net/ipv4/igmp: Use struct_size() helper

Replace IP_SFLSIZE() with struct_size() helper in order to avoid any
potential type mistakes or integer overflows that, in the worst
scenario, could lead to heap overflows.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

db243b79 04-Aug-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

net/ipv4/ipv6: Replace one-element arraya with flexible-array members

There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].

Use an anonymous union with a couple of anonymous structs in order to
keep userspace unchanged and refactor the related code accordingly:

$ pahole -C group_filter net/ipv4/ip_sockglue.o
struct group_filter {
union {
struct {
__u32 gf_interface_aux; /* 0 4 */

/* XXX 4 bytes hole, try to pack */

struct __kernel_sockaddr_storage gf_group_aux; /* 8 128 */
/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
__u32 gf_fmode_aux; /* 136 4 */
__u32 gf_numsrc_aux; /* 140 4 */
struct __kernel_sockaddr_storage gf_slist[1]; /* 144 128 */
}; /* 0 272 */
struct {
__u32 gf_interface; /* 0 4 */

/* XXX 4 bytes hole, try to pack */

struct __kernel_sockaddr_storage gf_group; /* 8 128 */
/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
__u32 gf_fmode; /* 136 4 */
__u32 gf_numsrc; /* 140 4 */
struct __kernel_sockaddr_storage gf_slist_flex[0]; /* 144 0 */
}; /* 0 144 */
}; /* 0 272 */

/* size: 272, cachelines: 5, members: 1 */
/* last cacheline: 16 bytes */
};

$ pahole -C compat_group_filter net/ipv4/ip_sockglue.o
struct compat_group_filter {
union {
struct {
__u32 gf_interface_aux; /* 0 4 */
struct __kernel_sockaddr_storage gf_group_aux __attribute__((__aligned__(4))); /* 4 128 */
/* --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- */
__u32 gf_fmode_aux; /* 132 4 */
__u32 gf_numsrc_aux; /* 136 4 */
struct __kernel_sockaddr_storage gf_slist[1] __attribute__((__aligned__(4))); /* 140 128 */
} __attribute__((__packed__)) __attribute__((__aligned__(4))); /* 0 268 */
struct {
__u32 gf_interface; /* 0 4 */
struct __kernel_sockaddr_storage gf_group __attribute__((__aligned__(4))); /* 4 128 */
/* --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- */
__u32 gf_fmode; /* 132 4 */
__u32 gf_numsrc; /* 136 4 */
struct __kernel_sockaddr_storage gf_slist_flex[0] __attribute__((__aligned__(4))); /* 140 0 */
} __attribute__((__packed__)) __attribute__((__aligned__(4))); /* 0 140 */
} __attribute__((__aligned__(1))); /* 0 268 */

/* size: 268, cachelines: 5, members: 1 */
/* forced alignments: 1 */
/* last cacheline: 12 bytes */
} __attribute__((__packed__));

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/109
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9384eacd 05-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: core: don't call SIOCBRADD/DELIF for non-bridge devices

Commit ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
changed SIOCBRADD/DELIF to use bridge's ioctl hook (br_ioctl_hook)
without checking if the target netdevice is actually a bridge which can
cause crashes and generally interpreting other devices' private pointers
as net_bridge pointers.

Crash example (lo - loopback):
$ brctl addif lo ens16
BUG: kernel NULL pointer dereference, address: 000000000000059898
#PF: supervisor read access in kernel modede
#PF: error_code(0x0000) - not-present pagege
PGD 0 P4D 0 ^Ac
Oops: 0000 [#1] SMP NOPTI
CPU: 2 PID: 1376 Comm: brctl Kdump: loaded Tainted: G W 5.14.0-rc3+ #405
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
RIP: 0010:add_del_if+0x1f/0x7c [bridge]
Code: 80 bf 1b a0 41 5c e9 c0 3c 03 e1 0f 1f 44 00 00 41 55 41 54 41 89 f4 be 0c 00 00 00 55 48 89 fd 53 48 8b 87 88 00 00 00 89 d3 <4c> 8b a8 98 05 00 00 49 8b bd d0 00 00 00 e8 17 d7 f3 e0 84 c0 74
RSP: 0018:ffff888109d97cb0 EFLAGS: 00010202^Ac
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000000000000000c RDI: ffff888101239bc0
RBP: ffff888101239bc0 R08: 0000000000000001 R09: 0000000000000000
R10: ffff888109d97cd8 R11: 00000000000000a3 R12: 0000000000000012
R13: 0000000000000000 R14: ffff888101239bc0 R15: ffff888109d97e10
FS: 00007fc1e365b540(0000) GS:ffff88822be80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000598 CR3: 0000000106506000 CR4: 00000000000006e0
Call Trace:
br_ioctl_stub+0x7c/0x441 [bridge]
br_ioctl_call+0x6d/0x8a
dev_ifsioc+0x325/0x4e8
dev_ioctl+0x46b/0x4e1
sock_do_ioctl+0x7b/0xad
sock_ioctl+0x2de/0x2f2
vfs_ioctl+0x1e/0x2b
__do_sys_ioctl+0x63/0x86
do_syscall_64+0xcb/0xf2
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc1e3589427
Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc8d501d38 EFLAGS: 00000202 ORIG_RAX: 000000000000001010
RAX: ffffffffffffffda RBX: 0000000000000012 RCX: 00007fc1e3589427
RDX: 00007ffc8d501d60 RSI: 00000000000089a3 RDI: 0000000000000003
RBP: 00007ffc8d501d60 R08: 0000000000000000 R09: fefefeff77686d74
R10: fffffffffffff8f9 R11: 0000000000000202 R12: 00007ffc8d502e06
R13: 00007ffc8d502e06 R14: 0000000000000000 R15: 0000000000000000
Modules linked in: bridge stp llc bonding ipv6 virtio_net [last unloaded: llc]^Ac
CR2: 0000000000000598

Reported-by: syzbot+79f4a8692e267bdb7227@syzkaller.appspotmail.com
Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cbd7ad29 05-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: fix ioctl old_deviceless bridge argument

Commit ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
changed the source of the argument copy in bridge's old_deviceless() from
args[1] (user ptr to device name) to uarg (ptr to ioctl arguments) causing
wrong device name to be used.

Example (broken, bridge exists but is up):
$ brctl delbr bridge
bridge bridge doesn't exist; can't delete it

Example (working):
$ brctl delbr bridge
bridge bridge is still up; can't delete it

Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

893b1958 05-Aug-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: fix ioctl locking

Before commit ad2f99aedf8f ("net: bridge: move bridge ioctls out of
.ndo_do_ioctl") the bridge ioctl calls were divided in two parts:
one was deviceless called by sock_ioctl and didn't expect rtnl to be held,
the other was with a device called by dev_ifsioc() and expected rtnl to be
held. After the commit above they were united in a single ioctl stub, but
it didn't take care of the locking expectations.
For sock_ioctl now we acquire (1) br_ioctl_mutex, (2) rtnl
and for dev_ifsioc we acquire (1) rtnl, (2) br_ioctl_mutex

The fix is to get a refcnt on the netdev for dev_ifsioc calls and drop rtnl
then to reacquire it in the bridge ioctl stub after br_ioctl_mutex has
been acquired. That will avoid playing locking games and make the rules
straight-forward: we always take br_ioctl_mutex first, and then rtnl.

Reported-by: syzbot+34fe5894623c4ab1b379@syzkaller.appspotmail.com
Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4167a960 04-Aug-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

net/ipv4: Revert use of struct_size() helper

Revert the use of structr_size() and stay with IP_MSFILTER_SIZE() for
now, as in this case, the size of struct ip_msfilter didn't change with
the addition of the flexible array imsf_slist_flex[]. So, if we use
struct_size() we will be allocating and calculating the size of
struct ip_msfilter with one too many items for imsf_slist_flex[].

We might use struct_size() in the future, but for now let's stay
with IP_MSFILTER_SIZE().

Fixes: 2d3e5caf96b9 ("net/ipv4: Replace one-element array with flexible-array member")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

af352460 04-Aug-2021 Paolo Abeni <pabeni@redhat.com>

net: fix GRO skb truesize update

commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock
reference") introduces a serious regression at the GRO layer setting
the wrong truesize for stolen-head skbs.

Restore the correct truesize: SKB_DATA_ALIGN(...) instead of
SKB_TRUESIZE(...)

Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Fixes: 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock reference")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2c0b0325 04-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: give preference to local CPU ports

Be there an "H" switch topology, where there are 2 switches connected as
follows:

eth0 eth1
| |
CPU port CPU port
| DSA link |
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 -------- sw1p4 sw1p3 sw1p2 sw1p1 sw1p0
| | | | | |
user user user user user user
port port port port port port

basically one where each switch has its own CPU port for termination,
but there is also a DSA link in case packets need to be forwarded in
hardware between one switch and another.

DSA insists to see this as a daisy chain topology, basically registering
all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding
eth1 as a valid DSA master.

This is only half the story, since when asked using dsa_port_is_cpu(),
DSA will respond that sw1p1 is a CPU port, however one which has no
dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used.

Furthermore, be there a driver for switches which support only one
upstream port. This driver iterates through its ports and checks using
dsa_is_upstream_port() whether the current port is an upstream one.
For switch 1, two ports pass the "is upstream port" checks:

- sw1p4 is an upstream port because it is a routing port towards the
dedicated CPU port assigned using dsa_tree_setup_default_cpu()

- sw1p1 is also an upstream port because it is a CPU port, albeit one
that is disabled. This is because dsa_upstream_port() returns:

if (!cpu_dp)
return port;

which means that if @dp does not have a ->cpu_dp pointer (which is a
characteristic of CPU ports themselves as well as unused ports), then
@dp is its own upstream port.

So the driver for switch 1 rightfully says: I have two upstream ports,
but I don't support multiple upstream ports! So let me error out, I
don't know which one to choose and what to do with the other one.

Generally I am against enforcing any default policy in the kernel in
terms of user to CPU port assignment (like round robin or such) but this
case is different. To solve the conundrum, one would have to:

- Disable sw1p1 in the device tree or mark it as "not a CPU port" in
order to comply with DSA's view of this topology as a daisy chain,
where the termination traffic from switch 1 must pass through switch 0.
This is counter-productive because it wastes 1Gbps of termination
throughput in switch 1.
- Disable the DSA link between sw0p4 and sw1p4 and do software
forwarding between switch 0 and 1, and basically treat the switches as
part of disjoint switch trees. This is counter-productive because it
wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1.
- Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could
work, but it makes cross-chip bridging impossible. In this setup we
would need to have 2 separate bridges, br0 spanning the ports of
switch 0, and br1 spanning the ports of switch 1, and the "DSA links
treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are
the gateway ports between one bridge and another. This is hard to
manage from a user's perspective, who wants to have a unified view of
the switching fabric and the ability to transparently add ports to the
same bridge. VLANs would also need to be explicitly managed by the
user on these gateway ports.

So it seems that the only reasonable thing to do is to make DSA prefer
CPU ports that are local to the switch. Meaning that by default, the
user and DSA ports of switch 0 will get assigned to the CPU port from
switch 0 (sw0p1) and the user and DSA ports of switch 1 will get
assigned to the CPU port from switch 1.

The way this solves the problem is that sw1p4 is no longer an upstream
port as far as switch 1 is concerned (it no longer views sw0p1 as its
dedicated CPU port).

So here we are, the first multi-CPU port that DSA supports is also
perhaps the most uneventful one: the individual switches don't support
multiple CPUs, however the DSA switch tree as a whole does have multiple
CPU ports. No user space assignment of user ports to CPU ports is
desirable, necessary, or possible.

Ports that do not have a local CPU port (say there was an extra switch
hanging off of sw0p0) default to the standard implementation of getting
assigned to the first CPU port of the DSA switch tree. Is that good
enough? Probably not (if the downstream switch was hanging off of switch
1, we would most certainly prefer its CPU port to be sw1p1), but in
order to support that use case too, we would need to traverse the
dst->rtable in search of an optimum dedicated CPU port, one that has the
smallest number of hops between dp->ds and dp->cpu_dp->ds. At the
moment, the DSA routing table structure does not keep the number of hops
between dl->dp and dl->link_dp, and while it is probably deducible,
there is zero justification to write that code now. Let's hope DSA will
never have to support that use case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0e8eb9a1 04-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: rename teardown_default_cpu to teardown_cpu_ports

There is nothing specific to having a default CPU port to what
dsa_tree_teardown_default_cpu() does. Even with multiple CPU ports,
it would do the same thing: iterate through the ports of this switch
tree and reset the ->cpu_dp pointer to NULL. So rename it accordingly.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

df7ba0eb 04-Aug-2021 Colin Ian King <colin.king@canonical.com>

mctp: remove duplicated assignment of pointer hdr

The pointer hdr is being initialized and also re-assigned with the
same value from the call to function mctp_hdr. Static analysis reports
that the initializated value is unused. The second assignment is
duplicated and can be removed.

Addresses-Coverity: ("Unused value").
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6d4eb36d 04-Aug-2021 Andrii Nakryiko <andrii@kernel.org>

bpf: Fix bpf_prog_test_run_xdp logic after incorrect merge resolution

During recent net into net-next merge ([0]) a piece of old logic ([1]) got
reintroduced accidentally while resolving merge conflict between bpf's [2]
and bpf-next's [3]. This check was removed in bpf-next tree to allow extra
ctx_in parameter passed for XDP test runs. Reinstating the check breaks
bpf_prog_test_run_xdp logic and causes a corresponding xdp_context_test_run
selftest failure. Fix by removing the check and allow ctx_in for XDP test
runs.

[0] 5af84df962dd ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
[1] 947e8b595b82 ("bpf: explicitly prohibit ctx_{in, out} in non-skb BPF_PROG_TEST_RUN")
[2] 5e21bb4e8125 ("bpf, test: fix NULL pointer dereference on invalid expected_attach_type")
[3] 47316f4a3053 ("bpf: Support input xdp_md context in BPF_PROG_TEST_RUN")

Fixes: 5af84df962dd ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>

372bbdd5 03-Aug-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

net: Replace deprecated CPU-hotplug functions.

The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c2eecaa1 03-Aug-2021 Nick Richardson <richardsonnick@google.com>

pktgen: Remove redundant clone_skb override

When the netif_receive xmit_mode is set, a line is supposed to set
clone_skb to a default 0 value. This line is made redundant due to a
preceding line that checks if clone_skb is more than zero and returns
-ENOTSUPP.

Overriding clone_skb to 0 does not make any difference to the behavior
because if it was positive we return error. So it can be either 0 or
negative, and in both cases the behavior is the same.

Remove redundant line that sets clone_skb to zero.

Signed-off-by: Nick Richardson <richardsonnick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

04190bf8 04-Aug-2021 Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

sock: allow reading and changing sk_userlocks with setsockopt

SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags disable automatic socket
buffers adjustment done by kernel (see tcp_fixup_rcvbuf() and
tcp_sndbuf_expand()). If we've just created a new socket this adjustment
is enabled on it, but if one changes the socket buffer size by
setsockopt(SO_{SND,RCV}BUF*) it becomes disabled.

CRIU needs to call setsockopt(SO_{SND,RCV}BUF*) on each socket on
restore as it first needs to increase buffer sizes for packet queues
restore and second it needs to restore back original buffer sizes. So
after CRIU restore all sockets become non-auto-adjustable, which can
decrease network performance of restored applications significantly.

CRIU need to be able to restore sockets with enabled/disabled adjustment
to the same state it was before dump, so let's add special setsockopt
for it.

Let's also export SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags to uAPI so
that using these interface one can reenable automatic socket buffer
adjustment on their sockets.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

957e2235 03-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridge

With the introduction of explicit offloading API in switchdev in commit
2f5dc00f7a3e ("net: bridge: switchdev: let drivers inform which bridge
ports are offloaded"), we started having Ethernet switch drivers calling
directly into a function exported by net/bridge/br_switchdev.c, which is
a function exported by the bridge driver.

This means that drivers that did not have an explicit dependency on the
bridge before, like cpsw and am65-cpsw, now do - otherwise it is not
possible to call a symbol exported by a driver that can be built as
module unless you are a module too.

There was an attempt to solve the dependency issue in the form of commit
b0e81817629a ("net: build all switchdev drivers as modules when the
bridge is a module"). Grygorii Strashko, however, says about it:

| In my opinion, the problem is a bit bigger here than just fixing the
| build :(
|
| In case, of ^cpsw the switchdev mode is kinda optional and in many
| cases (especially for testing purposes, NFS) the multi-mac mode is
| still preferable mode.
|
| There were no such tight dependency between switchdev drivers and
| bridge core before and switchdev serviced as independent, notification
| based layer between them, so ^cpsw still can be "Y" and bridge can be
| "M". Now for mostly every kernel build configuration the CONFIG_BRIDGE
| will need to be set as "Y", or we will have to update drivers to
| support build with BRIDGE=n and maintain separate builds for
| networking vs non-networking testing. But is this enough? Wouldn't
| it cause 'chain reaction' required to add more and more "Y" options
| (like CONFIG_VLAN_8021Q)?
|
| PS. Just to be sure we on the same page - ARM builds will be forced
| (with this patch) to have CONFIG_TI_CPSW_SWITCHDEV=m and so all our
| automation testing will just fail with omap2plus_defconfig.

In the light of this, it would be desirable for some configurations to
avoid dependencies between switchdev drivers and the bridge, and have
the switchdev mode as completely optional within the driver.

Arnd Bergmann also tried to write a patch which better expressed the
build time dependency for Ethernet switch drivers where the switchdev
support is optional, like cpsw/am65-cpsw, and this made the drivers
follow the bridge (compile as module if the bridge is a module) only if
the optional switchdev support in the driver was enabled in the first
place:
https://patchwork.kernel.org/project/netdevbpf/patch/20210802144813.1152762-1-arnd@kernel.org/

but this still did not solve the fact that cpsw and am65-cpsw now must
be built as modules when the bridge is a module - it just expressed
correctly that optional dependency. But the new behavior is an apparent
regression from Grygorii's perspective.

So to support the use case where the Ethernet driver is built-in,
NET_SWITCHDEV (a bool option) is enabled, and the bridge is a module, we
need a framework that can handle the possible absence of the bridge from
the running system, i.e. runtime bloatware as opposed to build-time
bloatware.

Luckily we already have this framework, since switchdev has been using
it extensively. Events from the bridge side are transmitted to the
driver side using notifier chains - this was originally done so that
unrelated drivers could snoop for events emitted by the bridge towards
ports that are implemented by other drivers (think of a switch driver
with LAG offload that listens for switchdev events on a bonding/team
interface that it offloads).

There are also events which are transmitted from the driver side to the
bridge side, which again are modeled using notifiers.
SWITCHDEV_FDB_ADD_TO_BRIDGE is an example of this, and deals with
notifying the bridge that a MAC address has been dynamically learned.
So there is a precedent we can use for modeling the new framework.

The difference compared to SWITCHDEV_FDB_ADD_TO_BRIDGE is that the work
that the bridge needs to do when a port becomes offloaded is blocking in
its nature: replay VLANs, MDBs etc. The calling context is indeed
blocking (we are under rtnl_mutex), but the existing switchdev
notification chain that the bridge is subscribed to is only the atomic
one. So we need to subscribe the bridge to the blocking switchdev
notification chain too.

This patch:
- keeps the driver-side perception of the switchdev_bridge_port_{,un}offload
unchanged
- moves the implementation of switchdev_bridge_port_{,un}offload from
the bridge module into the switchdev module.
- makes everybody that is subscribed to the switchdev blocking notifier
chain "hear" offload & unoffload events
- makes the bridge driver subscribe and handle those events
- moves the bridge driver's handling of those events into 2 new
functions called br_switchdev_port_{,un}offload. These functions
contain in fact the core of the logic that was previously in
switchdev_bridge_port_{,un}offload, just that now we go through an
extra indirection layer to reach them.

Unlike all the other switchdev notification structures, the structure
used to carry the bridge port information, struct
switchdev_notifier_brport_info, does not contain a "bool handled".
This is because in the current usage pattern, we always know that a
switchdev bridge port offloading event will be handled by the bridge,
because the switchdev_bridge_port_offload() call was initiated by a
NETDEV_CHANGEUPPER event in the first place, where info->upper_dev is a
bridge. So if the bridge wasn't loaded, then the CHANGEUPPER event
couldn't have happened.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9c0532f9 04-Aug-2021 David S. Miller <davem@davemloft.net>

Merge tag 'linux-can-next-for-5.15-20210804' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2021-08-04

this is a pull request of 5 patches for net-next/master.

The first patch is by me and fixes a typo in a comment in the CAN
J1939 protocol.

The next 2 patches are by Oleksij Rempel and update the CAN J1939
protocol to send RX status updates via the error queue mechanism.

The next patch is by me and adds a missing variable initialization to
the flexcan driver (the problem was introduced in the current net-next
cycle).

The last patch is by Aswath Govindraju and adds power-domains to the
Bosch m_can DT binding documentation.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


5b9272e9 07-Jul-2021 Oleksij Rempel <linux@rempel-privat.de>

can: j1939: extend UAPI to notify about RX status

To be able to create applications with user friendly feedback, we need be
able to provide receive status information.

Typical ETP transfer may take seconds or even hours. To give user some
clue or show a progress bar, the stack should push status updates.
Same as for the TX information, the socket error queue will be used with
following new signals:
- J1939_EE_INFO_RX_RTS - received and accepted request to send signal.
- J1939_EE_INFO_RX_DPO - received data package offset signal
- J1939_EE_INFO_RX_ABORT - RX session was aborted

Instead of completion signal, user will get data package.
To activate this signals, application should set
SOF_TIMESTAMPING_RX_SOFTWARE to the SO_TIMESTAMPING socket option. This
will avoid unpredictable application behavior for the old software.

Link: https://lore.kernel.org/r/20210707094854.30781-3-o.rempel@pengutronix.de
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

cd85d3ae 07-Jul-2021 Oleksij Rempel <linux@rempel-privat.de>

can: j1939: rename J1939_ERRQUEUE_* to J1939_ERRQUEUE_TX_*

Prepare the world for the J1939_ERRQUEUE_RX_ version

Link: https://lore.kernel.org/r/20210707094854.30781-2-o.rempel@pengutronix.de
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

d00551b4 04-Aug-2021 David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2021-08-04

1) Fix a sysbot reported memory leak in xfrm_user_rcv_msg.
From Pavel Skripkin.

2) Revert "xfrm: policy: Read seqcount outside of rcu-read side
in xfrm_policy_lookup_bytype". This commit tried to fix a
lockin bug, but only cured some of the symptoms. A proper
fix is applied on top of this revert.

3) Fix a locking bug on xfrm state hash resize. A recent change
on sequence counters accidentally repaced a spinlock by a mutex.
Fix from Frederic Weisbecker.

4) Fix possible user-memory-access in xfrm_user_rcv_msg_compat().
From Dmitry Safonov.

5) Add initialiation sefltest fot xfrm_spdattr_type_t.
From Dmitry Safonov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


51b8f812 03-Aug-2021 Eric Dumazet <edumazet@google.com>

ipv6: exthdrs: get rid of indirect calls in ip6_parse_tlv()

As presented last month in our "BIG TCP" talk at netdev 0x15,
we plan using IPv6 jumbograms.

One of the minor problem we talked about is the fact that
ip6_parse_tlv() is currently using tables to list known tlvs,
thus using potentially expensive indirect calls.

While we could mitigate this cost using macros from
indirect_call_wrapper.h, we also can get rid of the tables
and let the compiler emit optimized code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Justin Iurman <justin.iurman@uliege.be>
Cc: Coco Li <lixiaoyan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

271e5b7d 03-Aug-2021 Jakub Kicinski <kuba@kernel.org>

net: add netif_set_real_num_queues() for device reconfig

netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues()
can fail which breaks drivers trying to implement reconfiguration
in a way that can't leave the device half-broken. In other words
those functions are incompatible with prepare/commit approach.

Luckily setting real number of queues can fail only if the number
is increased, meaning that if we order operations correctly we
can guarantee ending up with either new config (success), or
the old one (on error).

Provide a helper implementing such logic so that drivers don't
have to duplicate it.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

8679c31e 03-Aug-2021 Rocco Yue <rocco.yue@mediatek.com>

net: add extack arg for link ops

Pass extack arg to validate_linkmsg and validate_link_af callbacks.
If a netlink attribute has a reject_message, use the extended ack
mechanism to carry the message back to user space.

Signed-off-by: Rocco Yue <rocco.yue@mediatek.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

06f5553e 03-Aug-2021 Yunsheng Lin <linyunsheng@huawei.com>

net: sched: fix lockdep_set_class() typo error for sch->seqlock

According to comment in qdisc_alloc(), sch->seqlock's lockdep
class key should be set to qdisc_tx_busylock, due to possible
type error, sch->busylock's lockdep class key is set to
qdisc_tx_busylock, which is duplicated because sch->busylock's
lockdep class key is already set in qdisc_alloc().

So fix it by replacing sch->busylock with sch->seqlock.

Fixes: 96009c7d500e ("sched: replace __QDISC_STATE_RUNNING bit with a spin lock")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

314001f0 01-Aug-2021 Rao Shoaib <rao.shoaib@oracle.com>

af_unix: Add OOB support

This patch adds OOB support for AF_UNIX sockets.
The semantics is same as TCP.

The last byte of a message with the OOB flag is
treated as the OOB byte. The byte is separated into
a skb and a pointer to the skb is stored in unix_sock.
The pointer is used to enforce OOB semantics.

Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

38ea9def 29-Jul-2021 Yajun Deng <yajun.deng@linux.dev>

netfilter: nf_conntrack_bridge: Fix memory leak when error

It should be added kfree_skb_list() when err is not equal to zero
in nf_br_ip_fragment().

v2: keep this aligned with IPv6.
v3: modify iter.frag_list to iter.frag.

Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system")
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

5f7b51bf 28-Jul-2021 Jozsef Kadlecsik <kadlec@netfilter.org>

netfilter: ipset: Limit the maximal range of consecutive elements to add/delete

The range size of consecutive elements were not limited. Thus one could
define a huge range which may result soft lockup errors due to the long
execution time. Now the range size is limited to 2^20 entries.

Reported-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

421297ef 02-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_sja1105: consistently fail with arbitrary input

Dan Carpenter's smatch tests report that the "vid" variable, populated
by sja1105_vlan_rcv when an skb is received by the tagger that has a
VLAN ID which cannot be decoded by tag_8021q, may be uninitialized when
used here:

if (source_port == -1 || switch_id == -1)
skb->dev = dsa_find_designated_bridge_port_by_vid(netdev, vid);

The sja1105 driver, by construction, sets up the switch in a way that
all data plane packets sent towards the CPU port are VLAN-tagged. So it
is practically impossible, in a functional system, for a packet to be
processed by sja1110_rcv() which is not a control packet and does not
have a VLAN header either.

However, it would be nice if the sja1105 tagging driver could
consistently do something valid, for example fail, even if presented with
packets that do not hold valid sja1105 tags. Currently it is a bit hard
to argue that it does that, given the fact that a data plane packet with
no VLAN tag will trigger a call to dsa_find_designated_bridge_port_by_vid
with a vid argument that is an uninitialized stack variable.

To fix this, we can initialize the u16 vid variable with 0, a value that
can never be a bridge VLAN, so dsa_find_designated_bridge_port_by_vid
will always return a NULL skb->dev.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20210802195137.303625-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e3ea110d 02-Aug-2021 Harshavardhan Unnibhavi <harshanavkis@gmail.com>

VSOCK: handle VIRTIO_VSOCK_OP_CREDIT_REQUEST

The original implementation of the virtio-vsock driver does not
handle a VIRTIO_VSOCK_OP_CREDIT_REQUEST as required by the
virtio-vsock specification. The vsock device emulated by
vhost-vsock and the virtio-vsock driver never uses this request,
which was probably why nobody noticed it. However, another
implementation of the device may use this request type.

Hence, this commit introduces a way to handle an explicit credit
request by responding with a corresponding credit update as
required by the virtio-vsock specification.

Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko")
Signed-off-by: Harshavardhan Unnibhavi <harshanavkis@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20210802173506.2383-1-harshanavkis@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c45074d6 02-Aug-2021 Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: Fix not generating RPA when required

Code was checking if random_addr and hdev->rpa match without first
checking if the RPA has not been set (BDADDR_ANY), furthermore it was
clearing HCI_RPA_EXPIRED before the command completes and the RPA is
actually programmed which in case of failure would leave the expired
RPA still set.

Since advertising instance have a similar problem the clearing of
HCI_RPA_EXPIRED has been moved to hci_event.c after checking the random
address is in fact the hdev->rap and then proceed to set the expire
timeout.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

10279313 02-Aug-2021 Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: HCI: Add proper tracking for enable status of adv instances

This adds a field to track if advertising instances are enabled or not
and only clear HCI_LE_ADV flag if there is no instance left advertising.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

2e19bb35 02-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: bridge: switchdev: fix incorrect use of FDB flags when picking the dst device

Nikolay points out that it is incorrect to assume that it is impossible
to have an fdb entry with fdb->dst == NULL and the BR_FDB_LOCAL bit in
fdb->flags not set. This is because there are reader-side places that
test_bit(BR_FDB_LOCAL, &fdb->flags) without the br->hash_lock, and if
the updating of the FDB entry happens on another CPU, there are no
memory barriers at writer or reader side which would ensure that the
reader sees the updates to both fdb->flags and fdb->dst in the same
order, i.e. the reader will not see an inconsistent FDB entry.

So we must be prepared to deal with FDB entries where fdb->dst and
fdb->flags are in a potentially inconsistent state, and that means that
fdb->dst == NULL should remain a condition to pick the net_device that
we report to switchdev as being the bridge device, which is what the
code did prior to the blamed patch.

Fixes: 52e4bec15546 ("net: bridge: switchdev: treat local FDBs the same as entries towards the bridge")
Suggested-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210802113633.189831-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

abc7285d 02-Aug-2021 Geliang Tang <geliangtang@gmail.com>

mptcp: drop unused rcu member in mptcp_pm_addr_entry

kfree_rcu() had been removed from pm_netlink.c, so this rcu field in
struct mptcp_pm_addr_entry became useless. Let's drop it.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Link: https://lore.kernel.org/r/20210802231914.54709-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bebc3bbf 03-Aug-2021 Yajun Deng <yajun.deng@linux.dev>

net: decnet: Fix refcount warning for new dn_fib_info

fib_treeref needs to be set after kzalloc. The old code had a ++ which
led to the confusion when the int was replaced by a refcount_t.

Fixes: 79976892f7ea ("net: convert fib_treeref from int to refcount_t")
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210803073739.22339-1-yajun.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

5ea2f5ff 03-Aug-2021 Arnd Bergmann <arnd@arndb.de>

move netdev_boot_setup into Space.c

This is now only used by a handful of old ISA drivers,
and can be moved into the file they already all depend on.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

d43c65b0 31-Jul-2021 Heiner Kallweit <hkallweit1@gmail.com>

ethtool: runtime-resume netdev parent in ethnl_ops_begin

If a network device is runtime-suspended then:
- network device may be flagged as detached and all ethtool ops (even if not
accessing the device) will fail because netif_device_present() returns
false
- ethtool ops may fail because device is not accessible (e.g. because being
in D3 in case of a PCI device)

It may not be desirable that userspace can't use even simple ethtool ops
that not access the device if interface or link is down. To be more friendly
to userspace let's ensure that device is runtime-resumed when executing the
respective ethtool op in kernel.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

41107ac2 31-Jul-2021 Heiner Kallweit <hkallweit1@gmail.com>

ethtool: move netif_device_present check from ethnl_parse_header_dev_get to ethnl_ops_begin

If device is runtime-suspended and not accessible then it may be
flagged as not present. If checking whether device is present is
done too early then we may bail out before we have the chance to
runtime-resume the device. Therefore move this check to
ethnl_ops_begin(). This is in preparation of a follow-up patch
that tries to runtime-resume the device before executing ethtool
ops.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c5ab51df 31-Jul-2021 Heiner Kallweit <hkallweit1@gmail.com>

ethtool: move implementation of ethnl_ops_begin/complete to netlink.c

In preparation of subsequent extensions to both functions move the
implementations from netlink.h to netlink.c.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f32a2137 31-Jul-2021 Heiner Kallweit <hkallweit1@gmail.com>

ethtool: runtime-resume netdev parent before ethtool ioctl ops

If a network device is runtime-suspended then:
- network device may be flagged as detached and all ethtool ops (even if not
accessing the device) will fail because netif_device_present() returns
false
- ethtool ops may fail because device is not accessible (e.g. because being
in D3 in case of a PCI device)

It may not be desirable that userspace can't use even simple ethtool ops
that not access the device if interface or link is down. To be more friendly
to userspace let's ensure that device is runtime-resumed when executing the
respective ethtool op in kernel.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

83f31535 31-Jul-2021 Cong Wang <cong.wang@bytedance.com>

bpf, unix: Check socket type in unix_bpf_update_proto()

As of now, only AF_UNIX datagram socket supports sockmap. But
unix_proto is shared for all kinds of AF_UNIX sockets, so we
have to check the socket type in unix_bpf_update_proto() to
explicitly reject other types, otherwise they could be added
into sockmap, too.

Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210731195038.8084-1-xiyou.wangcong@gmail.com

0547ffe6 02-Aug-2021 Yajun Deng <yajun.deng@linux.dev>

net: Keep vertical alignment

Those files under /proc/net/stat/ don't have vertical alignment, it looks
very difficult. Modify the seq_printf statement, keep vertical alignment.

v2:
- Use seq_puts() and seq_printf() correctly.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

ae954bbc 01-Aug-2021 Xin Long <lucien.xin@gmail.com>

sctp: move the active_key update after sh_keys is added

In commit 58acd1009226 ("sctp: update active_key for asoc when old key is
being replaced"), sctp_auth_asoc_init_active_key() is called to update
the active_key right after the old key is deleted and before the new key
is added, and it caused that the active_key could be found with the key_id.

In Ying Xu's testing, the BUG_ON in sctp_auth_asoc_init_active_key() was
triggered:

[ ] kernel BUG at net/sctp/auth.c:416!
[ ] RIP: 0010:sctp_auth_asoc_init_active_key.part.8+0xe7/0xf0 [sctp]
[ ] Call Trace:
[ ] sctp_auth_set_key+0x16d/0x1b0 [sctp]
[ ] sctp_setsockopt.part.33+0x1ba9/0x2bd0 [sctp]
[ ] __sys_setsockopt+0xd6/0x1d0
[ ] __x64_sys_setsockopt+0x20/0x30
[ ] do_syscall_64+0x5b/0x1a0

So fix it by moving the active_key update after sh_keys is added.

Fixes: 58acd1009226 ("sctp: update active_key for asoc when old key is being replaced")
Reported-by: Ying Xu <yinxu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a1e975e1 02-Aug-2021 Vasily Averin <vvs@virtuozzo.com>

bpf: use skb_expand_head in bpf_out_neigh_v4/6

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

53744a4a 02-Aug-2021 Vasily Averin <vvs@virtuozzo.com>

ax25: use skb_expand_head

Use skb_expand_head() in ax25_transmit_buffer and ax25_rt_build_path.
Unlike skb_realloc_headroom, new helper does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5678a595 02-Aug-2021 Vasily Averin <vvs@virtuozzo.com>

ipv4: use skb_expand_head in ip_finish_output2

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0c9f227b 02-Aug-2021 Vasily Averin <vvs@virtuozzo.com>

ipv6: use skb_expand_head in ip6_xmit

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e415ed3a 02-Aug-2021 Vasily Averin <vvs@virtuozzo.com>

ipv6: use skb_expand_head in ip6_finish_output2

Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate
a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f1260ff1 02-Aug-2021 Vasily Averin <vvs@virtuozzo.com>

skbuff: introduce skb_expand_head()

Like skb_realloc_headroom(), new helper increases headroom of specified skb.
Unlike skb_realloc_headroom(), it does not allocate a new skb if possible;
copies skb->sk on new skb when as needed and frees original skb in case
of failures.

This helps to simplify ip[6]_finish_output2() and a few other similar cases.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ce78ffa3 03-Aug-2021 David S. Miller <davem@davemloft.net>

net: really fix the build...

Signed-off-by: David S. Miller <davem@davemloft.net>

f8145cff 26-Jul-2021 Marc Kleine-Budde <mkl@pengutronix.de>

can: j1939: j1939_session_tx_dat(): fix typo

This patch fixes a typo in the j1939_session_tx_dat() function.

Link: https://lore.kernel.org/r/20210729113917.1655492-1-mkl@pengutronix.de
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

0541a629 01-Aug-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB entry

Currently it is possible to add broken extern_learn FDB entries to the
bridge in two ways:

1. Entries pointing towards the bridge device that are not local/permanent:

ip link add br0 type bridge
bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static

2. Entries pointing towards the bridge device or towards a port that
are marked as local/permanent, however the bridge does not process the
'permanent' bit in any way, therefore they are recorded as though they
aren't permanent:

ip link add br0 type bridge
bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent

Since commit 52e4bec15546 ("net: bridge: switchdev: treat local FDBs the
same as entries towards the bridge"), these incorrect FDB entries can
even trigger NULL pointer dereferences inside the kernel.

This is because that commit made the assumption that all FDB entries
that are not local/permanent have a valid destination port. For context,
local / permanent FDB entries either have fdb->dst == NULL, and these
point towards the bridge device and are therefore local and not to be
used for forwarding, or have fdb->dst == a net_bridge_port structure
(but are to be treated in the same way, i.e. not for forwarding).

That assumption _is_ correct as long as things are working correctly in
the bridge driver, i.e. we cannot logically have fdb->dst == NULL under
any circumstance for FDB entries that are not local. However, the
extern_learn code path where FDB entries are managed by a user space
controller show that it is possible for the bridge kernel driver to
misinterpret the NUD flags of an entry transmitted by user space, and
end up having fdb->dst == NULL while not being a local entry. This is
invalid and should be rejected.

Before, the two commands listed above both crashed the kernel in this
check from br_switchdev_fdb_notify:

struct net_device *dev = info.is_local ? br->dev : dst->dev;

info.is_local == false, dst == NULL.

After this patch, the invalid entry added by the first command is
rejected:

ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static; ip link del br0
Error: bridge: FDB entry towards bridge must be permanent.

and the valid entry added by the second command is properly treated as a
local address and does not crash br_switchdev_fdb_notify anymore:

ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent; ip link del br0

Fixes: eb100e0e24a2 ("net: bridge: allow to add externally learned entries from user-space")
Reported-by: syzbot+9ba1174359adba5a5b7c@syzkaller.appspotmail.com
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210801231730.7493-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

28814cd1 02-Aug-2021 David Ahern <dsahern@kernel.org>

ipv4: Fix refcount warning for new fib_info

Ioana reported a refcount warning when booting over NFS:

[ 5.042532] ------------[ cut here ]------------
[ 5.047184] refcount_t: addition on 0; use-after-free.
[ 5.052324] WARNING: CPU: 7 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0xa4/0x150
...
[ 5.167201] Call trace:
[ 5.169635] refcount_warn_saturate+0xa4/0x150
[ 5.174067] fib_create_info+0xc00/0xc90
[ 5.177982] fib_table_insert+0x8c/0x620
[ 5.181893] fib_magic.isra.0+0x110/0x11c
[ 5.185891] fib_add_ifaddr+0xb8/0x190
[ 5.189629] fib_inetaddr_event+0x8c/0x140

fib_treeref needs to be set after kzalloc. The old code had a ++ which
led to the confusion when the int was replaced by a refcount_t.

Fixes: 79976892f7ea ("net: convert fib_treeref from int to refcount_t")
Signed-off-by: David Ahern <dsahern@kernel.org>
Reported-by: Ioana Ciornei <ciorneiioana@gmail.com>
Cc: Yajun Deng <yajun.deng@linux.dev>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20210802160221.27263-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2d3e5caf 30-Jul-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

net/ipv4: Replace one-element array with flexible-array member

There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].

Use an anonymous union with a couple of anonymous structs in order to
keep userspace unchanged:

$ pahole -C ip_msfilter net/ipv4/ip_sockglue.o
struct ip_msfilter {
union {
struct {
__be32 imsf_multiaddr_aux; /* 0 4 */
__be32 imsf_interface_aux; /* 4 4 */
__u32 imsf_fmode_aux; /* 8 4 */
__u32 imsf_numsrc_aux; /* 12 4 */
__be32 imsf_slist[1]; /* 16 4 */
}; /* 0 20 */
struct {
__be32 imsf_multiaddr; /* 0 4 */
__be32 imsf_interface; /* 4 4 */
__u32 imsf_fmode; /* 8 4 */
__u32 imsf_numsrc; /* 12 4 */
__be32 imsf_slist_flex[0]; /* 16 0 */
}; /* 0 16 */
}; /* 0 20 */

/* size: 20, cachelines: 1, members: 1 */
/* last cacheline: 20 bytes */
};

Also, refactor the code accordingly and make use of the struct_size()
and flex_array_size() helpers.

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/109
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

29a097b7 31-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: remove the struct packet_type argument from dsa_device_ops::rcv()

No tagging driver uses this.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

35d7a6f1 30-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: hci: pass callback data param as pointer in nci_request()

The nci_request() receives a callback function and unsigned long data
argument "opt" which is passed to the callback. Almost all of the
nci_request() callers pass pointer to a stack variable as data argument.
Only few pass scalar value (e.g. u8).

All such callbacks do not modify passed data argument and in previous
commit they were made as const. However passing pointers via unsigned
long removes the const annotation. The callback could simply cast
unsigned long to a pointer to writeable memory.

Use "const void *" as type of this "opt" argument to solve this and
prevent modifying the pointed contents. This is also consistent with
generic pattern of passing data arguments - via "void *". In few places
which pass scalar values, use casts via "unsigned long" to suppress any
warnings.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ebca25ea 30-Jul-2021 Yannick Vignon <yannick.vignon@nxp.com>

net/sched: taprio: Fix init procedure

Commit 13511704f8d759 ("net: taprio offload: enforce qdisc to netdev queue mapping")
resulted in duplicate entries in the qdisc hash.
While this did not impact the overall operation of the qdisc and taprio
code paths, it did result in an infinite loop when dumping the qdisc
properties, at least on one target (NXP LS1028 ARDB).
Removing the duplicate call to qdisc_hash_add() solves the problem.

Fixes: 13511704f8d759 ("net: taprio offload: enforce qdisc to netdev queue mapping")
Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

87663c39 23-Jul-2021 Florian Westphal <fw@strlen.de>

netfilter: ebtables: do not hook tables by default

If any of these modules is loaded, hooks get registered in all netns:

Before: 'unshare -n nft list hooks' shows:
family bridge hook prerouting {
-2147483648 ebt_broute
-0000000300 ebt_nat_hook
}
family bridge hook input {
-0000000200 ebt_filter_hook
}
family bridge hook forward {
-0000000200 ebt_filter_hook
}
family bridge hook output {
+0000000100 ebt_nat_hook
+0000000200 ebt_filter_hook
}
family bridge hook postrouting {
+0000000300 ebt_nat_hook
}

This adds 'template 'tables' for ebtables.

Each ebtable_foo registers the table as a template, with an init function
that gets called once the first get/setsockopt call is made.

ebtables core then searches the (per netns) list of tables.
If no table is found, it searches the list of templates instead.
If a template entry exists, the init function is called which will
enable the table and register the hooks (so packets are diverted
to the table).

If no entry is found in the template list, request_module is called.

After this, hook registration is delayed until the 'ebtables'
(set/getsockopt) request is made for a given table and will only
happen in the specific namespace.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

695176bf 29-Jul-2021 Cong Wang <cong.wang@bytedance.com>

net_sched: refactor TC action init API

TC action ->init() API has 10 parameters, it becomes harder
to read. Some of them are just boolean and can be replaced
by flags. Similarly for the internal API tcf_action_init()
and tcf_exts_validate().

This patch converts them to flags and fold them into
the upper 16 bits of "flags", whose lower 16 bits are still
reserved for user-space. More specifically, the following
kernel flags are introduced:

TCA_ACT_FLAGS_POLICE replace 'name' in a few contexts, to
distinguish whether it is compatible with policer.

TCA_ACT_FLAGS_BIND replaces 'bind', to indicate whether
this action is bound to a filter.

TCA_ACT_FLAGS_REPLACE replaces 'ovr' in most contexts,
means we are replacing an existing action.

TCA_ACT_FLAGS_NO_RTNL replaces 'rtnl_held' but has the
opposite meaning, because we still hold RTNL in most
cases.

The only user-space flag TCA_ACT_FLAGS_NO_PERCPU_STATS is
untouched and still stored as before.

I have tested this patch with tdc and I do not see any
failure related to this patch.

Tested-by: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d51c5907 29-Jul-2021 Jakub Sitnicki <jakub@cloudflare.com>

net, gro: Set inner transport header offset in tcp/udp GRO hook

GSO expects inner transport header offset to be valid when
skb->encapsulation flag is set. GSO uses this value to calculate the length
of an individual segment of a GSO packet in skb_gso_transport_seglen().

However, tcp/udp gro_complete callbacks don't update the
skb->inner_transport_header when processing an encapsulated TCP/UDP
segment. As a result a GRO skb has ->inner_transport_header set to a value
carried over from earlier skb processing.

This can have mild to tragic consequences. From miscalculating the GSO
segment length to triggering a page fault [1], when trying to read TCP/UDP
header at an address past the skb->data page.

The latter scenario leads to an oops report like so:

BUG: unable to handle page fault for address: ffff9fa7ec00d008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 123f201067 P4D 123f201067 PUD 123f209067 PMD 0
Oops: 0000 [#1] SMP NOPTI
CPU: 44 PID: 0 Comm: swapper/44 Not tainted 5.4.53-cloudflare-2020.7.21 #1
Hardware name: HYVE EDGE-METAL-GEN10/HS-1811DLite1, BIOS V2.15 02/21/2020
RIP: 0010:skb_gso_transport_seglen+0x44/0xa0
Code: c0 41 83 e0 11 f6 87 81 00 00 00 20 74 30 0f b7 87 aa 00 00 00 0f [...]
RSP: 0018:ffffad8640bacbb8 EFLAGS: 00010202
RAX: 000000000000feda RBX: ffff9fcc8d31bc00 RCX: ffff9fa7ec00cffc
RDX: ffff9fa7ebffdec0 RSI: 000000000000feda RDI: 0000000000000122
RBP: 00000000000005c4 R08: 0000000000000001 R09: 0000000000000000
R10: ffff9fe588ae3800 R11: ffff9fe011fc92f0 R12: ffff9fcc8d31bc00
R13: ffff9fe0119d4300 R14: 00000000000005c4 R15: ffff9fba57d70900
FS: 0000000000000000(0000) GS:ffff9fe68df00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff9fa7ec00d008 CR3: 0000003e99b1c000 CR4: 0000000000340ee0
Call Trace:
<IRQ>
skb_gso_validate_network_len+0x11/0x70
__ip_finish_output+0x109/0x1c0
ip_sublist_rcv_finish+0x57/0x70
ip_sublist_rcv+0x2aa/0x2d0
? ip_rcv_finish_core.constprop.0+0x390/0x390
ip_list_rcv+0x12b/0x14f
__netif_receive_skb_list_core+0x2a9/0x2d0
netif_receive_skb_list_internal+0x1b5/0x2e0
napi_complete_done+0x93/0x140
veth_poll+0xc0/0x19f [veth]
? mlx5e_napi_poll+0x221/0x610 [mlx5_core]
net_rx_action+0x1f8/0x790
__do_softirq+0xe1/0x2bf
irq_exit+0x8e/0xc0
do_IRQ+0x58/0xe0
common_interrupt+0xf/0xf
</IRQ>

The bug can be observed in a simple setup where we send IP/GRE/IP/TCP
packets into a netns over a veth pair. Inside the netns, packets are
forwarded to dummy device:

trafgen -> [veth A]--[veth B] -forward-> [dummy]

For veth B to GRO aggregate packets on receive, it needs to have an XDP
program attached (for example, a trivial XDP_PASS). Additionally, for UDP,
we need to enable GSO_UDP_L4 feature on the device:

ip netns exec A ethtool -K AB rx-udp-gro-forwarding on

The last component is an artificial delay to increase the chances of GRO
batching happening:

ip netns exec A tc qdisc add dev AB root \
netem delay 200us slot 5ms 10ms packets 2 bytes 64k

With such a setup in place, the bug can be observed by tracing the skb
outer and inner offsets when GSO skb is transmitted from the dummy device:

tcp:

FUNC DEV SKB_LEN NH TH ENC INH ITH GSO_SIZE GSO_TYPE
ip_finish_output dumB 2830 270 290 1 294 254 1383 (tcpv4,gre,)
^^^
udp:

FUNC DEV SKB_LEN NH TH ENC INH ITH GSO_SIZE GSO_TYPE
ip_finish_output dumB 2818 270 290 1 294 254 1383 (gre,udp_l4,)
^^^

Fix it by updating the inner transport header offset in tcp/udp
gro_complete callbacks, similar to how {inet,ipv6}_gro_complete callbacks
update the inner network header offset, when skb->encapsulation flag is
set.

[1] https://lore.kernel.org/netdev/CAKxSbF01cLpZem2GFaUaifh0S-5WYViZemTicAg7FCHOnh6kug@mail.gmail.com/

Fixes: bf296b125b21 ("tcp: Add GRO support")
Fixes: f993bc25e519 ("net: core: handle encapsulation offloads when computing segment lengths")
Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.")
Reported-by: Alex Forster <aforster@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f2e3778d 22-Jul-2021 Florian Westphal <fw@strlen.de>

netfilter: remove xt pernet data

clusterip is now handled via net_generic.

NOTRACK is tiny compared to rest of xt_CT feature set, even the existing
deprecation warning is bigger than the actual functionality.

Just remove the warning, its not worth keeping/adding a net_generic one.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ded2d10e 22-Jul-2021 Florian Westphal <fw@strlen.de>

netfilter: ipt_CLUSTERIP: use clusterip_net to store pernet warning

No need to use struct net for this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

7c1829b6 22-Jul-2021 Florian Westphal <fw@strlen.de>

netfilter: ipt_CLUSTERIP: only add arp mangle hook when required

Do not register the arp mangling hooks from pernet init path.

As-is, load of the module is enough for these hooks to become active
in each net namespace.

Use checkentry instead so hook is only added if a CLUSTERIP rule is used.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

92fb1551 19-Jul-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: flowtable: remove nf_ct_l4proto_find() call

TCP and UDP are built-in conntrack protocol trackers and the flowtable
only supports for TCP and UDP, remove this call.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

241d1af4 15-Jul-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_compat: use nfnetlink_unicast()

Use nfnetlink_unicast() which already translates EAGAIN to ENOBUFS,
since EAGAIN is reserved to report missing module dependencies to the
nfnetlink core.

e0241ae6ac59 ("netfilter: use nfnetlink_unicast() forgot to update
this spot.

Reported-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

d39e8b92 31-Jul-2021 Jakub Kicinski <kuba@kernel.org>

Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Andrii Nakryiko says:

====================
bpf-next 2021-07-30

We've added 64 non-merge commits during the last 15 day(s) which contain
a total of 83 files changed, 5027 insertions(+), 1808 deletions(-).

The main changes are:

1) BTF-guided binary data dumping libbpf API, from Alan.

2) Internal factoring out of libbpf CO-RE relocation logic, from Alexei.

3) Ambient BPF run context and cgroup storage cleanup, from Andrii.

4) Few small API additions for libbpf 1.0 effort, from Evgeniy and Hengqi.

5) bpf_program__attach_kprobe_opts() fixes in libbpf, from Jiri.

6) bpf_{get,set}sockopt() support in BPF iterators, from Martin.

7) BPF map pinning improvements in libbpf, from Martynas.

8) Improved module BTF support in libbpf and bpftool, from Quentin.

9) Bpftool cleanups and documentation improvements, from Quentin.

10) Libbpf improvements for supporting CO-RE on old kernels, from Shuyi.

11) Increased maximum cgroup storage size, from Stanislav.

12) Small fixes and improvements to BPF tests and samples, from various folks.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits)
tools: bpftool: Complete metrics list in "bpftool prog profile" doc
tools: bpftool: Document and add bash completion for -L, -B options
selftests/bpf: Update bpftool's consistency script for checking options
tools: bpftool: Update and synchronise option list in doc and help msg
tools: bpftool: Complete and synchronise attach or map types
selftests/bpf: Check consistency between bpftool source, doc, completion
tools: bpftool: Slightly ease bash completion updates
unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()
libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf
tools: bpftool: Support dumping split BTF by id
libbpf: Add split BTF support for btf__load_from_kernel_by_id()
tools: Replace btf__get_from_id() with btf__load_from_kernel_by_id()
tools: Free BTF objects at various locations
libbpf: Rename btf__get_from_id() as btf__load_from_kernel_by_id()
libbpf: Rename btf__load() as btf__load_into_kernel()
libbpf: Return non-null error on failures in libbpf_find_prog_btf_id()
bpf: Emit better log message if bpf_iter ctx arg btf_id == 0
tools/resolve_btfids: Emit warnings and patch zero id for missing symbols
bpf: Increase supported cgroup storage value size
libbpf: Fix race when pinning maps in parallel
...
====================

Link: https://lore.kernel.org/r/20210730225606.1897330-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


d2e11fd2 31-Jul-2021 Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Conflicting commits, all resolutions pretty trivial:

drivers/bus/mhi/pci_generic.c
5c2c85315948 ("bus: mhi: pci-generic: configurable network interface MRU")
56f6f4c4eb2a ("bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean")

drivers/nfc/s3fwrn5/firmware.c
a0302ff5906a ("nfc: s3fwrn5: remove unnecessary label")
46573e3ab08f ("nfc: s3fwrn5: fix undefined parameter values in dev_err()")
801e541c79bb ("nfc: s3fwrn5: fix undefined parameter values in dev_err()")

MAINTAINERS
7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver")
8a7b46fa7902 ("MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>


c7d10223 30-Jul-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi
(mac80211) and netfilter trees.

Current release - regressions:

- mac80211: fix starting aggregation sessions on mesh interfaces

Current release - new code bugs:

- sctp: send pmtu probe only if packet loss in Search Complete state

- bnxt_en: add missing periodic PHC overflow check

- devlink: fix phys_port_name of virtual port and merge error

- hns3: change the method of obtaining default ptp cycle

- can: mcba_usb_start(): add missing urb->transfer_dma initialization

Previous releases - regressions:

- set true network header for ECN decapsulation

- mlx5e: RX, avoid possible data corruption w/ relaxed ordering and
LRO

- phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811
PHY

- sctp: fix return value check in __sctp_rcv_asconf_lookup

Previous releases - always broken:

- bpf:
- more spectre corner case fixes, introduce a BPF nospec
instruction for mitigating Spectre v4
- fix OOB read when printing XDP link fdinfo
- sockmap: fix cleanup related races

- mac80211: fix enabling 4-address mode on a sta vif after assoc

- can:
- raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
- j1939: j1939_session_deactivate(): clarify lifetime of session
object, avoid UAF
- fix number of identical memory leaks in USB drivers

- tipc:
- do not blindly write skb_shinfo frags when doing decryption
- fix sleeping in tipc accept routine"

* tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
gve: Update MAINTAINERS list
can: esd_usb2: fix memory leak
can: ems_usb: fix memory leak
can: usb_8dev: fix memory leak
can: mcba_usb_start(): add missing urb->transfer_dma initialization
can: hi311x: fix a signedness bug in hi3110_cmd()
MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
bpf: Fix leakage due to insufficient speculative store bypass mitigation
bpf: Introduce BPF nospec instruction for mitigating Spectre v4
sis900: Fix missing pci_disable_device() in probe and remove
net: let flow have same hash in two directions
nfc: nfcsim: fix use after free during module unload
tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
sctp: fix return value check in __sctp_rcv_asconf_lookup
nfc: s3fwrn5: fix undefined parameter values in dev_err()
net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32
net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
net/mlx5: Unload device upon firmware fatal error
net/mlx5e: Fix page allocation failure for ptp-RQ over SF
net/mlx5e: Fix page allocation failure for trap-RQ over SF
...


26713455 29-Jul-2021 Leon Romanovsky <leon@kernel.org>

devlink: Allocate devlink directly in requested net namespace

There is no need in extra call indirection and check from impossible
flow where someone tries to set namespace without prior call
to devlink_alloc().

Instead of this extra logic and additional EXPORT_SYMBOL, use specialized
devlink allocation function that receives net namespace as an argument.

Such specialized API allows clear view when devlink initialized in wrong
net namespace and/or kernel users don't try to change devlink namespace
under the hood.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

05a7f4a8 29-Jul-2021 Leon Romanovsky <leon@kernel.org>

devlink: Break parameter notification sequence to be before/after unload/load driver

The change of namespaces during devlink reload calls to driver unload
before it accesses devlink parameters. The commands below causes to
use-after-free bug when trying to get flow steering mode.

* ip netns add n1
* devlink dev reload pci/0000:00:09.0 netns n1

==================================================================
BUG: KASAN: use-after-free in mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
Read of size 4 at addr ffff888009d04308 by task devlink/275

CPU: 6 PID: 275 Comm: devlink Not tainted 5.12.0-rc2+ #2853
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x93/0xc2
print_address_description.constprop.0+0x18/0x140
? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
kasan_report.cold+0x7c/0xd8
? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core]
devlink_nl_param_fill+0x1c8/0xe80
? __free_pages_ok+0x37a/0x8a0
? devlink_flash_update_timeout_notify+0xd0/0xd0
? lock_acquire+0x1a9/0x6d0
? fs_reclaim_acquire+0xb7/0x160
? lock_is_held_type+0x98/0x110
? 0xffffffff81000000
? lock_release+0x1f9/0x6c0
? fs_reclaim_release+0xa1/0xf0
? lock_downgrade+0x6d0/0x6d0
? lock_is_held_type+0x98/0x110
? lock_is_held_type+0x98/0x110
? memset+0x20/0x40
? __build_skb_around+0x1f8/0x2b0
devlink_param_notify+0x6d/0x180
devlink_reload+0x1c3/0x520
? devlink_remote_reload_actions_performed+0x30/0x30
? mutex_trylock+0x24b/0x2d0
? devlink_nl_cmd_reload+0x62b/0x1070
devlink_nl_cmd_reload+0x66d/0x1070
? devlink_reload+0x520/0x520
? devlink_get_from_attrs+0x1bc/0x260
? devlink_nl_pre_doit+0x64/0x4d0
genl_family_rcv_msg_doit+0x1e9/0x2f0
? mutex_lock_io_nested+0x1130/0x1130
? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240
? security_capable+0x51/0x90
genl_rcv_msg+0x27f/0x4a0
? genl_get_cmd+0x3c0/0x3c0
? lock_acquire+0x1a9/0x6d0
? devlink_reload+0x520/0x520
? lock_release+0x6c0/0x6c0
netlink_rcv_skb+0x11d/0x340
? genl_get_cmd+0x3c0/0x3c0
? netlink_ack+0x9f0/0x9f0
? lock_release+0x1f9/0x6c0
genl_rcv+0x24/0x40
netlink_unicast+0x433/0x700
? netlink_attachskb+0x730/0x730
? _copy_from_iter_full+0x178/0x650
? __alloc_skb+0x113/0x2b0
netlink_sendmsg+0x6f1/0xbd0
? netlink_unicast+0x700/0x700
? lock_is_held_type+0x98/0x110
? netlink_unicast+0x700/0x700
sock_sendmsg+0xb0/0xe0
__sys_sendto+0x193/0x240
? __x64_sys_getpeername+0xb0/0xb0
? do_sys_openat2+0x10b/0x370
? __up_read+0x1a1/0x7b0
? do_user_addr_fault+0x219/0xdc0
? __x64_sys_openat+0x120/0x1d0
? __x64_sys_open+0x1a0/0x1a0
__x64_sys_sendto+0xdd/0x1b0
? syscall_enter_from_user_mode+0x1d/0x50
do_syscall_64+0x2d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc69d0af14a
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c
RSP: 002b:00007ffc1d8292f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fc69d0af14a
RDX: 0000000000000038 RSI: 0000555f57c56440 RDI: 0000000000000003
RBP: 0000555f57c56410 R08: 00007fc69d17b200 R09: 000000000000000c
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Allocated by task 146:
kasan_save_stack+0x1b/0x40
__kasan_kmalloc+0x99/0xc0
mlx5_init_fs+0xf0/0x1c50 [mlx5_core]
mlx5_load+0xd2/0x180 [mlx5_core]
mlx5_init_one+0x2f6/0x450 [mlx5_core]
probe_one+0x47d/0x6e0 [mlx5_core]
pci_device_probe+0x2a0/0x4a0
really_probe+0x20a/0xc90
driver_probe_device+0xd8/0x380
device_driver_attach+0x1df/0x250
__driver_attach+0xff/0x240
bus_for_each_dev+0x11e/0x1a0
bus_add_driver+0x309/0x570
driver_register+0x1ee/0x380
0xffffffffa06b8062
do_one_initcall+0xd5/0x410
do_init_module+0x1c8/0x760
load_module+0x6d8b/0x9650
__do_sys_finit_module+0x118/0x1b0
do_syscall_64+0x2d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 275:
kasan_save_stack+0x1b/0x40
kasan_set_track+0x1c/0x30
kasan_set_free_info+0x20/0x30
__kasan_slab_free+0x102/0x140
slab_free_freelist_hook+0x74/0x1b0
kfree+0xd7/0x2a0
mlx5_unload+0x16/0xb0 [mlx5_core]
mlx5_unload_one+0xae/0x120 [mlx5_core]
mlx5_devlink_reload_down+0x1bc/0x380 [mlx5_core]
devlink_reload+0x141/0x520
devlink_nl_cmd_reload+0x66d/0x1070
genl_family_rcv_msg_doit+0x1e9/0x2f0
genl_rcv_msg+0x27f/0x4a0
netlink_rcv_skb+0x11d/0x340
genl_rcv+0x24/0x40
netlink_unicast+0x433/0x700
netlink_sendmsg+0x6f1/0xbd0
sock_sendmsg+0xb0/0xe0
__sys_sendto+0x193/0x240
__x64_sys_sendto+0xdd/0x1b0
do_syscall_64+0x2d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff888009d04300
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 8 bytes inside of
128-byte region [ffff888009d04300, ffff888009d04380)
The buggy address belongs to the page:
page:0000000086a64ecc refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888009d04000 pfn:0x9d04
head:0000000086a64ecc order:1 compound_mapcount:0
flags: 0x4000000000010200(slab|head)
raw: 4000000000010200 ffffea0000203980 0000000200000002 ffff8880050428c0
raw: ffff888009d04000 000000008020001d 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888009d04200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888009d04280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888009d04300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888009d04380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888009d04400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

The right solution to devlink reload is to notify about deletion of
parameters, unload driver, change net namespaces, load driver and notify
about addition of parameters.

Fixes: 070c63f20f6c ("net: devlink: allow to change namespaces during reload")
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0b846445 23-Jul-2021 Cong Wang <cong.wang@bytedance.com>

unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()

As Eric noticed, __unix_dgram_recvmsg() may acquire u->iolock
too, so we have to release it before calling this function.

Fixes: 9825d866ce0d ("af_unix: Implement unix_dgram_bpf_recvmsg()")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>

a432934a 30-Jul-2021 Paolo Abeni <pabeni@redhat.com>

sk_buff: avoid potentially clearing 'slow_gro' field

If skb_dst_set_noref() is invoked with a NULL dst, the 'slow_gro'
field is cleared, too. That could lead to wrong behavior if
the skb later enters the GRO stage.

Fix the potential issue replacing preserving a non-zero value of
the 'slow_gro' field.

Additionally, fix a comment typo.

Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 8a886b142bd0 ("sk_buff: track dst status in slow_gro")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/aa42529252dc8bb02bd42e8629427040d1058537.1627662501.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bc830525 29-Jul-2021 Yajun Deng <yajun.deng@linux.dev>

net: netlink: Remove unused function

lockdep_genl_is_held() and its caller arm not used now, just remove them.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Link: https://lore.kernel.org/r/20210729074854.8968-1-yajun.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

77411df5 30-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: hci: cleanup unneeded spaces

No need for multiple spaces in variable declaration (the code does not
use them in other places). No functional change.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ddecf555 30-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: nci: constify several pointers to u8, sk_buff and other structs

Several functions receive pointers to u8, sk_buff or other structs but
do not modify the contents so make them const. This allows doing the
same for local variables and in total makes the code a little bit safer.

This makes const also data passed as "unsigned long opt" argument to
nci_request() function. Usual flow for such functions is:
1. Receive "u8 *" and store it (the pointer) in a structure
allocated on stack (e.g. struct nci_set_config_param),
2. Call nci_request() or __nci_request() passing a callback function an
the pointer to the structure via an "unsigned long opt",
3. nci_request() calls the callback which dereferences "unsigned long
opt" in a read-only way.

This converts all above paths to use proper pointer to const data, so
entire flow is safer.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f2479c0a 30-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: constify local pointer variables

Few pointers to struct nfc_target and struct nfc_se can be made const.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3df40eb3 30-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: constify several pointers to u8, char and sk_buff

Several functions receive pointers to u8, char or sk_buff but do not
modify the contents so make them const. This allows doing the same for
local variables and in total makes the code a little bit safer.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

4932c378 30-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: hci: annotate nfc_llc_init() as __init

The nfc_llc_init() is used only in other __init annotated context.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bf6cd772 30-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: annotate af_nfc_exit() as __exit

The af_nfc_exit() is used only in other __exit annotated context
(nfc_exit()).

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

79976892 29-Jul-2021 Yajun Deng <yajun.deng@linux.dev>

net: convert fib_treeref from int to refcount_t

refcount_t type should be used instead of int when fib_treeref is used as
a reference counter,and avoid use-after-free risks.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210729071350.28919-1-yajun.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bea79078 29-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: don't set skb->offload_fwd_mark when not offloading the bridge

DSA has gained the recent ability to deal gracefully with upper
interfaces it cannot offload, such as the bridge, bonding or team
drivers. When such uppers exist, the ports are still in standalone mode
as far as the hardware is concerned.

But when we deliver packets to the software bridge in order for that to
do the forwarding, there is an unpleasant surprise in that the bridge
will refuse to forward them. This is because we unconditionally set
skb->offload_fwd_mark = true, meaning that the bridge thinks the frames
were already forwarded in hardware by us.

Since dp->bridge_dev is populated only when there is hardware offload
for it, but not in the software fallback case, let's introduce a new
helper that can be called from the tagger data path which sets the
skb->offload_fwd_mark accordingly to zero when there is no hardware
offload for bridging. This lets the bridge forward packets back to other
interfaces of our switch, if needed.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3aa26055 28-Jul-2021 Davide Caratti <dcaratti@redhat.com>

net/sched: store the last executed chain also for clsact egress

currently, only 'ingress' and 'clsact ingress' qdiscs store the tc 'chain
id' in the skb extension. However, userspace programs (like ovs) are able
to setup egress rules, and datapath gets confused in case it doesn't find
the 'chain id' for a packet that's "recirculated" by tc.
Change tcf_classify() to have the same semantic as tcf_classify_ingress()
so that a single function can be called in ingress / egress, using the tc
ingress / egress block respectively.

Suggested-by: Alaa Hleilel <alaa@nvidia.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

04a17583 28-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_sja1105: fix control packets on SJA1110 being received on an imprecise port

On RX, a control packet with SJA1110 will have:
- an in-band control extension (DSA tag) composed of a header and an
optional trailer (if it is a timestamp frame). We can (and do) deduce
the source port and switch id from this.
- a VLAN header, which can either be the tag_8021q RX VLAN (pvid) or the
bridge VLAN. The sja1105_vlan_rcv() function attempts to deduce the
source port and switch id a second time from this.

The basic idea is that even though we don't need the source port
information from the tag_8021q header if it's a control packet, we do
need to strip that header before we pass it on to the network stack.

The problem is that we call sja1105_vlan_rcv for ports under VLAN-aware
bridges, and that function tells us it couldn't identify a tag_8021q
header, so we need to perform imprecise RX by VID. Well, we don't,
because we already know the source port and switch ID.

This patch drops the return value from sja1105_vlan_rcv and we just look
at the source_port and switch_id values from sja1105_rcv and sja1110_rcv
which were initialized to -1. If they are still -1 it means we need to
perform imprecise RX.

Fixes: 884be12f8566 ("net: dsa: sja1105: add support for imprecise RX")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

03f2bbc4 28-Jul-2021 Matt Johnston <matt@codeconstruct.com.au>

mctp: Allow per-netns default networks

Currently we have a compile-time default network
(MCTP_INITIAL_DEFAULT_NET). This change introduces a default_net field
on the net namespace, allowing future configuration for new interfaces.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

26ab3fca 28-Jul-2021 Matt Johnston <matt@codeconstruct.com.au>

mctp: Add dest neighbour lladdr to route output

Now that we have a neighbour implementation, hook it up to the output
path to set the dest hardware address for outgoing packets.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

4a992bbd 28-Jul-2021 Jeremy Kerr <jk@codeconstruct.com.au>

mctp: Implement message fragmentation & reassembly

This change implements MCTP fragmentation (based on route & device MTU),
and corresponding reassembly.

The MCTP specification only allows for fragmentation on the originating
message endpoint, and reassembly on the destination endpoint -
intermediate nodes do not need to reassemble/refragment. Consequently,
we only fragment in the local transmit path, and reassemble
locally-bound packets. Messages are required to be in-order, so we
simply cancel reassembly on out-of-order or missing packets.

In the fragmentation path, we just break up the message into MTU-sized
fragments; the skb structure is a simple copy for now, which we can later
improve with a shared data implementation.

For reassembly, we keep track of incoming message fragments using the
existing tag infrastructure, allocating a key on the (src,dest,tag)
tuple, and reassembles matching fragments into a skb->frag_list.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

833ef3b9 28-Jul-2021 Jeremy Kerr <jk@codeconstruct.com.au>

mctp: Populate socket implementation

Start filling-out the socket syscalls: bind, sendmsg & recvmsg.

This requires an input route implementation, so we add to
mctp_route_input, allowing lookups on binds & message tags. This just
handles single-packet messages at present, we will add fragmentation in
a future change.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

831119f8 28-Jul-2021 Matt Johnston <matt@codeconstruct.com.au>

mctp: Add neighbour netlink interface

This change adds the netlink interfaces for manipulating the MCTP
neighbour table.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

4d8b9319 28-Jul-2021 Matt Johnston <matt@codeconstruct.com.au>

mctp: Add neighbour implementation

Add an initial neighbour table implementation, to be used in the route
output path.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

06d2f4c5 28-Jul-2021 Matt Johnston <matt@codeconstruct.com.au>

mctp: Add netlink route management

This change adds RTM_GETROUTE, RTM_NEWROUTE & RTM_DELROUTE handlers,
allowing management of the MCTP route table.

Includes changes from Jeremy Kerr <jk@codeconstruct.com.au>.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

889b7da2 28-Jul-2021 Jeremy Kerr <jk@codeconstruct.com.au>

mctp: Add initial routing framework

Add a simple routing table, and a couple of route output handlers, and
the mctp packet_type & handler.

Includes changes from Matt Johnston <matt@codeconstruct.com.au>.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

583be982 28-Jul-2021 Jeremy Kerr <jk@codeconstruct.com.au>

mctp: Add device handling and netlink interface

This change adds the infrastructure for managing MCTP netdevices; we add
a pointer to the AF_MCTP-specific data to struct netdevice, and hook up
the rtnetlink operations for adding and removing addresses.

Includes changes from Matt Johnston <matt@codeconstruct.com.au>.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

8f601a1e 28-Jul-2021 Jeremy Kerr <jk@codeconstruct.com.au>

mctp: Add base socket/protocol definitions

Add an empty socket implementation, plus initialisation/destruction
handlers.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

bc49d816 28-Jul-2021 Jeremy Kerr <jk@codeconstruct.com.au>

mctp: Add MCTP base

Add basic Kconfig, an initial (empty) af_mctp source object, and
{AF,PF}_MCTP definitions, and the required definitions for a new
protocol type.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

92fe24a7 28-Jul-2021 Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

Bluetooth: skip invalid hci_sync_conn_complete_evt

Syzbot reported a corrupted list in kobject_add_internal [1]. This
happens when multiple HCI_EV_SYNC_CONN_COMPLETE event packets with
status 0 are sent for the same HCI connection. This causes us to
register the device more than once which corrupts the kset list.

As this is forbidden behavior, we add a check for whether we're
trying to process the same HCI_EV_SYNC_CONN_COMPLETE event multiple
times for one connection. If that's the case, the event is invalid, so
we report an error that the device is misbehaving, and ignore the
packet.

Link: https://syzkaller.appspot.com/bug?extid=66264bf2fd0476be7e6c [1]
Reported-by: syzbot+66264bf2fd0476be7e6c@syzkaller.appspotmail.com
Tested-by: syzbot+66264bf2fd0476be7e6c@syzkaller.appspotmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

5e10da53 28-Jul-2021 Paolo Abeni <pabeni@redhat.com>

skbuff: allow 'slow_gro' for skb carring sock reference

This change leverages the infrastructure introduced by the previous
patches to allow soft devices passing to the GRO engine owned skbs
without impacting the fast-path.

It's up to the GRO caller ensuring the slow_gro bit validity before
invoking the GRO engine. The new helper skb_prepare_for_gro() is
introduced for that goal.

On slow_gro, skbs are aggregated only with equal sk.
Additionally, skb truesize on GRO recycle and free is correctly
updated so that sk wmem is not changed by the GRO processing.

rfc-> v1:
- fixed bad truesize on dev_gro_receive NAPI_FREE
- use the existing state bit

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9efb4b5b 28-Jul-2021 Paolo Abeni <pabeni@redhat.com>

net: optimize GRO for the common case.

After the previous patches, at GRO time, skb->slow_gro is
usually 0, unless the packets comes from some H/W offload
slowpath or tunnel.

We can optimize the GRO code assuming !skb->slow_gro is likely.
This remove multiple conditionals in the most common path, at the
price of an additional one when we hit the above "slow-paths".

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b0999f38 28-Jul-2021 Paolo Abeni <pabeni@redhat.com>

sk_buff: track extension status in slow_gro

Similar to the previous one, but tracking the
active_extensions field status.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5d8dbb7f 28-Jul-2021 Pavel Skripkin <paskripkin@gmail.com>

net: xfrm: fix shift-out-of-bounce

We need to check up->dirmask to avoid shift-out-of-bounce bug,
since up->dirmask comes from userspace.

Also, added XFRM_USERPOLICY_DIRMASK_MAX constant to uapi to inform
user-space that up->dirmask has maximum possible value

Fixes: 2d151d39073a ("xfrm: Add possibility to set the default to block if we have no policy")
Reported-and-tested-by: syzbot+9cd5837a045bbee5b810@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

fc16a532 28-Jul-2021 David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2021-07-29

The following pull-request contains BPF updates for your *net* tree.

We've added 9 non-merge commits during the last 14 day(s) which contain
a total of 20 files changed, 446 insertions(+), 138 deletions(-).

The main changes are:

1) Fix UBSAN out-of-bounds splat for showing XDP link fdinfo, from Lorenz Bauer.

2) Fix insufficient Spectre v4 mitigation in BPF runtime, from Daniel Borkmann,
Piotr Krysiuk and Benedict Schlueter.

3) Batch of fixes for BPF sockmap found under stress testing, from John Fastabend.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


a25fca4d 26-Jul-2021 Tedd Ho-Jeong An <tedd.an@intel.com>

Bluetooth: mgmt: Fix wrong opcode in the response for add_adv cmd

This patch fixes the MGMT add_advertising command repsones with the
wrong opcode when it is trying to return the not supported error.

Fixes: cbbdfa6f33198 ("Bluetooth: Enable controller RPA resolution using Experimental feature")
Signed-off-by: Tedd Ho-Jeong An <tedd.an@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

58ce6d5b 26-Jul-2021 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Bluetooth: defer cleanup of resources in hci_unregister_dev()

syzbot is hitting might_sleep() warning at hci_sock_dev_event()
due to calling lock_sock() with rw spinlock held [1].

It seems that history of this locking problem is a trial and error.

Commit b40df5743ee8aed8 ("[PATCH] bluetooth: fix socket locking in
hci_sock_dev_event()") in 2.6.21-rc4 changed bh_lock_sock() to lock_sock()
as an attempt to fix lockdep warning.

Then, commit 4ce61d1c7a8ef4c1 ("[BLUETOOTH]: Fix locking in
hci_sock_dev_event().") in 2.6.22-rc2 changed lock_sock() to
local_bh_disable() + bh_lock_sock_nested() as an attempt to fix
sleep in atomic context warning.

Then, commit 4b5dd696f81b210c ("Bluetooth: Remove local_bh_disable() from
hci_sock.c") in 3.3-rc1 removed local_bh_disable().

Then, commit e305509e678b3a4a ("Bluetooth: use correct lock to prevent UAF
of hdev object") in 5.13-rc5 again changed bh_lock_sock_nested() to
lock_sock() as an attempt to fix CVE-2021-3573.

This difficulty comes from current implementation that
hci_sock_dev_event(HCI_DEV_UNREG) is responsible for dropping all
references from sockets because hci_unregister_dev() immediately reclaims
resources as soon as returning from hci_sock_dev_event(HCI_DEV_UNREG).
But the history suggests that hci_sock_dev_event(HCI_DEV_UNREG) was not
doing what it should do.

Therefore, instead of trying to detach sockets from device, let's accept
not detaching sockets from device at hci_sock_dev_event(HCI_DEV_UNREG),
by moving actual cleanup of resources from hci_unregister_dev() to
hci_release_dev() which is called by bt_host_release when all references
to this unregistered device (which is a kobject) are gone.

Link: https://syzkaller.appspot.com/bug?extid=a5df189917e79d5e59c9 [1]
Reported-by: syzbot <syzbot+a5df189917e79d5e59c9@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+a5df189917e79d5e59c9@syzkaller.appspotmail.com>
Fixes: e305509e678b3a4a ("Bluetooth: use correct lock to prevent UAF of hdev object")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

52e4bec1 28-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: bridge: switchdev: treat local FDBs the same as entries towards the bridge

Currently the following script:

1. ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up
2. ip link set swp2 up && ip link set swp2 master br0
3. ip link set swp3 up && ip link set swp3 master br0
4. ip link set swp4 up && ip link set swp4 master br0
5. bridge vlan del dev swp2 vid 1
6. bridge vlan del dev swp3 vid 1
7. ip link set swp4 nomaster
8. ip link set swp3 nomaster

produces the following output:

[ 641.010738] sja1105 spi0.1: port 2 failed to delete 00:1f:7b:63:02:48 vid 1 from fdb: -2

[ swp2, swp3 and br0 all have the same MAC address, the one listed above ]

In short, this happens because the number of FDB entry additions
notified to switchdev is unbalanced with the number of deletions.

At step 1, the bridge has a random MAC address. At step 2, the
br_fdb_replay of swp2 receives this initial MAC address. Then the bridge
inherits the MAC address of swp2 via br_fdb_change_mac_address(), and it
notifies switchdev (only swp2 at this point) of the deletion of the
random MAC address and the addition of 00:1f:7b:63:02:48 as a local FDB
entry with fdb->dst == swp2, in VLANs 0 and the default_pvid (1).

During step 7:

del_nbp
-> br_fdb_delete_by_port(br, p, vid=0, do_all=1);
-> fdb_delete_local(br, p, f);

br_fdb_delete_by_port() deletes all entries towards the ports,
regardless of vid, because do_all is 1.

fdb_delete_local() has logic to migrate local FDB entries deleted from
one port to another port which shares the same MAC address and is in the
same VLAN, or to the bridge device itself. This migration happens
without notifying switchdev of the deletion on the old port and the
addition on the new one, just fdb->dst is changed and the added_by_user
flag is cleared.

In the example above, the del_nbp(swp4) causes the
"addr 00:1f:7b:63:02:48 vid 1" local FDB entry with fdb->dst == swp4
that existed up until then to be migrated directly towards the bridge
(fdb->dst == NULL). This is because it cannot be migrated to any of the
other ports (swp2 and swp3 are not in VLAN 1).

After the migration to br0 takes place, swp4 requests a deletion replay
of all FDB entries. Since the "addr 00:1f:7b:63:02:48 vid 1" entry now
point towards the bridge, a deletion of it is replayed. There was just
a prior addition of this address, so the switchdev driver deletes this
entry.

Then, the del_nbp(swp3) at step 8 triggers another br_fdb_replay, and
switchdev is notified again to delete "addr 00:1f:7b:63:02:48 vid 1".
But it can't because it no longer has it, so it returns -ENOENT.

There are other possibilities to trigger this issue, but this is by far
the simplest to explain.

To fix this, we must avoid the situation where the addition of an FDB
entry is notified to switchdev as a local entry on a port, and the
deletion is notified on the bridge itself.

Considering that the 2 types of FDB entries are completely equivalent
and we cannot have the same MAC address as a local entry on 2 bridge
ports, or on a bridge port and pointing towards the bridge at the same
time, it makes sense to hide away from switchdev completely the fact
that a local FDB entry is associated with a given bridge port at all.
Just say that it points towards the bridge, it should make no difference
whatsoever to the switchdev driver and should even lead to a simpler
overall implementation, will less cases to handle.

This also avoids any modification at all to the core bridge driver, just
what is reported to switchdev changes. With the local/permanent entries
on bridge ports being already reported to user space, it is hard to
believe that the bridge behavior can change in any backwards-incompatible
way such as making all local FDB entries point towards the bridge.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b4454bc6 28-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: bridge: switchdev: replay the entire FDB for each port

Currently when a switchdev port joins a bridge, we replay all FDB
entries pointing towards that port or towards the bridge.

However, this is insufficient in certain situations:

(a) DSA, through its assisted_learning_on_cpu_port logic, snoops
dynamically learned FDB entries on foreign interfaces.
These are FDB entries that are pointing neither towards the newly
joined switchdev port, nor towards the bridge. So these addresses
would be missed when joining a bridge where a foreign interface has
already learned some addresses, and they would also linger on if the
DSA port leaves the bridge before the foreign interface forgets them.
None of this happens if we replay the entire FDB when the port joins.

(b) There is a desire to treat local FDB entries on a port (i.e. the
port's termination MAC address) identically to FDB entries pointing
towards the bridge itself. More details on the reason behind this in
the next patch. The point is that this cannot be done given the
current structure of br_fdb_replay() in this situation:
ip link set swp0 master br0 # br0 inherits its MAC address from swp0
ip link set swp1 master br0
What is desirable is that when swp1 joins the bridge, br_fdb_replay()
also notifies swp1 of br0's MAC address, but this won't in fact
happen because the MAC address of br0 does not have fdb->dst == NULL
(it doesn't point towards the bridge), but it has fdb->dst == swp0.
So our current logic makes it impossible for that address to be
replayed. But if we dump the entire FDB instead of just the entries
with fdb->dst == swp1 and fdb->dst == NULL, then the inherited MAC
address of br0 will be replayed too, which is what we need.

A natural question arises: say there is an FDB entry to be replayed,
like a MAC address dynamically learned on a foreign interface that
belongs to a bridge where no switchdev port has joined yet. If 10
switchdev ports belonging to the same driver join this bridge, one by
one, won't every port get notified 10 times of the foreign FDB entry,
amounting to a total of 100 notifications for this FDB entry in the
switchdev driver?

Well, yes, but this is where the "void *ctx" argument for br_fdb_replay
is useful: every port of the switchdev driver is notified whenever any
other port requests an FDB replay, but because the replay was initiated
by a different port, its context is different from the initiating port's
context, so it ignores those replays.

So the foreign FDB entry will be installed only 10 times, once per port.
This is done so that the following 4 code paths are always well balanced:
(a) addition of foreign FDB entry is replayed when port joins bridge
(b) deletion of foreign FDB entry is replayed when port leaves bridge
(c) addition of foreign FDB entry is notified to all ports currently in bridge
(c) deletion of foreign FDB entry is notified to all ports currently in bridge

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cbcf0112 28-Jul-2021 Miklos Szeredi <mszeredi@redhat.com>

af_unix: fix garbage collect vs MSG_PEEK

unix_gc() assumes that candidate sockets can never gain an external
reference (i.e. be installed into an fd) while the unix_gc_lock is
held. Except for MSG_PEEK this is guaranteed by modifying inflight
count under the unix_gc_lock.

MSG_PEEK does not touch any variable protected by unix_gc_lock (file
count is not), yet it needs to be serialized with garbage collection.
Do this by locking/unlocking unix_gc_lock:

1) increment file count

2) lock/unlock barrier to make sure incremented file count is visible
to garbage collection

3) install file into fd

This is a lock barrier (unlike smp_mb()) that ensures that garbage
collection is run completely before or completely after the barrier.

Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

56af5e74 27-Jul-2021 Peilin Ye <peilin.ye@bytedance.com>

net/sched: act_skbmod: Add SKBMOD_F_ECN option support

Currently, when doing rate limiting using the tc-police(8) action, the
easiest way is to simply drop the packets which exceed or conform the
configured bandwidth limit. Add a new option to tc-skbmod(8), so that
users may use the ECN [1] extension to explicitly inform the receiver
about the congestion instead of dropping packets "on the floor".

The 2 least significant bits of the Traffic Class field in IPv4 and IPv6
headers are used to represent different ECN states [2]:

0b00: "Non ECN-Capable Transport", Non-ECT
0b10: "ECN Capable Transport", ECT(0)
0b01: "ECN Capable Transport", ECT(1)
0b11: "Congestion Encountered", CE

As an example:

$ tc filter add dev eth0 parent 1: protocol ip prio 10 \
matchall action skbmod ecn

Doing the above marks all ECT(0) and ECT(1) packets as CE. It does NOT
affect Non-ECT or non-IP packets. In the tc-police scenario mentioned
above, users may pipe a tc-police action and a tc-skbmod "ecn" action
together to achieve ECN-based rate limiting.

For TCP connections, upon receiving a CE packet, the receiver will respond
with an ECE packet, asking the sender to reduce their congestion window.
However ECN also works with other L4 protocols e.g. DCCP and SCTP [2], and
our implementation does not touch or care about L4 headers.

The updated tc-skbmod SYNOPSIS looks like the following:

tc ... action skbmod { set SETTABLE | swap SWAPPABLE | ecn } ...

Only one of "set", "swap" or "ecn" shall be used in a single tc-skbmod
command. Trying to use more than one of them at a time is considered
undefined behavior; pipe multiple tc-skbmod commands together instead.
"set" and "swap" only affect Ethernet packets, while "ecn" only affects
IPv{4,6} packets.

It is also worth mentioning that, in theory, the same effect could be
achieved by piping a "police" action and a "bpf" action using the
bpf_skb_ecn_set_ce() helper, but this requires eBPF programming from the
user, thus impractical.

Depends on patch "net/sched: act_skbmod: Skip non-Ethernet packets".

[1] https://datatracker.ietf.org/doc/html/rfc3168
[2] https://en.wikipedia.org/wiki/Explicit_Congestion_Notification

Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1e60cebf 28-Jul-2021 zhang kai <zhangkaiheb@126.com>

net: let flow have same hash in two directions

using same source and destination ip/port for flow hash calculation
within the two directions.

Signed-off-by: zhang kai <zhangkaiheb@126.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d7907a2b 28-Jul-2021 Leon Romanovsky <leon@kernel.org>

devlink: Remove duplicated registration check

Both registered flag and devlink pointer are set at the same time
and indicate the same thing - devlink/devlink_port are ready. Instead
of checking ->registered use devlink pointer as an indication.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

557fb586 27-Jul-2021 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

sctp: fix return value check in __sctp_rcv_asconf_lookup

As Ben Hutchings noticed, this check should have been inverted: the call
returns true in case of success.

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 0c5dc070ff3d ("sctp: validate from_addr_param return")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9635720b 27-Jul-2021 John Fastabend <john.fastabend@gmail.com>

bpf, sockmap: Fix memleak on ingress msg enqueue

If backlog handler is running during a tear down operation we may enqueue
data on the ingress msg queue while tear down is trying to free it.

sk_psock_backlog()
sk_psock_handle_skb()
skb_psock_skb_ingress()
sk_psock_skb_ingress_enqueue()
sk_psock_queue_msg(psock,msg)
spin_lock(ingress_lock)
sk_psock_zap_ingress()
_sk_psock_purge_ingerss_msg()
_sk_psock_purge_ingress_msg()
-- free ingress_msg list --
spin_unlock(ingress_lock)
spin_lock(ingress_lock)
list_add_tail(msg,ingress_msg) <- entry on list with no one
left to free it.
spin_unlock(ingress_lock)

To fix we only enqueue from backlog if the ENABLED bit is set. The tear
down logic clears the bit with ingress_lock set so we wont enqueue the
msg in the last step.

Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210727160500.1713554-4-john.fastabend@gmail.com

476d9801 27-Jul-2021 John Fastabend <john.fastabend@gmail.com>

bpf, sockmap: On cleanup we additionally need to remove cached skb

Its possible if a socket is closed and the receive thread is under memory
pressure it may have cached a skb. We need to ensure these skbs are
free'd along with the normal ingress_skb queue.

Before 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear
down and backlog processing both had sock_lock for the common case of
socket close or unhash. So it was not possible to have both running in
parrallel so all we would need is the kfree in those kernels.

But, latest kernels include the commit 799aa7f98d5e and this requires a
bit more work. Without the ingress_lock guarding reading/writing the
state->skb case its possible the tear down could run before the state
update causing it to leak memory or worse when the backlog reads the state
it could potentially run interleaved with the tear down and we might end up
free'ing the state->skb from tear down side but already have the reference
from backlog side. To resolve such races we wrap accesses in ingress_lock
on both sides serializing tear down and backlog case. In both cases this
only happens after an EAGAIN error case so having an extra lock in place
is likely fine. The normal path will skip the locks.

Note, we check state->skb before grabbing lock. This works because
we can only enqueue with the mutex we hold already. Avoiding a race
on adding state->skb after the check. And if tear down path is running
that is also fine if the tear down path then removes state->skb we
will simply set skb=NULL and the subsequent goto is skipped. This
slight complication avoids locking in normal case.

With this fix we no longer see this warning splat from tcp side on
socket close when we hit the above case with redirect to ingress self.

[224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
[224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
[224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G I 5.14.0-rc1alu+ #181
[224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
[224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
[224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
[224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
[224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
[224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
[224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
[224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
[224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
[224913.935974] FS: 00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
[224913.935981] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
[224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[224913.936007] Call Trace:
[224913.936016] inet_csk_destroy_sock+0xba/0x1f0
[224913.936033] __tcp_close+0x620/0x790
[224913.936047] tcp_close+0x20/0x80
[224913.936056] inet_release+0x8f/0xf0
[224913.936070] __sock_release+0x72/0x120
[224913.936083] sock_close+0x14/0x20

Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210727160500.1713554-3-john.fastabend@gmail.com

343597d5 27-Jul-2021 John Fastabend <john.fastabend@gmail.com>

bpf, sockmap: Zap ingress queues after stopping strparser

We don't want strparser to run and pass skbs into skmsg handlers when
the psock is null. We just sk_drop them in this case. When removing
a live socket from map it means extra drops that we do not need to
incur. Move the zap below strparser close to avoid this condition.

This way we stop the stream parser first stopping it from processing
packets and then delete the psock.

Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210727160500.1713554-2-john.fastabend@gmail.com

8ca34a13 27-Jul-2021 Pavel Skripkin <paskripkin@gmail.com>

net: cipso: fix warnings in netlbl_cipsov4_add_std

Syzbot reported warning in netlbl_cipsov4_add(). The
problem was in too big doi_def->map.std->lvl.local_size
passed to kcalloc(). Since this value comes from userpace there is
no need to warn if value is not correct.

The same problem may occur with other kcalloc() calls in
this function, so, I've added __GFP_NOWARN flag to all
kcalloc() calls there.

Reported-and-tested-by: syzbot+cdd51ee2e6b0b2e18c0d@syzkaller.appspotmail.com
Fixes: 96cb8e3313c7 ("[NetLabel]: CIPSOv4 and Unlabeled packet integration")
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3d9d00bd 27-Jul-2021 Arnd Bergmann <arnd@arndb.de>

net: bonding: move ioctl handling to private ndo operation

All other user triggered operations are gone from ndo_ioctl, so move
the SIOCBOND family into a custom operation as well.

The .ndo_ioctl() helper is no longer called by the dev_ioctl.c code now,
but there are still a few definitions in obsolete wireless drivers as well
as the appletalk and ieee802154 layers to call SIOCSIFADDR/SIOCGIFADDR
helpers from inside the kernel.

Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

ad2f99ae 27-Jul-2021 Arnd Bergmann <arnd@arndb.de>

net: bridge: move bridge ioctls out of .ndo_do_ioctl

Working towards obsoleting the .ndo_do_ioctl operation entirely,
stop passing the SIOCBRADDIF/SIOCBRDELIF device ioctl commands
into this callback.

My first attempt was to add another ndo_siocbr() callback, but
as there is only a single driver that takes these commands and
there is already a hook mechanism to call directly into this
driver, extend this hook instead, and use it for both the
deviceless and the device specific ioctl commands.

Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Cc: bridge@lists.linux-foundation.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

88fc023f 27-Jul-2021 Arnd Bergmann <arnd@arndb.de>

net: socket: return changed ifreq from SIOCDEVPRIVATE

Some drivers that use SIOCDEVPRIVATE ioctl commands modify
the ifreq structure and expect it to be passed back to user
space, which has never really happened for compat mode
because the calling these drivers through ndo_do_ioctl
requires overwriting the ifr_data pointer.

Now that all drivers are converted to ndo_siocdevprivate,
change it to handle this correctly in both compat and
native mode.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

ad7eab2a 27-Jul-2021 Arnd Bergmann <arnd@arndb.de>

net: split out ndo_siowandev ioctl

In order to further reduce the scope of ndo_do_ioctl(), move
out the SIOCWANDEV handling into a new network device operation
function.

Adjust the prototype to only pass the if_settings sub-structure
in place of the ifreq, and remove the redundant 'cmd' argument
in the process.

Cc: Krzysztof Halasa <khc@pm.waw.pl>
Cc: "Jan \"Yenya\" Kasprzak" <kas@fi.muni.cz>
Cc: Kevin Curtis <kevin.curtis@farsite.co.uk>
Cc: Zhao Qiang <qiang.zhao@nxp.com>
Cc: Martin Schiller <ms@dev.tdt.de>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: linux-x25@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

a7605370 27-Jul-2021 Arnd Bergmann <arnd@arndb.de>

dev_ioctl: split out ndo_eth_ioctl

Most users of ndo_do_ioctl are ethernet drivers that implement
the MII commands SIOCGMIIPHY/SIOCGMIIREG/SIOCSMIIREG, or hardware
timestamping with SIOCSHWTSTAMP/SIOCGHWTSTAMP.

Separate these from the few drivers that use ndo_do_ioctl to
implement SIOCBOND, SIOCBR and SIOCWANDEV commands.

This is a purely cosmetic change intended to help readers find
their way through the implementation.

Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a554bf96 27-Jul-2021 Arnd Bergmann <arnd@arndb.de>

dev_ioctl: pass SIOCDEVPRIVATE data separately

The compat handlers for SIOCDEVPRIVATE are incorrect for any driver that
passes data as part of struct ifreq rather than as an ifr_data pointer, or
that passes data back this way, since the compat_ifr_data_ioctl() helper
overwrites the ifr_data pointer and does not copy anything back out.

Since all drivers using devprivate commands are now converted to the
new .ndo_siocdevprivate callback, fix this by adding the missing piece
and passing the pointer separately the whole way.

This further unifies the native and compat logic for socket ioctls,
as the new code now passes the correct pointer as well as the correct
data for both native and compat ioctls.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

3e7a1c7c 27-Jul-2021 Arnd Bergmann <arnd@arndb.de>

ip_tunnel: use ndo_siocdevprivate

The various ipv4 and ipv6 tunnel drivers each implement a set
of 12 SIOCDEVPRIVATE commands for managing tunnels. These
all work correctly in compat mode.

Move them over to the new .ndo_siocdevprivate operation.

Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

4747c1a8 27-Jul-2021 Arnd Bergmann <arnd@arndb.de>

phonet: use siocdevprivate

phonet has a single private ioctl that is broken in compat
mode on big-endian machines today because the data returned
from it is never copied back to user space.

Move it over to the ndo_siocdevprivate callback, which also
fixes the compat issue.

Cc: Remi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Rémi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

561d8352 27-Jul-2021 Arnd Bergmann <arnd@arndb.de>

bridge: use ndo_siocdevprivate

The bridge driver has an old set of ioctls using the SIOCDEVPRIVATE
namespace that have never worked in compat mode and are explicitly
forbidden already.

Move them over to ndo_siocdevprivate and fix compat mode for these,
because we can.

Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Cc: bridge@lists.linux-foundation.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

b9067f5d 27-Jul-2021 Arnd Bergmann <arnd@arndb.de>

net: split out SIOCDEVPRIVATE handling from dev_ioctl

SIOCDEVPRIVATE ioctl commands are mainly used in really old
drivers, and they have a number of problems:

- They hide behind the normal .ndo_do_ioctl function that
is also used for other things in modern drivers, so it's
hard to spot a driver that actually uses one of these

- Since drivers use a number different calling conventions,
it is impossible to support compat mode for them in
a generic way.

- With all drivers using the same 16 commands codes, there
is no way to introspect the data being passed through
things like strace.

Add a new net_device_ops callback pointer, to address the
first two of these. Separating them from .ndo_do_ioctl
makes it easy to grep for drivers with a .ndo_siocdevprivate
callback, and the unwieldy name hopefully makes it easier
to spot in code review.

By passing the ifreq structure and the ifr_data pointer
separately, it is no longer necessary to overload these,
and the driver can use either one for a given command.

Cc: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

a657db03 27-Jul-2021 Neal Cardwell <ncardwell@google.com>

tcp: more accurately check DSACKs to grow RACK reordering window

Previously, a DSACK could expand the RACK reordering window when no
reordering has been seen, and/or when the DSACK was due to an
unnecessary TLP retransmit (rather than a spurious fast recovery due
to reordering). This could result in unnecessarily growing the RACK
reordering window and thus unnecessarily delaying RACK-based fast
recovery episodes.

To avoid these issues, this commit tightens the conditions under which
a DSACK triggers the RACK reordering window to grow, so that a
connection only expands its RACK reordering window if:

(a) reordering has been seen in the connection
(b) a DSACKed range does not match the most recent TLP retransmit

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

63f367d9 27-Jul-2021 Yuchung Cheng <ycheng@google.com>

tcp: more accurately detect spurious TLP probes

Previously TLP is considered spurious if the sender receives any
DSACK during a TLP episode. This patch further checks the DSACK
sequences match the TLP's to improve accuracy.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

732b33d0 26-Jul-2021 Harshvardhan Jha <harshvardhan.jha@oracle.com>

9p/xen: Fix end of loop tests for list_for_each_entry

This patch addresses the following problems:
- priv can never be NULL, so this part of the check is useless
- if the loop ran through the whole list, priv->client is invalid and
it is more appropriate and sufficient to check for the end of
list_for_each_entry loop condition.

Link: http://lkml.kernel.org/r/20210727000709.225032-1-harshvardhan.jha@oracle.com
Signed-off-by: Harshvardhan Jha <harshvardhan.jha@oracle.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Stefano Stabellini <sstabellini@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>

f997ea3b 17-May-2021 Xie Yongji <xieyongji@bytedance.com>

9p/trans_virtio: Remove sysfs file on probe failure

This ensures we don't leak the sysfs file if we failed to
allocate chan->vc_wq during probe.

Link: http://lkml.kernel.org/r/20210517083557.172-1-xieyongji@bytedance.com
Fixes: 86c8437383ac ("net/9p: Add sysfs mount_tag file for virtio 9P device")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>

c7c9d210 24-Jul-2021 Pavel Skripkin <paskripkin@gmail.com>

net: llc: fix skb_over_panic

Syzbot reported skb_over_panic() in llc_pdu_init_as_xid_cmd(). The
problem was in wrong LCC header manipulations.

Syzbot's reproducer tries to send XID packet. llc_ui_sendmsg() is
doing following steps:

1. skb allocation with size = len + header size
len is passed from userpace and header size
is 3 since addr->sllc_xid is set.

2. skb_reserve() for header_len = 3
3. filling all other space with memcpy_from_msg()

Ok, at this moment we have fully loaded skb, only headers needs to be
filled.

Then code comes to llc_sap_action_send_xid_c(). This function pushes 3
bytes for LLC PDU header and initializes it. Then comes
llc_pdu_init_as_xid_cmd(). It initalizes next 3 bytes *AFTER* LLC PDU
header and call skb_push(skb, 3). This looks wrong for 2 reasons:

1. Bytes rigth after LLC header are user data, so this function
was overwriting payload.

2. skb_push(skb, 3) call can cause skb_over_panic() since
all free space was filled in llc_ui_sendmsg(). (This can
happen is user passed 686 len: 686 + 14 (eth header) + 3 (LLC
header) = 703. SKB_DATA_ALIGN(703) = 704)

So, in this patch I added 2 new private constansts: LLC_PDU_TYPE_U_XID
and LLC_PDU_LEN_U_XID. LLC_PDU_LEN_U_XID is used to correctly reserve
header size to handle LLC + XID case. LLC_PDU_TYPE_U_XID is used by
llc_pdu_header_init() function to push 6 bytes instead of 3. And finally
I removed skb_push() call from llc_pdu_init_as_xid_cmd().

This changes should not affect other parts of LLC, since after
all steps we just transmit buffer.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-and-tested-by: syzbot+5e5a981ad7cc54c4b2b4@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

076999e4 23-Jul-2021 Mark Gray <mark.d.gray@redhat.com>

openvswitch: fix sparse warning incorrect type

fix incorrect type in argument 1 (different address spaces)

../net/openvswitch/datapath.c:169:17: warning: incorrect type in argument 1 (different address spaces)
../net/openvswitch/datapath.c:169:17: expected void const *
../net/openvswitch/datapath.c:169:17: got struct dp_nlsk_pids [noderef] __rcu *upcall_portids

Found at: https://patchwork.kernel.org/project/netdevbpf/patch/20210630095350.817785-1-mark.d.gray@redhat.com/#24285159

Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

784dcfa5 23-Jul-2021 Mark Gray <mark.d.gray@redhat.com>

openvswitch: fix alignment issues

Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f9b282b3 26-Jul-2021 Yajun Deng <yajun.deng@linux.dev>

net: netlink: add the case when nlh is NULL

Add the case when nlh is NULL in nlmsg_report(),
so that the caller doesn't need to deal with this case.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

9f90a4dd 23-Jul-2021 Jiri Slaby <jirislaby@kernel.org>

tty: drop put_tty_driver

put_tty_driver() is an alias for tty_driver_kref_put(). There is no need
for two exported identical functions, therefore switch all users of
old put_tty_driver() to new tty_driver_kref_put() and remove the former
for good.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Samuel Iglesias Gonsalvez <siglesias@igalia.com>
Cc: Jens Taprogge <jens.taprogge@taprogge.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Scott Branden <scott.branden@broadcom.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David Lin <dtwlin@gmail.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Alex Elder <elder@kernel.org>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: David Sterba <dsterba@suse.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Pengutronix Kernel Team <kernel@pengutronix.de>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: NXP Linux Team <linux-imx@nxp.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Mathias Nyman <mathias.nyman@intel.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Acked-by: Alex Elder <elder@linaro.org>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Link: https://lore.kernel.org/r/20210723074317.32690-8-jslaby@suse.cz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

39b7b42b 23-Jul-2021 Jiri Slaby <jirislaby@kernel.org>

tty: stop using alloc_tty_driver

alloc_tty_driver was deprecated by tty_alloc_driver in commit
7f0bc6a68ed9 (TTY: pass flags to alloc_tty_driver) in 2012.

I never got into eliminating alloc_tty_driver until now. So we still
have two functions for allocating drivers which might be confusing. So
get rid of alloc_tty_driver uses to eliminate it for good in the next
patch.

Note we need to switch return value checking as tty_alloc_driver uses
ERR_PTR. And flags are now a parameter of tty_alloc_driver.

Cc: Richard Henderson <rth@twiddle.net>(odd fixer:ALPHA PORT)
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Samuel Iglesias Gonsalvez <siglesias@igalia.com>
Cc: Jens Taprogge <jens.taprogge@taprogge.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: David Sterba <dsterba@suse.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Johan Hovold <johan@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Acked-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Link: https://lore.kernel.org/r/20210723074317.32690-5-jslaby@suse.cz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

9b29a161 26-Jul-2021 Saeed Mahameed <saeedm@nvidia.com>

ethtool: Fix rxnfc copy to user buffer overflow

In the cited commit, copy_to_user() got called with the wrong pointer,
instead of passing the actual buffer ptr to copy from, a pointer to
the pointer got passed, which causes a buffer overflow calltrace to pop
up when executing "ethtool -x ethX".

Fix ethtool_rxnfc_copy_to_user() to use the rxnfc pointer as passed
to the function, instead of a pointer to it.

This fixes below call trace:
[ 15.533533] ------------[ cut here ]------------
[ 15.539007] Buffer overflow detected (8 < 192)!
[ 15.544110] WARNING: CPU: 3 PID: 1801 at include/linux/thread_info.h:200 copy_overflow+0x15/0x20
[ 15.549308] Modules linked in:
[ 15.551449] CPU: 3 PID: 1801 Comm: ethtool Not tainted 5.14.0-rc2+ #1058
[ 15.553919] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 15.558378] RIP: 0010:copy_overflow+0x15/0x20
[ 15.560648] Code: e9 7c ff ff ff b8 a1 ff ff ff eb c4 66 0f 1f 84 00 00 00 00 00 55 48 89 f2 89 fe 48 c7 c7 88 55 78 8a 48 89 e5 e8 06 5c 1e 00 <0f> 0b 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55
[ 15.565114] RSP: 0018:ffffad49c0523bd0 EFLAGS: 00010286
[ 15.566231] RAX: 0000000000000000 RBX: 00000000000000c0 RCX: 0000000000000000
[ 15.567616] RDX: 0000000000000001 RSI: ffffffff8a7912e7 RDI: 00000000ffffffff
[ 15.569050] RBP: ffffad49c0523bd0 R08: ffffffff8ab2ae28 R09: 00000000ffffdfff
[ 15.570534] R10: ffffffff8aa4ae40 R11: ffffffff8aa4ae40 R12: 0000000000000000
[ 15.571899] R13: 00007ffd4cc2a230 R14: ffffad49c0523c00 R15: 0000000000000000
[ 15.573584] FS: 00007f538112f740(0000) GS:ffff96d5bdd80000(0000) knlGS:0000000000000000
[ 15.575639] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 15.577092] CR2: 00007f5381226d40 CR3: 0000000013542000 CR4: 00000000001506e0
[ 15.578929] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 15.580695] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 15.582441] Call Trace:
[ 15.582970] ethtool_rxnfc_copy_to_user+0x30/0x46
[ 15.583815] ethtool_get_rxnfc.cold+0x23/0x2b
[ 15.584584] dev_ethtool+0x29c/0x25f0
[ 15.585286] ? security_netlbl_sid_to_secattr+0x77/0xd0
[ 15.586728] ? do_set_pte+0xc4/0x110
[ 15.587349] ? _raw_spin_unlock+0x18/0x30
[ 15.588118] ? __might_sleep+0x49/0x80
[ 15.588956] dev_ioctl+0x2c1/0x490
[ 15.589616] sock_ioctl+0x18e/0x330
[ 15.591143] __x64_sys_ioctl+0x41c/0x990
[ 15.591823] ? irqentry_exit_to_user_mode+0x9/0x20
[ 15.592657] ? irqentry_exit+0x33/0x40
[ 15.593308] ? exc_page_fault+0x32f/0x770
[ 15.593877] ? exit_to_user_mode_prepare+0x3c/0x130
[ 15.594775] do_syscall_64+0x35/0x80
[ 15.595397] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 15.596037] RIP: 0033:0x7f5381226d4b
[ 15.596492] Code: 0f 1e fa 48 8b 05 3d b1 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d b1 0c 00 f7 d8 64 89 01 48
[ 15.598743] RSP: 002b:00007ffd4cc2a1f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 15.599804] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5381226d4b
[ 15.600795] RDX: 00007ffd4cc2a350 RSI: 0000000000008946 RDI: 0000000000000003
[ 15.601712] RBP: 00007ffd4cc2a340 R08: 00007ffd4cc2a350 R09: 0000000000000001
[ 15.602751] R10: 00007f538128a990 R11: 0000000000000246 R12: 0000000000000000
[ 15.603882] R13: 00007ffd4cc2a350 R14: 00007ffd4cc2a4b0 R15: 0000000000000000
[ 15.605042] ---[ end trace 325cf185e2795048 ]---

Fixes: dd98d2895de6 ("ethtool: improve compat ioctl handling")
Reported-by: Shannon Nelson <snelson@pensando.io>
CC: Arnd Bergmann <arnd@arndb.de>
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Tested-by: Shannon Nelson <snelson@pensando.io>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

323e0cb4 26-Jul-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

flow_dissector: Fix out-of-bounds warnings

Fix the following out-of-bounds warnings:

net/core/flow_dissector.c: In function '__skb_flow_dissect':
>> net/core/flow_dissector.c:1104:4: warning: 'memcpy' offset [24, 39] from the object at '<unknown>' is out of the bounds of referenced subobject 'saddr' with type 'struct in6_addr' at offset 8 [-Warray-bounds]
1104 | memcpy(&key_addrs->v6addrs, &iph->saddr,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1105 | sizeof(key_addrs->v6addrs));
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from include/linux/ipv6.h:5,
from net/core/flow_dissector.c:6:
include/uapi/linux/ipv6.h:133:18: note: subobject 'saddr' declared here
133 | struct in6_addr saddr;
| ^~~~~
>> net/core/flow_dissector.c:1059:4: warning: 'memcpy' offset [16, 19] from the object at '<unknown>' is out of the bounds of referenced subobject 'saddr' with type 'unsigned int' at offset 12 [-Warray-bounds]
1059 | memcpy(&key_addrs->v4addrs, &iph->saddr,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1060 | sizeof(key_addrs->v4addrs));
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from include/linux/ip.h:17,
from net/core/flow_dissector.c:5:
include/uapi/linux/ip.h:103:9: note: subobject 'saddr' declared here
103 | __be32 saddr;
| ^~~~~

The problem is that the original code is trying to copy data into a
couple of struct members adjacent to each other in a single call to
memcpy(). So, the compiler legitimately complains about it. As these
are just a couple of members, fix this by copying each one of them in
separate calls to memcpy().

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

Link: https://github.com/KSPP/linux/issues/109
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/d5ae2e65-1f18-2577-246f-bada7eee6ccd@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6321c7ac 26-Jul-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

ipv4: ip_output.c: Fix out-of-bounds warning in ip_copy_addrs()

Fix the following out-of-bounds warning:

In function 'ip_copy_addrs',
inlined from '__ip_queue_xmit' at net/ipv4/ip_output.c:517:2:
net/ipv4/ip_output.c:449:2: warning: 'memcpy' offset [40, 43] from the object at 'fl' is out of the bounds of referenced subobject 'saddr' with type 'unsigned int' at offset 36 [-Warray-bounds]
449 | memcpy(&iph->saddr, &fl4->saddr,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
450 | sizeof(fl4->saddr) + sizeof(fl4->daddr));
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The problem is that the original code is trying to copy data into a
couple of struct members adjacent to each other in a single call to
memcpy(). This causes a legitimate compiler warning because memcpy()
overruns the length of &iph->saddr and &fl4->saddr. As these are just
a couple of struct members, fix this by using direct assignments,
instead of memcpy().

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

Link: https://github.com/KSPP/linux/issues/109
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/d5ae2e65-1f18-2577-246f-bada7eee6ccd@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

edac6f63 26-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

Revert "net: dsa: Allow drivers to filter packets they can decode source port from"

This reverts commit cc1939e4b3aaf534fb2f3706820012036825731c.

Currently 2 classes of DSA drivers are able to send/receive packets
directly through the DSA master:
- drivers with DSA_TAG_PROTO_NONE
- sja1105

Now that sja1105 has gained the ability to perform traffic termination
even under the tricky case (VLAN-aware bridge), and that is much more
functional (we can perform VLAN-aware bridging with foreign interfaces),
there is no reason to keep this code in the receive path of the network
core. So delete it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b6ad86e6 26-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: sja1105: add bridge TX data plane offload based on tag_8021q

The main desire for having this feature in sja1105 is to support network
stack termination for traffic coming from a VLAN-aware bridge.

For sja1105, offloading the bridge data plane means sending packets
as-is, with the proper VLAN tag, to the chip. The chip will look up its
FDB and forward them to the correct destination port.

But we support bridge data plane offload even for VLAN-unaware bridges,
and the implementation there is different. In fact, VLAN-unaware
bridging is governed by tag_8021q, so it makes sense to have the
.bridge_fwd_offload_add() implementation fully within tag_8021q.
The key difference is that we only support 1 VLAN-aware bridge, but we
support multiple VLAN-unaware bridges. So we need to make sure that the
forwarding domain is not crossed by packets injected from the stack.

For this, we introduce the concept of a tag_8021q TX VLAN for bridge
forwarding offload. As opposed to the regular TX VLANs which contain
only 2 ports (the user port and the CPU port), a bridge data plane TX
VLAN is "multicast" (or "imprecise"): it contains all the ports that are
part of a certain bridge, and the hardware will select where the packet
goes within this "imprecise" forwarding domain.

Each VLAN-unaware bridge has its own "imprecise" TX VLAN, so we make use
of the unique "bridge_num" provided by DSA for the data plane offload.
We use the same 3 bits from the tag_8021q VLAN ID format to encode this
bridge number.

Note that these 3 bit positions have been used before for sub-VLANs in
best-effort VLAN filtering mode. The difference is that for best-effort,
the sub-VLANs were only valid on RX (and it was documented that the
sub-VLAN field needed to be transmitted as zero). Whereas for the bridge
data plane offload, these 3 bits are only valid on TX.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

884be12f 26-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: sja1105: add support for imprecise RX

This is already common knowledge by now, but the sja1105 does not have
hardware support for DSA tagging for data plane packets, and tag_8021q
sets up a unique pvid per port, transmitted as VLAN-tagged towards the
CPU, for the source port to be decoded nonetheless.

When the port is part of a VLAN-aware bridge, the pvid committed to
hardware is taken from the bridge and not from tag_8021q, so we need to
work with that the best we can.

Configure the switches to send all packets to the CPU as VLAN-tagged
(even ones that were originally untagged on the wire) and make use of
dsa_untag_bridge_pvid() to get rid of it before we send those packets up
the network stack.

With the classified VLAN used by hardware known to the tagger, we first
peek at the VID in an attempt to figure out if the packet was received
from a VLAN-unaware port (standalone or under a VLAN-unaware bridge),
case in which we can continue to call dsa_8021q_rcv(). If that is not
the case, the packet probably came from a VLAN-aware bridge. So we call
the DSA helper that finds for us a "designated bridge port" - one that
is a member of the VLAN ID from the packet, and is in the proper STP
state - basically these are all checks performed by br_handle_frame() in
the software RX data path.

The bridge will accept the packet as valid even if the source port was
maybe wrong. So it will maybe learn the MAC SA of the packet on the
wrong port, and its software FDB will be out of sync with the hardware
FDB. So replies towards this same MAC DA will not work, because the
bridge will send towards a different netdev.

This is where the bridge data plane offload ("imprecise TX") added by
the next patch comes in handy. The software FDB is wrong, true, but the
hardware FDB isn't, and by offloading the bridge forwarding plane we
have a chance to right a wrong, and have the hardware look up the FDB
for us for the reply packet. So it all cancels out.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee80dd2e 26-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: bridge: add a helper for retrieving port VLANs from the data path

Introduce a brother of br_vlan_get_info() which is protected by the RCU
mechanism, as opposed to br_vlan_get_info() which relies on taking the
write-side rtnl_mutex.

This is needed for drivers which need to find out whether a bridge port
has a VLAN configured or not. For example, certain DSA switches might
not offer complete source port identification to the CPU on RX, just the
VLAN in which the packet was received. Based on this VLAN, we cannot set
an accurate skb->dev ingress port, but at least we can configure one
that behaves the same as the correct one would (this is possible because
DSA sets skb->offload_fwd_mark = 1).

When we look at the bridge RX handler (br_handle_frame), we see that
what matters regarding skb->dev is the VLAN ID and the port STP state.
So we need to select an skb->dev that has the same bridge VLAN as the
packet we're receiving, and is in the LEARNING or FORWARDING STP state.
The latter is easy, but for the former, we should somehow keep a shadow
list of the bridge VLANs on each port, and a lookup table between VLAN
ID and the 'designated port for imprecise RX'. That is rather
complicated to keep in sync properly (the designated port per VLAN needs
to be updated on the addition and removal of a VLAN, as well as on the
join/leave events of the bridge on that port).

So, to avoid all that complexity, let's just iterate through our finite
number of ports and ask the bridge, for each packet: "do you have this
VLAN configured on this port?".

Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Cc: Ido Schimmel <idosch@nvidia.com>
Cc: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f7cdb3ec 26-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: bridge: update BROPT_VLAN_ENABLED before notifying switchdev in br_vlan_filter_toggle

SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING is notified by the bridge from
two places:
- nbp_vlan_init(), during bridge port creation
- br_vlan_filter_toggle(), during a netlink/sysfs/ioctl change requested
by user space

If a switchdev driver uses br_vlan_enabled(br_dev) inside its handler
for the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING attribute notifier,
different things will be seen depending on whether the bridge calls from
the first path or the second:
- in nbp_vlan_init(), br_vlan_enabled() reflects the current state of
the bridge
- in br_vlan_filter_toggle(), br_vlan_enabled() reflects the past state
of the bridge

This can lead in some cases to complications in driver implementation,
which can be avoided if these could reliably use br_vlan_enabled().

Nothing seems to depend on this behavior, and it seems overall more
straightforward for br_vlan_enabled() to return the proper value even
during the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING notifier, so
temporarily enable the bridge option, then revert it if the switchdev
notifier failed.

Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Cc: Ido Schimmel <idosch@nvidia.com>
Cc: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d20e5880 26-Jul-2021 David S. Miller <davem@davemloft.net>

Merge tag 'linux-can-next-for-5.15-20210725' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

linux-can-next-for-5.15-20210725

Marc Kleine-Budde says:

====================
pull-request: can-next 2021-07-25

this is a pull request of 46 patches for net-next/master.

The first 6 patches target the CAN J1939 protocol. One is from
gushengxian, fixing a grammatical error, 5 are by me fixing a checkpatch
warning, make use of the fallthrough pseudo-keyword, and use
consistent variable naming.

The next 3 patches target the rx-offload helper, are by me and improve
the performance and fix the local softirq work pending error, when
napi_schedule() is called from threaded IRQ context.

The next 3 patches are by Vincent Mailhol and me update the CAN
bittiming and transmitter delay compensation, the documentation for
the struct can_tdc is fixed, clear data_bittiming if FD mode is turned
off and a redundant check is removed.

Followed by 4 patches targeting the m_can driver. Faiz Abbas's patches
add support for CAN PHY via the generic phy subsystem. Yang Yingliang
converts the driver to use devm_platform_ioremap_resource_byname().
And a patch by me which removes the unused support for custom bit
timing.

Andy Shevchenko contributes 2 patches for the mcp251xfd driver to
prepare the driver for ACPI support. A patch by me adds support for
shared IRQ handlers.

Zhen Lei contributes 3 patches to convert the esd_usb2, janz-ican3 and
the at91_can driver to make use of the DEVICE_ATTR_RO/RW() macros.

The next 8 patches are by Peng Li and provide general cleanups for the
at91_can driver.

The next 7 patches target the peak driver. Frist 2 cleanup patches by
me for the peak_pci driver, followed by Stephane Grosjean' patch to
print the name and firmware version of the detected hardware. The
peak_usb driver gets a cleanup patch, loopback and one-shot mode and
an upgrading of the bus state change handling in Stephane Grosjean's
patches.

Vincent Mailhol provides 6 cleanup patches for the etas_es58x driver.

In the last 3 patches Angelo Dureghello add support for the mcf5441x
SoC to the flexcan driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


2ebda027 25-Jul-2021 Chen Shen <peterchenshen@gmail.com>

sctp: delete addr based on sin6_scope_id

sctp_inet6addr_event deletes 'addr' from 'local_addr_list' when setting
netdev down, but it is possible to delete the incorrect entry (match
the first one with the same ipaddr, but the different 'ifindex'), if
there are some netdevs with the same 'local-link' ipaddr added already.
It should delete the entry depending on 'sin6_addr' and 'sin6_scope_id'
both. otherwise, the endpoint will call 'sctp_sf_ootb' if it can't find
the according association when receives 'heartbeat', and finally will
reply 'abort'.

For example:
1.when linux startup
the entries in local_addr_list:
ifindex:35 addr:fe80::40:43ff:fe80:0 (eths0.201)
ifindex:36 addr:fe80::40:43ff:fe80:0 (eths0.209)
ifindex:37 addr:fe80::40:43ff:fe80:0 (eths0.210)

the route table:
local fe80::40:43ff:fe80:0 dev eths0.201
local fe80::40:43ff:fe80:0 dev eths0.209
local fe80::40:43ff:fe80:0 dev eths0.210

2.after 'ifconfig eths0.209 down'
the entries in local_addr_list:
ifindex:36 addr:fe80::40:43ff:fe80:0 (eths0.209)
ifindex:37 addr:fe80::40:43ff:fe80:0 (eths0.210)

the route table:
local fe80::40:43ff:fe80:0 dev eths0.201
local fe80::40:43ff:fe80:0 dev eths0.210

3.asoc not found for src:[fe80::40:43ff:fe80:0]:37381 dst:[:1]:53335
::1->fe80::40:43ff:fe80:0 HEARTBEAT
fe80::40:43ff:fe80:0->::1 ABORT

Signed-off-by: Chen Shen <peterchenshen@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

480e93e1 25-Jul-2021 Harshvardhan Jha <harshvardhan.jha@oracle.com>

net: xfrm: Fix end of loop tests for list_for_each_entry

The list_for_each_entry() iterator, "pos" in this code, can never be
NULL so the warning will never be printed.

Signed-off-by: Harshvardhan Jha <harshvardhan.jha@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

eacf078c 25-Jul-2021 Xin Long <lucien.xin@gmail.com>

sctp: send pmtu probe only if packet loss in Search Complete state

This patch is to introduce last_rtx_chunks into sctp_transport to detect
if there's any packet retransmission/loss happened by checking against
asoc's rtx_data_chunks in sctp_transport_pl_send().

If there is, namely, transport->last_rtx_chunks != asoc->rtx_data_chunks,
the pmtu probe will be sent out. Otherwise, increment the pl.raise_count
and return when it's in Search Complete state.

With this patch, if in Search Complete state, which is a long period, it
doesn't need to keep probing the current pmtu unless there's data packet
loss. This will save quite some traffic.

v1->v2:
- add the missing Fixes tag.

Fixes: 0dac127c0557 ("sctp: do black hole detection in search complete state")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

058e6e0e 25-Jul-2021 Xin Long <lucien.xin@gmail.com>

sctp: improve the code for pmtu probe send and recv update

This patch does 3 things:

- make sctp_transport_pl_send() and sctp_transport_pl_recv()
return bool type to decide if more probe is needed to send.

- pr_debug() only when probe is really needed to send.

- count pl.raise_count in sctp_transport_pl_send() instead of
sctp_transport_pl_recv(), and it's only incremented for the
1st probe for the same size.

These are preparations for the next patch to make probes happen
only when there's packet loss in Search Complete state.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

149ea30f 23-Jul-2021 Parav Pandit <parav@nvidia.com>

devlink: Fix phys_port_name of virtual port and merge error

Merge commit cited in fixes tag was incorrect. Due to it phys_port_name
of the virtual port resulted in incorrect name.

Also the phys_port_name of the physical port was written twice due to
the merge error.

Fix it by removing the old code and inserting back the misplaced code.

Related commits of interest in net and net-next branches that resulted
in merge conflict are:

in net-next branch:
commit f285f37cb1e6 ("devlink: append split port number to the port name")

in net branch:
commit b28d8f0c25a9 ("devlink: Correct VIRTUAL port to not have phys_port attributes")

Fixes: 126285651b7 ("Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reported-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cc19862f 23-Jul-2021 Xin Long <lucien.xin@gmail.com>

tipc: fix an use-after-free issue in tipc_recvmsg

syzbot reported an use-after-free crash:

BUG: KASAN: use-after-free in tipc_recvmsg+0xf77/0xf90 net/tipc/socket.c:1979
Call Trace:
tipc_recvmsg+0xf77/0xf90 net/tipc/socket.c:1979
sock_recvmsg_nosec net/socket.c:943 [inline]
sock_recvmsg net/socket.c:961 [inline]
sock_recvmsg+0xca/0x110 net/socket.c:957
tipc_conn_rcv_from_sock+0x162/0x2f0 net/tipc/topsrv.c:398
tipc_conn_recv_work+0xeb/0x190 net/tipc/topsrv.c:421
process_one_work+0x98d/0x1630 kernel/workqueue.c:2276
worker_thread+0x658/0x11f0 kernel/workqueue.c:2422

As Hoang pointed out, it was caused by skb_cb->bytes_read still accessed
after calling tsk_advance_rx_queue() to free the skb in tipc_recvmsg().

This patch is to fix it by accessing skb_cb->bytes_read earlier than
calling tsk_advance_rx_queue().

Fixes: f4919ff59c28 ("tipc: keep the skb in rcv queue until the whole data is read")
Reported-by: syzbot+e6741b97d5552f97c24d@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a08ec5fe 21-May-2021 Marc Kleine-Budde <mkl@pengutronix.de>

can: j1939: j1939_xtp_rx_dat_one(): use separate pointer for session skb control buffer

In the j1939_xtp_rx_dat_one() function, there are 2 variables (skb and
se_skb) holding a skb. The control buffer of the skbs is accessed one
after the other, but using the same "skcb" variable.

To avoid confusion introduce a new variable "se_skcb" to access the
se_skb's control buffer as done in the rest of this file, too.

Cc: Robin van der Gracht <robin@protonic.nl>
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20210616102811.2449426-6-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

78b77c76 21-May-2021 Marc Kleine-Budde <mkl@pengutronix.de>

can: j1939: j1939_session_tx_dat(): use consistent name se_skcb for session skb control buffer

This patch changes the name of the "skcb" variable in
j1939_session_tx_dat() to "se_skcb" as it's the session skb's control
buffer. The same name is used in other functions for the session skb's
control buffer.

Cc: Robin van der Gracht <robin@protonic.nl>
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20210616102811.2449426-5-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

7ac56e40 21-May-2021 Marc Kleine-Budde <mkl@pengutronix.de>

can: j1939: j1939_session_completed(): use consistent name se_skb for the session skb

This patch changes the name of the "skb" variable in
j1939_session_completed() to "se_skb" as it's the session skb. The
same name is used in other functions for the session skb.

Cc: Robin van der Gracht <robin@protonic.nl>
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20210616102811.2449426-4-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

641ba6de 16-Jun-2021 Marc Kleine-Budde <mkl@pengutronix.de>

can: j1939: replace fall through comment by fallthrough pseudo-keyword

Replace the existing /* fall through */ comments the new
pseudo-keyword macro fallthrough.

Cc: Robin van der Gracht <robin@protonic.nl>
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20210616102811.2449426-3-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

33312873 16-Jun-2021 Marc Kleine-Budde <mkl@pengutronix.de>

can: j1939: fix checkpatch warnings

This patch fixes a checkpatch warning about a long line and wrong
indention.

Cc: Robin van der Gracht <robin@protonic.nl>
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20210616102811.2449426-2-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

04bdec2b 10-Jun-2021 gushengxian <gushengxian@yulong.com>

can: j1939: j1939_sk_sock_destruct(): correct a grammatical error

Correct a grammatical error.

Link: https://lore.kernel.org/r/20210611043933.17047-1-13145886936@163.com
Signed-off-by: gushengxian <gushengxian@yulong.com>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

7186aac9 24-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: constify nfc_digital_ops

Neither the core nor the drivers modify the passed pointer to struct
nfc_digital_ops, so make it a pointer to const for correctness and safety.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

49545357 24-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: constify nfc_llc_ops

Neither the core nor the drivers modify the passed pointer to struct
nfc_llc_ops, so make it a pointer to const for correctness and safety.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

094c45c8 24-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: constify nfc_hci_ops

Neither the core nor the drivers modify the passed pointer to struct
nfc_hci_ops, so make it a pointer to const for correctness and safety.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f6c802a7 24-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: constify nfc_ops

Neither the core nor the drivers modify the passed pointer to struct
nfc_ops, so make it a pointer to const for correctness and safety.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5f3e6393 24-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: constify nfc_hci_gate

Neither the core nor the drivers modify the passed pointer to struct
nfc_hci_gate, so make it a pointer to const for correctness and safety.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

15944ad2 24-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: constify pointer to nfc_vendor_cmd

Neither the core nor the drivers modify the passed pointer to struct
nfc_vendor_cmd, so make it a pointer to const for correctness and
safety.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cb8caa3c 24-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: constify nci_driver_ops (prop_ops and core_ops)

Neither the core nor the drivers modify the passed pointer to struct
nci_driver_ops (consisting of function pointers), so make it a pointer
to const for correctness and safety.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b9c28286 24-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: constify nci_ops

The struct nci_ops is modified by NFC core in only one case:
nci_allocate_device() receives too many proprietary commands (prop_ops)
to configure. This is a build time known constrain, so a graceful
handling of such case is not necessary.

Instead, fail the nci_allocate_device() and add BUILD_BUG_ON() to places
which set these.

This allows to constify the struct nci_ops (consisting of function
pointers) for correctness and safety.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

48d54403 24-Jul-2021 Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

nfc: constify payload argument in nci_send_cmd()

The nci_send_cmd() payload argument is passed directly to skb_put_data()
which already accepts a pointer to const, so make it const as well for
correctness and safety.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c5381154 23-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: bridge: fix build when setting skb->offload_fwd_mark with CONFIG_NET_SWITCHDEV=n

Switchdev support can be disabled at compile time, and in that case,
struct sk_buff will not contain the offload_fwd_mark field.

To make the code in br_forward.c work in both cases, we do what is done
in other places and we create a helper function, with an empty shim
definition, that is implemented by the br_switchdev.o translation module.
This is always compiled if and only if CONFIG_NET_SWITCHDEV is y or m.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 472111920f1c ("net: bridge: switchdev: allow the TX data plane forwarding to be offloaded")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3cf4375a 23-Jul-2021 Xin Long <lucien.xin@gmail.com>

tipc: do not write skb_shinfo frags when doing decrytion

One skb's skb_shinfo frags are not writable, and they can be shared with
other skbs' like by pskb_copy(). To write the frags may cause other skb's
data crash.

So before doing en/decryption, skb_cow_data() should always be called for
a cloned or nonlinear skb if req dst is using the same sg as req src.
While at it, the likely branch can be removed, as it will be covered
by skb_cow_data().

Note that esp_input() has the same issue, and I will fix it in another
patch. tipc_aead_encrypt() doesn't have this issue, as it only processes
linear data in the unlikely branch.

Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c6eea1c8 06-Jul-2021 Zhang Changzhong <zhangchangzhong@huawei.com>

can: j1939: j1939_xtp_rx_dat_one(): fix rxtimer value between consecutive TP.DT to 750ms

For receive side, the max time interval between two consecutive TP.DT
should be 750ms.

Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Link: https://lore.kernel.org/r/1625569210-47506-1-git-send-email-zhangchangzhong@huawei.com
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

0c71437d 14-Jul-2021 Oleksij Rempel <linux@rempel-privat.de>

can: j1939: j1939_session_deactivate(): clarify lifetime of session object

The j1939_session_deactivate() is decrementing the session ref-count and
potentially can free() the session. This would cause use-after-free
situation.

However, the code calling j1939_session_deactivate() does always hold
another reference to the session, so that it would not be free()ed in
this code path.

This patch adds a comment to make this clear and a WARN_ON, to ensure
that future changes will not violate this requirement. Further this
patch avoids dereferencing the session pointer as a precaution to avoid
use-after-free if the session is actually free()ed.

Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Link: https://lore.kernel.org/r/20210714111602.24021-1-o.rempel@pengutronix.de
Reported-by: Xiaochen Zou <xzou017@ucr.edu>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

54f93336 22-Jul-2021 Ziyang Xuan <william.xuanziyang@huawei.com>

can: raw: raw_setsockopt(): fix raw_rcv panic for sock UAF

We get a bug during ltp can_filter test as following.

===========================================
[60919.264984] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[60919.265223] PGD 8000003dda726067 P4D 8000003dda726067 PUD 3dda727067 PMD 0
[60919.265443] Oops: 0000 [#1] SMP PTI
[60919.265550] CPU: 30 PID: 3638365 Comm: can_filter Kdump: loaded Tainted: G W 4.19.90+ #1
[60919.266068] RIP: 0010:selinux_socket_sock_rcv_skb+0x3e/0x200
[60919.293289] RSP: 0018:ffff8d53bfc03cf8 EFLAGS: 00010246
[60919.307140] RAX: 0000000000000000 RBX: 000000000000001d RCX: 0000000000000007
[60919.320756] RDX: 0000000000000001 RSI: ffff8d5104a8ed00 RDI: ffff8d53bfc03d30
[60919.334319] RBP: ffff8d9338056800 R08: ffff8d53bfc29d80 R09: 0000000000000001
[60919.347969] R10: ffff8d53bfc03ec0 R11: ffffb8526ef47c98 R12: ffff8d53bfc03d30
[60919.350320] perf: interrupt took too long (3063 > 2500), lowering kernel.perf_event_max_sample_rate to 65000
[60919.361148] R13: 0000000000000001 R14: ffff8d53bcf90000 R15: 0000000000000000
[60919.361151] FS: 00007fb78b6b3600(0000) GS:ffff8d53bfc00000(0000) knlGS:0000000000000000
[60919.400812] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[60919.413730] CR2: 0000000000000010 CR3: 0000003e3f784006 CR4: 00000000007606e0
[60919.426479] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[60919.439339] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[60919.451608] PKRU: 55555554
[60919.463622] Call Trace:
[60919.475617] <IRQ>
[60919.487122] ? update_load_avg+0x89/0x5d0
[60919.498478] ? update_load_avg+0x89/0x5d0
[60919.509822] ? account_entity_enqueue+0xc5/0xf0
[60919.520709] security_sock_rcv_skb+0x2a/0x40
[60919.531413] sk_filter_trim_cap+0x47/0x1b0
[60919.542178] ? kmem_cache_alloc+0x38/0x1b0
[60919.552444] sock_queue_rcv_skb+0x17/0x30
[60919.562477] raw_rcv+0x110/0x190 [can_raw]
[60919.572539] can_rcv_filter+0xbc/0x1b0 [can]
[60919.582173] can_receive+0x6b/0xb0 [can]
[60919.591595] can_rcv+0x31/0x70 [can]
[60919.600783] __netif_receive_skb_one_core+0x5a/0x80
[60919.609864] process_backlog+0x9b/0x150
[60919.618691] net_rx_action+0x156/0x400
[60919.627310] ? sched_clock_cpu+0xc/0xa0
[60919.635714] __do_softirq+0xe8/0x2e9
[60919.644161] do_softirq_own_stack+0x2a/0x40
[60919.652154] </IRQ>
[60919.659899] do_softirq.part.17+0x4f/0x60
[60919.667475] __local_bh_enable_ip+0x60/0x70
[60919.675089] __dev_queue_xmit+0x539/0x920
[60919.682267] ? finish_wait+0x80/0x80
[60919.689218] ? finish_wait+0x80/0x80
[60919.695886] ? sock_alloc_send_pskb+0x211/0x230
[60919.702395] ? can_send+0xe5/0x1f0 [can]
[60919.708882] can_send+0xe5/0x1f0 [can]
[60919.715037] raw_sendmsg+0x16d/0x268 [can_raw]

It's because raw_setsockopt() concurrently with
unregister_netdevice_many(). Concurrent scenario as following.

cpu0 cpu1
raw_bind
raw_setsockopt unregister_netdevice_many
unlist_netdevice
dev_get_by_index raw_notifier
raw_enable_filters ......
can_rx_register
can_rcv_list_find(..., net->can.rx_alldev_list)

......

sock_close
raw_release(sock_a)

......

can_receive
can_rcv_filter(net->can.rx_alldev_list, ...)
raw_rcv(skb, sock_a)
BUG

After unlist_netdevice(), dev_get_by_index() return NULL in
raw_setsockopt(). Function raw_enable_filters() will add sock
and can_filter to net->can.rx_alldev_list. Then the sock is closed.
Followed by, we sock_sendmsg() to a new vcan device use the same
can_filter. Protocol stack match the old receiver whose sock has
been released on net->can.rx_alldev_list in can_rcv_filter().
Function raw_rcv() uses the freed sock. UAF BUG is triggered.

We can find that the key issue is that net_device has not been
protected in raw_setsockopt(). Use rtnl_lock to protect net_device
in raw_setsockopt().

Fixes: c18ce101f2e4 ("[CAN]: Add raw protocol")
Link: https://lore.kernel.org/r/20210722070819.1048263-1-william.xuanziyang@huawei.com
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

3cee6fb8 01-Jul-2021 Martin KaFai Lau <kafai@fb.com>

bpf: tcp: Support bpf_(get|set)sockopt in bpf tcp iter

This patch allows bpf tcp iter to call bpf_(get|set)sockopt.
To allow a specific bpf iter (tcp here) to call a set of helpers,
get_func_proto function pointer is added to bpf_iter_reg.
The bpf iter is a tracing prog which currently requires
CAP_PERFMON or CAP_SYS_ADMIN, so this patch does not
impose other capability checks for bpf_(get|set)sockopt.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200619.1036715-1-kafai@fb.com

04c7820b 01-Jul-2021 Martin KaFai Lau <kafai@fb.com>

bpf: tcp: Bpf iter batching and lock_sock

This patch does batching and lock_sock for the bpf tcp iter.
It does not affect the proc fs iteration.

With bpf-tcp-cc, new algo rollout happens more often. Instead of
restarting the application to pick up the new tcp-cc, the next patch
will allow bpf iter to do setsockopt(TCP_CONGESTION). This requires
locking the sock.

Also, unlike the proc iteration (cat /proc/net/tcp[6]), the bpf iter
can inspect all fields of a tcp_sock. It will be useful to have a
consistent view on some of the fields (e.g. the ones reported in
tcp_get_info() that also acquires the sock lock).

Double lock: locking the bucket first and then locking the sock could
lead to deadlock. This patch takes a batching approach similar to
inet_diag. While holding the bucket lock, it batch a number of sockets
into an array first and then unlock the bucket. Before doing show(),
it then calls lock_sock_fast().

In a machine with ~400k connections, the maximum number of
sk in a bucket of the established hashtable is 7. 0.02% of
the established connections fall into this bucket size.

For listen hash (port+addr lhash2), the bucket is usually very
small also except for the SO_REUSEPORT use case which the
userspace could have one SO_REUSEPORT socket per thread.

While batching is used, it can also minimize the chance of missing
sock in the setsockopt use case if the whole bucket is batched.
This patch will start with a batch array with INIT_BATCH_SZ (16)
which will be enough for the most common cases. bpf_iter_tcp_batch()
will try to realloc to a larger array to handle exception case (e.g.
the SO_REUSEPORT case in the lhash2).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com

05c0b357 01-Jul-2021 Martin KaFai Lau <kafai@fb.com>

tcp: seq_file: Replace listening_hash with lhash2

This patch moves the tcp seq_file iteration on listeners
from the port only listening_hash to the port+addr lhash2.

When iterating from the bpf iter, the next patch will need to
lock the socket such that the bpf iter can call setsockopt (e.g. to
change the TCP_CONGESTION). To avoid locking the bucket and then locking
the sock, the bpf iter will first batch some sockets from the same bucket
and then unlock the bucket. If the bucket size is small (which
usually is), it is easier to batch the whole bucket such that it is less
likely to miss a setsockopt on a socket due to changes in the bucket.

However, the port only listening_hash could have many listeners
hashed to a bucket (e.g. many individual VIP(s):443 and also
multiple by the number of SO_REUSEPORT). We have seen bucket size in
tens of thousands range. Also, the chance of having changes
in some popular port buckets (e.g. 443) is also high.

The port+addr lhash2 was introduced to solve this large listener bucket
issue. Also, the listening_hash usage has already been replaced with
lhash2 in the fast path inet[6]_lookup_listener(). This patch follows
the same direction on moving to lhash2 and iterates the lhash2
instead of listening_hash.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200606.1035783-1-kafai@fb.com

b72acf45 01-Jul-2021 Martin KaFai Lau <kafai@fb.com>

tcp: seq_file: Add listening_get_first()

The current listening_get_next() is overloaded by passing
NULL to the 2nd arg, like listening_get_next(seq, NULL), to
mean get_first().

This patch moves some logic from the listening_get_next() into
a new function listening_get_first(). It will be equivalent
to the current established_get_first() and established_get_next()
setup. get_first() is to find a non empty bucket and return
the first sk. get_next() is to find the next sk of the current
bucket and then resorts to get_first() if the current bucket is
exhausted.

The next patch is to move the listener seq_file iteration from
listening_hash (port only) to lhash2 (port+addr).
Separating out listening_get_first() from listening_get_next()
here will make the following lhash2 changes cleaner and easier to
follow.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200600.1035353-1-kafai@fb.com

62001372 01-Jul-2021 Martin KaFai Lau <kafai@fb.com>

bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state

A following patch will create a separate struct to store extra
bpf_iter state and it will embed the existing tcp_iter_state like this:
struct bpf_tcp_iter_state {
struct tcp_iter_state state;
/* More bpf_iter specific states here ... */
}

As a prep work, this patch removes the
"struct tcp_seq_afinfo *bpf_seq_afinfo" where its purpose is
to tell if it is iterating from bpf_iter instead of proc fs.
Currently, if "*bpf_seq_afinfo" is not NULL, it is iterating from
bpf_iter. The kernel should not filter by the addr family and
leave this filtering decision to the bpf prog.

Instead of adding a "*bpf_seq_afinfo" pointer, this patch uses the
"seq->op == &bpf_iter_tcp_seq_ops" test to tell if it is iterating
from the bpf iter.

The bpf_iter_(init|fini)_tcp() is left here to prepare for
the change of a following patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200554.1034982-1-kafai@fb.com

ad2d6137 01-Jul-2021 Martin KaFai Lau <kafai@fb.com>

tcp: seq_file: Refactor net and family matching

This patch refactors the net and family matching into
two new helpers, seq_sk_match() and seq_file_family().

seq_file_family() is in the later part of the file to prepare
the change of a following patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200548.1034629-1-kafai@fb.com

525e2f9f 01-Jul-2021 Martin KaFai Lau <kafai@fb.com>

tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos

st->bucket stores the current bucket number.
st->offset stores the offset within this bucket that is the sk to be
seq_show(). Thus, st->offset only makes sense within the same
st->bucket.

These two variables are an optimization for the common no-lseek case.
When resuming the seq_file iteration (i.e. seq_start()),
tcp_seek_last_pos() tries to continue from the st->offset
at bucket st->bucket.

However, it is possible that the bucket pointed by st->bucket
has changed and st->offset may end up skipping the whole st->bucket
without finding a sk. In this case, tcp_seek_last_pos() currently
continues to satisfy the offset condition in the next (and incorrect)
bucket. Instead, regardless of the offset value, the first sk of the
next bucket should be returned. Thus, "bucket == st->bucket" check is
added to tcp_seek_last_pos().

The chance of hitting this is small and the issue is a decade old,
so targeting for the next tree.

Fixes: a8b690f98baf ("tcp: Fix slowness in read /proc/net/tcp")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200541.1033917-1-kafai@fb.com

1f22cf13 23-Jul-2021 David S. Miller <davem@davemloft.net>

Merge tag 'mac80211-for-net-2021-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Couple of fixes:
* fix aggregation on mesh
* fix late enabling of 4-addr mode
* leave monitor SKBs with some headroom
* limit band information for old applications
* fix virt-wifi WARN_ON
* fix memory leak in cfg80211 BSS list maintenance


52f3456a 23-Jul-2021 Pavel Skripkin <paskripkin@gmail.com>

net: qrtr: fix memory leaks

Syzbot reported memory leak in qrtr. The problem was in unputted
struct sock. qrtr_local_enqueue() function calls qrtr_port_lookup()
which takes sock reference if port was found. Then there is the following
check:

if (!ipc || &ipc->sk == skb->sk) {
...
return -ENODEV;
}

Since we should drop the reference before returning from this function and
ipc can be non-NULL inside this if, we should add qrtr_port_put() inside
this if.

The similar corner case is in qrtr_endpoint_post() as Manivannan
reported. In case of sock_queue_rcv_skb() failure we need to put
port reference to avoid leaking struct sock pointer.

Fixes: e04df98adf7d ("net: qrtr: Remove receive worker")
Fixes: bdabad3e363d ("net: Add Qualcomm IPC router")
Reported-and-tested-by: syzbot+35a511c72ea7356cdcf3@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

200bd566 23-Jul-2021 David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayusosays:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Memleak in commit audit error path, from Dongliang Mu.

2) Avoid possible false sharing for flowtable timeout updates
and nft_last use.

3) Adjust conntrack timestamp due to garbage collection delay,
from Florian Westphal.

4) Fix nft_nat without layer 3 address for the inet family.

5) Fix compilation warning in nfnl_hook when ingress support
is disabled, from Arnd Bergmann.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


6a6b83ca 22-Jul-2021 Kangmin Park <l4stpr0gr4m@gmail.com>

mpls: defer ttl decrement in mpls_forward()

Defer ttl decrement to optimize in tx_err case. There is no need
to decrease ttl in the case of goto tx_err.

Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

46c7655f 22-Jul-2021 Kangmin Park <l4stpr0gr4m@gmail.com>

ipv6: decrease hop limit counter in ip6_forward()

Decrease hop limit counter when deliver skb to ndp proxy.

Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

227adfb2 22-Jul-2021 Gilad Naaman <gnaaman@drivenets.com>

net: Set true network header for ECN decapsulation

In cases where the header straight after the tunnel header was
another ethernet header (TEB), instead of the network header,
the ECN decapsulation code would treat the ethernet header as if
it was an IP header, resulting in mishandling and possible
wrong drops or corruption of the IP header.

In this case, ECT(1) is sent, so IP_ECN_decapsulate tries to copy it to the
inner IPv4 header, and correct its checksum.

The offset of the ECT bits in an IPv4 header corresponds to the
lower 2 bits of the second octet of the destination MAC address
in the ethernet header.
The IPv4 checksum corresponds to end of the source address.

In order to reproduce:

$ ip netns add A
$ ip netns add B
$ ip -n A link add _v0 type veth peer name _v1 netns B
$ ip -n A link set _v0 up
$ ip -n A addr add dev _v0 10.254.3.1/24
$ ip -n A route add default dev _v0 scope global
$ ip -n B link set _v1 up
$ ip -n B addr add dev _v1 10.254.1.6/24
$ ip -n B route add default dev _v1 scope global
$ ip -n B link add gre1 type gretap local 10.254.1.6 remote 10.254.3.1 key 0x49000000
$ ip -n B link set gre1 up

# Now send an IPv4/GRE/Eth/IPv4 frame where the outer header has ECT(1),
# and the inner header has no ECT bits set:

$ cat send_pkt.py
#!/usr/bin/env python3
from scapy.all import *

pkt = IP(b'E\x01\x00\xa7\x00\x00\x00\x00@/`%\n\xfe\x03\x01\n\xfe\x01\x06 \x00eXI\x00'
b'\x00\x00\x18\xbe\x92\xa0\xee&\x18\xb0\x92\xa0l&\x08\x00E\x00\x00}\x8b\x85'
b'@\x00\x01\x01\xe4\xf2\x82\x82\x82\x01\x82\x82\x82\x02\x08\x00d\x11\xa6\xeb'
b'3\x1e\x1e\\xf3\\xf7`\x00\x00\x00\x00ZN\x00\x00\x00\x00\x00\x00\x10\x11\x12'
b'\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./01234'
b'56789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ')

send(pkt)
$ sudo ip netns exec B tcpdump -neqlllvi gre1 icmp & ; sleep 1
$ sudo ip netns exec A python3 send_pkt.py

In the original packet, the source/destinatio MAC addresses are
dst=18:be:92:a0:ee:26 src=18:b0:92:a0:6c:26

In the received packet, they are
dst=18:bd:92:a0:ee:26 src=18:b0:92:a0:6c:27

Thanks to Lahav Schlesinger <lschlesinger@drivenets.com> and Isaac Garzon <isaac@speed.io>
for helping me pinpoint the origin.

Fixes: b723748750ec ("tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040")
Cc: David S. Miller <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d237a7f1 22-Jul-2021 Hoang Le <hoang.h.le@dektech.com.au>

tipc: fix sleeping in tipc accept routine

The release_sock() is blocking function, it would change the state
after sleeping. In order to evaluate the stated condition outside
the socket lock context, switch to use wait_woken() instead.

Fixes: 6398e23cdb1d8 ("tipc: standardize accept routine")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

f8dd60de 21-Jul-2021 Xin Long <lucien.xin@gmail.com>

tipc: fix implicit-connect for SYN+

For implicit-connect, when it's either SYN- or SYN+, an ACK should
be sent back to the client immediately. It's not appropriate for
the client to enter established state only after receiving data
from the server.

On client side, after the SYN is sent out, tipc_wait_for_connect()
should be called to wait for the ACK if timeout is set.

This patch also restricts __tipc_sendstream() to call __sendmsg()
only when it's in TIPC_OPEN state, so that the client can program
in a single loop doing both connecting and data sending like:

for (...)
sendmsg(dest, buf);

This makes the implicit-connect more implicit.

Fixes: b97bf3fd8f6a ("[TIPC] Initial merge")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d82f8ab0 22-Jul-2021 Tobias Waldekranz <tobias@waldekranz.com>

net: dsa: tag_dsa: offload the bridge forwarding process

Allow the DSA tagger to generate FORWARD frames for offloaded skbs
sent from a bridge that we offload, allowing the switch to handle any
frame replication that may be required. This also means that source
address learning takes place on packets sent from the CPU, meaning
that return traffic no longer needs to be flooded as unknown unicast.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

123abc06 22-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: add support for bridge TX forwarding offload

For a DSA switch, to offload the forwarding process of a bridge device
means to send the packets coming from the software bridge as data plane
packets. This is contrary to everything that DSA has done so far,
because the current taggers only know to send control packets (ones that
target a specific destination port), whereas data plane packets are
supposed to be forwarded according to the FDB lookup, much like packets
ingressing on any regular ingress port. If the FDB lookup process
returns multiple destination ports (flooding, multicast), then
replication is also handled by the switch hardware - the bridge only
sends a single packet and avoids the skb_clone().

DSA keeps for each bridge port a zero-based index (the number of the
bridge). Multiple ports performing TX forwarding offload to the same
bridge have the same dp->bridge_num value, and ports not offloading the
TX data plane of a bridge have dp->bridge_num = -1.

The tagger can check if the packet that is being transmitted on has
skb->offload_fwd_mark = true or not. If it does, it can be sure that the
packet belongs to the data plane of a bridge, further information about
which can be obtained based on dp->bridge_dev and dp->bridge_num.
It can then compose a DSA tag for injecting a data plane packet into
that bridge number.

For the switch driver side, we offer two new dsa_switch_ops methods,
called .port_bridge_fwd_offload_{add,del}, which are modeled after
.port_bridge_{join,leave}.
These methods are provided in case the driver needs to configure the
hardware to treat packets coming from that bridge software interface as
data plane packets. The switchdev <-> bridge interaction happens during
the netdev_master_upper_dev_link() call, so to switch drivers, the
effect is that the .port_bridge_fwd_offload_add() method is called
immediately after .port_bridge_join().

If the bridge number exceeds the number of bridges for which the switch
driver can offload the TX data plane (and this includes the case where
the driver can offload none), DSA falls back to simply returning
tx_fwd_offload = false in the switchdev_bridge_port_offload() call.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5b22d366 22-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: track the number of switches in a tree

In preparation of supporting data plane forwarding on behalf of a
software bridge, some drivers might need to view bridges as virtual
switches behind the CPU port in a cross-chip topology.

Give them some help and let them know how many physical switches there
are in the tree, so that they can count the virtual switches starting
from that number on.

Note that the first dsa_switch_ops method where this information is
reliably available is .setup(). This is because of how DSA works:
in a tree with 3 switches, each calling dsa_register_switch(), the first
2 will advance until dsa_tree_setup() -> dsa_tree_setup_routing_table()
and exit with error code 0 because the topology is not complete. Since
probing is parallel at this point, one switch does not know about the
existence of the other. Then the third switch comes, and for it,
dsa_tree_setup_routing_table() returns complete = true. This switch goes
ahead and calls dsa_tree_setup_switches() for everybody else, calling
their .setup() methods too. This acts as the synchronization point.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

47211192 22-Jul-2021 Tobias Waldekranz <tobias@waldekranz.com>

net: bridge: switchdev: allow the TX data plane forwarding to be offloaded

Allow switchdevs to forward frames from the CPU in accordance with the
bridge configuration in the same way as is done between bridge
ports. This means that the bridge will only send a single skb towards
one of the ports under the switchdev's control, and expects the driver
to deliver the packet to all eligible ports in its domain.

Primarily this improves the performance of multicast flows with
multiple subscribers, as it allows the hardware to perform the frame
replication.

The basic flow between the driver and the bridge is as follows:

- When joining a bridge port, the switchdev driver calls
switchdev_bridge_port_offload() with tx_fwd_offload = true.

- The bridge sends offloadable skbs to one of the ports under the
switchdev's control using skb->offload_fwd_mark = true.

- The switchdev driver checks the skb->offload_fwd_mark field and lets
its FDB lookup select the destination port mask for this packet.

v1->v2:
- convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long
- introduce a static key "br_switchdev_fwd_offload_used" to minimize the
impact of the newly introduced feature on all the setups which don't
have hardware that can make use of it
- introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache
line access
- reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in
__br_forward()
- do not strip VLAN on egress if forwarding offload on VLAN-aware bridge
is being used
- propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP

v2->v3:
- replace the solution based on .ndo_dfwd_add_station with a solution
based on switchdev_bridge_port_offload
- rename BR_FWD_OFFLOAD to BR_TX_FWD_OFFLOAD
v3->v4: rebase
v4->v5:
- make sure the static key is decremented on bridge port unoffload
- more function and variable renaming and comments for them:
br_switchdev_fwd_offload_used to br_switchdev_tx_fwd_offload
br_switchdev_accels_skb to br_switchdev_frame_uses_tx_fwd_offload
nbp_switchdev_frame_mark_tx_fwd to nbp_switchdev_frame_mark_tx_fwd_to_hwdom
nbp_switchdev_frame_mark_accel to nbp_switchdev_frame_mark_tx_fwd_offload
fwd_accel to tx_fwd_offload

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5af84df9 23-Jul-2021 David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Conflicts are simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>


29c49648 22-Jul-2021 Arnd Bergmann <arnd@arndb.de>

net: socket: rework compat_ifreq_ioctl()

compat_ifreq_ioctl() is one of the last users of copy_in_user() and
compat_alloc_user_space(), as it attempts to convert the 'struct ifreq'
arguments from 32-bit to 64-bit format as used by dev_ioctl() and a
couple of socket family specific interpretations.

The current implementation works correctly when calling dev_ioctl(),
inet_ioctl(), ieee802154_sock_ioctl(), atalk_ioctl(), qrtr_ioctl()
and packet_ioctl(). The ioctl handlers for x25, netrom, rose and x25 do
not interpret the arguments and only block the corresponding commands,
so they do not care.

For af_inet6 and af_decnet however, the compat conversion is slightly
incorrect, as it will copy more data than the native handler accesses,
both of them use a structure that is shorter than ifreq.

Replace the copy_in_user() conversion with a pair of accessor functions
to read and write the ifreq data in place with the correct length where
needed, while leaving the other ones to copy the (already compatible)
structures directly.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

876f0bf9 22-Jul-2021 Arnd Bergmann <arnd@arndb.de>

net: socket: simplify dev_ifconf handling

The dev_ifconf() calling conventions make compat handling
more complicated than necessary, simplify this by moving
the in_compat_syscall() check into the function.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

b0e99d03 22-Jul-2021 Arnd Bergmann <arnd@arndb.de>

net: socket: remove register_gifconf

Since dynamic registration of the gifconf() helper is only used for
IPv4, and this can not be in a loadable module, this can be simplified
noticeably by turning it into a direct function call as a preparation
for cleaning up the compat handling.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

709566d7 22-Jul-2021 Arnd Bergmann <arnd@arndb.de>

net: socket: rework SIOC?IFMAP ioctls

SIOCGIFMAP and SIOCSIFMAP currently require compat_alloc_user_space()
and copy_in_user() for compat mode.

Move the compat handling into the location where the structures are
actually used, to avoid using those interfaces and get a clearer
implementation.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

dd98d289 22-Jul-2021 Arnd Bergmann <arnd@arndb.de>

ethtool: improve compat ioctl handling

The ethtool compat ioctl handling is hidden away in net/socket.c,
which introduces a couple of minor oddities:

- The implementation may end up diverging, as seen in the RXNFC
extension in commit 84a1d9c48200 ("net: ethtool: extend RXNFC
API to support RSS spreading of filter matches") that does not work
in compat mode.

- Most architectures do not need the compat handling at all
because u64 and compat_u64 have the same alignment.

- On x86, the conversion is done for both x32 and i386 user space,
but it's actually wrong to do it for x32 and cannot work there.

- On 32-bit Arm, it never worked for compat oabi user space, since
that needs to do the same conversion but does not.

- It would be nice to get rid of both compat_alloc_user_space()
and copy_in_user() throughout the kernel.

None of these actually seems to be a serious problem that real
users are likely to encounter, but fixing all of them actually
leads to code that is both shorter and more readable.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

217e26bd 21-Jul-2021 Arnd Bergmann <arnd@arndb.de>

netfilter: nfnl_hook: fix unused variable warning

The only user of this variable is in an #ifdef:

net/netfilter/nfnetlink_hook.c: In function 'nfnl_hook_entries_head':
net/netfilter/nfnetlink_hook.c:177:28: error: unused variable 'netdev' [-Werror=unused-variable]

Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

a33f387e 20-Jul-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_nat: allow to specify layer 4 protocol NAT only

nft_nat reports a bogus EAFNOSUPPORT if no layer 3 information is specified.

Fixes: d07db9884a5f ("netfilter: nf_tables: introduce nft_validate_register_load()")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

30a56a2b 18-Jul-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: adjust stop timestamp to real expiry value

In case the entry is evicted via garbage collection there is
delay between the timeout value and the eviction event.

This adjusts the stop value based on how much time has passed.

Fixes: b87a2f9199ea82 ("netfilter: conntrack: add gc worker to remove timed-out entries")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

32953df7 17-Jul-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_last: avoid possible false sharing

Use the idiom described in:

https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE#it-may-improve-performance

Moreover, prevent a compiler optimization.

Fixes: 836382dc2471 ("netfilter: nf_tables: add last expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

32c3973d 17-Jul-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: flowtable: avoid possible false sharing

The flowtable follows the same timeout approach as conntrack, use the
same idiom as in cc16921351d8 ("netfilter: conntrack: avoid same-timeout
update") but also include the fix provided by e37542ba111f ("netfilter:
conntrack: avoid possible false sharing").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

f9a5c358 28-Jun-2021 Nguyen Dinh Phi <phind.uet@gmail.com>

cfg80211: Fix possible memory leak in function cfg80211_bss_update

When we exceed the limit of BSS entries, this function will free the
new entry, however, at this time, it is the last door to access the
inputed ies, so these ies will be unreferenced objects and cause memory
leak.
Therefore we should free its ies before deallocating the new entry, beside
of dropping it from hidden_list.

Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com>
Link: https://lore.kernel.org/r/20210628132334.851095-1-phind.uet@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

0d059964 12-Jul-2021 Johannes Berg <johannes.berg@intel.com>

nl80211: limit band information in non-split data

In non-split data, we shouldn't be adding S1G and 6 GHz
data (or future bands) since we're really close to the
4k message size limit. Remove those bands, any modern
userspace that can use S1G or 6 GHz should already be
using split dumps, and if not then it needs to update.

Link: https://lore.kernel.org/r/20210712215329.31444162a2c2.I5555312e4a074c84f8b4e7ad79dc4d1fbfc5126c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

a5d3cbdb 01-Jul-2021 Felix Fietkau <nbd@nbd.name>

mac80211: fix enabling 4-address mode on a sta vif after assoc

Notify the driver about the 4-address mode change and also send a nulldata
packet to the AP to notify it about the change

Fixes: 1ff4e8f2dec8 ("mac80211: notify the driver when a sta uses 4-address mode")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210702050111.47546-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

1a791550 29-Jun-2021 Felix Fietkau <nbd@nbd.name>

mac80211: fix starting aggregation sessions on mesh interfaces

The logic for starting aggregation sessions was recently moved from minstrel_ht
to mac80211, into the subif tx handler just after the sta lookup.
Unfortunately this didn't work for mesh interfaces, since the sta lookup is
deferred until a much later point in time on those.
Fix this by also calling the aggregation check right after the deferred sta
lookup.

Fixes: 08a46c642001 ("mac80211: move A-MPDU session check from minstrel_ht to mac80211")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210629112853.29785-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

ec61cd49 28-Jun-2021 Johan Almbladh <johan.almbladh@anyfinetworks.com>

mac80211: Do not strip skb headroom on monitor frames

When a monitor interface is present together with other interfaces, a
received skb is copied and received on the monitor netdev. Before, the
copied skb was allocated with exactly the amount of space needed for
the radiotap header, resulting in an skb without any headroom at all
being received on the monitor netdev. With the introduction of eBPF
and XDP in the kernel, skbs may be processed by custom eBPF programs.
However, since the skb cannot be reallocated in the eBPF program, no
more data or headers can be pushed. The old code made sure the final
headroom was zero regardless of the value of NET_SKB_PAD, so increasing
that constant would have no effect.

Now we allocate monitor skb copies with a headroom of NET_SKB_PAD bytes
before the radiotap header. Monitor interfaces now behave in the same
way as other netdev interfaces that honor the NET_SKB_PAD constant.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Link: https://lore.kernel.org/r/20210628123713.2070753-1-johan.almbladh@anyfinetworks.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

9af41761 12-Jul-2021 Colin Ian King <colin.king@canonical.com>

6lowpan: iphc: Fix an off-by-one check of array index

The bounds check of id is off-by-one and the comparison should
be >= rather >. Currently the WARN_ON_ONCE check does not stop
the out of range indexing of &ldev->ctx.table[id] so also add
a return path if the bounds are out of range.

Addresses-Coverity: ("Illegal address computation").
Fixes: 5609c185f24d ("6lowpan: iphc: add support for stateful compression")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

59da0b38 25-Jun-2021 Dan Carpenter <dan.carpenter@oracle.com>

Bluetooth: sco: prevent information leak in sco_conn_defer_accept()

Smatch complains that some of these struct members are not initialized
leading to a stack information disclosure:

net/bluetooth/sco.c:778 sco_conn_defer_accept() warn:
check that 'cp.retrans_effort' doesn't leak information

This seems like a valid warning. I've added a default case to fix
this issue.

Fixes: 2f69a82acf6f ("Bluetooth: Use voice setting in deferred SCO connection request")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

176f716c 22-Jul-2021 Matthieu Baerts <matthieu.baerts@tessares.net>

ipv6: fix "'ioam6_if_id_max' defined but not used" warn

When compiling without CONFIG_SYSCTL, this warning appears:

net/ipv6/addrconf.c:99:12: error: 'ioam6_if_id_max' defined but not used [-Werror=unused-variable]
99 | static u32 ioam6_if_id_max = U16_MAX;
| ^~~~~~~~~~~~~~~
cc1: all warnings being treated as errors

Simply moving the declaration of this variable under ...

#ifdef CONFIG_SYSCTL

... with other similar variables fixes the issue.

Fixes: 9ee11f0fff20 ("ipv6: ioam: Data plane support for Pre-allocated Trace")
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

802a76af 22-Jul-2021 Oleksij Rempel <linux@rempel-privat.de>

net: selftests: add MTU test

Test if we actually can send/receive packets with MTU size. This kind of
issue was detected on ASIX HW with bogus EEPROM.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

9d85a6f4 21-Jul-2021 Yajun Deng <yajun.deng@linux.dev>

net: sched: cls_api: Fix the the wrong parameter

The 4th parameter in tc_chain_notify() should be flags rather than seq.
Let's change it back correctly.

Fixes: 32a4f5ecd738 ("net: sched: introduce chain object to uapi")
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

2b0a5688 21-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: switchdev: fix FDB entries towards foreign ports not getting propagated to us

The newly introduced switchdev_handle_fdb_{add,del}_to_device helpers
solved a problem but introduced another one. They have a severe design
bug: they do not propagate FDB events on foreign interfaces to us, i.e.
this use case:

br0
/ \
/ \
/ \
/ \
swp0 eno0
(switchdev) (foreign)

when an address is learned on eno0, what is supposed to happen is that
this event should also be propagated towards swp0. Somehow I managed to
convince myself that this did work correctly, but obviously it does not.

The trouble with foreign interfaces is that we must reach a switchdev
net_device pointer through a foreign net_device that has no direct
upper/lower relationship with it. So we need to do exploratory searching
through the lower interfaces of the foreign net_device's bridge upper
(to reach swp0 from eno0, we must check its upper, br0, for lower
interfaces that pass the check_cb and foreign_dev_check_cb). This is
something that the previous code did not do, it just assumed that "dev"
will become a switchdev interface at some point, somehow, probably by
magic.

With this patch, assisted address learning on the CPU port works again
in DSA:

ip link add br0 type bridge
ip link set swp0 master br0
ip link set eno0 master br0
ip link set br0 up

[ 46.708929] mscc_felix 0000:00:00.5 swp0: Adding FDB entry towards eno0, addr 00:04:9f:05:f4:ab vid 0 as host address

Fixes: 8ca07176ab00 ("net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE")
Reported-by: Eric Woudstra <ericwouds@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4e51bf44 21-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: bridge: move the switchdev object replay helpers to "push" mode

Starting with commit 4f2673b3a2b6 ("net: bridge: add helper to replay
port and host-joined mdb entries"), DSA has introduced some bridge
helpers that replay switchdev events (FDB/MDB/VLAN additions and
deletions) that can be lost by the switchdev drivers in a variety of
circumstances:

- an IP multicast group was host-joined on the bridge itself before any
switchdev port joined the bridge, leading to the host MDB entries
missing in the hardware database.
- during the bridge creation process, the MAC address of the bridge was
added to the FDB as an entry pointing towards the bridge device
itself, but with no switchdev ports being part of the bridge yet, this
local FDB entry would remain unknown to the switchdev hardware
database.
- a VLAN/FDB/MDB was added to a bridge port that is a LAG interface,
before any switchdev port joined that LAG, leading to the hardware
database missing those entries.
- a switchdev port left a LAG that is a bridge port, while the LAG
remained part of the bridge, and all FDB/MDB/VLAN entries remained
installed in the hardware database of the switchdev port.

Also, since commit 0d2cfbd41c4a ("net: bridge: ignore switchdev events
for LAG ports which didn't request replay"), DSA introduced a method,
based on a const void *ctx, to ensure that two switchdev ports under the
same LAG that is a bridge port do not see the same MDB/VLAN entry being
replayed twice by the bridge, once for every bridge port that joins the
LAG.

With so many ordering corner cases being possible, it seems unreasonable
to expect a switchdev driver writer to get it right from the first try.
Therefore, now that DSA has experimented with the bridge replay helpers
for a little bit, we can move the code to the bridge driver where it is
more readily available to all switchdev drivers.

To convert the switchdev object replay helpers from "pull mode" (where
the driver asks for them) to a "push mode" (where the bridge offers them
automatically), the biggest problem is that the bridge needs to be aware
when a switchdev port joins and leaves, even when the switchdev is only
indirectly a bridge port (for example when the bridge port is a LAG
upper of the switchdev).

Luckily, we already have a hook for that, in the form of the newly
introduced switchdev_bridge_port_offload() and
switchdev_bridge_port_unoffload() calls. These offer a natural place for
hooking the object addition and deletion replays.

Extend the above 2 functions with:
- pointers to the switchdev atomic notifier (for FDB replays) and the
blocking notifier (for MDB and VLAN replays).
- the "const void *ctx" argument required for drivers to be able to
disambiguate between which port is targeted, when multiple ports are
lowers of the same LAG that is a bridge port. Most of the drivers pass
NULL to this argument, except the ones that support LAG offload and have
the proper context check already in place in the switchdev blocking
notifier handler.

Also unexport the replay helpers, since nobody except the bridge calls
them directly now.

Note that:
(a) we abuse the terminology slightly, because FDB entries are not
"switchdev objects", but we count them as objects nonetheless.
With no direct way to prove it, I think they are not modeled as
switchdev objects because those can only be installed by the bridge
to the hardware (as opposed to FDB entries which can be propagated
in the other direction too). This is merely an abuse of terms, FDB
entries are replayed too, despite not being objects.
(b) the bridge does not attempt to sync port attributes to newly joined
ports, just the countable stuff (the objects). The reason for this
is simple: no universal and symmetric way to sync and unsync them is
known. For example, VLAN filtering: what to do on unsync, disable or
leave it enabled? Similarly, STP state, ageing timer, etc etc. What
a switchdev port does when it becomes standalone again is not really
up to the bridge's competence, and the driver should deal with it.
On the other hand, replaying deletions of switchdev objects can be
seen a matter of cleanup and therefore be treated by the bridge,
hence this patch.

We make the replay helpers opt-in for drivers, because they might not
bring immediate benefits for them:

- nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(),
so br_vlan_replay() should not do anything for the new drivers on
which we call it. The existing drivers where there was even a slight
possibility for there to exist a VLAN on a bridge port before they
join it are already guarded against this: mlxsw and prestera deny
joining LAG interfaces that are members of a bridge.

- br_fdb_replay() should now notify of local FDB entries, but I patched
all drivers except DSA to ignore these new entries in commit
2c4eca3ef716 ("net: bridge: switchdev: include local flag in FDB
notifications"). Driver authors can lift this restriction as they
wish, and when they do, they can also opt into the FDB replay
functionality.

- br_mdb_replay() should fix a real issue which is described in commit
4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined
mdb entries"). However most drivers do not offload the
SWITCHDEV_OBJ_ID_HOST_MDB to see this issue: only cpsw and am65_cpsw
offload this switchdev object, and I don't completely understand the
way in which they offload this switchdev object anyway. So I'll leave
it up to these drivers' respective maintainers to opt into
br_mdb_replay().

So most of the drivers pass NULL notifier blocks for the replay helpers,
except:
- dpaa2-switch which was already acked/regression-tested with the
helpers enabled (and there isn't much of a downside in having them)
- ocelot which already had replay logic in "pull" mode
- DSA which already had replay logic in "pull" mode

An important observation is that the drivers which don't currently
request bridge event replays don't even have the
switchdev_bridge_port_{offload,unoffload} calls placed in proper places
right now. This was done to avoid unnecessary rework for drivers which
might never even add support for this. For driver writers who wish to
add replay support, this can be used as a tentative placement guide:
https://patchwork.kernel.org/project/netdevbpf/patch/20210720134655.892334-11-vladimir.oltean@nxp.com/

Cc: Vadym Kochan <vkochan@marvell.com>
Cc: Taras Chornyi <tchornyi@marvell.com>
Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
Cc: Lars Povlsen <lars.povlsen@microchip.com>
Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
Cc: UNGLinuxDriver@microchip.com
Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
Signed-off-by: David S. Miller <davem@davemloft.net>

7105b50b 21-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: bridge: guard the switchdev replay helpers against a NULL notifier block

There is a desire to make the object and FDB replay helpers optional
when moving them inside the bridge driver. For example a certain driver
might not offload host MDBs and there is no case where the replay
helpers would be of immediate use to it.

So it would be nice if we could allow drivers to pass NULL pointers for
the atomic and blocking notifier blocks, and the replay helpers to do
nothing in that case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2f5dc00f 21-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: bridge: switchdev: let drivers inform which bridge ports are offloaded

On reception of an skb, the bridge checks if it was marked as 'already
forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it
is, it assigns the source hardware domain of that skb based on the
hardware domain of the ingress port. Then during forwarding, it enforces
that the egress port must have a different hardware domain than the
ingress one (this is done in nbp_switchdev_allowed_egress).

Non-switchdev drivers don't report any physical switch id (neither
through devlink nor .ndo_get_port_parent_id), therefore the bridge
assigns them a hardware domain of 0, and packets coming from them will
always have skb->offload_fwd_mark = 0. So there aren't any restrictions.

Problems appear due to the fact that DSA would like to perform software
fallback for bonding and team interfaces that the physical switch cannot
offload.

+-- br0 ---+
/ / | \
/ / | \
/ | | bond0
/ | | / \
swp0 swp1 swp2 swp3 swp4

There, it is desirable that the presence of swp3 and swp4 under a
non-offloaded LAG does not preclude us from doing hardware bridging
beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high
enough that software bridging between {swp0,swp1,swp2} and bond0 is not
impractical.

But this creates an impossible paradox given the current way in which
port hardware domains are assigned. When the driver receives a packet
from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to
something.

- If we set it to 0, then the bridge will forward it towards swp1, swp2
and bond0. But the switch has already forwarded it towards swp1 and
swp2 (not to bond0, remember, that isn't offloaded, so as far as the
switch is concerned, ports swp3 and swp4 are not looking up the FDB,
and the entire bond0 is a destination that is strictly behind the
CPU). But we don't want duplicated traffic towards swp1 and swp2, so
it's not ok to set skb->offload_fwd_mark = 0.

- If we set it to 1, then the bridge will not forward the skb towards
the ports with the same switchdev mark, i.e. not to swp1, swp2 and
bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should
have forwarded the skb there.

So the real issue is that bond0 will be assigned the same hardware
domain as {swp0,swp1,swp2}, because the function that assigns hardware
domains to bridge ports, nbp_switchdev_add(), recurses through bond0's
lower interfaces until it finds something that implements devlink (calls
dev_get_port_parent_id with bool recurse = true). This is a problem
because the fact that bond0 can be offloaded by swp3 and swp4 in our
example is merely an assumption.

A solution is to give the bridge explicit hints as to what hardware
domain it should use for each port.

Currently, the bridging offload is very 'silent': a driver registers a
netdevice notifier, which is put on the netns's notifier chain, and
which sniffs around for NETDEV_CHANGEUPPER events where the upper is a
bridge, and the lower is an interface it knows about (one registered by
this driver, normally). Then, from within that notifier, it does a bunch
of stuff behind the bridge's back, without the bridge necessarily
knowing that there's somebody offloading that port. It looks like this:

ip link set swp0 master br0
|
v
br_add_if() calls netdev_master_upper_dev_link()
|
v
call_netdevice_notifiers
|
v
dsa_slave_netdevice_event
|
v
oh, hey! it's for me!
|
v
.port_bridge_join

What we do to solve the conundrum is to be less silent, and change the
switchdev drivers to present themselves to the bridge. Something like this:

ip link set swp0 master br0
|
v
br_add_if() calls netdev_master_upper_dev_link()
|
v bridge: Aye! I'll use this
call_netdevice_notifiers ^ ppid as the
| | hardware domain for
v | this port, and zero
dsa_slave_netdevice_event | if I got nothing.
| |
v |
oh, hey! it's for me! |
| |
v |
.port_bridge_join |
| |
+------------------------+
switchdev_bridge_port_offload(swp0, swp0)

Then stacked interfaces (like bond0 on top of swp3/swp4) would be
treated differently in DSA, depending on whether we can or cannot
offload them.

The offload case:

ip link set bond0 master br0
|
v
br_add_if() calls netdev_master_upper_dev_link()
|
v bridge: Aye! I'll use this
call_netdevice_notifiers ^ ppid as the
| | switchdev mark for
v | bond0.
dsa_slave_netdevice_event | Coincidentally (or not),
| | bond0 and swp0, swp1, swp2
v | all have the same switchdev
hmm, it's not quite for me, | mark now, since the ASIC
but my driver has already | is able to forward towards
called .port_lag_join | all these ports in hw.
for it, because I have |
a port with dp->lag_dev == bond0. |
| |
v |
.port_bridge_join |
for swp3 and swp4 |
| |
+------------------------+
switchdev_bridge_port_offload(bond0, swp3)
switchdev_bridge_port_offload(bond0, swp4)

And the non-offload case:

ip link set bond0 master br0
|
v
br_add_if() calls netdev_master_upper_dev_link()
|
v bridge waiting:
call_netdevice_notifiers ^ huh, switchdev_bridge_port_offload
| | wasn't called, okay, I'll use a
v | hwdom of zero for this one.
dsa_slave_netdevice_event : Then packets received on swp0 will
| : not be software-forwarded towards
v : swp1, but they will towards bond0.
it's not for me, but
bond0 is an upper of swp3
and swp4, but their dp->lag_dev
is NULL because they couldn't
offload it.

Basically we can draw the conclusion that the lowers of a bridge port
can come and go, so depending on the configuration of lowers for a
bridge port, it can dynamically toggle between offloaded and unoffloaded.
Therefore, we need an equivalent switchdev_bridge_port_unoffload too.

This patch changes the way any switchdev driver interacts with the
bridge. From now on, everybody needs to call switchdev_bridge_port_offload
and switchdev_bridge_port_unoffload, otherwise the bridge will treat the
port as non-offloaded and allow software flooding to other ports from
the same ASIC.

Note that these functions lay the ground for a more complex handshake
between switchdev drivers and the bridge in the future.

For drivers that will request a replay of the switchdev objects when
they offload and unoffload a bridge port (DSA, dpaa2-switch, ocelot), we
place the call to switchdev_bridge_port_unoffload() strategically inside
the NETDEV_PRECHANGEUPPER notifier's code path, and not inside
NETDEV_CHANGEUPPER. This is because the switchdev object replay helpers
need the netdev adjacency lists to be valid, and that is only true in
NETDEV_PRECHANGEUPPER.

Cc: Vadym Kochan <vkochan@marvell.com>
Cc: Taras Chornyi <tchornyi@marvell.com>
Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
Cc: Lars Povlsen <lars.povlsen@microchip.com>
Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
Cc: UNGLinuxDriver@microchip.com
Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch: regression
Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> # ocelot-switch
Signed-off-by: David S. Miller <davem@davemloft.net>

85826610 21-Jul-2021 Tobias Waldekranz <tobias@waldekranz.com>

net: bridge: switchdev: recycle unused hwdoms

Since hwdoms have only been used thus far for equality comparisons, the
bridge has used the simplest possible assignment policy; using a
counter to keep track of the last value handed out.

With the upcoming transmit offloading, we need to perform set
operations efficiently based on hwdoms, e.g. we want to answer
questions like "has this skb been forwarded to any port within this
hwdom?"

Move to a bitmap-based allocation scheme that recycles hwdoms once all
members leaves the bridge. This means that we can use a single
unsigned long to keep track of the hwdoms that have received an skb.

v1->v2: convert the typedef DECLARE_BITMAP(br_hwdom_map_t, BR_HWDOM_MAX)
into a plain unsigned long.
v2->v6: none

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f7cf972f 21-Jul-2021 Tobias Waldekranz <tobias@waldekranz.com>

net: bridge: disambiguate offload_fwd_mark

Before this change, four related - but distinct - concepts where named
offload_fwd_mark:

- skb->offload_fwd_mark: Set by the switchdev driver if the underlying
hardware has already forwarded this frame to the other ports in the
same hardware domain.

- nbp->offload_fwd_mark: An idetifier used to group ports that share
the same hardware forwarding domain.

- br->offload_fwd_mark: Counter used to make sure that unique IDs are
used in cases where a bridge contains ports from multiple hardware
domains.

- skb->cb->offload_fwd_mark: The hardware domain on which the frame
ingressed and was forwarded.

Introduce the term "hardware forwarding domain" ("hwdom") in the
bridge to denote a set of ports with the following property:

If an skb with skb->offload_fwd_mark set, is received on a port
belonging to hwdom N, that frame has already been forwarded to all
other ports in hwdom N.

By decoupling the name from "offload_fwd_mark", we can extend the
term's definition in the future - e.g. to add constraints that
describe expected egress behavior - without overloading the meaning of
"offload_fwd_mark".

- nbp->offload_fwd_mark thus becomes nbp->hwdom.

- br->offload_fwd_mark becomes br->last_hwdom.

- skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. The slight
change in naming here mandates a slight change in behavior of the
nbp_switchdev_frame_mark() function. Previously, it only set this
value in skb->cb for packets with skb->offload_fwd_mark true (ones
which were forwarded in hardware). Whereas now we always track the
incoming hwdom for all packets coming from a switchdev (even for the
packets which weren't forwarded in hardware, such as STP BPDUs, IGMP
reports etc). As all uses of skb->cb->offload_fwd_mark were already
gated behind checks of skb->offload_fwd_mark, this will not introduce
any functional change, but it paves the way for future changes where
the ingressing hwdom must be known for frames coming from a switchdev
regardless of whether they were forwarded in hardware or not
(basically, if the skb comes from a switchdev, skb->cb->src_hwdom now
always tracks which one).

A typical example where this is relevant: the switchdev has a fixed
configuration to trap STP BPDUs, but STP is not running on the bridge
and the group_fwd_mask allows them to be forwarded. Say we have this
setup:

br0
/ | \
/ | \
swp0 swp1 swp2

A BPDU comes in on swp0 and is trapped to the CPU; the driver does not
set skb->offload_fwd_mark. The bridge determines that the frame should
be forwarded to swp{1,2}. It is imperative that forward offloading is
_not_ allowed in this case, as the source hwdom is already "poisoned".

Recording the source hwdom allows this case to be handled properly.

v2->v3: added code comments
v3->v6: none

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

37120f23 21-Jul-2021 Lino Sanfilippo <LinoSanfilippo@gmx.de>

net: dsa: tag_ksz: dont let the hardware process the layer 4 checksum

If the checksum calculation is offloaded to the network device (e.g due to
NETIF_F_HW_CSUM inherited from the DSA master device), the calculated
layer 4 checksum is incorrect. This is since the DSA tag which is placed
after the layer 4 data is considered as being part of the daa and thus
errorneously included into the checksum calculation.
To avoid this, always calculate the layer 4 checksum in software.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

21cf377a 21-Jul-2021 Lino Sanfilippo <LinoSanfilippo@gmx.de>

net: dsa: ensure linearized SKBs in case of tail taggers

The function skb_put() that is used by tail taggers to make room for the
DSA tag must only be called for linearized SKBS. However in case that the
slave device inherited features like NETIF_F_HW_SG or NETIF_F_FRAGLIST the
SKB passed to the slaves transmit function may not be linearized.
Avoid those SKBs by clearing the NETIF_F_HW_SG and NETIF_F_FRAGLIST flags
for tail taggers.
Furthermore since the tagging protocol can be changed at runtime move the
code for setting up the slaves features into dsa_slave_setup_tagger().

Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

213ad73d 21-Jul-2021 Wei Wang <weiwan@google.com>

tcp: disable TFO blackhole logic by default

Multiple complaints have been raised from the TFO users on the internet
stating that the TFO blackhole logic is too aggressive and gets falsely
triggered too often.
(e.g. https://blog.apnic.net/2021/07/05/tcp-fast-open-not-so-fast/)
Considering that most middleboxes no longer drop TFO packets, we decide
to disable the blackhole logic by setting
/proc/sys/net/ipv4/tcp_fastopen_blackhole_timeout_set to 0 by default.

Fixes: cf1ef3f0719b4 ("net/tcp_fastopen: Disable active side TFO in certain scenarios")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

58d913a3 21-Jul-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: multicast: add context support for host-joined groups

Adding bridge multicast context support for host-joined groups is easy
because we only need the proper timer value. We pass the already chosen
context and use its timer value.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6567cb43 21-Jul-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: multicast: add mdb context support

Choose the proper bridge multicast context when user-spaces is adding
mdb entries. Currently we require the vlan to be configured on at least
one device (port or bridge) in order to add an mdb entry if vlan
mcast snooping is enabled (vlan snooping implies vlan filtering).
Note that we always allow deleting an entry, regardless of the vlan state.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

02dc2ee7 21-Jul-2021 Xin Long <lucien.xin@gmail.com>

sctp: do not update transport pathmtu if SPP_PMTUD_ENABLE is not set

Currently, in sctp_packet_config(), sctp_transport_pmtu_check() is
called to update transport pathmtu with dst's mtu when dst's mtu
has been changed by non sctp stack like xfrm.

However, this should only happen when SPP_PMTUD_ENABLE is set, no
matter where dst's mtu changed. This patch is to fix by checking
SPP_PMTUD_ENABLE flag before calling sctp_transport_pmtu_check().

Thanks Jacek for reporting and looking into this issue.

v1->v2:
- add the missing "{" to fix the build error.

Fixes: 69fec325a643 ('Revert "sctp: remove sctp_transport_pmtu_check"')
Reported-by: Jacek Szafraniec <jacek.szafraniec@nokia.com>
Tested-by: Jacek Szafraniec <jacek.szafraniec@nokia.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

240bfd13 21-Jul-2021 Eric Dumazet <edumazet@google.com>

tcp: tweak len/truesize ratio for coalesce candidates

tcp_grow_window() is using skb->len/skb->truesize to increase tp->rcv_ssthresh
which has a direct impact on advertized window sizes.

We added TCP coalescing in linux-3.4 & linux-3.5:

Instead of storing skbs with one or two MSS in receive queue (or OFO queue),
we try to append segments together to reduce memory overhead.

High performance network drivers tend to cook skb with 3 parts :

1) sk_buff structure (256 bytes)
2) skb->head contains room to copy headers as needed, and skb_shared_info
3) page fragment(s) containing the ~1514 bytes frame (or more depending on MTU)

Once coalesced into a previous skb, 1) and 2) are freed.

We can therefore tweak the way we compute len/truesize ratio knowing
that skb->truesize is inflated by 1) and 2) soon to be freed.

This is done only for in-order skb, or skb coalesced into OFO queue.

The result is that low rate flows no longer pay the memory price of having
low GRO aggregation factor. Same result for drivers not using GRO.

This is critical to allow a big enough receiver window,
typically tcp_rmem[2] / 2.

We have been using this at Google for about 5 years, it is due time
to make it upstream.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

54cb4319 21-Jul-2021 Nikolay Aleksandrov <nikolay@nvidia.com>

net: bridge: multicast: fix igmp/mld port context null pointer dereferences

With the recent change to use bridge/port multicast context pointers
instead of bridge/port I missed to convert two locations which pass the
port pointer as-is, but with the new model we need to verify the port
context is non-NULL first and retrieve the port from it. The first
location is when doing querier selection when a query is received, the
second location is when leaving a group. The port context will be null
if the packets originated from the bridge device (i.e. from the host).
The fix is simple just check if the port context exists and retrieve
the port pointer from it.

Fixes: adc47037a7d5 ("net: bridge: multicast: use multicast contexts instead of bridge or port")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

739b2adf 21-Jul-2021 Eric Dumazet <edumazet@google.com>

tcp: avoid indirect call in tcp_new_space()

For tcp sockets, sk->sk_write_space is most probably sk_stream_write_space().

Other sk->sk_write_space() calls in TCP are slow path and do not deserve
any change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9bfce73c 20-Jul-2021 Vadim Fedorenko <vfedorenko@novek.ru>

udp: check encap socket in __udp_lib_err

Commit d26796ae5894 ("udp: check udp sock encap_type in __udp_lib_err")
added checks for encapsulated sockets but it broke cases when there is
no implementation of encap_err_lookup for encapsulation, i.e. ESP in
UDP encapsulation. Fix it by calling encap_err_lookup only if socket
implements this method otherwise treat it as legal socket.

Fixes: d26796ae5894 ("udp: check udp sock encap_type in __udp_lib_err")
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

58acd100 20-Jul-2021 Xin Long <lucien.xin@gmail.com>

sctp: update active_key for asoc when old key is being replaced

syzbot reported a call trace:

BUG: KASAN: use-after-free in sctp_auth_shkey_hold+0x22/0xa0 net/sctp/auth.c:112
Call Trace:
sctp_auth_shkey_hold+0x22/0xa0 net/sctp/auth.c:112
sctp_set_owner_w net/sctp/socket.c:131 [inline]
sctp_sendmsg_to_asoc+0x152e/0x2180 net/sctp/socket.c:1865
sctp_sendmsg+0x103b/0x1d30 net/sctp/socket.c:2027
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:821
sock_sendmsg_nosec net/socket.c:703 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:723

This is an use-after-free issue caused by not updating asoc->shkey after
it was replaced in the key list asoc->endpoint_shared_keys, and the old
key was freed.

This patch is to fix by also updating active_key for asoc when old key is
being replaced with a new one. Note that this issue doesn't exist in
sctp_auth_del_key_id(), as it's not allowed to delete the active_key
from the asoc.

Fixes: 1b1e0bc99474 ("sctp: add refcnt support for sh_key")
Reported-by: syzbot+b774577370208727d12b@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ac6627a2 20-Jul-2021 Vadim Fedorenko <vfedorenko@novek.ru>

net: ipv4: Consolidate ipv4_mtu and ip_dst_mtu_maybe_forward

Consolidate IPv4 MTU code the same way it is done in IPv6 to have code
aligned in both address families

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

427faee1 20-Jul-2021 Vadim Fedorenko <vfedorenko@novek.ru>

net: ipv6: introduce ip6_dst_mtu_maybe_forward

Replace ip6_dst_mtu_forward with ip6_dst_mtu_maybe_forward and
reuse this code in ip6_mtu. Actually these two functions were
almost duplicates, this change will simplify the maintaince of
mtu calculation code.

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

3edede08 20-Jul-2021 Justin Iurman <justin.iurman@uliege.be>

ipv6: ioam: Support for IOAM injection with lwtunnels

Add support for the IOAM inline insertion (only for the host-to-host use case)
which is per-route configured with lightweight tunnels. The target is iproute2
and the patch is ready. It will be posted as soon as this patchset is merged.
Here is an overview:

$ ip -6 ro ad fc00::1/128 encap ioam6 trace type 0x800000 ns 1 size 12 dev eth0

This example configures an IOAM Pre-allocated Trace option attached to the
fc00::1/128 prefix. The IOAM namespace (ns) is 1, the size of the pre-allocated
trace data block is 12 octets (size) and only the first IOAM data (bit 0:
hop_limit + node id) is included in the trace (type) represented as a bitfield.

The reason why the in-transit (IPv6-in-IPv6 encapsulation) use case is not
implemented is explained on the patchset cover.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

8c6f6fa6 20-Jul-2021 Justin Iurman <justin.iurman@uliege.be>

ipv6: ioam: IOAM Generic Netlink API

Add Generic Netlink commands to allow userspace to configure IOAM
namespaces and schemas. The target is iproute2 and the patch is ready.
It will be posted as soon as this patchset is merged. Here is an overview:

$ ip ioam
Usage: ip ioam { COMMAND | help }
ip ioam namespace show
ip ioam namespace add ID [ data DATA32 ] [ wide DATA64 ]
ip ioam namespace del ID
ip ioam schema show
ip ioam schema add ID DATA
ip ioam schema del ID
ip ioam namespace set ID schema { ID | none }

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ee11f0f 20-Jul-2021 Justin Iurman <justin.iurman@uliege.be>

ipv6: ioam: Data plane support for Pre-allocated Trace

Implement support for processing the IOAM Pre-allocated Trace with IPv6,
see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3].

A new per-interface sysctl is introduced. The value is a boolean to accept (=1)
or ignore (=0, by default) IPv6 IOAM options on ingress for an interface:
- net.ipv6.conf.XXX.ioam6_enabled

Two other sysctls are introduced to define IOAM IDs, represented by an integer.
They are respectively per-namespace and per-interface:
- net.ipv6.ioam6_id
- net.ipv6.conf.XXX.ioam6_id

The value of the first one represents the IOAM ID of the node itself (u32; max
and default value = U32_MAX>>8, due to hop limit concatenation) while the other
represents the IOAM ID of an interface (u16; max and default value = U16_MAX).

Each "ioam6_id" sysctl has a "_wide" equivalent:
- net.ipv6.ioam6_id_wide
- net.ipv6.conf.XXX.ioam6_id_wide

The value of the first one represents the wide IOAM ID of the node itself (u64;
max and default value = U64_MAX>>8, due to hop limit concatenation) while the
other represents the wide IOAM ID of an interface (u32; max and default value
= U32_MAX).

The use of short and wide equivalents is not exclusive, a deployment could
choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format)
could be an identifier for a physical interface, whereas
net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a
logical sub-interface. Documentation about new sysctls is provided at the end
of this patchset.

Two relativistic hash tables are used: one for IOAM namespaces, the other for
IOAM schemas. A namespace can only have a single active schema and a schema
can only be attached to a single namespace (1:1 relationship).

[1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options
[2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data
[3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

71f4f89a 20-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: switchdev: recurse into __switchdev_handle_fdb_del_to_device

The difference between __switchdev_handle_fdb_del_to_device and
switchdev_handle_del_to_device is that the former takes an extra
orig_dev argument, while the latter starts with dev == orig_dev.

We should recurse into the variant that does not lose the orig_dev along
the way. This is relevant when deleting FDB entries pointing towards a
bridge (dev changes to the lower interfaces, but orig_dev shouldn't).

The addition helper already recurses properly, just the deletion one
doesn't.

Fixes: 8ca07176ab00 ("net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4e950506 17-Jul-2021 Dmitry Safonov <0x7f454c46@gmail.com>

net/xfrm/compat: Copy xfrm_spdattr_type_t atributes

The attribute-translator has to take in mind maxtype, that is
xfrm_link::nla_max. When it is set, attributes are not of xfrm_attr_type_t.
Currently, they can be only XFRMA_SPD_MAX (message XFRM_MSG_NEWSPDINFO),
their UABI is the same for 64/32-bit, so just copy them.

Thanks to YueHaibing for reporting this:
In xfrm_user_rcv_msg_compat() if maxtype is not zero and less than
XFRMA_MAX, nlmsg_parse_deprecated() do not initialize attrs array fully.
xfrm_xlate32() will access uninit 'attrs[i]' while iterating all attrs
array.

KASAN: probably user-memory-access in range [0x0000000041b58ab0-0x0000000041b58ab7]
CPU: 0 PID: 15799 Comm: syz-executor.2 Tainted: G W 5.14.0-rc1-syzkaller #0
RIP: 0010:nla_type include/net/netlink.h:1130 [inline]
RIP: 0010:xfrm_xlate32_attr net/xfrm/xfrm_compat.c:410 [inline]
RIP: 0010:xfrm_xlate32 net/xfrm/xfrm_compat.c:532 [inline]
RIP: 0010:xfrm_user_rcv_msg_compat+0x5e5/0x1070 net/xfrm/xfrm_compat.c:577
[...]
Call Trace:
xfrm_user_rcv_msg+0x556/0x8b0 net/xfrm/xfrm_user.c:2774
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
xfrm_netlink_rcv+0x6b/0x90 net/xfrm/xfrm_user.c:2824
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:702 [inline]

Fixes: 5106f4a8acff ("xfrm/compat: Add 32=>64-bit messages translator")
Cc: <stable@kernel.org>
Reported-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

2d151d39 18-Jul-2021 Steffen Klassert <steffen.klassert@secunet.com>

xfrm: Add possibility to set the default to block if we have no policy

As the default we assume the traffic to pass, if we have no
matching IPsec policy. With this patch, we have a possibility to
change this default from allow to block. It can be configured
via netlink. Each direction (input/output/forward) can be
configured separately. With the default to block configuered,
we need allow policies for all packet flows we accept.
We do not use default policy lookup for the loopback device.

v1->v2
- fix compiling when XFRM is disabled
- Reported-by: kernel test robot <lkp@intel.com>

Co-developed-by: Christian Langrock <christian.langrock@secunet.com>
Signed-off-by: Christian Langrock <christian.langrock@secunet.com>
Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

8fb4792f 20-Jul-2021 Paolo Abeni <pabeni@redhat.com>

ipv6: fix another slab-out-of-bounds in fib6_nh_flush_exceptions

While running the self-tests on a KASAN enabled kernel, I observed a
slab-out-of-bounds splat very similar to the one reported in
commit 821bbf79fe46 ("ipv6: Fix KASAN: slab-out-of-bounds Read in
fib6_nh_flush_exceptions").

We additionally need to take care of fib6_metrics initialization
failure when the caller provides an nh.

The fix is similar, explicitly free the route instead of calling
fib6_info_release on a half-initialized object.

Fixes: f88d8ea67fbdb ("ipv6: Plumb support for nexthop object in a fib6_info")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8292d7f6 19-Jul-2021 Yang Yang <yang.yang29@zte.com.cn>

net: ipv4: add capability check for net administration

Root in init user namespace can modify /proc/sys/net/ipv4/ip_forward
without CAP_NET_ADMIN, this doesn't follow the principle of
capabilities. For example, let's take a look at netdev_store(),
root can't modify netdev attribute without CAP_NET_ADMIN.
So let's keep the consistency of permission check logic.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

f8fdadef 16-Jul-2021 Corey Minyard <cminyard@mvista.com>

ipsec: Remove unneeded extra variable in esp4 esp_ssg_unref()

It's assigned twice, but only used to calculate the size of the
structure it points to. Just remove it and take a sizeof the
actual structure.

Signed-off-by: Corey Minyard <cminyard@mvista.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

727d6a8b 19-Jul-2021 Peilin Ye <peilin.ye@bytedance.com>

net/sched: act_skbmod: Skip non-Ethernet packets

Currently tcf_skbmod_act() assumes that packets use Ethernet as their L2
protocol, which is not always the case. As an example, for CAN devices:

$ ip link add dev vcan0 type vcan
$ ip link set up vcan0
$ tc qdisc add dev vcan0 root handle 1: htb
$ tc filter add dev vcan0 parent 1: protocol ip prio 10 \
matchall action skbmod swap mac

Doing the above silently corrupts all the packets. Do not perform skbmod
actions for non-Ethernet packets.

Fixes: 86da71b57383 ("net_sched: Introduce skbmod action")
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b94dc99c 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: use switchdev_handle_fdb_{add,del}_to_device

Using the new fan-out helper for FDB entries installed on the software
bridge, we can install host addresses with the proper refcount on the
CPU port, such that this case:

ip link set swp0 master br0
ip link set swp1 master br0
ip link set swp2 master br0
ip link set swp3 master br0
ip link set br0 address 00:01:02:03:04:05
ip link set swp3 nomaster

works properly and the br0 address remains installed as a host entry
with refcount 3 instead of getting deleted.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8ca07176 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE

Currently DSA has an issue with FDB entries pointing towards the bridge
in the presence of br_fdb_replay() being called at port join and leave
time.

In particular, each bridge port will ask for a replay for the FDB
entries pointing towards the bridge when it joins, and for another
replay when it leaves.

This means that for example, a bridge with 4 switch ports will notify
DSA 4 times of the bridge MAC address.

But if the MAC address of the bridge changes during the normal runtime
of the system, the bridge notifies switchdev [ once ] of the deletion of
the old MAC address as a local FDB towards the bridge, and of the
insertion [ again once ] of the new MAC address as a local FDB.

This is a problem, because DSA keeps the old MAC address as a host FDB
entry with refcount 4 (4 ports asked for it using br_fdb_replay). So the
old MAC address will not be deleted. Additionally, the new MAC address
will only be installed with refcount 1, and when the first switch port
leaves the bridge (leaving 3 others as still members), it will delete
with it the new MAC address of the bridge from the local FDB entries
kept by DSA (because the br_fdb_replay call on deletion will bring the
entry's refcount from 1 to 0).

So the problem, really, is that the number of br_fdb_replay() calls is
not matched with the refcount that a host FDB is offloaded to DSA during
normal runtime.

An elegant way to solve the problem would be to make the switchdev
notification emitted by br_fdb_change_mac_address() result in a host FDB
kept by DSA which has a refcount exactly equal to the number of ports
under that bridge. Then, no matter how many DSA ports join or leave that
bridge, the host FDB entry will always be deleted when there are exactly
zero remaining DSA switch ports members of the bridge.

To implement the proposed solution, we remember that the switchdev
objects and port attributes have some helpers provided by switchdev,
which can be optionally called by drivers:
switchdev_handle_port_obj_{add,del} and switchdev_handle_port_attr_set.
These helpers:
- fan out a switchdev object/attribute emitted for the bridge towards
all the lower interfaces that pass the check_cb().
- fan out a switchdev object/attribute emitted for a bridge port that is
a LAG towards all the lower interfaces that pass the check_cb().

In other words, this is the model we need for the FDB events too:
something that will keep an FDB entry emitted towards a physical port as
it is, but translate an FDB entry emitted towards the bridge into N FDB
entries, one per physical port.

Of course, there are many differences between fanning out a switchdev
object (VLAN) on 3 lower interfaces of a LAG and fanning out an FDB
entry on 3 lower interfaces of a LAG. Intuitively, an FDB entry towards
a LAG should be treated specially, because FDB entries are unicast, we
can't just install the same address towards 3 destinations. It is
imaginable that drivers might want to treat this case specifically, so
create some methods for this case and do not recurse into the LAG lower
ports, just the bridge ports.

DSA also listens for FDB entries on "foreign" interfaces, aka interfaces
bridged with us which are not part of our hardware domain: think an
Ethernet switch bridged with a Wi-Fi AP. For those addresses, DSA
installs host FDB entries. However, there we have the same problem
(those host FDB entries are installed with a refcount of only 1) and an
even bigger one which we did not have with FDB entries towards the
bridge:

br_fdb_replay() is currently not called for FDB entries on foreign
interfaces, just for the physical port and for the bridge itself.

So when DSA sniffs an address learned by the software bridge towards a
foreign interface like an e1000 port, and then that e1000 leaves the
bridge, DSA remains with the dangling host FDB address. That will be
fixed separately by replaying all FDB entries and not just the ones
towards the port and the bridge.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c6451cda 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: switchdev: introduce helper for checking dynamically learned FDB entries

It is a bit difficult to understand what DSA checks when it tries to
avoid installing dynamically learned addresses on foreign interfaces as
local host addresses, so create a generic switchdev helper that can be
reused and is generally more readable.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c64b9c05 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_8021q: add proper cross-chip notifier support

The big problem which mandates cross-chip notifiers for tag_8021q is
this:

|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]

When the user runs:

ip link add br0 type bridge
ip link set sw0p0 master br0
ip link set sw2p0 master br0

It doesn't work.

This is because dsa_8021q_crosschip_bridge_join() assumes that "ds" and
"other_ds" are at most 1 hop away from each other, so it is sufficient
to add the RX VLAN of {ds, port} into {other_ds, other_port} and vice
versa and presto, the cross-chip link works. When there is another
switch in the middle, such as in this case switch 1 with its DSA links
sw1p3 and sw1p4, somebody needs to tell it about these VLANs too.

Which is exactly why the problem is quadratic: when a port joins a
bridge, for each port in the tree that's already in that same bridge we
notify a tag_8021q VLAN addition of that port's RX VLAN to the entire
tree. It is a very complicated web of VLANs.

It must be mentioned that currently we install tag_8021q VLANs on too
many ports (DSA links - to be precise, on all of them). For example,
when sw2p0 joins br0, and assuming sw1p0 was part of br0 too, we add the
RX VLAN of sw2p0 on the DSA links of switch 0 too, even though there
isn't any port of switch 0 that is a member of br0 (at least yet).
In theory we could notify only the switches which sit in between the
port joining the bridge and the port reacting to that bridge_join event.
But in practice that is impossible, because of the way 'link' properties
are described in the device tree. The DSA bindings require DT writers to
list out not only the real/physical DSA links, but in fact the entire
routing table, like for example switch 0 above will have:

sw0p3: port@3 {
link = <&sw1p4 &sw2p4>;
};

This was done because:

/* TODO: ideally DSA ports would have a single dp->link_dp member,
* and no dst->rtable nor this struct dsa_link would be needed,
* but this would require some more complex tree walking,
* so keep it stupid at the moment and list them all.
*/

but it is a perfect example of a situation where too much information is
actively detrimential, because we are now in the position where we
cannot distinguish a real DSA link from one that is put there to avoid
the 'complex tree walking'. And because DT is ABI, there is not much we
can change.

And because we do not know which DSA links are real and which ones
aren't, we can't really know if DSA switch A is in the data path between
switches B and C, in the general case.

So this is why tag_8021q RX VLANs are added on all DSA links, and
probably why it will never change.

On the other hand, at least the number of additions/deletions is well
balanced, and this means that once we implement reference counting at
the cross-chip notifier level a la fdb/mdb, there is absolutely zero
need for a struct dsa_8021q_crosschip_link, it's all self-managing.

In fact, with the tag_8021q notifiers emitted from the bridge join
notifiers, it becomes so generic that sja1105 does not need to do
anything anymore, we can just delete its implementation of the
.crosschip_bridge_{join,leave} methods.

Among other things we can simply delete is the home-grown implementation
of sja1105_notify_crosschip_switches(). The reason why that is wrong is
because it is not quadratic - it only covers remote switches to which we
have a cross-chip bridging link and that does not cover in-between
switches. This deletion is part of the same patch because sja1105 used
to poke deep inside the guts of the tag_8021q context in order to do
that. Because the cross-chip links went away, so needs the sja1105 code.

Last but not least, dsa_8021q_setup_port() is simplified (and also
renamed). Because our TAG_8021Q_VLAN_ADD notifier is designed to react
on the CPU port too, the four dsa_8021q_vid_apply() calls:
- 1 for RX VLAN on user port
- 1 for the user port's RX VLAN on the CPU port
- 1 for TX VLAN on user port
- 1 for the user port's TX VLAN on the CPU port

now get squashed into only 2 notifier calls via
dsa_port_tag_8021q_vlan_add.

And because the notifiers to add and to delete a tag_8021q VLAN are
distinct, now we finally break up the port setup and teardown into
separate functions instead of relying on a "bool enabled" flag which
tells us what to do. Arguably it should have been this way from the
get go.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e19cc13c 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_8021q: manage RX VLANs dynamically at bridge join/leave time

There has been at least one wasted opportunity for tag_8021q to be used
by a driver:

https://patchwork.ozlabs.org/project/netdev/patch/20200710113611.3398-3-kurt@linutronix.de/#2484272

because of a design decision: the declared purpose of tag_8021q is to
offer source port/switch identification for a tagging driver for packets
coming from a switch with no hardware DSA tagging support. It is not
intended to provide VLAN-based port isolation, because its first user,
sja1105, had another mechanism for bridging domain isolation, the L2
Forwarding Table. So even if 2 ports are in the same VLAN but they are
separated via the L2 Forwarding Table, they will not communicate with
one another. The L2 Forwarding Table is managed by the
sja1105_bridge_join() and sja1105_bridge_leave() methods.

As a consequence, today tag_8021q does not bother too much with hooking
into .port_bridge_join() and .port_bridge_leave() because that would
introduce yet another degree of freedom, it just iterates statically
through all ports of a switch and adds the RX VLAN of one port to all
the others. In this way, whenever .port_bridge_join() is called,
bridging will magically work because the RX VLANs are already installed
everywhere they need to be.

This is not to say that the reason for the change in this patch is to
satisfy the hellcreek and similar use cases, that is merely a nice side
effect. Instead it is to make sja1105 cross-chip links work properly
over a DSA link.

For context, sja1105 today supports a degenerate form of cross-chip
bridging, where the switches are interconnected through their CPU ports
("disjoint trees" topology). There is some code which has been
generalized into dsa_8021q_crosschip_link_{add,del}, but it is not
enough, and frankly it is impossible to build upon that.
Real multi-switch DSA trees, like daisy chains or H trees, which have
actual DSA links, do not work.

The problem is that sja1105 is unlike mv88e6xxx, and does not have a PVT
for cross-chip bridging, which is a table by which the local switch can
select the forwarding domain for packets from a certain ingress switch
ID and source port. The sja1105 switches cannot parse their own DSA
tags, because, well, they don't really have support for DSA tags, it's
all VLANs.

So to make something like cross-chip bridging between sw0p0 and sw1p0 to
work over the sw0p3/sw1p3 DSA link to work with sja1105 in the topology
below:

| |
sw0p0 sw0p1 sw0p2 sw0p3 sw1p3 sw1p2 sw1p1 sw1p0
[ user ] [ user ] [ cpu ] [ dsa ] ---- [ dsa ] [ cpu ] [ user ] [ user ]

we need to ask ourselves 2 questions:

(1) how should the L2 Forwarding Table be managed?
(2) how should the VLAN Lookup Table be managed?

i.e. what should prevent packets from going to unwanted ports?

Since as mentioned, there is no PVT, the L2 Forwarding Table only
contains forwarding rules for local ports. So we can say "all user ports
are allowed to forward to all CPU ports and all DSA links".

If we allow forwarding to DSA links unconditionally, this means we must
prevent forwarding using the VLAN Lookup Table. This is in fact
asymmetric with what we do for tag_8021q on ports local to the same
switch, and it matters because now that we are making tag_8021q a core
DSA feature, we need to hook into .crosschip_bridge_join() to add/remove
the tag_8021q VLANs. So for symmetry it makes sense to manage the VLANs
for local forwarding in the same way as cross-chip forwarding.

Note that there is a very precise reason why tag_8021q hooks into
dsa_switch_bridge_join() which acts at the cross-chip notifier level,
and not at a higher level such as dsa_port_bridge_join(). We need to
install the RX VLAN of the newly joining port into the VLAN table of all
the existing ports across the tree that are part of the same bridge, and
the notifier already does the iteration through the switches for us.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

328621f6 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}register

Right now, setting up tag_8021q is a 2-step operation for a driver,
first the context structure needs to be created, then the VLANs need to
be installed on the ports. A similar thing is true for teardown.

Merge the 2 steps into the register/unregister methods, to be as
transparent as possible for the driver as to what tag_8021q does behind
the scenes. This also gets rid of the funny "bool setup == true means
setup, == false means teardown" API that tag_8021q used to expose.

Note that dsa_tag_8021q_register() must be called at least in the
.setup() driver method and never earlier (like in the driver probe
function). This is because the DSA switch tree is not initialized at
probe time, and the cross-chip notifiers will not work.

For symmetry with .setup(), the unregister method should be put in
.teardown().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5da11eb4 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: make tag_8021q operations part of the core

Make tag_8021q a more central element of DSA and move the 2 driver
specific operations outside of struct dsa_8021q_context (which is
supposed to hold dynamic data and not really constant function
pointers).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d7b1fd52 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: let the core manage the tag_8021q context

The basic problem description is as follows:

Be there 3 switches in a daisy chain topology:

|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]

The CPU will not be able to ping through the user ports of the
bottom-most switch (like for example sw2p0), simply because tag_8021q
was not coded up for this scenario - it has always assumed DSA switch
trees with a single switch.

To add support for the topology above, we must admit that the RX VLAN of
sw2p0 must be added on some ports of switches 0 and 1 as well. This is
in fact a textbook example of thing that can use the cross-chip notifier
framework that DSA has set up in switch.c.

There is only one problem: core DSA (switch.c) is not able right now to
make the connection between a struct dsa_switch *ds and a struct
dsa_8021q_context *ctx. Right now, it is drivers who call into
tag_8021q.c and always provide a struct dsa_8021q_context *ctx pointer,
and tag_8021q.c calls them back with the .tag_8021q_vlan_{add,del}
methods.

But with cross-chip notifiers, it is possible for tag_8021q to call
drivers without drivers having ever asked for anything. A good example
is right above: when sw2p0 wants to set itself up for tag_8021q,
the .tag_8021q_vlan_add method needs to be called for switches 1 and 0,
so that they transport sw2p0's VLANs towards the CPU without dropping
them.

So instead of letting drivers manage the tag_8021q context, add a
tag_8021q_ctx pointer inside of struct dsa_switch, which will be
populated when dsa_tag_8021q_register() returns success.

The patch is fairly long-winded because we are partly reverting commit
5899ee367ab3 ("net: dsa: tag_8021q: add a context structure") which made
the driver-facing tag_8021q API use "ctx" instead of "ds". Now that we
can access "ctx" directly from "ds", this is no longer needed.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8b6e638b 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: build tag_8021q.c as part of DSA core

Upcoming patches will add tag_8021q related logic to switch.c and
port.c, in order to allow it to make use of cross-chip notifiers.
In addition, a struct dsa_8021q_context *ctx pointer will be added to
struct dsa_switch.

It seems fairly low-reward to #ifdef the *ctx from struct dsa_switch and
to provide shim implementations of the entire tag_8021q.c calling
surface (not even clear what to do about the tag_8021q cross-chip
notifiers to avoid compiling them). The runtime overhead for switches
which don't use tag_8021q is fairly small because all helpers will check
for ds->tag_8021q_ctx being a NULL pointer and stop there.

So let's make it part of dsa_core.o.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cedf4670 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_8021q: create dsa_tag_8021q_{register,unregister} helpers

In preparation of moving tag_8021q to core DSA, move all initialization
and teardown related to tag_8021q which is currently done by drivers in
2 functions called "register" and "unregister". These will gather more
functionality in future patches, which will better justify the chosen
naming scheme.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

69ebb370 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_8021q: use symbolic error names

Use %pe to give the user a string holding the error code instead of just
a number.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a81a4574 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: tag_8021q: use "err" consistently instead of "rc"

Some of the tag_8021q code has been taken out of sja1105, which uses
"rc" for its return code variables, whereas the DSA core uses "err".
Change tag_8021q for consistency.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0fac6aa0 19-Jul-2021 Vladimir Oltean <vladimir.oltean@nxp.com>

net: dsa: sja1105: delete the best_effort_vlan_filtering mode

Simply put, the best-effort VLAN filtering mode relied on VLAN retagging
from a bridge VLAN towards a tag_8021q sub-VLAN in order to be able to
decode the source port in the tagger, but the VLAN retagging
implementation inside the sja1105 chips is not the best and we were
relying on marginal operating conditions.

The most notable limitation of the best-effort VLAN filtering mode is
its incapacity to treat this case properly:

ip link add br0 type bridge vlan_filtering 1
ip link set swp2 master br0
ip link set swp4 master br0
bridge vlan del dev swp4 vid 1
bridge vlan add dev swp4 vid 1 pvid

When sending an untagged packet through swp2, the expectation is for it
to be forwarded to swp4 as egress-tagged (so it will contain VLAN ID 1
on egress). But the switch will send it as egress-untagged.

There was an attempt to fix this here:
https://patchwork.kernel.org/project/netdevbpf/patch/20210407201452.1703261-2-olteanv@gmail.com/

but it failed miserably because it broke PTP RX timestamping, in a way
that cannot be corrected due to hardware issues related to VLAN
retagging.

So with either PTP broken or pushing VLAN headers on egress for untagged
packets being broken, the sad reality is that the best-effort VLAN
filtering code is broken. Delete it.

Note that this means there will be a temporary loss of functionality in
this driver until it is replaced with something better (network stack
RX/TX capability for "mode 2" as described in
Documentation/networking/dsa/sja1105.rst, the "port under VLAN-aware
bridge" case). We simply cannot keep this code until that driver rework
is done, it is super bloated and tangled with tag_8021q.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

542bb396 20-Jul-2021 David S. Miller <davem@davemloft.net>

Merge branch 'veth-flexible-channel-numbers'

Paolo Abeni says:

====================
veth: more flexible channels number configuration

XDP setups can benefit from multiple veth RX/TX queues. Currently
veth allow setting such number only at creation time via the
'numrxqueues' and 'numtxqueues' parameters.

This series introduces support for the ethtool set_channel operation
and allows configuring the queue number via a new module parameter.

The veth default configuration is not changed.

Finally self-tests are updated to check the new features, with both
valid and invalid arguments.

This iteration is a rebase of the most recent RFC, it does not provide
a module parameter to configure the default number of queues, but I
think could be worthy

RFC v1 -> RFC v2:
- report more consistent 'combined' count
- make set_channel as resilient as possible to errors
- drop module parameter - but I would still consider it.
- more self-tests
====================

Signed-off-by: David S. Miller <davem@davemloft.net>


2c080404 20-Jul-2021 David S. Miller <davem@davemloft.net>

Merge branch 'bridge-vlan-multicast'

Nikolay Aleksandrov says:

====================
net: bridge: multicast: add vlan support

This patchset adds initial per-vlan multicast support, most of the code
deals with moving to multicast context pointers from bridge/port pointers.
That allows us to switch them with the per-vlan contexts when a multicast
packet is being processed and vlan multicast snooping has been enabled.
That is controlled by a global bridge option added in patch 06 which is
off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note
that this option can change only under RTNL and doesn't require
multicast_lock, so we need to be careful when retrieving mcast contexts
in parallel. For packet processing they are switched only once in
br_multicast_rcv() and then used until the packet has been processed.
For the most part we need these contexts only to read config values and
check if they are disabled. The global mcast state which is maintained
consists of querier and router timers, the rest are config options.
The port mcast state which is maintained consists of query timer and
link to router port list if it's ever marked as a router port. Port
multicast contexts _must_ be used only with their respective global
contexts, that is a bridge port's mcast context must be used only with
bridge's global mcast context and a vlan/port's mcast context must be
used only with that vlan's global mcast context due to the router port
lists. This way a bridge port can be marked as a router in multiple
vlans, but might not be a router in some other vlan. Also this allows us
to have per-vlan querier elections, per-vlan queries and basically the
whole multicast state becomes per-vlan when the option is enabled.
One of the hardest parts is synchronization with vlan's memory
management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED
which is changed only under multicast_lock. When a vlan is being
destroyed first that flag is removed under the lock, then the multicast
context is torn down which includes waiting for any outstanding context
timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED
it must be checked first if the contexts are vlan and the multicast_lock
has been acquired. That is done by all IGMP/MLD packet processing
functions and timers. When processing a packet we have RCU so the vlan
memory won't be freed, but if the flag is missing we must not process it.
The timers are synchronized in the same way with the addition of waiting
for them to finish in case they are running after removing the flag
under multicast_lock (i.e. they were waiting for the lock). Multicast vlan
snooping requires vlan filtering to be enabled, if it's disabled then
snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED
controls if a vlan has BR_VLFLAG_MCAS