History log of /linux-master/drivers/net/wireguard/allowedips.c
Revision Date Author Comments
# 46622219 07-Aug-2023 Jason A. Donenfeld <Jason@zx2c4.com>

wireguard: allowedips: expand maximum node depth

In the allowedips self-test, nodes are inserted into the tree, but it
generated an even amount of nodes, but for checking maximum node depth,
there is of course the root node, which makes the total number
necessarily odd. With two few nodes added, it never triggered the
maximum depth check like it should have. So, add 129 nodes instead of
128 nodes, and do so with a more straightforward scheme, starting with
all the bits set, and shifting over one each time. Then increase the
maximum depth to 129, and choose a better name for that variable to
make it clear that it represents depth as opposed to bits.

Cc: stable@vger.kernel.org
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://lore.kernel.org/r/20230807132146.2191597-2-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# c31b14d8 02-Aug-2022 Jason A. Donenfeld <Jason@zx2c4.com>

wireguard: allowedips: don't corrupt stack when detecting overflow

In case push_rcu() and related functions are buggy, there's a
WARN_ON(len >= 128), which the selftest tries to hit by being tricky. In
case it is hit, we shouldn't corrupt the kernel's stack, though;
otherwise it may be hard to even receive the report that it's buggy. So
conditionalize the stack write based on that WARN_ON()'s return value.

Note that this never *actually* happens anyway. The WARN_ON() in the
first place is bounded by IS_ENABLED(DEBUG), and isn't expected to ever
actually hit. This is just a debugging sanity check.

Additionally, hoist the constant 128 into a named enum,
MAX_ALLOWEDIPS_BITS, so that it's clear why this value is chosen.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/all/CAHk-=wjJZGA6w_DxA+k7Ejbqsq+uGK==koPai3sqdsfJqemvag@mail.gmail.com/
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# ae928781 29-Nov-2021 Jason A. Donenfeld <Jason@zx2c4.com>

wireguard: allowedips: add missing __rcu annotation to satisfy sparse

A __rcu annotation got lost during refactoring, which caused sparse to
become enraged.

Fixes: bf7b042dc62a ("wireguard: allowedips: free empty intermediate nodes when removing single node")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# bf7b042d 04-Jun-2021 Jason A. Donenfeld <Jason@zx2c4.com>

wireguard: allowedips: free empty intermediate nodes when removing single node

When removing single nodes, it's possible that that node's parent is an
empty intermediate node, in which case, it too should be removed.
Otherwise the trie fills up and never is fully emptied, leading to
gradual memory leaks over time for tries that are modified often. There
was originally code to do this, but was removed during refactoring in
2016 and never reworked. Now that we have proper parent pointers from
the previous commits, we can implement this properly.

In order to reduce branching and expensive comparisons, we want to keep
the double pointer for parent assignment (which lets us easily chain up
to the root), but we still need to actually get the parent's base
address. So encode the bit number into the last two bits of the pointer,
and pack and unpack it as needed. This is a little bit clumsy but is the
fastest and less memory wasteful of the compromises. Note that we align
the root struct here to a minimum of 4, because it's embedded into a
larger struct, and we're relying on having the bottom two bits for our
flag, which would only be 16-bit aligned on m68k.

The existing macro-based helpers were a bit unwieldy for adding the bit
packing to, so this commit replaces them with safer and clearer ordinary
functions.

We add a test to the randomized/fuzzer part of the selftests, to free
the randomized tries by-peer, refuzz it, and repeat, until it's supposed
to be empty, and then then see if that actually resulted in the whole
thing being emptied. That combined with kmemcheck should hopefully make
sure this commit is doing what it should. Along the way this resulted in
various other cleanups of the tests and fixes for recent graphviz.

Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dc680de2 04-Jun-2021 Jason A. Donenfeld <Jason@zx2c4.com>

wireguard: allowedips: allocate nodes in kmem_cache

The previous commit moved from O(n) to O(1) for removal, but in the
process introduced an additional pointer member to a struct that
increased the size from 60 to 68 bytes, putting nodes in the 128-byte
slab. With deployed systems having as many as 2 million nodes, this
represents a significant doubling in memory usage (128 MiB -> 256 MiB).
Fix this by using our own kmem_cache, that's sized exactly right. This
also makes wireguard's memory usage more transparent in tools like
slabtop and /proc/slabinfo.

Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f634f418 04-Jun-2021 Jason A. Donenfeld <Jason@zx2c4.com>

wireguard: allowedips: remove nodes in O(1)

Previously, deleting peers would require traversing the entire trie in
order to rebalance nodes and safely free them. This meant that removing
1000 peers from a trie with a half million nodes would take an extremely
long time, during which we're holding the rtnl lock. Large-scale users
were reporting 200ms latencies added to the networking stack as a whole
every time their userspace software would queue up significant removals.
That's a serious situation.

This commit fixes that by maintaining a double pointer to the parent's
bit pointer for each node, and then using the already existing node list
belonging to each peer to go directly to the node, fix up its pointers,
and free it with RCU. This means removal is O(1) instead of O(n), and we
don't use gobs of stack.

The removal algorithm has the same downside as the code that it fixes:
it won't collapse needlessly long runs of fillers. We can enhance that
in the future if it ever becomes a problem. This commit documents that
limitation with a TODO comment in code, a small but meaningful
improvement over the prior situation.

Currently the biggest flaw, which the next commit addresses, is that
because this increases the node size on 64-bit machines from 60 bytes to
68 bytes. 60 rounds up to 64, but 68 rounds up to 128. So we wind up
using twice as much memory per node, because of power-of-two
allocations, which is a big bummer. We'll need to figure something out
there.

Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9981159f 04-Feb-2020 Eric Dumazet <edumazet@google.com>

wireguard: allowedips: fix use-after-free in root_remove_peer_lists

In the unlikely case a new node could not be allocated, we need to
remove @newnode from @peer->allowedips_list before freeing it.

syzbot reported:

BUG: KASAN: use-after-free in __list_del_entry_valid+0xdc/0xf5 lib/list_debug.c:54
Read of size 8 at addr ffff88809881a538 by task syz-executor.4/30133

CPU: 0 PID: 30133 Comm: syz-executor.4 Not tainted 5.5.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x197/0x210 lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
__kasan_report.cold+0x1b/0x32 mm/kasan/report.c:506
kasan_report+0x12/0x20 mm/kasan/common.c:639
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:135
__list_del_entry_valid+0xdc/0xf5 lib/list_debug.c:54
__list_del_entry include/linux/list.h:132 [inline]
list_del include/linux/list.h:146 [inline]
root_remove_peer_lists+0x24f/0x4b0 drivers/net/wireguard/allowedips.c:65
wg_allowedips_free+0x232/0x390 drivers/net/wireguard/allowedips.c:300
wg_peer_remove_all+0xd5/0x620 drivers/net/wireguard/peer.c:187
wg_set_device+0xd01/0x1350 drivers/net/wireguard/netlink.c:542
genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:717 [inline]
genl_rcv_msg+0x67d/0xea0 net/netlink/genetlink.c:734
netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
genl_rcv+0x29/0x40 net/netlink/genetlink.c:745
netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:672
____sys_sendmsg+0x753/0x880 net/socket.c:2343
___sys_sendmsg+0x100/0x170 net/socket.c:2397
__sys_sendmsg+0x105/0x1d0 net/socket.c:2430
__do_sys_sendmsg net/socket.c:2439 [inline]
__se_sys_sendmsg net/socket.c:2437 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437
do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x45b399
Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f99a9bcdc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f99a9bce6d4 RCX: 000000000045b399
RDX: 0000000000000000 RSI: 0000000020001340 RDI: 0000000000000003
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000004
R13: 00000000000009ba R14: 00000000004cb2b8 R15: 0000000000000009

Allocated by task 30103:
save_stack+0x23/0x90 mm/kasan/common.c:72
set_track mm/kasan/common.c:80 [inline]
__kasan_kmalloc mm/kasan/common.c:513 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
kasan_kmalloc+0x9/0x10 mm/kasan/common.c:527
kmem_cache_alloc_trace+0x158/0x790 mm/slab.c:3551
kmalloc include/linux/slab.h:556 [inline]
kzalloc include/linux/slab.h:670 [inline]
add+0x70a/0x1970 drivers/net/wireguard/allowedips.c:236
wg_allowedips_insert_v4+0xf6/0x160 drivers/net/wireguard/allowedips.c:320
set_allowedip drivers/net/wireguard/netlink.c:343 [inline]
set_peer+0xfb9/0x1150 drivers/net/wireguard/netlink.c:468
wg_set_device+0xbd4/0x1350 drivers/net/wireguard/netlink.c:591
genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:717 [inline]
genl_rcv_msg+0x67d/0xea0 net/netlink/genetlink.c:734
netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
genl_rcv+0x29/0x40 net/netlink/genetlink.c:745
netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:672
____sys_sendmsg+0x753/0x880 net/socket.c:2343
___sys_sendmsg+0x100/0x170 net/socket.c:2397
__sys_sendmsg+0x105/0x1d0 net/socket.c:2430
__do_sys_sendmsg net/socket.c:2439 [inline]
__se_sys_sendmsg net/socket.c:2437 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437
do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 30103:
save_stack+0x23/0x90 mm/kasan/common.c:72
set_track mm/kasan/common.c:80 [inline]
kasan_set_free_info mm/kasan/common.c:335 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
__cache_free mm/slab.c:3426 [inline]
kfree+0x10a/0x2c0 mm/slab.c:3757
add+0x12d2/0x1970 drivers/net/wireguard/allowedips.c:266
wg_allowedips_insert_v4+0xf6/0x160 drivers/net/wireguard/allowedips.c:320
set_allowedip drivers/net/wireguard/netlink.c:343 [inline]
set_peer+0xfb9/0x1150 drivers/net/wireguard/netlink.c:468
wg_set_device+0xbd4/0x1350 drivers/net/wireguard/netlink.c:591
genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:717 [inline]
genl_rcv_msg+0x67d/0xea0 net/netlink/genetlink.c:734
netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
genl_rcv+0x29/0x40 net/netlink/genetlink.c:745
netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:672
____sys_sendmsg+0x753/0x880 net/socket.c:2343
___sys_sendmsg+0x100/0x170 net/socket.c:2397
__sys_sendmsg+0x105/0x1d0 net/socket.c:2430
__do_sys_sendmsg net/socket.c:2439 [inline]
__se_sys_sendmsg net/socket.c:2437 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437
do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff88809881a500
which belongs to the cache kmalloc-64 of size 64
The buggy address is located 56 bytes inside of
64-byte region [ffff88809881a500, ffff88809881a540)
The buggy address belongs to the page:
page:ffffea0002620680 refcount:1 mapcount:0 mapping:ffff8880aa400380 index:0x0
raw: 00fffe0000000200 ffffea000250b748 ffffea000254bac8 ffff8880aa400380
raw: 0000000000000000 ffff88809881a000 0000000100000020 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88809881a400: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff88809881a480: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
>ffff88809881a500: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
^
ffff88809881a580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff88809881a600: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc

Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: wireguard@lists.zx2c4.com
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d89ee7d5 15-Dec-2019 Wei Yongjun <weiyongjun1@huawei.com>

wireguard: allowedips: use kfree_rcu() instead of call_rcu()

The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e7096c13 08-Dec-2019 Jason A. Donenfeld <Jason@zx2c4.com>

net: WireGuard secure network tunnel

WireGuard is a layer 3 secure networking tunnel made specifically for
the kernel, that aims to be much simpler and easier to audit than IPsec.
Extensive documentation and description of the protocol and
considerations, along with formal proofs of the cryptography, are
available at:

* https://www.wireguard.com/
* https://www.wireguard.com/papers/wireguard.pdf

This commit implements WireGuard as a simple network device driver,
accessible in the usual RTNL way used by virtual network drivers. It
makes use of the udp_tunnel APIs, GRO, GSO, NAPI, and the usual set of
networking subsystem APIs. It has a somewhat novel multicore queueing
system designed for maximum throughput and minimal latency of encryption
operations, but it is implemented modestly using workqueues and NAPI.
Configuration is done via generic Netlink, and following a review from
the Netlink maintainer a year ago, several high profile userspace tools
have already implemented the API.

This commit also comes with several different tests, both in-kernel
tests and out-of-kernel tests based on network namespaces, taking profit
of the fact that sockets used by WireGuard intentionally stay in the
namespace the WireGuard interface was originally created, exactly like
the semantics of userspace tun devices. See wireguard.com/netns/ for
pictures and examples.

The source code is fairly short, but rather than combining everything
into a single file, WireGuard is developed as cleanly separable files,
making auditing and comprehension easier. Things are laid out as
follows:

* noise.[ch], cookie.[ch], messages.h: These implement the bulk of the
cryptographic aspects of the protocol, and are mostly data-only in
nature, taking in buffers of bytes and spitting out buffers of
bytes. They also handle reference counting for their various shared
pieces of data, like keys and key lists.

* ratelimiter.[ch]: Used as an integral part of cookie.[ch] for
ratelimiting certain types of cryptographic operations in accordance
with particular WireGuard semantics.

* allowedips.[ch], peerlookup.[ch]: The main lookup structures of
WireGuard, the former being trie-like with particular semantics, an
integral part of the design of the protocol, and the latter just
being nice helper functions around the various hashtables we use.

* device.[ch]: Implementation of functions for the netdevice and for
rtnl, responsible for maintaining the life of a given interface and
wiring it up to the rest of WireGuard.

* peer.[ch]: Each interface has a list of peers, with helper functions
available here for creation, destruction, and reference counting.

* socket.[ch]: Implementation of functions related to udp_socket and
the general set of kernel socket APIs, for sending and receiving
ciphertext UDP packets, and taking care of WireGuard-specific sticky
socket routing semantics for the automatic roaming.

* netlink.[ch]: Userspace API entry point for configuring WireGuard
peers and devices. The API has been implemented by several userspace
tools and network management utility, and the WireGuard project
distributes the basic wg(8) tool.

* queueing.[ch]: Shared function on the rx and tx path for handling
the various queues used in the multicore algorithms.

* send.c: Handles encrypting outgoing packets in parallel on
multiple cores, before sending them in order on a single core, via
workqueues and ring buffers. Also handles sending handshake and cookie
messages as part of the protocol, in parallel.

* receive.c: Handles decrypting incoming packets in parallel on
multiple cores, before passing them off in order to be ingested via
the rest of the networking subsystem with GRO via the typical NAPI
poll function. Also handles receiving handshake and cookie messages
as part of the protocol, in parallel.

* timers.[ch]: Uses the timer wheel to implement protocol particular
event timeouts, and gives a set of very simple event-driven entry
point functions for callers.

* main.c, version.h: Initialization and deinitialization of the module.

* selftest/*.h: Runtime unit tests for some of the most security
sensitive functions.

* tools/testing/selftests/wireguard/netns.sh: Aforementioned testing
script using network namespaces.

This commit aims to be as self-contained as possible, implementing
WireGuard as a standalone module not needing much special handling or
coordination from the network subsystem. I expect for future
optimizations to the network stack to positively improve WireGuard, and
vice-versa, but for the time being, this exists as intentionally
standalone.

We introduce a menu option for CONFIG_WIREGUARD, as well as providing a
verbose debug log and self-tests via CONFIG_WIREGUARD_DEBUG.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: David Miller <davem@davemloft.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>