#
04d9d1fc |
|
08-Mar-2024 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix refcnt handling in __inet_hash_connect(). syzbot reported a warning in sk_nulls_del_node_init_rcu(). The commit 66b60b0c8c4a ("dccp/tcp: Unhash sk from ehash for tb2 alloc failure after check_estalblished().") tried to fix an issue that an unconnected socket occupies an ehash entry when bhash2 allocation fails. In such a case, we need to revert changes done by check_established(), which does not hold refcnt when inserting socket into ehash. So, to revert the change, we need to __sk_nulls_add_node_rcu() instead of sk_nulls_add_node_rcu(). Otherwise, sock_put() will cause refcnt underflow and leak the socket. [0]: WARNING: CPU: 0 PID: 23948 at include/net/sock.h:799 sk_nulls_del_node_init_rcu+0x166/0x1a0 include/net/sock.h:799 Modules linked in: CPU: 0 PID: 23948 Comm: syz-executor.2 Not tainted 6.8.0-rc6-syzkaller-00159-gc055fc00c07b #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 RIP: 0010:sk_nulls_del_node_init_rcu+0x166/0x1a0 include/net/sock.h:799 Code: e8 7f 71 c6 f7 83 fb 02 7c 25 e8 35 6d c6 f7 4d 85 f6 0f 95 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 1b 6d c6 f7 90 <0f> 0b 90 eb b2 e8 10 6d c6 f7 4c 89 e7 be 04 00 00 00 e8 63 e7 d2 RSP: 0018:ffffc900032d7848 EFLAGS: 00010246 RAX: ffffffff89cd0035 RBX: 0000000000000001 RCX: 0000000000040000 RDX: ffffc90004de1000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: 1ffff1100439ac26 R08: ffffffff89ccffe3 R09: 1ffff1100439ac28 R10: dffffc0000000000 R11: ffffed100439ac29 R12: ffff888021cd6140 R13: dffffc0000000000 R14: ffff88802a9bf5c0 R15: ffff888021cd6130 FS: 00007f3b823f16c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3b823f0ff8 CR3: 000000004674a000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __inet_hash_connect+0x140f/0x20b0 net/ipv4/inet_hashtables.c:1139 dccp_v6_connect+0xcb9/0x1480 net/dccp/ipv6.c:956 __inet_stream_connect+0x262/0xf30 net/ipv4/af_inet.c:678 inet_stream_connect+0x65/0xa0 net/ipv4/af_inet.c:749 __sys_connect_file net/socket.c:2048 [inline] __sys_connect+0x2df/0x310 net/socket.c:2065 __do_sys_connect net/socket.c:2075 [inline] __se_sys_connect net/socket.c:2072 [inline] __x64_sys_connect+0x7a/0x90 net/socket.c:2072 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7f3b8167dda9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f3b823f10c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 00007f3b817abf80 RCX: 00007f3b8167dda9 RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000003 RBP: 00007f3b823f1120 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 000000000000000b R14: 00007f3b817abf80 R15: 00007ffd3beb57b8 </TASK> Reported-by: syzbot+12c506c1aae251e70449@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=12c506c1aae251e70449 Fixes: 66b60b0c8c4a ("dccp/tcp: Unhash sk from ehash for tb2 alloc failure after check_estalblished().") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240308201623.65448-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
6e073572 |
|
06-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
inet: move inet_ehash_secret and udp_ehash_secret into net_hotdata "struct net_protocol" has a 32bit hole in 32bit arches. Use it to store the 32bit secret used by UDP and TCP, to increase cache locality in rx path. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-15-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
66b60b0c |
|
14-Feb-2024 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
dccp/tcp: Unhash sk from ehash for tb2 alloc failure after check_estalblished(). syzkaller reported a warning [0] in inet_csk_destroy_sock() with no repro. WARN_ON(inet_sk(sk)->inet_num && !inet_csk(sk)->icsk_bind_hash); However, the syzkaller's log hinted that connect() failed just before the warning due to FAULT_INJECTION. [1] When connect() is called for an unbound socket, we search for an available ephemeral port. If a bhash bucket exists for the port, we call __inet_check_established() or __inet6_check_established() to check if the bucket is reusable. If reusable, we add the socket into ehash and set inet_sk(sk)->inet_num. Later, we look up the corresponding bhash2 bucket and try to allocate it if it does not exist. Although it rarely occurs in real use, if the allocation fails, we must revert the changes by check_established(). Otherwise, an unconnected socket could illegally occupy an ehash entry. Note that we do not put tw back into ehash because sk might have already responded to a packet for tw and it would be better to free tw earlier under such memory presure. [0]: WARNING: CPU: 0 PID: 350830 at net/ipv4/inet_connection_sock.c:1193 inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193) Modules linked in: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193) Code: 41 5c 41 5d 41 5e e9 2d 4a 3d fd e8 28 4a 3d fd 48 89 ef e8 f0 cd 7d ff 5b 5d 41 5c 41 5d 41 5e e9 13 4a 3d fd e8 0e 4a 3d fd <0f> 0b e9 61 fe ff ff e8 02 4a 3d fd 4c 89 e7 be 03 00 00 00 e8 05 RSP: 0018:ffffc9000b21fd38 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000009e78 RCX: ffffffff840bae40 RDX: ffff88806e46c600 RSI: ffffffff840bb012 RDI: ffff88811755cca8 RBP: ffff88811755c880 R08: 0000000000000003 R09: 0000000000000000 R10: 0000000000009e78 R11: 0000000000000000 R12: ffff88811755c8e0 R13: ffff88811755c892 R14: ffff88811755c918 R15: 0000000000000000 FS: 00007f03e5243800(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b32f21000 CR3: 0000000112ffe001 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> ? inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193) dccp_close (net/dccp/proto.c:1078) inet_release (net/ipv4/af_inet.c:434) __sock_release (net/socket.c:660) sock_close (net/socket.c:1423) __fput (fs/file_table.c:377) __fput_sync (fs/file_table.c:462) __x64_sys_close (fs/open.c:1557 fs/open.c:1539 fs/open.c:1539) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) RIP: 0033:0x7f03e53852bb Code: 03 00 00 00 0f 05 48 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 43 c9 f5 ff 8b 7c 24 0c 41 89 c0 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 0c e8 a1 c9 f5 ff 8b 44 RSP: 002b:00000000005dfba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f03e53852bb RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000167c R10: 0000000008a79680 R11: 0000000000000293 R12: 00007f03e4e43000 R13: 00007f03e4e43170 R14: 00007f03e4e43178 R15: 00007f03e4e43170 </TASK> [1]: FAULT_INJECTION: forcing a failure. name failslab, interval 1, probability 0, space 0, times 0 CPU: 0 PID: 350833 Comm: syz-executor.1 Not tainted 6.7.0-12272-g2121c43f88f5 #9 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1)) should_fail_ex (lib/fault-inject.c:52 lib/fault-inject.c:153) should_failslab (mm/slub.c:3748) kmem_cache_alloc (mm/slub.c:3763 mm/slub.c:3842 mm/slub.c:3867) inet_bind2_bucket_create (net/ipv4/inet_hashtables.c:135) __inet_hash_connect (net/ipv4/inet_hashtables.c:1100) dccp_v4_connect (net/dccp/ipv4.c:116) __inet_stream_connect (net/ipv4/af_inet.c:676) inet_stream_connect (net/ipv4/af_inet.c:747) __sys_connect_file (net/socket.c:2048 (discriminator 2)) __sys_connect (net/socket.c:2065) __x64_sys_connect (net/socket.c:2072) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) RIP: 0033:0x7f03e5284e5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007f03e4641cc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 00000000004bbf80 RCX: 00007f03e5284e5d RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000003 RBP: 00000000004bbf80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 000000000000000b R14: 00007f03e52e5530 R15: 0000000000000000 </TASK> Reported-by: syzkaller <syzkaller@googlegroups.com> Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8191792c |
|
18-Dec-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Remove dead code and fields for bhash2. Now all sockets including TIME_WAIT are linked to bhash2 using sock_common.skc_bind_node. We no longer use inet_bind2_bucket.deathrow, sock.sk_bind2_node, and inet_timewait_sock.tw_bind2_node. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
770041d3 |
|
18-Dec-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Link sk and twsk to tb2->owners using skc_bind_node. Now we can use sk_bind_node/tw_bind_node for bhash2, which means we need not link TIME_WAIT sockets separately. The dead code and sk_bind2_node will be removed in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b2cb9f9e |
|
18-Dec-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Unlink sk from bhash. Now we do not use tb->owners and can unlink sockets from bhash. sk_bind_node/tw_bind_node are available for bhash2 and will be used in the following patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8002d44f |
|
18-Dec-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Check hlist_empty(&tb->bhash2) instead of hlist_empty(&tb->owners). We use hlist_empty(&tb->owners) to check if the bhash bucket has a socket. We can check the child bhash2 buckets instead. For this to work, the bhash2 bucket must be freed before the bhash bucket. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
822fb91f |
|
18-Dec-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Link bhash2 to bhash. bhash2 added a new member sk_bind2_node in struct sock to link sockets to bhash2 in addition to bhash. bhash is still needed to search conflicting sockets efficiently from a port for the wildcard address. However, bhash itself need not have sockets. If we link each bhash2 bucket to the corresponding bhash bucket, we can iterate the same set of the sockets from bhash2 via bhash. This patch links bhash2 to bhash only, and the actual use will be in the later patches. Finally, we will remove sk_bind2_node. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4dd71088 |
|
18-Dec-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Rename tb in inet_bind2_bucket_(init|create)(). Later, we no longer link sockets to bhash. Instead, each bhash2 bucket is linked to the corresponding bhash bucket. Then, we pass the bhash bucket to bhash2 allocation functions as tb. However, tb is already used in inet_bind2_bucket_create() and inet_bind2_bucket_init() as the bhash2 bucket. To make the following diff clear, let's use tb2 for the bhash2 bucket there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5a22bba1 |
|
18-Dec-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Save address type in inet_bind2_bucket. inet_bind2_bucket_addr_match() and inet_bind2_bucket_match_addr_any() are called for each bhash2 bucket to check conflicts. Thus, we call ipv6_addr_any() and ipv6_addr_v4mapped() over and over during bind(). Let's avoid calling them by saving the address type in inet_bind2_bucket. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
06a8c04f |
|
18-Dec-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Save v4 address as v4-mapped-v6 in inet_bind2_bucket.v6_rcv_saddr. In bhash2, IPv4/IPv6 addresses are saved in two union members, which complicate address checks in inet_bind2_bucket_addr_match() and inet_bind2_bucket_match_addr_any() considering uninitialised memory and v4-mapped-v6 conflicts. Let's simplify that by saving IPv4 address as v4-mapped-v6 address and defining tb2.rcv_saddr as tb2.v6_rcv_saddr.s6_addr32[3]. Then, we can compare v6 address as is, and after checking v4-mapped-v6, we can compare v4 address easily. Also, we can remove tb2->family. Note these functions will be further refactored in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
56f3e3f0 |
|
18-Dec-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Rearrange tests in inet_bind2_bucket_(addr_match|match_addr_any)(). The protocol family tests in inet_bind2_bucket_addr_match() and inet_bind2_bucket_match_addr_any() are ordered as follows. if (sk->sk_family != tb2->family) else if (sk->sk_family == AF_INET6) else This patch rearranges them so that AF_INET6 socket is handled first to make the following patch tidy, where tb2->family will be removed. if (sk->sk_family == AF_INET6) else if (tb2->family == AF_INET6) else Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5e07e672 |
|
18-Dec-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Use bhash2 for v4-mapped-v6 non-wildcard address. While checking port availability in bind() or listen(), we used only bhash for all v4-mapped-v6 addresses. But there is no good reason not to use bhash2 for v4-mapped-v6 non-wildcard addresses. Let's do it by returning true in inet_use_bhash2_on_bind(). Then, we also need to add a test in inet_bind2_bucket_match_addr_any() so that ::ffff:X.X.X.X will match with 0.0.0.0. Note that sk->sk_rcv_saddr is initialised for v4-mapped-v6 sk in __inet6_bind(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
20718485 |
|
14-Dec-2023 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: change source port selection at connect() time In commit 1580ab63fc9a ("tcp/dccp: better use of ephemeral ports in connect()") we added an heuristic to select even ports for connect() and odd ports for bind(). This was nice because no applications changes were needed. But it added more costs when all even ports are in use, when there are few listeners and many active connections. Since then, IP_LOCAL_PORT_RANGE has been added to permit an application to partition ephemeral port range at will. This patch extends the idea so that if IP_LOCAL_PORT_RANGE is set on a socket before accept(), port selection no longer favors even ports. This means that connect() can find a suitable source port faster, and applications can use a different split between connect() and bind() users. This should give more entropy to Toeplitz hash used in RSS: Using even ports was wasting one bit from the 16bit sport. A similar change can be done in inet_csk_find_open_port() if needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20231214192939.1962891-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
871019b2 |
|
08-Nov-2023 |
Stanislav Fomichev <sdf@google.com> |
net: set SOCK_RCU_FREE before inserting socket into hashtable We've started to see the following kernel traces: WARNING: CPU: 83 PID: 0 at net/core/filter.c:6641 sk_lookup+0x1bd/0x1d0 Call Trace: <IRQ> __bpf_skc_lookup+0x10d/0x120 bpf_sk_lookup+0x48/0xd0 bpf_sk_lookup_tcp+0x19/0x20 bpf_prog_<redacted>+0x37c/0x16a3 cls_bpf_classify+0x205/0x2e0 tcf_classify+0x92/0x160 __netif_receive_skb_core+0xe52/0xf10 __netif_receive_skb_list_core+0x96/0x2b0 napi_complete_done+0x7b5/0xb70 <redacted>_poll+0x94/0xb0 net_rx_action+0x163/0x1d70 __do_softirq+0xdc/0x32e asm_call_irq_on_stack+0x12/0x20 </IRQ> do_softirq_own_stack+0x36/0x50 do_softirq+0x44/0x70 __inet_hash can race with lockless (rcu) readers on the other cpus: __inet_hash __sk_nulls_add_node_rcu <- (bpf triggers here) sock_set_flag(SOCK_RCU_FREE) Let's move the SOCK_RCU_FREE part up a bit, before we are inserting the socket into hashtables. Note, that the race is really harmless; the bpf callers are handling this situation (where listener socket doesn't have SOCK_RCU_FREE set) correctly, so the only annoyance is a WARN_ONCE. More details from Eric regarding SOCK_RCU_FREE timeline: Commit 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") added SOCK_RCU_FREE. At that time, the precise location of sock_set_flag(sk, SOCK_RCU_FREE) did not matter, because the thread calling __inet_hash() owns a reference on sk. SOCK_RCU_FREE was only tested at dismantle time. Commit 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") started checking SOCK_RCU_FREE _after_ the lookup to infer whether the refcount has been taken care of. Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8702cf12 |
|
09-Oct-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix listen() warning with v4-mapped-v6 address. syzbot reported a warning [0] introduced by commit c48ef9c4aed3 ("tcp: Fix bind() regression for v4-mapped-v6 non-wildcard address."). After the cited commit, a v4 socket's address matches the corresponding v4-mapped-v6 tb2 in inet_bind2_bucket_match_addr(), not vice versa. During X.X.X.X -> ::ffff:X.X.X.X order bind()s, the second bind() uses bhash and conflicts properly without checking bhash2 so that we need not check if a v4-mapped-v6 sk matches the corresponding v4 address tb2 in inet_bind2_bucket_match_addr(). However, the repro shows that we need to check that in a no-conflict case. The repro bind()s two sockets to the 2-tuples using SO_REUSEPORT and calls listen() for the first socket: from socket import * s1 = socket() s1.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1) s1.bind(('127.0.0.1', 0)) s2 = socket(AF_INET6) s2.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1) s2.bind(('::ffff:127.0.0.1', s1.getsockname()[1])) s1.listen() The second socket should belong to the first socket's tb2, but the second bind() creates another tb2 bucket because inet_bind2_bucket_find() returns NULL in inet_csk_get_port() as the v4-mapped-v6 sk does not match the corresponding v4 address tb2. bhash2[] -> tb2(::ffff:X.X.X.X) -> tb2(X.X.X.X) Then, listen() for the first socket calls inet_csk_get_port(), where the v4 address matches the v4-mapped-v6 tb2 and WARN_ON() is triggered. To avoid that, we need to check if v4-mapped-v6 sk address matches with the corresponding v4 address tb2 in inet_bind2_bucket_match(). The same checks are needed in inet_bind2_bucket_addr_match() too, so we can move all checks there and call it from inet_bind2_bucket_match(). Note that now tb->family is just an address family of tb->(v6_)?rcv_saddr and not of sockets in the bucket. This could be refactored later by defining tb->rcv_saddr as tb->v6_rcv_saddr.s6_addr32[3] and prepending ::ffff: when creating v4 tb2. [0]: WARNING: CPU: 0 PID: 5049 at net/ipv4/inet_connection_sock.c:587 inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587 Modules linked in: CPU: 0 PID: 5049 Comm: syz-executor288 Not tainted 6.6.0-rc2-syzkaller-00018-g2cf0f7156238 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/04/2023 RIP: 0010:inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587 Code: 7c 24 08 e8 4c b6 8a 01 31 d2 be 88 01 00 00 48 c7 c7 e0 94 ae 8b e8 59 2e a3 f8 2e 2e 2e 31 c0 e9 04 fe ff ff e8 ca 88 d0 f8 <0f> 0b e9 0f f9 ff ff e8 be 88 d0 f8 49 8d 7e 48 e8 65 ca 5a 00 31 RSP: 0018:ffffc90003abfbf0 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff888026429100 RCX: 0000000000000000 RDX: ffff88807edcbb80 RSI: ffffffff88b73d66 RDI: ffff888026c49f38 RBP: ffff888026c49f30 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff9260f200 R13: ffff888026c49880 R14: 0000000000000000 R15: ffff888026429100 FS: 00005555557d5380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000045ad50 CR3: 0000000025754000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> inet_csk_listen_start+0x155/0x360 net/ipv4/inet_connection_sock.c:1256 __inet_listen_sk+0x1b8/0x5c0 net/ipv4/af_inet.c:217 inet_listen+0x93/0xd0 net/ipv4/af_inet.c:239 __sys_listen+0x194/0x270 net/socket.c:1866 __do_sys_listen net/socket.c:1875 [inline] __se_sys_listen net/socket.c:1873 [inline] __x64_sys_listen+0x53/0x80 net/socket.c:1873 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f3a5bce3af9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffc1a1c79e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000032 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a5bce3af9 RDX: 00007f3a5bce3af9 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007f3a5bd565f0 R08: 0000000000000006 R09: 0000000000000006 R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000001 R13: 431bde82d7b634db R14: 0000000000000001 R15: 0000000000000001 </TASK> Fixes: c48ef9c4aed3 ("tcp: Fix bind() regression for v4-mapped-v6 non-wildcard address.") Reported-by: syzbot+71e724675ba3958edb31@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=71e724675ba3958edb31 Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231010013814.70571-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
c48ef9c4 |
|
11-Sep-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix bind() regression for v4-mapped-v6 non-wildcard address. Since bhash2 was introduced, the example below does not work as expected. These two bind() should conflict, but the 2nd bind() now succeeds. from socket import * s1 = socket(AF_INET6, SOCK_STREAM) s1.bind(('::ffff:127.0.0.1', 0)) s2 = socket(AF_INET, SOCK_STREAM) s2.bind(('127.0.0.1', s1.getsockname()[1])) During the 2nd bind() in inet_csk_get_port(), inet_bind2_bucket_find() fails to find the 1st socket's tb2, so inet_bind2_bucket_create() allocates a new tb2 for the 2nd socket. Then, we call inet_csk_bind_conflict() that checks conflicts in the new tb2 by inet_bhash2_conflict(). However, the new tb2 does not include the 1st socket, thus the bind() finally succeeds. In this case, inet_bind2_bucket_match() must check if AF_INET6 tb2 has the conflicting v4-mapped-v6 address so that inet_bind2_bucket_find() returns the 1st socket's tb2. Note that if we bind two sockets to 127.0.0.1 and then ::FFFF:127.0.0.1, the 2nd bind() fails properly for the same reason mentinoed in the previous commit. Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
aa99e5f8 |
|
11-Sep-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix bind() regression for v4-mapped-v6 wildcard address. Andrei Vagin reported bind() regression with strace logs. If we bind() a TCPv6 socket to ::FFFF:0.0.0.0 and then bind() a TCPv4 socket to 127.0.0.1, the 2nd bind() should fail but now succeeds. from socket import * s1 = socket(AF_INET6, SOCK_STREAM) s1.bind(('::ffff:0.0.0.0', 0)) s2 = socket(AF_INET, SOCK_STREAM) s2.bind(('127.0.0.1', s1.getsockname()[1])) During the 2nd bind(), if tb->family is AF_INET6 and sk->sk_family is AF_INET in inet_bind2_bucket_match_addr_any(), we still need to check if tb has the v4-mapped-v6 wildcard address. The example above does not work after commit 5456262d2baa ("net: Fix incorrect address comparison when searching for a bind2 bucket"), but the blamed change is not the commit. Before the commit, the leading zeros of ::FFFF:0.0.0.0 were treated as 0.0.0.0, and the sequence above worked by chance. Technically, this case has been broken since bhash2 was introduced. Note that if we bind() two sockets to 127.0.0.1 and then ::FFFF:0.0.0.0, the 2nd bind() fails properly because we fall back to using bhash to detect conflicts for the v4-mapped-v6 address. Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") Reported-by: Andrei Vagin <avagin@google.com> Closes: https://lore.kernel.org/netdev/ZPuYBOFC8zsK6r9T@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c6d27706 |
|
11-Sep-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Factorise sk_family-independent comparison in inet_bind2_bucket_match(_addr_any). This is a prep patch to make the following patches cleaner that touch inet_bind2_bucket_match() and inet_bind2_bucket_match_addr_any(). Both functions have duplicated comparison for netns, port, and l3mdev. Let's factorise them. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
74bdfab4 |
|
31-Jul-2023 |
Lorenz Bauer <lmb@isovalent.com> |
net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn There are already INDIRECT_CALLABLE_DECLARE in the hashtable headers, no need to declare them again. Fixes: 0f495f761722 ("net: remove duplicate reuseport_lookup functions") Suggested-by: Martin Lau <martin.lau@linux.dev> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230731-indir-call-v1-1-4cd0aeaee64f@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
6c886db2 |
|
20-Jul-2023 |
Lorenz Bauer <lmb@isovalent.com> |
net: remove duplicate sk_lookup helpers Now that inet[6]_lookup_reuseport are parameterised on the ehashfn we can remove two sk_lookup helpers. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-6-7021b683cdae@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
2a617763 |
|
20-Jul-2023 |
Lorenz Bauer <lmb@isovalent.com> |
net: document inet[6]_lookup_reuseport sk_state requirements The current implementation was extracted from inet[6]_lhash2_lookup in commit 80b373f74f9e ("inet: Extract helper for selecting socket from reuseport group") and commit 5df6531292b5 ("inet6: Extract helper for selecting socket from reuseport group"). In the original context, sk is always in TCP_LISTEN state and so did not have a separate check. Add documentation that specifies which sk_state are valid to pass to the function. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-5-7021b683cdae@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
0f495f76 |
|
20-Jul-2023 |
Lorenz Bauer <lmb@isovalent.com> |
net: remove duplicate reuseport_lookup functions There are currently four copies of reuseport_lookup: one each for (TCP, UDP)x(IPv4, IPv6). This forces us to duplicate all callers of those functions as well. This is already the case for sk_lookup helpers (inet,inet6,udp4,udp6)_lookup_run_bpf. There are two differences between the reuseport_lookup helpers: 1. They call different hash functions depending on protocol 2. UDP reuseport_lookup checks that sk_state != TCP_ESTABLISHED Move the check for sk_state into the caller and use the INDIRECT_CALL infrastructure to cut down the helpers to one per IP version. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-4-7021b683cdae@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
ce796e60 |
|
20-Jul-2023 |
Lorenz Bauer <lmb@isovalent.com> |
net: export inet_lookup_reuseport and inet6_lookup_reuseport Rename the existing reuseport helpers for IPv4 and IPv6 so that they can be invoked in the follow up commit. Export them so that building DCCP and IPv6 as a module works. No change in functionality. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-3-7021b683cdae@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
81b3ade5 |
|
17-Jul-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
Revert "tcp: avoid the lookup process failing to get sk in ehash table" This reverts commit 3f4ca5fafc08881d7a57daa20449d171f2887043. Commit 3f4ca5fafc08 ("tcp: avoid the lookup process failing to get sk in ehash table") reversed the order in how a socket is inserted into ehash to fix an issue that ehash-lookup could fail when reqsk/full sk/twsk are swapped. However, it introduced another lookup failure. The full socket in ehash is allocated from a slab with SLAB_TYPESAFE_BY_RCU and does not have SOCK_RCU_FREE, so the socket could be reused even while it is being referenced on another CPU doing RCU lookup. Let's say a socket is reused and inserted into the same hash bucket during lookup. After the blamed commit, a new socket is inserted at the end of the list. If that happens, we will skip sockets placed after the previous position of the reused socket, resulting in ehash lookup failure. As described in Documentation/RCU/rculist_nulls.rst, we should insert a new socket at the head of the list to avoid such an issue. This issue, the swap-lookup-failure, and another variant reported in [0] can all be handled properly by adding a locked ehash lookup suggested by Eric Dumazet [1]. However, this issue could occur for every packet, thus more likely than the other two races, so let's revert the change for now. Link: https://lore.kernel.org/netdev/20230606064306.9192-1-duanmuquan@baidu.com/ [0] Link: https://lore.kernel.org/netdev/CANn89iK8snOz8TYOhhwfimC7ykYA78GA3Nyv8x06SZYa1nKdyA@mail.gmail.com/ [1] Fixes: 3f4ca5fafc08 ("tcp: avoid the lookup process failing to get sk in ehash table") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230717215918.15723-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
8cdc3223 |
|
27-Mar-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
ipv6: Remove in6addr_any alternatives. Some code defines the IPv6 wildcard address as a local variable and use it with memcmp() or ipv6_addr_equal(). Let's use in6addr_any and ipv6_addr_any() instead. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d9ba9934 |
|
11-Mar-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix bind() conflict check for dual-stack wildcard address. Paul Holzinger reported [0] that commit 5456262d2baa ("net: Fix incorrect address comparison when searching for a bind2 bucket") introduced a bind() regression. Paul also gave a nice repro that calls two types of bind() on the same port, both of which now succeed, but the second call should fail: bind(fd1, ::, port) + bind(fd2, 127.0.0.1, port) The cited commit added address family tests in three functions to fix the uninit-value KMSAN report. [1] However, the test added to inet_bind2_bucket_match_addr_any() removed a necessary conflict check; the dual-stack wildcard address no longer conflicts with an IPv4 non-wildcard address. If tb->family is AF_INET6 and sk->sk_family is AF_INET in inet_bind2_bucket_match_addr_any(), we still need to check if tb has the dual-stack wildcard address. Note that the IPv4 wildcard address does not conflict with IPv6 non-wildcard addresses. [0]: https://lore.kernel.org/netdev/e21bf153-80b0-9ec0-15ba-e04a4ad42c34@redhat.com/ [1]: https://lore.kernel.org/netdev/CAG_fn=Ud3zSW7AZWXc+asfMhZVL5ETnvuY44Pmyv4NPv-ijN-A@mail.gmail.com/ Fixes: 5456262d2baa ("net: Fix incorrect address comparison when searching for a bind2 bucket") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://lore.kernel.org/netdev/CAG_fn=Ud3zSW7AZWXc+asfMhZVL5ETnvuY44Pmyv4NPv-ijN-A@mail.gmail.com/ Reviewed-by: Eric Dumazet <edumazet@google.com> Tested-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
91d0b78c |
|
24-Jan-2023 |
Jakub Sitnicki <jakub@cloudflare.com> |
inet: Add IP_LOCAL_PORT_RANGE socket option Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
21cbd90a |
|
14-Jan-2023 |
Pietro Borrello <borrello@diag.uniroma1.it> |
inet: fix fast path in __inet_hash_connect() __inet_hash_connect() has a fast path taken if sk_head(&tb->owners) is equal to the sk parameter. sk_head() returns the hlist_entry() with respect to the sk_node field. However entries in the tb->owners list are inserted with respect to the sk_bind_node field with sk_add_bind_node(). Thus the check would never pass and the fast path never execute. This fast path has never been executed or tested as this bug seems to be present since commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), thus remove it to reduce code complexity. Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230112-inet_hash_connect_bind_head-v3-1-b591fd212b93@diag.uniroma1.it Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
3f4ca5fa |
|
17-Jan-2023 |
Jason Xing <kernelxing@tencent.com> |
tcp: avoid the lookup process failing to get sk in ehash table While one cpu is working on looking up the right socket from ehash table, another cpu is done deleting the request socket and is about to add (or is adding) the big socket from the table. It means that we could miss both of them, even though it has little chance. Let me draw a call trace map of the server side. CPU 0 CPU 1 ----- ----- tcp_v4_rcv() syn_recv_sock() inet_ehash_insert() -> sk_nulls_del_node_init_rcu(osk) __inet_lookup_established() -> __sk_nulls_add_node_rcu(sk, list) Notice that the CPU 0 is receiving the data after the final ack during 3-way shakehands and CPU 1 is still handling the final ack. Why could this be a real problem? This case is happening only when the final ack and the first data receiving by different CPUs. Then the server receiving data with ACK flag tries to search one proper established socket from ehash table, but apparently it fails as my map shows above. After that, the server fetches a listener socket and then sends a RST because it finds a ACK flag in the skb (data), which obeys RST definition in RFC 793. Besides, Eric pointed out there's one more race condition where it handles tw socket hashdance. Only by adding to the tail of the list before deleting the old one can we avoid the race if the reader has already begun the bucket traversal and it would possibly miss the head. Many thanks to Eric for great help from beginning to end. Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/lkml/20230112065336.41034-1-kerneljasonxing@gmail.com/ Link: https://lore.kernel.org/r/20230118015941.1313-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
936a192f |
|
26-Dec-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Add TIME_WAIT sockets in bhash2. Jiri Slaby reported regression of bind() with a simple repro. [0] The repro creates a TIME_WAIT socket and tries to bind() a new socket with the same local address and port. Before commit 28044fc1d495 ("net: Add a bhash2 table hashed by port and address"), the bind() failed with -EADDRINUSE, but now it succeeds. The cited commit should have put TIME_WAIT sockets into bhash2; otherwise, inet_bhash2_conflict() misses TIME_WAIT sockets when validating bind() requests if the address is not a wildcard one. The straight option is to move sk_bind2_node from struct sock to struct sock_common to add twsk to bhash2 as implemented as RFC. [1] However, the binary layout change in the struct sock could affect performances moving hot fields on different cachelines. To avoid that, we add another TIME_WAIT list in inet_bind2_bucket and check it while validating bind(). [0]: https://lore.kernel.org/netdev/6b971a4e-c7d8-411e-1f92-fda29b5b2fb9@kernel.org/ [1]: https://lore.kernel.org/netdev/20221221151258.25748-2-kuniyu@amazon.com/ Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") Reported-by: Jiri Slaby <jirislaby@kernel.org> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8032bf12 |
|
09-Oct-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
treewide: use get_random_u32_below() instead of deprecated function This is a simple mechanical transformation done by: @@ expression E; @@ - prandom_u32_max + get_random_u32_below (E) Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
#
e0833d1f |
|
18-Nov-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
dccp/tcp: Fixup bhash2 bucket when connect() fails. If a socket bound to a wildcard address fails to connect(), we only reset saddr and keep the port. Then, we have to fix up the bhash2 bucket; otherwise, the bucket has an inconsistent address in the list. Also, listen() for such a socket will fire the WARN_ON() in inet_csk_get_port(). [0] Note that when a system runs out of memory, we give up fixing the bucket and unlink sk from bhash and bhash2 by inet_put_port(). [0]: WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1)) Modules linked in: CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014 RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1)) Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202 RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000 RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280 FS: 00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205) inet_listen (net/ipv4/af_inet.c:228) __sys_listen (net/socket.c:1810) __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) RIP: 0033:0x7f8ac051de5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032 RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004 RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007 R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388 R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000 </TASK> Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
8c5dae4c |
|
18-Nov-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
dccp/tcp: Update saddr under bhash's lock. When we call connect() for a socket bound to a wildcard address, we update saddr locklessly. However, it could result in a data race; another thread iterating over bhash might see a corrupted address. Let's update saddr under the bhash bucket's lock. Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6") Fixes: 7c657876b63c ("[DCCP]: Initial implementation") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
8acdad37 |
|
18-Nov-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
dccp/tcp: Remove NULL check for prev_saddr in inet_bhash2_update_saddr(). When we call inet_bhash2_update_saddr(), prev_saddr is always non-NULL. Let's remove the unnecessary test. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
aeac4ec8 |
|
14-Nov-2022 |
Gleb Mazovetskiy <glex.spb@gmail.com> |
tcp: configurable source port perturb table size On embedded systems with little memory and no relevant security concerns, it is beneficial to reduce the size of the table. Reducing the size from 2^16 to 2^8 saves 255 KiB of kernel RAM. Makes the table size configurable as an expert option. The size was previously increased from 2^8 to 2^16 in commit 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16"). Signed-off-by: Gleb Mazovetskiy <glex.spb@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
81895a65 |
|
05-Oct-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
treewide: use prandom_u32_max() when possible, part 1 Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
#
2a4187f4 |
|
03-Oct-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
once: rename _SLOW to _SLEEPABLE The _SLOW designation wasn't really descriptive of anything. This is meant to be called from process context when it's possible to sleep. So name this more aptly _SLEEPABLE, which better fits its intended use. Fixes: 62c07983bef9 ("once: add DO_ONCE_SLOW() for sleepable contexts") Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221003181413.1221968-1-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
62c07983 |
|
01-Oct-2022 |
Eric Dumazet <edumazet@google.com> |
once: add DO_ONCE_SLOW() for sleepable contexts Christophe Leroy reported a ~80ms latency spike happening at first TCP connect() time. This is because __inet_hash_connect() uses get_random_once() to populate a perturbation table which became quite big after commit 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") get_random_once() uses DO_ONCE(), which block hard irqs for the duration of the operation. This patch adds DO_ONCE_SLOW() which uses a mutex instead of a spinlock for operations where we prefer to stay in process context. Then __inet_hash_connect() can use get_random_slow_once() to populate its perturbation table. Fixes: 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") Fixes: 190cc82489f4 ("tcp: change source port randomizarion at connect() time") Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/netdev/CANn89iLAEYBaoYajy0Y9UmGFff5GPxDUoG-ErVB2jDdRNQ5Tug@mail.gmail.com/T/#t Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5456262d |
|
26-Sep-2022 |
Martin KaFai Lau <martin.lau@kernel.org> |
net: Fix incorrect address comparison when searching for a bind2 bucket The v6_rcv_saddr and rcv_saddr are inside a union in the 'struct inet_bind2_bucket'. When searching a bucket by following the bhash2 hashtable chain, eg. inet_bind2_bucket_match, it is only using the sk->sk_family and there is no way to check if the inet_bind2_bucket has a v6 or v4 address in the union. This leads to an uninit-value KMSAN report in [0] and also potentially incorrect matches. This patch fixes it by adding a family member to the inet_bind2_bucket and then tests 'sk->sk_family != tb->family' before matching the sk's address to the tb's address. Cc: Joanne Koong <joannelkoong@gmail.com> Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/r/20220927002544.3381205-1-kafai@fb.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d1e5e640 |
|
07-Sep-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Introduce optional per-netns ehash. The more sockets we have in the hash table, the longer we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation. The root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource. On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash entries), after running iperf3 in different netns, creating 24Mi sockets without data transfer in the root netns causes about 10% performance regression for the iperf3's connection. thash_entries sockets length Gbps 524288 1 1 50.7 24Mi 48 45.1 It is basically related to the length of the list of each hash bucket. For testing purposes to see how performance drops along the length, I set 131072 (1Mi / 8) to thash_entries, and here's the result. thash_entries sockets length Gbps 131072 1 1 50.7 1Mi 8 49.9 2Mi 16 48.9 4Mi 32 47.3 8Mi 64 44.6 16Mi 128 40.6 24Mi 192 36.3 32Mi 256 32.5 40Mi 320 27.0 48Mi 384 25.0 To resolve the socket lookup degradation, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash, bhash2 and lhash2. With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. In addition, we can reduce lock contention. We can control the ehash size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, we can fall back to using the global ehash in case we fail to allocate enough memory for a new ehash. The maximum size is 16Mi, which is large enough that even if we have 48Mi sockets, the average list length is 3, and regression would be less than 1%. We can check the current ehash size by another read-only sysctl knob, net.ipv4.tcp_ehash_entries. A negative value means the netns shares the global ehash (per-netns ehash is disabled or failed to allocate memory). # dmesg | cut -d ' ' -f 5- | grep "established hash" TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) # sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = -524288 # share the global ehash # sysctl -w net.ipv4.tcp_child_ehash_entries=100 net.ipv4.tcp_child_ehash_entries = 100 # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets When more than two processes in the same netns create per-netns ehash concurrently with different sizes, we need to guarantee the size in one of the following ways: 1) Share the global ehash and create per-netns ehash First, unshare() with tcp_child_ehash_entries==0. It creates dedicated netns sysctl knobs where we can safely change tcp_child_ehash_entries and clone()/unshare() to create a per-netns ehash. 2) Control write on sysctl by BPF We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on sysctl knobs. Note that the global ehash allocated at the boot time is spread over available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate pages for each per-netns ehash depending on the current process's NUMA policy. By default, the allocation is done in the local node only, so the per-netns hash table could fully reside on a random node. Thus, depending on the NUMA policy the netns is created with and the CPU the current thread is running on, we could see some performance differences for highly optimised networking applications. Note also that the default values of two sysctl knobs depend on the ehash size and should be tuned carefully: tcp_max_tw_buckets : tcp_child_ehash_entries / 2 tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) As a bonus, we can dismantle netns faster. Currently, while destroying netns, we call inet_twsk_purge(), which walks through the global ehash. It can be potentially big because it can have many sockets other than TIME_WAIT in all netns. Splitting ehash changes that situation, where it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets in each netns. With regard to this, we do not free the per-netns ehash in inet_twsk_kill() to avoid UAF while iterating the per-netns ehash in inet_twsk_purge(). Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to keep it protocol-family-independent. In the future, we could optimise ehash lookup/iteration further by removing netns comparison for the per-netns ehash. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
4461568a |
|
07-Sep-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Access &tcp_hashinfo via net. We will soon introduce an optional per-netns ehash. This means we cannot use tcp_hashinfo directly in most places. Instead, access it via net->ipv4.tcp_death_row.hashinfo. The access will be valid only while initialising tcp_hashinfo itself and creating/destroying each netns. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
429e42c1 |
|
07-Sep-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Set NULL to sk->sk_prot->h.hashinfo. We will soon introduce an optional per-netns ehash. This means we cannot use the global sk->sk_prot->h.hashinfo to fetch a TCP hashinfo. Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get a proper hashinfo from net->ipv4.tcp_death_row.hashinfo. Note that we need not use sk->sk_prot->h.hashinfo if DCCP is disabled. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
08eaef90 |
|
07-Sep-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Clean up some functions. This patch adds no functional change and cleans up some functions that the following patches touch around so that we make them tidy and easy to review/revert. The changes are - Keep reverse christmas tree order - Remove unnecessary init of port in inet_csk_find_open_port() - Use req_to_sk() once in reqsk_queue_unlink() - Use sock_net(sk) once in tcp_time_wait() and tcp_v[46]_connect() Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
28044fc1 |
|
22-Aug-2022 |
Joanne Koong <joannelkoong@gmail.com> |
net: Add a bhash2 table hashed by port and address The current bind hashtable (bhash) is hashed by port only. In the socket bind path, we have to check for bind conflicts by traversing the specified port's inet_bind_bucket while holding the hashbucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()). In instances where there are tons of sockets hashed to the same port at different addresses, the bind conflict check is time-intensive and can cause softirq cpu lockups, as well as stops new tcp connections since __inet_inherit_port() also contests for the spinlock. This patch adds a second bind table, bhash2, that hashes by port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6). Searching the bhash2 table leads to significantly faster conflict resolution and less time holding the hashbucket spinlock. Please note a few things: * There can be the case where the a socket's address changes after it has been bound. There are two cases where this happens: 1) The case where there is a bind() call on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will assign the socket an address when it handles the connect() 2) In inet_sk_reselect_saddr(), which is called when rebuilding the sk header and a few pre-conditions are met (eg rerouting fails). In these two cases, we need to update the bhash2 table by removing the entry for the old address, and add a new entry reflecting the updated address. * The bhash2 table must have its own lock, even though concurrent accesses on the same port are protected by the bhash lock. Bhash2 must have its own lock to protect against cases where sockets on different ports hash to different bhash hashbuckets but to the same bhash2 hashbucket. This brings up a few stipulations: 1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock will always be acquired after the bhash lock and released before the bhash lock is released. 2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always acquired+released before another bhash2 lock is acquired+released. * The bhash table cannot be superseded by the bhash2 table because for bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket bound to that port must be checked for a potential conflict. The bhash table is the only source of port->socket associations. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
593d1ebe |
|
15-Jun-2022 |
Joanne Koong <joannelkoong@gmail.com> |
Revert "net: Add a second bind table hashed by port and address" This reverts: commit d5a42de8bdbe ("net: Add a second bind table hashed by port and address") commit 538aaf9b2383 ("selftests: Add test for timing a bind request to a port with a populated bhash entry") Link: https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/ There are a few things that need to be fixed here: * Updating bhash2 in cases where the socket's rcv saddr changes * Adding bhash2 hashbucket locks Links to syzbot reports: https://lore.kernel.org/netdev/00000000000022208805e0df247a@google.com/ https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/ Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address") Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com Reported-by: syzbot+98fd2d1422063b0f8c44@syzkaller.appspotmail.com Reported-by: syzbot+0a847a982613c6438fba@syzkaller.appspotmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20220615193213.2419568-1-joannelkoong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
e67b72b9 |
|
07-Jun-2022 |
Muchun Song <songmuchun@bytedance.com> |
tcp: use alloc_large_system_hash() to allocate table_perturb In our server, there may be no high order (>= 6) memory since we reserve lots of HugeTLB pages when booting. Then the system panic. So use alloc_large_system_hash() to allocate table_perturb. Fixes: e9261476184b ("tcp: dynamically allocate the perturb table used by source ports") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220607070214.94443-1-songmuchun@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d5a42de8 |
|
19-May-2022 |
Joanne Koong <joannelkoong@gmail.com> |
net: Add a second bind table hashed by port and address We currently have one tcp bind table (bhash) which hashes by port number only. In the socket bind path, we check for bind conflicts by traversing the specified port's inet_bind2_bucket while holding the bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()). In instances where there are tons of sockets hashed to the same port at different addresses, checking for a bind conflict is time-intensive and can cause softirq cpu lockups, as well as stops new tcp connections since __inet_inherit_port() also contests for the spinlock. This patch proposes adding a second bind table, bhash2, that hashes by port and ip address. Searching the bhash2 table leads to significantly faster conflict resolution and less time holding the spinlock. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
eda090c3 |
|
13-May-2022 |
Eric Dumazet <edumazet@google.com> |
inet: rename INET_MATCH() This is no longer a macro, but an inlined function. INET_MATCH() -> inet_match() Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Olivier Hartkopp <socketcan@hartkopp.net> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5d368f03 |
|
13-May-2022 |
Eric Dumazet <edumazet@google.com> |
ipv6: add READ_ONCE(sk->sk_bound_dev_if) in INET6_MATCH() INET6_MATCH() runs without holding a lock on the socket. We probably need to annotate most reads. This patch makes INET6_MATCH() an inline function to ease our changes. v2: inline function only defined if IS_ENABLED(CONFIG_IPV6) Change the name to inet6_match(), this is no longer a macro. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4915d50e |
|
12-May-2022 |
Eric Dumazet <edumazet@google.com> |
inet: add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH() INET_MATCH() runs without holding a lock on the socket. We probably need to annotate most reads. This patch makes INET_MATCH() an inline function to ease our changes. v2: We remove the 32bit version of it, as modern compilers should generate the same code really, no need to try to be smarter. Also make 'struct net *net' the first argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cae3873c |
|
11-May-2022 |
Martin KaFai Lau <kafai@fb.com> |
net: inet: Retire port only listening_hash The listen sk is currently stored in two hash tables, listening_hash (hashed by port) and lhash2 (hashed by port and address). After commit 0ee58dad5b06 ("net: tcp6: prefer listeners bound to an address") and commit d9fbc7f6431f ("net: tcp: prefer listeners bound to an address"), the TCP-SYN lookup fast path does not use listening_hash. The commit 05c0b35709c5 ("tcp: seq_file: Replace listening_hash with lhash2") also moved the seq_file (/proc/net/tcp) iteration usage from listening_hash to lhash2. There are still a few listening_hash usages left. One of them is inet_reuseport_add_sock() which uses the listening_hash to search a listen sk during the listen() system call. This turns out to be very slow on use cases that listen on many different VIPs at a popular port (e.g. 443). [ On top of the slowness in adding to the tail in the IPv6 case ]. The latter patch has a selftest to demonstrate this case. This patch takes this chance to move all remaining listening_hash usages to lhash2 and then retire listening_hash. Since most changes need to be done together, it is hard to cut the listening_hash to lhash2 switch into small patches. The changes in this patch is highlighted here for the review purpose. 1. Because of the listening_hash removal, lhash2 can use the sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node. This will also keep the sk_unhashed() check to work as is after stop adding sk to listening_hash. The union is removed from inet_listen_hashbucket because only nulls_head is needed. 2. icsk->icsk_listen_portaddr_node and its helpers are removed. 3. The current lhash2 users needs to iterate with sk_nulls_node instead of icsk_listen_portaddr_node. One case is in the inet[6]_lhash2_lookup(). Another case is the seq_file iterator in tcp_ipv4.c. One thing to note is sk_nulls_next() is needed because the old inet_lhash2_for_each_icsk_continue() does a "next" first before iterating. 4. Move the remaining listening_hash usage to lhash2 inet_reuseport_add_sock() which this series is trying to improve. inet_diag.c and mptcp_diag.c are the final two remaining use cases and is moved to lhash2 now also. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
e8d00590 |
|
11-May-2022 |
Martin KaFai Lau <kafai@fb.com> |
net: inet: Open code inet_hash2 and inet_unhash2 This patch folds lhash2 related functions into __inet_hash and inet_unhash. This will make the removal of the listening_hash in a latter patch easier to review. First, this patch folds inet_hash2 into __inet_hash. For unhash, the current call sequence is like inet_unhash() => __inet_unhash() => inet_unhash2(). The specific testing cases in __inet_unhash() are mostly related to TCP_LISTEN sk and its caller inet_unhash() already has the TCP_LISTEN test, so this patch folds both __inet_unhash() and inet_unhash2() into inet_unhash(). Note that all listening_hash users also have lhash2 initialized, so the !h->lhash2 check is no longer needed. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
8ea1eebb |
|
11-May-2022 |
Martin KaFai Lau <kafai@fb.com> |
net: inet: Remove count from inet_listen_hashbucket After commit 0ee58dad5b06 ("net: tcp6: prefer listeners bound to an address") and commit d9fbc7f6431f ("net: tcp: prefer listeners bound to an address"), the count is no longer used. This patch removes it. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
e8161345 |
|
02-May-2022 |
Willy Tarreau <w@1wt.eu> |
tcp: drop the hash_32() part from the index calculation In commit 190cc82489f4 ("tcp: change source port randomizarion at connect() time"), the table_perturb[] array was introduced and an index was taken from the port_offset via hash_32(). But it turns out that hash_32() performs a multiplication while the input here comes from the output of SipHash in secure_seq, that is well distributed enough to avoid the need for yet another hash. Suggested-by: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
4c2c8f03 |
|
02-May-2022 |
Willy Tarreau <w@1wt.eu> |
tcp: increase source port perturb table to 2^16 Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately identify a client by forcing it to emit only 40 times more connections than there are entries in the table_perturb[] table. The previous two improvements consisting in resalting the secret every 10s and adding randomness to each port selection only slightly improved the situation, and the current value of 2^8 was too small as it's not very difficult to make a client emit 10k connections in less than 10 seconds. Thus we're increasing the perturb table from 2^8 to 2^16 so that the same precision now requires 2.6M connections, which is more difficult in this time frame and harder to hide as a background activity. The impact is that the table now uses 256 kB instead of 1 kB, which could mostly affect devices making frequent outgoing connections. However such components usually target a small set of destinations (load balancers, database clients, perf assessment tools), and in practice only a few entries will be visited, like before. A live test at 1 million connections per second showed no performance difference from the previous value. Reported-by: Moshe Kol <moshe.kol@mail.huji.ac.il> Reported-by: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Reported-by: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
e9261476 |
|
02-May-2022 |
Willy Tarreau <w@1wt.eu> |
tcp: dynamically allocate the perturb table used by source ports We'll need to further increase the size of this table and it's likely that at some point its size will not be suitable anymore for a static table. Let's allocate it on boot from inet_hashinfo2_init(), which is called from tcp_init(). Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
ca7af040 |
|
02-May-2022 |
Willy Tarreau <w@1wt.eu> |
tcp: add small random increments to the source port Here we're randomly adding between 0 and 7 random increments to the selected source port in order to add some noise in the source port selection that will make the next port less predictable. With the default port range of 32768-60999 this means a worst case reuse scenario of 14116/8=1764 connections between two consecutive uses of the same port, with an average of 14116/4.5=3137. This code was stressed at more than 800000 connections per second to a fixed target with all connections closed by the client using RSTs (worst condition) and only 2 connections failed among 13 billion, despite the hash being reseeded every 10 seconds, indicating a perfectly safe situation. Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
9e9b70ae |
|
02-May-2022 |
Willy Tarreau <w@1wt.eu> |
tcp: use different parts of the port_offset for index and offset Amit Klein suggests that we use different parts of port_offset for the table's index and the port offset so that there is no direct relation between them. Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
b2d05756 |
|
02-May-2022 |
Willy Tarreau <w@1wt.eu> |
secure_seq: use the 64 bits of the siphash for port offset calculation SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit 7cd23e5300c1 ("secure_seq: use SipHash in place of MD5"), but the output remained truncated to 32-bit only. In order to exploit more bits from the hash, let's make the functions return the full 64-bit of siphash_3u32(). We also make sure the port offset calculation in __inet_hash_connect() remains done on 32-bit to avoid the need for div_u64_rem() and an extra cost on 32-bit systems. Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
4f9bf2a2 |
|
09-Feb-2022 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH. Commit 9652dc2eb9e40 ("tcp: relax listening_hash operations") removed the need to disable bottom half while acquiring listening_hash.lock. There are still two callers left which disable bottom half before the lock is acquired. On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts as a lock to ensure that resources, that are protected by disabling bottom halves, remain protected. This leads to a circular locking dependency if the lock acquired with disabled bottom halves is also acquired with enabled bottom halves followed by disabling bottom halves. This is the reverse locking order. It has been observed with inet_listen_hashbucket::lock: local_bh_disable() + spin_lock(&ilb->lock): inet_listen() inet_csk_listen_start() sk->sk_prot->hash() := inet_hash() local_bh_disable() __inet_hash() spin_lock(&ilb->lock); acquire(&ilb->lock); Reverse order: spin_lock(&ilb2->lock) + local_bh_disable(): tcp_seq_next() listening_get_next() spin_lock(&ilb2->lock); acquire(&ilb2->lock); tcp4_seq_show() get_tcp4_sock() sock_i_ino() read_lock_bh(&sk->sk_callback_lock); acquire(softirq_ctrl) // <---- whoops acquire(&sk->sk_callback_lock) Drop local_bh_disable() around __inet_hash() which acquires listening_hash->lock. Split inet_unhash() and acquire the listen_hashbucket lock without disabling bottom halves; the inet_ehash lock with disabled bottom halves. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/12d6f9879a97cd56c09fb53dee343cbb14f7f1f7.camel@gmx.de Link: https://lkml.kernel.org/r/X9CheYjuXWc75Spa@hirez.programming.kicks-ass.net Link: https://lore.kernel.org/r/YgQOebeZ10eNx1W6@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
f8931565 |
|
10-Nov-2021 |
Mark Pashmfouroush <markpash@cloudflare.com> |
bpf: Add ingress_ifindex to bpf_sk_lookup It may be helpful to have access to the ifindex during bpf socket lookup. An example may be to scope certain socket lookup logic to specific interfaces, i.e. an interface may be made exempt from custom lookup code. Add the ifindex of the arriving connection to the bpf_sk_lookup API. Signed-off-by: Mark Pashmfouroush <markpash@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211110111016.5670-2-markpash@cloudflare.com
|
#
19757ceb |
|
14-Oct-2021 |
Eric Dumazet <edumazet@google.com> |
tcp: switch orphan_count to bare per-cpu counters Use of percpu_counter structure to track count of orphaned sockets is causing problems on modern hosts with 256 cpus or more. Stefan Bach reported a serious spinlock contention in real workloads, that I was able to reproduce with a netfilter rule dropping incoming FIN packets. 53.56% server [kernel.kallsyms] [k] queued_spin_lock_slowpath | ---queued_spin_lock_slowpath | --53.51%--_raw_spin_lock_irqsave | --53.51%--__percpu_counter_sum tcp_check_oom | |--39.03%--__tcp_close | tcp_close | inet_release | inet6_release | sock_close | __fput | ____fput | task_work_run | exit_to_usermode_loop | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __GI___libc_close | --14.48%--tcp_out_of_resources tcp_write_timeout tcp_retransmit_timer tcp_write_timer_handler tcp_write_timer call_timer_fn expire_timers __run_timers run_timer_softirq __softirqentry_text_start As explained in commit cf86a086a180 ("net/dst: use a smaller percpu_counter batch for dst entries accounting"), default batch size is too big for the default value of tcp_max_orphans (262144). But even if we reduce batch sizes, there would still be cases where the estimated count of orphans is beyond the limit, and where tcp_too_many_orphans() has to call the expensive percpu_counter_sum_positive(). One solution is to use plain per-cpu counters, and have a timer to periodically refresh this cache. Updating this cache every 100ms seems about right, tcp pressure state is not radically changing over shorter periods. percpu_counter was nice 15 years ago while hosts had less than 16 cpus, not anymore by current standards. v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m, reported by kernel test robot <lkp@intel.com> Remove unused socket argument from tcp_too_many_orphans() Fixes: dd24c00191d5 ("net: Use a percpu_counter for orphan_count") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Stefan Bach <sfb@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8d6c414c |
|
05-Oct-2021 |
Mike Manning <mvrmanning@gmail.com> |
net: prefer socket bound to interface when not in VRF The commit 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF") modified compute_score() so that a device match is always made, not just in the case of an l3mdev skb, then increments the score also for unbound sockets. This ensures that sockets bound to an l3mdev are never selected when not in a VRF. But as unbound and bound sockets are now scored equally, this results in the last opened socket being selected if there are matches in the default VRF for an unbound socket and a socket bound to a dev that is not an l3mdev. However, handling prior to this commit was to always select the bound socket in this case. Reinstate this handling by incrementing the score only for bound sockets. The required isolation due to choosing between an unbound socket and a socket bound to an l3mdev remains in place due to the device match always being made. The same approach is taken for compute_score() for stream sockets. Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF") Fixes: e78190581aff ("net: ensure unbound stream socket to be chosen when not in a VRF") Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/cf0a8523-b362-1edf-ee78-eef63cbbb428@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
333bb73f |
|
12-Jun-2021 |
Kuniyuki Iwashima <kuniyu@amazon.co.jp> |
tcp: Keep TCP_CLOSE sockets in the reuseport group. When we close a listening socket, to migrate its connections to another listener in the same reuseport group, we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not. The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed. Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, we cannot do that because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets. This patch allows TCP_CLOSE sockets to remain in the reuseport group and access it while any child socket references them. The point is that reuseport_detach_sock() was called twice from inet_unhash() and sk_destruct(). This patch replaces the first reuseport_detach_sock() with reuseport_stop_listen_sock(), which checks if the reuseport group is capable of migration. If capable, it decrements num_socks, moves the socket backwards in socks[] and increments num_closed_socks. When all connections are migrated, sk_destruct() calls reuseport_detach_sock() to remove the socket from socks[], decrement num_closed_socks, and set NULL to sk_reuseport_cb. By this change, closed or shutdowned sockets can keep sk_reuseport_cb. Consequently, calling listen() after shutdown() can cause EADDRINUSE or EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects such sockets not to have the reuseport group. Therefore, this patch also loosens such validation rules so that a socket can listen again if it has a reuseport group with num_closed_socks more than 0. When such sockets listen again, we handle them in reuseport_resurrect(). If there is an existing reuseport group (reuseport_add_sock() path), we move the socket from the old group to the new one and free the old one if necessary. If there is no existing group (reuseport_alloc() path), we allocate a new reuseport group, detach sk from the old one, and free it if necessary, not to break the current shutdown behaviour: - we cannot carry over the eBPF prog of shutdowned sockets - we cannot attach/detach an eBPF prog to/from listening sockets via shutdowned sockets Note that when the number of sockets gets over U16_MAX, we try to detach a closed socket randomly to make room for the new listening socket in reuseport_grow(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
|
#
c579bd1b |
|
09-Feb-2021 |
Eric Dumazet <edumazet@google.com> |
tcp: add some entropy in __inet_hash_connect() Even when implementing RFC 6056 3.3.4 (Algorithm 4: Double-Hash Port Selection Algorithm), a patient attacker could still be able to collect enough state from an otherwise idle host. Idea of this patch is to inject some noise, in the cases __inet_hash_connect() found a candidate in the first attempt. This noise should not significantly reduce the collision avoidance, and should be zero if connection table is already well used. Note that this is not implementing RFC 6056 3.3.5 because we think Algorithm 5 could hurt typical workloads. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Dworken <ddworken@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
190cc824 |
|
09-Feb-2021 |
Eric Dumazet <edumazet@google.com> |
tcp: change source port randomizarion at connect() time RFC 6056 (Recommendations for Transport-Protocol Port Randomization) provides good summary of why source selection needs extra care. David Dworken reminded us that linux implements Algorithm 3 as described in RFC 6056 3.3.3 Quoting David : In the context of the web, this creates an interesting info leak where websites can count how many TCP connections a user's computer is establishing over time. For example, this allows a website to count exactly how many subresources a third party website loaded. This also allows: - Distinguishing between different users behind a VPN based on distinct source port ranges. - Tracking users over time across multiple networks. - Covert communication channels between different browsers/browser profiles running on the same computer - Tracking what applications are running on a computer based on the pattern of how fast source ports are getting incremented. Section 3.3.4 describes an enhancement, that reduces attackers ability to use the basic information currently stored into the shared 'u32 hint'. This change also decreases collision rate when multiple applications need to connect() to different destinations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: David Dworken <ddworken@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
01770a16 |
|
20-Nov-2020 |
Ricardo Dias <rdias@singlestore.com> |
tcp: fix race condition when creating child sockets from syncookies When the TCP stack is in SYN flood mode, the server child socket is created from the SYN cookie received in a TCP packet with the ACK flag set. The child socket is created when the server receives the first TCP packet with a valid SYN cookie from the client. Usually, this packet corresponds to the final step of the TCP 3-way handshake, the ACK packet. But is also possible to receive a valid SYN cookie from the first TCP data packet sent by the client, and thus create a child socket from that SYN cookie. Since a client socket is ready to send data as soon as it receives the SYN+ACK packet from the server, the client can send the ACK packet (sent by the TCP stack code), and the first data packet (sent by the userspace program) almost at the same time, and thus the server will equally receive the two TCP packets with valid SYN cookies almost at the same instant. When such event happens, the TCP stack code has a race condition that occurs between the momement a lookup is done to the established connections hashtable to check for the existence of a connection for the same client, and the moment that the child socket is added to the established connections hashtable. As a consequence, this race condition can lead to a situation where we add two child sockets to the established connections hashtable and deliver two sockets to the userspace program to the same client. This patch fixes the race condition by checking if an existing child socket exists for the same client when we are adding the second child socket to the established connections socket. If an existing child socket exists, we drop the packet and discard the second child socket to the same client. Signed-off-by: Ricardo Dias <rdias@singlestore.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
34e1ec31 |
|
31-Aug-2020 |
Miaohe Lin <linmiaohe@huawei.com> |
net: ipv4: remove unused arg exact_dif in compute_score The arg exact_dif is not used anymore, remove it. inet_exact_dif_match() is no longer needed after the above is removed, so remove it too. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d76f3351 |
|
11-Aug-2020 |
Tim Froidcoeur <tim.froidcoeur@tessares.net> |
net: initialize fastreuse on inet_inherit_port In the case of TPROXY, bind_conflict optimizations for SO_REUSEADDR or SO_REUSEPORT are broken, possibly resulting in O(n) instead of O(1) bind behaviour or in the incorrect reuse of a bind. the kernel keeps track for each bind_bucket if all sockets in the bind_bucket support SO_REUSEADDR or SO_REUSEPORT in two fastreuse flags. These flags allow skipping the costly bind_conflict check when possible (meaning when all sockets have the proper SO_REUSE option). For every socket added to a bind_bucket, these flags need to be updated. As soon as a socket that does not support reuse is added, the flag is set to false and will never go back to true, unless the bind_bucket is deleted. Note that there is no mechanism to re-evaluate these flags when a socket is removed (this might make sense when removing a socket that would not allow reuse; this leaves room for a future patch). For this optimization to work, it is mandatory that these flags are properly initialized and updated. When a child socket is created from a listen socket in __inet_inherit_port, the TPROXY case could create a new bind bucket without properly initializing these flags, thus preventing the optimization to work. Alternatively, a socket not allowing reuse could be added to an existing bind bucket without updating the flags, causing bind_conflict to never be called as it should. Call inet_csk_update_fastreuse when __inet_inherit_port decides to create a new bind_bucket or use a different bind_bucket than the one of the listen socket. Fixes: 093d282321da ("tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()") Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Tim Froidcoeur <tim.froidcoeur@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1559b4aa |
|
16-Jul-2020 |
Jakub Sitnicki <jakub@cloudflare.com> |
inet: Run SK_LOOKUP BPF program on socket lookup Run a BPF program before looking up a listening socket on the receive path. Program selects a listening socket to yield as result of socket lookup by calling bpf_sk_assign() helper and returning SK_PASS code. Program can revert its decision by assigning a NULL socket with bpf_sk_assign(). Alternatively, BPF program can also fail the lookup by returning with SK_DROP, or let the lookup continue as usual with SK_PASS on return, when no socket has been selected with bpf_sk_assign(). This lets the user match packets with listening sockets freely at the last possible point on the receive path, where we know that packets are destined for local delivery after undergoing policing, filtering, and routing. With BPF code selecting the socket, directing packets destined to an IP range or to a port range to a single socket becomes possible. In case multiple programs are attached, they are run in series in the order in which they were attached. The end result is determined from return codes of all the programs according to following rules: 1. If any program returned SK_PASS and selected a valid socket, the socket is used as result of socket lookup. 2. If more than one program returned SK_PASS and selected a socket, last selection takes effect. 3. If any program returned SK_DROP, and no program returned SK_PASS and selected a socket, socket lookup fails with -ECONNREFUSED. 4. If all programs returned SK_PASS and none of them selected a socket, socket lookup continues to htable-based lookup. Suggested-by: Marek Majkowski <marek@cloudflare.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200717103536.397595-5-jakub@cloudflare.com
|
#
80b373f7 |
|
16-Jul-2020 |
Jakub Sitnicki <jakub@cloudflare.com> |
inet: Extract helper for selecting socket from reuseport group Prepare for calling into reuseport from __inet_lookup_listener as well. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200717103536.397595-4-jakub@cloudflare.com
|
#
8dbd76e7 |
|
13-Dec-2019 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: fix possible race __inet_lookup_established() Michal Kubecek and Firo Yang did a very nice analysis of crashes happening in __inet_lookup_established(). Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN (via a close()/socket()/listen() cycle) without a RCU grace period, I should not have changed listeners linkage in their hash table. They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt), so that a lookup can detect a socket in a hash list was moved in another one. Since we added code in commit d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6 ordering fix"), we have to add hlist_nulls_add_tail_rcu() helper. Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Michal Kubecek <mkubecek@suse.cz> Reported-by: Firo Yang <firo.yang@suse.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/ Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
#
7170a977 |
|
30-Oct-2019 |
Eric Dumazet <edumazet@google.com> |
net: annotate accesses to sk->sk_incoming_cpu This socket field can be read and written by concurrent cpus. Use READ_ONCE() and WRITE_ONCE() annotations to document this, and avoid some compiler 'optimizations'. KCSAN reported : BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0: sk_incoming_cpu_update include/net/sock.h:953 [inline] tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337 do_softirq kernel/softirq.c:329 [inline] __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189 read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1: sk_incoming_cpu_update include/net/sock.h:952 [inline] tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
88e235b8 |
|
05-Jun-2019 |
Enrico Weigelt <info@metux.net> |
net: ipv4: drop unneeded likely() call around IS_ERR() IS_ERR() already calls unlikely(), so this extra unlikely() call around IS_ERR() is not needed. Signed-off-by: Enrico Weigelt <info@metux.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2874c5fd |
|
27-May-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
c92c81df |
|
24-Dec-2018 |
Peter Oskolkov <posk@google.com> |
net: dccp: fix kernel crash on module load Patch eedbbb0d98b2 "net: dccp: initialize (addr,port) ..." added calling to inet_hashinfo2_init() from dccp_init(). However, inet_hashinfo2_init() is marked as __init(), and thus the kernel panics when dccp is loaded as module. Removing __init() tag from inet_hashinfo2_init() is not feasible because it calls into __init functions in mm. This patch adds inet_hashinfo2_init_mod() function that can be called after the init phase is done; changes dccp_init() to call the new function; un-marks inet_hashinfo2_init() as exported. Fixes: eedbbb0d98b2 ("net: dccp: initialize (addr,port) ...") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eedbbb0d |
|
16-Dec-2018 |
Peter Oskolkov <posk@google.com> |
net: dccp: initialize (addr,port) listening hashtable Commit d9fbc7f6431f "net: tcp: prefer listeners bound to an address" removes port-only listener lookups. This caused segfaults in DCCP lookups because DCCP did not initialize the (addr,port) hashtable. This patch adds said initialization. The only non-trivial issue here is the size of the new hashtable. It seemed reasonable to make it match the size of the port-only hashtable (= INET_LHTABLE_SIZE) that was used previously. Other parameters to inet_hashinfo2_init() match those used in TCP. V2 changes: marked inet_hashinfo2_init as an exported symbol so that DCCP compiles when configured as a module. Tested: syzcaller issues fixed; the second patch in the patchset tests that DCCP lookups work correctly. Fixes: d9fbc7f6431f "net: tcp: prefer listeners bound to an address" Reported-by: syzcaller <syzkaller@googlegroups.com> Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d9fbc7f6 |
|
12-Dec-2018 |
Peter Oskolkov <posk@google.com> |
net: tcp: prefer listeners bound to an address A relatively common use case is to have several IPs configured on a host, and have different listeners for each of them. We would like to add a "catch all" listener on addr_any, to match incoming connections not served by any of the listeners bound to a specific address. However, port-only lookups can match addr_any sockets when sockets listening on specific addresses are present if so_reuseport flag is set. This patch eliminates lookups into port-only hashtable, as lookups by (addr,port) tuple are easily available. In addition, compute_score() is tweaked to _not_ match addr_any sockets to specific addresses, as hash collisions could result in the unwanted behavior described above. Tested: the patch compiles; full test in the last patch in this patchset. Existing reuseport_* selftests also pass. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e7819058 |
|
07-Nov-2018 |
Mike Manning <mmanning@vyatta.att-mail.com> |
net: ensure unbound stream socket to be chosen when not in a VRF The commit a04a480d4392 ("net: Require exact match for TCP socket lookups if dif is l3mdev") only ensures that the correct socket is selected for packets in a VRF. However, there is no guarantee that the unbound socket will be selected for packets when not in a VRF. By checking for a device match in compute_score() also for the case when there is no bound device and attaching a score to this, the unbound socket is selected. And if a failure is returned when there is no device match, this ensures that bound sockets are never selected, even if there is no unbound socket. Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Tested-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3c82a21f |
|
07-Nov-2018 |
Robert Shearman <rshearma@vyatta.att-mail.com> |
net: allow binding socket in a VRF when there's an unbound socket Change the inet socket lookup to avoid packets arriving on a device enslaved to an l3mdev from matching unbound sockets by removing the wildcard for non sk_bound_dev_if and instead relying on check against the secondary device index, which will be 0 when the input device is not enslaved to an l3mdev and so match against an unbound socket and not match when the input device is enslaved. Change the socket binding to take the l3mdev into account to allow an unbound socket to not conflict sockets bound to an l3mdev given the datapath isolation now guaranteed. Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Tested-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
57c8a661 |
|
30-Oct-2018 |
Mike Rapoport <rppt@linux.vnet.ibm.com> |
mm: remove include/linux/bootmem.h Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8217ca65 |
|
08-Aug-2018 |
Martin KaFai Lau <kafai@fb.com> |
bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in the earlier patch. "bpf_run_sk_reuseport()" will return -ECONNREFUSED when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP. The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to handle this case and return NULL immediately instead of continuing the sk search from its hashtable. It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach BPF_PROG_TYPE_SK_REUSEPORT. The "sk_reuseport_attach_bpf()" will check if the attaching bpf prog is in the new SK_REUSEPORT or the existing SOCKET_FILTER type and then check different things accordingly. One level of "__reuseport_attach_prog()" call is removed. The "sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed back to "reuseport_attach_prog()" in sock_reuseport.c. sock_reuseport.c seems to have more knowledge on those test requirements than filter.c. In "reuseport_attach_prog()", after new_prog is attached to reuse->prog, the old_prog (if any) is also directly freed instead of returning the old_prog to the caller and asking the caller to free. The sysctl_optmem_max check is moved back to the "sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()". As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only bounded by the usual "bpf_prog_charge_memlock()" during load time instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
2dbb9b9e |
|
08-Aug-2018 |
Martin KaFai Lau <kafai@fb.com> |
bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN. BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern" to store the bpf context instead of using the skb->cb[48]. At the SO_REUSEPORT sk lookup time, it is in the middle of transiting from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this point, it is not always clear where the bpf context can be appended in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not clear if the lower layer is only ipv4 and ipv6 in the future and will it not touch the cb[] again before transiting to the upper layer. For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB instead of IP[6]CB and it may still modify the cb[] after calling the udp[46]_lib_lookup_skb(). Because of the above reason, if sk->cb is used for the bpf ctx, saving-and-restoring is needed and likely the whole 48 bytes cb[] has to be saved and restored. Instead of saving, setting and restoring the cb[], this patch opts to create a new "struct sk_reuseport_kern" and setting the needed values in there. The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)" will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol specific usage at this point and it is also inline with the current sock_reuseport.c implementation (i.e. no protocol specific requirement). In "struct sk_reuseport_md", this patch exposes data/data_end/len with semantic similar to other existing usages. Together with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()", the bpf prog can peek anywhere in the skb. The "bind_inany" tells the bpf prog that the reuseport group is bind-ed to a local INANY address which cannot be learned from skb. The new "bind_inany" is added to "struct sock_reuseport" which will be used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order to avoid repeating the "bind INANY" test on "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can only be properly initialized when a "sk->sk_reuseport" enabled sk is adding to a hashtable (i.e. during "reuseport_alloc()" and "reuseport_add_sock()"). The new "sk_select_reuseport()" is the main helper that the bpf prog will use to select a SO_REUSEPORT sk. It is the only function that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in the earlier patch, the validity of a selected sk is checked in run time in "sk_select_reuseport()". Doing the check in verification time is difficult and inflexible (consider the map-in-map use case). The runtime check is to compare the selected sk's reuseport_id with the reuseport_id that we want. This helper will return -EXXX if the selected sk cannot serve the incoming request (e.g. reuseport_id not match). The bpf prog can decide if it wants to do SK_DROP as its discretion. When the bpf prog returns SK_PASS, the kernel will check if a valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL"). If it does , it will use the selected sk. If not, the kernel will select one from "reuse->socks[]" (as before this patch). The SK_DROP and SK_PASS handling logic will be in the next patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
8c43bd17 |
|
18-Jun-2018 |
David Ahern <dsahern@gmail.com> |
net/tcp: Fix socket lookups with SO_BINDTODEVICE Similar to 69678bcd4d2d ("udp: fix SO_BINDTODEVICE"), TCP socket lookups need to fail if dev_match is not true. Currently, a packet to a given port can match a socket bound to device when it should not. In the VRF case, this causes the lookup to hit a VRF socket and not a global socket resulting in a response trying to go through the VRF when it should not. Fixes: 3fa6f616a7a4d ("net: ipv4: add second dif to inet socket lookups") Fixes: 4297a0ef08572 ("net: ipv6: add second dif to inet6 socket lookups") Reported-by: Lou Berger <lberger@labn.net> Diagnosed-by: Renato Westphal <renato@opensourcerouting.org> Tested-by: Renato Westphal <renato@opensourcerouting.org> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0ba98718 |
|
01-Feb-2018 |
Geert Uytterhoeven <geert@linux-m68k.org> |
inet: Avoid unitialized variable warning in inet_unhash() With gcc-4.1.2: net/ipv4/inet_hashtables.c: In function ‘inet_unhash’: net/ipv4/inet_hashtables.c:628: warning: ‘ilb’ may be used uninitialized in this function While this is a false positive, it can easily be avoided by using the pointer itself as the canary variable. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
563e0bb0 |
|
19-Dec-2017 |
Yafang Shao <laoar.shao@gmail.com> |
net: tracepoint: replace tcp_set_state tracepoint with inet_sock_set_state tracepoint As sk_state is a common field for struct sock, so the state transition tracepoint should not be a TCP specific feature. Currently it traces all AF_INET state transition, so I rename this tracepoint to inet_sock_set_state tracepoint with some minor changes and move it into trace/events/sock.h. We dont need to create a file named trace/events/inet_sock.h for this one single tracepoint. Two helpers are introduced to trace sk_state transition - void inet_sk_state_store(struct sock *sk, int newstate); - void inet_sk_set_state(struct sock *sk, int state); As trace header should not be included in other header files, so they are defined in sock.c. The protocol such as SCTP maybe compiled as a ko, hence export inet_sk_set_state(). Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
61b7c691 |
|
01-Dec-2017 |
Martin KaFai Lau <kafai@fb.com> |
inet: Add a 2nd listener hashtable (port+addr) The current listener hashtable is hashed by port only. When a process is listening at many IP addresses with the same port (e.g. [IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener() performance is degraded to a link list. It is prone to syn attack. UDP had a similar issue and a second hashtable was added to resolve it. This patch adds a second hashtable for the listener's sockets. The second hashtable is hashed by port and address. It cannot reuse the existing skc_portaddr_node which is shared with skc_bind_node. TCP listener needs to use skc_bind_node. Instead, this patch adds a hlist_node 'icsk_listen_portaddr_node' to the inet_connection_sock which the listener (like TCP) also belongs to. The new portaddr hashtable may need two lookup (First by IP:PORT. Second by INADDR_ANY:PORT if the IP:PORT is a not found). Hence, it implements a similar cut off as UDP such that it will only consult the new portaddr hashtable if the current port-only hashtable has >10 sk in the link-list. lhash2 and lhash2_mask are added to 'struct inet_hashinfo'. I take this chance to plug a 4 bytes hole. It is done by first moving the existing bind_bucket_cachep up and then add the new (int lhash2_mask, *lhash2) after the existing bhash_size. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
76d013b2 |
|
01-Dec-2017 |
Martin KaFai Lau <kafai@fb.com> |
inet: Add a count to struct inet_listen_hashbucket This patch adds a count to the 'struct inet_listen_hashbucket'. It counts how many sk is hashed to a bucket. It will be used to decide if the (to-be-added) portaddr listener's hashtable should be used during inet[6]_lookup_listener(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e94a62f5 |
|
30-Nov-2017 |
Paolo Abeni <pabeni@redhat.com> |
net/reuseport: drop legacy code Since commit e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") and commit c125e80b8868 ("soreuseport: fast reuseport TCP socket selection") the relevant reuseport socket matching the current packet is selected by the reuseport_select_sock() call. The only exceptions are invalid BPF filters/filters returning out-of-range indices. In the latter case the code implicitly falls back to using the hash demultiplexing, but instead of selecting the socket inside the reuseport_select_sock() function, it relies on the hash selection logic introduced with the early soreuseport implementation. With this patch, in case of a BPF filter returning a bad socket index value, we fall back to hash-based selection inside the reuseport_select_sock() body, so that we can drop some duplicate code in the ipv4 and ipv6 stack. This also allows faster lookup in the above scenario and will allow us to avoid computing the hash value for successful, BPF based demultiplexing - in a later patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1b5f962e |
|
19-Oct-2017 |
Craig Gallek <kraig@google.com> |
soreuseport: fix initialization race Syzkaller stumbled upon a way to trigger WARNING: CPU: 1 PID: 13881 at net/core/sock_reuseport.c:41 reuseport_alloc+0x306/0x3b0 net/core/sock_reuseport.c:39 There are two initialization paths for the sock_reuseport structure in a socket: Through the udp/tcp bind paths of SO_REUSEPORT sockets or through SO_ATTACH_REUSEPORT_[CE]BPF before bind. The existing implementation assumedthat the socket lock protected both of these paths when it actually only protects the SO_ATTACH_REUSEPORT path. Syzkaller triggered this double allocation by running these paths concurrently. This patch moves the check for double allocation into the reuseport_alloc function which is protected by a global spin lock. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection") Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3fa6f616 |
|
07-Aug-2017 |
David Ahern <dsahern@gmail.com> |
net: ipv4: add second dif to inet socket lookups Add a second device index, sdif, to inet socket lookups. sdif is the index for ingress devices enslaved to an l3mdev. It allows the lookups to consider the enslaved device as well as the L3 domain when searching for a socket. TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the ingress index is obtained from IPCB using inet_sdif and after the cb move in tcp_v4_rcv the tcp_v4_sdif helper is used. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
784c372a |
|
03-Jul-2017 |
Eric Dumazet <edumazet@google.com> |
net: make sk_ehashfn() static sk_ehashfn() is only used from a single file. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
41c6d650 |
|
30-Jun-2017 |
Reshetova, Elena <elena.reshetova@intel.com> |
net: convert sock.sk_refcnt from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. This patch uses refcount_inc_not_zero() instead of atomic_inc_not_zero_hint() due to absense of a _hint() version of refcount API. If the hint() version must be used, we might need to revisit API. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
752ade68 |
|
08-May-2017 |
Michal Hocko <mhocko@suse.com> |
treewide: use kv[mz]alloc* rather than opencoded variants There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b9470c27 |
|
17-Jan-2017 |
Josef Bacik <jbacik@fb.com> |
inet: kill smallest_size and smallest_port In inet_csk_get_port we seem to be using smallest_port to figure out where the best place to look for a SO_REUSEPORT sk that matches with an existing set of SO_REUSEPORT's. However if we get to the logic if (smallest_size != -1) { port = smallest_port; goto have_port; } we will do a useless search, because we would have already done the inet_csk_bind_conflict for that port and it would have returned 1, otherwise we would have gone to found_tb and succeeded. Since this logic makes us do yet another trip through inet_csk_bind_conflict for a port we know won't work just delete this code and save us the time. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fe38d2a1 |
|
17-Jan-2017 |
Josef Bacik <jbacik@fb.com> |
inet: collapse ipv4/v6 rcv_saddr_equal functions into one We pass these per-protocol equal functions around in various places, but we can just have one function that checks the sk->sk_family and then do the right comparison function. I've also changed the ipv4 version to not cast to inet_sock since it is unneeded. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a04a480d |
|
16-Oct-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: Require exact match for TCP socket lookups if dif is l3mdev Currently, socket lookups for l3mdev (vrf) use cases can match a socket that is bound to a port but not a device (ie., a global socket). If the sysctl tcp_l3mdev_accept is not set this leads to ack packets going out based on the main table even though the packet came in from an L3 domain. The end result is that the connection does not establish creating confusion for users since the service is running and a socket shows in ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the skb came through an interface enslaved to an l3mdev device and the tcp_l3mdev_accept is not set. skb's through an l3mdev interface are marked by setting a flag in inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the flag for IPv4. Using an skb flag avoids a device lookup on the dif. The flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the inet_skb_parm struct is moved in the cb per commit 971f10eca186, so the match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the move is done after the socket lookup, so IP6CB is used. The flags field in inet_skb_parm struct needs to be increased to add another flag. There is currently a 1-byte hole following the flags, so it can be expanded to u16 without increasing the size of the struct. Fixes: 193125dbd8eb ("net: Introduce VRF device driver") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
90e5d0db |
|
28-Apr-2016 |
Craig Gallek <kraig@google.com> |
soreuseport: Fix TCP listener hash collision I forgot to include a check for listener port equality when deciding if two sockets should belong to the same reuseport group. This was not caught previously because it's only necessary when two listening sockets for the same user happen to hash to the same listener bucket. The same error does not exist in the UDP path. Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection") Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02a1d6e7 |
|
27-Apr-2016 |
Eric Dumazet <edumazet@google.com> |
net: rename NET_{ADD|INC}_STATS_BH() Rename NET_INC_STATS_BH() to __NET_INC_STATS() and NET_ADD_STATS_BH() to __NET_ADD_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d296ba60 |
|
25-Apr-2016 |
Craig Gallek <kraig@google.com> |
soreuseport: Resolve merge conflict for v4/v6 ordering fix d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets") was merged as a bug fix to the net tree. Two conflicting changes were committed to net-next before the above fix was merged back to net-next: ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU") 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") These changes switched the datastructure used for TCP and UDP sockets from hlist_nulls to hlist. This patch applies the necessary parts of the net tree fix to net-next which were not automatic as part of the merge. Fixes: 1602f49b58ab ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
85017869 |
|
06-Apr-2016 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: fix inet_reuseport_add_sock() David Ahern reported panics in __inet_hash() caused by my recent commit. The reason is inet_reuseport_add_sock() was still using sk_nulls_for_each_rcu() instead of sk_for_each_rcu(). SO_REUSEPORT enabled listeners were causing an instant crash. While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE flag, as it is inherited from the parent at clone time. Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3b24d854 |
|
01-Apr-2016 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: do not touch listener sk_refcnt under synflood When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes. By letting listeners use SOCK_RCU_FREE infrastructure, we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets, only listeners are impacted by this change. Peak performance under SYNFLOOD is increased by ~33% : On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps Most consuming functions are now skb_set_owner_w() and sock_wfree() contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2d331915 |
|
01-Apr-2016 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: use rcu locking in inet_diag_find_one_icsk() RX packet processing holds rcu_read_lock(), so we can remove pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions if inet_diag also holds rcu before calling them. This is needed anyway as __inet_lookup_listener() and inet6_lookup_listener() will soon no longer increment refcount on the found listener. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1580ab63 |
|
11-Feb-2016 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: better use of ephemeral ports in connect() In commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range in connect()"), I added a very simple heuristic, so that we got better chances to use even ports, and allow bind() users to have more available slots. It gave nice results, but with more than 200,000 TCP sessions on a typical server, the ~30,000 ephemeral ports are still a rare resource. I chose to go a step further, by looking at all even ports, and if none was available, fallback to odd ports. The companion patch does the same in bind(), but in opposite way. I've seen exec times of up to 30ms on busy servers, so I no longer disable BH for the whole traversal, but only for each hash bucket. I also call cond_resched() to be gentle to other tasks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c125e80b |
|
10-Feb-2016 |
Craig Gallek <kraig@google.com> |
soreuseport: fast reuseport TCP socket selection This change extends the fast SO_REUSEPORT socket lookup implemented for UDP to TCP. Listener sockets with SO_REUSEPORT and the same receive address are additionally added to an array for faster random access. This means that only a single socket from the group must be found in the listener list before any socket in the group can be used to receive a packet. Previously, every socket in the group needed to be considered before handing off the incoming packet. This feature also exposes the ability to use a BPF program when selecting a socket from a reuseport group. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a583636a |
|
10-Feb-2016 |
Craig Gallek <kraig@google.com> |
inet: refactor inet[6]_lookup functions to take skb This is a preliminary step to allow fast socket lookup of SO_REUSEPORT groups. Doing so with a BPF filter will require access to the skb in question. This change plumbs the skb (and offset to payload data) through the call stack to the listening socket lookup implementations where it will be used in a following patch. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
086c653f |
|
10-Feb-2016 |
Craig Gallek <kraig@google.com> |
sock: struct proto hash function may error In order to support fast reuseport lookups in TCP, the hash function defined in struct proto must be capable of returning an error code. This patch changes the function signature of all related hash functions to return an integer and handles or propagates this return value at all call sites. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5e0724d0 |
|
22-Oct-2015 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: fix hashdance race for passive sessions Multiple cpus can process duplicates of incoming ACK messages matching a SYN_RECV request socket. This is a rare event under normal operations, but definitely can happen. Only one must win the race, otherwise corruption would occur. To fix this without adding new atomic ops, we use logic in inet_ehash_nolisten() to detect the request was present in the same ehash bucket where we try to insert the new child. If request socket was not found, we have to undo the child creation. This actually removes a spin_lock()/spin_unlock() pair in reqsk_queue_unlink() for the fast path. Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c2f34a65 |
|
14-Oct-2015 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: fix potential NULL deref in __inet_inherit_port() As we no longer hold listener lock in fast path, it is possible that a child is created right after listener freed its bound port, if a close() is done while incoming packets are processed. __inet_inherit_port() must detect this and return an error, so that caller can free the child earlier. Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
70da268b |
|
08-Oct-2015 |
Eric Dumazet <edumazet@google.com> |
net: SO_INCOMING_CPU setsockopt() support SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command to fetch incoming cpu handling a particular TCP flow after accept() This commits adds setsockopt() support and extends SO_REUSEPORT selection logic : If a TCP listener or UDP socket has this option set, a packet is delivered to this socket only if CPU handling the packet matches the specified one. This allows to build very efficient TCP servers, using one listener per RX queue, as the associated TCP listener should only accept flows handled in softirq by the same cpu. This provides optimal NUMA behavior and keep cpu caches hot. Note that __inet_lookup_listener() still has to iterate over the list of all listeners. Following patch puts sk_refcnt in a different cache line to let this iteration hit only shared and read mostly cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
079096f1 |
|
02-Oct-2015 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: install syn_recv requests into ehash table In this patch, we insert request sockets into TCP/DCCP regular ehash table (where ESTABLISHED and TIMEWAIT sockets are) instead of using the per listener hash table. ACK packets find SYN_RECV pseudo sockets without having to find and lock the listener. In nominal conditions, this halves pressure on listener lock. Note that this will allow for SO_REUSEPORT refinements, so that we can select a listener using cpu/numa affinities instead of the prior 'consistent hash', since only SYN packets will apply this selection logic. We will shrink listen_sock in the following patch to ease code review. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ying Cai <ycai@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1ce31c9e |
|
29-Sep-2015 |
Eric Dumazet <edumazet@google.com> |
inet: constify __inet_inherit_port() sock argument socket is not touched, make it const. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
89e478a2 |
|
21-Jul-2015 |
Eric Dumazet <edumazet@google.com> |
tcp: suppress a division by zero warning Andrew Morton reported following warning on one ARM build with gcc-4.4 : net/ipv4/inet_hashtables.c: In function 'inet_ehash_locks_alloc': net/ipv4/inet_hashtables.c:617: warning: division by zero Even guarded with a test on sizeof(spinlock_t), compiler does not like current construct on a !CONFIG_SMP build. Remove the warning by using a temporary variable. Fixes: 095dc8e0c368 ("tcp: fix/cleanup inet_ehash_locks_alloc()") Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dbe7faa4 |
|
08-Jul-2015 |
Eric Dumazet <edumazet@google.com> |
inet: inet_twsk_deschedule factorization inet_twsk_deschedule() calls are followed by inet_twsk_put(). Only particular case is in inet_twsk_purge() but there is no point to defer the inet_twsk_put() after re-enabling BH. Lets rename inet_twsk_deschedule() to inet_twsk_deschedule_put() and move the inet_twsk_put() inside. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fc01538f |
|
08-Jul-2015 |
Eric Dumazet <edumazet@google.com> |
inet: simplify timewait refcounting timewait sockets have a complex refcounting logic. Once we realize it should be similar to established and syn_recv sockets, we can use sk_nulls_del_node_init_rcu() and remove inet_twsk_unhash() In particular, deferred inet_twsk_put() added in commit 13475a30b66cd ("tcp: connect() race with timewait reuse") looks unecessary : When removing a timewait socket from ehash or bhash, caller must own a reference on the socket anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e2baad9e |
|
27-May-2015 |
Eric Dumazet <edumazet@google.com> |
tcp: connect() from bound sockets can be faster __inet_hash_connect() does not use its third argument (port_offset) if socket was already bound to a source port. No need to perform useless but expensive md5 computations. Reported-by: Crestez Dan Leonard <cdleonard@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
07f4c900 |
|
24-May-2015 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: try to not exhaust ip_local_port_range in connect() A long standing problem on busy servers is the tiny available TCP port range (/proc/sys/net/ipv4/ip_local_port_range) and the default sequential allocation of source ports in connect() system call. If a host is having a lot of active TCP sessions, chances are very high that all ports are in use by at least one flow, and subsequent bind(0) attempts fail, or have to scan a big portion of space to find a slot. In this patch, I changed the starting point in __inet_hash_connect() so that we try to favor even [1] ports, leaving odd ports for bind() users. We still perform a sequential search, so there is no guarantee, but if connect() targets are very different, end result is we leave more ports available to bind(), and we spread them all over the range, lowering time for both connect() and bind() to find a slot. This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range is even, ie if start/end values have different parity. Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to 32768 - 60999 (instead of 32768 - 61000) There is no change on security aspects here, only some poor hashing schemes could be eventually impacted by this change. [1] : The odd/even property depends on ip_local_port_range values parity Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
095dc8e0 |
|
26-May-2015 |
Eric Dumazet <edumazet@google.com> |
tcp: fix/cleanup inet_ehash_locks_alloc() If tcp ehash table is constrained to a very small number of buckets (eg boot parameter thash_entries=128), then we can crash if spinlock array has more entries. While we are at it, un-inline inet_ehash_locks_alloc() and make following changes : - Budget 2 cache lines per cpu worth of 'spinlocks' - Try to kmalloc() the array to avoid extra TLB pressure. (Most servers at Google allocate 8192 bytes for this hash table) - Get rid of various #ifdef Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f5af1f57 |
|
20-May-2015 |
Eric Dumazet <edumazet@google.com> |
inet_hashinfo: remove bsocket counter We no longer need bsocket atomic counter, as inet_csk_get_port() calls bind_conflict() regardless of its value, after commit 2b05ad33e1e624e ("tcp: bind() fix autoselection to share ports") This patch removes overhead of maintaining this counter and double inet_csk_get_port() calls under pressure. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Flavio Leitner <fbl@redhat.com> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
789f558c |
|
12-Apr-2015 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: get rid of central timewait timer Using a timer wheel for timewait sockets was nice ~15 years ago when memory was expensive and machines had a single processor. This does not scale, code is ugly and source of huge latencies (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.) We can afford to use an extra 64 bytes per timewait sock and spread timewait load to all cpus to have better behavior. Tested: On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1 on the target (lpaa24) Before patch : lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 419594 lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 437171 While test is running, we can observe 25 or even 33 ms latencies. lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23 ... 1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2 lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23 ... 1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2 After patch : About 90% increase of throughput : lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 810442 lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 800992 And latencies are kept to minimal values during this load, even if network utilization is 90% higher : lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23 ... 1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
00db4124 |
|
03-Apr-2015 |
Ian Morris <ipm@chirality.org.uk> |
ipv4: coding style: comparison for inequality with NULL The ipv4 code uses a mixture of coding styles. In some instances check for non-NULL pointer is done as x != NULL and sometimes as x. x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b4d6444e |
|
18-Mar-2015 |
Eric Dumazet <edumazet@google.com> |
inet: get rid of last __inet_hash_connect() argument We now always call __inet_hash_nolisten(), no need to pass it as an argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
77a6a471 |
|
18-Mar-2015 |
Eric Dumazet <edumazet@google.com> |
ipv6: get rid of __inet6_hash() We can now use inet_hash() and __inet_hash() instead of private functions. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d1e559d0 |
|
18-Mar-2015 |
Eric Dumazet <edumazet@google.com> |
inet: add IPv6 support to sk_ehashfn() Intent is to converge IPv4 & IPv6 inet_hash functions to factorize code. IPv4 sockets initialize sk_rcv_saddr and sk_v6_daddr in this patch, thanks to new sk_daddr_set() and sk_rcv_saddr_set() helpers. __inet6_hash can now use sk_ehashfn() instead of a private inet6_sk_ehashfn() and will simply use __inet_hash() in a following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b441f76 |
|
18-Mar-2015 |
Eric Dumazet <edumazet@google.com> |
net: introduce sk_ehashfn() helper Goal is to unify IPv4/IPv6 inet_hash handling, and use common helpers for all kind of sockets (full sockets, timewait and request sockets) inet_sk_ehashfn() becomes sk_ehashfn() but still only copes with IPv4 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6eada011 |
|
18-Mar-2015 |
Eric Dumazet <edumazet@google.com> |
netns: constify net_hash_mix() and various callers const qualifiers ease code review by making clear which objects are not written in a function. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2c13270b |
|
15-Mar-2015 |
Eric Dumazet <edumazet@google.com> |
inet: factorize sock_edemux()/sock_gen_put() code sock_edemux() is not used in fast path, and should really call sock_gen_put() to save some code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
41b822c5 |
|
12-Mar-2015 |
Eric Dumazet <edumazet@google.com> |
inet: prepare sock_edemux() & sock_gen_put() for new SYN_RECV state sock_edemux() & sock_gen_put() should be ready to cope with request socks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
efd7ef1c |
|
11-Mar-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Kill hold_net release_net hold_net and release_net were an idea that turned out to be useless. The code has been disabled since 2008. Kill the code it is long past due. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8fc54f68 |
|
23-Aug-2014 |
Daniel Borkmann <daniel@iogearbox.net> |
net: use reciprocal_scale() helper Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale(). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c7228317 |
|
13-May-2014 |
Joe Perches <joe@perches.com> |
net: Use a more standard macro for INET_ADDR_COOKIE Missing a colon on definition use is a bit odd so change the macro for the 32 bit case to declare an __attribute__((unused)) and __deprecated variable. The __deprecated attribute will cause gcc to emit an error if the variable is actually used. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
122ff243 |
|
12-May-2014 |
WANG Cong <xiyou.wangcong@gmail.com> |
ipv4: make ip_local_reserved_ports per netns ip_local_port_range is already per netns, so should ip_local_reserved_ports be. And since it is none by default we don't actually need it when we don't enable CONFIG_SYSCTL. By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port() Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1bbdceef |
|
19-Oct-2013 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
inet: convert inet_ehash_secret and ipv6_hash_secret to net_get_random_once Initialize the ehash and ipv6_hash_secrets with net_get_random_once. Each compilation unit gets its own secret now: ipv4/inet_hashtables.o ipv4/udp.o ipv6/inet6_hashtables.o ipv6/udp.o rds/connection.o The functions still get inlined into the hashing functions. In the fast path we have at most two (needed in ipv6) if (unlikely(...)). Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
65cd8033 |
|
19-Oct-2013 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
ipv4: split inet_ehashfn to hash functions per compilation unit This duplicates a bit of code but let's us easily introduce separate secret keys later. The separate compilation units are ipv4/inet_hashtabbles.o, ipv4/udp.o and rds/connection.o. Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
05dbc7b5 |
|
03-Oct-2013 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: remove twchain TCP listener refactoring, part 3 : Our goal is to hash SYN_RECV sockets into main ehash for fast lookup, and parallel SYN processing. Current inet_ehash_bucket contains two chains, one for ESTABLISH (and friend states) sockets, another for TIME_WAIT sockets only. As the hash table is sized to get at most one socket per bucket, it makes little sense to have separate twchain, as it makes the lookup slightly more complicated, and doubles hash table memory usage. If we make sure all socket types have the lookup keys at the same offsets, we can use a generic and faster lookup. It turns out TIME_WAIT and ESTABLISHED sockets already have common lookup fields for IPv4. [ INET_TW_MATCH() is no longer needed ] I'll provide a follow-up to factorize IPv6 lookup as well, to remove INET6_TW_MATCH() This way, SYN_RECV pseudo sockets will be supported the same. A new sock_gen_put() helper is added, doing either a sock_put() or inet_twsk_put() [ and will support SYN_RECV later ]. Note this helper should only be called in real slow path, when rcu lookup found a socket that was moved to another identity (freed/reused immediately), but could eventually be used in other contexts, like sock_edemux() Before patch : dmesg | grep "TCP established" TCP established hash table entries: 524288 (order: 11, 8388608 bytes) After patch : TCP established hash table entries: 524288 (order: 10, 4194304 bytes) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
80ad1d61 |
|
01-Oct-2013 |
Eric Dumazet <edumazet@google.com> |
net: do not call sock_put() on TIMEWAIT sockets commit 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls") incorrectly used sock_put() on TIMEWAIT sockets. We should instead use inet_twsk_put() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0bbf87d8 |
|
28-Sep-2013 |
Eric W. Biederman <ebiederm@xmission.com> |
net ipv4: Convert ipv4.ip_local_port_range to be per netns v3 - Move sysctl_local_ports from a global variable into struct netns_ipv4. - Modify inet_get_local_port_range to take a struct net, and update all of the callers. - Move the initialization of sysctl_local_ports into sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c v2: - Ensure indentation used tabs - Fixed ip.h so it applies cleanly to todays net-next v3: - Compile fixes of strange callers of inet_get_local_port_range. This patch now successfully passes an allmodconfig build. Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range Originally-by: Samya <samya@twitter.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3b8ccd44 |
|
11-Jul-2013 |
Camelia Groza <camelia.groza@gmail.com> |
inet: fix spacing in assignment Found using checkpatch.pl Signed-off-by: Camelia Groza <camelia.groza@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b67bfe0d |
|
27-Feb-2013 |
Sasha Levin <sasha.levin@oracle.com> |
hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
da5e3630 |
|
22-Jan-2013 |
Tom Herbert <therbert@google.com> |
soreuseport: TCP/IPv4 implementation Allow multiple listener sockets to bind to the same port. Motivation for soresuseport would be something like a web server binding to port 80 running with multiple threads, where each thread might have it's own listener socket. This could be done as an alternative to other models: 1) have one listener thread which dispatches completed connections to workers. 2) accept on a single listener socket from multiple threads. In case #1 the listener thread can easily become the bottleneck with high connection turn-over rate. In case #2, the proportion of connections accepted per thread tends to be uneven under high connection load (assuming simple event loop: while (1) { accept(); process() }, wakeup does not promote fairness among the sockets. We have seen the disproportion to be as high as 3:1 ratio between thread accepting most connections and the one accepting the fewest. With so_reusport the distribution is uniform. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ce43b03e |
|
30-Nov-2012 |
Eric Dumazet <edumazet@google.com> |
net: move inet_dport/inet_num in sock_common commit 68835aba4d9b (net: optimize INET input path further) moved some fields used for tcp/udp sockets lookup in the first cache line of struct sock_common. This patch moves inet_dport/inet_num as well, filling a 32bit hole on 64 bit arches and reducing number of cache line misses in lookups. Also change INET_MATCH()/INET_TW_MATCH() to perform the ports match before addresses match, as this check is more discriminant. Remove the hash check from MATCH() macros because we dont need to re validate the hash value after taking a refcount on socket, and use likely/unlikely compiler hints, as the sk_hash/hash check makes the following conditional tests 100% predicted by cpu. Introduce skc_addrpair/skc_portpair pair values to better document the alignment requirements of the port/addr pairs used in the various MATCH() macros, and remove some casts. The namespace check can also be done at last. This slightly improves TCP/UDP lookup times. IP/TCP early demux needs inet->rx_dst_ifindex and TCP needs inet->min_ttl, lets group them together in same cache line. With help from Ben Hutchings & Joe Perches. Idea of this patch came after Ling Ma proposal to move skc_hash to the beginning of struct sock_common, and should allow him to submit a final version of his patch. My tests show an improvement doing so. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Joe Perches <joe@perches.com> Cc: Ling Ma <ling.ma.program@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5e73ea1a |
|
14-Apr-2012 |
Daniel Baluta <dbaluta@ixiacom.com> |
ipv4: fix checkpatch errors Fix checkpatch errors of the following type: * ERROR: "foo * bar" should be "foo *bar" * ERROR: "(foo*)" should be "(foo *)" Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6e5714ea |
|
03-Aug-2011 |
David S. Miller <davem@davemloft.net> |
net: Compute protocol sequence numbers and fragment IDs using MD5. Computers have become a lot faster since we compromised on the partial MD4 hash which we use currently for performance reasons. MD5 is a much safer choice, and is inline with both RFC1948 and other ISS generators (OpenBSD, Solaris, etc.) Furthermore, only having 24-bits of the sequence number be truly unpredictable is a very serious limitation. So the periodic regeneration and 8-bit counter have been removed. We compute and use a full 32-bit sequence number. For ipv6, DCCP was found to use a 32-bit truncated initial sequence number (it needs 43-bits) and that is fixed here as well. Reported-by: Dan Kaminsky <dan@doxpara.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b4ff3c90 |
|
26-Nov-2010 |
Nagendra Tomar <tomer_iisc@yahoo.com> |
inet: Fix __inet_inherit_port() to correctly increment bsockets and num_owners inet sockets corresponding to passive connections are added to the bind hash using ___inet_inherit_port(). These sockets are later removed from the bind hash using __inet_put_port(). These two functions are not exactly symmetrical. __inet_put_port() decrements hashinfo->bsockets and tb->num_owners, whereas ___inet_inherit_port() does not increment them. This results in both of these going to -ve values. This patch fixes this by calling inet_bind_hash() from ___inet_inherit_port(), which does the right thing. 'bsockets' and 'num_owners' were introduced by commit a9d8f9110d7e953c (inet: Allowing more than 64k connections and heavily optimize bind(0)) Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
093d2823 |
|
21-Oct-2010 |
Balazs Scheidler <bazsi@balabit.hu> |
tproxy: fix hash locking issue when using port redirection in __inet_inherit_port() When __inet_inherit_port() is called on a tproxy connection the wrong locks are held for the inet_bind_bucket it is added to. __inet_inherit_port() made an implicit assumption that the listener's port number (and thus its bind bucket). Unfortunately, if you're using the TPROXY target to redirect skbs to a transparent proxy that assumption is not true anymore and things break. This patch adds code to __inet_inherit_port() so that it can handle this case by looking up or creating a new bind bucket for the child socket and updates callers of __inet_inherit_port() to gracefully handle __inet_inherit_port() failing. Reported by and original patch from Stephen Buck <stephen.buck@exinda.com>. See http://marc.info/?t=128169268200001&r=1&w=2 for the original discussion. Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: Patrick McHardy <kaber@trash.net>
|
#
4bc2f18b |
|
09-Jul-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net/ipv4: EXPORT_SYMBOL cleanups CodingStyle cleanups EXPORT_SYMBOL should immediately follow the symbol declaration. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e3826f1e |
|
04-May-2010 |
Amerigo Wang <amwang@redhat.com> |
net: reserve ports for applications using fixed port numbers (Dropped the infiniband part, because Tetsuo modified the related code, I will send a separate patch for it once this is accepted.) This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which allows users to reserve ports for third-party applications. The reserved ports will not be used by automatic port assignments (e.g. when calling connect() or bind() with port number 0). Explicit port allocation behavior is unchanged. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3cdaedae |
|
03-Dec-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: Fix a connect() race with timewait sockets When we find a timewait connection in __inet_hash_connect() and reuse it for a new connection request, we have a race window, releasing bind list lock and reacquiring it in __inet_twsk_kill() to remove timewait socket from list. Another thread might find the timewait socket we already chose, leading to list corruption and crashes. Fix is to remove timewait socket from bind list before releasing the bind lock. Note: This problem happens if sysctl_tcp_tw_reuse is set. Reported-by: kapil dakhane <kdakhane@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9327f705 |
|
03-Dec-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: Fix a connect() race with timewait sockets First patch changes __inet_hash_nolisten() and __inet6_hash() to get a timewait parameter to be able to unhash it from ehash at same time the new socket is inserted in hash. This makes sure timewait socket wont be found by a concurrent writer in __inet_check_established() Reported-by: kapil dakhane <kdakhane@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
13475a30 |
|
02-Dec-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: connect() race with timewait reuse Its currently possible that several threads issuing a connect() find the same timewait socket and try to reuse it, leading to list corruptions. Condition for bug is that these threads bound their socket on same address/port of to-be-find timewait socket, and connected to same target. (SO_REUSEADDR needed) To fix this problem, we could unhash timewait socket while holding ehash lock, to make sure lookups/changes will be serialized. Only first thread finds the timewait socket, other ones find the established socket and return an EADDRNOTAVAIL error. This second version takes into account Evgeniy's review and makes sure inet_twsk_put() is called outside of locked sections. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
09ad9bc7 |
|
25-Nov-2009 |
Octavian Purdila <opurdila@ixiacom.com> |
net: use net_eq to compare nets Generated with the following semantic patch @@ struct net *n1; struct net *n2; @@ - n1 == n2 + net_eq(n1, n2) @@ struct net *n1; struct net *n2; @@ - n1 != n2 + !net_eq(n1, n2) applied over {include,net,drivers/net}. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c720c7e8 |
|
15-Oct-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
inet: rename some inet_sock fields In order to have better cache layouts of struct sock (separate zones for rx/tx paths), we need this preliminary patch. Goal is to transfert fields used at lookup time in the first read-mostly cache line (inside struct sock_common) and move sk_refcnt to a separate cache line (only written by rx path) This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr, sport and id fields. This allows a future patch to define these fields as macros, like sk_refcnt, without name clashes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f373b53b |
|
08-Oct-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: replace ehash_size by ehash_mask Storing the mask (size - 1) instead of the size allows fast path to be a bit faster. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
24dd1fa1 |
|
01-Feb-2009 |
Eric Dumazet <dada1@cosmosbay.com> |
net: move bsockets outside of read only beginning of struct inet_hashinfo And switch bsockets to atomic_t since it might be changed in parallel. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a9d8f911 |
|
19-Jan-2009 |
Evgeniy Polyakov <zbr@ioremap.net> |
inet: Allowing more than 64k connections and heavily optimize bind(0) time. With simple extension to the binding mechanism, which allows to bind more than 64k sockets (or smaller amount, depending on sysctl parameters), we have to traverse the whole bind hash table to find out empty bucket. And while it is not a problem for example for 32k connections, bind() completion time grows exponentially (since after each successful binding we have to traverse one bucket more to find empty one) even if we start each time from random offset inside the hash table. So, when hash table is full, and we want to add another socket, we have to traverse the whole table no matter what, so effectivelly this will be the worst case performance and it will be constant. Attached picture shows bind() time depending on number of already bound sockets. Green area corresponds to the usual binding to zero port process, which turns on kernel port selection as described above. Red area is the bind process, when number of reuse-bound sockets is not limited by 64k (or sysctl parameters). The same exponential growth (hidden by the green area) before number of ports reaches sysctl limit. At this time bind hash table has exactly one reuse-enbaled socket in a bucket, but it is possible that they have different addresses. Actually kernel selects the first port to try randomly, so at the beginning bind will take roughly constant time, but with time number of port to check after random start will increase. And that will have exponential growth, but because of above random selection, not every next port selection will necessary take longer time than previous. So we have to consider the area below in the graph (if you could zoom it, you could find, that there are many different times placed there), so area can hide another. Blue area corresponds to the port selection optimization. This is rather simple design approach: hashtable now maintains (unprecise and racely updated) number of currently bound sockets, and when number of such sockets becomes greater than predefined value (I use maximum port range defined by sysctls), we stop traversing the whole bind hash table and just stop at first matching bucket after random start. Above limit roughly corresponds to the case, when bind hash table is full and we turned on mechanism of allowing to bind more reuse-enabled sockets, so it does not change behaviour of other sockets. Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Tested-by: Denys Fedoryschenko <denys@visp.net.lb> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
920de804 |
|
24-Nov-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
net: Make sure BHs are disabled in sock_prot_inuse_add() The rule of calling sock_prot_inuse_add() is that BHs must be disabled. Some new calls were added where this was not true and this tiggers warnings as reported by Ilpo. Fix this by adding explicit BH disabling around those call sites, or moving sock_prot_inuse_add() call inside an existing BH disabled section. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c25eb3bf |
|
23-Nov-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
net: Convert TCP/DCCP listening hash tables to use RCU This is the last step to be able to perform full RCU lookups in __inet_lookup() : After established/timewait tables, we add RCU lookups to listening hash table. The only trick here is that a socket of a given type (TCP ipv4, TCP ipv6, ...) can now flight between two different tables (established and listening) during a RCU grace period, so we must use different 'nulls' end-of-chain values for two tables. We define a large value : #define LISTENING_NULLS_BASE (1U << 29) So that slots in listening table are guaranteed to have different end-of-chain values than slots in established table. A reader can still detect it finished its lookup in the right chain. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9db66bdc |
|
20-Nov-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
net: convert TCP/DCCP ehash rwlocks to spinlocks Now TCP & DCCP use RCU lookups, we can convert ehash rwlocks to spinlocks. /proc/net/tcp and other seq_file 'readers' can safely be converted to 'writers'. This should speedup writers, since spin_lock()/spin_unlock() only use one atomic operation instead of two for write_lock()/write_unlock() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5caea4ea |
|
20-Nov-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
net: listening_hash get a spinlock per bucket This patch prepares RCU migration of listening_hash table for TCP/DCCP protocols. listening_hash table being small (32 slots per protocol), we add a spinlock for each slot, instead of a single rwlock for whole table. This should reduce hold time of readers, and writers concurrency. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3ab5aee7 |
|
16-Nov-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls RCU was added to UDP lookups, using a fast infrastructure : - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the price of call_rcu() at freeing time. - hlist_nulls permits to use few memory barriers. This patch uses same infrastructure for TCP/DCCP established and timewait sockets. Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications using short lived TCP connections. A followup patch, converting rwlocks to spinlocks will even speedup this case. __inet_lookup_established() is pretty fast now we dont have to dirty a contended cache line (read_lock/read_unlock) Only established and timewait hashtable are converted to RCU (bind table and listen table are still using traditional locking) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7a9546ee |
|
12-Nov-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
net: ib_net pointer should depends on CONFIG_NET_NS We can shrink size of "struct inet_bind_bucket" by 50%, using read_pnet() and write_pnet() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
547b792c |
|
25-Jul-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
net: convert BUG_TRAP to generic WARN_ON Removes legacy reinvent-the-wheel type thing. The generic machinery integrates much better to automated debugging aids such as kerneloops.org (and others), and is unambiguous due to better naming. Non-intuively BUG_TRAP() is actually equal to WARN_ON() rather than BUG_ON() though some might actually be promoted to BUG_ON() but I left that to future. I could make at least one BUILD_BUG_ON conversion. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
de0744af |
|
16-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to NET_INC_STATS_BH Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9f26b3ad |
|
16-Jun-2008 |
Pavel Emelyanov <xemul@openvz.org> |
inet: add struct net argument to inet_ehashfn Although this hash takes addresses into account, the ehash chains can also be too long when, for instance, communications via lo occur. So, prepare the inet_hashfn to take struct net into account. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2086a650 |
|
16-Jun-2008 |
Pavel Emelyanov <xemul@openvz.org> |
inet: add struct net argument to inet_lhashfn Listening-on-one-port sockets in many namespaces produce long chains in the listening_hash-es, so prepare the inet_lhashfn to take struct net into account. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7f635ab7 |
|
16-Jun-2008 |
Pavel Emelyanov <xemul@openvz.org> |
inet: add struct net argument to inet_bhashfn Binding to some port in many namespaces may create too long chains in bhash-es, so prepare the hashfn to take struct net into account. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
53083773 |
|
18-Apr-2008 |
Pavel Emelyanov <xemul@openvz.org> |
[INET]: Uninline the __inet_inherit_port call. This deblats ~200 bytes when ipv6 and dccp are 'y'. Besides, this will ease compilation issues for patches I'm working on to make inet hash tables more scalable wrt net namespaces. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8c5da49a |
|
16-Apr-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Add netns refcnt debug for inet bind buckets. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c29a0bc4 |
|
31-Mar-2008 |
Pavel Emelyanov <xemul@openvz.org> |
[SOCK][NETNS]: Add a struct net argument to sock_prot_inuse_add and _get. This counter is about to become per-proto-and-per-net, so we'll need two arguments to determine which cell in this "table" to work with. All the places, but proc already pass proper net to it - proc will be tuned a bit later. Some indentation with spaces in proc files is done to keep the file coding style consistent. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
878628fb |
|
25-Mar-2008 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] NETNS: Omit namespace comparision without CONFIG_NET_NS. Introduce an inline net_eq() to compare two namespaces. Without CONFIG_NET_NS, since no namespace other than &init_net exists, it is always 1. We do not need to convert 1) inline vs inline and 2) inline vs &init_net comparisons. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
#
3b1e0a65 |
|
25-Mar-2008 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS. Introduce per-sock inlines: sock_net(), sock_net_set() and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set(). Without CONFIG_NET_NS, no namespace other than &init_net exists. Let's explicitly define them to help compiler optimizations. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
#
39d8cda7 |
|
22-Mar-2008 |
Pavel Emelyanov <xemul@openvz.org> |
[SOCK]: Add udp_hash member to struct proto. Inspired by the commit ab1e0a13 ([SOCK] proto: Add hashinfo member to struct proto) from Arnaldo, I made similar thing for UDP/-Lite IPv4 and -v6 protocols. The result is not that exciting, but it removes some levels of indirection in udpxxx_get_port and saves some space in code and text. The first step is to union existing hashinfo and new udp_hash on the struct proto and give a name to this union, since future initialization of tcpxxx_prot, dccp_vx_protinfo and udpxxx_protinfo will cause gcc warning about inability to initialize anonymous member this way. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
324b5761 |
|
13-Feb-2008 |
Adrian Bunk <bunk@kernel.org> |
[INET]: Unexport inet_listen_wlock This patch removes the no longer used EXPORT_SYMBOL(inet_listen_wlock). Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
74da4d34 |
|
13-Feb-2008 |
Adrian Bunk <bunk@kernel.org> |
[INET]: Unexport __inet_hash_connect This patch removes the unused EXPORT_SYMBOL_GPL(__inet_hash_connect). Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5d8c0aa9 |
|
05-Feb-2008 |
Pavel Emelyanov <xemul@openvz.org> |
[INET]: Fix accidentally broken inet(6)_hash_connect's port offset calculations. The port offset calculations depend on the protocol family, but, as Adrian noticed, I broke this logic with the commit 5ee31fc1ecdcbc234c8c56dcacef87c8e09909d8 [INET]: Consolidate inet(6)_hash_connect. Return this logic back, by passing the port offset directly into the consolidated function. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Noticed-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ab1e0a13 |
|
03-Feb-2008 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SOCK] proto: Add hashinfo member to struct proto This way we can remove TCP and DCCP specific versions of sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port sk->sk_prot->hash: inet_hash is directly used, only v6 need a specific version to deal with mapped sockets sk->sk_prot->unhash: both v4 and v6 use inet_hash directly struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so that inet_csk_get_port can find the per family routine. Now only the lookup routines receive as a parameter a struct inet_hashtable. With this we further reuse code, reducing the difference among INET transport protocols. Eventually work has to be done on UDP and SCTP to make them share this infrastructure and get as a bonus inet_diag interfaces so that iproute can be used with these protocols. net-2.6/net/ipv4/inet_hashtables.c: struct proto | +8 struct inet_connection_sock_af_ops | +8 2 structs changed __inet_hash_nolisten | +18 __inet_hash | -210 inet_put_port | +8 inet_bind_bucket_create | +1 __inet_hash_connect | -8 5 functions changed, 27 bytes added, 218 bytes removed, diff: -191 net-2.6/net/core/sock.c: proto_seq_show | +3 1 function changed, 3 bytes added, diff: +3 net-2.6/net/ipv4/inet_connection_sock.c: inet_csk_get_port | +15 1 function changed, 15 bytes added, diff: +15 net-2.6/net/ipv4/tcp.c: tcp_set_state | -7 1 function changed, 7 bytes removed, diff: -7 net-2.6/net/ipv4/tcp_ipv4.c: tcp_v4_get_port | -31 tcp_v4_hash | -48 tcp_v4_destroy_sock | -7 tcp_v4_syn_recv_sock | -2 tcp_unhash | -179 5 functions changed, 267 bytes removed, diff: -267 net-2.6/net/ipv6/inet6_hashtables.c: __inet6_hash | +8 1 function changed, 8 bytes added, diff: +8 net-2.6/net/ipv4/inet_hashtables.c: inet_unhash | +190 inet_hash | +242 2 functions changed, 432 bytes added, diff: +432 vmlinux: 16 functions changed, 485 bytes added, 492 bytes removed, diff: -7 /home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c: tcp_v6_get_port | -31 tcp_v6_hash | -7 tcp_v6_syn_recv_sock | -9 3 functions changed, 47 bytes removed, diff: -47 /home/acme/git/net-2.6/net/dccp/proto.c: dccp_destroy_sock | -7 dccp_unhash | -179 dccp_hash | -49 dccp_set_state | -7 dccp_done | +1 5 functions changed, 1 bytes added, 242 bytes removed, diff: -241 /home/acme/git/net-2.6/net/dccp/ipv4.c: dccp_v4_get_port | -31 dccp_v4_request_recv_sock | -2 2 functions changed, 33 bytes removed, diff: -33 /home/acme/git/net-2.6/net/dccp/ipv6.c: dccp_v6_get_port | -31 dccp_v6_hash | -7 dccp_v6_request_recv_sock | +5 3 functions changed, 5 bytes added, 38 bytes removed, diff: -33 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c67499c0 |
|
31-Jan-2008 |
Pavel Emelyanov <xemul@openvz.org> |
[NETNS]: Tcp-v4 sockets per-net lookup. Add a net argument to inet_lookup and propagate it further into lookup calls. Plus tune the __inet_check_established. The dccp and inet_diag, which use that lookup functions pass the init_net into them. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
941b1d22 |
|
31-Jan-2008 |
Pavel Emelyanov <xemul@openvz.org> |
[NETNS]: Make bind buckets live in net namespaces. This tags the inet_bind_bucket struct with net pointer, initializes it during creation and makes a filtering during lookup. A better hashfn, that takes the net into account is to be done in the future, but currently all bind buckets with similar port will be in one hash chain. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5ee31fc1 |
|
31-Jan-2008 |
Pavel Emelyanov <xemul@openvz.org> |
[INET]: Consolidate inet(6)_hash_connect. These two functions are the same except for what they call to "check_established" and "hash" for a socket. This saves half-a-kilo for ipv4 and ipv6. add/remove: 1/0 grow/shrink: 1/4 up/down: 582/-1128 (-546) function old new delta __inet_hash_connect - 577 +577 arp_ignore 108 113 +5 static.hint 8 4 -4 rt_worker_func 376 372 -4 inet6_hash_connect 584 25 -559 inet_hash_connect 586 25 -561 Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
65f76517 |
|
03-Jan-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: prot_inuse cleanups and optimizations 1) Cleanups (all functions are prefixed by sock_prot_inuse) sock_prot_inc_use(prot) -> sock_prot_inuse_add(prot,-1) sock_prot_dec_use(prot) -> sock_prot_inuse_add(prot,-1) sock_prot_inuse() -> sock_prot_inuse_get() New functions : sock_prot_inuse_init() and sock_prot_inuse_free() to abstract pcounter use. 2) if CONFIG_PROC_FS=n, we can zap 'inuse' member from "struct proto", since nobody wants to read the inuse value. This saves 1372 bytes on i386/SMP and some cpu cycles. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9a429c49 |
|
01-Jan-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: Add some acquires/releases sparse annotations. Add __acquires() and __releases() annotations to suppress some sparse warnings. example of warnings : net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong count at exit net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' - unexpected unlock Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
77a5ba55 |
|
20-Dec-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[INET]: Uninline the __inet_lookup_established function. This is -700 bytes from the net/ipv4/built-in.o add/remove: 1/0 grow/shrink: 1/3 up/down: 340/-1040 (-700) function old new delta __inet_lookup_established - 339 +339 tcp_sacktag_write_queue 2254 2255 +1 tcp_v4_err 1304 973 -331 tcp_v4_rcv 2089 1744 -345 tcp_v4_do_rcv 826 462 -364 Exporting is for dccp module (used via e.g. inet_lookup). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
152da81d |
|
20-Dec-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[INET]: Uninline the __inet_hash function. This one is used in quite many places in the networking code and seems to big to be inline. After the patch net/ipv4/build-in.o loses ~650 bytes: add/remove: 2/0 grow/shrink: 0/5 up/down: 461/-1114 (-653) function old new delta __inet_hash_nolisten - 282 +282 __inet_hash - 179 +179 tcp_sacktag_write_queue 2255 2254 -1 __inet_lookup_listener 284 274 -10 tcp_v4_syn_recv_sock 755 493 -262 tcp_v4_hash 389 35 -354 inet_hash_connect 1086 599 -487 This version addresses the issue pointed by Eric, that while being inline this function was optimized by gcc in respect to the 'listen_possible' argument. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
230140cf |
|
07-Nov-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[INET]: Remove per bucket rwlock in tcp/dccp ehash table. As done two years ago on IP route cache table (commit 22c047ccbc68fa8f3fa57f0e8f906479a062c426) , we can avoid using one lock per hash bucket for the huge TCP/DCCP hash tables. On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for litle performance differences. (we hit a different cache line for the rwlock, but then the bucket cache line have a better sharing factor among cpus, since we dirty it less often). For netstat or ss commands that want a full scan of hash table, we perform fewer memory accesses. Using a 'small' table of hashed rwlocks should be more than enough to provide correct SMP concurrency between different buckets, without using too much memory. Sizing of this table depends on num_possible_cpus() and various CONFIG settings. This patch provides some locking abstraction that may ease a future work using a different model for TCP/DCCP table. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a25de534 |
|
18-Oct-2007 |
Anton Arapov <aarapov@redhat.com> |
[INET]: Justification for local port range robustness. There is a justifying patch for Stephen's patches. Stephen's patches disallows using a port range of one single port and brakes the meaning of the 'remaining' variable, in some places it has different meaning. My patch gives back the sense of 'remaining' variable. It should mean how many ports are remaining and nothing else. Also my patch allows using a single port. I sure we must be able to use mentioned port range, this does not restricted by documentation and does not brake current behavior. usefull links: Patches posted by Stephen Hemminger http://marc.info/?l=linux-netdev&m=119206106218187&w=2 http://marc.info/?l=linux-netdev&m=119206109918235&w=2 Andrew Morton's comment http://marc.info/?l=linux-kernel&m=119248225007737&w=2 1. Allows using a port range of one single port. 2. Gives back sense of 'remaining' variable. Signed-off-by: Anton Arapov <aarapov@redhat.com> Acked-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
227b60f5 |
|
10-Oct-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[INET]: local port range robustness Expansion of original idea from Denis V. Lunev <den@openvz.org> Add robustness and locking to the local_port_range sysctl. 1. Enforce that low < high when setting. 2. Use seqlock to ensure atomic update. The locking might seem like overkill, but there are cases where sysadmin might want to change value in the middle of a DoS attack. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e905a9ed |
|
09-Feb-2007 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] IPV4: Fix whitespace errors. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dbca9b275 |
|
08-Feb-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: change layout of ehash table ehash table layout is currently this one : First half of this table is used by sockets not in TIME_WAIT state Second half of it is used by sockets in TIME_WAIT state. This is non optimal because of for a given hash or socket, the two chain heads are located in separate cache lines. Moreover the locks of the second half are never used. If instead of this halving, we use two list heads in inet_ehash_bucket instead of only one, we probably can avoid one cache miss, and reduce ram usage, particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep together related chains to speedup lookups and socket state change. In this patch I did not try to align struct inet_ehash_bucket, but a future patch could try to make this structure have a convenient size (a power of two or a multiple of L1_CACHE_SIZE). I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so maybe we dont need to scratch our heads to align the bucket... Note : In case struct inet_ehash_bucket is not a power of two, we could probably change alloc_large_system_hash() (in case it use __get_free_pages()) to free the unused space. It currently allocates a big zone, but the last quarter of it could be freed. Again, this should be a temporary 'problem'. Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e18b890b |
|
06-Dec-2006 |
Christoph Lameter <clameter@sgi.com> |
[PATCH] slab: remove kmem_cache_t Replace all uses of kmem_cache_t with struct kmem_cache. The patch was generated using the following script: #!/bin/sh # # Replace one string by another in all the kernel sources. # set -e for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do quilt add $file sed -e "1,\$s/$1/$2/g" $file >/tmp/$$ mv /tmp/$$ $file quilt refresh done The script was run like this sh replace kmem_cache_t "struct kmem_cache" Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
54e6ecb2 |
|
06-Dec-2006 |
Christoph Lameter <clameter@sgi.com> |
[PATCH] slab: remove SLAB_ATOMIC SLAB_ATOMIC is an alias of GFP_ATOMIC Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
fb99c848 |
|
27-Sep-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[IPV4]: annotate inet_lookup() and friends inet_lookup() annotated along with helper functions (__inet_lookup(), __inet_lookup_established(), inet_lookup_established(), inet_lookup_listener(), __inet_lookup_listener() and inet_ehashfn()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4f765d84 |
|
27-Sep-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[IPV4]: INET_MATCH() annotations INET_MATCH() and friends depend on an interesting set of kludges: * there's a pair of adjacent fields in struct inet_sock - __be16 dport followed by __u16 num. We want to search by pair, so we combine the keys into a single 32bit value and compare with 32bit value read from &...->dport. * on 64bit targets we combine comparisons with pair of adjacent __be32 fields in the same way. Make sure that we don't mix those values with anything else and that pairs we form them from have correct types. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8f491069 |
|
09-Aug-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[IPV4]: Use network-order dport for all visible inet_lookup_* Right now most inet_lookup_* functions take a host-order hnum instead of a network-order dport because that's how it is represented internally. This means that users of these functions have to be careful about using the right byte-order. To add more confusion, inet_lookup takes a network-order dport unlike all other functions. So this patch changes all visible inet_lookup functions to take a dport and move all dport->hnum conversion inside them. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
99a92ff5 |
|
08-Aug-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[IPV4]: Uninline inet_lookup_listener By modern standards this function is way too big to be inlined. It's even bigger than __inet_lookup_listener :) Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6ab3d562 |
|
30-Jun-2006 |
Jörn Engel <joern@wohnheim.fh-wedel.de> |
Remove obsolete #include <linux/config.h> Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
#
6c97e72a |
|
12-Apr-2006 |
Adrian Bunk <bunk@stusta.de> |
[IPV4]: Possible cleanups. This patch contains the following possible cleanups: - make the following needlessly global function static: - arp.c: arp_rcv() - remove the following unused EXPORT_SYMBOL's: - devinet.c: devinet_ioctl - fib_frontend.c: ip_rt_ioctl - inet_hashtables.c: inet_bind_bucket_create - inet_hashtables.c: inet_bind_hash - tcp_input.c: sysctl_tcp_abc - tcp_ipv4.c: sysctl_tcp_tw_reuse - tcp_output.c: sysctl_tcp_mtu_probing - tcp_output.c: sysctl_tcp_base_mss Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
53b3531b |
|
24-Mar-2006 |
Alexey Dobriyan <adobriyan@gmail.com> |
[PATCH] s/;;/;/g Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
a7f5e7f1 |
|
14-Dec-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[INET]: Generalise tcp_v4_hash_connect Renaming it to inet_hash_connect, making it possible to ditch dccp_v4_hash_connect and share the same code with TCP instead. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
463c84b9 |
|
09-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET]: Introduce inet_connection_sock This creates struct inet_connection_sock, moving members out of struct tcp_sock that are shareable with other INET connection oriented protocols, such as DCCP, that in my private tree already uses most of these members. The functions that operate on these members were renamed, using a inet_csk_ prefix while not being moved yet to a new file, so as to ease the review of these changes. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e48c414e |
|
09-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Generalise the TCP sock ID lookup routines And also some TIME_WAIT functions. [acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size /tmp/before.size: 282955 13122 9312 305389 4a8ed net/ipv4/built-in.o /tmp/after.size: 281566 13122 9312 304000 4a380 net/ipv4/built-in.o [acme@toy net-2.6.14]$ I kept them still inlined, will uninline at some point to see what would be the performance difference. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
33b62231 |
|
09-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Generalise tcp_v4_lookup_listener [acme@toy net-2.6.14]$ grep built-in /tmp/before /tmp/after /tmp/before: 282560 13122 9312 304994 4a762 net/ipv4/built-in.o /tmp/after: 282560 13122 9312 304994 4a762 net/ipv4/built-in.o Will be used in DCCP, not exporting it right now not to get in Adrian Bunk's exported-but-not-used-on-modules radar 8) Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f3f05f70 |
|
09-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Generalise the tcp_listen_ lock routines Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2d8c4ce5 |
|
09-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Generalise tcp_bind_hash & tcp_inherit_port This required moving tcp_bucket_cachep to inet_hashinfo. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
77d8bf9c |
|
09-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Move the TCP hashtable functions/structs to inet_hashtables.[ch] Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|