History log of /linux-master/net/dccp/ipv4.c
Revision Date Author Comments
# fa2df45a 30-Oct-2023 Kuniyuki Iwashima <kuniyu@amazon.com>

dccp: Call security_inet_conn_request() after setting IPv4 addresses.

Initially, commit 4237c75c0a35 ("[MLSXFRM]: Auto-labeling of child
sockets") introduced security_inet_conn_request() in some functions
where reqsk is allocated. The hook is added just after the allocation,
so reqsk's IPv4 remote address was not initialised then.

However, SELinux/Smack started to read it in netlbl_req_setattr()
after the cited commits.

This bug was partially fixed by commit 284904aa7946 ("lsm: Relocate
the IPv4 security_inet_conn_request() hooks").

This patch fixes the last bug in DCCPv4.

Fixes: 389fb800ac8b ("netlabel: Label incoming TCP connections correctly in SELinux")
Fixes: 07feee8f812f ("netlabel: Cleanup the Smack/NetLabel code to fix incoming TCP connections")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# e08d0b3d 21-Sep-2023 Eric Dumazet <edumazet@google.com>

inet: implement lockless IP_TOS

Some reads of inet->tos are racy.

Add needed READ_ONCE() annotations and convert IP_TOS option lockless.

v2: missing changes in include/net/route.h (David Ahern)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6af28974 15-Sep-2023 Eric Dumazet <edumazet@google.com>

dccp: fix dccp_v4_err()/dccp_v6_err() again

dh->dccph_x is the 9th byte (offset 8) in "struct dccp_hdr",
not in the "byte 7" as Jann claimed.

We need to make sure the ICMP messages are big enough,
using more standard ways (no more assumptions).

syzbot reported:
BUG: KMSAN: uninit-value in pskb_may_pull_reason include/linux/skbuff.h:2667 [inline]
BUG: KMSAN: uninit-value in pskb_may_pull include/linux/skbuff.h:2681 [inline]
BUG: KMSAN: uninit-value in dccp_v6_err+0x426/0x1aa0 net/dccp/ipv6.c:94
pskb_may_pull_reason include/linux/skbuff.h:2667 [inline]
pskb_may_pull include/linux/skbuff.h:2681 [inline]
dccp_v6_err+0x426/0x1aa0 net/dccp/ipv6.c:94
icmpv6_notify+0x4c7/0x880 net/ipv6/icmp.c:867
icmpv6_rcv+0x19d5/0x30d0
ip6_protocol_deliver_rcu+0xda6/0x2a60 net/ipv6/ip6_input.c:438
ip6_input_finish net/ipv6/ip6_input.c:483 [inline]
NF_HOOK include/linux/netfilter.h:304 [inline]
ip6_input+0x15d/0x430 net/ipv6/ip6_input.c:492
ip6_mc_input+0xa7e/0xc80 net/ipv6/ip6_input.c:586
dst_input include/net/dst.h:468 [inline]
ip6_rcv_finish+0x5db/0x870 net/ipv6/ip6_input.c:79
NF_HOOK include/linux/netfilter.h:304 [inline]
ipv6_rcv+0xda/0x390 net/ipv6/ip6_input.c:310
__netif_receive_skb_one_core net/core/dev.c:5523 [inline]
__netif_receive_skb+0x1a6/0x5a0 net/core/dev.c:5637
netif_receive_skb_internal net/core/dev.c:5723 [inline]
netif_receive_skb+0x58/0x660 net/core/dev.c:5782
tun_rx_batched+0x83b/0x920
tun_get_user+0x564c/0x6940 drivers/net/tun.c:2002
tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048
call_write_iter include/linux/fs.h:1985 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x8ef/0x15c0 fs/read_write.c:584
ksys_write+0x20f/0x4c0 fs/read_write.c:637
__do_sys_write fs/read_write.c:649 [inline]
__se_sys_write fs/read_write.c:646 [inline]
__x64_sys_write+0x93/0xd0 fs/read_write.c:646
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Uninit was created at:
slab_post_alloc_hook+0x12f/0xb70 mm/slab.h:767
slab_alloc_node mm/slub.c:3478 [inline]
kmem_cache_alloc_node+0x577/0xa80 mm/slub.c:3523
kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:559
__alloc_skb+0x318/0x740 net/core/skbuff.c:650
alloc_skb include/linux/skbuff.h:1286 [inline]
alloc_skb_with_frags+0xc8/0xbd0 net/core/skbuff.c:6313
sock_alloc_send_pskb+0xa80/0xbf0 net/core/sock.c:2795
tun_alloc_skb drivers/net/tun.c:1531 [inline]
tun_get_user+0x23cf/0x6940 drivers/net/tun.c:1846
tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048
call_write_iter include/linux/fs.h:1985 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x8ef/0x15c0 fs/read_write.c:584
ksys_write+0x20f/0x4c0 fs/read_write.c:637
__do_sys_write fs/read_write.c:649 [inline]
__se_sys_write fs/read_write.c:646 [inline]
__x64_sys_write+0x93/0xd0 fs/read_write.c:646
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

CPU: 0 PID: 4995 Comm: syz-executor153 Not tainted 6.6.0-rc1-syzkaller-00014-ga747acc0b752 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/04/2023

Fixes: 977ad86c2a1b ("dccp: Fix out of bounds access in DCCP error handler")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jann Horn <jannh@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 977ad86c 25-Aug-2023 Jann Horn <jannh@google.com>

dccp: Fix out of bounds access in DCCP error handler

There was a previous attempt to fix an out-of-bounds access in the DCCP
error handlers, but that fix assumed that the error handlers only want
to access the first 8 bytes of the DCCP header. Actually, they also look
at the DCCP sequence number, which is stored beyond 8 bytes, so an
explicit pskb_may_pull() is required.

Fixes: 6706a97fec96 ("dccp: fix out of bound access in dccp_v4_err()")
Fixes: 1aa9d1a0e7ee ("ipv6: dccp: fix out of bound access in dccp_v6_err()")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6b5f43ea 16-Aug-2023 Eric Dumazet <edumazet@google.com>

inet: move inet->recverr to inet->inet_flags

IP_RECVERR socket option can now be set/get without locking the socket.

This patch potentially avoid data-races around inet->recverr.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2d6c85ca 17-Jul-2023 Guillaume Nault <gnault@redhat.com>

dccp: Set TOS and routing scope independently for fib lookups.

There's no reason for setting the RTO_ONLINK flag in ->flowi4_tos as
RT_CONN_FLAGS() does. We can easily set ->flowi4_scope properly
instead. This makes the code more explicit and will allow to convert
->flowi4_tos to dscp_t in the future.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f866fbc8 18-Aug-2023 Eric Dumazet <edumazet@google.com>

ipv4: fix data-races around inet->inet_id

UDP sendmsg() is lockless, so ip_select_ident_segs()
can very well be run from multiple cpus [1]

Convert inet->inet_id to an atomic_t, but implement
a dedicated path for TCP, avoiding cost of a locked
instruction (atomic_add_return())

Note that this patch will cause a trivial merge conflict
because we added inet->flags in net-next tree.

v2: added missing change in
drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
(David Ahern)

[1]

BUG: KCSAN: data-race in __ip_make_skb / __ip_make_skb

read-write to 0xffff888145af952a of 2 bytes by task 7803 on cpu 1:
ip_select_ident_segs include/net/ip.h:542 [inline]
ip_select_ident include/net/ip.h:556 [inline]
__ip_make_skb+0x844/0xc70 net/ipv4/ip_output.c:1446
ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560
udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260
inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg net/socket.c:748 [inline]
____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494
___sys_sendmsg net/socket.c:2548 [inline]
__sys_sendmmsg+0x269/0x500 net/socket.c:2634
__do_sys_sendmmsg net/socket.c:2663 [inline]
__se_sys_sendmmsg net/socket.c:2660 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff888145af952a of 2 bytes by task 7804 on cpu 0:
ip_select_ident_segs include/net/ip.h:541 [inline]
ip_select_ident include/net/ip.h:556 [inline]
__ip_make_skb+0x817/0xc70 net/ipv4/ip_output.c:1446
ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560
udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260
inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg net/socket.c:748 [inline]
____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494
___sys_sendmsg net/socket.c:2548 [inline]
__sys_sendmmsg+0x269/0x500 net/socket.c:2634
__do_sys_sendmmsg net/socket.c:2663 [inline]
__se_sys_sendmmsg net/socket.c:2660 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x184d -> 0x184e

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 7804 Comm: syz-executor.1 Not tainted 6.5.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
==================================================================

Fixes: 23f57406b82d ("ipv4: avoid using shared IP generator for connected sockets")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dc97391e 23-Jun-2023 David Howells <dhowells@redhat.com>

sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)

Remove ->sendpage() and ->sendpage_locked(). sendmsg() with
MSG_SPLICE_PAGES should be used instead. This allows multiple pages and
multipage folios to be passed through.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-afs@lists.infradead.org
cc: mptcp@lists.linux.dev
cc: rds-devel@oss.oracle.com
cc: tipc-discussion@lists.sourceforge.net
cc: virtualization@lists.linux-foundation.org
Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 9a25f0cb 15-Mar-2023 Eric Dumazet <edumazet@google.com>

dccp: annotate lockless accesses to sk->sk_err_soft

This field can be read/written without lock synchronization.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e0833d1f 18-Nov-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

dccp/tcp: Fixup bhash2 bucket when connect() fails.

If a socket bound to a wildcard address fails to connect(), we
only reset saddr and keep the port. Then, we have to fix up the
bhash2 bucket; otherwise, the bucket has an inconsistent address
in the list.

Also, listen() for such a socket will fire the WARN_ON() in
inet_csk_get_port(). [0]

Note that when a system runs out of memory, we give up fixing the
bucket and unlink sk from bhash and bhash2 by inet_put_port().

[0]:
WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
Modules linked in:
CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
FS: 00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
inet_listen (net/ipv4/af_inet.c:228)
__sys_listen (net/socket.c:1810)
__x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
RIP: 0033:0x7f8ac051de5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
</TASK>

Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 8c5dae4c 18-Nov-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

dccp/tcp: Update saddr under bhash's lock.

When we call connect() for a socket bound to a wildcard address, we update
saddr locklessly. However, it could result in a data race; another thread
iterating over bhash might see a corrupted address.

Let's update saddr under the bhash bucket's lock.

Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 77934dc6 18-Nov-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

dccp/tcp: Reset saddr on failure after inet6?_hash_connect().

When connect() is called on a socket bound to the wildcard address,
we change the socket's saddr to a local address. If the socket
fails to connect() to the destination, we have to reset the saddr.

However, when an error occurs after inet_hash6?_connect() in
(dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
the socket bound to the address.

From the user's point of view, whether saddr is reset or not varies
with errno. Let's fix this inconsistent behaviour.

Note that after this patch, the repro [0] will trigger the WARN_ON()
in inet_csk_get_port() again, but this patch is not buggy and rather
fixes a bug papering over the bhash2's bug for which we need another
fix.

For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
by this sequence:

s1 = socket()
s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s1.bind(('127.0.0.1', 10000))
s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
# or s1.connect(('127.0.0.1', 10000))

s2 = socket()
s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s2.bind(('0.0.0.0', 10000))
s2.connect(('127.0.0.1', 10000)) # -EADDRNOTAVAIL

s2.listen(32) # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);

[0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09

Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 7e3cf084 05-Oct-2022 Jason A. Donenfeld <Jason@zx2c4.com>

treewide: use get_random_{u8,u16}() when possible, part 1

Rather than truncate a 32-bit value to a 16-bit value or an 8-bit value,
simply use the get_random_{u8,u16}() functions, which are faster than
wasting the additional bytes from a 32-bit value. This was done
mechanically with this coccinelle script:

@@
expression E;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u16;
typedef __be16;
typedef __le16;
typedef u8;
@@
(
- (get_random_u32() & 0xffff)
+ get_random_u16()
|
- (get_random_u32() & 0xff)
+ get_random_u8()
|
- (get_random_u32() % 65536)
+ get_random_u16()
|
- (get_random_u32() % 256)
+ get_random_u8()
|
- (get_random_u32() >> 16)
+ get_random_u16()
|
- (get_random_u32() >> 24)
+ get_random_u8()
|
- (u16)get_random_u32()
+ get_random_u16()
|
- (u8)get_random_u32()
+ get_random_u8()
|
- (__be16)get_random_u32()
+ (__be16)get_random_u16()
|
- (__le16)get_random_u32()
+ (__le16)get_random_u16()
|
- prandom_u32_max(65536)
+ get_random_u16()
|
- prandom_u32_max(256)
+ get_random_u8()
|
- E->inet_id = get_random_u32()
+ E->inet_id = get_random_u16()
)

@@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u16;
identifier v;
@@
- u16 v = get_random_u32();
+ u16 v = get_random_u16();

@@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u8;
identifier v;
@@
- u8 v = get_random_u32();
+ u8 v = get_random_u8();

@@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u16;
u16 v;
@@
- v = get_random_u32();
+ v = get_random_u16();

@@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u8;
u8 v;
@@
- v = get_random_u32();
+ v = get_random_u8();

// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@

((T)get_random_u32()@p & (LITERAL))

// Examine limits
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@

value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value < 256:
coccinelle.RESULT = cocci.make_ident("get_random_u8")
elif value < 65536:
coccinelle.RESULT = cocci.make_ident("get_random_u16")
else:
print("Skipping large mask of %s" % (literal))
cocci.include_match(False)

// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
identifier add_one.RESULT;
identifier FUNC;
@@

- (FUNC()@p & (LITERAL))
+ (RESULT() & LITERAL)

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>


# 28044fc1 22-Aug-2022 Joanne Koong <joannelkoong@gmail.com>

net: Add a bhash2 table hashed by port and address

The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.

This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.

Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:

1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()

2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).

In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.

* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.

This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.

2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.

* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 36f7cec4 13-May-2022 Eric Dumazet <edumazet@google.com>

dccp: use READ_ONCE() to read sk->sk_bound_dev_if

When reading listener sk->sk_bound_dev_if locklessly,
we must use READ_ONCE().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 67e1e2f4 20-Apr-2022 Guillaume Nault <gnault@redhat.com>

ipv4: Avoid using RTO_ONLINK with ip_route_connect().

Now that ip_rt_fix_tos() doesn't reset ->flowi4_scope unconditionally,
we don't have to rely on the RTO_ONLINK bit to properly set the scope
of a flowi4 structure. We can just set ->flowi4_scope explicitly and
avoid using RTO_ONLINK in ->flowi4_tos.

This patch converts callers of ip_route_connect(). Instead of setting
the tos parameter with RT_CONN_FLAGS(sk), as all callers do, we can:

1- Drop the tos parameter from ip_route_connect(): its value was
entirely based on sk, which is also passed as parameter.

2- Set ->flowi4_scope depending on the SOCK_LOCALROUTE socket option
instead of always initialising it with RT_SCOPE_UNIVERSE (let's
define ip_sock_rt_scope() for this purpose).

3- Avoid overloading ->flowi4_tos with RTO_ONLINK: since the scope is
now properly initialised, we don't need to tell ip_rt_fix_tos() to
adjust ->flowi4_scope for us. So let's define ip_sock_rt_tos(),
which is the same as RT_CONN_FLAGS() but without the RTO_ONLINK
bit overload.

Note:
In the original ip_route_connect() code, __ip_route_output_key()
might clear the RTO_ONLINK bit of fl4->flowi4_tos (because of
ip_rt_fix_tos()). Therefore flowi4_update_output() had to reuse the
original tos variable. Now that we don't set RTO_ONLINK any more,
this is not a problem and we can use fl4->flowi4_tos in
flowi4_update_output().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 04c494e6 12-May-2022 Eric Dumazet <edumazet@google.com>

Revert "tcp/dccp: get rid of inet_twsk_purge()"

This reverts commits:

0dad4087a86a2cbe177404dc73f18ada26a2c390 ("tcp/dccp: get rid of inet_twsk_purge()")
d507204d3c5cc57d9a8bdf0a477615bb59ea1611 ("tcp/dccp: add tw->tw_bslot")

As Leonard pointed out, a newly allocated netns can happen
to reuse a freed 'struct net'.

While TCP TW timers were covered by my patches, other things were not:

1) Lookups in rx path (INET_MATCH() and INET6_MATCH()), as they look
at 4-tuple plus the 'struct net' pointer.

2) /proc/net/tcp[6] and inet_diag, same reason.

3) hashinfo->bhash[], same reason.

Fixing all this seems risky, lets instead revert.

In the future, we might have a per netns tcp hash table, or
a per netns list of timewait sockets...

Fixes: 0dad4087a86a ("tcp/dccp: get rid of inet_twsk_purge()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Leonard Crestez <cdleonard@gmail.com>
Tested-by: Leonard Crestez <cdleonard@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0dad4087 24-Jan-2022 Eric Dumazet <edumazet@google.com>

tcp/dccp: get rid of inet_twsk_purge()

Prior patches in the series made sure tw_timer_handler()
can be fired after netns has been dismantled/freed.

We no longer have to scan a potentially big TCP ehash
table at netns dismantle.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e3ae2365 27-Jun-2021 Alexander Aring <aahringo@redhat.com>

net: sock: introduce sk_error_report

This patch introduces a function wrapper to call the sk_error_report
callback. That will prepare to add additional handling whenever
sk_error_report is called, for example to trace socket errors.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5796254e 17-May-2021 Yejune Deng <yejune.deng@gmail.com>

net: Remove the member netns_ok

Every protocol has the 'netns_ok' member and it is euqal to 1. The
'if (!prot->netns_ok)' always false in inet_add_protocol().

Signed-off-by: Yejune Deng <yejunedeng@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b98b3304 08-Apr-2021 Florian Westphal <fw@strlen.de>

net: dccp: use net_generic storage

DCCP is virtually never used, so no need to use space in struct net for it.

Put the pernet ipv4/v6 socket in the dccp ipv4/ipv6 modules instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20210408174502.1625-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 01770a16 20-Nov-2020 Ricardo Dias <rdias@singlestore.com>

tcp: fix race condition when creating child sockets from syncookies

When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.

Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 3df98d79 27-Sep-2020 Paul Moore <paul@paul-moore.com>

lsm,selinux: pass flowi_common instead of flowi to the LSM hooks

As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely. As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.

Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>


# de033b7d 09-Sep-2020 Wei Wang <weiwan@google.com>

ip: pass tos into ip_build_and_send_pkt()

This commit adds tos as a new passed in parameter to
ip_build_and_send_pkt() which will be used in the later commit.
This is a pure restructure and does not have any functional change.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 54633527 22-Aug-2020 Randy Dunlap <rdunlap@infradead.org>

net: dccp: delete repeated words

Drop duplicated words in /net/dccp/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3021ad52 17-Jul-2020 Christoph Hellwig <hch@lst.de>

net/ipv6: remove compat_ipv6_{get,set}sockopt

Handle the few cases that need special treatment in-line using
in_compat_syscall(). This also removes all the now unused
compat_{get,set}sockopt methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b6238c04 17-Jul-2020 Christoph Hellwig <hch@lst.de>

net/ipv4: remove compat_ip_{get,set}sockopt

Handle the few cases that need special treatment in-line using
in_compat_syscall().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8c918ffb 17-Jul-2020 Christoph Hellwig <hch@lst.de>

net: remove compat_sock_common_{get,set}sockopt

Add the compat handling to sock_common_{get,set}sockopt instead,
keyed of in_compat_syscall(). This allow to remove the now unused
->compat_{get,set}sockopt methods from struct proto_ops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d0b1101b 12-Jul-2020 Andrew Lunn <andrew@lunn.ch>

net: dccp: kerneldoc fixes

Simple fixes which require no deep knowledge of the code.

Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3d1e5039 04-Nov-2019 Eric Dumazet <edumazet@google.com>

dccp: do not leak jiffies on the wire

For some reason I missed the case of DCCP passive
flows in my previous patch.

Fixes: a904a0693c18 ("inet: stop leaking jiffies on the wire")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Thiemo Nagel <tnagel@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a904a069 01-Nov-2019 Eric Dumazet <edumazet@google.com>

inet: stop leaking jiffies on the wire

Historically linux tried to stick to RFC 791, 1122, 2003
for IPv4 ID field generation.

RFC 6864 made clear that no matter how hard we try,
we can not ensure unicity of IP ID within maximum
lifetime for all datagrams with a given source
address/destination address/protocol tuple.

Linux uses a per socket inet generator (inet_id), initialized
at connection startup with a XOR of 'jiffies' and other
fields that appear clear on the wire.

Thiemo Nagel pointed that this strategy is a privacy
concern as this provides 16 bits of entropy to fingerprint
devices.

Let's switch to a random starting point, this is just as
good as far as RFC 6864 is concerned and does not leak
anything critical.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Thiemo Nagel <tnagel@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 895b5c9f 29-Sep-2019 Florian Westphal <fw@strlen.de>

netfilter: drop bridge nf reset from nf_reset

commit 174e23810cd31
("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
recycle always drop skb extensions. The additional skb_ext_del() that is
performed via nf_reset on napi skb recycle is not needed anymore.

Most nf_reset() calls in the stack are there so queued skb won't block
'rmmod nf_conntrack' indefinitely.

This removes the skb_ext_del from nf_reset, and renames it to a more
fitting nf_reset_ct().

In a few selected places, add a call to skb_ext_reset to make sure that
no active extensions remain.

I am submitting this for "net", because we're still early in the release
cycle. The patch applies to net-next too, but I think the rename causes
needless divergence between those trees.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c7cbdbf2 17-Apr-2019 Arnd Bergmann <arnd@arndb.de>

net: rework SIOCGSTAMP ioctl handling

The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
socket protocol handlers, and all of those end up calling the same
sock_get_timestamp()/sock_get_timestampns() helper functions, which
results in a lot of duplicate code.

With the introduction of 64-bit time_t on 32-bit architectures, this
gets worse, as we then need four different ioctl commands in each
socket protocol implementation.

To simplify that, let's add a new .gettstamp() operation in
struct proto_ops, and move ioctl implementation into the common
sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
through.

We can reuse the sock_get_timestamp() implementation, but generalize
it so it can deal with both native and compat mode, as well as
timeval and timespec structures.

Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32bbd879 07-Nov-2018 Stefano Brivio <sbrivio@redhat.com>

net: Convert protocol error handlers from void to int

We'll need this to handle ICMP errors for tunnels without a sending socket
(i.e. FoU and GUE). There, we might have to look up different types of IP
tunnels, registered as network protocols, before we get a match, so we
want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
inet_protos and inet6_protos. These error codes will be used in the next
patch.

For consistency, return sensible error codes in protocol error handlers
whenever handlers can't handle errors because, even if valid, they don't
match a protocol or any of its states.

This has no effect on existing error handling paths.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2ab2ddd3 02-Oct-2018 Eric Dumazet <edumazet@google.com>

inet: make sure to grab rcu_read_lock before using ireq->ireq_opt

Timer handlers do not imply rcu_read_lock(), so my recent fix
triggered a LOCKDEP warning when SYNACK is retransmit.

Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
usages instead of guessing what is done by callers, since it is
not worth the pain.

Get rid of ireq_opt_deref() helper since it hides the logic
without real benefit, since it is now a standard rcu_dereference().

Fixes: 1ad98e9d1bdf ("tcp/dccp: fix lockdep issue when SYN is backlogged")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a11e1d43 28-Jun-2018 Linus Torvalds <torvalds@linux-foundation.org>

Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL

The poll() changes were not well thought out, and completely
unexplained. They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead. That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case. The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
individually because that would just be unnecessarily messy - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4335f52 31-Dec-2017 Christoph Hellwig <hch@lst.de>

net/dccp: convert to ->poll_mask

Signed-off-by: Christoph Hellwig <hch@lst.de>


# b855ff82 07-Apr-2018 Eric Dumazet <edumazet@google.com>

dccp: initialize ireq->ir_mark

syzbot reported an uninit-value read of skb->mark in iptable_mangle_hook()

Thanks to the nice report, I tracked the problem to dccp not caring
of ireq->ir_mark for passive sessions.

BUG: KMSAN: uninit-value in ipt_mangle_out net/ipv4/netfilter/iptable_mangle.c:66 [inline]
BUG: KMSAN: uninit-value in iptable_mangle_hook+0x5e5/0x720 net/ipv4/netfilter/iptable_mangle.c:84
CPU: 0 PID: 5300 Comm: syz-executor3 Not tainted 4.16.0+ #81
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x185/0x1d0 lib/dump_stack.c:53
kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
__msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
ipt_mangle_out net/ipv4/netfilter/iptable_mangle.c:66 [inline]
iptable_mangle_hook+0x5e5/0x720 net/ipv4/netfilter/iptable_mangle.c:84
nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
nf_hook_slow+0x158/0x3d0 net/netfilter/core.c:483
nf_hook include/linux/netfilter.h:243 [inline]
__ip_local_out net/ipv4/ip_output.c:113 [inline]
ip_local_out net/ipv4/ip_output.c:122 [inline]
ip_queue_xmit+0x1d21/0x21c0 net/ipv4/ip_output.c:504
dccp_transmit_skb+0x15eb/0x1900 net/dccp/output.c:142
dccp_xmit_packet+0x814/0x9e0 net/dccp/output.c:281
dccp_write_xmit+0x20f/0x480 net/dccp/output.c:363
dccp_sendmsg+0x12ca/0x12d0 net/dccp/proto.c:818
inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
sock_sendmsg_nosec net/socket.c:630 [inline]
sock_sendmsg net/socket.c:640 [inline]
___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
__sys_sendmsg net/socket.c:2080 [inline]
SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
SyS_sendmsg+0x54/0x80 net/socket.c:2087
do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x455259
RSP: 002b:00007f1a4473dc68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f1a4473e6d4 RCX: 0000000000455259
RDX: 0000000000000000 RSI: 0000000020b76fc8 RDI: 0000000000000015
RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000004f0 R14: 00000000006fa720 R15: 0000000000000000

Uninit was stored to memory at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
__msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
ip_queue_xmit+0x1e35/0x21c0 net/ipv4/ip_output.c:502
dccp_transmit_skb+0x15eb/0x1900 net/dccp/output.c:142
dccp_xmit_packet+0x814/0x9e0 net/dccp/output.c:281
dccp_write_xmit+0x20f/0x480 net/dccp/output.c:363
dccp_sendmsg+0x12ca/0x12d0 net/dccp/proto.c:818
inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
sock_sendmsg_nosec net/socket.c:630 [inline]
sock_sendmsg net/socket.c:640 [inline]
___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
__sys_sendmsg net/socket.c:2080 [inline]
SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
SyS_sendmsg+0x54/0x80 net/socket.c:2087
do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Uninit was stored to memory at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
__msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
inet_csk_clone_lock+0x503/0x580 net/ipv4/inet_connection_sock.c:797
dccp_create_openreq_child+0x7f/0x890 net/dccp/minisocks.c:92
dccp_v4_request_recv_sock+0x22c/0xe90 net/dccp/ipv4.c:408
dccp_v6_request_recv_sock+0x290/0x2000 net/dccp/ipv6.c:414
dccp_check_req+0x7b9/0x8f0 net/dccp/minisocks.c:197
dccp_v4_rcv+0x12e4/0x2630 net/dccp/ipv4.c:840
ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
NF_HOOK include/linux/netfilter.h:288 [inline]
ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
dst_input include/net/dst.h:449 [inline]
ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
NF_HOOK include/linux/netfilter.h:288 [inline]
ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
__netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
__netif_receive_skb net/core/dev.c:4627 [inline]
process_backlog+0x62d/0xe20 net/core/dev.c:5307
napi_poll net/core/dev.c:5705 [inline]
net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
__do_softirq+0x56d/0x93d kernel/softirq.c:285
Uninit was created at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
reqsk_alloc include/net/request_sock.h:88 [inline]
inet_reqsk_alloc+0xc4/0x7f0 net/ipv4/tcp_input.c:6145
dccp_v4_conn_request+0x5cc/0x1770 net/dccp/ipv4.c:600
dccp_v6_conn_request+0x299/0x1880 net/dccp/ipv6.c:317
dccp_rcv_state_process+0x2ea/0x2410 net/dccp/input.c:612
dccp_v4_do_rcv+0x229/0x340 net/dccp/ipv4.c:682
dccp_v6_do_rcv+0x16d/0x1220 net/dccp/ipv6.c:578
sk_backlog_rcv include/net/sock.h:908 [inline]
__sk_receive_skb+0x60e/0xf20 net/core/sock.c:513
dccp_v4_rcv+0x24d4/0x2630 net/dccp/ipv4.c:874
ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
NF_HOOK include/linux/netfilter.h:288 [inline]
ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
dst_input include/net/dst.h:449 [inline]
ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
NF_HOOK include/linux/netfilter.h:288 [inline]
ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
__netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
__netif_receive_skb net/core/dev.c:4627 [inline]
process_backlog+0x62d/0xe20 net/core/dev.c:5307
napi_poll net/core/dev.c:5705 [inline]
net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
__do_softirq+0x56d/0x93d kernel/softirq.c:285

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f635cee 27-Mar-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Drop pernet_operations::async

Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5368bd72 05-Mar-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Convert dccp_v4_ops

These pernet_operations create and destroy net::dccp::v4_ctl_sk.
It looks like another pernet_operations don't want to send
dccp packets to dying or creating net. Batch method similar
to ipv4/ipv6 sockets and it has to be safe to be executed
in parallel with anything else. So, we mark them as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 06f877d6 24-Oct-2017 Eric Dumazet <edumazet@google.com>

tcp/dccp: fix other lockdep splats accessing ireq_opt

In my first attempt to fix the lockdep splat, I forgot we could
enter inet_csk_route_req() with a freshly allocated request socket,
for which refcount has not yet been elevated, due to complex
SLAB_TYPESAFE_BY_RCU rules.

We either are in rcu_read_lock() section _or_ we own a refcount on the
request.

Correct RCU verb to use here is rcu_dereference_check(), although it is
not possible to prove we actually own a reference on a shared
refcount :/

In v2, I added ireq_opt_deref() helper and use in three places, to fix other
possible splats.

[ 49.844590] lockdep_rcu_suspicious+0xea/0xf3
[ 49.846487] inet_csk_route_req+0x53/0x14d
[ 49.848334] tcp_v4_route_req+0xe/0x10
[ 49.850174] tcp_conn_request+0x31c/0x6a0
[ 49.851992] ? __lock_acquire+0x614/0x822
[ 49.854015] tcp_v4_conn_request+0x5a/0x79
[ 49.855957] ? tcp_v4_conn_request+0x5a/0x79
[ 49.858052] tcp_rcv_state_process+0x98/0xdcc
[ 49.859990] ? sk_filter_trim_cap+0x2f6/0x307
[ 49.862085] tcp_v4_do_rcv+0xfc/0x145
[ 49.864055] ? tcp_v4_do_rcv+0xfc/0x145
[ 49.866173] tcp_v4_rcv+0x5ab/0xaf9
[ 49.868029] ip_local_deliver_finish+0x1af/0x2e7
[ 49.870064] ip_local_deliver+0x1b2/0x1c5
[ 49.871775] ? inet_del_offload+0x45/0x45
[ 49.873916] ip_rcv_finish+0x3f7/0x471
[ 49.875476] ip_rcv+0x3f1/0x42f
[ 49.876991] ? ip_local_deliver_finish+0x2e7/0x2e7
[ 49.878791] __netif_receive_skb_core+0x6d3/0x950
[ 49.880701] ? process_backlog+0x7e/0x216
[ 49.882589] __netif_receive_skb+0x1d/0x5e
[ 49.884122] process_backlog+0x10c/0x216
[ 49.885812] net_rx_action+0x147/0x3df

Fixes: a6ca7abe53633 ("tcp/dccp: fix lockdep splat in inet_csk_route_req()")
Fixes: c92e8c02fe66 ("tcp/dccp: fix ireq->opt races")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kernel test robot <fengguang.wu@intel.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c92e8c02 20-Oct-2017 Eric Dumazet <edumazet@google.com>

tcp/dccp: fix ireq->opt races

syzkaller found another bug in DCCP/TCP stacks [1]

For the reasons explained in commit ce1050089c96 ("tcp/dccp: fix
ireq->pktopts race"), we need to make sure we do not access
ireq->opt unless we own the request sock.

Note the opt field is renamed to ireq_opt to ease grep games.

[1]
BUG: KASAN: use-after-free in ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
Read of size 1 at addr ffff8801c951039c by task syz-executor5/3295

CPU: 1 PID: 3295 Comm: syz-executor5 Not tainted 4.14.0-rc4+ #80
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
print_address_description+0x73/0x250 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351 [inline]
kasan_report+0x25b/0x340 mm/kasan/report.c:409
__asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1135
tcp_send_ack.part.37+0x3bb/0x650 net/ipv4/tcp_output.c:3587
tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3557
__tcp_ack_snd_check+0x2c6/0x4b0 net/ipv4/tcp_input.c:5072
tcp_ack_snd_check net/ipv4/tcp_input.c:5085 [inline]
tcp_rcv_state_process+0x2eff/0x4850 net/ipv4/tcp_input.c:6071
tcp_child_process+0x342/0x990 net/ipv4/tcp_minisocks.c:816
tcp_v4_rcv+0x1827/0x2f80 net/ipv4/tcp_ipv4.c:1682
ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
NF_HOOK include/linux/netfilter.h:249 [inline]
ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
dst_input include/net/dst.h:464 [inline]
ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
NF_HOOK include/linux/netfilter.h:249 [inline]
ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
__netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
__netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
netif_receive_skb+0xae/0x390 net/core/dev.c:4611
tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
call_write_iter include/linux/fs.h:1770 [inline]
new_sync_write fs/read_write.c:468 [inline]
__vfs_write+0x68a/0x970 fs/read_write.c:481
vfs_write+0x18f/0x510 fs/read_write.c:543
SYSC_write fs/read_write.c:588 [inline]
SyS_write+0xef/0x220 fs/read_write.c:580
entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x40c341
RSP: 002b:00007f469523ec10 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 000000000040c341
RDX: 0000000000000037 RSI: 0000000020004000 RDI: 0000000000000015
RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000f4240 R11: 0000000000000293 R12: 00000000004b7fd1
R13: 00000000ffffffff R14: 0000000020000000 R15: 0000000000025000

Allocated by task 3295:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
__do_kmalloc mm/slab.c:3725 [inline]
__kmalloc+0x162/0x760 mm/slab.c:3734
kmalloc include/linux/slab.h:498 [inline]
tcp_v4_save_options include/net/tcp.h:1962 [inline]
tcp_v4_init_req+0x2d3/0x3e0 net/ipv4/tcp_ipv4.c:1271
tcp_conn_request+0xf6d/0x3410 net/ipv4/tcp_input.c:6283
tcp_v4_conn_request+0x157/0x210 net/ipv4/tcp_ipv4.c:1313
tcp_rcv_state_process+0x8ea/0x4850 net/ipv4/tcp_input.c:5857
tcp_v4_do_rcv+0x55c/0x7d0 net/ipv4/tcp_ipv4.c:1482
tcp_v4_rcv+0x2d10/0x2f80 net/ipv4/tcp_ipv4.c:1711
ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
NF_HOOK include/linux/netfilter.h:249 [inline]
ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
dst_input include/net/dst.h:464 [inline]
ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
NF_HOOK include/linux/netfilter.h:249 [inline]
ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
__netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
__netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
netif_receive_skb+0xae/0x390 net/core/dev.c:4611
tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
call_write_iter include/linux/fs.h:1770 [inline]
new_sync_write fs/read_write.c:468 [inline]
__vfs_write+0x68a/0x970 fs/read_write.c:481
vfs_write+0x18f/0x510 fs/read_write.c:543
SYSC_write fs/read_write.c:588 [inline]
SyS_write+0xef/0x220 fs/read_write.c:580
entry_SYSCALL_64_fastpath+0x1f/0xbe

Freed by task 3306:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
__cache_free mm/slab.c:3503 [inline]
kfree+0xca/0x250 mm/slab.c:3820
inet_sock_destruct+0x59d/0x950 net/ipv4/af_inet.c:157
__sk_destruct+0xfd/0x910 net/core/sock.c:1560
sk_destruct+0x47/0x80 net/core/sock.c:1595
__sk_free+0x57/0x230 net/core/sock.c:1603
sk_free+0x2a/0x40 net/core/sock.c:1614
sock_put include/net/sock.h:1652 [inline]
inet_csk_complete_hashdance+0xd5/0xf0 net/ipv4/inet_connection_sock.c:959
tcp_check_req+0xf4d/0x1620 net/ipv4/tcp_minisocks.c:765
tcp_v4_rcv+0x17f6/0x2f80 net/ipv4/tcp_ipv4.c:1675
ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
NF_HOOK include/linux/netfilter.h:249 [inline]
ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
dst_input include/net/dst.h:464 [inline]
ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
NF_HOOK include/linux/netfilter.h:249 [inline]
ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
__netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
__netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
netif_receive_skb+0xae/0x390 net/core/dev.c:4611
tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
call_write_iter include/linux/fs.h:1770 [inline]
new_sync_write fs/read_write.c:468 [inline]
__vfs_write+0x68a/0x970 fs/read_write.c:481
vfs_write+0x18f/0x510 fs/read_write.c:543
SYSC_write fs/read_write.c:588 [inline]
SyS_write+0xef/0x220 fs/read_write.c:580
entry_SYSCALL_64_fastpath+0x1f/0xbe

Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3fa6f616 07-Aug-2017 David Ahern <dsahern@gmail.com>

net: ipv4: add second dif to inet socket lookups

Add a second device index, sdif, to inet socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
ingress index is obtained from IPCB using inet_sdif and after the cb move
in tcp_v4_rcv the tcp_v4_sdif helper is used.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b7953d3c 26-Jul-2017 Xin Long <lucien.xin@gmail.com>

dccp: fix a memleak that dccp_ipv4 doesn't put reqsk properly

The patch "dccp: fix a memleak that dccp_ipv6 doesn't put reqsk
properly" fixed reqsk refcnt leak for dccp_ipv6. The same issue
exists on dccp_ipv4.

This patch is to fix it for dccp_ipv4.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d5494acb 20-Jun-2017 Xin Long <lucien.xin@gmail.com>

dccp: call inet_add_protocol after register_pernet_subsys in dccp_v4_init

Now dccp_ipv4 works as a kernel module. During loading this module, if
one dccp packet is being recieved after inet_add_protocol but before
register_pernet_subsys in which v4_ctl_sk is initialized, a null pointer
dereference may be triggered because of init_net.dccp.v4_ctl_sk is 0x0.

Jianlin found this issue when the following call trace occurred:

[ 171.950177] BUG: unable to handle kernel NULL pointer dereference at 0000000000000110
[ 171.951007] IP: [<ffffffffc0558364>] dccp_v4_ctl_send_reset+0xc4/0x220 [dccp_ipv4]
[...]
[ 171.984629] Call Trace:
[ 171.984859] <IRQ>
[ 171.985061]
[ 171.985213] [<ffffffffc0559a53>] dccp_v4_rcv+0x383/0x3f9 [dccp_ipv4]
[ 171.985711] [<ffffffff815ca054>] ip_local_deliver_finish+0xb4/0x1f0
[ 171.986309] [<ffffffff815ca339>] ip_local_deliver+0x59/0xd0
[ 171.986852] [<ffffffff810cd7a4>] ? update_curr+0x104/0x190
[ 171.986956] [<ffffffff815c9cda>] ip_rcv_finish+0x8a/0x350
[ 171.986956] [<ffffffff815ca666>] ip_rcv+0x2b6/0x410
[ 171.986956] [<ffffffff810c83b4>] ? task_cputime+0x44/0x80
[ 171.986956] [<ffffffff81586f22>] __netif_receive_skb_core+0x572/0x7c0
[ 171.986956] [<ffffffff810d2c51>] ? trigger_load_balance+0x61/0x1e0
[ 171.986956] [<ffffffff81587188>] __netif_receive_skb+0x18/0x60
[ 171.986956] [<ffffffff8158841e>] process_backlog+0xae/0x180
[ 171.986956] [<ffffffff8158799d>] net_rx_action+0x16d/0x380
[ 171.986956] [<ffffffff81090b7f>] __do_softirq+0xef/0x280
[ 171.986956] [<ffffffff816b6a1c>] call_softirq+0x1c/0x30

This patch is to move inet_add_protocol after register_pernet_subsys in
dccp_v4_init, so that v4_ctl_sk is initialized before any incoming dccp
packets are processed.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5f0d5a3a 18-Jan-2017 Paul E. McKenney <paulmck@kernel.org>

mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]
Acked-by: David Rientjes <rientjes@google.com>


# 45caeaa5 09-Mar-2017 Jon Maxwell <jmaxwell37@gmail.com>

dccp/tcp: fix routing redirect race

As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

#8 [] page_fault at ffffffff8163e648
[exception RIP: __tcp_ack_snd_check+74]
.
.
#9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

224 static bool tcp_in_quickack_mode(struct sock *sk)↩
225 {↩
226 ▹ const struct inet_connection_sock *icsk = inet_csk(sk);↩
227 ▹ const struct dst_entry *dst = __sk_dst_get(sk);↩
228 ↩
229 ▹ return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
230 ▹ ▹ (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps 267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320610d6 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ec7cb62d 21-Feb-2017 Andrey Ryabinin <ryabinin.a.a@gmail.com>

net/dccp: fix use after free in tw_timer_handler()

DCCP doesn't purge timewait sockets on network namespace shutdown.
So, after net namespace destroyed we could still have an active timer
which will trigger use after free in tw_timer_handler():

BUG: KASAN: use-after-free in tw_timer_handler+0x4a/0xa0 at addr ffff88010e0d1e10
Read of size 8 by task swapper/1/0
Call Trace:
__asan_load8+0x54/0x90
tw_timer_handler+0x4a/0xa0
call_timer_fn+0x127/0x480
expire_timers+0x1db/0x2e0
run_timer_softirq+0x12f/0x2a0
__do_softirq+0x105/0x5b4
irq_exit+0xdd/0xf0
smp_apic_timer_interrupt+0x57/0x70
apic_timer_interrupt+0x90/0xa0

Object at ffff88010e0d1bc0, in cache net_namespace size: 6848
Allocated:
save_stack_trace+0x1b/0x20
kasan_kmalloc+0xee/0x180
kasan_slab_alloc+0x12/0x20
kmem_cache_alloc+0x134/0x310
copy_net_ns+0x8d/0x280
create_new_namespaces+0x23f/0x340
unshare_nsproxy_namespaces+0x75/0xf0
SyS_unshare+0x299/0x4f0
entry_SYSCALL_64_fastpath+0x18/0xad
Freed:
save_stack_trace+0x1b/0x20
kasan_slab_free+0xae/0x180
kmem_cache_free+0xb4/0x350
net_drop_ns+0x3f/0x50
cleanup_net+0x3df/0x450
process_one_work+0x419/0xbb0
worker_thread+0x92/0x850
kthread+0x192/0x1e0
ret_from_fork+0x2e/0x40

Add .exit_batch hook to dccp_v4_ops()/dccp_v6_ops() which will purge
timewait sockets on net namespace destruction and prevent above issue.

Fixes: f2bf415cfed7 ("mib: add net to NET_ADD_STATS_BH")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aa078842 17-Jan-2017 Josef Bacik <jbacik@fb.com>

inet: drop ->bind_conflict

The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
is how they check the rcv_saddr, so delete this call back and simply
change inet_csk_bind_conflict to call inet_rcv_saddr_equal.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 648f0c28 28-Nov-2016 Eric Dumazet <edumazet@google.com>

net/dccp: fix use-after-free in dccp_invalid_packet

pskb_may_pull() can reallocate skb->head, we need to reload dh pointer
in dccp_invalid_packet() or risk use after free.

Bug found by Andrey Konovalov using syzkaller.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6706a97f 02-Nov-2016 Eric Dumazet <edumazet@google.com>

dccp: fix out of bound access in dccp_v4_err()

dccp_v4_err() does not use pskb_may_pull() and might access garbage.

We only need 4 bytes at the beginning of the DCCP header, like TCP,
so the 8 bytes pulled in icmp_socket_deliver() are more than enough.

This patch might allow to process more ICMP messages, as some routers
are still limiting the size of reflected bytes to 28 (RFC 792), instead
of extended lengths (RFC 1812 4.3.2.3)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c3f24cfb 02-Nov-2016 Eric Dumazet <edumazet@google.com>

dccp: do not release listeners too soon

Andrey Konovalov reported following error while fuzzing with syzkaller :

IPv4: Attempt to release alive inet socket ffff880068e98940
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006b9e0000 task.stack: ffff880068770000
RIP: 0010:[<ffffffff819ead5f>] [<ffffffff819ead5f>]
selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
RSP: 0018:ffff8800687771c8 EFLAGS: 00010202
RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
Stack:
ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
Call Trace:
[<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
[<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
[<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
[<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
[<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
[< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
[< inline >] NF_HOOK ./include/linux/netfilter.h:255
[<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
[< inline >] dst_input ./include/net/dst.h:507
[<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
[< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
[< inline >] NF_HOOK ./include/linux/netfilter.h:255
[<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
[<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
[<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
[<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
[<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
[<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
[<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
[< inline >] new_sync_write fs/read_write.c:499
[<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512
[<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560
[< inline >] SYSC_write fs/read_write.c:607
[<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599
[<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2

It turns out DCCP calls __sk_receive_skb(), and this broke when
lookups no longer took a reference on listeners.

Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
so that sock_put() is used only when needed.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5ea8ea2c 26-Oct-2016 Eric Dumazet <edumazet@google.com>

tcp/dccp: drop SYN packets if accept queue is full

Per listen(fd, backlog) rules, there is really no point accepting a SYN,
sending a SYNACK, and dropping the following ACK packet if accept queue
is full, because application is not draining accept queue fast enough.

This behavior is fooling TCP clients that believe they established a
flow, while there is nothing at server side. They might then send about
10 MSS (if using IW10) that will be dropped anyway while server is under
stress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4f0c40d9 12-Jul-2016 Willem de Bruijn <willemb@google.com>

dccp: limit sk_filter trim to payload

Dccp verifies packet integrity, including length, at initial rcv in
dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.

A call to sk_filter in-between can cause __skb_pull to wrap skb->len.
skb_copy_datagram_msg interprets this as a negative value, so
(correctly) fails with EFAULT. The negative length is reported in
ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.

Introduce an sk_receive_skb variant that caps how small a filter
program can trim packets, and call this in dccp with the header
length. Excessively trimmed packets are now processed normally and
queued for reception as 0B payloads.

Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 95556a88 08-Jul-2016 Eric Dumazet <edumazet@google.com>

dccp: avoid deadlock in dccp_v4_ctl_send_reset

In the prep work I did before enabling BH while handling socket backlog,
I missed two points in DCCP :

1) dccp_v4_ctl_send_reset() uses bh_lock_sock(), assuming BH were
blocked. It is not anymore always true.

2) dccp_v4_route_skb() was using __IP_INC_STATS() instead of
IP_INC_STATS()

A similar fix was done for TCP, in commit 47dcc20a39d0
("ipv4: tcp: ip_send_unicast_reply() is not BH safe")

Fixes: 7309f8821fd6 ("dccp: do not assume DCCP code is non preemptible")
Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7309f882 29-Apr-2016 Eric Dumazet <edumazet@google.com>

dccp: do not assume DCCP code is non preemptible

DCCP uses the generic backlog code, and this will soon
be changed to not disable BH when protocol is called back.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 02a1d6e7 27-Apr-2016 Eric Dumazet <edumazet@google.com>

net: rename NET_{ADD|INC}_STATS_BH()

Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b45386ef 27-Apr-2016 Eric Dumazet <edumazet@google.com>

net: rename IP_INC_STATS_BH()

Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5d3848bc 27-Apr-2016 Eric Dumazet <edumazet@google.com>

net: rename ICMP_INC_STATS_BH()

Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aa62d76b 27-Apr-2016 Eric Dumazet <edumazet@google.com>

dccp: rename DCCP_INC_STATS_BH()

Rename DCCP_INC_STATS_BH() to __DCCP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1e1d04e6 05-Apr-2016 Hannes Frederic Sowa <hannes@stressinduktion.org>

net: introduce lockdep_is_held and update various places to use it

The socket is either locked if we hold the slock spin_lock for
lock_sock_fast and unlock_sock_fast or we own the lock (sk_lock.owned
!= 0). Check for this and at the same time improve that the current
thread/cpu is really holding the lock.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3b24d854 01-Apr-2016 Eric Dumazet <edumazet@google.com>

tcp/dccp: do not touch listener sk_refcnt under synflood

When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

By letting listeners use SOCK_RCU_FREE infrastructure,
we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
only listeners are impacted by this change.

Peak performance under SYNFLOOD is increased by ~33% :

On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

Most consuming functions are now skb_set_owner_w() and sock_wfree()
contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e316ea62 16-Mar-2016 Eric Dumazet <edumazet@google.com>

tcp/dccp: remove obsolete WARN_ON() in icmp handlers

Now SYN_RECV request sockets are installed in ehash table, an ICMP
handler can find a request socket while another cpu handles an incoming
packet transforming this SYN_RECV request socket into an ESTABLISHED
socket.

We need to remove the now obsolete WARN_ON(req->sk), since req->sk
is set when a new child is created and added into listener accept queue.

If this race happens, the ICMP will do nothing special.

Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ben Lazarus <blazarus@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7716682c 18-Feb-2016 Eric Dumazet <edumazet@google.com>

tcp/dccp: fix another race at listener dismantle

Ilya reported following lockdep splat:

kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0: (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1: (rcu_read_lock){......}, at: [<ffffffff816e977f>]
ip_local_deliver_finish+0x3f/0x380
kernel: #2: (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
sk_clone_lock+0x19b/0x440
kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0

To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.

We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.

Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: ebb516af60e1 ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a583636a 10-Feb-2016 Craig Gallek <kraig@google.com>

inet: refactor inet[6]_lookup functions to take skb

This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
groups. Doing so with a BPF filter will require access to the
skb in question. This change plumbs the skb (and offset to payload
data) through the call stack to the listening socket lookup
implementations where it will be used in a following patch.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5e0724d0 22-Oct-2015 Eric Dumazet <edumazet@google.com>

tcp/dccp: fix hashdance race for passive sessions

Multiple cpus can process duplicates of incoming ACK messages
matching a SYN_RECV request socket. This is a rare event under
normal operations, but definitely can happen.

Only one must win the race, otherwise corruption would occur.

To fix this without adding new atomic ops, we use logic in
inet_ehash_nolisten() to detect the request was present in the same
ehash bucket where we try to insert the new child.

If request socket was not found, we have to undo the child creation.

This actually removes a spin_lock()/spin_unlock() pair in
reqsk_queue_unlink() for the fast path.

Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f03f2e15 14-Oct-2015 Eric Dumazet <edumazet@google.com>

tcp/dccp: add inet_csk_reqsk_queue_drop_and_put() helper

Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
In many cases we also need to release reference on request socket,
so add a helper to do this, reducing code size and complexity.

Fixes: 4bdc3d66147b ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ef84d8ce 14-Oct-2015 Eric Dumazet <edumazet@google.com>

Revert "inet: fix double request socket freeing"

This reverts commit c69736696cf3742b37d850289dc0d7ead177bb14.

At the time of above commit, tcp_req_err() and dccp_req_err()
were dead code, as SYN_RECV request sockets were not yet in ehash table.

Real bug was fixed later in a different commit.

We need to revert to not leak a refcount on request socket.

inet_csk_reqsk_queue_drop_and_put() will be added
in following commit to make clean inet_csk_reqsk_queue_drop()
does not release the reference owned by caller.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4bdc3d66 13-Oct-2015 Eric Dumazet <edumazet@google.com>

tcp/dccp: fix behavior of stale SYN_RECV request sockets

When a TCP/DCCP listener is closed, its pending SYN_RECV request sockets
become stale, meaning 3WHS can not complete.

But current behavior is wrong :
incoming packets finding such stale sockets are dropped.

We need instead to cleanup the request socket and perform another
lookup :
- Incoming ACK will give a RST answer,
- SYN rtx might find another listener if available.
- We expedite cleanup of request sockets and old listener socket.

Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a1a5344d 04-Oct-2015 Eric Dumazet <edumazet@google.com>

tcp: avoid two atomic ops for syncookies

inet_reqsk_alloc() is used to allocate a temporary request
in order to generate a SYNACK with a cookie. Then later,
syncookie validation also uses a temporary request.

These paths already took a reference on listener refcount,
we can avoid a couple of atomic operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 079096f1 02-Oct-2015 Eric Dumazet <edumazet@google.com>

tcp/dccp: install syn_recv requests into ehash table

In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0c27171e 29-Sep-2015 Eric Dumazet <edumazet@google.com>

tcp/dccp: constify syn_recv_sock() method sock argument

We'll soon no longer hold listener socket lock, these
functions do not modify the socket in any way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a00e7444 29-Sep-2015 Eric Dumazet <edumazet@google.com>

tcp/dccp: constify send_synack and send_reset socket argument

None of these functions need to change the socket, make it
const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ea3bea3a 25-Sep-2015 Eric Dumazet <edumazet@google.com>

tcp/dccp: constify rtx_synack() and friends

This is done to make sure we do not change listener socket
while sending SYNACK packets while socket lock is not held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b357a364 23-Apr-2015 Eric Dumazet <edumazet@google.com>

inet: fix possible panic in reqsk_queue_unlink()

[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
0000000000000080
[ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243

There is a race when reqsk_timer_handler() and tcp_check_req() call
inet_csk_reqsk_queue_unlink() on the same req at the same time.

Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener
timer"), listener spinlock was held and race could not happen.

To solve this bug, we change reqsk_queue_unlink() to not assume req
must be found, and we return a status, to conditionally release a
refcount on the request sock.

This also means tcp_check_req() in non fastopen case might or not
consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
to properly handle this.

(Same remark for dccp_check_req() and its callers)

inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
called 4 times in tcp and 3 times in dccp.

Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c6973669 23-Mar-2015 Fan Du <fan.du@intel.com>

inet: fix double request socket freeing

Eric Hugne reported following error :

I'm hitting this warning on latest net-next when i try to SSH into a machine
with eth0 added to a bridge (but i think the problem is older than that)

Steps to reproduce:
node2 ~ # brctl addif br0 eth0
[ 223.758785] device eth0 entered promiscuous mode
node2 ~ # ip link set br0 up
[ 244.503614] br0: port 1(eth0) entered forwarding state
[ 244.505108] br0: port 1(eth0) entered forwarding state
node2 ~ # [ 251.160159] ------------[ cut here ]------------
[ 251.160831] WARNING: CPU: 0 PID: 3 at include/net/request_sock.h:102 tcp_v4_err+0x6b1/0x720()
[ 251.162077] Modules linked in:
[ 251.162496] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc3+ #18
[ 251.163334] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 251.164078] ffffffff81a8365c ffff880038a6ba18 ffffffff8162ace4 0000000000009898
[ 251.165084] 0000000000000000 ffff880038a6ba58 ffffffff8104da85 ffff88003fa437c0
[ 251.166195] ffff88003fa437c0 ffff88003fa74e00 ffff88003fa43bb8 ffff88003fad99a0
[ 251.167203] Call Trace:
[ 251.167533] [<ffffffff8162ace4>] dump_stack+0x45/0x57
[ 251.168206] [<ffffffff8104da85>] warn_slowpath_common+0x85/0xc0
[ 251.169239] [<ffffffff8104db65>] warn_slowpath_null+0x15/0x20
[ 251.170271] [<ffffffff81559d51>] tcp_v4_err+0x6b1/0x720
[ 251.171408] [<ffffffff81630d03>] ? _raw_read_lock_irq+0x3/0x10
[ 251.172589] [<ffffffff81534e20>] ? inet_del_offload+0x40/0x40
[ 251.173366] [<ffffffff81569295>] icmp_socket_deliver+0x65/0xb0
[ 251.174134] [<ffffffff815693a2>] icmp_unreach+0xc2/0x280
[ 251.174820] [<ffffffff8156a82d>] icmp_rcv+0x2bd/0x3a0
[ 251.175473] [<ffffffff81534ea2>] ip_local_deliver_finish+0x82/0x1e0
[ 251.176282] [<ffffffff815354d8>] ip_local_deliver+0x88/0x90
[ 251.177004] [<ffffffff815350f0>] ip_rcv_finish+0xf0/0x310
[ 251.177693] [<ffffffff815357bc>] ip_rcv+0x2dc/0x390
[ 251.178336] [<ffffffff814f5da3>] __netif_receive_skb_core+0x713/0xa20
[ 251.179170] [<ffffffff814f7fca>] __netif_receive_skb+0x1a/0x80
[ 251.179922] [<ffffffff814f97d4>] process_backlog+0x94/0x120
[ 251.180639] [<ffffffff814f9612>] net_rx_action+0x1e2/0x310
[ 251.181356] [<ffffffff81051267>] __do_softirq+0xa7/0x290
[ 251.182046] [<ffffffff81051469>] run_ksoftirqd+0x19/0x30
[ 251.182726] [<ffffffff8106cc23>] smpboot_thread_fn+0x153/0x1d0
[ 251.183485] [<ffffffff8106cad0>] ? SyS_setgroups+0x130/0x130
[ 251.184228] [<ffffffff8106935e>] kthread+0xee/0x110
[ 251.184871] [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
[ 251.185690] [<ffffffff81631108>] ret_from_fork+0x58/0x90
[ 251.186385] [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
[ 251.187216] ---[ end trace c947fc7b24e42ea1 ]---
[ 259.542268] br0: port 1(eth0) entered forwarding state

Remove the double calls to reqsk_put()

[edumazet] :

I got confused because reqsk_timer_handler() _has_ to call
reqsk_put(req) after calling inet_csk_reqsk_queue_drop(), as
the timer handler holds a reference on req.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Erik Hugne <erik.hugne@ericsson.com>
Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: David S. Miller <davem@davemloft.net>


# 85645bab 22-Mar-2015 Eric Dumazet <edumazet@google.com>

ipv4: dccp: handle ICMP messages on DCCP_NEW_SYN_RECV request sockets

dccp_v4_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

New dccp_req_err() helper is exported so that we can use it in IPv6
in following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 42cb80a2 22-Mar-2015 Eric Dumazet <edumazet@google.com>

inet: remove sk_listener parameter from syn_ack_timeout()

It is not needed, and req->sk_listener points to the listener anyway.
request_sock argument can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fa76ce73 19-Mar-2015 Eric Dumazet <edumazet@google.com>

inet: get rid of central tcp/dccp listener timer

One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 52452c54 19-Mar-2015 Eric Dumazet <edumazet@google.com>

inet: drop prev pointer handling in request sock

When request sock are put in ehash table, the whole notion
of having a previous request to update dl_next is pointless.

Also, following patch will get rid of big purge timer,
so we want to delete a request sock without holding listener lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 08d2cc3b 18-Mar-2015 Eric Dumazet <edumazet@google.com>

inet: request sock should init IPv6/IPv4 addresses

In order to be able to use sk_ehashfn() for request socks,
we need to initialize their IPv6/IPv4 addresses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d1e559d0 18-Mar-2015 Eric Dumazet <edumazet@google.com>

inet: add IPv6 support to sk_ehashfn()

Intent is to converge IPv4 & IPv6 inet_hash functions to
factorize code.

IPv4 sockets initialize sk_rcv_saddr and sk_v6_daddr
in this patch, thanks to new sk_daddr_set() and sk_rcv_saddr_set()
helpers.

__inet6_hash can now use sk_ehashfn() instead of a private
inet6_sk_ehashfn() and will simply use __inet_hash() in a
following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 407640de 17-Mar-2015 Eric Dumazet <edumazet@google.com>

inet: add sk_listener argument to inet_reqsk_alloc()

listener socket can be used to set net pointer, and will
be later used to hold a reference on listener.

Add a const qualifier to first argument (struct request_sock_ops *),
and factorize all write_pnet(&ireq->ireq_net, sock_net(sk));

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 16f86165 13-Mar-2015 Eric Dumazet <edumazet@google.com>

inet: fill request sock ir_iif for IPv4

Once request socks will be in ehash table, they will need to have
a valid ir_iff field.

This is currently true only for IPv6. This patch extends support
for IPv4 as well.

This means inet_diag_fill_req() can now properly use ir_iif,
which is better for IPv6 link locals anyway, as request sockets
and established sockets will propagate consistent netlink idiag_if.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3f66b083 12-Mar-2015 Eric Dumazet <edumazet@google.com>

inet: introduce ireq_family

Before inserting request socks into general hash table,
fill their socket family.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bd337c58 12-Mar-2015 Eric Dumazet <edumazet@google.com>

ipv6: add missing ireq_net & ir_cookie initializations

I forgot to update dccp_v6_conn_request() & cookie_v6_check().
They both need to set ireq->ireq_net and ireq->ir_cookie

Lets clear ireq->ir_cookie in inet_reqsk_alloc()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33cf7c90fe2f ("net: add real socket cookies")
Signed-off-by: David S. Miller <davem@davemloft.net>


# d77c555d 11-Mar-2015 Eric Dumazet <edumazet@google.com>

net: fix CONFIG_NET_NS=n compilation

I forgot to use write_pnet() in three locations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33cf7c90fe2f9 ("net: add real socket cookies")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 33cf7c90 11-Mar-2015 Eric Dumazet <edumazet@google.com>

net: add real socket cookies

A long standing problem in netlink socket dumps is the use
of kernel socket addresses as cookies.

1) It is a security concern.

2) Sockets can be reused quite quickly, so there is
no guarantee a cookie is used once and identify
a flow.

3) request sock, establish sock, and timewait socks
for a given flow have different cookies.

Part of our effort to bring better TCP statistics requires
to switch to a different allocator.

In this patch, I chose to use a per network namespace 64bit generator,
and to use it only in the case a socket needs to be dumped to netlink.
(This might be refined later if needed)

Note that I tried to carry cookies from request sock, to establish sock,
then timewait sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eric Salo <salo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 54da7996 17-Nov-2014 Fabian Frederick <fabf@skynet.be>

dccp: remove blank lines between function/EXPORT_SYMBOL

See Documentation/CodingStyle chapter 6.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b26ba202 23-May-2014 Tom Herbert <therbert@google.com>

net: Eliminate no_check from protosw

It doesn't seem like an protocols are setting anything other
than the default, and allowing to arbitrarily disable checksums
for a whole protocol seems dangerous. This can be done on a per
socket basis.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8ed1dc44 09-Jan-2014 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv4: introduce hardened ip_no_pmtu_disc mode

This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
to be honored by protocols which do more stringent validation on the
ICMP's packet payload. This knob is useful for people who e.g. want to
run an unmodified DNS server in a namespace where they need to use pmtu
for TCP connections (as they are used for zone transfers or fallback
for requests) but don't want to use possibly spoofed UDP pmtu information.

Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
if the returned packet is in the window or if the association is valid.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0e0d44ab 28-Aug-2013 Steffen Klassert <steffen.klassert@secunet.com>

net: Remove FLOWI_FLAG_CAN_SLEEP

FLOWI_FLAG_CAN_SLEEP was used to notify xfrm about the posibility
to sleep until the needed states are resolved. This code is gone,
so FLOWI_FLAG_CAN_SLEEP is not needed anymore.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>


# 482fc609 04-Nov-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE

Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
their sockets won't accept and install new path mtu information and they
will always use the interface mtu for outgoing packets. It is guaranteed
that the packet is not fragmented locally. But we won't set the DF-Flag
on the outgoing frames.

Florian Weimer had the idea to use this flag to ensure DNS servers are
never generating outgoing fragments. They may well be fragmented on the
path, but the server never stores or usees path mtu values, which could
well be forged in an attack.

(The root of the problem with path MTU discovery is that there is
no reliable way to authenticate ICMP Fragmentation Needed But DF Set
messages because they are sent from intermediate routers with their
source addresses, and the IMCP payload will not always contain sufficient
information to identify a flow.)

Recent research in the DNS community showed that it is possible to
implement an attack where DNS cache poisoning is feasible by spoofing
fragments. This work was done by Amir Herzberg and Haya Shulman:
<https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>

This issue was previously discussed among the DNS community, e.g.
<http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>,
without leading to fixes.

This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
regarding local fragmentation with UFO/CORK" for the enforcement of the
non-fragmentable checks. If other users than ip_append_page/data should
use this semantic too, we have to add a new flag to IPCB(skb)->flags to
suppress local fragmentation and check for this in ip_finish_output.

Many thanks to Florian Weimer for the idea and feedback while implementing
this patch.

Cc: David S. Miller <davem@davemloft.net>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 634fb979 09-Oct-2013 Eric Dumazet <edumazet@google.com>

inet: includes a sock_common in request_sock

TCP listener refactoring, part 5 :

We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.

This patch includes the needed struct sock_common in front
of struct request_sock

This means there is no more inet6_request_sock IPv6 specific
structure.

Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.

loc_port -> ir_loc_port
loc_addr -> ir_loc_addr
rmt_addr -> ir_rmt_addr
rmt_port -> ir_rmt_port
iif -> ir_iif

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1a2c6181 17-Mar-2013 Christoph Paasch <christoph.paasch@uclouvain.be>

tcp: Remove TCPCT

TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.

As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:

Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.

before this patch:
average (among 7 runs) of 20845.5 Requests/Second
after:
average (among 7 runs) of 21403.6 Requests/Second

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e337e24d 13-Dec-2012 Christoph Paasch <christoph.paasch@uclouvain.be>

inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock

If in either of the above functions inet_csk_route_child_sock() or
__inet_inherit_port() fails, the newsk will not be freed:

unreferenced object 0xffff88022e8a92c0 (size 1592):
comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
hex dump (first 32 bytes):
0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................
02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
[<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
[<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
[<ffffffff8149b784>] sk_clone_lock+0x16/0x21e
[<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
[<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
[<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
[<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
[<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
[<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
[<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
[<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
[<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
[<ffffffff814cee68>] ip_rcv+0x217/0x267
[<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
[<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82

This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
a single sock_put() is not enough to free the memory. Additionally, things
like xfrm, memcg, cookie_values,... may have been initialized.
We have to free them properly.

This is fixed by forcing a call to tcp_done(), ending up in
inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
xfrm,...

Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
force it entering inet_csk_destroy_sock. To avoid the warning in
inet_csk_destroy_sock, inet_num has to be set to 0.
As inet_csk_destroy_sock does a dec on orphan_count, we first have to
increase it.

Calling tcp_done() allows us to remove the calls to
tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().

A similar approach is taken for dccp by calling dccp_done().

This is in the kernel since 093d282321 (tproxy: fix hash locking issue
when using port redirection in __inet_inherit_port()), thus since
version >= 2.6.37.

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 92101b3b 23-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Prepare for change of rt->rt_iif encoding.

Use inet_iif() consistently, and for TCP record the input interface of
cached RX dst in inet sock.

rt->rt_iif is going to be encoded differently, so that we can
legitimately cache input routes in the FIB info more aggressively.

When the input interface is "use SKB device index" the rt->rt_iif will
be set to zero.

This forces us to move the TCP RX dst cache installation into the ipv4
specific code, and as well it should since doing the route caching for
ipv6 is pointless at the moment since it is not inspected in the ipv6
input paths yet.

Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
obsolete set to a non-zero value to force invocation of the check
callback.

Signed-off-by: David S. Miller <davem@davemloft.net>


# ba3f7f04 17-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 6700c270 17-Jul-2012 David S. Miller <davem@davemloft.net>

net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}()

This will be used so that we can compose a full flow key.

Even though we have a route in this context, we need more. In the
future the routes will be without destination address, source address,
etc. keying. One ipv4 route will cover entire subnets, etc.

In this environment we have to have a way to possess persistent storage
for redirects and PMTU information. This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here. Using that flow key will do a fib_lookup()
and create/update the persistent entry.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 80d0a69f 16-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Add helper inet_csk_update_pmtu().

This abstracts away the call to dst_ops->update_pmtu() so that we can
transparently handle the fact that, in the future, the dst itself can
be invalidated by the PMTU update (when we have non-host routes cached
in sockets).

So we try to rebuild the socket cached route after the method
invocation if necessary.

This isn't used by SCTP because it needs to cache dsts per-transport,
and thus will need it's own local version of this helper.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 1ed5c48f 12-Jul-2012 David S. Miller <davem@davemloft.net>

net: Remove checks for dst_ops->redirect being NULL.

No longer necessary.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 55be7a9c 11-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Add redirect support to all protocol icmp error handlers.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 7586eceb 19-Jun-2012 Eric Dumazet <edumazet@google.com>

ipv4: tcp: dont cache output dst for syncookies

Don't cache output dst for syncookies, as this adds pressure on IP route
cache and rcu subsystem for no gain.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c72e1183 12-Apr-2012 Eric Dumazet <edumazet@google.com>

inet: makes syn_ack_timeout mandatory

There are two struct request_sock_ops providers, tcp and dccp.

inet_csk_reqsk_queue_prune() can avoid testing syn_ack_timeout being
NULL if we make it non NULL like syn_ack_timeout

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>


# f541fb7e 26-Feb-2012 Samuel Jero <sj323707@ohio.edu>

dccp: fix bug in sequence number validation during connection setup

This fixes a bug in the sequence number validation during the initial handshake.

The code did not treat the initial sequence numbers ISS and ISR as read-only and
did not keep state for GSR and GSS as required by the specification. This causes
problems with retransmissions during the initial handshake, causing the
budding connection to be reset.

This patch now treats ISS/ISR as read-only and tracks GSS/GSR as required.

Signed-off-by: Samuel Jero <sj323707@ohio.edu>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>


# 898f7358 01-Dec-2011 David S. Miller <davem@davemloft.net>

dccp: Evaluate ip_hdr() only once in dccp_v4_route_skb().

This also works around a bogus gcc warning generated by an
upcoming patch from Eric Dumazet that rearranges the layout
of struct flowi4.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 525c6465 21-Nov-2011 RongQing.Li <roy.qing.li@gmail.com>

dccp: fix error propagation in dccp_v4_connect

The errcode is not updated when ip_route_newports() fails.

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 918eb399 01-Nov-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: add missing bh_unlock_sock() calls

Simon Kirby reported lockdep warnings and following messages :

[104661.897577] huh, entered softirq 3 NET_RX ffffffff81613740
preempt_count 00000101, exited with 00000102?

[104661.923653] huh, entered softirq 3 NET_RX ffffffff81613740
preempt_count 00000101, exited with 00000102?

Problem comes from commit 0e734419
(ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.)

If inet_csk_route_child_sock() returns NULL, we should release socket
lock before freeing it.

Another lock imbalance exists if __inet_inherit_port() returns an error
since commit 093d282321da ( tproxy: fix hash locking issue when using
port redirection in __inet_inherit_port()) a backport is also needed for
>= 2.6.37 kernels.

Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Balazs Scheidler <bazsi@balabit.hu>
CC: KOVACS Krisztian <hidden@balabit.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6e5714ea 03-Aug-2011 David S. Miller <davem@davemloft.net>

net: Compute protocol sequence numbers and fragment IDs using MD5.

Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.

MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)

Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation. So the periodic
regeneration and 8-bit counter have been removed. We compute and
use a full 32-bit sequence number.

For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.

Reported-by: Dan Kaminsky <dan@doxpara.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6bd023f3 18-May-2011 David S. Miller <davem@davemloft.net>

ipv4: Make caller provide flowi4 key to inet_csk_route_req().

This way the caller can get at the fully resolved fl4->{daddr,saddr}
etc.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 0e734419 08-May-2011 David S. Miller <davem@davemloft.net>

ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.

Operation order is now transposed, we first create the child
socket then we try to hook up the route.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 2c42758c 06-May-2011 David S. Miller <davem@davemloft.net>

dccp: Use cork flow in dccp_v4_connect()

Since this is invoked from inet_stream_connect() the socket is locked
and therefore this usage is safe.

Signed-off-by: David S. Miller <davem@davemloft.net>


# f1390160 03-May-2011 David S. Miller <davem@davemloft.net>

dccp: Use flowi4->saddr in dccp_v4_connect()

Signed-off-by: David S. Miller <davem@davemloft.net>


# 91ab0b60 29-Apr-2011 David S. Miller <davem@davemloft.net>

ipv4: Get route daddr from flow key in dccp_v4_connect().

Now that output route lookups update the flow with
destination address selection, we can fetch it from
fl4->daddr instead of rt->rt_dst

Signed-off-by: David S. Miller <davem@davemloft.net>


# f6d8bd05 21-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com>

inet: add RCU protection to inet->opt

We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2d7192d6 26-Apr-2011 David S. Miller <davem@davemloft.net>

ipv4: Sanitize and simplify ip_route_{connect,newports}()

These functions are used together as a unit for route resolution
during connect(). They address the chicken-and-egg problem that
exists when ports need to be allocated during connect() processing,
yet such port allocations require addressing information from the
routing code.

It's currently more heavy handed than it needs to be, and in
particular we allocate and initialize a flow object twice.

Let the callers provide the on-stack flow object. That way we only
need to initialize it once in the ip_route_connect() call.

Later, if ip_route_newports() needs to do anything, it re-uses that
flow object as-is except for the ports which it updates before the
route re-lookup.

Also, describe why this set of facilities are needed and how it works
in a big comment.

Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>


# 9cce96df 12-Mar-2011 David S. Miller <davem@davemloft.net>

net: Put fl4_* macros to struct flowi4 and use them again.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 9d6ec938 11-Mar-2011 David S. Miller <davem@davemloft.net>

ipv4: Use flowi4 in public route lookup interfaces.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 6281dcc9 11-Mar-2011 David S. Miller <davem@davemloft.net>

net: Make flowi ports AF dependent.

Create two sets of port member accessors, one set prefixed by fl4_*
and the other prefixed by fl6_*

This will let us to create AF optimal flow instances.

It will work because every context in which we access the ports,
we have to be fully aware of which AF the flowi is anyways.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 1d28f42c 11-Mar-2011 David S. Miller <davem@davemloft.net>

net: Put flowi_* prefix on AF independent members of struct flowi

I intend to turn struct flowi into a union of AF specific flowi
structs. There will be a common structure that each variant includes
first, much like struct sock_common.

This is the first step to move in that direction.

Signed-off-by: David S. Miller <davem@davemloft.net>


# b23dd4fe 02-Mar-2011 David S. Miller <davem@davemloft.net>

ipv4: Make output route lookup return rtable directly.

Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 273447b3 01-Mar-2011 David S. Miller <davem@davemloft.net>

ipv4: Kill can_sleep arg to ip_route_output_flow()

This boolean state is now available in the flow flags.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 420d44da 01-Mar-2011 David S. Miller <davem@davemloft.net>

ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep"

Since that is what the current vague "flags" argument means.

Signed-off-by: David S. Miller <davem@davemloft.net>


# abdf7e72 01-Mar-2011 David S. Miller <davem@davemloft.net>

ipv4: Can final ip_route_connect() arg to boolean "can_sleep".

Since that's what the current vague "flags" thing means.

Signed-off-by: David S. Miller <davem@davemloft.net>


# dca8b089 24-Feb-2011 David S. Miller <davem@davemloft.net>

ipv4: Rearrange how ip_route_newports() gets port keys.

ip_route_newports() is the only place in the entire kernel that
cares about the port members in the routing cache entry's lookup
flow key.

Therefore the only reason we store an entire flow inside of the
struct rtentry is for this one special case.

Rewrite ip_route_newports() such that:

1) The caller passes in the original port values, so we don't need
to use the rth->fl.fl_ip_{s,d}port values to remember them.

2) The lookup flow is constructed by hand instead of being copied
from the routing cache entry's flow.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 5811662b 12-Nov-2010 Changli Gao <xiaosuo@gmail.com>

net: use the macros defined for the members of flowi

Use the macros defined for the members of flowi to clean the code up.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 093d2823 21-Oct-2010 Balazs Scheidler <bazsi@balabit.hu>

tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()

When __inet_inherit_port() is called on a tproxy connection the wrong locks are
held for the inet_bind_bucket it is added to. __inet_inherit_port() made an
implicit assumption that the listener's port number (and thus its bind bucket).
Unfortunately, if you're using the TPROXY target to redirect skbs to a
transparent proxy that assumption is not true anymore and things break.

This patch adds code to __inet_inherit_port() so that it can handle this case
by looking up or creating a new bind bucket for the child socket and updates
callers of __inet_inherit_port() to gracefully handle __inet_inherit_port()
failing.

Reported by and original patch from Stephen Buck <stephen.buck@exinda.com>.
See http://marc.info/?t=128169268200001&r=1&w=2 for the original discussion.

Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# d8d1f30b 11-Jun-2010 Changli Gao <xiaosuo@gmail.com>

net-next: remove useless union keyword

remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bb296246 10-Apr-2010 Herbert Xu <herbert@gondor.apana.org.au>

inet: Remove unused send_check length argument

inet: Remove unused send_check length argument

This patch removes the unused length argument from the send_check
function in struct inet_connection_sock_af_ops.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Yinghai <yinghai.lu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# d14a0ebd 14-Mar-2010 Gerrit Renker <gerrit@erg.abdn.ac.uk>

net-2.6 [Bug-Fix][dccp]: fix oops caused after failed initialisation

dccp: fix panic caused by failed initialisation

This fixes a kernel panic reported thanks to Andre Noll:

if DCCP is compiled into the kernel and any out of the initialisation
steps in net/dccp/proto.c:dccp_init() fail, a subsequent attempt to create
a SOCK_DCCP socket will panic, since inet{,6}_create() are not prevented
from creating DCCP sockets.

This patch fixes the problem by propagating a failure in dccp_init() to
dccp_v{4,6}_init_net(), and from there to dccp_v{4,6}_init(), so that the
DCCP protocol is not made available if its initialisation fails.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2c8c1e72 16-Jan-2010 Alexey Dobriyan <adobriyan@gmail.com>

net: spread __net_init, __net_exit

__net_init/__net_exit are apparently not going away, so use them
to full extent.

In some cases __net_init was removed, because it was called from
__net_exit code.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9327f705 03-Dec-2009 Eric Dumazet <eric.dumazet@gmail.com>

tcp: Fix a connect() race with timewait sockets

First patch changes __inet_hash_nolisten() and __inet6_hash()
to get a timewait parameter to be able to unhash it from ehash
at same time the new socket is inserted in hash.

This makes sure timewait socket wont be found by a concurrent
writer in __inet_check_established()

Reported-by: kapil dakhane <kdakhane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e6b4d113 02-Dec-2009 William Allen Simpson <william.allen.simpson@gmail.com>

TCPCT part 1a: add request_values parameter for sending SYNACK

Add optional function parameters associated with sending SYNACK.
These parameters are not needed after sending SYNACK, and are not
used for retransmission. Avoids extending struct tcp_request_sock,
and avoids allocating kernel memory.

Also affects DCCP as it uses common struct request_sock_ops,
but this parameter is currently reserved for future use.

Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 13f18aa0 05-Nov-2009 Eric Paris <eparis@redhat.com>

net: drop capability from protocol definitions

struct can_proto had a capability field which wasn't ever used. It is
dropped entirely.

struct inet_protosw had a capability field which can be more clearly
expressed in the code by just checking if sock->type = SOCK_RAW.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c720c7e8 15-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com>

inet: rename some inet_sock fields

In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.

Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)

This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32613090 13-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

net: constify struct net_protocol

Remove long removed "inet_protocol_base" declaration.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3b401a81 01-Sep-2009 Stephen Hemminger <shemminger@vyatta.com>

inet: inet_connection_sock_af_ops const

The function block inet_connect_sock_af_ops contains no data
make it constant.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# adf30907 01-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com>

net: skb->dst accessors

Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 511c3f92 01-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com>

net: skb->rtable accessor

Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

Delete skb->rtable field

Setting rtable is not allowed, just set dst instead as rtable is an alias.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3ab5aee7 16-Nov-2008 Eric Dumazet <dada1@cosmosbay.com>

net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls

RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.

This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.

Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.

__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)

Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d99a7bd2 05-Nov-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Cleanup routines for feature negotiation

This inserts the required de-allocation routines for memory allocated
by feature negotiation in the socket destructors, replacing
dccp_feat_clean() in one instance.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ac75773c 05-Nov-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Per-socket initialisation of feature negotiation

This provides feature-negotiation initialisation for both DCCP sockets
and DCCP request_sockets, to support feature negotiation during
connection setup.

It also resolves a FIXME regarding the congestion control
initialisation.

Thanks to Wei Yongjun for help with the IPv6 side of this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 21454aaa 31-Oct-2008 Harvey Harrison <harvey.harrison@gmail.com>

net: replace NIPQUAD() in net/*/

Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9a1f27c4 07-Oct-2008 Arnaldo Carvalho de Melo <acme@redhat.com>

inet_hashtables: Add inet_lookup_skb helpers

To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 410e27a4 09-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

This reverts "Merge branch 'dccp' of git://eden-feed.erg.abdn.ac.uk/dccp_exp"
as it accentally contained the wrong set of patches. These will be
submitted separately.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>


# 70208383 03-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Cleanup routines for feature negotiation

This inserts the required de-allocation routines for memory allocated by
feature negotiation in the socket destructors, replacing dccp_feat_clean()
in one instance.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>


# 828755ce 03-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Per-socket initialisation of feature negotiation

This provides feature-negotiation initialisation for both DCCP sockets and
DCCP request_sockets, to support feature negotiation during connection setup.

It also resolves a FIXME regarding the congestion control initialisation.

Thanks to Wei Yongjun for help with the IPv6 side of this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>


# 18e1d836 26-Jul-2008 Wei Yongjun <yjwei@cn.fujitsu.com>

dccp: Fix incorrect length check for ICMPv4 packets

Unlike TCP, which only needs 8 octets of original packet data, DCCP requires
minimally 12 or 16 bytes for ICMP-payload sequence number checks.

This patch replaces the insufficient length constant of 8 with a two-stage
test, making sure that 12 bytes are available, before computing the basic
header length required for sequence number checks.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>


# d68f0866 26-Jul-2008 Wei Yongjun <yjwei@cn.fujitsu.com>

dccp: Fix sequence number check for ICMPv4 packets

The payload of ICMP message is a part of the packet sent by ourself,
so the sequence number check must use AWL and AWH, not SWL and SWH.

For example:
Endpoint A Endpoint B

DATA-ACK -------->
(SEQ=X)
<-------- ICMP (Fragmentation Needed)
(SEQ=X)

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>


# 547b792c 25-Jul-2008 Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

net: convert BUG_TRAP to generic WARN_ON

Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.

I could make at least one BUILD_BUG_ON conversion.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>


# de0744af 16-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

mib: add net to NET_INC_STATS_BH

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7c73a6fa 16-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

mib: add net to IP_INC_STATS_BH

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dcfc23ca 15-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

mib: add struct net to ICMP_INC_STATS_BH

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fd54d716 15-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

inet: toss struct net initialization around

Some places, that deal with ICMP statistics already have where
to get a struct net from, but use it directly, without declaring
a separate variable on the stack.

Since I will need this net soon, I declare a struct net on the
stack and use it in the existing places in a separate patch not
to spoil the future ones.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ce4a7d0d 10-Jun-2008 Arnaldo Carvalho de Melo <acme@redhat.com>

inet{6}_request_sock: Init ->opt and ->pktopts in the constructor

Wei Yongjun noticed that we may call reqsk_free on request sock objects where
the opt fields may not be initialized, fix it by introducing inet_reqsk_alloc
where we initialize ->opt to NULL and set ->pktopts to NULL in
inet6_reqsk_alloc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6079a463 27-May-2008 Wei Yongjun <yjwei@cn.fujitsu.com>

dccp: Fix to handle short sequence numbers packet correctly

RFC4340 said:
8.5. Pseudocode
...
If P.type is not Data, Ack, or DataAck and P.X == 0 (the packet
has short sequence numbers), drop packet and return

But DCCP has some mistake to handle short sequence numbers packet, now
it drop packet only if P.type is Data, Ack, or DataAck and P.X == 0.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 13f51d82 14-Apr-2008 Pavel Emelyanov <xemul@openvz.org>

[DCCP]: Fix comment about control sockets.

These sockets now have a bit other names and are no longer global.

Shame on me, I haven't provided a good comment for this when
sending DCCP netnsization patches.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fc5f8580 13-Apr-2008 Pavel Emelyanov <xemul@openvz.org>

[NETNS][DCCPV4]: Enable DCCPv4 in net namespaces.

This enables sockets creation with IPPROTO_DCCP and enables
the ip level to pass DCCP packets to the DCCP level.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b9901a84 13-Apr-2008 Pavel Emelyanov <xemul@openvz.org>

[NETNS][DCCPV4]: Make per-net socket lookup.

The inet_lookup family of functions requires a net to lookup
a socket in, so give a proper one to them.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f5487398 13-Apr-2008 Pavel Emelyanov <xemul@openvz.org>

[NETNS][DCCPV4]: Use proper net to route the reset packet.

The dccp_v4_route_skb used in dccp_v4_ctl_send_reset, currently
works with init_net's routing tables - fix it.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b76c4b27 13-Apr-2008 Pavel Emelyanov <xemul@openvz.org>

[NETNS][DCCPV4]: Actually create ctl socket on each net and use it.

Move the call to inet_ctl_sock_create to init callback (and
inet_ctl_sock_destroy to exit one) and use proper ctl sock
in dccp_v4_ctl_send_reset.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7b1cffa8 13-Apr-2008 Pavel Emelyanov <xemul@openvz.org>

[NETNS][DCCPV4]: Move the dccp_v4_ctl_sk on the struct net.

And replace all its usage with init_net's socket.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 72a2d613 13-Apr-2008 Pavel Emelyanov <xemul@openvz.org>

[NETNS][DCCPV4]: Add dummy per-net operations.

They will be responsible for ctl socket initialization, but
currently they are void.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 028b0275 12-Apr-2008 Patrick McHardy <kaber@trash.net>

[DCCP]: Fix skb->cb conflicts with IP

dev_queue_xmit() and the other IP output functions expect to get a skb
with clear or properly initialized skb->cb. Unlike TCP and UDP, the
dccp_skb_cb doesn't contain a struct inet_skb_parm at the beginning,
so the DCCP-specific data is interpreted by the IP output functions.
This can cause false negatives for the conditional POST_ROUTING hook
invocation, making the packet bypass the hook.

Add a inet_skb_parm/inet6_skb_parm union to the beginning of
dccp_skb_cb to avoid clashes. Also add a BUILD_BUG_ON to make
sure it fits in the cb.

[ Combined with patch from Gerrit Renker to remove two now unnecessary
memsets of IPCB(skb)->opt ]

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5677242f 03-Apr-2008 Denis V. Lunev <den@openvz.org>

[NETNS]: Inet control socket should not hold a namespace.

This is a generic requirement, so make inet_ctl_sock_create namespace
aware and create a inet_ctl_sock_destroy wrapper around
sk_release_kernel.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# eee4fe4d 03-Apr-2008 Denis V. Lunev <den@openvz.org>

[INET]: Let inet_ctl_sock_create return sock rather than socket.

All upper protocol layers are already use sock internally.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3d58b5fa 03-Apr-2008 Denis V. Lunev <den@openvz.org>

[INET]: Rename inet_csk_ctl_sock_create to inet_ctl_sock_create.

This call is nothing common with INET connection sockets code. It
simply creates an unhashes kernel sockets for protocol messages.

Move the new call into af_inet.c after the rename.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4f049b4f 03-Apr-2008 Denis V. Lunev <den@openvz.org>

[DCCP]: dccp_v(4|6)_ctl_socket is leaked.

This seems a purism as module can't be unloaded, but though if cleanup
method is present it should be correct and clean all staff created.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7630f026 03-Apr-2008 Denis V. Lunev <den@openvz.org>

[DCCP]: Replace socket with sock for reset sending.

Replace dccp_v(4|6)_ctl_socket with sock to unify a code with TCP/ICMP.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bdcde3d7 28-Mar-2008 Pavel Emelyanov <xemul@openvz.org>

[SOCK]: Drop inuse pcounter from struct proto (v2).

An uppercut - do not use the pcounter on struct proto.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 39d8cda7 22-Mar-2008 Pavel Emelyanov <xemul@openvz.org>

[SOCK]: Add udp_hash member to struct proto.

Inspired by the commit ab1e0a13 ([SOCK] proto: Add hashinfo member to
struct proto) from Arnaldo, I made similar thing for UDP/-Lite IPv4
and -v6 protocols.

The result is not that exciting, but it removes some levels of
indirection in udpxxx_get_port and saves some space in code and text.

The first step is to union existing hashinfo and new udp_hash on the
struct proto and give a name to this union, since future initialization
of tcpxxx_prot, dccp_vx_protinfo and udpxxx_protinfo will cause gcc
warning about inability to initialize anonymous member this way.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ee6b9673 05-Mar-2008 Eric Dumazet <dada1@cosmosbay.com>

[IPV4]: Add 'rtable' field in struct sk_buff to alias 'dst' and avoid casts

(Anonymous) unions can help us to avoid ugly casts.

A common cast it the (struct rtable *)skb->dst one.

Defining an union like :
union {
struct dst_entry *dst;
struct rtable *rtable;
};
permits to use skb->rtable in place.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fd80eb94 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[INET]: Remove struct dst_entry *dst from request_sock_ops.rtx_syn_ack.

It looks like dst parameter is used in this API due to historical
reasons. Actually, it is really used in the direct call to
tcp_v4_send_synack only. So, create a wrapper for tcp_v4_send_synack
and remove dst from rtx_syn_ack.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab1e0a13 03-Feb-2008 Arnaldo Carvalho de Melo <acme@redhat.com>

[SOCK] proto: Add hashinfo member to struct proto

This way we can remove TCP and DCCP specific versions of

sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly

struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.

Now only the lookup routines receive as a parameter a struct inet_hashtable.

With this we further reuse code, reducing the difference among INET transport
protocols.

Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.

net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191

net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3

net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15

net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7

net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267

net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8

net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432

vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7

/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47

/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241

/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33

/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c67499c0 31-Jan-2008 Pavel Emelyanov <xemul@openvz.org>

[NETNS]: Tcp-v4 sockets per-net lookup.

Add a net argument to inet_lookup and propagate it further
into lookup calls. Plus tune the __inet_check_established.

The dccp and inet_diag, which use that lookup functions
pass the init_net into them.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f1b050bf 22-Jan-2008 Denis V. Lunev <den@openvz.org>

[NETNS]: Add namespace parameter to ip_route_output_flow.

Needed to propagate it down to the __ip_route_output_key.

Signed_off_by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 152da81d 20-Dec-2007 Pavel Emelyanov <xemul@openvz.org>

[INET]: Uninline the __inet_hash function.

This one is used in quite many places in the networking code and
seems to big to be inline.

After the patch net/ipv4/build-in.o loses ~650 bytes:
add/remove: 2/0 grow/shrink: 0/5 up/down: 461/-1114 (-653)
function old new delta
__inet_hash_nolisten - 282 +282
__inet_hash - 179 +179
tcp_sacktag_write_queue 2255 2254 -1
__inet_lookup_listener 284 274 -10
tcp_v4_syn_recv_sock 755 493 -262
tcp_v4_hash 389 35 -354
inet_hash_connect 1086 599 -487

This version addresses the issue pointed by Eric, that
while being inline this function was optimized by gcc
in respect to the 'listen_possible' argument.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8b819412 12-Dec-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Allow to parse options on Request Sockets

The option parsing code currently only parses on full sk's. This causes a problem for
options sent during the initial handshake (in particular timestamps and feature-negotiation
options). Therefore, this patch extends the option parsing code with an additional argument
for request_socks: if it is non-NULL, options are parsed on the request socket, otherwise
the normal path (parsing on the sk) is used.

Subsequent patches, which implement feature negotiation during connection setup, make use
of this facility.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c62cf5cb 06-Nov-2007 David S. Miller <davem@sunset.davemloft.net>

[DCCP]: Use DEFINE_PROTO_INUSE infrastructure.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 1238d087 24-Oct-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: One more exemption from full sequence number checks

This fixes the following problem: client connects to peer which has no DCCP
enabled or loaded; ICMP error messages ("Protocol Unavailable") can be seen
on the wire, but the application hangs. Reason: ICMP packets don't get through
to dccp_v4_err.

When reporting errors, a sequence number check is made for the DCCP packet
that had caused an ICMP error to arrive.
Such checks can not be made if the socket is in state LISTEN, RESPOND (which
in the implementation is the same as LISTEN), or REQUEST, since update_gsr()
has not been called in these states, hence the sequence window is 0..0.

This patch fixes the problem by adding the REQUEST state as another exemption
to the window check. The error reporting now works as expected on connecting.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>


# fde20105 24-Oct-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Retrieve packet sequence number for error reporting

This fixes a problem when analysing erroneous packets in dccp_v{4,6}_err:
* dccp_hdr_seq currently takes an skb
* however, the transport headers in the skb are shifted, due to the
preceding IPv4/v6 header.
Fixed for v4 and v6 by changing dccp_hdr_seq to take a struct dccp_hdr as
argument. Verified that the correct sequence number is now reported in the
error handler.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 7131c6c7 21-Oct-2007 Jean Delvare <jdelvare@suse.de>

[INET]: Use MODULE_ALIAS_NET_PF_PROTO_TYPE where possible.

Now that we have this new macro, use it where possible.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4a5409a5 04-Oct-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Twice the wrong reset code in receiving connection-Requests

This fixes two bugs in processing of connection-Requests in
v{4,6}_conn_request:

1. Due to using the variable `reset_code', the Reset code generated
internally by dccp_parse_options() is overwritten with the
initialised value ("Too Busy") of reset_code, which is not what is
intended.

2. When receiving a connection-Request on a multicast or broadcast
address, no Reset should be generated, to avoid storms of such
packets. Instead of jumping to the `drop' label, the
v{4,6}_conn_request functions now return 0. Below is why in my
understanding this is correct:

When the conn_request function returns < 0, then the caller,
dccp_rcv_state_process(), returns 1. In all instances where
dccp_rcv_state_process is called (dccp_v4_do_rcv, dccp_v6_do_rcv,
and dccp_child_process), a return value of != 0 from
dccp_rcv_state_process() means that a Reset is generated.

If on the other hand the conn_request function returns 0, the
packet is discarded and no Reset is generated.

Note: There may be a related problem when sending the Response, due to
the following.

if (dccp_v6_send_response(sk, req, NULL))
goto drop_and_free;
/* ... */
drop_and_free:
return -1;

In this case, if send_response fails due to transmission errors, the
next thing that is generated is a Reset with a code "Too Busy". I
haven't been able to conjure up such a condition, but it might be good
to change the behaviour here also (not done by this patch).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e356d37a 26-Sep-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Factor out common code for generating Resets

This factors code common to dccp_v{4,6}_ctl_send_reset into a separate function,
and adds support for filling in the Data 1 ... Data 3 fields from RFC 4340, 5.6.

It is useful to have this separate, since the following Reset codes will always
be generated from the control socket rather than via dccp_send_reset:
* Code 3, "No Connection", cf. 8.3.1;
* Code 4, "Packet Error" (identification for Data 1 added);
* Code 5, "Option Error" (identification for Data 1..3 added, will be used later);
* Code 6, "Mandatory Error" (same as Option Error);
* Code 7, "Connection Refused" (what on Earth is the difference to "No Connection"?);
* Code 8, "Bad Service Code";
* Code 9, "Too Busy";
* Code 10, "Bad Init Cookie" (not used).

Code 0 is not recommended by the RFC, the following codes would be used in
dccp_send_reset() instead, since they all relate to an established DCCP connection:
* Code 1, "Closed";
* Code 2, "Aborted";
* Code 11, "Aggression Penalty" (12.3).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>


# 9bf55cda 26-Sep-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Sequence number wrap-around when sending reset

This replaces normal addition with mod-48 addition so that sequence number
wraparound is respected.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>


# c7261872 05-Sep-2007 Micah Gruber <micah.gruber@gmail.com>

[DCCP]: Remove unneeded pointer newdp from dccp_v4_request_recv_sock()

This trivial patch removes the unneeded pointer newdp, which is never used.

Signed-off-by: Micah Gruber <micah.gruber@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 88c7664f 13-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce icmp_hdr(), remove skb->h.icmph

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# eddc9ec5 20-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c9eaf173 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET] DCCP: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8eb9086f 08-Feb-2007 David S. Miller <davem@sunset.davemloft.net>

[IPV4/IPV6]: Always wait for IPSEC SA resolution in socket contexts.

Do this even for non-blocking sockets. This avoids the silly -EAGAIN
that applications can see now, even for non-blocking sockets in some
cases (f.e. connect()).

With help from Venkat Tekkirala.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 8109b02b 10-Dec-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Whitespace cleanups

That accumulated over the last months hackaton, shame on me for not
using git-apply whitespace helping hand, will do that from now on.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 59348b19 20-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Simplified conditions due to use of enum:8 states

This reaps the benefit of the earlier patch, which changed the type of
CCID 3 states to use enums, in that many conditions are now simplified
and the number of possible (unexpected) values is greatly reduced.

In a few instances, this also allowed to simplify pre-conditions; where
care has been taken to retain logical equivalence.

[DCCP]: Introduce a consistent BUG/WARN message scheme

This refines the existing set of DCCP messages so that
* BUG(), BUG_ON(), WARN_ON() have meaningful DCCP-specific counterparts
* DCCP_CRIT (for severe warnings) is not rate-limited
* DCCP_WARN() is introduced as rate-limited wrapper

Using these allows a faster and cleaner transition to their original
counterparts once the code has matured into a full DCCP implementation.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 58a5a7b9 16-Nov-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[NET]: Conditionally use bh_lock_sock_nested in sk_receive_skb

Spotted by Ian McDonald, tentatively fixed by Gerrit Renker:

http://www.mail-archive.com/dccp%40vger.kernel.org/msg00599.html

Rewritten not to unroll sk_receive_skb, in the common case, i.e. no lock
debugging, its optimized away.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# e523a155 15-Nov-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: One NET_INC_STATS() could be NET_INC_STATS_BH in dccp_v4_err()

Spotted by Eric Dumazet in tcp_v4_rcv().

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 2bda2853 14-Nov-2006 Al Viro <viro@zeniv.linux.org.uk>

[NET]: Annotate csum_tcpudp_magic() callers in net/*

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cfb6eeb4 14-Nov-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[TCP]: MD5 Signature Option (RFC2385) support.

Based on implementation by Rick Payne.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 09dbc389 13-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Miscellaneous code tidy-ups

This patch does not change code; it performs some trivial clean/tidy-ups:

* removal of a `debug_prefix' string in favour of the
already existing dccp_role(sk)

* add documentation of structures and constants

* separated out the cases for invalid packets (step 1
of the packet validation)

* removing duplicate statements

* combining declaration & initialisation

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# b9df3cb8 14-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[TCP/DCCP]: Introduce net_xmit_eval

Throughout the TCP/DCCP (and tunnelling) code, it often happens that the
return code of a transmit function needs to be tested against NET_XMIT_CN
which is a value that does not indicate a strict error condition.

This patch uses a macro for these recurring situations which is consistent
with the already existing macro net_xmit_errno, saving on duplicated code.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# d7f7365f 13-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCPv6]: Choose a genuine initial sequence number

This
* resolves a FIXME - DCCPv6 connections started all with
an initial sequence number of 1;
* provides a redirection `secure_dccpv6_sequence_number'
in case the init_sequence_v6 code should be updated later;
* concentrates the update of S.GAR into dccp_connect_init();
* removes a duplicate dccp_update_gss() in ipv4.c;
* uses inet->dport instead of usin->sin_port, due to the
following assignment in dccp_v4_connect():
inet->dport = usin->sin_port;

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 865e9022 13-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Remove redundant statements in init_sequence (ISS)

This patch removes the following redundancies:

1) The test skb->protocol == htons(ETH_P_IPV6) in dccp_v6_init_sequence
is always true since
* dccp_v6_conn_request() is the only calling function
* dccp_v6_conn_request() redirects all skb's with ETH_P_IP to
dccp_v4_conn_request()

2) The first argument, `struct sock *sk', of dccp_v{4,6}_init_sequence()
is never used.

(This is similar for tcp_v{4,6}_init_sequence, an analogous patch has been
submitted to netdev and merged.)

By the way - are the `sport' / `dport' arguments in the right order?
I have made them consistent among calls but they seem to be in the
reverse order.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# afb0a34d 13-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Introduce a consistent naming scheme for sysctls

In order to make their function clearer and obtain a consistent naming
scheme to identify sysctls, all existing DCCP sysctls have been prefixed
with `sysctl_dccp', following the same convention as used by TCP.

Feature-specific sysctls retain the `feat' in the middle, although the
`default' has been dropped, since it is obvious from use.

Also removed a duplicate `dccp_feat_default_sequence_window' in ipv4.c.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 6f4e5fff 10-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Support for partial checksums (RFC 4340, sec. 9.2)

This patch does the following:
a) introduces variable-length checksums as specified in [RFC 4340, sec. 9.2]
b) provides necessary socket options and documentation as to how to use them
c) basic support and infrastructure for the Minimum Checksum Coverage feature
[RFC 4340, sec. 9.2.1]: acceptability tests, user notification and user
interface

In addition, it

(1) fixes two bugs in the DCCPv4 checksum computation:
* pseudo-header used checksum_len instead of skb->len
* incorrect checksum coverage calculation based on dccph_x
(2) removes dccp_v4_verify_checksum() since it reduplicates code of the
checksum computation; code calling this function is updated accordingly.
(3) now uses skb_checksum(), which is safer than checksum_partial() if the
sk_buff has is a non-linear buffer (has pages attached to it).
(4) fixes an outstanding TODO item:
* If P.CsCov is too large for the packet size, drop packet and return.

The code has been tested with applications, the latest version of tcpdump now
comes with support for partial DCCP checksums.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# d83ca5ac 10-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Update code comments for Step 2/3

Sorts out the comments for processing steps 2,3 in section 8.5 of RFC 4340.
All comments have been updated against this document, and the reference to step
2 has been made consistent throughout the files.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# cf557926 10-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: tidy up dccp_v{4,6}_conn_request

This is a code simplification to remove reduplicated code
by concentrating and abstracting shared code.

Detailed Changes:


# 3d2fe62b 09-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCPv4]: remove forward declarations in ipv4.c

This relates to Arnaldo's announcement in
http://www.mail-archive.com/dccp@vger.kernel.org/msg00604.html

Originally this had been part of the Oops fix and is a revised variant of
http://www.mail-archive.com/dccp@vger.kernel.org/msg00598.html

No code change, merely reshuffling, with the particular objective of
having all request_sock_ops close(r) together for more clarity.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 8a73cd09 09-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: calling dccp_v{4,6}_reqsk_send_ack is a BUG

This patch removes two functions, the send_ack functions of request_sock,
which are not called/used by the DCCP code. It is correct that these
functions are not called, below is a justification why calling these
functions (on a passive socket in the LISTEN/RESPOND state) would mean
a DCCP protocol violation.

A) Background: using request_sock in TCP:


# d23c7107 10-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Simplify jump labels in dccp_v{4,6}_rcv

This is a code simplification and was singled out from the
DCCPv6 Oops patch on
http://www.mail-archive.com/dccp@vger.kernel.org/msg00600.html

It mainly makes the code consistent between ipv{4,6}.c for the functions
dccp_v4_rcv
dccp_v6_rcv
and removes the do_time_wait label to simplify code somewhat.

Commiter note: fixed up a compile problem, trivial.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 9b42078e 10-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Combine allocating & zeroing header space on skb

This is a code simplification:
it combines three often recurring operations into one inline function,

* allocate `len' bytes header space in skb
* fill these `len' bytes with zeroes
* cast the start of this header space as dccp_hdr

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 494b4e7d 09-Nov-2006 David S. Miller <davem@sunset.davemloft.net>

[DCCP]: Fix typo _read_mostly --> __read_mostly.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 72a3effa 16-Nov-2006 Eric Dumazet <dada1@cosmosbay.com>

[NET]: Size listen hash tables using backlog hint

We currently allocate a fixed size (TCP_SYNQ_HSIZE=512) slots hash table for
each LISTEN socket, regardless of various parameters (listen backlog for
example)

On x86_64, this means order-1 allocations (might fail), even for 'small'
sockets, expecting few connections. On the contrary, a huge server wanting a
backlog of 50000 is slowed down a bit because of this fixed limit.

This patch makes the sizing of listen hash table a dynamic parameter,
depending of :
- net.core.somaxconn tunable (default is 128)
- net.ipv4.tcp_max_syn_backlog tunable (default : 256, 1024 or 128)
- backlog value given by user application (2nd parameter of listen())

For large allocations (bigger than PAGE_SIZE), we use vmalloc() instead of
kmalloc().

We still limit memory allocation with the two existing tunables (somaxconn &
tcp_max_syn_backlog). So for standard setups, this patch actually reduce RAM
usage.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0e64e94e 24-Oct-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Update documentation references.

Updates the references to spec documents throughout the code, taking into
account that

* the DCCP, CCID 2, and CCID 3 drafts all became RFCs in March this year

* RFC 1063 was obsoleted by RFC 1191

* draft-ietf-tcpimpl-pmtud-0x.txt was published as an Informational
RFC, RFC 2923 on 2000-09-22.

All references verified.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 82709531 11-Oct-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Fix Oops in DCCPv6

I think I got the cause for the Oops observed in
http://www.mail-archive.com/dccp@vger.kernel.org/msg00578.html

The problem is always with applications listening on PF_INET6 sockets. Apart
from the mentioned oops, I observed another one one, triggered at irregular
intervals via timer interrupt:

run_timer_softirq -> dccp_keepalive_timer
-> inet_csk_reqsk_queue_prune
-> reqsk_free
-> dccp_v6_reqsk_destructor

The latter function is the problem and is also the last function to be called
in said kernel panic.

In any case, there is a real problem with allocating the right request_sock
which is what this patch tackles.

It fixes the following problem:
- application listens on PF_INET6
- DCCPv4 packet comes in, is handed over to dccp_v4_do_rcv, from there
to dccp_v4_conn_request

Now: socket is PF_INET6, packet is IPv4. The following code then furnishes the
connection with IPv6 - request_sock operations:

req = reqsk_alloc(sk->sk_prot->rsk_prot);

The first problem is that all further incoming packets will get a Reset since
the connection can not be looked up.

The second problem is worse:
--> reqsk_alloc is called instead of inet6_reqsk_alloc
--> consequently inet6_rsk_offset is never set (dangling pointer)
--> the request_sock_ops are nevertheless still dccp6_request_ops
--> destructor is called via reqsk_free
--> dccp_v6_reqsk_destructor tries to free random memory location (inet6_rsk_offset not set)
--> panic

I have tested this for a while, DCCP sockets are now handled correctly in all
three scenarios (v4/v6 only/v4-mapped).

Commiter note: I've added the dccp_request_sock_ops forward declaration to keep
the tree building and to reduce the size of the patch for 2.6.19,
later I'll move the functions to the top of the affected source
code to match what we have in the TCP counterpart, where this
problem hasn't existed in the first place, dumb me not to have
done the same thing on DCCP land 8)

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 9469c7b4 10-Oct-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET]: Use typesafe inet_twsk() inline function instead of cast.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bada8adc 26-Sep-2006 Al Viro <viro@zeniv.linux.org.uk>

[IPV4]: ip_route_connect() ipv4 address arguments annotated

annotated address arguments (port number left alone for now); ditto
for inferred net-endian variables in callers.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 00e4d116 22-Sep-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Allow default/fallback service code.

This has been discussed on dccp@vger and removes the necessity for applications
to supply service codes in each and every case.

If an application does not want to provide a service code, that's fine, it will
be given 0. Otherwise, service codes can be set via socket options as before.

This patch has been tested using various client/server configurations
(including listening on multiple service codes).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 8f491069 09-Aug-2006 Herbert Xu <herbert@gondor.apana.org.au>

[IPV4]: Use network-order dport for all visible inet_lookup_*

Right now most inet_lookup_* functions take a host-order hnum instead
of a network-order dport because that's how it is represented
internally.

This means that users of these functions have to be careful about
using the right byte-order. To add more confusion, inet_lookup takes
a network-order dport unlike all other functions.

So this patch changes all visible inet_lookup functions to take a
dport and move all dport->hnum conversion inside them.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4237c75c 25-Jul-2006 Venkat Yekkirala <vyekkirala@TrustedCS.com>

[MLSXFRM]: Auto-labeling of child sockets

This automatically labels the TCP, Unix stream, and dccp child sockets
as well as openreqs to be at the same MLS level as the peer. This will
result in the selection of appropriately labeled IPSec Security
Associations.

This also uses the sock's sid (as opposed to the isec sid) in SELinux
enforcement of secmark in rcv_skb and postroute_last hooks.

Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# beb8d13b 05-Aug-2006 Venkat Yekkirala <vyekkirala@TrustedCS.com>

[MLSXFRM]: Add flow labeling

This labels the flows that could utilize IPSec xfrms at the points the
flows are defined so that IPSec policy and SAs at the right label can
be used.

The following protos are currently not handled, but they should
continue to be able to use single-labeled IPSec like they currently
do.

ipmr
ip_gre
ipip
igmp
sit
sctp
ip6_tunnel (IPv6 over IPv6 tunnel device)
decnet

Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b79f0af 24-Jul-2006 Ian McDonald <ian.mcdonald@jandi.co.nz>

[DCCP]: Fix default sequence window size

When using the default sequence window size (100) I got the following in
my logs:

Jun 22 14:24:09 localhost kernel: [ 1492.114775] DCCP: Step 6 failed for
DATA packet, (LSWL(6279674225) <= P.seqno(6279674749) <=
S.SWH(6279674324)) and (P.ackno doesn't exist or LAWL(18798206530) <=
P.ackno(1125899906842620) <= S.AWH(18798206548), sending SYNC...
Jun 22 14:24:09 localhost kernel: [ 1492.115147] DCCP: Step 6 failed for
DATA packet, (LSWL(6279674225) <= P.seqno(6279674750) <=
S.SWH(6279674324)) and (P.ackno doesn't exist or LAWL(18798206530) <=
P.ackno(1125899906842620) <= S.AWH(18798206549), sending SYNC...

I went to alter the default sysctl and it didn't take for new sockets.
Below patch fixes this.

I think the default is too low but it is what the DCCP spec specifies.

As a side effect of this my rx speed using iperf goes from about 2.8 Mbits/sec
to 3.5. This is still far too slow but it is a step in the right direction.

Compile tested only for IPv6 but not particularly complex change.

Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# b8282dcf 10-Apr-2006 Eric Sesterhenn <snakebyte@gmx.de>

[DCCP]: Fix leak in net/dccp/ipv4.c

we dont free req if we cant parse the options.
This fixes coverity bug id #1046

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 543d9cfe 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[NET]: Identation & other cleanups related to compat_[gs]etsockopt cset

No code changes, just tidying up, in some cases moving EXPORT_SYMBOLs
to just after the function exported, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3fdadf7d 20-Mar-2006 Dmitry Mishin <dim@openvz.org>

[NET]: {get|set}sockopt compatibility layer

This patch extends {get|set}sockopt compatibility layer in order to
move protocol specific parts to their place and avoid huge universal
net/compat.c file in the future.

Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c5fed159 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: ditch dccp_v[46]_ctl_send_ack

Merging it with its only user: dccp_v[46]_reqsk_send_ack.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 118b2c95 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Use sk->sk_prot->max_header consistently for non-data packets

Using this also provides opportunities for introducing
inet_csk_alloc_skb that would call alloc_skb, account it to the sock
and skb_reserve(max_header), but I'll leave this for later, for now
using sk_prot->max_header consistently is enough.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c4d93909 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[ICSK]: Introduce inet_csk_ctl_sock_create

Consolidating open coded sequences in tcp and dccp, v4 and v6.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 72478873 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP] ipv6: Add missing ipv6 control socket

I guess I forgot to add it, nah, now it just works:

18:04:33.274066 IP6 ::1.1476 > ::1.5001: request (service=0)
18:04:33.334482 IP6 ::1.5001 > ::1.1476: reset (code=bad_service_code)

Ditched IP_DCCP_UNLOAD_HACK, as now we would have to do it for both
IPv6 and IPv4, so I'll come up with another way for freeing the
control sockets in upcoming changesets.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5e0817f8 20-Mar-2006 Adrian Bunk <bunk@stusta.de>

[DCCP] ipv4: make struct dccp_v4_prot static

There's no reason for struct dccp_v4_prot being global.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b61fafc4 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Move the IPv4 specific bits from proto.c to ipv4.c

With this patch in place we can break down the complexity by better
compartmentalizing the code that is common to ipv6 and ipv4.

Now we have these modules:
Module Size Used by
dccp_diag 1344 0
inet_diag 9448 1 dccp_diag
dccp_ccid3 15856 0
dccp_tfrc_lib 12320 1 dccp_ccid3
dccp_ccid2 5764 0
dccp_ipv4 16996 2
dccp 48208 4 dccp_diag,dccp_ccid3,dccp_ccid2,dccp_ipv4

dccp_ipv6 still requires dccp_ipv4 due to dccp_ipv6_mapped, that is
the next target to work on the "hey, ipv4 is legacy, I only want ipv6
dude!" direction.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 075ae866 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Move dccp_hashinfo from ipv4.c to the core

As it is used by both ipv4 and ipv6.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0a1ec676 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Dont use dccp_v4_checksum in dccp_make_response

dccp_make_response is shared by ipv4/6 and the ipv6 code was
recalculating the checksum, not good, so move the dccp_v4_checksum
call to dccp_v4_send_response.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c985ed70 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Move dccp_[un]hash from ipv4.c to the core

As this is used by both ipv4 and ipv6 and is not ipv4 specific.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3e0fadc5 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Move dccp_v4_{init,destroy}_sock to the core

Removing one more ipv6 uses ipv4 stuff case in dccp land.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 017487d7 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Generalize dccp_v4_send_reset

Renaming it to dccp_send_reset and moving it from the ipv4 specific
code to the core dccp code.

This fixes some bugs in IPV6 where timers would send v4 resets, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 04e2661e 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Call dccp_feat_init more early in dccp_v4_init_sock

So that dccp_feat_clean doesn't get confused with uninitialized
list_heads.

Noticed when testing with no ccid kernel modules.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 60fe62e7 20-Mar-2006 Andrea Bittau <a.bittau@cs.ucl.ac.uk>

[DCCP]: sparse endianness annotations

This also fixes the layout of dccp_hdr short sequence numbers, problem
was not fatal now as we only support long (48 bits) sequence numbers.

Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 91f0ebf7 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP] CCID: Improve CCID infrastructure

1. No need for ->ccid_init nor ->ccid_exit, this is what module_{init,exit}
does and anynways neither ccid2 nor ccid3 were using it.

2. Rename struct ccid to struct ccid_operations and introduce struct ccid
with a pointer to ccid_operations and rigth after it the rx or tx
private state.

3. Remove the pointer to the state of the half connections from struct
dccp_sock, now its derived thru ccid_priv() from the ccid pointer.

Now we also can implement the setsockopt for changing the CCID easily as
no ccid init routines can affect struct dccp_sock in any way that prevents
other CCIDs from working if a CCID switch operation is asked by apps.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# afe00251 20-Mar-2006 Andrea Bittau <a.bittau@cs.ucl.ac.uk>

[DCCP]: Initial feature negotiation implementation

Still needs more work, but boots and doesn't crashes, even
does some negotiation!

18:38:52.174934 127.0.0.1.43458 > 127.0.0.1.5001: request <change_l ack_ratio 2, change_r ccid 2, change_l ccid 2>
18:38:52.218526 127.0.0.1.5001 > 127.0.0.1.43458: response <nop, nop, change_l ack_ratio 2, confirm_r ccid 2 2, confirm_l ccid 2 2, confirm_r ack_ratio 2>
18:38:52.185398 127.0.0.1.43458 > 127.0.0.1.5001: <nop, confirm_r ack_ratio 2, ack_vector0 0x00, elapsed_time 212>

:-)

Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2a91aa39 20-Mar-2006 Andrea Bittau <a.bittau@cs.ucl.ac.uk>

[DCCP] CCID2: Initial CCID2 (TCP-Like) implementation

Original work by Andrea Bittau, Arnaldo Melo cleaned up and fixed several
issues on the merge process.

For now CCID2 was turned the default for all SOCK_DCCP connections, but this
will be remedied soon with the merge of the feature negotiation code.

Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9833d6da 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Don't alloc ack vector for the control sock

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7400d781 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP] ackvec: Ditch dccpav_buf_len

Simplifying the code a bit as we're always using DCCP_MAX_ACKVEC_LEN.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5d39a795 31-Jan-2006 Patrick McHardy <kaber@trash.net>

[IPV4]: Always set fl.proto in ip_route_newports

ip_route_newports uses the struct flowi from the struct rtable returned
by ip_route_connect for the new route lookup and just replaces the port
numbers if they have changed. If an IPsec policy exists which doesn't match
port 0 the struct flowi won't have the proto field set and no xfrm lookup
is done for the changed ports.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# eb9c7ebe 07-Jan-2006 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Handle NAT in IPsec policy checks

Handle NAT of decapsulated IPsec packets by reconstructing the struct flowi
of the original packet from the conntrack information for IPsec policy
checks.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b59c2701 07-Jan-2006 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Keep conntrack reference until IPsec policy checks are done

Keep the conntrack reference until policy checks have been performed for
IPsec NAT support. The reference needs to be dropped before a packet is
queued to avoid having the conntrack module unloadable.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 14c85021 26-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[INET_SOCK]: Move struct inet_sock & helper functions to net/inet_sock.h

To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.

Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 25995ff5 26-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[SOCK]: Introduce sk_receive_skb

Its common enough to to justify that, TCP still can't use it as it has the
prequeueing stuff, still to be made generic in the not so distant future :-)

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d83d8461 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[IP_SOCKGLUE]: Remove most of the tcp specific calls

As DCCP needs to be called in the same spots.

Now we have a member in inet_sock (is_icsk), set at sock creation time from
struct inet_protosw->flags (if INET_PROTOSW_ICSK is set, like for TCP and
DCCP) to see if a struct sock instance is a inet_connection_sock for places
like the ones in ip_sockglue.c (v4 and v6) where we previously were looking if
sk_type was SOCK_STREAM, that is insufficient because we now use the same code
for DCCP, that has sk_type SOCK_DCCP.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a7f5e7f1 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[INET]: Generalise tcp_v4_hash_connect

Renaming it to inet_hash_connect, making it possible to ditch
dccp_v4_hash_connect and share the same code with TCP instead.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6d6ee43e 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[TWSK]: Introduce struct timewait_sock_ops

So that we can share several timewait sockets related functions and
make the timewait mini sockets infrastructure closer to the request
mini sockets one.

Next changesets will take advantage of this, moving more code out of
TCP and DCCP v4 and v6 to common infrastructure.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fc44b980 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Use reqsk_free in dccp_v4_conn_request

Now we have the destructor (dccp_v4_reqsk_destructor) in our
request_sock_ops vtable.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f21e68ca 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Prepare the AF agnostic core for the introduction of DCCPv6

Basically exports a similar set of functions as the one exported by
the non-AF specific TCP code.

In the process moved some non-AF specific code from dccp_v4_connect to
dccp_connect_init and moved the checksum verification from
dccp_invalid_packet to dccp_v4_rcv, so as to use it in dccp_v6_rcv
too.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 34ca6860 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Just rename dccp_v4_prot to dccp_prot

To match TCP equivalent.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 57cca05a 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Introduce dccp_ipv4_af_ops

And make the core DCCP code AF agnostic, just like TCP, now its time
to work on net/dccp/ipv6.c, we are close to the end!

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 971af18b 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[IPV6]: Reuse inet_csk_get_port in tcp_v6_get_port

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4c7e6895 21-Dec-2005 Ian McDonald <imcdnzl@gmail.com>

[DCCP]: Comment typo

I hope to actually change this behaviour shortly but this will help
anybody grepping code at present.

Signed-off-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a51482bd 08-Nov-2005 Jesper Juhl <jesper.juhl@gmail.com>

[NET]: kfree cleanup

From: Jesper Juhl <jesper.juhl@gmail.com>

This is the net/ part of the big kfree cleanup patch.

Remove pointless checks for NULL prior to calling kfree() in net/.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>


# 6df71634 03-Nov-2005 Stephen Hemminger <shemminger@osdl.org>

[TCP/DCCP]: Randomize port selection

This patch randomizes the port selected on bind() for connections
to help with possible security attacks. It should also be faster
in most cases because there is no need for a global lock.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 49c5bfaf 17-Oct-2005 Herbert Xu <herbert@gondor.apana.org.au>

[DCCP]: Clear the IPCB area

Turns out the problem has nothing to do with use-after-free or double-free.
It's just that we're not clearing the CB area and DCCP unlike TCP uses a CB
format that's incompatible with IP.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 81c3d547 03-Oct-2005 Eric Dumazet <dada1@cosmosbay.com>

[INET]: speedup inet (tcp/dccp) lookups

Arnaldo and I agreed it could be applied now, because I have other
pending patches depending on this one (Thank you Arnaldo)

(The other important patch moves skc_refcnt in a separate cache line,
so that the SMP/NUMA performance doesnt suffer from cache line ping pongs)

1) First some performance data :
--------------------------------

tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established()

The most time critical code is :

sk_for_each(sk, node, &head->chain) {
if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif))
goto hit; /* You sunk my battleship! */
}

The sk_for_each() does use prefetch() hints but only the begining of
"struct sock" is prefetched.

As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far
away from the begining of "struct sock", it has to bring into CPU
cache cold cache line. Each iteration has to use at least 2 cache
lines.

This can be problematic if some chains are very long.

2) The goal
-----------

The idea I had is to change things so that INET_MATCH() may return
FALSE in 99% of cases only using the data already in the CPU cache,
using one cache line per iteration.

3) Description of the patch
---------------------------

Adds a new 'unsigned int skc_hash' field in 'struct sock_common',
filling a 32 bits hole on 64 bits platform.

struct sock_common {
unsigned short skc_family;
volatile unsigned char skc_state;
unsigned char skc_reuse;
int skc_bound_dev_if;
struct hlist_node skc_node;
struct hlist_node skc_bind_node;
atomic_t skc_refcnt;
+ unsigned int skc_hash;
struct proto *skc_prot;
};

Store in this 32 bits field the full hash, not masked by (ehash_size -
1) Using this full hash as the first comparison done in INET_MATCH
permits us immediatly skip the element without touching a second cache
line in case of a miss.

Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to
sk_hash and tw_hash) already contains the slot number if we mask with
(ehash_size - 1)

File include/net/inet_hashtables.h

64 bits platforms :
#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
(((__sk)->sk_hash == (__hash))
((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie)) && \
((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports)) && \
(!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))

32bits platforms:
#define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
(((__sk)->sk_hash == (__hash)) && \
(inet_sk(__sk)->daddr == (__saddr)) && \
(inet_sk(__sk)->rcv_saddr == (__daddr)) && \
(!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))


- Adds a prefetch(head->chain.first) in
__inet_lookup_established()/__tcp_v4_check_established() and
__inet6_lookup_established()/__tcp_v6_check_established() and
__dccp_v4_check_established() to bring into cache the first element of the
list, before the {read|write}_lock(&head->lock);

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 561713cf 18-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Don't use necessarily the same CCID for tx and rx

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ae31c339 18-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Move the ack vector code to net/dccp/ackvec.[ch]

Isolating it, that will be used when we introduce a CCID2 (TCP-Like)
implementation.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 67e6b629 16-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Introduce DCCP_SOCKOPT_SERVICE

As discussed in the dccp@vger mailing list:

Now applications have to use setsockopt(DCCP_SOCKOPT_SERVICE, service[s]),
prior to calling listen() and connect().

An array of unsigned ints can be passed meaning that the listening sock accepts
connection requests for several services.

With this we can ditch struct sockaddr_dccp and use only sockaddr_in (and
sockaddr_in6 in the future).

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0c10c5d9 16-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: More precisely set reset_code when sending RESET packets

Moving the setting of DCCP_SKB_CB(skb)->dccpd_reset_code to the places
where events happen that trigger sending a RESET packet.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dc19336c 09-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP] Only call the HC _exit() routines in dccp_v4_destroy_sock

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# b0e56780 08-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP] Introduce dccp_timestamp

To start the timestamps with 0.0ms, easing the integer maths in the CCIDs, this
probably will be reworked to use the to be introduced struct timeval_offset
infrastructure out of skb_get_timestamp, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# c530cfb1 28-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[CCID3]: Call sk->sk_write_space(sk) when receiving a feedback packet

This makes the send rate calculations behave way more closely to what
is specified, with the jitter previously seen on x and x_recv
disappearing completely on non lossy setups.

This resembles the tcp_data_snd_check code, that possibly we'll end up
using in DCCP as well, perhaps moving this code to
inet_connection_sock.

For now I'm doing the simplest implementation tho.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8efa544f 23-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Call the HC exit routines at dccp_v4_destroy_sock

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 03ace394 21-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Fix the ACK and SEQ window variables settings

This is from a first audit, more eyeballs are more than welcome.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2807d4ff 21-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Fix seqno setting in dccp_v4_ctl_send_reset

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c59eab46 18-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Use LIMIT_NETDEBUG in some debugging printks

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e92ae93a 17-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Send SYNCACK packets in response to SYNC packets

Also fix step 6 when receiving SYNC or SYNCACK packets, i.e. we were not using
the updated swl.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a1d3a355 13-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Fix sparse warnings

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7690af3f 13-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Just reflow the source code to fit in 80 columns

Andrew Morton should be happy now 8)

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 540722ff 10-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[TCPDIAG]: Implement cheapest way of supporting DCCPDIAG_GETSOCK

With ugly ifdefs, etc, but this actually:

1. keeps the existing ABI, i.e. no need to recompile the iproute2
utilities if not interested in DCCP.

2. Provides all the tcp_diag functionality in DCCP, with just a
small patch that makes iproute2 support DCCP.

Of course I'll get this cleaned-up in time, but for now I think its
OK to be this way to quickly get this functionality.

iproute2-ss050808 patch at:

http://vger.kernel.org/~acme/iproute2-ss050808.dccp.patch

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 64cf1e5d 09-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Finish the TIMEWAIT minisock support

Using most of the infrastructure TCP uses, with a dccp_death_row,
etc. As per my current interpretation of the draft what we have with
this changeset seems to be all we need (or very close to it 8)).

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0b4e03bf 09-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Initialize icsk_rto in dccp_v4_init_sock

Fixes nasty bug related to the retransmit timer (yeah, DCCP does
retransmits) firing too early.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f6ccf554 09-Aug-2005 David S. Miller <davem@davemloft.net>

[DCCP]: Fix u64 printf format warnings.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 95b81ef7 09-Aug-2005 Yoshifumi Nishida <nishida@csl.sony.co.jp>

[DCCP]: Fix checksum routines

Signed-off-by: Yoshifumi Nishida <nishida@csl.sony.co.jp>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7c657876 09-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net>

[DCCP]: Initial implementation

Development to this point was done on a subversion repository at:

http://oops.ghostprotocols.net:81/cgi-bin/viewcvs.cgi/dccp-2.6/

This repository will be kept at this site for the foreseable future,
so that interested parties can see the history of this code,
attributions, etc.

If I ever decide to take this offline I'll provide the full history at
some other suitable place.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>