History log of /linux-master/net/dccp/minisocks.c
Revision Date Author Comments
# fbb82952 26-Jan-2022 Eric Dumazet <edumazet@google.com>

tcp: allocate tcp_death_row outside of struct netns_ipv4

I forgot tcp had per netns tracking of timewait sockets,
and their sysctl to change the limit.

After 0dad4087a86a ("tcp/dccp: get rid of inet_twsk_purge()"),
whole struct net can be freed before last tw socket is freed.

We need to allocate a separate struct inet_timewait_death_row
object per netns.

tw_count becomes a refcount and gains associated debugging infrastructure.

BUG: KASAN: use-after-free in inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
Read of size 8 at addr ffff88807d5f9f40 by task kworker/1:7/3690

CPU: 1 PID: 3690 Comm: kworker/1:7 Not tainted 5.16.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events pwq_unbound_release_workfn
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
expire_timers kernel/time/timer.c:1466 [inline]
__run_timers.part.0+0x67c/0xa30 kernel/time/timer.c:1734
__run_timers kernel/time/timer.c:1715 [inline]
run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
RIP: 0010:lockdep_unregister_key+0x1c9/0x250 kernel/locking/lockdep.c:6328
Code: 00 00 00 48 89 ee e8 46 fd ff ff 4c 89 f7 e8 5e c9 ff ff e8 09 cc ff ff 9c 58 f6 c4 02 75 26 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c 41 5d 41 5e 41 5f e9 19 4a 08 00 0f 0b 5b 5d 41 5c 41 5d
RSP: 0018:ffffc90004077cb8 EFLAGS: 00000206
RAX: 0000000000000046 RBX: ffff88807b61b498 RCX: 0000000000000001
RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff888077027128 R08: 0000000000000001 R09: ffffffff8f1ea4fc
R10: fffffbfff1ff93ee R11: 000000000000af1e R12: 0000000000000246
R13: 0000000000000000 R14: ffffffff8ffc89b8 R15: ffffffff90157fb0
wq_unregister_lockdep kernel/workqueue.c:3508 [inline]
pwq_unbound_release_workfn+0x254/0x340 kernel/workqueue.c:3746
process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
worker_thread+0x657/0x1110 kernel/workqueue.c:2454
kthread+0x2e9/0x3a0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
</TASK>

Allocated by task 3635:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:437 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:470
kasan_slab_alloc include/linux/kasan.h:260 [inline]
slab_post_alloc_hook mm/slab.h:732 [inline]
slab_alloc_node mm/slub.c:3230 [inline]
slab_alloc mm/slub.c:3238 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3243
kmem_cache_zalloc include/linux/slab.h:705 [inline]
net_alloc net/core/net_namespace.c:407 [inline]
copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
ksys_unshare+0x445/0x920 kernel/fork.c:3048
__do_sys_unshare kernel/fork.c:3119 [inline]
__se_sys_unshare kernel/fork.c:3117 [inline]
__x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff88807d5f9a80
which belongs to the cache net_namespace of size 6528
The buggy address is located 1216 bytes inside of
6528-byte region [ffff88807d5f9a80, ffff88807d5fb400)
The buggy address belongs to the page:
page:ffffea0001f57e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88807d5f9a80 pfn:0x7d5f8
head:ffffea0001f57e00 order:3 compound_mapcount:0 compound_pincount:0
memcg:ffff888070023001
flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000010200 ffff888010dd4f48 ffffea0001404e08 ffff8880118fd000
raw: ffff88807d5f9a80 0000000000040002 00000001ffffffff ffff888070023001
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3634, ts 119694798460, free_ts 119693556950
prep_new_page mm/page_alloc.c:2434 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
alloc_slab_page mm/slub.c:1799 [inline]
allocate_slab mm/slub.c:1944 [inline]
new_slab+0x28a/0x3b0 mm/slub.c:2004
___slab_alloc+0x87c/0xe90 mm/slub.c:3018
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
slab_alloc_node mm/slub.c:3196 [inline]
slab_alloc mm/slub.c:3238 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3243
kmem_cache_zalloc include/linux/slab.h:705 [inline]
net_alloc net/core/net_namespace.c:407 [inline]
copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
ksys_unshare+0x445/0x920 kernel/fork.c:3048
__do_sys_unshare kernel/fork.c:3119 [inline]
__se_sys_unshare kernel/fork.c:3117 [inline]
__x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1352 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
free_unref_page_prepare mm/page_alloc.c:3325 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3404
skb_free_head net/core/skbuff.c:655 [inline]
skb_release_data+0x65d/0x790 net/core/skbuff.c:677
skb_release_all net/core/skbuff.c:742 [inline]
__kfree_skb net/core/skbuff.c:756 [inline]
consume_skb net/core/skbuff.c:914 [inline]
consume_skb+0xc2/0x160 net/core/skbuff.c:908
skb_free_datagram+0x1b/0x1f0 net/core/datagram.c:325
netlink_recvmsg+0x636/0xea0 net/netlink/af_netlink.c:1998
sock_recvmsg_nosec net/socket.c:948 [inline]
sock_recvmsg net/socket.c:966 [inline]
sock_recvmsg net/socket.c:962 [inline]
____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
___sys_recvmsg+0x127/0x200 net/socket.c:2674
__sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Memory state around the buggy address:
ffff88807d5f9e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807d5f9e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88807d5f9f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807d5f9f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807d5fa000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 0dad4087a86a ("tcp/dccp: get rid of inet_twsk_purge()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220126180714.845362-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d9ea761f 07-Sep-2021 Lin, Zhenpeng <zplin@psu.edu>

dccp: don't duplicate ccid when cloning dccp sock

Commit 2677d2067731 ("dccp: don't free ccid2_hc_tx_sock ...") fixed
a UAF but reintroduced CVE-2017-6074.

When the sock is cloned, two dccps_hc_tx_ccid will reference to the
same ccid. So one can free the ccid object twice from two socks after
cloning.

This issue was found by "Hadar Manor" as well and assigned with
CVE-2020-16119, which was fixed in Ubuntu's kernel. So here I port
the patch from Ubuntu to fix it.

The patch prevents cloned socks from referencing the same ccid.

Fixes: 2677d2067731410 ("dccp: don't free ccid2_hc_tx_sock ...")
Signed-off-by: Zhenpeng Lin <zplin@psu.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3283ff2e 23-Feb-2020 Jules Irenge <jbi.octave@gmail.com>

dccp: Add missing annotation for dccp_child_process()

Sparse reports a warning at dccp_child_process()
warning: context imbalance in dccp_child_process() - unexpected unlock
The root cause is the missing annotation at dccp_child_process()
Add the missing __releases(child) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 95358a95 05-Jun-2018 Maciej Żenczykowski <maze@google.com>

net-tcp: remove useless tw_timeout field

Tested: 'git grep tw_timeout' comes up empty and it builds :-)

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ec94c269 11-Dec-2017 Eric Dumazet <edumazet@google.com>

tcp/dccp: avoid one atomic operation for timewait hashdance

First, rename __inet_twsk_hashdance() to inet_twsk_hashdance()

Then, remove one inet_twsk_put() by setting tw_refcnt to 3 instead
of 4, but adding a fat warning that we do not have the right to access
tw anymore after inet_twsk_hashdance()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cfac7f83 01-Dec-2017 Eric Dumazet <edumazet@google.com>

tcp/dccp: block bh before arming time_wait timer

Maciej Żenczykowski reported some panics in tcp_twsk_destructor()
that might be caused by the following bug.

timewait timer is pinned to the cpu, because we want to transition
timwewait refcount from 0 to 4 in one go, once everything has been
initialized.

At the time commit ed2e92394589 ("tcp/dccp: fix timewait races in timer
handling") was merged, TCP was always running from BH habdler.

After commit 5413d1babe8f ("net: do not block BH while processing
socket backlog") we definitely can run tcp_time_wait() from process
context.

We need to block BH in the critical section so that the pinned timer
has still its purpose.

This bug is more likely to happen under stress and when very small RTO
are used in datacenter flows.

Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 62f8f4d9 05-Mar-2017 Eric Dumazet <edumazet@google.com>

dccp: fix use-after-free in dccp_feat_activate_values

Dmitry reported crashes in DCCP stack [1]

Problem here is that when I got rid of listener spinlock, I missed the
fact that DCCP stores a complex state in struct dccp_request_sock,
while TCP does not.

Since multiple cpus could access it at the same time, we need to add
protection.

[1]
BUG: KASAN: use-after-free in dccp_feat_activate_values+0x967/0xab0
net/dccp/feat.c:1541 at addr ffff88003713be68
Read of size 8 by task syz-executor2/8457
CPU: 2 PID: 8457 Comm: syz-executor2 Not tainted 4.10.0-rc7+ #127
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x292/0x398 lib/dump_stack.c:51
kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
print_address_description mm/kasan/report.c:200 [inline]
kasan_report_error mm/kasan/report.c:289 [inline]
kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
kasan_report mm/kasan/report.c:332 [inline]
__asan_report_load8_noabort+0x29/0x30 mm/kasan/report.c:332
dccp_feat_activate_values+0x967/0xab0 net/dccp/feat.c:1541
dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
NF_HOOK include/linux/netfilter.h:257 [inline]
ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
dst_input include/net/dst.h:507 [inline]
ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
NF_HOOK include/linux/netfilter.h:257 [inline]
ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
__netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
__netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
process_backlog+0xe5/0x6c0 net/core/dev.c:4839
napi_poll net/core/dev.c:5202 [inline]
net_rx_action+0xe70/0x1900 net/core/dev.c:5267
__do_softirq+0x2fb/0xb7d kernel/softirq.c:284
do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902
</IRQ>
do_softirq.part.17+0x1e8/0x230 kernel/softirq.c:328
do_softirq kernel/softirq.c:176 [inline]
__local_bh_enable_ip+0x1f2/0x200 kernel/softirq.c:181
local_bh_enable include/linux/bottom_half.h:31 [inline]
rcu_read_unlock_bh include/linux/rcupdate.h:971 [inline]
ip6_finish_output2+0xbb0/0x23d0 net/ipv6/ip6_output.c:123
ip6_finish_output+0x302/0x960 net/ipv6/ip6_output.c:148
NF_HOOK_COND include/linux/netfilter.h:246 [inline]
ip6_output+0x1cb/0x8d0 net/ipv6/ip6_output.c:162
ip6_xmit+0xcdf/0x20d0 include/net/dst.h:501
inet6_csk_xmit+0x320/0x5f0 net/ipv6/inet6_connection_sock.c:179
dccp_transmit_skb+0xb09/0x1120 net/dccp/output.c:141
dccp_xmit_packet+0x215/0x760 net/dccp/output.c:280
dccp_write_xmit+0x168/0x1d0 net/dccp/output.c:362
dccp_sendmsg+0x79c/0xb10 net/dccp/proto.c:796
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
SYSC_sendto+0x660/0x810 net/socket.c:1687
SyS_sendto+0x40/0x50 net/socket.c:1655
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x4458b9
RSP: 002b:00007f8ceb77bb58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 00000000004458b9
RDX: 0000000000000023 RSI: 0000000020e60000 RDI: 0000000000000017
RBP: 00000000006e1b90 R08: 00000000200f9fe1 R09: 0000000000000020
R10: 0000000000008010 R11: 0000000000000282 R12: 00000000007080a8
R13: 0000000000000000 R14: 00007f8ceb77c9c0 R15: 00007f8ceb77c700
Object at ffff88003713be50, in cache kmalloc-64 size: 64
Allocated:
PID = 8446
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
save_stack+0x43/0xd0 mm/kasan/kasan.c:502
set_track mm/kasan/kasan.c:514 [inline]
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2738
kmalloc include/linux/slab.h:490 [inline]
dccp_feat_entry_new+0x214/0x410 net/dccp/feat.c:467
dccp_feat_push_change+0x38/0x220 net/dccp/feat.c:487
__feat_register_sp+0x223/0x2f0 net/dccp/feat.c:741
dccp_feat_propagate_ccid+0x22b/0x2b0 net/dccp/feat.c:949
dccp_feat_server_ccid_dependencies+0x1b3/0x250 net/dccp/feat.c:1012
dccp_make_response+0x1f1/0xc90 net/dccp/output.c:423
dccp_v6_send_response+0x4ec/0xc20 net/dccp/ipv6.c:217
dccp_v6_conn_request+0xaba/0x11b0 net/dccp/ipv6.c:377
dccp_rcv_state_process+0x51e/0x1650 net/dccp/input.c:606
dccp_v6_do_rcv+0x213/0x350 net/dccp/ipv6.c:632
sk_backlog_rcv include/net/sock.h:893 [inline]
__sk_receive_skb+0x36f/0xcc0 net/core/sock.c:479
dccp_v6_rcv+0xba5/0x1d00 net/dccp/ipv6.c:742
ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
NF_HOOK include/linux/netfilter.h:257 [inline]
ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
dst_input include/net/dst.h:507 [inline]
ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
NF_HOOK include/linux/netfilter.h:257 [inline]
ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
__netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
__netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
process_backlog+0xe5/0x6c0 net/core/dev.c:4839
napi_poll net/core/dev.c:5202 [inline]
net_rx_action+0xe70/0x1900 net/core/dev.c:5267
__do_softirq+0x2fb/0xb7d kernel/softirq.c:284
Freed:
PID = 15
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
save_stack+0x43/0xd0 mm/kasan/kasan.c:502
set_track mm/kasan/kasan.c:514 [inline]
kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
slab_free_hook mm/slub.c:1355 [inline]
slab_free_freelist_hook mm/slub.c:1377 [inline]
slab_free mm/slub.c:2954 [inline]
kfree+0xe8/0x2b0 mm/slub.c:3874
dccp_feat_entry_destructor.part.4+0x48/0x60 net/dccp/feat.c:418
dccp_feat_entry_destructor net/dccp/feat.c:416 [inline]
dccp_feat_list_pop net/dccp/feat.c:541 [inline]
dccp_feat_activate_values+0x57f/0xab0 net/dccp/feat.c:1543
dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
NF_HOOK include/linux/netfilter.h:257 [inline]
ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
dst_input include/net/dst.h:507 [inline]
ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
NF_HOOK include/linux/netfilter.h:257 [inline]
ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
__netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
__netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
process_backlog+0xe5/0x6c0 net/core/dev.c:4839
napi_poll net/core/dev.c:5202 [inline]
net_rx_action+0xe70/0x1900 net/core/dev.c:5267
__do_softirq+0x2fb/0xb7d kernel/softirq.c:284
Memory state around the buggy address:
ffff88003713bd00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88003713bd80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88003713be00: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb
^

Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 94352d45 01-Mar-2017 Arnaldo Carvalho de Melo <acme@redhat.com>

net: Introduce sk_clone_lock() error path routine

When handling problems in cloning a socket with the sk_clone_locked()
function we need to perform several steps that were open coded in it and
its callers, so introduce a routine to avoid this duplication:
sk_free_unlock_clone().

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/net-ui6laqkotycunhtmqryl9bfx@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d5afb6f9 01-Mar-2017 Arnaldo Carvalho de Melo <acme@redhat.com>

dccp: Unlock sock before calling sk_free()

The code where sk_clone() came from created a new socket and locked it,
but then, on the error path didn't unlock it.

This problem stayed there for a long while, till b0691c8ee7c2 ("net:
Unlock sock before calling sk_free()") fixed it, but unfortunately the
callers of sk_clone() (now sk_clone_locked()) were not audited and the
one in dccp_create_openreq_child() remained.

Now in the age of the syskaller fuzzer, this was finally uncovered, as
reported by Dmitry:

---- 8< ----

I've got the following report while running syzkaller fuzzer on
86292b33d4b7 ("Merge branch 'akpm' (patches from Andrew)")

[ BUG: held lock freed! ]
4.10.0+ #234 Not tainted
-------------------------
syz-executor6/6898 is freeing memory
ffff88006286cac0-ffff88006286d3b7, with a lock still held there!
(slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>] spin_lock
include/linux/spinlock.h:299 [inline]
(slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>]
sk_clone_lock+0x3d9/0x12c0 net/core/sock.c:1504
5 locks held by syz-executor6/6898:
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff839a34b4>] lock_sock
include/net/sock.h:1460 [inline]
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff839a34b4>]
inet_stream_connect+0x44/0xa0 net/ipv4/af_inet.c:681
#1: (rcu_read_lock){......}, at: [<ffffffff83bc1c2a>]
inet6_csk_xmit+0x12a/0x5d0 net/ipv6/inet6_connection_sock.c:126
#2: (rcu_read_lock){......}, at: [<ffffffff8369b424>] __skb_unlink
include/linux/skbuff.h:1767 [inline]
#2: (rcu_read_lock){......}, at: [<ffffffff8369b424>] __skb_dequeue
include/linux/skbuff.h:1783 [inline]
#2: (rcu_read_lock){......}, at: [<ffffffff8369b424>]
process_backlog+0x264/0x730 net/core/dev.c:4835
#3: (rcu_read_lock){......}, at: [<ffffffff83aeb5c0>]
ip6_input_finish+0x0/0x1700 net/ipv6/ip6_input.c:59
#4: (slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>] spin_lock
include/linux/spinlock.h:299 [inline]
#4: (slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>]
sk_clone_lock+0x3d9/0x12c0 net/core/sock.c:1504

Fix it just like was done by b0691c8ee7c2 ("net: Unlock sock before calling
sk_free()").

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170301153510.GE15145@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aa62d76b 27-Apr-2016 Eric Dumazet <edumazet@google.com>

dccp: rename DCCP_INC_STATS_BH()

Rename DCCP_INC_STATS_BH() to __DCCP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5e0724d0 22-Oct-2015 Eric Dumazet <edumazet@google.com>

tcp/dccp: fix hashdance race for passive sessions

Multiple cpus can process duplicates of incoming ACK messages
matching a SYN_RECV request socket. This is a rare event under
normal operations, but definitely can happen.

Only one must win the race, otherwise corruption would occur.

To fix this without adding new atomic ops, we use logic in
inet_ehash_nolisten() to detect the request was present in the same
ehash bucket where we try to insert the new child.

If request socket was not found, we have to undo the child creation.

This actually removes a spin_lock()/spin_unlock() pair in
reqsk_queue_unlink() for the fast path.

Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 54105f98 29-Sep-2015 Eric Dumazet <edumazet@google.com>

dccp: constify dccp_create_openreq_child() sock argument

socket no longer needs to be read/write

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a00e7444 29-Sep-2015 Eric Dumazet <edumazet@google.com>

tcp/dccp: constify send_synack and send_reset socket argument

None of these functions need to change the socket, make it
const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ed2e9239 19-Sep-2015 Eric Dumazet <edumazet@google.com>

tcp/dccp: fix timewait races in timer handling

When creating a timewait socket, we need to arm the timer before
allowing other cpus to find it. The signal allowing cpus to find
the socket is setting tw_refcnt to non zero value.

As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
call inet_twsk_schedule() first.

This also means we need to remove tw_refcnt changes from
inet_twsk_schedule() and let the caller handle it.

Note that because we use mod_timer_pinned(), we have the guarantee
the timer wont expire before we set tw_refcnt as we run in BH context.

To make things more readable I introduced inet_twsk_reschedule() helper.

When rearming the timer, we can use mod_timer_pending() to make sure
we do not rearm a canceled timer.

Note: This bug can possibly trigger if packets of a flow can hit
multiple cpus. This does not normally happen, unless flow steering
is broken somehow. This explains this bug was spotted ~5 months after
its introduction.

A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
but will be provided in a separate patch for proper tracking.

Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b357a364 23-Apr-2015 Eric Dumazet <edumazet@google.com>

inet: fix possible panic in reqsk_queue_unlink()

[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
0000000000000080
[ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243

There is a race when reqsk_timer_handler() and tcp_check_req() call
inet_csk_reqsk_queue_unlink() on the same req at the same time.

Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener
timer"), listener spinlock was held and race could not happen.

To solve this bug, we change reqsk_queue_unlink() to not assume req
must be found, and we return a status, to conditionally release a
refcount on the request sock.

This also means tcp_check_req() in non fastopen case might or not
consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
to properly handle this.

(Same remark for dccp_check_req() and its callers)

inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
called 4 times in tcp and 3 times in dccp.

Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 789f558c 12-Apr-2015 Eric Dumazet <edumazet@google.com>

tcp/dccp: get rid of central timewait timer

Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.

This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.

Tested:

On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)

Before patch :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171

While test is running, we can observe 25 or even 33 ms latencies.

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

After patch :

About 90% increase of throughput :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 52452c54 19-Mar-2015 Eric Dumazet <edumazet@google.com>

inet: drop prev pointer handling in request sock

When request sock are put in ehash table, the whole notion
of having a previous request to update dl_next is pointless.

Also, following patch will get rid of big purge timer,
so we want to delete a request sock without holding listener lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9fe516ba 27-Jun-2014 Eric Dumazet <edumazet@google.com>

inet: move ipv6only in sock_common

When an UDP application switches from AF_INET to AF_INET6 sockets, we
have a small performance degradation for IPv4 communications because of
extra cache line misses to access ipv6only information.

This can also be noticed for TCP listeners, as ipv6_only_sock() is also
used from __inet_lookup_listener()->compute_score()

This is magnified when SO_REUSEPORT is used.

Move ipv6only into struct sock_common so that it is available at
no extra cost in lookups.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 676d2369 11-Apr-2014 David S. Miller <davem@davemloft.net>

net: Fix use after free by removing length arg from sk_data_ready callbacks.

Several spots in the kernel perform a sequence like:

skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.

Signed-off-by: David S. Miller <davem@davemloft.net>


# b44084c2 10-Oct-2013 Eric Dumazet <edumazet@google.com>

inet: rename ir_loc_port to ir_num

In commit 634fb979e8f ("inet: includes a sock_common in request_sock")
I forgot that the two ports in sock_common do not have same byte order :

skc_dport is __be16 (network order), but skc_num is __u16 (host order)

So sparse complains because ir_loc_port (mapped into skc_num) is
considered as __u16 while it should be __be16

Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
and perform appropriate htons/ntohs conversions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 634fb979 09-Oct-2013 Eric Dumazet <edumazet@google.com>

inet: includes a sock_common in request_sock

TCP listener refactoring, part 5 :

We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.

This patch includes the needed struct sock_common in front
of struct request_sock

This means there is no more inet6_request_sock IPv6 specific
structure.

Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.

loc_port -> ir_loc_port
loc_addr -> ir_loc_addr
rmt_addr -> ir_rmt_addr
rmt_port -> ir_rmt_port
iif -> ir_iif

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# efe4208f 03-Oct-2013 Eric Dumazet <edumazet@google.com>

ipv6: make lookups simpler and faster

TCP listener refactoring, part 4 :

To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common

Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.

Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).

inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.

inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.

We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e6c022a4 27-Oct-2012 Eric Dumazet <edumazet@google.com>

tcp: better retrans tracking for defer-accept

For passive TCP connections using TCP_DEFER_ACCEPT facility,
we incorrectly increment req->retrans each time timeout triggers
while no SYNACK is sent.

SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
which we received the ACK from client). Only the last SYNACK is sent
so that we can receive again an ACK from client, to move the req into
accept queue. We plan to change this later to avoid the useless
retransmit (and potential problem as this SYNACK could be lost)

TCP_INFO later gives wrong information to user, claiming imaginary
retransmits.

Decouple req->retrans field into two independent fields :

num_retrans : number of retransmit
num_timeout : number of timeouts

num_timeout is the counter that is incremented at each timeout,
regardless of actual SYNACK being sent or not, and used to
compute the exponential timeout.

Introduce inet_rtx_syn_ack() helper to increment num_retrans
only if ->rtx_syn_ack() succeeded.

Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
when we re-send a SYNACK in answer to a (retransmitted) SYN.
Prior to this patch, we were not counting these retransmits.

Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
only if a synack packet was successfully queued.

Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f541fb7e 26-Feb-2012 Samuel Jero <sj323707@ohio.edu>

dccp: fix bug in sequence number validation during connection setup

This fixes a bug in the sequence number validation during the initial handshake.

The code did not treat the initial sequence numbers ISS and ISR as read-only and
did not keep state for GSR and GSS as required by the specification. This causes
problems with retransmissions during the initial handshake, causing the
budding connection to be reset.

This patch now treats ISS/ISR as read-only and tracks GSS/GSR as required.

Signed-off-by: Samuel Jero <sj323707@ohio.edu>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>


# dfd56b8b 10-Dec-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: use IS_ENABLED(CONFIG_IPV6)

Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4e3fd7a0 20-Nov-2011 Alexey Dobriyan <adobriyan@gmail.com>

net: remove ipv6_addr_copy()

C assignment can handle struct in6_addr copying.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e56c57d0 08-Nov-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: rename sk_clone to sk_clone_lock

Make clear that sk_clone() and inet_csk_clone() return a locked socket.

Add _lock() prefix and kerneldoc.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0b53d460 11-Oct-2010 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: fix the adjustments to AWL and SWL

This fixes a problem and a potential loophole with regard to seqno/ackno
validity: currently the initial adjustments to AWL/SWL are only performed
once at the begin of the connection, during the handshake.

Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
it is however necessary to perform these adjustments at least for the first
W/W' (variables as per 7.5.1) packets in the lifetime of a connection.

This requirement is complicated by the fact that W/W' can change at any time
during the lifetime of a connection.

Therefore it is better to perform that safety check each time SWL/AWL are
updated, as implemented by the patch.

A second problem solved by this patch is that the remote/local Sequence Window
feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
feature negotiation has completed.

During the initial handshake we have more stringent sequence number protection;
the changes added by this patch effect that {A,S}W{L,H} are within the correct
bounds at the instant that feature negotiation completes (since the SeqWin
feature activation handlers call dccp_update_gsr/gss()).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# a3a858ff 04-Mar-2010 Zhu Yi <yi.zhu@intel.com>

net: backlog functions rename

sk_add_backlog -> __sk_add_backlog
sk_add_backlog_limited -> sk_add_backlog

Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e6b4d113 02-Dec-2009 William Allen Simpson <william.allen.simpson@gmail.com>

TCPCT part 1a: add request_values parameter for sending SYNACK

Add optional function parameters associated with sending SYNACK.
These parameters are not needed after sending SYNACK, and are not
used for retransmission. Avoids extending struct tcp_request_sock,
and avoids allocating kernel memory.

Also affects DCCP as it uses common struct request_sock_ops,
but this parameter is currently reserved for future use.

Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 792b4878 16-Jan-2009 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Implement both feature-local and feature-remote Sequence Window feature

This adds full support for local/remote Sequence Window feature, from which the
* sequence-number-validity (W) and
* acknowledgment-number-validity (W') windows
derive as specified in RFC 4340, 7.5.3.

Specifically, the following is contained in this patch:
* integrated new socket fields into dccp_sk;
* updated the update_gsr/gss routines with regard to these fields;
* updated handler code: the Sequence Window feature is located at the TX side,
so the local feature is meant if the handler-rx flag is false;
* the initialisation of `rcv_wnd' in reqsk is removed, since
- rcv_wnd is not used by the code anywhere;
- sequence number checks are not done in the LISTEN state (cf. 7.5.3);
- dccp_check_req checks the Ack number validity more rigorously;
* the `struct dccp_minisock' became empty and is now removed.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6fdd34d4 08-Dec-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp ccid-2: Phase out the use of boolean Ack Vector sysctl

This removes the use of the sysctl and the minisock variable for the Send Ack
Vector feature, as it now is handled fully dynamically via feature negotiation
(i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per
RFC 4341, 4.).

Using a sysctl in parallel to this implementation would open the door to
crashes, since much of the code relies on tests of the boolean minisock /
sysctl variable. Thus, this patch replaces all tests of type

if (dccp_msk(sk)->dccpms_send_ack_vector)
/* ... */
with
if (dp->dccps_hc_rx_ackvec != NULL)
/* ... */

The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
negotiation concluded that Ack Vectors are to be used on the half-connection.
Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
so that the test is a valid one.

The activation handler for Ack Vectors is called as soon as the feature
negotiation has concluded at the
* server when the Ack marking the transition RESPOND => OPEN arrives;
* client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.

Adding the sequence number of the Response packet to the Ack Vector has been
removed, since
(a) connection establishment implies that the Response has been received;
(b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
this entry will always be ignored;
(c) it can not be used for anything useful - to detect loss for instance, only
packets received after the loss can serve as pseudo-dupacks.

There was a FIXME to change the error code when dccp_ackvec_add() fails.
I removed this after finding out that:
* the check whether ackno < ISN is already made earlier,
* this Response is likely the 1st packet with an Ackno that the client gets,
* so when dccp_ackvec_add() fails, the reason is likely not a packet error.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4098dce5 08-Dec-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Remove manual influence on NDP Count feature

Updating the NDP count feature is handled automatically now:
* for CCID-2 it is disabled, since the code does not use NDP counts;
* for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.

Allowing the user to change NDP values leads to unpredictable and failing
behaviour, since it is then possible to disable NDP counts even when they
are needed (e.g. in CCID-3).

This means that only those user settings are sensible that agree with the
values for Send NDP Count implied by the choice of CCID. But those settings
are already activated by the feature negotiation (CCID dependency tracking),
hence this form of support is redundant.

At startup the initialisation of the NDP count feature uses the default
value of 0, which is done implicitly by the zeroing-out of the socket when
it is allocated. If the choice of CCID or feature negotiation enables NDP
count, this will then be updated via the NDP activation handler.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0049bab5 08-Dec-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Remove obsolete parts of the old CCID interface

The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector
case, their value equals initially that of the sysctl, but at the end of
feature negotiation may be something different.

The old interface removed by this patch thus has been replaced by the newer
interface to dynamically query the currently loaded CCIDs.

Also removed are the constructors for the TX CCID and the RX CCID, since the
switch "rx <-> non-rx" is done by the handler in minisocks.c (and the handler
is the only place in the code where CCIDs are loaded).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 192b27ff 08-Dec-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Integration of dynamic feature activation - part 2 (server side)

This patch integrates the activation of features at the end of negotiation
into the server-side code.

Note regarding the removal of 'const':
--------------------------------------
The 'const' attribute has been removed from 'dreq' since dccp_activate_values()
needs to operate on dreq's feature list. Part of the activation is to remove
those options from the list that have already been confirmed, hence it is not
purely read-only.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dd9c0e36 16-Nov-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Deprecate Ack Ratio sysctl

This patch deprecates the Ack Ratio sysctl, since
* Ack Ratio is entirely ignored by CCID-3 and CCID-4,
* Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
* even if it would work in CCID-2, there is no point for a user to change it:
- Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
- if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
(since waiting for Acks which will never arrive in this window),
- cwnd is not a user-configurable value.

The only reasonable place for Ack Ratio is to print it for debugging. It is
planned to do this later on, as part of e.g. dccp_probe.

With this patch Ack Ratio is now under full control of feature negotiation:
* Ack Ratio is resolved as a dependency of the selected CCID;
* if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
Ratio 2 for both endpoints";
* what happens then is part of another patch set, since it concerns the
dynamic update of Ack Ratio while the connection is in full flight.

Thanks to Tomasz Grobelny for discussion leading up to this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ac75773c 05-Nov-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Per-socket initialisation of feature negotiation

This provides feature-negotiation initialisation for both DCCP sockets
and DCCP request_sockets, to support feature negotiation during
connection setup.

It also resolves a FIXME regarding the congestion control
initialisation.

Thanks to Wei Yongjun for help with the IPv6 side of this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 944f7502 20-Oct-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Port redirection support for DCCP

Commit a3116ac5c216fc3c145906a46df9ce542ff7dcf2 from 1st October ("tcp: Port
redirection support for TCP") broke DCCP skb lookup by changing inet_csk_clone,
which is used by DCCP to generate the child socket after the handshake.

This patch updates DCCP to use 'loc_port' instead of 'sport', which fixes the
problem, and thus inheriting port redirection support via the new interface.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 410e27a4 09-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

This reverts "Merge branch 'dccp' of git://eden-feed.erg.abdn.ac.uk/dccp_exp"
as it accentally contained the wrong set of patches. These will be
submitted separately.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>


# bfbddd08 03-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Fix the adjustments to AWL and SWL

This fixes a problem and a potential loophole with regard to seqno/ackno
validity: the problem is that the initial adjustments to AWL/SWL were
only performed at the begin of the connection, during the handshake.

Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
it is however necessary to perform these adjustments at least for the first
W/W' (variables as per 7.5.1) packets in the lifetime of a connection.

This requirement is complicated by the fact that W/W' can change at any time
during the lifetime of a connection.

Therefore the consequence is to perform this safety check each time SWL/AWL
are updated.

A second problem solved by this patch is that the remote/local Sequence Window
feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
feature negotiation has completed.

During the initial handshake we have more stringent sequence number protection,
the changes added by this patch effect that {A,S}W{L,H} are within the correct
bounds at the instant that feature negotiation completes (since the SeqWin
feature activation handlers call dccp_update_gsr/gss()).

A detailed rationale is below -- can be removed from the commit message.


1. Server sequence number checks during initial handshake
---------------------------------------------------------
The server can not use the fields of the listening socket for seqno/ackno checks
and thus needs to store all relevant information on a per-connection basis on
the dccp_request socket. This is a size-constrained structure and has currently
only ISS (dreq_iss) and ISR (dreq_isr) defined.
Adding further fields (SW{L,H}, AW{L,H}) would increase the size of the struct
and it is questionable whether this will have any practical gain. The currently
implemented solution is as follows.
* receiving first Request: dccp_v{4,6}_conn_request sets
ISR := P.seqno, ISS := dccp_v{4,6}_init_sequence()

* sending first Response: dccp_v{4,6}_send_response via dccp_make_response()
sets P.seqno := ISS, sets P.ackno := ISR

* receiving retransmitted Request: dccp_check_req() overrides ISR := P.seqno

* answering retransmitted Request: dccp_make_response() sets ISS += 1,
otherwise as per first Response

* completing the handshake: succeeds in dccp_check_req() for the first Ack
where P.ackno == ISS (P.seqno is not tested)

* creating child socket: ISS, ISR are copied from the request_sock

This solution will succeed whenever the server can receive the Request and the
subsequent Ack in succession, without retransmissions. If there is packet loss,
the client needs to retransmit until this condition succeeds; it will otherwise
eventually give up. Adding further fields to the request_sock could increase
the robustness a bit, in that it would make possible to let a reordered Ack
(from a retransmitted Response) pass. The argument against such a solution is
that if the packet loss is not persistent and an Ack gets through, why not
wait for the one answering the original response: if the loss is persistent, it
is probably better to not start the connection in the first place.

Long story short: the present design (by Arnaldo) is simple and will likely work
just as well as a more complicated solution. As a consequence, {A,S}W{L,H} are
not needed until the moment the request_sock is cloned into the accept queue.

At that stage feature negotiation has completed, so that the values for the local
and remote Sequence Window feature (7.5.2) are known, i.e. we are now in a better
position to compute {A,S}W{L,H}.


2. Client sequence number checks during initial handshake
---------------------------------------------------------
Until entering PARTOPEN the client does not need the adjustments, since it
constrains the Ack window to the packet it sent.

* sending first Request: dccp_v{4,6}_connect() choose ISS,
dccp_connect() then sets GAR := ISS (as per 8.5),
dccp_transmit_skb() (with the previous bug fix) sets
GSS := ISS, AWL := ISS, AWH := GSS
* n-th retransmitted Request (with previous patch):
dccp_retransmit_skb() via timer calls
dccp_transmit_skb(), which sets GSS := ISS+n
and then AWL := ISS, AWH := ISS+n

* receiving any Response: dccp_rcv_request_sent_state_process()
-- accepts packet if AWL <= P.ackno <= AWH;
-- sets GSR = ISR = P.seqno

* sending the Ack completing the handshake: dccp_send_ack() calls
dccp_transmit_skb(), which sets GSS += 1
and AWL := ISS, AWH := GSS


Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>


# 51c7d4fa 03-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Implement both feature-local and feature-remote Sequence Window feature

This adds full support for local/remote Sequence Window feature, from which the
* sequence-number-validity (W) and
* acknowledgment-number-validity (W') windows
derive as specified in RFC 4340, 7.5.3.

Specifically, the following changes are introduced:
* integrated new socket fields into dccp_sk;
* updated the update_gsr/gss routines with regard to these fields;
* updated handler code: the Sequence Window feature is located at the TX side,
so the local feature is meant if the handler-rx flag is false;
* the initialisation of `rcv_wnd' in reqsk is removed, since
- rcv_wnd is not used by the code anywhere;
- sequence number checks are not done in the LISTEN state (cf. 7.5.3);
- dccp_check_req checks the Ack number validity more rigorously;
* the `struct dccp_minisock' became empty and is now removed.

Until the handshake completes with activating negotiated values, the local/remote
Sequence-Window values are undefined and thus can not reliably be estimated.
This issue is addressed in a separate patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>


# b235dc4a 03-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp ccid-2: Phase out the use of boolean Ack Vector sysctl

This removes the use of the sysctl and the minisock variable for the Send Ack
Vector feature, which is now handled fully dynamically via feature negotiation;
i.e. when CCID2 is enabled, Ack Vectors are automatically enabled (as per
RFC 4341, 4.).

Using a sysctl in parallel to this implementation would open the door to
crashes, since much of the code relies on tests of the boolean minisock /
sysctl variable. Thus, this patch replaces all tests of type

if (dccp_msk(sk)->dccpms_send_ack_vector)
/* ... */
with
if (dp->dccps_hc_rx_ackvec != NULL)
/* ... */

The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
negotiation concluded that Ack Vectors are to be used on the half-connection.
Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
so that the test is a valid one.

The activation handler for Ack Vectors is called as soon as the feature
negotiation has concluded at the
* server when the Ack marking the transition RESPOND => OPEN arrives;
* client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.

Adding the sequence number of the Response packet to the Ack Vector has been
removed, since
(a) connection establishment implies that the Response has been received;
(b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
this entry will always be ignored;
(c) it can not be used for anything useful - to detect loss for instance, only
packets received after the loss can serve as pseudo-dupacks.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>


# 68e074bf 03-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Remove manual influence on NDP Count feature

Updating the NDP count feature is handled automatically now:
* for CCID-2 it is disabled, since the code does not use NDP counts;
* for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.

Allowing the user to change NDP values leads to unpredictable and failing
behaviour, since it is then possible to disable NDP counts even when they
are needed (e.g. in CCID-3).

This means that only those user settings are sensible that agree with the
values for Send NDP Count implied by the choice of CCID. But those settings
are already activated by the feature negotiation (CCID dependency tracking),
hence this form of support is redundant.

At startup the initialisation of the NDP count feature is with the default
value of 0, which is done implicitly by the zeroing-out of the socket when
it is allocated. If the choice of CCID or feature negotiation enables NDP
count, this will then be updated via the NDP activation handler.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>


# 78673e24 03-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Remove obsolete parts of the old CCID interface

The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector
case, their value equals initially that of the sysctl, but at the end of
feature negotiation may be something different.

The old interface removed by this patch thus has been replaced by the newer
interface to dynamically query the currently loaded CCIDs earlier in this
patch set.

Also removed the constructors for the TX CCID and the RX CCID, since the
switch rx/non-rx is done by the handler in minisocks.c (and the handler is
the only place in the code where CCIDs are loaded).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>


# e70cacb9 03-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Integration of dynamic feature activation - part 2 (server side)

This patch integrates the activation of features at the end of negotiation
into the server-side code.

Note:
In dccp_create_openreq_child the request_sock argument is no longer constant,
since dccp_activate_values() uses the feature-negotiation list on dreq to sort
out the initialisation values for the different features of the child socket;
and purges this queue after use (but the `req' argument to openreq_child
can and does still remain constant).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>


# 17c30b40 03-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Deprecate Ack Ratio sysctl

This patch deprecates the Ack Ratio sysctl, since
* Ack Ratio is entirely ignored by CCID-3 and CCID-4,
* Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
* even if it would work in CCID-2, there is no point for a user to change it:
- Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
- if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
(since waiting for Acks which will never arrive in this window),
- cwnd is not a user-configurable value.

The only reasonable place for Ack Ratio is to print it for debugging. It is
planned to do this later on, as part of e.g. dccp_probe.

With this patch Ack Ratio is now under full control of feature negotiation:
* Ack Ratio is resolved as a dependency of the selected CCID;
* if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
Ratio 2 for both endpoints";
* what happens then is part of another patch set, since it concerns the
dynamic update of Ack Ratio while the connection is in full flight.

Thanks to Tomasz Grobelny for discussion leading up to this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 828755ce 03-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Per-socket initialisation of feature negotiation

This provides feature-negotiation initialisation for both DCCP sockets and
DCCP request_sockets, to support feature negotiation during connection setup.

It also resolves a FIXME regarding the congestion control initialisation.

Thanks to Wei Yongjun for help with the IPv6 side of this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>


# 6edafaaf 07-Aug-2008 Gui Jianfeng <guijianfeng@cn.fujitsu.com>

tcp: Fix kernel panic when calling tcp_v(4/6)_md5_do_lookup

If the following packet flow happen, kernel will panic.
MathineA MathineB
SYN
---------------------->
SYN+ACK
<----------------------
ACK(bad seq)
---------------------->
When a bad seq ACK is received, tcp_v4_md5_do_lookup(skb->sk, ip_hdr(skb)->daddr))
is finally called by tcp_v4_reqsk_send_ack(), but the first parameter(skb->sk) is
NULL at that moment, so kernel panic happens.
This patch fixes this bug.

OOPS output is as following:
[ 302.812793] IP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42
[ 302.817075] Oops: 0000 [#1] SMP
[ 302.819815] Modules linked in: ipv6 loop dm_multipath rtc_cmos rtc_core rtc_lib pcspkr pcnet32 mii i2c_piix4 parport_pc i2c_core parport ac button ata_piix libata dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
[ 302.849946]
[ 302.851198] Pid: 0, comm: swapper Not tainted (2.6.27-rc1-guijf #5)
[ 302.855184] EIP: 0060:[<c05cfaa6>] EFLAGS: 00010296 CPU: 0
[ 302.858296] EIP is at tcp_v4_md5_do_lookup+0x12/0x42
[ 302.861027] EAX: 0000001e EBX: 00000000 ECX: 00000046 EDX: 00000046
[ 302.864867] ESI: ceb69e00 EDI: 1467a8c0 EBP: cf75f180 ESP: c0792e54
[ 302.868333] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 302.871287] Process swapper (pid: 0, ti=c0792000 task=c0712340 task.ti=c0746000)
[ 302.875592] Stack: c06f413a 00000000 cf75f180 ceb69e00 00000000 c05d0d86 000016d0 ceac5400
[ 302.883275] c05d28f8 000016d0 ceb69e00 ceb69e20 681bf6e3 00001000 00000000 0a67a8c0
[ 302.890971] ceac5400 c04250a3 c06f413a c0792eb0 c0792edc cf59a620 cf59a620 cf59a634
[ 302.900140] Call Trace:
[ 302.902392] [<c05d0d86>] tcp_v4_reqsk_send_ack+0x17/0x35
[ 302.907060] [<c05d28f8>] tcp_check_req+0x156/0x372
[ 302.910082] [<c04250a3>] printk+0x14/0x18
[ 302.912868] [<c05d0aa1>] tcp_v4_do_rcv+0x1d3/0x2bf
[ 302.917423] [<c05d26be>] tcp_v4_rcv+0x563/0x5b9
[ 302.920453] [<c05bb20f>] ip_local_deliver_finish+0xe8/0x183
[ 302.923865] [<c05bb10a>] ip_rcv_finish+0x286/0x2a3
[ 302.928569] [<c059e438>] dev_alloc_skb+0x11/0x25
[ 302.931563] [<c05a211f>] netif_receive_skb+0x2d6/0x33a
[ 302.934914] [<d0917941>] pcnet32_poll+0x333/0x680 [pcnet32]
[ 302.938735] [<c05a3b48>] net_rx_action+0x5c/0xfe
[ 302.941792] [<c042856b>] __do_softirq+0x5d/0xc1
[ 302.944788] [<c042850e>] __do_softirq+0x0/0xc1
[ 302.948999] [<c040564b>] do_softirq+0x55/0x88
[ 302.951870] [<c04501b1>] handle_fasteoi_irq+0x0/0xa4
[ 302.954986] [<c04284da>] irq_exit+0x35/0x69
[ 302.959081] [<c0405717>] do_IRQ+0x99/0xae
[ 302.961896] [<c040422b>] common_interrupt+0x23/0x28
[ 302.966279] [<c040819d>] default_idle+0x2a/0x3d
[ 302.969212] [<c0402552>] cpu_idle+0xb2/0xd2
[ 302.972169] =======================
[ 302.974274] Code: fc ff 84 d2 0f 84 df fd ff ff e9 34 fe ff ff 83 c4 0c 5b 5e 5f 5d c3 90 90 57 89 d7 56 53 89 c3 50 68 3a 41 6f c0 e8 e9 55 e5 ff <8b> 93 9c 04 00 00 58 85 d2 59 74 1e 8b 72 10 31 db 31 c9 85 f6
[ 303.011610] EIP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42 SS:ESP 0068:c0792e54
[ 303.018360] Kernel panic - not syncing: Fatal exception in interrupt

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# be4c798a 11-Jun-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk>

dccp: Bug in initial acknowledgment number assignment

Step 8.5 in RFC 4340 says for the newly cloned socket

Initialize S.GAR := S.ISS,

but what in fact the code (minisocks.c) does is

Initialize S.GAR := S.ISR,

which is wrong (typo?) -- fixed by the patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>


# fd80eb94 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[INET]: Remove struct dst_entry *dst from request_sock_ops.rtx_syn_ack.

It looks like dst parameter is used in this API due to historical
reasons. Actually, it is really used in the direct call to
tcp_v4_send_synack only. So, create a wrapper for tcp_v4_send_synack
and remove dst from rtx_syn_ack.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b4d4f7c7 12-Dec-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Handle timestamps on Request/Response exchange separately

In DCCP, timestamps can occur on packets anytime, CCID3 uses a timestamp(/echo) on the Request/Response
exchange. This patch addresses the following situation:
* timestamps are recorded on the listening socket;
* Responses are sent from dccp_request_sockets;
* suppose two connections reach the listening socket with very small time in between:
* the first timestamp value gets overwritten by the second connection request.

This is not really good, so this patch separates timestamps into
* those which are received by the server during the initial handshake (on dccp_request_sock);
* those which are received by the client or the client after connection establishment.

As before, a timestamp of 0 is regarded as indicating that no (meaningful) timestamp has been
received (in addition, a warning message is printed if hosts send 0-valued timestamps).

The timestamp-echoing now works as follows:
* when a timestamp is present on the initial Request, it is placed into dreq, due to the
call to dccp_parse_options in dccp_v{4,6}_conn_request;
* when a timestamp is present on the Ack leading from RESPOND => OPEN, it is copied over
from the request_sock into the child cocket in dccp_create_openreq_child;
* timestamps received on an (established) dccp_sock are treated as before.

Since Elapsed Time is measured in hundredths of milliseconds (13.2), the new dccp_timestamp()
function is used, as it is expected that the time between receiving the timestamp and
sending the timestamp echo will be very small against the wrap-around time. As a byproduct,
this allows smaller timestamping-time fields.

Furthermore, inserting the Timestamp Echo option has been taken out of the block starting with
'!dccp_packet_without_ack()', since Timestamp Echo can be carried on any packet (5.8 and 13.3).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8109616e 12-Dec-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Add (missing) option parsing to request_sock processing

This adds option-parsing code to processing of Acks in the listening state
on request_socks on the server, serving two purposes
(i) resolves a FIXME (removed);
(ii) paves the way for feature-negotiation during connection-setup.

There is an intended subtlety here with regard to dccp_check_req:

Parsing options happens only after testing whether the received packet is
a retransmitted Request. Otherwise, if the Request contained (a possibly
large number of) feature-negotiation options, recomputing state would have to
happen each time a retransmitted Request arrives, which opens the door to an
easy DoS attack. Since in a genuine retransmission the options should not be
different from the original, reusing the already computed state seems better.

The other point is - if there are timestamp options on the Request, they will
not be answered; which means that in the presence of retransmission (likely
due to loss and/or other problems), the use of Request/Response RTT sampling
is suspended, so that startup problems here do not propagate.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 451bc047 04-Oct-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Tidy-up -- minisock initialisation

This

* removes a declaration of a non-existent function
__dccp_minisock_init;

* shifts the initialisation function dccp_minisock_init() from
options.c to minisocks.c, where it is more naturally expected to
be.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8fb8354a 19-Aug-2007 Arnaldo Carvalho de Melo <acme@ghostprotocols.net>

[DCCP]: Nuke dccp_timestamp and dccps_epoch, not used anymore

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4ef8d0ae 26-Apr-2007 Milind Arun Choudhary <milindchoudhary@gmail.com>

[NET]: SPIN_LOCK_UNLOCKED cleanup in drivers/atm, net

SPIN_LOCK_UNLOCKED cleanup,use __SPIN_LOCK_UNLOCKED instead

Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 99c72ce0 06-Mar-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Set RTO for newly created child socket

This mirrors a recent change in tcp_open_req_child, whereby the icsk_rto of the
newly created child socket was not set (but rather on the parent socket). Same
fix for DCCP.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8109b02b 10-Dec-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Whitespace cleanups

That accumulated over the last months hackaton, shame on me for not
using git-apply whitespace helping hand, will do that from now on.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 59348b19 20-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Simplified conditions due to use of enum:8 states

This reaps the benefit of the earlier patch, which changed the type of
CCID 3 states to use enums, in that many conditions are now simplified
and the number of possible (unexpected) values is greatly reduced.

In a few instances, this also allowed to simplify pre-conditions; where
care has been taken to retain logical equivalence.

[DCCP]: Introduce a consistent BUG/WARN message scheme

This refines the existing set of DCCP messages so that
* BUG(), BUG_ON(), WARN_ON() have meaningful DCCP-specific counterparts
* DCCP_CRIT (for severe warnings) is not rate-limited
* DCCP_WARN() is introduced as rate-limited wrapper

Using these allows a faster and cleaner transition to their original
counterparts once the code has matured into a full DCCP implementation.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# cfb6eeb4 14-Nov-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[TCP]: MD5 Signature Option (RFC2385) support.

Based on implementation by Rick Payne.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# afb0a34d 13-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Introduce a consistent naming scheme for sysctls

In order to make their function clearer and obtain a consistent naming
scheme to identify sysctls, all existing DCCP sysctls have been prefixed
with `sysctl_dccp', following the same convention as used by TCP.

Feature-specific sysctls retain the `feat' in the middle, although the
`default' has been dropped, since it is obvious from use.

Also removed a duplicate `dccp_feat_default_sequence_window' in ipv4.c.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# e11d9d30 13-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Increment sequence numbers on retransmitted Response packets

Problem:


# d83ca5ac 10-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: Update code comments for Step 2/3

Sorts out the comments for processing steps 2,3 in section 8.5 of RFC 4340.
All comments have been updated against this document, and the reference to step
2 has been made consistent throughout the files.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# cf557926 10-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: tidy up dccp_v{4,6}_conn_request

This is a code simplification to remove reduplicated code
by concentrating and abstracting shared code.

Detailed Changes:


# 8a73cd09 09-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk>

[DCCP]: calling dccp_v{4,6}_reqsk_send_ack is a BUG

This patch removes two functions, the send_ack functions of request_sock,
which are not called/used by the DCCP code. It is correct that these
functions are not called, below is a justification why calling these
functions (on a passive socket in the LISTEN/RESPOND state) would mean
a DCCP protocol violation.

A) Background: using request_sock in TCP:


# c4028958 22-Nov-2006 David Howells <dhowells@redhat.com>

WorkStruct: make allyesconfig

Fix up for make allyesconfig.

Signed-Off-By: David Howells <dhowells@redhat.com>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# a4bf3902 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP] minisock: Rename struct dccp_options to struct dccp_minisock

This will later be included in struct dccp_request_sock so that we can
have per connection feature negotiation state while in the 3way
handshake, when we clone the DCCP_ROLE_LISTEN socket (in
dccp_create_openreq_child) we'll just copy this state from
dreq_minisock to dccps_minisock.

Also the feature negotiation and option parsing code will mostly touch
dccps_minisock, which will simplify some stuff.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 91f0ebf7 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP] CCID: Improve CCID infrastructure

1. No need for ->ccid_init nor ->ccid_exit, this is what module_{init,exit}
does and anynways neither ccid2 nor ccid3 were using it.

2. Rename struct ccid to struct ccid_operations and introduce struct ccid
with a pointer to ccid_operations and rigth after it the rx or tx
private state.

3. Remove the pointer to the state of the half connections from struct
dccp_sock, now its derived thru ccid_priv() from the ccid pointer.

Now we also can implement the setsockopt for changing the CCID easily as
no ccid init routines can affect struct dccp_sock in any way that prevents
other CCIDs from working if a CCID switch operation is asked by apps.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# afe00251 20-Mar-2006 Andrea Bittau <a.bittau@cs.ucl.ac.uk>

[DCCP]: Initial feature negotiation implementation

Still needs more work, but boots and doesn't crashes, even
does some negotiation!

18:38:52.174934 127.0.0.1.43458 > 127.0.0.1.5001: request <change_l ack_ratio 2, change_r ccid 2, change_l ccid 2>
18:38:52.218526 127.0.0.1.5001 > 127.0.0.1.43458: response <nop, nop, change_l ack_ratio 2, confirm_r ccid 2 2, confirm_l ccid 2 2, confirm_r ack_ratio 2>
18:38:52.185398 127.0.0.1.43458 > 127.0.0.1.5001: <nop, confirm_r ack_ratio 2, ack_vector0 0x00, elapsed_time 212>

:-)

Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7400d781 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP] ackvec: Ditch dccpav_buf_len

Simplifying the code a bit as we're always using DCCP_MAX_ACKVEC_LEN.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3df80d93 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Introduce DCCPv6

Still needs mucho polishing, specially in the checksum code, but works
just fine, inet_diag/iproute2 and all 8)

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f21e68ca 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Prepare the AF agnostic core for the introduction of DCCPv6

Basically exports a similar set of functions as the one exported by
the non-AF specific TCP code.

In the process moved some non-AF specific code from dccp_v4_connect to
dccp_connect_init and moved the checksum verification from
dccp_invalid_packet to dccp_v4_rcv, so as to use it in dccp_v6_rcv
too.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 57cca05a 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Introduce dccp_ipv4_af_ops

And make the core DCCP code AF agnostic, just like TCP, now its time
to work on net/dccp/ipv6.c, we are close to the end!

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ae31c339 18-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Move the ack vector code to net/dccp/ackvec.[ch]

Isolating it, that will be used when we introduce a CCID2 (TCP-Like)
implementation.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 67e6b629 16-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Introduce DCCP_SOCKOPT_SERVICE

As discussed in the dccp@vger mailing list:

Now applications have to use setsockopt(DCCP_SOCKOPT_SERVICE, service[s]),
prior to calling listen() and connect().

An array of unsigned ints can be passed meaning that the listening sock accepts
connection requests for several services.

With this we can ditch struct sockaddr_dccp and use only sockaddr_in (and
sockaddr_in6 in the future).

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b0e56780 08-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP] Introduce dccp_timestamp

To start the timestamps with 0.0ms, easing the integer maths in the CCIDs, this
probably will be reworked to use the to be introduced struct timeval_offset
infrastructure out of skb_get_timestamp, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 03ace394 21-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Fix the ACK and SEQ window variables settings

This is from a first audit, more eyeballs are more than welcome.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7690af3f 13-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Just reflow the source code to fit in 80 columns

Andrew Morton should be happy now 8)

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 64cf1e5d 09-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[DCCP]: Finish the TIMEWAIT minisock support

Using most of the infrastructure TCP uses, with a dccp_death_row,
etc. As per my current interpretation of the draft what we have with
this changeset seems to be all we need (or very close to it 8)).

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f6ccf554 09-Aug-2005 David S. Miller <davem@davemloft.net>

[DCCP]: Fix u64 printf format warnings.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 7c657876 09-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net>

[DCCP]: Initial implementation

Development to this point was done on a subversion repository at:

http://oops.ghostprotocols.net:81/cgi-bin/viewcvs.cgi/dccp-2.6/

This repository will be kept at this site for the foreseable future,
so that interested parties can see the history of this code,
attributions, etc.

If I ever decide to take this offline I'll provide the full history at
some other suitable place.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>