#
5eb902b8 |
|
08-Feb-2024 |
Kui-Feng Lee <thinker.li@gmail.com> |
net/ipv6: Remove expired routes with a separated list of routes. FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree can be expensive if the number of routes in a table is big, even if most of them are permanent. Checking routes in a separated list of routes having expiration will avoid this potential issue. Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
20df28fb |
|
22-Jan-2024 |
Breno Leitao <leitao@debian.org> |
net/ipv6: resolve warning in ip6_fib.c In some configurations, the 'iter' variable in function fib6_repair_tree() is unused, resulting the following warning when compiled with W=1. net/ipv6/ip6_fib.c:1781:6: warning: variable 'iter' set but not used [-Wunused-but-set-variable] 1781 | int iter = 0; | ^ It is unclear what is the advantage of this RT6_TRACE() macro[1], since users can control pr_debug() in runtime, which is better than at compilation time. pr_debug() has no overhead when disabled. Remove the RT6_TRACE() in favor of simple pr_debug() helpers. [1] Link: https://lore.kernel.org/all/ZZwSEJv2HgI0cD4J@gmail.com/ Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240122181955.2391676-2-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
5a78a812 |
|
18-Dec-2023 |
David Ahern <dsahern@kernel.org> |
net/ipv6: Remove gc_link warn on in fib6_info_release A revert of 3dec89b14d37 ("net/ipv6: Remove expired routes with a separated list of routes") was sent for net-next. Revert the remainder of 5a08d0065a915 which added a warn on if a fib entry is still on the gc_link list to avoid compile failures when net is merged to net-next Signed-off-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231219030742.25715-1-dsahern@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
a3c205d0 |
|
07-Dec-2023 |
Eric Dumazet <edumazet@google.com> |
ipv6: do not check fib6_has_expires() in fib6_info_release() My prior patch went a bit too far, because apparently fib6_has_expires() could be true while f6i->gc_link is not hashed yet. fib6_set_expires_locked() can indeed set RTF_EXPIRES while f6i->fib6_table is NULL. Original syzbot reports were about corruptions caused by dangling f6i->gc_link. Fixes: 5a08d0065a91 ("ipv6: add debug checks in fib6_info_release()") Reported-by: syzbot+c15aa445274af8674f41@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kui-Feng Lee <thinker.li@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231207201322.549000-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
5a08d006 |
|
05-Dec-2023 |
Eric Dumazet <edumazet@google.com> |
ipv6: add debug checks in fib6_info_release() Some elusive syzbot reports are hinting to fib6_info_release(), with a potential dangling f6i->gc_link anchor. Add debug checks so that syzbot can catch the issue earlier eventually. BUG: KASAN: slab-use-after-free in __hlist_del include/linux/list.h:990 [inline] BUG: KASAN: slab-use-after-free in hlist_del_init include/linux/list.h:1016 [inline] BUG: KASAN: slab-use-after-free in fib6_clean_expires_locked include/net/ip6_fib.h:533 [inline] BUG: KASAN: slab-use-after-free in fib6_purge_rt+0x986/0x9c0 net/ipv6/ip6_fib.c:1064 Write of size 8 at addr ffff88802805a840 by task syz-executor.1/10057 CPU: 1 PID: 10057 Comm: syz-executor.1 Not tainted 6.7.0-rc2-syzkaller-00029-g9b6de136b5f0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0xc4/0x620 mm/kasan/report.c:475 kasan_report+0xda/0x110 mm/kasan/report.c:588 __hlist_del include/linux/list.h:990 [inline] hlist_del_init include/linux/list.h:1016 [inline] fib6_clean_expires_locked include/net/ip6_fib.h:533 [inline] fib6_purge_rt+0x986/0x9c0 net/ipv6/ip6_fib.c:1064 fib6_del_route net/ipv6/ip6_fib.c:1993 [inline] fib6_del+0xa7a/0x1750 net/ipv6/ip6_fib.c:2038 __ip6_del_rt net/ipv6/route.c:3866 [inline] ip6_del_rt+0xf7/0x200 net/ipv6/route.c:3881 ndisc_router_discovery+0x295b/0x3560 net/ipv6/ndisc.c:1372 ndisc_rcv+0x3de/0x5f0 net/ipv6/ndisc.c:1856 icmpv6_rcv+0x1470/0x19c0 net/ipv6/icmp.c:979 ip6_protocol_deliver_rcu+0x170/0x13e0 net/ipv6/ip6_input.c:438 ip6_input_finish+0x14f/0x2f0 net/ipv6/ip6_input.c:483 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip6_input+0xa1/0xc0 net/ipv6/ip6_input.c:492 ip6_mc_input+0x48b/0xf40 net/ipv6/ip6_input.c:586 dst_input include/net/dst.h:461 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline] NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ipv6_rcv+0x24e/0x380 net/ipv6/ip6_input.c:310 __netif_receive_skb_one_core+0x115/0x180 net/core/dev.c:5529 __netif_receive_skb+0x1f/0x1b0 net/core/dev.c:5643 netif_receive_skb_internal net/core/dev.c:5729 [inline] netif_receive_skb+0x133/0x700 net/core/dev.c:5788 tun_rx_batched+0x429/0x780 drivers/net/tun.c:1579 tun_get_user+0x29e3/0x3bc0 drivers/net/tun.c:2002 tun_chr_write_iter+0xe8/0x210 drivers/net/tun.c:2048 call_write_iter include/linux/fs.h:2020 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x64f/0xdf0 fs/read_write.c:584 ksys_write+0x12f/0x250 fs/read_write.c:637 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b RIP: 0033:0x7f38e387b82f Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 b9 80 02 00 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 0c 81 02 00 48 RSP: 002b:00007f38e45c9090 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007f38e399bf80 RCX: 00007f38e387b82f RDX: 00000000000003b6 RSI: 0000000020000680 RDI: 00000000000000c8 RBP: 00007f38e38c847a R08: 0000000000000000 R09: 0000000000000000 R10: 00000000000003b6 R11: 0000000000000293 R12: 0000000000000000 R13: 000000000000000b R14: 00007f38e399bf80 R15: 00007f38e3abfa48 </TASK> Allocated by task 10044: kasan_save_stack+0x33/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 ____kasan_kmalloc mm/kasan/common.c:374 [inline] __kasan_kmalloc+0xa2/0xb0 mm/kasan/common.c:383 kasan_kmalloc include/linux/kasan.h:198 [inline] __do_kmalloc_node mm/slab_common.c:1007 [inline] __kmalloc+0x59/0x90 mm/slab_common.c:1020 kmalloc include/linux/slab.h:604 [inline] kzalloc include/linux/slab.h:721 [inline] fib6_info_alloc+0x40/0x160 net/ipv6/ip6_fib.c:155 ip6_route_info_create+0x337/0x1e70 net/ipv6/route.c:3749 ip6_route_add+0x26/0x150 net/ipv6/route.c:3843 rt6_add_route_info+0x2e7/0x4b0 net/ipv6/route.c:4316 rt6_route_rcv+0x76c/0xbf0 net/ipv6/route.c:985 ndisc_router_discovery+0x138b/0x3560 net/ipv6/ndisc.c:1529 ndisc_rcv+0x3de/0x5f0 net/ipv6/ndisc.c:1856 icmpv6_rcv+0x1470/0x19c0 net/ipv6/icmp.c:979 ip6_protocol_deliver_rcu+0x170/0x13e0 net/ipv6/ip6_input.c:438 ip6_input_finish+0x14f/0x2f0 net/ipv6/ip6_input.c:483 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip6_input+0xa1/0xc0 net/ipv6/ip6_input.c:492 ip6_mc_input+0x48b/0xf40 net/ipv6/ip6_input.c:586 dst_input include/net/dst.h:461 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline] NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ipv6_rcv+0x24e/0x380 net/ipv6/ip6_input.c:310 __netif_receive_skb_one_core+0x115/0x180 net/core/dev.c:5529 __netif_receive_skb+0x1f/0x1b0 net/core/dev.c:5643 netif_receive_skb_internal net/core/dev.c:5729 [inline] netif_receive_skb+0x133/0x700 net/core/dev.c:5788 tun_rx_batched+0x429/0x780 drivers/net/tun.c:1579 tun_get_user+0x29e3/0x3bc0 drivers/net/tun.c:2002 tun_chr_write_iter+0xe8/0x210 drivers/net/tun.c:2048 call_write_iter include/linux/fs.h:2020 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x64f/0xdf0 fs/read_write.c:584 ksys_write+0x12f/0x250 fs/read_write.c:637 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Freed by task 5123: kasan_save_stack+0x33/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x40 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] ____kasan_slab_free+0x15b/0x1b0 mm/kasan/common.c:200 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1800 [inline] slab_free_freelist_hook+0x114/0x1e0 mm/slub.c:1826 slab_free mm/slub.c:3809 [inline] __kmem_cache_free+0xc0/0x180 mm/slub.c:3822 rcu_do_batch kernel/rcu/tree.c:2158 [inline] rcu_core+0x819/0x1680 kernel/rcu/tree.c:2431 __do_softirq+0x21a/0x8de kernel/softirq.c:553 Last potentially related work creation: kasan_save_stack+0x33/0x50 mm/kasan/common.c:45 __kasan_record_aux_stack+0xbc/0xd0 mm/kasan/generic.c:492 __call_rcu_common.constprop.0+0x9a/0x7a0 kernel/rcu/tree.c:2681 fib6_info_release include/net/ip6_fib.h:332 [inline] fib6_info_release include/net/ip6_fib.h:329 [inline] rt6_route_rcv+0xa4e/0xbf0 net/ipv6/route.c:997 ndisc_router_discovery+0x138b/0x3560 net/ipv6/ndisc.c:1529 ndisc_rcv+0x3de/0x5f0 net/ipv6/ndisc.c:1856 icmpv6_rcv+0x1470/0x19c0 net/ipv6/icmp.c:979 ip6_protocol_deliver_rcu+0x170/0x13e0 net/ipv6/ip6_input.c:438 ip6_input_finish+0x14f/0x2f0 net/ipv6/ip6_input.c:483 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip6_input+0xa1/0xc0 net/ipv6/ip6_input.c:492 ip6_mc_input+0x48b/0xf40 net/ipv6/ip6_input.c:586 dst_input include/net/dst.h:461 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline] NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ipv6_rcv+0x24e/0x380 net/ipv6/ip6_input.c:310 __netif_receive_skb_one_core+0x115/0x180 net/core/dev.c:5529 __netif_receive_skb+0x1f/0x1b0 net/core/dev.c:5643 netif_receive_skb_internal net/core/dev.c:5729 [inline] netif_receive_skb+0x133/0x700 net/core/dev.c:5788 tun_rx_batched+0x429/0x780 drivers/net/tun.c:1579 tun_get_user+0x29e3/0x3bc0 drivers/net/tun.c:2002 tun_chr_write_iter+0xe8/0x210 drivers/net/tun.c:2048 call_write_iter include/linux/fs.h:2020 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x64f/0xdf0 fs/read_write.c:584 ksys_write+0x12f/0x250 fs/read_write.c:637 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Second to last potentially related work creation: kasan_save_stack+0x33/0x50 mm/kasan/common.c:45 __kasan_record_aux_stack+0xbc/0xd0 mm/kasan/generic.c:492 insert_work+0x38/0x230 kernel/workqueue.c:1647 __queue_work+0xcdc/0x11f0 kernel/workqueue.c:1803 call_timer_fn+0x193/0x590 kernel/time/timer.c:1700 expire_timers kernel/time/timer.c:1746 [inline] __run_timers+0x585/0xb20 kernel/time/timer.c:2022 run_timer_softirq+0x58/0xd0 kernel/time/timer.c:2035 __do_softirq+0x21a/0x8de kernel/softirq.c:553 The buggy address belongs to the object at ffff88802805a800 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 64 bytes inside of freed 512-byte region [ffff88802805a800, ffff88802805aa00) The buggy address belongs to the physical page: page:ffffea0000a01600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x28058 head:ffffea0000a01600 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff) page_type: 0xffffffff() raw: 00fff00000000840 ffff888013041c80 ffffea0001e02600 dead000000000002 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 2, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 18706, tgid 18699 (syz-executor.2), ts 999991973280, free_ts 996884464281 set_page_owner include/linux/page_owner.h:31 [inline] post_alloc_hook+0x2d0/0x350 mm/page_alloc.c:1537 prep_new_page mm/page_alloc.c:1544 [inline] get_page_from_freelist+0xa25/0x36d0 mm/page_alloc.c:3312 __alloc_pages+0x22e/0x2420 mm/page_alloc.c:4568 alloc_pages_mpol+0x258/0x5f0 mm/mempolicy.c:2133 alloc_slab_page mm/slub.c:1870 [inline] allocate_slab mm/slub.c:2017 [inline] new_slab+0x283/0x3c0 mm/slub.c:2070 ___slab_alloc+0x979/0x1500 mm/slub.c:3223 __slab_alloc.constprop.0+0x56/0xa0 mm/slub.c:3322 __slab_alloc_node mm/slub.c:3375 [inline] slab_alloc_node mm/slub.c:3468 [inline] __kmem_cache_alloc_node+0x131/0x310 mm/slub.c:3517 __do_kmalloc_node mm/slab_common.c:1006 [inline] __kmalloc+0x49/0x90 mm/slab_common.c:1020 kmalloc include/linux/slab.h:604 [inline] kzalloc include/linux/slab.h:721 [inline] copy_splice_read+0x1ac/0x8f0 fs/splice.c:338 vfs_splice_read fs/splice.c:992 [inline] vfs_splice_read+0x2ea/0x3b0 fs/splice.c:962 splice_direct_to_actor+0x2a5/0xa30 fs/splice.c:1069 do_splice_direct+0x1af/0x280 fs/splice.c:1194 do_sendfile+0xb3e/0x1310 fs/read_write.c:1254 __do_sys_sendfile64 fs/read_write.c:1322 [inline] __se_sys_sendfile64 fs/read_write.c:1308 [inline] __x64_sys_sendfile64+0x1d6/0x220 fs/read_write.c:1308 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82 page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1137 [inline] free_unref_page_prepare+0x4fa/0xaa0 mm/page_alloc.c:2347 free_unref_page_list+0xe6/0xb40 mm/page_alloc.c:2533 release_pages+0x32a/0x14f0 mm/swap.c:1042 tlb_batch_pages_flush+0x9a/0x190 mm/mmu_gather.c:98 tlb_flush_mmu_free mm/mmu_gather.c:293 [inline] tlb_flush_mmu mm/mmu_gather.c:300 [inline] tlb_finish_mmu+0x14b/0x6f0 mm/mmu_gather.c:392 exit_mmap+0x38b/0xa70 mm/mmap.c:3321 __mmput+0x12a/0x4d0 kernel/fork.c:1349 mmput+0x62/0x70 kernel/fork.c:1371 exit_mm kernel/exit.c:567 [inline] do_exit+0x9ad/0x2ae0 kernel/exit.c:858 do_group_exit+0xd4/0x2a0 kernel/exit.c:1021 get_signal+0x23be/0x2790 kernel/signal.c:2904 arch_do_signal_or_restart+0x90/0x7f0 arch/x86/kernel/signal.c:309 exit_to_user_mode_loop kernel/entry/common.c:168 [inline] exit_to_user_mode_prepare+0x121/0x240 kernel/entry/common.c:204 irqentry_exit_to_user_mode+0xa/0x40 kernel/entry/common.c:309 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:645 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231205173250.2982846-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
dade3f6a |
|
18-Dec-2023 |
David Ahern <dsahern@kernel.org> |
net/ipv6: Revert remove expired routes with a separated list of routes This reverts commit 3dec89b14d37ee635e772636dad3f09f78f1ab87. The commit has some race conditions given how expires is managed on a fib6_info in relation to gc start, adding the entry to the gc list and setting the timer value leading to UAF. Revert the commit and try again in a later release. Fixes: 3dec89b14d37 ("net/ipv6: Remove expired routes with a separated list of routes") Cc: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231219030243.25687-1-dsahern@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
8aae7625 |
|
30-Aug-2023 |
Florian Westphal <fw@strlen.de> |
net: fib: avoid warn splat in flow dissector New skbs allocated via nf_send_reset() have skb->dev == NULL. fib*_rules_early_flow_dissect helpers already have a 'struct net' argument but its not passed down to the flow dissector core, which will then WARN as it can't derive a net namespace to use: WARNING: CPU: 0 PID: 0 at net/core/flow_dissector.c:1016 __skb_flow_dissect+0xa91/0x1cd0 [..] ip_route_me_harder+0x143/0x330 nf_send_reset+0x17c/0x2d0 [nf_reject_ipv4] nft_reject_inet_eval+0xa9/0xf2 [nft_reject_inet] nft_do_chain+0x198/0x5d0 [nf_tables] nft_do_chain_inet+0xa4/0x110 [nf_tables] nf_hook_slow+0x41/0xc0 ip_local_deliver+0xce/0x110 .. Cc: Stanislav Fomichev <sdf@google.com> Cc: David Ahern <dsahern@kernel.org> Cc: Ido Schimmel <idosch@nvidia.com> Fixes: 812fa71f0d96 ("netfilter: Dissect flow after packet mangling") Link: https://bugzilla.kernel.org/show_bug.cgi?id=217826 Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230830110043.30497-1-fw@strlen.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
3dec89b1 |
|
15-Aug-2023 |
Kui-Feng Lee <thinker.li@gmail.com> |
net/ipv6: Remove expired routes with a separated list of routes. FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree can be expensive if the number of routes in a table is big, even if most of them are permanent. Checking routes in a separated list of routes having expiration will avoid this potential issue. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8cdc3223 |
|
27-Mar-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
ipv6: Remove in6addr_any alternatives. Some code defines the IPv6 wildcard address as a local variable and use it with memcmp() or ipv6_addr_equal(). Let's use in6addr_any and ipv6_addr_any() instead. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d288a162 |
|
23-Mar-2023 |
Wangyang Guo <wangyang.guo@intel.com> |
net: dst: Prevent false sharing vs. dst_entry:: __refcnt dst_entry::__refcnt is highly contended in scenarios where many connections happen from and to the same IP. The reference count is an atomic_t, so the reference count operations have to take the cache-line exclusive. Aside of the unavoidable reference count contention there is another significant problem which is caused by that: False sharing. perf top identified two affected read accesses. dst_entry::lwtstate and rtable::rt_genid. dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into a seperate cacheline vs. the read mostly members located at the beginning of the struct. That prevents false sharing vs. the struct members in the first 64 bytes of the structure, but there is also dst_entry::lwtstate which is located after the reference count and in the same cache line. This member is read after a reference count has been acquired. struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a size of 112 bytes, which means that the struct members of rtable which follow the dst member share the same cache line as dst_entry::__refcnt. Especially rtable::rt_genid is also read by the contexts which have a reference count acquired already. When dst_entry:__refcnt is incremented or decremented via an atomic operation these read accesses stall. This was found when analysing the memtier benchmark in 1:100 mode, which amplifies the problem extremly. Move the rt[6i]_uncached[_list] members out of struct rtable and struct rt6_info into struct dst_entry to provide padding and move the lwtstate member after that so it ends up in the same cache line. The resulting improvement depends on the micro-architecture and the number of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached benchmark. [ tglx: Rearrange struct ] Signed-off-by: Wangyang Guo <wangyang.guo@intel.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230323102800.042297517@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2d4feb2c |
|
10-Feb-2022 |
Eric Dumazet <edumazet@google.com> |
ipv6: get rid of net->ipv6.rt6_stats->fib_rt_uncache This counter has never been visible, there is little point trying to maintain it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d95d6320 |
|
16-Feb-2022 |
Eric Dumazet <edumazet@google.com> |
ipv6: fix data-race in fib6_info_hw_flags_set / fib6_purge_rt Because fib6_info_hw_flags_set() is called without any synchronization, all accesses to gi6->offload, fi->trap and fi->offload_failed need some basic protection like READ_ONCE()/WRITE_ONCE(). BUG: KCSAN: data-race in fib6_info_hw_flags_set / fib6_purge_rt read to 0xffff8881087d5886 of 1 bytes by task 13953 on cpu 0: fib6_drop_pcpu_from net/ipv6/ip6_fib.c:1007 [inline] fib6_purge_rt+0x4f/0x580 net/ipv6/ip6_fib.c:1033 fib6_del_route net/ipv6/ip6_fib.c:1983 [inline] fib6_del+0x696/0x890 net/ipv6/ip6_fib.c:2028 __ip6_del_rt net/ipv6/route.c:3876 [inline] ip6_del_rt+0x83/0x140 net/ipv6/route.c:3891 __ipv6_dev_ac_dec+0x2b5/0x370 net/ipv6/anycast.c:374 ipv6_dev_ac_dec net/ipv6/anycast.c:387 [inline] __ipv6_sock_ac_close+0x141/0x200 net/ipv6/anycast.c:207 ipv6_sock_ac_close+0x79/0x90 net/ipv6/anycast.c:220 inet6_release+0x32/0x50 net/ipv6/af_inet6.c:476 __sock_release net/socket.c:650 [inline] sock_close+0x6c/0x150 net/socket.c:1318 __fput+0x295/0x520 fs/file_table.c:280 ____fput+0x11/0x20 fs/file_table.c:313 task_work_run+0x8e/0x110 kernel/task_work.c:164 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:175 [inline] exit_to_user_mode_prepare+0x160/0x190 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300 do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae write to 0xffff8881087d5886 of 1 bytes by task 1912 on cpu 1: fib6_info_hw_flags_set+0x155/0x3b0 net/ipv6/route.c:6230 nsim_fib6_rt_hw_flags_set drivers/net/netdevsim/fib.c:668 [inline] nsim_fib6_rt_add drivers/net/netdevsim/fib.c:691 [inline] nsim_fib6_rt_insert drivers/net/netdevsim/fib.c:756 [inline] nsim_fib6_event drivers/net/netdevsim/fib.c:853 [inline] nsim_fib_event drivers/net/netdevsim/fib.c:886 [inline] nsim_fib_event_work+0x284f/0x2cf0 drivers/net/netdevsim/fib.c:1477 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307 worker_thread+0x616/0xa70 kernel/workqueue.c:2454 kthread+0x2c7/0x2e0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 value changed: 0x22 -> 0x2a Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 1912 Comm: kworker/1:3 Not tainted 5.16.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events nsim_fib_event_work Fixes: 0c5fcf9e249e ("IPv6: Add "offload failed" indication to routes") Fixes: bb3c4ab93e44 ("ipv6: Add "offload" and "trap" indications to routes") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Amit Cohen <amcohen@nvidia.com> Cc: Ido Schimmel <idosch@nvidia.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220216173217.3792411-2-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
aafc2e32 |
|
20-Jan-2022 |
Eric Dumazet <edumazet@google.com> |
ipv6: annotate accesses to fn->fn_sernum struct fib6_node's fn_sernum field can be read while other threads change it. Add READ_ONCE()/WRITE_ONCE() annotations. Do not change existing smp barriers in fib6_get_cookie_safe() and __fib6_update_sernum_upto_root() syzbot reported: BUG: KCSAN: data-race in fib6_clean_node / inet6_csk_route_socket write to 0xffff88813df62e2c of 4 bytes by task 1920 on cpu 1: fib6_clean_node+0xc2/0x260 net/ipv6/ip6_fib.c:2178 fib6_walk_continue+0x38e/0x430 net/ipv6/ip6_fib.c:2112 fib6_walk net/ipv6/ip6_fib.c:2160 [inline] fib6_clean_tree net/ipv6/ip6_fib.c:2240 [inline] __fib6_clean_all+0x1a9/0x2e0 net/ipv6/ip6_fib.c:2256 fib6_flush_trees+0x6c/0x80 net/ipv6/ip6_fib.c:2281 rt_genid_bump_ipv6 include/net/net_namespace.h:488 [inline] addrconf_dad_completed+0x57f/0x870 net/ipv6/addrconf.c:4230 addrconf_dad_work+0x908/0x1170 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307 worker_thread+0x616/0xa70 kernel/workqueue.c:2454 kthread+0x1bf/0x1e0 kernel/kthread.c:359 ret_from_fork+0x1f/0x30 read to 0xffff88813df62e2c of 4 bytes by task 15701 on cpu 0: fib6_get_cookie_safe include/net/ip6_fib.h:285 [inline] rt6_get_cookie include/net/ip6_fib.h:306 [inline] ip6_dst_store include/net/ip6_route.h:234 [inline] inet6_csk_route_socket+0x352/0x3c0 net/ipv6/inet6_connection_sock.c:109 inet6_csk_xmit+0x91/0x1e0 net/ipv6/inet6_connection_sock.c:121 __tcp_transmit_skb+0x1323/0x1840 net/ipv4/tcp_output.c:1402 tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline] tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680 __tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864 tcp_push+0x2d9/0x2f0 net/ipv4/tcp.c:725 mptcp_push_release net/mptcp/protocol.c:1491 [inline] __mptcp_push_pending+0x46c/0x490 net/mptcp/protocol.c:1578 mptcp_sendmsg+0x9ec/0xa50 net/mptcp/protocol.c:1764 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:643 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] kernel_sendmsg+0x97/0xd0 net/socket.c:745 sock_no_sendpage+0x84/0xb0 net/core/sock.c:3086 inet_sendpage+0x9d/0xc0 net/ipv4/af_inet.c:834 kernel_sendpage+0x187/0x200 net/socket.c:3492 sock_sendpage+0x5a/0x70 net/socket.c:1007 pipe_to_sendpage+0x128/0x160 fs/splice.c:364 splice_from_pipe_feed fs/splice.c:418 [inline] __splice_from_pipe+0x207/0x500 fs/splice.c:562 splice_from_pipe fs/splice.c:597 [inline] generic_splice_sendpage+0x94/0xd0 fs/splice.c:746 do_splice_from fs/splice.c:767 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:936 splice_direct_to_actor+0x345/0x650 fs/splice.c:891 do_splice_direct+0x106/0x190 fs/splice.c:979 do_sendfile+0x675/0xc40 fs/read_write.c:1245 __do_sys_sendfile64 fs/read_write.c:1310 [inline] __se_sys_sendfile64 fs/read_write.c:1296 [inline] __x64_sys_sendfile64+0x102/0x140 fs/read_write.c:1296 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000026f -> 0x00000271 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 15701 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 The Fixes tag I chose is probably arbitrary, I do not think we need to backport this patch to older kernels. Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220120174112.1126644-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
3b80b73a |
|
29-Dec-2021 |
Jakub Kicinski <kuba@kernel.org> |
net: Add includes masked by netdevice.h including uapi/bpf.h Add missing includes unmasked by the subsequent change. Mostly network drivers missing an include for XDP_PACKET_HEADROOM. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211230012742.770642-2-kuba@kernel.org
|
#
8837cbbf |
|
22-Nov-2021 |
Nikolay Aleksandrov <nikolay@nvidia.com> |
net: ipv6: add fib6_nh_release_dsts stub We need a way to release a fib6_nh's per-cpu dsts when replacing nexthops otherwise we can end up with stale per-cpu dsts which hold net device references, so add a new IPv6 stub called fib6_nh_release_dsts. It must be used after an RCU grace period, so no new dsts can be created through a group's nexthop entry. Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed so it doesn't need a dummy stub when IPv6 is not enabled. Fixes: 7bf4796dd099 ("nexthops: add support for replace") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
446e7f21 |
|
22-Aug-2021 |
zhang kai <zhangkaiheb@126.com> |
ipv6: correct comments about fib6_node sernum correct comments in set and get fn_sernum Signed-off-by: zhang kai <zhangkaiheb@126.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0c5fcf9e |
|
07-Feb-2021 |
Amit Cohen <amcohen@nvidia.com> |
IPv6: Add "offload failed" indication to routes After installing a route to the kernel, user space receives an acknowledgment, which means the route was installed in the kernel, but not necessarily in hardware. The asynchronous nature of route installation in hardware can lead to a routing daemon advertising a route before it was actually installed in hardware. This can result in packet loss or mis-routed packets until the route is installed in hardware. To avoid such cases, previous patch set added the ability to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed, this behavior is controlled by sysctl. With the above mentioned behavior, it is possible to know from user-space if the route was offloaded, but if the offload fails there is no indication to user-space. Following a failure, a routing daemon will wait indefinitely for a notification that will never come. This patch adds an "offload_failed" indication to IPv6 routes, so that users will have better visibility into the offload process. 'struct fib6_info' is extended with new field that indicates if route offload failed. Note that the new field is added using unused bit and therefore there is no need to increase struct size. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
907eea48 |
|
01-Feb-2021 |
Amit Cohen <amcohen@nvidia.com> |
net: ipv6: Emit notification when fib hardware flags are changed After installing a route to the kernel, user space receives an acknowledgment, which means the route was installed in the kernel, but not necessarily in hardware. The asynchronous nature of route installation in hardware can lead to a routing daemon advertising a route before it was actually installed in hardware. This can result in packet loss or mis-routed packets until the route is installed in hardware. It is also possible for a route already installed in hardware to change its action and therefore its flags. For example, a host route that is trapping packets can be "promoted" to perform decapsulation following the installation of an IPinIP/VXLAN tunnel. Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed. The aim is to provide an indication to user-space (e.g., routing daemons) about the state of the route in hardware. Introduce a sysctl that controls this behavior. Keep the default value at 0 (i.e., do not emit notifications) for several reasons: - Multiple RTM_NEWROUTE notification per-route might confuse existing routing daemons. - Convergence reasons in routing daemons. - The extra notifications will negatively impact the insertion rate. - Not all users are interested in these notifications. Move fib6_info_hw_flags_set() to C file because it is no longer a short function. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
fbaca8f8 |
|
01-Feb-2021 |
Amit Cohen <amcohen@nvidia.com> |
net: Pass 'net' struct as first argument to fib6_info_hw_flags_set() The next patch will emit notification when hardware flags are changed, in case that fib_notify_on_flag_change sysctl is set to 1. To know sysctl values, net struct is needed. This change is consistent with the IPv4 version, which gets 'net' struct as its first argument. Currently, the only callers of this function are mlxsw and netdevsim. Patch the callers to pass net. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
4b48b0a3 |
|
15-Jul-2020 |
Randy Dunlap <rdunlap@infradead.org> |
net: ip6_fib.h: drop duplicate word in comment Drop doubled word "the" in a comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
55cced4f |
|
23-Jun-2020 |
Brian Vazquez <brianvv@google.com> |
ipv6: fib6: avoid indirect calls from fib6_rule_lookup It was reported that a considerable amount of cycles were spent on the expensive indirect calls on fib6_rule_lookup. This patch introduces an inline helper called pol_route_func that uses the indirect_call_wrappers to avoid the indirect calls. This patch saves around 50ns per call. Performance was measured on the receiver by checking the amount of syncookies that server was able to generate under a synflood load. Traffic was generated using trafgen[1] which was pushing around 1Mpps on a single queue. Receiver was using only one rx queue which help to create a bottle neck and make the experiment rx-bounded. These are the syncookies generated over 10s from the different runs: Whithout the patch: TcpExtSyncookiesSent 3553749 0.0 TcpExtSyncookiesSent 3550895 0.0 TcpExtSyncookiesSent 3553845 0.0 TcpExtSyncookiesSent 3541050 0.0 TcpExtSyncookiesSent 3539921 0.0 TcpExtSyncookiesSent 3557659 0.0 TcpExtSyncookiesSent 3526812 0.0 TcpExtSyncookiesSent 3536121 0.0 TcpExtSyncookiesSent 3529963 0.0 TcpExtSyncookiesSent 3536319 0.0 With the patch: TcpExtSyncookiesSent 3611786 0.0 TcpExtSyncookiesSent 3596682 0.0 TcpExtSyncookiesSent 3606878 0.0 TcpExtSyncookiesSent 3599564 0.0 TcpExtSyncookiesSent 3601304 0.0 TcpExtSyncookiesSent 3609249 0.0 TcpExtSyncookiesSent 3617437 0.0 TcpExtSyncookiesSent 3608765 0.0 TcpExtSyncookiesSent 3620205 0.0 TcpExtSyncookiesSent 3601895 0.0 Without the patch the average is 354263 pkt/s or 2822 ns/pkt and with the patch the average is 360738 pkt/s or 2772 ns/pkt which gives an estimate of 50 ns per packet. [1] http://netsniff-ng.org/ Changelog since v1: - Change ordering in the ICW (Paolo Abeni) Cc: Luigi Rizzo <lrizzo@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Brian Vazquez <brianvv@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
38428d68 |
|
21-May-2020 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
nexthop: support for fdb ecmp nexthops This patch introduces ecmp nexthops and nexthop groups for mac fdb entries. In subsequent patches this is used by the vxlan driver fdb entries. The use case is E-VPN multihoming [1,2,3] which requires bridged vxlan traffic to be load balanced to remote switches (vteps) belonging to the same multi-homed ethernet segment (This is analogous to a multi-homed LAG but over vxlan). Changes include new nexthop flag NHA_FDB for nexthops referenced by fdb entries. These nexthops only have ip. This patch includes appropriate checks to avoid routes referencing such nexthops. example: $ip nexthop add id 12 via 172.16.1.2 fdb $ip nexthop add id 13 via 172.16.1.3 fdb $ip nexthop add id 102 group 12/13 fdb $bridge fdb add 02:02:00:00:00:13 dev vxlan1000 nhid 101 self [1] E-VPN https://tools.ietf.org/html/rfc7432 [2] E-VPN VxLAN: https://tools.ietf.org/html/rfc8365 [3] LPC talk with mention of nexthop groups for L2 ecmp http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf v4 - fixed uninitialized variable reported by kernel test robot Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3c32cc1b |
|
13-May-2020 |
Yonghong Song <yhs@fb.com> |
bpf: Enable bpf_iter targets registering ctx argument types Commit b121b341e598 ("bpf: Add PTR_TO_BTF_ID_OR_NULL support") adds a field btf_id_or_null_non0_off to bpf_prog->aux structure to indicate that the first ctx argument is PTR_TO_BTF_ID reg_type and all others are PTR_TO_BTF_ID_OR_NULL. This approach does not really scale if we have other different reg types in the future, e.g., a pointer to a buffer. This patch enables bpf_iter targets registering ctx argument reg types which may be different from the default one. For example, for pointers to structures, the default reg_type is PTR_TO_BTF_ID for tracing program. The target can register a particular pointer type as PTR_TO_BTF_ID_OR_NULL which can be used by the verifier to enforce accesses. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200513180221.2949882-1-yhs@fb.com
|
#
8f34e53b |
|
01-May-2020 |
David Ahern <dsahern@kernel.org> |
ipv6: Use global sernum for dst validation with nexthop objects Nik reported a bug with pcpu dst cache when nexthop objects are used illustrated by the following: $ ip netns add foo $ ip -netns foo li set lo up $ ip -netns foo addr add 2001:db8:11::1/128 dev lo $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1 $ ip li add veth1 type veth peer name veth2 $ ip li set veth1 up $ ip addr add 2001:db8:10::1/64 dev veth1 $ ip li set dev veth2 netns foo $ ip -netns foo li set veth2 up $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2 $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1 $ ip -6 route add 2001:db8:11::1/128 nhid 100 Create a pcpu entry on cpu 0: $ taskset -a -c 0 ip -6 route get 2001:db8:11::1 Re-add the route entry: $ ip -6 ro del 2001:db8:11::1 $ ip -6 route add 2001:db8:11::1/128 nhid 100 Route get on cpu 0 returns the stale pcpu: $ taskset -a -c 0 ip -6 route get 2001:db8:11::1 RTNETLINK answers: Network is unreachable While cpu 1 works: $ taskset -a -c 1 ip -6 route get 2001:db8:11::1 2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium Conversion of FIB entries to work with external nexthop objects missed an important difference between IPv4 and IPv6 - how dst entries are invalidated when the FIB changes. IPv4 has a per-network namespace generation id (rt_genid) that is bumped on changes to the FIB. Checking if a dst_entry is still valid means comparing rt_genid in the rtable to the current value of rt_genid for the namespace. IPv6 also has a per network namespace counter, fib6_sernum, but the count is saved per fib6_node. With the per-node counter only dst_entries based on fib entries under the node are invalidated when changes are made to the routes - limiting the scope of invalidations. IPv6 uses a reference in the rt6_info, 'from', to track the corresponding fib entry used to create the dst_entry. When validating a dst_entry, the 'from' is used to backtrack to the fib6_node and check the sernum of it to the cookie passed to the dst_check operation. With the inline format (nexthop definition inline with the fib6_info), dst_entries cached in the fib6_nh have a 1:1 correlation between fib entries, nexthop data and dst_entries. With external nexthops, IPv6 looks more like IPv4 which means multiple fib entries across disparate fib6_nodes can all reference the same fib6_nh. That means validation of dst_entries based on external nexthops needs to use the IPv4 format - the per-network namespace counter. Add sernum to rt6_info and set it when creating a pcpu dst entry. Update rt6_get_cookie to return sernum if it is set and update dst_check for IPv6 to look for sernum set and based the check on it if so. Finally, rt6_get_pcpu_route needs to validate the cached entry before returning a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and __mkroute_output for IPv4). This problem only affects routes using the new, external nexthops. Thanks to the kbuild test robot for catching the IS_ENABLED needed around rt_genid_ipv6 before I sent this out. Fixes: 5b98324ebe29 ("ipv6: Allow routes to use nexthop objects") Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David Ahern <dsahern@kernel.org> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
af13b3c3 |
|
23-Mar-2020 |
David Laight <David.Laight@ACULAB.COM> |
Remove DST_HOST Previous changes to the IP routing code have removed all the tests for the DS_HOST route flag. Remove the flags and all the code that sets it. Signed-off-by: David Laight <david.laight@aculab.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6e68f499 |
|
02-Mar-2020 |
Gustavo A. R. Silva <gustavo@embeddedor.com> |
net: ip6_fib: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bb3c4ab9 |
|
14-Jan-2020 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: Add "offload" and "trap" indications to routes In a similar fashion to previous patch, add "offload" and "trap" indication to IPv6 routes. This is done by using two unused bits in 'struct fib6_info' to hold these indications. Capable drivers are expected to set these when processing the various in-kernel route notifications. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d2f0c9b1 |
|
23-Dec-2019 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: Handle route deletion notification For the purpose of route offload, when a single route is deleted, it is only of interest if it is the first route in the node or if it is sibling to such a route. In the first case, distinguish between several possibilities: 1. Route is the last route in the node. Emit a delete notification 2. Route is followed by a non-multipath route. Emit a replace notification for the non-multipath route. 3. Route is followed by a multipath route. Emit a replace notification for the multipath route. In the second case, only emit a delete notification to ensure the route is no longer used as a valid nexthop. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b9b33e7c |
|
20-Nov-2019 |
Paolo Abeni <pabeni@redhat.com> |
ipv6: keep track of routes using src Use a per namespace counter, increment it on successful creation of any route using the source address, decrement it on deletion of such routes. This allows us to check easily if the routing decision in the current namespace depends on the packet source. Will be used by the next patch. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1f8ac570 |
|
20-Nov-2019 |
Paolo Abeni <pabeni@redhat.com> |
ipv6: add fib6_has_custom_rules() helper It wraps the namespace field with the same name, to easily access it regardless of build options. Suggested-by: David Ahern <dsahern@gmail.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b7a59557 |
|
03-Oct-2019 |
Jiri Pirko <jiri@mellanox.com> |
net: fib_notifier: propagate extack down to the notifier block callback Since errors are propagated all the way up to the caller, propagate possible extack of the caller all the way down to the notifier block callback. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7c550daf |
|
03-Oct-2019 |
Jiri Pirko <jiri@mellanox.com> |
net: fib_notifier: make FIB notifier per-netns Currently all users of FIB notifier only cares about events in init_net. Later in this patchset, users get interested in other namespaces too. However, for every registered block user is interested only about one namespace. Make the FIB notifier registration per-netns and avoid unnecessary calls of notifier block for other namespaces. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1e47b483 |
|
21-Jun-2019 |
Stefano Brivio <sbrivio@redhat.com> |
ipv6: Dump route exceptions if requested Since commit 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache"), route exceptions reside in a separate hash table, and won't be found by walking the FIB, so they won't be dumped to userspace on a RTM_GETROUTE message. This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to have no function anymore: # ip -6 route get fc00:3::1 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium # ip -6 route get fc00:4::1 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium # ip -6 route list cache # ip -6 route flush cache # ip -6 route get fc00:3::1 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium # ip -6 route get fc00:4::1 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium because iproute2 lists cached routes using RTM_GETROUTE, and flushes them by listing all the routes, and deleting them with RTM_DELROUTE one by one. If cached routes are requested using the RTM_F_CLONED flag together with strict checking, or if no strict checking is requested (and hence we can't consistently apply filters), look up exceptions in the hash table associated with the current fib6_info in rt6_dump_route(), and, if present and not expired, add them to the dump. We might be unable to dump all the entries for a given node in a single message, so keep track of how many entries were handled for the current node in fib6_walker, and skip that amount in case we start from the same partially dumped node. When a partial dump restarts, as the starting node might change when 'sernum' changes, we have no guarantee that we need to skip the same amount of in-node entries. Therefore, we need two counters, and we need to zero the in-node counter if the node from which the dump is resumed differs. Note that, with the current version of iproute2, this only fixes the 'ip -6 route list cache': on a flush command, iproute2 doesn't pass RTM_F_CLONED and, due to this inconsistency, 'ip -6 route flush cache' is still unable to fetch the routes to be flushed. This will be addressed in a patch for iproute2. To flush cached routes, a procfs entry could be introduced instead: that's how it works for IPv4. We already have a rt6_flush_exception() function ready to be wired to it. However, this would not solve the issue for listing. Versions of iproute2 and kernel tested: iproute2 kernel 4.14.0 4.15.0 4.19.0 5.0.0 5.1.0 5.1.0, patched 3.18 list + + + + + + flush + + + + + + 4.4 list + + + + + + flush + + + + + + 4.9 list + + + + + + flush + + + + + + 4.14 list + + + + + + flush + + + + + + 4.15 list flush 4.19 list flush 5.0 list flush 5.1 list flush with list + + + + + + fix flush + + + + v7: - Explain usage of "skip" counters in commit message (suggested by David Ahern) v6: - Rebase onto net-next, use recently introduced nexthop walker - Make rt6_nh_dump_exceptions() a separate function (suggested by David Ahern) v5: - Use dump_routes and dump_exceptions from filter, ignore NLM_F_MATCH, update test results (flushing works with iproute2 < 5.0.0 now) v4: - Split NLM_F_MATCH and strict check handling in separate patches - Filter routes using RTM_F_CLONED: if it's not set, only return non-cached routes, and if it's set, only return cached routes: change requested by David Ahern and Martin Lau. This implies that iproute2 needs a separate patch to be able to flush IPv6 cached routes. This is not ideal because we can't fix the breakage caused by 2b760fcf5cfb entirely in kernel. However, two years have passed since then, and this makes it more tolerable v3: - More descriptive comment about expired exceptions in rt6_dump_route() - Swap return values of rt6_dump_route() (suggested by Martin Lau) - Don't zero skip_in_node in case we don't dump anything in a given pass (also suggested by Martin Lau) - Remove check on RTM_F_CLONED altogether: in the current UAPI semantic, it's just a flag to indicate the route was cloned, not to filter on routes v2: Add tracking of number of entries to be skipped in current node after a partial dump. As we restart from the same node, if not all the exceptions for a given node fit in a single message, the dump will not terminate, as suggested by Martin Lau. This is a concrete possibility, setting up a big number of exceptions for the same route actually causes the issue, suggested by David Ahern. Reported-by: Jianlin Shi <jishi@redhat.com> Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d5382fef |
|
18-Jun-2019 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: Stop sending in-kernel notifications for each nexthop Both listeners - mlxsw and netdevsim - of IPv6 FIB notifications are now ready to handle IPv6 multipath notifications. Therefore, stop ignoring such notifications in both drivers and stop sending notification for each added / deleted nexthop. v2: * Remove 'multipath_rt' from 'struct fib6_entry_notifier_info' Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d4b96c7b |
|
18-Jun-2019 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: Extend notifier info for multipath routes Extend the IPv6 FIB notifier info with number of sibling routes being notified. This will later allow listeners to process one notification for a multipath routes instead of N, where N is the number of nexthops. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b98324e |
|
08-Jun-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Allow routes to use nexthop objects Add support for RTA_NH_ID attribute to allow a user to specify a nexthop id to use with a route. fc_nh_id is added to fib6_config to hold the value passed in the RTA_NH_ID attribute. If a nexthop id is given, the gateway, device, encap and multipath attributes can not be set. Update ip6_route_del to check metric and protocol before nexthop specs. If fc_nh_id is set, then it must match the id in the route entry. Since IPv6 allows delete of a cached entry (an exception), add ip6_del_cached_rt_nh to cycle through all of the fib6_nh in a fib entry if it is using a nexthop. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b7999b07 |
|
02-Jun-2019 |
Xin Long <lucien.xin@gmail.com> |
ipv6: fix the check before getting the cookie in rt6_get_cookie In Jianlin's testing, netperf was broken with 'Connection reset by peer', as the cookie check failed in rt6_check() and ip6_dst_check() always returned NULL. It's caused by Commit 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes"), where the cookie can be got only when 'c1'(see below) for setting dst_cookie whereas rt6_check() is called when !'c1' for checking dst_cookie, as we can see in ip6_dst_check(). Since in ip6_dst_check() both rt6_dst_from_check() (c1) and rt6_check() (!c1) will check the 'from' cookie, this patch is to remove the c1 check in rt6_get_cookie(), so that the dst_cookie can always be set properly. c1: (rt->rt6i_flags & RTF_PCPU || unlikely(!list_empty(&rt->rt6i_uncached))) Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f88d8ea6 |
|
03-Jun-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Plumb support for nexthop object in a fib6_info Add struct nexthop and nh_list list_head to fib6_info. nh_list is the fib6_info side of the nexthop <-> fib_info relationship. Since a fib6_info referencing a nexthop object can not have 'sibling' entries (the old way of doing multipath routes), the nh_list is a union with fib6_siblings. Add f6i_list list_head to 'struct nexthop' to track fib6_info entries using a nexthop instance. Update __remove_nexthop_fib to walk f6_list and delete fib entries using the nexthop. Add a few nexthop helpers for use when a nexthop is added to fib6_info: - nexthop_fib6_nh - return first fib6_nh in a nexthop object - fib6_info_nh_dev moved to nexthop.h and updated to use nexthop_fib6_nh if the fib6_info references a nexthop object - nexthop_path_fib6_result - similar to ipv4, select a path within a multipath nexthop object. If the nexthop is a blackhole, set fib6_result type to RTN_BLACKHOLE, and set the REJECT flag Update the fib6_info references to check for nh and take a different path as needed: - rt6_qualify_for_ecmp - if a fib entry uses a nexthop object it can NOT be coalesced with other fib entries into a multipath route - rt6_duplicate_nexthop - use nexthop_cmp if either fib6_info references a nexthop - addrconf (host routes), RA's and info entries (anything configured via ndisc) does not use nexthop objects - fib6_info_destroy_rcu - put reference to nexthop object - fib6_purge_rt - drop fib6_info from f6i_list - fib6_select_path - update to use the new nexthop_path_fib6_result when fib entry uses a nexthop object - rt6_device_match - update to catch use of nexthop object as a blackhole and set fib6_type and flags. - ip6_route_info_create - don't add space for fib6_nh if fib entry is going to reference a nexthop object, take a reference to nexthop object, disallow use of source routing - rt6_nlmsg_size - add space for RTA_NH_ID - add rt6_fill_node_nexthop to add nexthop data on a dump As with ipv4, most of the changes push existing code into the else branch of whether the fib entry uses a nexthop object. Update the nexthop code to walk f6i_list on a nexthop deleted to remove fib entries referencing it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2874c5fd |
|
27-May-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
1cf844c7 |
|
22-May-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Make fib6_nh optional at the end of fib6_info Move fib6_nh to the end of fib6_info and make it an array of size 0. Pass a flag to fib6_info_alloc indicating if the allocation needs to add space for a fib6_nh. The current code path always has a fib6_nh allocated with a fib6_info; with nexthop objects they will be separate. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cc5c073a |
|
22-May-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Move exception bucket to fib6_nh Similar to the pcpu routes exceptions are really per nexthop, so move rt6i_exception_bucket from fib6_info to fib6_nh. To avoid additional increases to the size of fib6_nh for a 1-bit flag, use the lowest bit in the allocated memory pointer for the flushed flag. Add helpers for retrieving the bucket pointer to mask off the flag. The cleanup of the exception bucket is moved to fib6_nh_release. fib6_nh_flush_exceptions can now be called from 2 contexts: 1. deleting a fib entry 2. deleting a fib6_nh For 1., fib6_nh_flush_exceptions is called for a specific fib6_info that is getting deleted. All exceptions in the cache using the entry are deleted. For 2, the fib6_nh itself is getting destroyed so fib6_nh_flush_exceptions is called for a NULL fib6_info which means flush all entries. The pmtu.sh selftest exercises the affected code paths - from creating exceptions to cleaning them up on device delete. All tests pass without any rcu locking or memleak warnings. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f40b6ae2 |
|
22-May-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Move pcpu cached routes to fib6_nh rt6_info are specific instances of a fib entry and are tied to a device and gateway - ie., a nexthop. Before nexthop objects, IPv6 fib entries have separate fib6_info for each nexthop in a multipath route, so the location of the pcpu cache in the fib6_info struct worked. However, with nexthop objects a fib6_info can point to a set of nexthops (yet another alignment of ipv6 with ipv4). Accordingly, the pcpu cache needs to be moved to the fib6_nh struct so the cached entries are local to the nexthop specification used to create the rt6_info. Initialization and free of the pcpu entries moved to fib6_nh_init and fib6_nh_release. Change in location only, from fib6_info down to fib6_nh; no other functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
19a3b7ee |
|
22-May-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: export function to send route updates Add fib6_rt_update to send RTM_NEWROUTE with NLM_F_REPLACE set. This helper will be used by the nexthop code to notify userspace of routes that are impacted when a nexthop config is updated via replace. This notification is needed for legacy apps that do not understand the new nexthop object. Apps that are nexthop aware can use the RTA_NH_ID attribute in the route notification to just ignore it. In the future this should be wrapped in a sysctl to allow OS'es that are fully updated to avoid the notificaton storm. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cdaa16a4 |
|
22-May-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Add hook to bump sernum for a route to stubs Add hook to ipv6 stub to bump the sernum up to the root node for a route. This is needed by the nexthop code when a nexthop config changes. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
61fb0d01 |
|
15-May-2019 |
Eric Dumazet <edumazet@google.com> |
ipv6: prevent possible fib6 leaks At ipv6 route dismantle, fib6_drop_pcpu_from() is responsible for finding all percpu routes and set their ->from pointer to NULL, so that fib6_ref can reach its expected value (1). The problem right now is that other cpus can still catch the route being deleted, since there is no rcu grace period between the route deletion and call to fib6_drop_pcpu_from() This can leak the fib6 and associated resources, since no notifier will take care of removing the last reference(s). I decided to add another boolean (fib6_destroying) instead of reusing/renaming exception_bucket_flushed to ease stable backports, and properly document the memory barriers used to implement this fix. This patch has been co-developped with Wei Wang. Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Wei Wang <weiwan@google.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin Lau <kafai@fb.com> Acked-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a65120ba |
|
23-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Use result arg in fib_lookup_arg consistently arg.result is sometimes used as fib6_result and sometimes used to hold the rt6_info. Add rt6_info to fib6_result and make the use of arg.result consistent through ipv6 rules. The rt6 entry is filled in for lookups returning a dst_entry, but not for direct fib_lookups that just want a fib6_info. Fixes: effda4dd97e8 ("ipv6: Pass fib6_result to fib lookups") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f05713e0 |
|
22-Apr-2019 |
Eric Dumazet <edumazet@google.com> |
ipv6: convert fib6_ref to refcount_t We suspect some issues involving fib6_ref 0 -> 1 transitions might cause strange syzbot reports. Lets convert fib6_ref to refcount_t to catch them earlier. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Acked-by: Wei Wang <weiwan@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7e5f4cdb |
|
20-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Remove fib6_info_nh_lwt fib6_info_nh_lwt is no longer used; remove it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7d21fec9 |
|
16-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Add fib6_type and fib6_flags to fib6_result Add the fib6_flags and fib6_type to fib6_result. Update the lookup helpers to set them and update post fib lookup users to use the version from the result. This allows nexthop objects to have blackhole nexthop. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
effda4dd |
|
16-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Pass fib6_result to fib lookups Change fib6_lookup and fib6_table_lookup to take a fib6_result and set f6i and nh rather than returning a fib6_info. For now both always return 0. A later patch set can make these more like the IPv4 counterparts and return EINVAL, EACCESS, etc based on fib6_type. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b1d40991 |
|
16-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Rename fib6_multipath_select and pass fib6_result Add 'struct fib6_result' to hold the fib entry and fib6_nh from a fib lookup as separate entries, similar to what IPv4 now has with fib_result. Rename fib6_multipath_select to fib6_select_path, pass fib6_result to it, and set f6i and nh in the result once a path selection is done. Call fib6_select_path unconditionally for path selection which means moving the sibling and oif check to fib6_select_path. To handle the two different call paths (2 only call multipath_select if flowi6_oif == 0 and the other always calls it), add a new have_oif_match that controls the sibling walk if relevant. Update callers of fib6_multipath_select accordingly and have them use the fib6_info and fib6_nh from the result. This is needed for multipath nexthop objects where a single f6i can point to multiple fib6_nh (similar to IPv4). Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cc3a86c8 |
|
09-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Change rt6_probe to take a fib6_nh rt6_probe sends probes for gateways in a nexthop. As such it really depends on a fib6_nh, not a fib entry. Move last_probe to fib6_nh and update rt6_probe to a fib6_nh struct. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f1741730 |
|
27-Mar-2019 |
David Ahern <dsahern@gmail.com> |
net: Add fib_nh_common and update fib_nh and fib6_nh Add fib_nh_common struct with common nexthop attributes. Convert fib_nh and fib6_nh to use it. Use macros to move existing fib_nh_* references to the new nh_common.nhc_*. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ad1601ae |
|
27-Mar-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Rename fib6_nh entries Rename fib6_nh entries that will be moved to a fib_nh_common struct. Specifically, the device, gateway, flags, and lwtstate are common with all nexthop definitions. In some places new temporary variables are declared or local variables renamed to maintain line lengths. Rename only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2b2450ca |
|
27-Mar-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Move gateway checks to a fib6_nh setting The gateway setting is not per fib6_info entry but per-fib6_nh. Add a new fib_nh_has_gw flag to fib6_nh and convert references to RTF_GATEWAY to the new flag. For IPv6 address the flag is cheaper than checking that nh_gw is non-0 like IPv4 does. While this increases fib6_nh by 8-bytes, the effective allocation size of a fib6_info is unchanged. The 8 bytes is recovered later with a fib_nh_common change. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dac7d0f2 |
|
27-Mar-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Create cleanup helper for fib6_nh Move the fib6_nh cleanup code to a new helper, fib6_nh_release. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
83c44251 |
|
27-Mar-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Create init helper for fib6_nh Similar to IPv4, consolidate the fib6_nh initialization into a helper. As a new standalone function, add a cleanup path to put lwtstate on error. To avoid modifying fib6_config flags, move the reject check to a helper that is invoked once by fib6_nh_init to reset the device and then again in ip6_route_info_create to set the fib6_flags. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c7a1ce39 |
|
21-Mar-2019 |
David Ahern <dsahern@gmail.com> |
ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create Change addrconf_f6i_alloc to generate a fib6_config and call ip6_route_info_create. addrconf_f6i_alloc is the last caller to fib6_info_alloc besides ip6_route_info_create, and there is no reason for it to do its own initialization on a fib6_info. Host routes need to be created even if the device is down, so add a new flag, fc_ignore_dev_down, to fib6_config and update fib6_nh_init to not error out if device is not up. Notes on the conversion: - ip_fib_metrics_init is the same as fib6_config has fc_mx set to NULL and fc_mx_len set to 0 - dst_nocount is handled by the RTF_ADDRCONF flag - dst_host is handled by fc_dst_len = 128 nh_gw does not get set after the conversion to ip6_route_info_create but it should not be set in addrconf_f6i_alloc since this is a host route not a gateway route. Everything else is a straight forward map between fib6_info and fib6_config. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f547fac6 |
|
12-Oct-2018 |
Sabrina Dubroca <sd@queasysnail.net> |
ipv6: rate-limit probes for neighbourless routes When commit 270972554c91 ("[IPV6]: ROUTE: Add Router Reachability Probing (RFC4191).") introduced router probing, the rt6_probe() function required that a neighbour entry existed. This neighbour entry is used to record the timestamp of the last probe via the ->updated field. Later, commit 2152caea7196 ("ipv6: Do not depend on rt->n in rt6_probe().") removed the requirement for a neighbour entry. Neighbourless routes skip the interval check and are not rate-limited. This patch adds rate-limiting for neighbourless routes, by recording the timestamp of the last probe in the fib6_info itself. Fixes: 2152caea7196 ("ipv6: Do not depend on rt->n in rt6_probe().") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7c6bb7d2 |
|
11-Oct-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Add knob to skip DELROUTE message on device down Another difference between IPv4 and IPv6 is the generation of RTM_DELROUTE notifications when a device is taken down (admin down) or deleted. IPv4 does not generate a message for routes evicted by the down or delete; IPv6 does. A NOS at scale really needs to avoid these messages and have IPv4 and IPv6 behave similarly, relying on userspace to handle link notifications and evict the routes. At this point existing user behavior needs to be preserved. Since notifications are a global action (not per app) the only way to preserve existing behavior and allow the messages to be skipped is to add a new sysctl (net/ipv6/route/skip_notify_on_dev_down) which can be set to disable the notificatioons. IPv6 route code already supports the option to skip the message (it is used for multipath routes for example). Besides the new sysctl we need to pass the skip_notify setting through the generic fib6_clean and fib6_walk functions to fib6_clean_node and to set skip_notify on calls to __ip_del_rt for the addrconf_ifdown path. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
67edf21e |
|
10-Sep-2018 |
David Ahern <dsahern@gmail.com> |
scsi: libcxgbi: fib6_ino reference in rt6_info is rcu protected The fib6_info reference in rt6_info is rcu protected. Add a helper to extract prefsrc from and update cxgbi_check_route6 to use it. Fixes: 0153167aebd0 ("net/ipv6: Remove rt6i_prefsrc") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0153167a |
|
10-Sep-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Remove rt6i_prefsrc After the conversion to fib6_info, rt6i_prefsrc has a single user that reads the value and otherwise it is only set. The one reader can be converted to use rt->from so rt6i_prefsrc can be removed, reducing rt6_info by another 20 bytes. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e873e4b9 |
|
21-Jul-2018 |
Wei Wang <weiwan@google.com> |
ipv6: use fib6_info_hold_safe() when necessary In the code path where only rcu read lock is held, e.g. in the route lookup code path, it is not safe to directly call fib6_info_hold() because the fib6_info may already have been deleted but still exists in the rcu grace period. Holding reference to it could cause double free and crash the kernel. This patch adds a new function fib6_info_hold_safe() and replace fib6_info_hold() in all necessary places. Syzbot reported 3 crash traces because of this. One of them is: 8021q: adding VLAN 0 to HW filter on device team0 IPv6: ADDRCONF(NETDEV_CHANGE): team0: link becomes ready dst_release: dst:(____ptrval____) refcnt:-1 dst_release: dst:(____ptrval____) refcnt:-2 WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 dst_hold include/net/dst.h:239 [inline] WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 ip6_setup_cork+0xd66/0x1830 net/ipv6/ip6_output.c:1204 dst_release: dst:(____ptrval____) refcnt:-1 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 4845 Comm: syz-executor493 Not tainted 4.18.0-rc3+ #10 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 panic+0x238/0x4e7 kernel/panic.c:184 dst_release: dst:(____ptrval____) refcnt:-2 dst_release: dst:(____ptrval____) refcnt:-3 __warn.cold.8+0x163/0x1ba kernel/panic.c:536 dst_release: dst:(____ptrval____) refcnt:-4 report_bug+0x252/0x2d0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296 dst_release: dst:(____ptrval____) refcnt:-5 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992 RIP: 0010:dst_hold include/net/dst.h:239 [inline] RIP: 0010:ip6_setup_cork+0xd66/0x1830 net/ipv6/ip6_output.c:1204 Code: c1 ed 03 89 9d 18 ff ff ff 48 b8 00 00 00 00 00 fc ff df 41 c6 44 05 00 f8 e9 2d 01 00 00 4c 8b a5 c8 fe ff ff e8 1a f6 e6 fa <0f> 0b e9 6a fc ff ff e8 0e f6 e6 fa 48 8b 85 d0 fe ff ff 48 8d 78 RSP: 0018:ffff8801a8fcf178 EFLAGS: 00010293 RAX: ffff8801a8eba5c0 RBX: 0000000000000000 RCX: ffffffff869511e6 RDX: 0000000000000000 RSI: ffffffff869515b6 RDI: 0000000000000005 RBP: ffff8801a8fcf2c8 R08: ffff8801a8eba5c0 R09: ffffed0035ac8338 R10: ffffed0035ac8338 R11: ffff8801ad6419c3 R12: ffff8801a8fcf720 R13: ffff8801a8fcf6a0 R14: ffff8801ad6419c0 R15: ffff8801ad641980 ip6_make_skb+0x2c8/0x600 net/ipv6/ip6_output.c:1768 udpv6_sendmsg+0x2c90/0x35f0 net/ipv6/udp.c:1376 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:641 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:651 ___sys_sendmsg+0x51d/0x930 net/socket.c:2125 __sys_sendmmsg+0x240/0x6f0 net/socket.c:2220 __do_sys_sendmmsg net/socket.c:2249 [inline] __se_sys_sendmmsg net/socket.c:2246 [inline] __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2246 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x446ba9 Code: e8 cc bb 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fb39a469da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00000000006dcc54 RCX: 0000000000446ba9 RDX: 00000000000000b8 RSI: 0000000020001b00 RDI: 0000000000000003 RBP: 00000000006dcc50 R08: 00007fb39a46a700 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 45c828efc7a64843 R13: e6eeb815b9d8a477 R14: 5068caf6f713c6fc R15: 0000000000000001 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Rebooting in 86400 seconds.. Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes") Reported-by: syzbot+902e2a1bcd4f7808cef5@syzkaller.appspotmail.com Reported-by: syzbot+8ae62d67f647abeeceb9@syzkaller.appspotmail.com Reported-by: syzbot+3f08feb14086930677d0@syzkaller.appspotmail.com Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9b0a8da8 |
|
18-Jun-2018 |
Eric Dumazet <edumazet@google.com> |
net/ipv6: respect rcu grace period before freeing fib6_info syzbot reported use after free that is caused by fib6_info being freed without a proper RCU grace period. CPU: 0 PID: 1407 Comm: udevd Not tainted 4.17.0+ #39 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 __read_once_size include/linux/compiler.h:188 [inline] find_rr_leaf net/ipv6/route.c:705 [inline] rt6_select net/ipv6/route.c:761 [inline] fib6_table_lookup+0x12b7/0x14d0 net/ipv6/route.c:1823 ip6_pol_route+0x1c2/0x1020 net/ipv6/route.c:1856 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2082 fib6_rule_lookup+0x211/0x6d0 net/ipv6/fib6_rules.c:122 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2110 ip6_route_output include/net/ip6_route.h:82 [inline] icmpv6_xrlim_allow net/ipv6/icmp.c:211 [inline] icmp6_send+0x147c/0x2da0 net/ipv6/icmp.c:535 icmpv6_send+0x17a/0x300 net/ipv6/ip6_icmp.c:43 ip6_link_failure+0xa5/0x790 net/ipv6/route.c:2244 dst_link_failure include/net/dst.h:427 [inline] ndisc_error_report+0xd1/0x1c0 net/ipv6/ndisc.c:695 neigh_invalidate+0x246/0x550 net/core/neighbour.c:892 neigh_timer_handler+0xaf9/0xde0 net/core/neighbour.c:978 call_timer_fn+0x230/0x940 kernel/time/timer.c:1326 expire_timers kernel/time/timer.c:1363 [inline] __run_timers+0x79e/0xc50 kernel/time/timer.c:1666 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692 __do_softirq+0x2e0/0xaf5 kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 [inline] irq_exit+0x1d1/0x200 kernel/softirq.c:404 exiting_irq arch/x86/include/asm/apic.h:527 [inline] smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863 </IRQ> RIP: 0010:strlen+0x5e/0xa0 lib/string.c:482 Code: 24 00 74 3b 48 bb 00 00 00 00 00 fc ff df 4c 89 e0 48 83 c0 01 48 89 c2 48 89 c1 48 c1 ea 03 83 e1 07 0f b6 14 1a 38 ca 7f 04 <84> d2 75 23 80 38 00 75 de 48 83 c4 08 4c 29 e0 5b 41 5c 5d c3 48 RSP: 0018:ffff8801af117850 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 RAX: ffff880197f53bd0 RBX: dffffc0000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff81c5b06c RDI: ffff880197f53bc0 RBP: ffff8801af117868 R08: ffff88019a976540 R09: 0000000000000000 R10: ffff88019a976540 R11: 0000000000000000 R12: ffff880197f53bc0 R13: ffff880197f53bc0 R14: ffffffff899e4e90 R15: ffff8801d91c6a00 strlen include/linux/string.h:267 [inline] getname_kernel+0x24/0x370 fs/namei.c:218 open_exec+0x17/0x70 fs/exec.c:882 load_elf_binary+0x968/0x5610 fs/binfmt_elf.c:780 search_binary_handler+0x17d/0x570 fs/exec.c:1653 exec_binprm fs/exec.c:1695 [inline] __do_execve_file.isra.35+0x16fe/0x2710 fs/exec.c:1819 do_execveat_common fs/exec.c:1866 [inline] do_execve fs/exec.c:1883 [inline] __do_sys_execve fs/exec.c:1964 [inline] __se_sys_execve fs/exec.c:1959 [inline] __x64_sys_execve+0x8f/0xc0 fs/exec.c:1959 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f1576a46207 Code: 77 19 f4 48 89 d7 44 89 c0 0f 05 48 3d 00 f0 ff ff 76 e0 f7 d8 64 41 89 01 eb d8 f7 d8 64 41 89 01 eb df b8 3b 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 00 8c 2d 00 f7 d8 64 89 02 RSP: 002b:00007ffff2784568 EFLAGS: 00000202 ORIG_RAX: 000000000000003b RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f1576a46207 RDX: 0000000001215b10 RSI: 00007ffff2784660 RDI: 00007ffff2785670 RBP: 0000000000625500 R08: 000000000000589c R09: 000000000000589c R10: 0000000000000000 R11: 0000000000000202 R12: 0000000001215b10 R13: 0000000000000007 R14: 0000000001204250 R15: 0000000000000005 Allocated by task 12188: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553 kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620 kmalloc include/linux/slab.h:513 [inline] kzalloc include/linux/slab.h:706 [inline] fib6_info_alloc+0xbb/0x280 net/ipv6/ip6_fib.c:152 ip6_route_info_create+0x782/0x2b50 net/ipv6/route.c:3013 ip6_route_add+0x23/0xb0 net/ipv6/route.c:3154 ipv6_route_ioctl+0x5a5/0x760 net/ipv6/route.c:3660 inet6_ioctl+0x100/0x1f0 net/ipv6/af_inet6.c:546 sock_do_ioctl+0xe4/0x3e0 net/socket.c:973 sock_ioctl+0x30d/0x680 net/socket.c:1097 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:500 [inline] do_vfs_ioctl+0x1cf/0x16f0 fs/ioctl.c:684 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701 __do_sys_ioctl fs/ioctl.c:708 [inline] __se_sys_ioctl fs/ioctl.c:706 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:706 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 1402: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kfree+0xd9/0x260 mm/slab.c:3813 fib6_info_destroy+0x29b/0x350 net/ipv6/ip6_fib.c:207 fib6_info_release include/net/ip6_fib.h:286 [inline] __ip6_del_rt_siblings net/ipv6/route.c:3235 [inline] ip6_route_del+0x11c4/0x13b0 net/ipv6/route.c:3316 ipv6_route_ioctl+0x616/0x760 net/ipv6/route.c:3663 inet6_ioctl+0x100/0x1f0 net/ipv6/af_inet6.c:546 sock_do_ioctl+0xe4/0x3e0 net/socket.c:973 sock_ioctl+0x30d/0x680 net/socket.c:1097 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:500 [inline] do_vfs_ioctl+0x1cf/0x16f0 fs/ioctl.c:684 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701 __do_sys_ioctl fs/ioctl.c:708 [inline] __se_sys_ioctl fs/ioctl.c:706 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:706 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8801b5df2580 which belongs to the cache kmalloc-256 of size 256 The buggy address is located 8 bytes inside of 256-byte region [ffff8801b5df2580, ffff8801b5df2680) The buggy address belongs to the page: page:ffffea0006d77c80 count:1 mapcount:0 mapping:ffff8801da8007c0 index:0xffff8801b5df2e40 flags: 0x2fffc0000000100(slab) raw: 02fffc0000000100 ffffea0006c5cc48 ffffea0007363308 ffff8801da8007c0 raw: ffff8801b5df2e40 ffff8801b5df2080 0000000100000006 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8801b5df2480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8801b5df2500: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc > ffff8801b5df2580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8801b5df2600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8801b5df2680: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb Fixes: a64efe142f5e ("net/ipv6: introduce fib6_info struct and helpers") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@gmail.com> Reported-by: syzbot+9e6d75e3edef427ee888@syzkaller.appspotmail.com Acked-by: David Ahern <dsahern@gmail.com> Tested-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
901731b8 |
|
21-May-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Add helper to return path MTU based on fib result Determine path MTU from a FIB lookup result. Logic is based on ip6_dst_mtu_forward plus lookup of nexthop exception. Add ip6_dst_mtu_forward to ipv6_stubs to handle access by core bpf code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
c3506372 |
|
10-Apr-2018 |
Christoph Hellwig <hch@lst.de> |
proc: introduce proc_create_net{,_data} Variants of proc_create{,_data} that directly take a struct seq_operations and deal with network namespaces in ->open and ->release. All callers of proc_create + seq_open_net converted over, and seq_{open,release}_net are removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
138118ec |
|
09-May-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Add fib6_lookup Add IPv6 equivalent to fib_lookup. Does a fib lookup, including rules, but returns a FIB entry, fib6_info, rather than a dst based rt6_info. fib6_lookup is any where from 140% (MULTIPLE_TABLES config disabled) to 60% faster than any of the dst based lookup methods (without custom rules) and 25% faster with custom rules (e.g., l3mdev rule). Since the lookup function has a completely different signature, fib6_rule_action is split into 2 paths: the existing one is renamed __fib6_rule_action and a new one for the fib6_info path is added. fib6_rule_action decides which to call based on the lookup_ptr. If it is fib6_table_lookup then the new path is taken. Caller must hold rcu lock as no reference is taken on the returned fib entry. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
1d053da9 |
|
09-May-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Extract table lookup from ip6_pol_route ip6_pol_route is used for ingress and egress FIB lookups. Refactor it moving the table lookup into a separate fib6_table_lookup that can be invoked separately and export the new function. ip6_pol_route now calls fib6_table_lookup and uses the result to generate a dst based rt6_info. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
3b290a31 |
|
09-May-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Rename rt6_multipath_select Rename rt6_multipath_select to fib6_multipath_select and export it. A later patch wants access to it similar to IPv4's fib_select_path. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
6454743b |
|
09-May-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Rename fib6_lookup to fib6_node_lookup Rename fib6_lookup to fib6_node_lookup to better reflect what it returns. The fib6_lookup name will be used in a later patch for an IPv6 equivalent to IPv4's fib_lookup. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
8fb11a9a |
|
04-May-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: rename rt6_next to fib6_next This slipped through the cracks in the followup set to the fib6_info flip. Rename rt6_next to fib6_next. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a68886a6 |
|
20-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Make from in rt6_info rcu protected When a dst entry is created from a fib entry, the 'from' in rt6_info is set to the fib entry. The 'from' reference is used most notably for cookie checking - making sure stale dst entries are updated if the fib entry is changed. When a fib entry is deleted, the pcpu routes on it are walked releasing the fib6_info reference. This is needed for the fib6_info cleanup to happen and to make sure all device references are released in a timely manner. There is a race window when a FIB entry is deleted and the 'from' on the pcpu route is dropped and the pcpu route hits a cookie check. Handle this race using rcu on from. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a87b7dc9 |
|
20-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Move rcu locking to callers of fib6_get_cookie_safe A later patch protects 'from' in rt6_info and this simplifies the locking needed by it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a269f1a7 |
|
20-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Rename rt6_get_cookie_safe rt6_get_cookie_safe takes a fib6_info and checks the sernum of the node. Update the name to reflect its purpose. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6a3e030f |
|
20-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Clean up rt expires helpers rt6_clean_expires and rt6_set_expires are no longer used. Removed them. rt6_update_expires has 1 caller in route.c, so move it from the header. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dcd1f572 |
|
18-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Remove fib6_idev fib6_idev can be obtained from __in6_dev_get on the nexthop device rather than caching it in the fib6_info. Remove it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9ee8cbb2 |
|
18-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Remove aca_idev aca_idev has only 1 user - inet6_fill_ifacaddr - and it only wants the device index which can be extracted from the fib6_info nexthop. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
93c2fb25 |
|
18-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Rename fib6_info struct elements Change the prefix for fib6_info struct elements from rt6i_ to fib6_. rt6i_pcpu and rt6i_exception_bucket are left as is given that they point to rt6_info entries. Rename only; not functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
77634cc6 |
|
17-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Remove unused code and variables for rt6_info Drop unneeded elements from rt6_info struct and rearrange layout to something more relevant for the data path. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8d1c802b |
|
17-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Flip FIB entries to fib6_info Convert all code paths referencing a FIB entry from rt6_info to fib6_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
93531c67 |
|
17-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: separate handling of FIB entries from dst based routes Last step before flipping the data type for FIB entries: - use fib6_info_alloc to create FIB entries in ip6_route_info_create and addrconf_dst_alloc - use fib6_info_release in place of dst_release, ip6_rt_put and rt6_release - remove the dst_hold before calling __ip6_ins_rt or ip6_del_rt - when purging routes, drop per-cpu routes - replace inc and dec of rt6i_ref with fib6_info_hold and fib6_info_release - use rt->from since it points to the FIB entry - drop references to exception bucket, fib6_metrics and per-cpu from dst entries (those are relevant for fib entries only) Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a64efe14 |
|
17-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: introduce fib6_info struct and helpers Add fib6_info struct and alloc, destroy, hold and release helpers. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3b6761d1 |
|
17-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Move dst flags to booleans in fib entries Continuing to wean FIB paths off of dst_entry, use a bool to hold requests for certain dst settings. Add a helper to convert the flags to DST flags when a FIB entry is converted to a dst_entry. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
14895687 |
|
17-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: move expires into rt6_info Add expires to rt6_info for FIB entries, and add fib6 helpers to manage it. Data path use of dst.expires remains. The transition is fairly straightforward: when working with fib entries, rt->dst.expires is just rt->expires, rt6_clean_expires is replaced with fib6_clean_expires, rt6_set_expires becomes fib6_set_expires, and rt6_check_expired becomes fib6_check_expired, where the fib6 versions are added by this patch. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d4ead6b3 |
|
17-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: move metrics from dst to rt6_info Similar to IPv4, add fib metrics to the fib struct, which at the moment is rt6_info. Will be moved to fib6_info in a later patch. Copy metrics into dst by reference using refcount. To make the transition: - add dst_metrics to rt6_info. Default to dst_default_metrics if no metrics are passed during route add. No need for a separate pmtu entry; it can reference the MTU slot in fib6_metrics - ip6_convert_metrics allocates memory in the FIB entry and uses ip_metrics_convert to copy from netlink attribute to metrics entry - the convert metrics call is done in ip6_route_info_create simplifying the route add path + fib6_commit_metrics and fib6_copy_metrics and the temporary mx6_config are no longer needed - add fib6_metric_set helper to change the value of a metric in the fib entry since dst_metric_set can no longer be used - cow_metrics for IPv6 can drop to dst_cow_metrics_generic - rt6_dst_from_metrics_check is no longer needed - rt6_fill_node needs the FIB entry and dst as separate arguments to keep compatibility with existing output. Current dst address is renamed to dest. (to be consistent with IPv4 rt6_fill_node really should be split into 2 functions similar to fib_dump_info and rt_fill_info) - rt6_fill_node no longer needs the temporary metrics variable Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5e670d84 |
|
17-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Move nexthop data to fib6_nh Introduce fib6_nh structure and move nexthop related data from rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or lwtstate from a FIB lookup perspective are converted to use fib6_nh; datapath references to dst version are left as is. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e8478e80 |
|
17-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Save route type in rt6_info The RTN_ type for IPv6 FIB entries is currently embedded in rt6i_flags and dst.error. Since dst is going to be removed, it can no longer be relied on for FIB dumps so save the route type as fib6_type. fc_type is set in current users based on the algorithm in rt6_fill_node: - rt6i_flags contains RTF_LOCAL: fc_type = RTN_LOCAL - rt6i_flags contains RTF_ANYCAST: fc_type = RTN_ANYCAST - else fc_type = RTN_UNICAST Similarly, fib6_type is set in the rt6_info templates based on the RTF_REJECT section of rt6_fill_node converting dst.error to RTN type. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7aef6859 |
|
17-Apr-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Pass net to fib6_update_sernum Pass net namespace to fib6_update_sernum. It can not be marked const as fib6_new_sernum will change ipv6.fib6_sernum. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b75cc8f9 |
|
02-Mar-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv6: Pass skb to route lookup IPv6 does path selection for multipath routes deep in the lookup functions. The next patch adds L4 hash option and needs the skb for the forward path. To get the skb to the relevant FIB lookup functions it needs to go through the fib rules layer, so add a lookup_data argument to the fib_lookup_arg struct. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5e5d6fed |
|
28-Feb-2018 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
ipv6: route: dissect flow in input path if fib rules need it Dissect flow in fwd path if fib rules require it. Controlled by a flag to avoid penatly for the common case. Flag is set when fib rules with sport, dport and proto match that require flow dissect are installed. Also passes the dissected hash keys to the multipath hash function when applicable to avoid dissecting the flow again. icmp packets will continue to use inner header for hash calculations. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
398958ae |
|
09-Jan-2018 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: Add support for non-equal-cost multipath The use of hash-threshold instead of modulo-N makes it trivial to add support for non-equal-cost multipath. Instead of dividing the multipath hash function's output space equally between the nexthops, each nexthop is assigned a region size which is proportional to its weight. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d7dedee1 |
|
09-Jan-2018 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: Calculate hash thresholds for IPv6 nexthops Before we convert IPv6 to use hash-threshold instead of modulo-N, we first need each nexthop to store its region boundary in the hash function's output space. The boundary is calculated by dividing the output space equally between the different active nexthops. That is, nexthops that are not dead or linkdown. The boundaries are rebalanced whenever a nexthop is added or removed to a multipath route and whenever a nexthop becomes active or inactive. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4a8e56ee |
|
06-Jan-2018 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: Export sernum update function We are going to allow dead routes to stay in the FIB tree (e.g., when they are part of a multipath route, directly connected route with no carrier) and revive them when their nexthop device gains carrier or when it is put administratively up. This is equivalent to the addition of the route to the FIB tree and we should therefore take care of updating the sernum of all the parent nodes of the node where the route is stored. Otherwise, we risk sockets caching and using sub-optimal dst entries. Export the function that performs the above, so that it could be invoked from fib6_ifup() later on. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a2c554d3 |
|
06-Jan-2018 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: Add explicit flush indication to routes When routes that are a part of a multipath route are evaluated by fib6_ifdown() in response to NETDEV_DOWN and NETDEV_UNREGISTER events the state of their sibling routes is not considered. This will change in subsequent patches in order to align IPv6 with IPv4's behavior. For example, when the last sibling in a multipath route becomes dead, the entire multipath route needs to be removed. To prevent the tree walker from re-evaluating all the sibling routes each time, we can simply evaluate them once - when the first sibling is traversed. If we determine the entire multipath route needs to be removed, then the 'should_flush' bit is set in all the siblings, which will cause the walker to flush them when it traverses them. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3a2232e9 |
|
28-Nov-2017 |
David Miller <davem@davemloft.net> |
ipv6: Move dst->from into struct rt6_info. The dst->from value is only used by ipv6 routes to track where a route "came from". Any time we clone or copy a core ipv6 route in the ipv6 routing tables, we have the copy/clone's ->from point to the base route. This is used to handle route expiration properly. Only ipv6 uses this mechanism, and only ipv6 code references it. So it is safe to move it into rt6_info. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
|
#
071fb37e |
|
28-Nov-2017 |
David Miller <davem@davemloft.net> |
ipv6: Move rt6_next from dst_entry into ipv6 route structure. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
|
#
81eb8447 |
|
06-Oct-2017 |
Wei Wang <weiwan@google.com> |
ipv6: take care of rt6_stats Currently, most of the rt6_stats are not hooked up correctly. As the last part of this patch series, hook up all existing rt6_stats and add one new stat fib_rt_uncache to indicate the number of routes in the uncached list. For details of the stats, please refer to the comments added in include/net/ip6_fib.h. Note: fib_rt_alloc and fib_rt_uncache are not guaranteed to be modified under a lock. So atomic_t is used for them. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
66f5d6ce |
|
06-Oct-2017 |
Wei Wang <weiwan@google.com> |
ipv6: replace rwlock with rcu and spinlock in fib6_table With all the preparation work before, we are now ready to replace rwlock with rcu and spinlock in fib6_table. That means now all fib6_node in fib6_table are protected by rcu. And when freeing fib6_node, call_rcu() is used to wait for the rcu grace period before releasing the memory. When accessing fib6_node, corresponding rcu APIs need to be used. And all previous sessions protected by the write lock will now be protected by the spin lock per table. All previous sessions protected by read lock will now be protected by rcu_read_lock(). A couple of things to note here: 1. As part of the work of replacing rwlock with rcu, the linked list of fn->leaf now has to be rcu protected as well. So both fn->leaf and rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are used when manipulating them. 2. For fn->rr_ptr, first of all, it also needs to be rcu protected now and is tagged with __rcu and rcu APIs are used in corresponding places. Secondly, fn->rr_ptr is changed in rt6_select() which is a reader thread. This makes the issue a bit complicated. We think a valid solution for it is to let rt6_select() grab the tb6_lock if it decides to change it. As it is not in the normal operation and only happens when there is no valid neighbor cache for the route, we think the performance impact should be low. 3. fib6_walk_continue() has to be called with tb6_lock held even in the route dumping related functions, e.g. inet6_dump_fib(), fib6_tables_dump() and ipv6_route_seq_ops. It is because fib6_walk_continue() makes modifications to the walker structure, and so are fib6_repair_tree() and fib6_del_route(). In order to do proper syncing between them, we need to let fib6_walk_continue() hold the lock. We may be able to do further improvement on the way we do the tree walk to get rid of the need for holding the spin lock. But not for now. 4. When fib6_del_route() removes a route from the tree, we no longer mark rt->dst.rt6_next to NULL to make simultaneous reader be able to further traverse the list with rcu. However, rt->dst.rt6_next is only valid within this same rcu period. No one should access it later. 5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be performed before we publish this route (either by linking it to fn->leaf or insert it in the list pointed by fn->leaf) just to be safe because as soon as we publish the route, some read thread will be able to access it. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bbd63f06 |
|
06-Oct-2017 |
Wei Wang <weiwan@google.com> |
ipv6: update fn_sernum after route is inserted to tree fib6_add() logic currently calls fib6_add_1() to figure out what node should be used for the newly added route and then call fib6_add_rt2node() to insert the route to the node. And during the call of fib6_add_1(), fn_sernum is updated for all nodes that share the same prefix as the new route. This does not have issue in the current code because reader thread will not be able to access the tree while writer thread is inserting new route to it. However, it is not the case once we transition to use RCU. Reader thread could potentially see the new fn_sernum before the new route is inserted. As a result, reader thread's route lookup will return a stale route with the new fn_sernum. In order to solve this issue, we remove all the update of fn_sernum in fib6_add_1(), and instead, introduce a new function that updates fn_sernum for all related nodes and call this functions once the route is successfully inserted to the tree. Also, smp_wmb() is used after a route is successfully inserted into the fib tree and right before the updated of fn->sernum. And smp_rmb() is used right after fn->sernum is accessed in rt6_get_cookie_safe(). This is to guarantee that when the reader thread sees the new fn->sernum, the new route is already inserted in the tree in memory. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2b760fcf |
|
06-Oct-2017 |
Wei Wang <weiwan@google.com> |
ipv6: hook up exception table to store dst cache This commit makes use of the exception hash table implementation to store dst caches created by pmtu discovery and ip redirect into the hash table under the rt_info and no longer inserts these routes into fib6 tree. This makes the fib6 tree only contain static configured routes and could now be protected by rcu instead of a rw lock. With this change, in the route lookup related functions, after finding the rt6_info with the longest prefix, we also need to search for the exception table before doing backtracking. In the route delete function, if the route being deleted is not a dst cache, deletion of this route also need to flush the whole hash table under it. If it is a dst cache, then only delete the cached dst in the hash table. Note: for fib6_walk_continue() function, w->root now is always pointing to a root node considering that fib6_prune_clones() is removed from the code. So we add a WARN_ON() msg to make sure w->root always points to a root node and also removed the update of w->root in fib6_repair_tree(). This is a prerequisite for later patch because we don't need to make w->root as rcu protected when replacing rwlock with RCU. Also, we remove all prune related variables as it is no longer used. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
38fbeeee |
|
06-Oct-2017 |
Wei Wang <weiwan@google.com> |
ipv6: prepare fib6_locate() for exception table fib6_locate() is used to find the fib6_node according to the passed in prefix address key. It currently tries to find the fib6_node with the exact match of the passed in key. However, when we move cached routes into the exception table, fib6_locate() will fail to find the fib6_node for it as the cached routes will be stored in the exception table under the fib6_node with the longest prefix match of the cache's dst addr key. This commit adds a new parameter to let the caller specify if it needs exact match or longest prefix match. Right now, all callers still does exact match when calling fib6_locate(). It will be changed in later commit where exception table is hooked up to store cached routes. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c757faa8 |
|
06-Oct-2017 |
Wei Wang <weiwan@google.com> |
ipv6: prepare fib6_age() for exception table If all dst cache entries are stored in the exception table under the main route, we have to go through them during fib6_age() when doing garbage collecting. Introduce a new function rt6_age_exception() which goes through all dst entries in the exception table and remove those entries that are expired. This function is called in fib6_age() so that all dst caches are also garbage collected. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
35732d01 |
|
06-Oct-2017 |
Wei Wang <weiwan@google.com> |
ipv6: introduce a hash table to store dst cache Add a hash table into struct rt6_info in order to store dst caches created by pmtu discovery and ip redirect in ipv6 routing code. APIs to add dst cache, delete dst cache, find dst cache and update dst cache in the hash table are implemented and will be used in later commits. This is a preparation work to move all cache routes into the exception table instead of getting inserted into the fib6 tree. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
180ca444 |
|
06-Oct-2017 |
Wei Wang <weiwan@google.com> |
ipv6: introduce a new function fib6_update_sernum() This function takes a route as input and tries to update the sernum in the fib6_node this route is associated with. It will be used in later commit when adding a cached route into the exception table under that route. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4e587ea7 |
|
25-Aug-2017 |
Wei Wang <weiwan@google.com> |
ipv6: fix sparse warning on rt6i_node Commit c5cff8561d2d adds rcu grace period before freeing fib6_node. This generates a new sparse warning on rt->rt6i_node related code: net/ipv6/route.c:1394:30: error: incompatible types in comparison expression (different address spaces) ./include/net/ip6_fib.h:187:14: error: incompatible types in comparison expression (different address spaces) This commit adds "__rcu" tag for rt6i_node and makes sure corresponding rcu API is used for it. After this fix, sparse no longer generates the above warning. Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node") Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c5cff856 |
|
21-Aug-2017 |
Wei Wang <weiwan@google.com> |
ipv6: add rcu grace period before freeing fib6_node We currently keep rt->rt6i_node pointing to the fib6_node for the route. And some functions make use of this pointer to dereference the fib6_node from rt structure, e.g. rt6_check(). However, as there is neither refcount nor rcu taken when dereferencing rt->rt6i_node, it could potentially cause crashes as rt->rt6i_node could be set to NULL by other CPUs when doing a route deletion. This patch introduces an rcu grace period before freeing fib6_node and makes sure the functions that dereference it takes rcu_read_lock(). Note: there is no "Fixes" tag because this bug was there in a very early stage. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fe400799 |
|
15-Aug-2017 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: fib: Provide offload indication using nexthop flags IPv6 routes currently lack nexthop flags as in IPv4. This has several implications. In the forwarding path, it requires us to check the carrier state of the nexthop device and potentially ignore a linkdown route, instead of checking for RTNH_F_LINKDOWN. It also requires capable drivers to use the user facing IPv6-specific route flags to provide offload indication, instead of using the nexthop flags as in IPv4. Add nexthop flags to IPv6 routes in the 40 bytes hole and use it to provide offload indication instead of the RTF_OFFLOAD flag, which is removed while it's still not part of any official kernel release. In the near future we would like to use the field for the RTNH_F_{LINKDOWN,DEAD} flags, but this change is more involved and might not be ready in time for the current cycle. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a460aa83 |
|
03-Aug-2017 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: fib: Add helpers to hold / drop a reference on rt6_info Similar to commit 1c677b3d2828 ("ipv4: fib: Add fib_info_hold() helper") and commit b423cb10807b ("ipv4: fib: Export free_fib_info()") add an helper to hold a reference on rt6_info and export rt6_release() to drop it and potentially release the route. This is needed so that drivers capable of FIB offload could hold a reference on the route before queueing it for offload and drop it after the route has been programmed to the device's tables. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e1ee0a5b |
|
03-Aug-2017 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: fib: Dump tables during registration to FIB chain Dump all the FIB tables in each net namespace upon registration to the FIB notification chain so that the callee will have a complete view of the tables. The integrity of the dump is ensured by a per-table sequence counter that is incremented (under write lock) whenever a route is added or deleted from the table. All the sequence counters are read (under each table's read lock) and summed, prior and after the dump. In case the counters differ, then the dump is either restarted or the registration fails. While it's possible for a table to be modified after its counter has been read, this isn't really a problem. In case it happened before it was read the second time, then the comparison at the end will fail. If it happened afterwards, then we're guaranteed to be notified about the change, as the notification block is registered prior to the second read. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dcb18f76 |
|
03-Aug-2017 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: fib_rules: Dump rules during registration to FIB chain Allow users of the FIB notification chain to receive a complete view of the IPv6 FIB rules upon registration to the chain. The integrity of the dump is ensured by a per-family sequence counter that is incremented (under RTNL) whenever a rule is added or deleted. All the sequence counters are read (under RTNL) and summed, prior and after the dump. In case the counters differ, then the dump is either restarted or the registration fails. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
df77fe4d |
|
03-Aug-2017 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: fib: Add in-kernel notifications for route add / delete As with IPv4, allow listeners of the FIB notification chain to receive notifications whenever a route is added, replaced or deleted. This is done by placing calls to the FIB notification chain in the two lowest level functions that end up performing these operations - namely, fib6_add_rt2node() and fib6_del_route(). Unlike IPv4, APPEND notifications aren't sent as the kernel doesn't distinguish between "append" (NLM_F_CREATE|NLM_F_APPEND) and "prepend" (NLM_F_CREATE). If NLM_F_EXCL isn't set, duplicate routes are always added after the existing duplicate routes. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
16ab6d7d |
|
03-Aug-2017 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: fib: Add FIB notifiers callbacks We're about to add IPv6 FIB offload support, so implement the necessary callbacks in IPv6 code, which will later allow us to add routes and rules notifications. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e3ea9731 |
|
03-Aug-2017 |
Ido Schimmel <idosch@mellanox.com> |
ipv6: fib_rules: Check if rule is a default rule As explained in commit 3c71006d15fd ("ipv4: fib_rules: Check if rule is a default rule"), drivers supporting IPv6 FIB offload need to be able to sanitize the rules they don't support and potentially flush their tables. Add an IPv6 helper to check if a FIB rule is a default rule. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a4c2fd7f |
|
17-Jun-2017 |
Wei Wang <weiwan@google.com> |
net: remove DST_NOCACHE flag DST_NOCACHE flag check has been removed from dst_release() and dst_hold_safe() in a previous patch because all the dst are now ref counted properly and can be released based on refcnt only. Looking at the rest of the DST_NOCACHE use, all of them can now be removed or replaced with other checks. So this patch gets rid of all the DST_NOCACHE usage and remove this flag completely. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
333c4301 |
|
21-May-2017 |
David Ahern <dsahern@gmail.com> |
net: ipv6: Plumb extack through route add functions Plumb extack argument down to route add functions. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0ae81335 |
|
02-Feb-2017 |
David Ahern <dsa@cumulusnetworks.com> |
net: ipv6: Allow shorthand delete of all nexthops in multipath route IPv4 allows multipath routes to be deleted using just the prefix and length. For example: $ ip ro ls vrf red unreachable default metric 8192 1.1.1.0/24 nexthop via 10.100.1.254 dev eth1 weight 1 nexthop via 10.11.200.2 dev eth11.200 weight 1 10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3 $ ip ro del 1.1.1.0/24 vrf red $ ip ro ls vrf red unreachable default metric 8192 10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3 The same notation does not work with IPv6 because of how multipath routes are implemented for IPv6. For IPv6 only the first nexthop of a multipath route is deleted if the request contains only a prefix and length. This leads to unnecessary complexity in userspace dealing with IPv6 multipath routes. This patch allows all nexthops to be deleted without specifying each one in the delete request. Internally, this is done by walking the sibling list of the route matching the specifications given (prefix, length, metric, protocol, etc). $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:200::/120 via 2001:db8:1::2 dev eth1 metric 1024 pref medium 2001:db8:200::/120 via 2001:db8:2::2 dev eth2 metric 1024 pref medium ... $ ip -6 ro del vrf red 2001:db8:200::/120 $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium ... Because IPv6 allows individual nexthops to be deleted without deleting the entire route, the ip6_route_multipath_del and non-multipath code path (ip6_route_del) have to be discriminated so that all nexthops are only deleted for the latter case. This is done by making the existing fc_type in fib6_config a u16 and then adding a new u16 field with fc_delete_all_nh as the first bit. Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
830218c1 |
|
24-Oct-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: ipv6: Fix processing of RAs in presence of VRF rt6_add_route_info and rt6_add_dflt_router were updated to pull the FIB table from the device index, but the corresponding rt6_get_route_info and rt6_get_dflt_router functions were not leading to the failure to process RA's: ICMPv6: RA: ndisc_router_discovery failed to add default route Fix the 'get' functions by using the table id associated with the device when applicable. Also, now that default routes can be added to tables other than the default table, rt6_purge_dflt_routers needs to be updated as well to look at all tables. To handle that efficiently, add a flag to the table denoting if it is has a default route via RA. Fixes: ca254490c8dfd ("net: Add VRF support to IPv6 stack") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02bcf4e0 |
|
11-Nov-2015 |
Martin KaFai Lau <kafai@fb.com> |
ipv6: Check rt->dst.from for the DST_NOCACHE route All DST_NOCACHE rt6_info used to have rt->dst.from set to its parent. After commit 8e3d5be73681 ("ipv6: Avoid double dst_free"), DST_NOCACHE is also set to rt6_info which does not have a parent (i.e. rt->dst.from is NULL). This patch catches the rt->dst.from == NULL case. Fixes: 8e3d5be73681 ("ipv6: Avoid double dst_free") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
37a1d361 |
|
13-Sep-2015 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
ipv6: include NLM_F_REPLACE in route replace notifications This patch adds NLM_F_REPLACE flag to ipv6 route replace notifications. This makes nlm_flags in ipv6 replace notifications consistent with ipv4. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
61adedf3 |
|
20-Aug-2015 |
Jiri Benc <jbenc@redhat.com> |
route: move lwtunnel state to dst_entry Currently, the lwtunnel state resides in per-protocol data. This is a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa). The xmit function of the tunnel does not know whether the packet has been routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the lwtstate data to dst_entry makes such inter-protocol tunneling possible. As a bonus, this brings a nice diffstat. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
19e42e45 |
|
21-Jul-2015 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
ipv6: support for fib route lwtunnel encap attributes This patch adds support in ipv6 fib functions to parse Netlink RTA encap attributes and attach encap state data to rt6_info. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d52d3997 |
|
22-May-2015 |
Martin KaFai Lau <kafai@fb.com> |
ipv6: Create percpu rt6_info After the patch 'ipv6: Only create RTF_CACHE routes after encountering pmtu exception', we need to compensate the performance hit (bouncing dst->__refcnt). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8d0b94af |
|
22-May-2015 |
Martin KaFai Lau <kafai@fb.com> |
ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister This patch keeps track of the DST_NOCACHE routes in a list and replaces its dev with loopback during the iface down/unregister event. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3da59bd9 |
|
22-May-2015 |
Martin KaFai Lau <kafai@fb.com> |
ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set This patch always creates RTF_CACHE clone with DST_NOCACHE when FLOWI_FLAG_KNOWN_NH is set so that the rt6i_dst is set to the fl6->daddr. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Tested-by: Julian Anastasov <ja@ssi.bg> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b197df4f |
|
22-May-2015 |
Martin KaFai Lau <kafai@fb.com> |
ipv6: Add rt6_get_cookie() function Instead of doing the rt6->rt6i_node check whenever we need to get the route's cookie. Refactor it into rt6_get_cookie(). It is a prep work to handle FLOWI_FLAG_KNOWN_NH and also percpu rt6_info later. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
afc4eef8 |
|
28-Apr-2015 |
Martin KaFai Lau <kafai@fb.com> |
ipv6: Remove DST_METRICS_FORCE_OVERWRITE and _rt6i_peer _rt6i_peer is no longer needed after the last patch, 'ipv6: Stop rt6_info from using inet_peer's metrics'. DST_METRICS_FORCE_OVERWRITE is added by commit e5fd387ad5b3 ("ipv6: do not overwrite inetpeer metrics prematurely"). Since inetpeer is no longer used for metrics, this bit is also not needed. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4b32b5ad |
|
28-Apr-2015 |
Martin KaFai Lau <kafai@fb.com> |
ipv6: Stop rt6_info from using inet_peer's metrics inet_peer is indexed by the dst address alone. However, the fib6 tree could have multiple routing entries (rt6_info) for the same dst. For example, 1. A /128 dst via multiple gateways. 2. A RTF_CACHE route cloned from a /128 route. In the above cases, all of them will share the same metrics and step on each other. This patch will steer away from inet_peer's metrics and use dst_cow_metrics_generic() for everything. Change Highlights: 1. Remove rt6_cow_metrics() which currently acquires metrics from inet_peer for DST_HOST route (i.e. /128 route). 2. Add rt6i_pmtu to take care of the pmtu update to avoid creating a full size metrics just to override the RTAX_MTU. 3. After (2), the RTF_CACHE route can also share the metrics with its dst.from route, by: dst_init_metrics(&cache_rt->dst, dst_metrics_ptr(cache_rt->dst.from), true); 4. Stop creating RTF_CACHE route by cloning another RTF_CACHE route. Instead, directly clone from rt->dst. [ Currently, cloning from another RTF_CACHE is only possible during rt6_do_redirect(). Also, the old clone is removed from the tree immediately after the new clone is added. ] In case of cloning from an older redirect RTF_CACHE, it should work as before. In case of cloning from an older pmtu RTF_CACHE, this patch will forget the pmtu and re-learn it (if there is any) from the redirected route. The _rt6i_peer and DST_METRICS_FORCE_OVERWRITE will be removed in the next cleanup patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e715b6d3 |
|
05-Jan-2015 |
Florian Westphal <fw@strlen.de> |
net: fib6: convert cfg metric to u32 outside of table write lock Do the nla validation earlier, outside the write lock. This is needed by followup patch which needs to be able to call request_module (which can sleep) if needed. Joint work with Daniel Borkmann. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
42b18706 |
|
06-Oct-2014 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
ipv6: make rt_sernum atomic and serial number fields ordinary ints Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
94b2cfe0 |
|
06-Oct-2014 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
ipv6: minor fib6 cleanups like type safety, bool conversion, inline removal Also renamed struct fib6_walker_t to fib6_walker and enum fib_walk_state_t to fib6_walk_state as recommended by Cong Wang. Cc: Cong Wang <cwang@twopensource.com> Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
705f1c86 |
|
27-Sep-2014 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
ipv6: remove rt6i_genid Eric Dumazet noticed that all no-nonexthop or no-gateway routes which are already marked DST_HOST (e.g. input routes routes) will always be invalidated during sk_dst_check. Thus per-socket dst caching absolutely had no effect and early demuxing had no effect. Thus this patch removes rt6i_genid: fn_sernum already gets modified during add operations, so we only must ensure we mutate fn_sernum during ipv6 address remove operations. This is a fairly cost extensive operations, but address removal should not happen that often. Also our mtu update functions do the same and we heard no complains so far. xfrm policy changes also cause a call into fib6_flush_trees. Also plug a hole in rt6_info (no cacheline changes). I verified via tracing that this change has effect. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e5fd387a |
|
27-Mar-2014 |
Michal Kubeček <mkubecek@suse.cz> |
ipv6: do not overwrite inetpeer metrics prematurely If an IPv6 host route with metrics exists, an attempt to add a new route for the same target with different metrics fails but rewrites the metrics anyway: 12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000 12sp0:~ # ip -6 route show fe80::/64 dev eth0 proto kernel metric 256 fec0::1 dev eth0 metric 1024 rto_min lock 1s 12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1500 RTNETLINK answers: File exists 12sp0:~ # ip -6 route show fe80::/64 dev eth0 proto kernel metric 256 fec0::1 dev eth0 metric 1024 rto_min lock 1.5s This is caused by all IPv6 host routes using the metrics in their inetpeer (or the shared default). This also holds for the new route created in ip6_route_add() which shares the metrics with the already existing route and thus ip6_route_add() rewrites the metrics even if the new route ends up not being used at all. Another problem is that old metrics in inetpeer can reappear unexpectedly for a new route, e.g. 12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000 12sp0:~ # ip route del fec0::1 12sp0:~ # ip route add fec0::1 dev eth0 12sp0:~ # ip route change fec0::1 dev eth0 hoplimit 10 12sp0:~ # ip -6 route show fe80::/64 dev eth0 proto kernel metric 256 fec0::1 dev eth0 metric 1024 hoplimit 10 rto_min lock 1s Resolve the first problem by moving the setting of metrics down into fib6_add_rt2node() to the point we are sure we are inserting the new route into the tree. Second problem is addressed by introducing new flag DST_METRICS_FORCE_OVERWRITE which is set for a new host route in ip6_route_add() and makes ipv6_cow_metrics() always overwrite the metrics in inetpeer (even if they are not "new"); it is reset after that. v5: use a flag in _metrics member rather than one in flags v4: fix a typo making a condition always true (thanks to Hannes Frederic Sowa) v3: rewritten based on David Miller's idea to move setting the metrics (and allocation in non-host case) down to the point we already know the route is to be inserted. Also rebased to net-next as it is quite late in the cycle. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0c3584d5 |
|
27-Dec-2013 |
Li RongQing <roy.qing.li@gmail.com> |
ipv6: remove prune parameter for fib6_clean_all since the prune parameter for fib6_clean_all always is 0, remove it. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
01ba16d6 |
|
24-Oct-2013 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
ipv6: reset dst.expires value when clearing expire flag On receiving a packet too big icmp error we update the expire value by calling rt6_update_expires. This function uses dst_set_expires which is implemented that it can only reduce the expiration value of the dst entry. If we insert new routing non-expiry information into the ipv6 fib where we already have a matching rt6_info we only clear the RTF_EXPIRES flag in rt6i_flags and leave the dst.expires value as is. When new mtu information arrives for that cached dst_entry we again call dst_set_expires. This time it won't update the dst.expire value because we left the dst.expire value intact from the last update. So dst_set_expires won't touch dst.expires. Fix this by resetting dst.expires when clearing the RTF_EXPIRE flag. dst_set_expires checks for a zero expiration and updates the dst.expires. In the past this (not updating dst.expires) was necessary because dst.expire was placed in a union with the dst_entry *from reference and rt6_clean_expires did assign NULL to it. This split happend in ecd9883724b78cc72ed92c98bcb1a46c764fff21 ("ipv6: fix race condition regarding dst->expires and dst->from"). Reported-by: Steinar H. Gunderson <sgunderson@bigfoot.com> Reported-by: Valentijn Sessink <valentyn@blub.net> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Valentijn Sessink <valentyn@blub.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8d2ca1d7 |
|
21-Sep-2013 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
ipv6: avoid high order memory allocations for /proc/net/ipv6_route Dumping routes on a system with lots rt6_infos in the fibs causes up to 11-order allocations in seq_file (which fail). While we could switch there to vmalloc we could just implement the streaming interface for /proc/net/ipv6_route. This patch switches /proc/net/ipv6_route from single_open_net to seq_open_net. loff_t *pos tracks dst entries. Also kill never used struct rt6_proc_arg and now unused function fib6_clean_all_ro. Cc: Ben Greear <greearb@candelatech.com> Cc: Patrick McHardy <kaber@trash.net> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5c3a0fd7 |
|
21-Sep-2013 |
Joe Perches <joe@perches.com> |
ip*.h: Remove extern from function prototypes There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2ac3ac8f |
|
01-Aug-2013 |
Michal Kubeček <mkubecek@suse.cz> |
ipv6: prevent fib6_run_gc() contention On a high-traffic router with many processors and many IPv6 dst entries, soft lockup in fib6_run_gc() can occur when number of entries reaches gc_thresh. This happens because fib6_run_gc() uses fib6_gc_lock to allow only one thread to run the garbage collector but ip6_dst_gc() doesn't update net->ipv6.ip6_rt_last_gc until fib6_run_gc() returns. On a system with many entries, this can take some time so that in the meantime, other threads pass the tests in ip6_dst_gc() (ip6_rt_last_gc is still not updated) and wait for the lock. They then have to run the garbage collector one after another which blocks them for quite long. Resolve this by replacing special value ~0UL of expire parameter to fib6_run_gc() by explicit "force" parameter to choose between spin_lock_bh() and spin_trylock_bh() and call fib6_run_gc() with force=false if gc_thresh is reached but not max_size. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ecd98837 |
|
19-Feb-2013 |
YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> |
ipv6: fix race condition regarding dst->expires and dst->from. Eric Dumazet wrote: | Some strange crashes happen in rt6_check_expired(), with access | to random addresses. | | At first glance, it looks like the RTF_EXPIRES and | stuff added in commit 1716a96101c49186b | (ipv6: fix problem with expired dst cache) | are racy : same dst could be manipulated at the same time | on different cpus. | | At some point, our stack believes rt->dst.from contains a dst pointer, | while its really a jiffie value (as rt->dst.expires shares the same area | of memory) | | rt6_update_expires() should be fixed, or am I missing something ? | | CC Neil because of https://bugzilla.redhat.com/show_bug.cgi?id=892060 Because we do not have any locks for dst_entry, we cannot change essential structure in the entry; e.g., we cannot change reference to other entity. To fix this issue, split 'from' and 'expires' field in dst_entry out of union. Once it is 'from' is assigned in the constructor, keep the reference until the very last stage of the life time of the object. Of course, it is unsafe to change 'from', so make rt6_set_from simple just for fresh entries. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: Neil Horman <nhorman@tuxdriver.com> CC: Gao Feng <gaofeng@cn.fujitsu.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reported-by: Steinar H. Gunderson <sesse@google.com> Reviewed-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
887c95cc |
|
16-Jan-2013 |
YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> |
ipv6: Complete neighbour entry removal from dst_entry. CC: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a4477c4d |
|
07-Nov-2012 |
Li RongQing <roy.qing.li@gmail.com> |
ipv6: remove rt6i_peer_genid from rt6_info and its handler 6431cbc25f(Create a mechanism for upward inetpeer propagation into routes) introduces these codes, but this mechanism is never enabled since rt6i_peer_genid always is zero whether it is not assigned or assigned by rt6_peer_genid(). After 5943634fc5 (ipv4: Maintain redirect and PMTU info in struct rtable again), the ipv4 related codes of this mechanism has been removed, I think we maybe able to remove them now. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
94e187c0 |
|
28-Oct-2012 |
Amerigo Wang <amwang@redhat.com> |
ipv6: introduce ip6_rt_put() As suggested by Eric, we could introduce a helper function for ipv6 too, to avoid checking if rt is NULL before dst_release(). Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
51ebd318 |
|
21-Oct-2012 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
ipv6: add support of equal cost multipath (ECMP) Each nexthop is added like a single route in the routing table. All routes that have the same metric/weight and destination but not the same gateway are considering as ECMP routes. They are linked together, through a list called rt6i_siblings. ECMP routes can be added in one shot, with RTA_MULTIPATH attribute or one after the other (in both case, the flag NLM_F_EXCL should not be set). The patch is based on a previous work from Luc Saillard <luc.saillard@6wind.com>. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6f3118b5 |
|
10-Sep-2012 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
ipv6: use net->rt_genid to check dst validity IPv6 dst should take care of rt_genid too. When a xfrm policy is inserted or deleted, all dst should be invalidated. To force the validation, dst entries should be created with ->obsolete set to DST_OBSOLETE_FORCE_CHK. This was already the case for all functions calling ip6_dst_alloc(), except for ip6_rt_copy(). As a consequence, we can remove the specific code in inet6_connection_sock. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ef2c7d7b |
|
04-Sep-2012 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
ipv6: fix handling of blackhole and prohibit routes When adding a blackhole or a prohibit route, they were handling like classic routes. Moreover, it was only possible to add this kind of routes by specifying an interface. Bug already reported here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498498 Before the patch: $ ip route add blackhole 2001::1/128 RTNETLINK answers: No such device $ ip route add blackhole 2001::1/128 dev eth0 $ ip -6 route | grep 2001 2001::1 dev eth0 metric 1024 After: $ ip route add blackhole 2001::1/128 $ ip -6 route | grep 2001 blackhole 2001::1 dev lo metric 1024 error -22 v2: wrong patch v3: add a field fc_type in struct fib6_config to store RTN_* type Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
97cac082 |
|
02-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv6: Store route neighbour in rt6_info struct. This makes for a simplified conversion away from dst_get_neighbour*(). All code outside of ipv6 will use neigh lookups via dst_neigh_lookup*(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e8803b6c |
|
16-Jun-2012 |
David S. Miller <davem@davemloft.net> |
Revert "ipv6: Prevent access to uninitialized fib_table_hash via /proc/net/ipv6_route" This reverts commit 2a0c451ade8e1783c5d453948289e4a978d417c9. It causes crashes, because now ip6_null_entry is used before it is initialized. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2a0c451a |
|
14-Jun-2012 |
Thomas Graf <tgraf@suug.ch> |
ipv6: Prevent access to uninitialized fib_table_hash via /proc/net/ipv6_route /proc/net/ipv6_route reflects the contents of fib_table_hash. The proc handler is installed in ip6_route_net_init() whereas fib_table_hash is allocated in fib6_net_init() _after_ the proc handler has been installed. This opens up a short time frame to access fib_table_hash with its pants down. fib6_init() as a whole can't be moved to an earlier position as it also registers the rtnetlink message handlers which should be registered at the end. Therefore split it into fib6_init() which is run early and fib6_init_late() to register the rtnetlink message handlers. Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8e773277 |
|
11-Jun-2012 |
David S. Miller <davem@davemloft.net> |
inet: Add inetpeer tree roots to the FIB tables. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
97bab73f |
|
09-Jun-2012 |
David S. Miller <davem@davemloft.net> |
inet: Hide route peer accesses behind helpers. We encode the pointer(s) into an unsigned long with one state bit. The state bit is used so we can store the inetpeer tree root to use when resolving the peer later. Later the peer roots will be per-FIB table, and this change works to facilitate that. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cda31e10 |
|
15-Apr-2012 |
Jiri Bohac <jbohac@suse.cz> |
ipv6: clean up rt6_clean_expires Functionally, this change is a NOP. Semantically, rt6_clean_expires() wants to do rt->dst.from = NULL instead of rt->dst.expires = 0. It is clearing the RTF_EXPIRES flag, so the union is going to be treated as a pointer (dst.from) not a long (dst.expires). Signed-off-by: Jiri Bohac <jbohac@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
edfb5d46 |
|
15-Apr-2012 |
Jiri Bohac <jbohac@suse.cz> |
ipv6: fix rt6_update_expires Commit 1716a961 (ipv6: fix problem with expired dst cache) broke PMTU discovery. rt6_update_expires() calls dst_set_expires(), which only updates dst->expires if it has not been set previously (expires == 0) or if the new expires is earlier than the current dst->expires. rt6_update_expires() needs to zero rt->dst.expires, otherwise it will contain ivalid data left over from rt->dst.from and will confuse dst_set_expires(). Signed-off-by: Jiri Bohac <jbohac@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1716a961 |
|
05-Apr-2012 |
Gao feng <gaofeng@cn.fujitsu.com> |
ipv6: fix problem with expired dst cache If the ipv6 dst cache which copy from the dst generated by ICMPV6 RA packet. this dst cache will not check expire because it has no RTF_EXPIRES flag. So this dst cache will always be used until the dst gc run. Change the struct dst_entry,add a union contains new pointer from and expires. When rt6_info.rt6i_flags has no RTF_EXPIRES flag,the dst.expires has no use. we can use this field to point to where the dst cache copy from. The dst.from is only used in IPV6. rt6_check_expired check if rt6_info.dst.from is expired. ip6_rt_copy only set dst.from when the ort has flag RTF_ADDRCONF and RTF_DEFAULT.then hold the ort. ip6_dst_destroy release the ort. Add some functions to operate the RTF_EXPIRES flag and expires(from) together. and change the code to use these new adding functions. Changes from v5: modify ip6_route_add and ndisc_router_discovery to use new adding functions. Only set dst.from when the ort has flag RTF_ADDRCONF and RTF_DEFAULT.then hold the ort. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
32b293a5 |
|
28-Dec-2011 |
Josh Hunt <joshhunt00@gmail.com> |
IPv6: Avoid taking write lock for /proc/net/ipv6_route During some debugging I needed to look into how /proc/net/ipv6_route operated and in my digging I found its calling fib6_clean_all() which uses "write_lock_bh(&table->tb6_lock)" before doing the walk of the table. I found this on 2.6.32, but reading the code I believe the same basic idea exists currently. Looking at the rtnetlink code they are only calling "read_lock_bh(&table->tb6_lock);" via fib6_dump_table(). While I realize reading from proc isn't the recommended way of fetching the ipv6 route table; taking a write lock seems unnecessary and would probably cause network performance issues. To verify this I loaded up the ipv6 route table and then ran iperf in 3 cases: * doing nothing * reading ipv6 route table via proc (while :; do cat /proc/net/ipv6_route > /dev/null; done) * reading ipv6 route table via rtnetlink (while :; do ip -6 route show table all > /dev/null; done) * Load the ipv6 route table up with: * for ((i = 0;i < 4000;i++)); do ip route add unreachable 2000::$i; done * iperf commands: * client: iperf -i 1 -V -c <ipv6 addr> * server: iperf -V -s * iperf results - 3 runs each (in Mbits/sec) * nothing: client: 927,927,927 server: 927,927,927 * proc: client: 179,97,96,113 server: 142,112,133 * iproute: client: 928,927,928 server: 927,927,927 lock_stat shows taking the write lock is causing the slowdown. Using this info I decided to write a version of fib6_clean_all() which replaces write_lock_bh(&table->tb6_lock) with read_lock_bh(&table->tb6_lock). With this new function I see the same results as with my rtnetlink iperf test. Signed-off-by: Josh Hunt <joshhunt00@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d1918542 |
|
28-Dec-2011 |
David S. Miller <davem@davemloft.net> |
ipv6: Kill rt6i_dev and rt6i_expires defines. It just obscures that the netdevice pointer and the expires value are implemented in the dst_entry sub-object of the ipv6 route. And it makes grepping for dst_entry member uses much harder too. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9cbb7ecb |
|
17-Jul-2011 |
David S. Miller <davem@davemloft.net> |
ipv6: Get rid of rt6i_nexthop macro. It just makes it harder to see 1) what the code is doing and 2) grep for all users of dst{->,.}neighbour Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2a9e9507 |
|
24-Apr-2011 |
David S. Miller <davem@davemloft.net> |
net: Remove __KERNEL__ cpp checks from include/net These header files are never installed to user consumption, so any __KERNEL__ cpp checks are superfluous. Projects should also not copy these files into their userland utility sources and try to use them there. If they insist on doing so, the onus is on them to sanitize the headers as needed. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b71d1d42 |
|
21-Apr-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
inet: constify ip headers and in6_addr Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers where possible, to make code intention more obvious. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c3968a85 |
|
13-Apr-2011 |
Daniel Walter <sahne@0x90.at> |
ipv6: RTA_PREFSRC support for ipv6 route source address selection [ipv6] Add support for RTA_PREFSRC This patch allows a user to select the preferred source address for a specific IPv6-Route. It can be set via a netlink message setting RTA_PREFSRC to a valid IPv6 address which must be up on the device the route will be bound to. Signed-off-by: Daniel Walter <dwalter@barracuda.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4c9483b2 |
|
12-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv6: Convert to use flowi6 where applicable. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6431cbc2 |
|
07-Feb-2011 |
David S. Miller <davem@davemloft.net> |
inet: Create a mechanism for upward inetpeer propagation into routes. If we didn't have a routing cache, we would not be able to properly propagate certain kinds of dynamic path attributes, for example PMTU information and redirects. The reason is that if we didn't have a routing cache, then there would be no way to lookup all of the active cached routes hanging off of sockets, tunnels, IPSEC bundles, etc. Consider the case where we created a cached route, but no inetpeer entry existed and also we were not asked to pre-COW the route metrics and therefore did not force the creation a new inetpeer entry. If we later get a PMTU message, or a redirect, and store this information in a new inetpeer entry, there is no way to teach that cached route about the newly existing inetpeer entry. The facilities implemented here handle this problem. First we create a generation ID. When we create a cached route of any kind, we remember the generation ID at the time of attachment. Any time we force-create an inetpeer entry in response to new path information, we bump that generation ID. The dst_ops->check() callback is where the knowledge of this event is propagated. If the global generation ID does not equal the one stored in the cached route, and the cached route has not attached to an inetpeer yet, we look it up and attach if one is found. Now that we've updated the cached route's information, we update the route's generation ID too. This clears the way for implementing PMTU and redirects directly in the inetpeer cache. There is absolutely no need to consult cached route information in order to maintain this information. At this point nothing bumps the inetpeer genids, that comes in the later changes which handle PMTUs and redirects using inetpeers. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b3419363 |
|
30-Nov-2010 |
David S. Miller <davem@davemloft.net> |
ipv6: Add infrastructure to bind inet_peer objects to routes. They are only allowed on cached ipv6 routes. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d8d1f30b |
|
11-Jun-2010 |
Changli Gao <xiaosuo@gmail.com> |
net-next: remove useless union keyword remove useless union keyword in rtable, rt6_info and dn_route. Since there is only one member in a union, the union keyword isn't useful. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bd2c77a0 |
|
31-Mar-2010 |
YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> |
ipv6 fib: Make rt6_info{} more cache-line aware. The head element of rt6_info{} is dst_entry{}, and IPv6 specific elements follow. Because elements at the end of dst_entry{} are frequently updated, it is not good to put frequently-used static elements, such as rt6i_idev, rt6i_dst or rt6i_flags in the same cache line. On the other hand, fib6_table, rt6i_node or rt6i_gateway are rarely used, so it is okay to stay in the same cache line. Let's rearrange rt6_info{}. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bbef49da |
|
18-Feb-2010 |
Alexey Dobriyan <adobriyan@gmail.com> |
ipv6: use standard lists for FIB walks Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2bec5a36 |
|
07-Feb-2010 |
Patrick McHardy <kaber@trash.net> |
ipv6: fib: fix crash when changing large fib while dumping it When the fib size exceeds what can be dumped in a single skb, the dump is suspended and resumed once the last skb has been received by userspace. When the fib is changed while the dump is suspended, the walker might contain stale pointers, causing a crash when the dump is resumed. BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffffa01bce04>] fib6_walk_continue+0xbb/0x124 [ipv6] PGD 5347a067 PUD 65c7067 PMD 0 Oops: 0000 [#1] PREEMPT SMP ... RIP: 0010:[<ffffffffa01bce04>] [<ffffffffa01bce04>] fib6_walk_continue+0xbb/0x124 [ipv6] ... Call Trace: [<ffffffff8104aca3>] ? mutex_spin_on_owner+0x59/0x71 [<ffffffffa01bd105>] inet6_dump_fib+0x11b/0x1b9 [ipv6] [<ffffffff81371af4>] netlink_dump+0x5b/0x19e [<ffffffff8134f288>] ? consume_skb+0x28/0x2a [<ffffffff81373b69>] netlink_recvmsg+0x1ab/0x2c6 [<ffffffff81372781>] ? netlink_unicast+0xfa/0x151 [<ffffffff813483e0>] __sock_recvmsg+0x6d/0x79 [<ffffffff81348a53>] sock_recvmsg+0xca/0xe3 [<ffffffff81066d4b>] ? autoremove_wake_function+0x0/0x38 [<ffffffff811ed1f8>] ? radix_tree_lookup_slot+0xe/0x10 [<ffffffff810b3ed7>] ? find_get_page+0x90/0xa5 [<ffffffff810b5dc5>] ? filemap_fault+0x201/0x34f [<ffffffff810ef152>] ? fget_light+0x2f/0xac [<ffffffff813519e7>] ? verify_iovec+0x4f/0x94 [<ffffffff81349a65>] sys_recvmsg+0x14d/0x223 Store the serial number when beginning to walk the fib and reload pointers when continuing to walk after a change occured. Similar to other dumping functions, this might cause unrelated entries to be missed when entries are deleted. Tested-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fd2c3ef7 |
|
02-Nov-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: cleanup include/net This cleanup patch puts struct/union/enum opening braces, in first line to ease grep games. struct something { becomes : struct something { Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a33bc5c1 |
|
30-Jul-2009 |
Neil Horman <nhorman@tuxdriver.com> |
xfrm: select sane defaults for xfrm[4|6] gc_thresh Choose saner defaults for xfrm[4|6] gc_thresh values on init Currently, the xfrm[4|6] code has hard-coded initial gc_thresh values (set to 1024). Given that the ipv4 and ipv6 routing caches are sized dynamically at boot time, the static selections can be non-sensical. This patch dynamically selects an appropriate gc threshold based on the corresponding main routing table size, using the assumption that we should in the worst case be able to handle as many connections as the routing table can. For ipv4, the maximum route cache size is 16 * the number of hash buckets in the route cache. Given that xfrm4 starts garbage collection at the gc_thresh and prevents new allocations at 2 * gc_thresh, we set gc_thresh to half the maximum route cache size. For ipv6, its a bit trickier. there is no maximum route cache size, but the ipv6 dst_ops gc_thresh is statically set to 1024. It seems sane to select a simmilar gc_thresh for the xfrm6 code that is half the number of hash buckets in the v6 route cache times 16 (like the v4 code does). Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8ed67789 |
|
04-Mar-2008 |
Daniel Lezcano <dlezcano@fr.ibm.com> |
[NETNS][IPV6] rt6_info - move rt6_info structure inside the namespace The rt6_info structures are moved inside the network namespace structure. All references to these structures are now relative to the initial network namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b7c931d |
|
04-Mar-2008 |
Daniel Lezcano <dlezcano@fr.ibm.com> |
[NETNS][IPV6] ip6_fib - add net to gc timer parameter The fib tables are now relative to the network namespace. When the garbage collector timer expires, we must have a network namespace parameter in order to retrieve the tables. For now this is the init_net, but we should be able to have a timer per namespace and use the timer callback parameter to pass the network namespace from the expired timer. The timer callback, fib6_run_gc, is actually used to be called synchronously by some functions and asynchronously when the timer expires. When the timer expires, the delay specified for fib6_run_gc parameter is always zero. So, I changed fib6_run_gc to not be a timer callback but a function called by the timer callback and I added a timer callback where its work is just to retrieve from the data arg of the timer the network namespace and call fib6_run_gc with zero expiring time and the network namespace parameters. That makes the code cleaner for the fib6_run_gc callers. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f3db4851 |
|
04-Mar-2008 |
Daniel Lezcano <dlezcano@fr.ibm.com> |
[NETNS][IPV6] ip6_fib - fib6_clean_all handle several network namespaces The function fib6_clean_all takes the network namespace as parameter. That allows to flush the routes related to a specific network namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
58f09b78 |
|
04-Mar-2008 |
Daniel Lezcano <dlezcano@fr.ibm.com> |
[NETNS][IPV6] ip6_fib - make it per network namespace The fib table for ipv6 are moved to the network namespace structure. All references to them are made relatively to the network namespace. All external calls to the ip6_fib functions taking the network namespace parameter are made using the init_net variable, so the ip6_fib engine is ready for the namespaces but the callers not yet. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4e881a21 |
|
07-Feb-2008 |
Rami Rosen <ramirose@gmail.com> |
[IPV6] Minor cleanup: remove unused definitions in net/ip6_fib.h This patch removes some unused definitions and one method typedef declaration (f_pnode) in include/net/ip6_fib.h, as they are not used in the kernel. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a1b05140 |
|
20-Dec-2007 |
Masahide NAKAMURA <nakam@linux-ipv6.org> |
[XFRM] IPv6: Fix dst/routing check at transformation. IPv6 specific thing is wrongly removed from transformation at net-2.6.25. This patch recovers it with current design. o Update "path" of xfrm_dst since IPv6 transformation should care about routing changes. It is required by MIPv6 and off-link destined IPsec. o Rename nfheader_len which is for non-fragment transformation used by MIPv6 to rt6i_nfheader_len as IPv6 name space. Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7e5449c2 |
|
08-Dec-2007 |
Daniel Lezcano <dlezcano@fr.ibm.com> |
[IPV6]: route6 remove ifdef for fib_rules The patch defines the usual static inline functions when the code is disabled for fib6_rules. That's allow to remove some ifdef in route.c file and make the code a little more clear. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9eb87f3f |
|
07-Dec-2007 |
Daniel Lezcano <dlezcano@fr.ibm.com> |
[IPV6]: Make fib6_rules_init to return an error code. When the fib_rules initialization finished, no return code is provided so there is no way to know, for the caller, if the initialization has been successful or has failed. This patch fix that. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Acked-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d63bddbe |
|
07-Dec-2007 |
Daniel Lezcano <dlezcano@fr.ibm.com> |
[IPV6]: Make fib6_init to return an error code. If there is an error in the initialization function, nothing is followed up to the caller. So I add a return value to be set for the init function. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Acked-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b4ce9277 |
|
13-Nov-2007 |
Herbert Xu <herbert@gondor.apana.org.au> |
[IPV6]: Move nfheader_len into rt6_info The dst member nfheader_len is only used by IPv6. It's also currently creating a rather ugly alignment hole in struct dst. Therefore this patch moves it from there into struct rt6_info. It also reorders the fields in rt6_info to minimize holes. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a47ed4cd |
|
06-Sep-2007 |
Noriaki TAKAMIYA <takamiya@po.ntts.co.jp> |
[IPV6] XFRM: Fix connected socket to use transformation. When XFRM policy and state are ready after TCP connection is started, the traffic should be transformed immediately, however it does not on IPv6 TCP. It depends on a dst cache replacement policy with connected socket. It seems that the replacement is always done for IPv4, however, on IPv6 case it is done only when routing cookie is changed. This patch fix that non-transformation dst can be changed to transformation one. This behavior is required by MIPv6 and improves IPv6 IPsec. Fixes by Masahide NAKAMURA. Signed-off-by: Noriaki TAKAMIYA <takamiya@po.ntts.co.jp> Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c127ea2c |
|
22-Mar-2007 |
Thomas Graf <tgraf@suug.ch> |
[IPv6]: Use rtnl registration interface Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f11e6659 |
|
24-Mar-2007 |
David S. Miller <davem@sunset.davemloft.net> |
[IPV6]: Fix routing round-robin locking. As per RFC2461, section 6.3.6, item #2, when no routers on the matching list are known to be reachable or probably reachable we do round robin on those available routes so that we make sure to probe as many of them as possible to detect when one becomes reachable faster. Each routing table has a rwlock protecting the tree and the linked list of routes at each leaf. The round robin code executes during lookup and thus with the rwlock taken as a reader. A small local spinlock tries to provide protection but this does not work at all for two reasons: 1) The round-robin list manipulation, as coded, goes like this (with read lock held): walk routes finding head and tail spin_lock(); rotate list using head and tail spin_unlock(); While one thread is rotating the list, another thread can end up with stale values of head and tail and then proceed to corrupt the list when it gets the lock. This ends up causing the OOPS in fib6_add() later onthat many people have been hitting. 2) All the other code paths that run with the rwlock held as a reader do not expect the list to change on them, they expect it to remain completely fixed while they hold the lock in that way. So, simply stated, it is impossible to implement this correctly using a manipulation of the list without violating the rwlock locking semantics. Reimplement using a per-fib6_node round-robin pointer. This way we don't need to manipulate the list at all, and since the round-robin pointer can only ever point to real existing entries we don't need to perform any locking on the changing of the round-robin pointer itself. We only need to reset the round-robin pointer to NULL when the entry it is pointing to is removed. The idea is from Thomas Graf and it is very similar to how this was implemented before the advanced router selection code when in. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7cc48263 |
|
09-Feb-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[IPV6]: Convert ipv6 route to use the new dst_entry 'next' pointer This patch removes the next pointer from 'struct rt6_info.u' union, and renames u.next to u.dst.rt6_next. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8bce65b9 |
|
13-Dec-2006 |
Kim Nordlund <kim.nordlund@nokia.com> |
[IPV6]: Make fib6_node subtree depend on IPV6_SUBTREES Make fib6_node 'subtree' depend on IPV6_SUBTREES. Signed-off-by: Kim Nordlund <kim.nordlund@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7a3025b1 |
|
13-Oct-2006 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[IPV6]: Introduce ip6_dst_idev() to get inet6_dev{} stored in dst_entry{}. Otherwise, we will see a lot of casts... Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
#
77d16f45 |
|
23-Aug-2006 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[IPV6] ROUTE: Unify RT6_F_xxx and RT6_SELECT_F_xxx flags Unify RT6_F_xxx and RT6_SELECT_F_xxx flags into RT6_LOOKUP_F_xxx flags, and put them into ip6_route.h Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Acked-by: Ville Nuorvala <vnuorval@tcs.hut.fi Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7fc33165 |
|
23-Aug-2006 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[IPV6] ROUTE: Put SUBTREE() as FIB6_SUBTREE() into ip6_fib.h for future use. Based on MIPL2 kernel patch. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Ville Nuorvala <vnuorval@tcs.hut.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
86872cb5 |
|
22-Aug-2006 |
Thomas Graf <tgraf@suug.ch> |
[IPv6] route: FIB6 configuration using struct fib6_config Replaces the struct in6_rtmsg based interface orignating from the ioctl interface with a struct fib6_config based on. Allows changing the interface without breaking the ioctl interface and avoids passing on tons of parameters. The recently introduced struct nl_info is used to pass on netlink authorship information for notifications. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
90d41122 |
|
15-Aug-2006 |
Adrian Bunk <bunk@stusta.de> |
[IPV6] ip6_fib.c: make code static Make the following needlessly global code static: - fib6_walker_lock - struct fib6_walker_list - fib6_walk_continue() - fib6_walk() Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8ce11e6a |
|
07-Aug-2006 |
Adrian Bunk <bunk@stusta.de> |
[NET]: Make code static. This patch makes needlessly global code static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
101367c2 |
|
04-Aug-2006 |
Thomas Graf <tgraf@suug.ch> |
[IPV6]: Policy Routing Rules Adds support for policy routing rules including a new local table for routes with a local destination. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c71099ac |
|
05-Aug-2006 |
Thomas Graf <tgraf@suug.ch> |
[IPV6]: Multiple Routing Tables Adds the framework to support multiple IPv6 routing tables. Currently all automatically generated routes are put into the same table. This could be changed at a later point after considering the produced locking overhead. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0d51aa80 |
|
21-Jun-2005 |
Jamal Hadi Salim <hadi@cyberus.ca> |
[IPV6]: V6 route events reported with wrong netlink PID and seq number Essentially netlink at the moment always reports a pid and sequence of 0 always for v6 route activities. To understand the repurcassions of this look at: http://lists.quagga.net/pipermail/quagga-dev/2005-June/003507.html While fixing this, i took the liberty to resolve the outstanding issue of IPV6 routes inserted via ioctls to have the correct pids as well. This patch tries to behave as close as possible to the v4 routes i.e maintains whatever PID the socket issuing the command owns as opposed to the process. That made the patch a little bulky. I have tested against both netlink derived utility to add/del routes as well as ioctl derived one. The Quagga folks have tested against quagga. This fixes the problem and so far hasnt been detected to introduce any new issues. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1da177e4 |
|
16-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|