#
58a4c9b1 |
|
21-Apr-2024 |
Eric Dumazet <edumazet@google.com> |
ipv4: check for NULL idev in ip_route_use_hint() syzbot was able to trigger a NULL deref in fib_validate_source() in an old tree [1]. It appears the bug exists in latest trees. All calls to __in_dev_get_rcu() must be checked for a NULL result. [1] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 2 PID: 3257 Comm: syz-executor.3 Not tainted 5.10.0-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:fib_validate_source+0xbf/0x15a0 net/ipv4/fib_frontend.c:425 Code: 18 f2 f2 f2 f2 42 c7 44 20 23 f3 f3 f3 f3 48 89 44 24 78 42 c6 44 20 27 f3 e8 5d 88 48 fc 4c 89 e8 48 c1 e8 03 48 89 44 24 18 <42> 80 3c 20 00 74 08 4c 89 ef e8 d2 15 98 fc 48 89 5c 24 10 41 bf RSP: 0018:ffffc900015fee40 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88800f7a4000 RCX: ffff88800f4f90c0 RDX: 0000000000000000 RSI: 0000000004001eac RDI: ffff8880160c64c0 RBP: ffffc900015ff060 R08: 0000000000000000 R09: ffff88800f7a4000 R10: 0000000000000002 R11: ffff88800f4f90c0 R12: dffffc0000000000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88800f7a4000 FS: 00007f938acfe6c0(0000) GS:ffff888058c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f938acddd58 CR3: 000000001248e000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ip_route_use_hint+0x410/0x9b0 net/ipv4/route.c:2231 ip_rcv_finish_core+0x2c4/0x1a30 net/ipv4/ip_input.c:327 ip_list_rcv_finish net/ipv4/ip_input.c:612 [inline] ip_sublist_rcv+0x3ed/0xe50 net/ipv4/ip_input.c:638 ip_list_rcv+0x422/0x470 net/ipv4/ip_input.c:673 __netif_receive_skb_list_ptype net/core/dev.c:5572 [inline] __netif_receive_skb_list_core+0x6b1/0x890 net/core/dev.c:5620 __netif_receive_skb_list net/core/dev.c:5672 [inline] netif_receive_skb_list_internal+0x9f9/0xdc0 net/core/dev.c:5764 netif_receive_skb_list+0x55/0x3e0 net/core/dev.c:5816 xdp_recv_frames net/bpf/test_run.c:257 [inline] xdp_test_run_batch net/bpf/test_run.c:335 [inline] bpf_test_run_xdp_live+0x1818/0x1d00 net/bpf/test_run.c:363 bpf_prog_test_run_xdp+0x81f/0x1170 net/bpf/test_run.c:1376 bpf_prog_test_run+0x349/0x3c0 kernel/bpf/syscall.c:3736 __sys_bpf+0x45c/0x710 kernel/bpf/syscall.c:5115 __do_sys_bpf kernel/bpf/syscall.c:5201 [inline] __se_sys_bpf kernel/bpf/syscall.c:5199 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5199 Fixes: 02b24941619f ("ipv4: use dst hint for ipv4 list receive") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240421184326.1704930-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
cf1b7201 |
|
08-Apr-2024 |
Arnd Bergmann <arnd@arndb.de> |
ipv4/route: avoid unused-but-set-variable warning The log_martians variable is only used in an #ifdef, causing a 'make W=1' warning with gcc: net/ipv4/route.c: In function 'ip_rt_send_redirect': net/ipv4/route.c:880:13: error: variable 'log_martians' set but not used [-Werror=unused-but-set-variable] Change the #ifdef to an equivalent IS_ENABLED() to let the compiler see where the variable is used. Fixes: 30038fc61adf ("net: ip_rt_send_redirect() optimization") Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240408074219.3030256-2-arnd@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
0598f8f3 |
|
27-Feb-2024 |
Eric Dumazet <edumazet@google.com> |
inet: annotate devconf data-races Add READ_ONCE() in ipv4_devconf_get() and corresponding WRITE_ONCE() in ipv4_devconf_set() Add IPV4_DEVCONF_RO() and IPV4_DEVCONF_ALL_RO() macros, and use them when reading devconf fields. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240227092411.2315725-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
7eb2bc24 |
|
20-Feb-2024 |
Kunwu Chan <chentao@kylinos.cn> |
ipv4: Simplify the allocation of slab caches in ip_rt_init Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. And change cache name from 'ip_dst_cache' to 'rtable'. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c0e29262 |
|
19-Nov-2023 |
Kunwu Chan <chentao@kylinos.cn> |
ipv4: Correct/silence an endian warning in __ip_do_redirect net/ipv4/route.c:783:46: warning: incorrect type in argument 2 (different base types) net/ipv4/route.c:783:46: expected unsigned int [usertype] key net/ipv4/route.c:783:46: got restricted __be32 [usertype] new_gw Fixes: 969447f226b4 ("ipv4: use new_gw for redirect neigh lookup") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20231119141759.420477-1-chentao@kylinos.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
bf3fcbf7 |
|
16-Oct-2023 |
Beniamino Galvani <b.galvani@gmail.com> |
ipv4: rename and move ip_route_output_tunnel() At the moment ip_route_output_tunnel() is used only by bareudp. Ideally, other UDP tunnel implementations should use it, but to do so the function needs to accept new parameters that are specific for UDP tunnels, such as the ports. Prepare for these changes by renaming the function to udp_tunnel_dst_lookup() and move it to file net/ipv4/udp_tunnel_core.c. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
762c8dc7 |
|
11-Sep-2023 |
Zhengchao Shao <shaozhengchao@huawei.com> |
net: dst: remove unnecessary input parameter in dst_alloc and dst_init Since commit 1202cdd66531("Remove DECnet support from kernel") has been merged, all callers pass in the initial_ref value of 1 when they call dst_alloc(). Therefore, remove initial_ref when the dst_alloc() is declared and replace initial_ref with 1 in dst_alloc(). Also when all callers call dst_init(), the value of initial_ref is 1. Therefore, remove the input parameter initial_ref of the dst_init() and replace initial_ref with the value 1 in dst_init. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://lore.kernel.org/r/20230911125045.346390-1-shaozhengchao@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
0add5c59 |
|
26-Sep-2023 |
Benjamin Poirier <bpoirier@nvidia.com> |
ipv4: Set offload_failed flag in fibmatch results Due to a small omission, the offload_failed flag is missing from ipv4 fibmatch results. Make sure it is set correctly. The issue can be witnessed using the following commands: echo "1 1" > /sys/bus/netdevsim/new_device ip link add dummy1 up type dummy ip route add 192.0.2.0/24 dev dummy1 echo 1 > /sys/kernel/debug/netdevsim/netdevsim1/fib/fail_route_offload ip route add 198.51.100.0/24 dev dummy1 ip route # 192.168.15.0/24 has rt_trap # 198.51.100.0/24 has rt_offload_failed ip route get 192.168.15.1 fibmatch # Result has rt_trap ip route get 198.51.100.1 fibmatch # Result differs from the route shown by `ip route`, it is missing # rt_offload_failed ip link del dev dummy1 echo 1 > /sys/bus/netdevsim/del_device Fixes: 36c5100e859d ("IPv4: Add "offload failed" indication to routes") Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230926182730.231208-1-bpoirier@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
0113d9c9 |
|
14-Sep-2023 |
Kyle Zeng <zengyhkyle@gmail.com> |
ipv4: fix null-deref in ipv4_link_failure Currently, we assume the skb is associated with a device before calling __ip_options_compile, which is not always the case if it is re-routed by ipvs. When skb->dev is NULL, dev_net(skb->dev) will become null-dereference. This patch adds a check for the edge case and switch to use the net_device from the rtable when skb->dev is NULL. Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Suggested-by: David Ahern <dsahern@kernel.org> Signed-off-by: Kyle Zeng <zengyhkyle@gmail.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Cc: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6ac66cb0 |
|
31-Aug-2023 |
Sriram Yagnaraman <sriram.yagnaraman@est.tech> |
ipv4: ignore dst hint for multipath routes Route hints when the nexthop is part of a multipath group causes packets in the same receive batch to be sent to the same nexthop irrespective of the multipath hash of the packet. So, do not extract route hint for packets whose destination is part of a multipath group. A new SKB flag IPSKB_MULTIPATH is introduced for this purpose, set the flag when route is looked up in ip_mkroute_input() and use it in ip_extract_route_hint() to check for the existence of the flag. Fixes: 02b24941619f ("ipv4: use dst hint for ipv4 list receive") Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cafbe182 |
|
16-Aug-2023 |
Eric Dumazet <edumazet@google.com> |
inet: move inet->hdrincl to inet->inet_flags IP_HDRINCL socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c899710f |
|
08-Aug-2023 |
Joel Granados <joel.granados@gmail.com> |
networking: Update to register_net_sysctl_sz Move from register_net_sysctl to register_net_sysctl_sz for all the networking related files. Do this while making sure to mirror the NULL assignments with a table_size of zero for the unprivileged users. We need to move to the new function in preparation for when we change SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do so would erroneously allow ARRAY_SIZE() to be called on a pointer. We hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all the relevant net sysctl registering functions to register_net_sysctl_sz in subsequent commits. An additional size function was added to the following files in order to calculate the size of an array that is defined in another file: include/net/ipv6.h net/ipv6/icmp.c net/ipv6/route.c net/ipv6/sysctl_net_ipv6.c Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
#
3c5b4d69 |
|
28-Jul-2023 |
Eric Dumazet <edumazet@google.com> |
net: annotate data-races around sk->sk_mark sk->sk_mark is often read while another thread could change the value. Fixes: 4a19ec5800fc ("[NET]: Introducing socket mark socket option.") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
418a7307 |
|
20-Apr-2023 |
Maxime Bizon <mbizon@freebox.fr> |
net: dst: fix missing initialization of rt_uncached xfrm_alloc_dst() followed by xfrm4_dst_destroy(), without a xfrm4_fill_dst() call in between, causes the following BUG: BUG: spinlock bad magic on CPU#0, fbxhostapd/732 lock: 0x890b7668, .magic: 890b7668, .owner: <none>/-1, .owner_cpu: 0 CPU: 0 PID: 732 Comm: fbxhostapd Not tainted 6.3.0-rc6-next-20230414-00613-ge8de66369925-dirty #9 Hardware name: Marvell Kirkwood (Flattened Device Tree) unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x28/0x30 dump_stack_lvl from do_raw_spin_lock+0x20/0x80 do_raw_spin_lock from rt_del_uncached_list+0x30/0x64 rt_del_uncached_list from xfrm4_dst_destroy+0x3c/0xbc xfrm4_dst_destroy from dst_destroy+0x5c/0xb0 dst_destroy from rcu_process_callbacks+0xc4/0xec rcu_process_callbacks from __do_softirq+0xb4/0x22c __do_softirq from call_with_stack+0x1c/0x24 call_with_stack from do_softirq+0x60/0x6c do_softirq from __local_bh_enable_ip+0xa0/0xcc Patch "net: dst: Prevent false sharing vs. dst_entry:: __refcnt" moved rt_uncached and rt_uncached_list fields from rtable struct to dst struct, so they are more zeroed by memset_after(xdst, 0, u.dst) in xfrm_alloc_dst(). Note that rt_uncached (list_head) was never properly initialized at alloc time, but xfrm[46]_dst_destroy() is written in such a way that it was not an issue thanks to the memset: if (xdst->u.rt.dst.rt_uncached_list) rt_del_uncached_list(&xdst->u.rt); The route code does it the other way around: rt_uncached_list is assumed to be valid IIF rt_uncached list_head is not empty: void rt_del_uncached_list(struct rtable *rt) { if (!list_empty(&rt->dst.rt_uncached)) { struct uncached_list *ul = rt->dst.rt_uncached_list; spin_lock_bh(&ul->lock); list_del_init(&rt->dst.rt_uncached); spin_unlock_bh(&ul->lock); } } This patch adds mandatory rt_uncached list_head initialization in generic dst_init(), and adapt xfrm[46]_dst_destroy logic to match the rest of the code. Fixes: d288a162dd1c ("net: dst: Prevent false sharing vs. dst_entry:: __refcnt") Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202304162125.18b7bcdd-oliver.sang@intel.com Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> CC: Leon Romanovsky <leon@kernel.org> Signed-off-by: Maxime Bizon <mbizon@freebox.fr> Link: https://lore.kernel.org/r/20230420182508.2417582-1-mbizon@freebox.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d288a162 |
|
23-Mar-2023 |
Wangyang Guo <wangyang.guo@intel.com> |
net: dst: Prevent false sharing vs. dst_entry:: __refcnt dst_entry::__refcnt is highly contended in scenarios where many connections happen from and to the same IP. The reference count is an atomic_t, so the reference count operations have to take the cache-line exclusive. Aside of the unavoidable reference count contention there is another significant problem which is caused by that: False sharing. perf top identified two affected read accesses. dst_entry::lwtstate and rtable::rt_genid. dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into a seperate cacheline vs. the read mostly members located at the beginning of the struct. That prevents false sharing vs. the struct members in the first 64 bytes of the structure, but there is also dst_entry::lwtstate which is located after the reference count and in the same cache line. This member is read after a reference count has been acquired. struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a size of 112 bytes, which means that the struct members of rtable which follow the dst member share the same cache line as dst_entry::__refcnt. Especially rtable::rt_genid is also read by the contexts which have a reference count acquired already. When dst_entry:__refcnt is incremented or decremented via an atomic operation these read accesses stall. This was found when analysing the memtier benchmark in 1:100 mode, which amplifies the problem extremly. Move the rt[6i]_uncached[_list] members out of struct rtable and struct rt6_info into struct dst_entry to provide padding and move the lwtstate member after that so it ends up in the same cache line. The resulting improvement depends on the micro-architecture and the number of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached benchmark. [ tglx: Rearrange struct ] Signed-off-by: Wangyang Guo <wangyang.guo@intel.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230323102800.042297517@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
09eed119 |
|
20-Mar-2023 |
Eric Dumazet <edumazet@google.com> |
neighbour: switch to standard rcu, instead of rcu_bh rcu_bh is no longer a win, especially for objects freed with standard call_rcu(). Switch neighbour code to no longer disable BH when not necessary. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
b071af52 |
|
13-Mar-2023 |
Eric Dumazet <edumazet@google.com> |
neighbour: annotate lockless accesses to n->nud_state We have many lockless accesses to n->nud_state. Before adding another one in the following patch, add annotations to readers and writers. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
8032bf12 |
|
09-Oct-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
treewide: use get_random_u32_below() instead of deprecated function This is a simple mechanical transformation done by: @@ expression E; @@ - prandom_u32_max + get_random_u32_below (E) Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
#
197173db |
|
05-Oct-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
treewide: use get_random_bytes() when possible The prandom_bytes() function has been a deprecated inline wrapper around get_random_bytes() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> # powerpc Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
#
a251c17a |
|
05-Oct-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
treewide: use get_random_u32() when possible The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
#
5022e221 |
|
11-Jul-2022 |
Zhengchao Shao <shaozhengchao@huawei.com> |
net: change the type of ip_route_input_rcu to static The type of ip_route_input_rcu should be static. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://lore.kernel.org/r/20220711073549.8947-1-shaozhengchao@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
b5c8b3fe |
|
20-May-2022 |
Eyal Birger <eyal.birger@gmail.com> |
xfrm: no need to set DST_NOPOLICY in IPv4 This is a cleanup patch following commit e6175a2ed1f1 ("xfrm: fix "disable_policy" flag use when arriving from different devices") which made DST_NOPOLICY no longer be used for inbound policy checks. On outbound the flag was set, but never used. As such, avoid setting it altogether and remove the nopolicy argument from rt_dst_alloc(). Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
d62607c3 |
|
07-Jun-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: rename reference+tracking helpers Netdev reference helpers have a dev_ prefix for historic reasons. Renaming the old helpers would be too much churn but we can rename the tracking ones which are relatively recent and should be the default for new code. Rename: dev_hold_track() -> netdev_hold() dev_put_track() -> netdev_put() dev_replace_track() -> netdev_ref_replace() Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
8895a9c2 |
|
18-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
ipv4: Fix data-races around sysctl_fib_multipath_hash_fields. While reading sysctl_fib_multipath_hash_fields, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: ce5c9c20d364 ("ipv4: Add a sysctl to control multipath hash fields") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7998c12a |
|
18-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
ipv4: Fix data-races around sysctl_fib_multipath_hash_policy. While reading sysctl_fib_multipath_hash_policy, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: bf4e0a3db97e ("net: ipv4: add support for ECMP hash policy choice") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
60c158dc |
|
13-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
ip: Fix data-races around sysctl_ip_fwd_use_pmtu. While reading sysctl_ip_fwd_use_pmtu, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: f87c10a8aa1e ("ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b1ad4138 |
|
20-Apr-2022 |
Guillaume Nault <gnault@redhat.com> |
ipv4: Initialise ->flowi4_scope properly in ICMP handlers. All the *_redirect() and *_update_pmtu() functions initialise their struct flowi4 variable with either __build_flow_key() or build_sk_flow_key(). When sk is provided, these functions use RT_CONN_FLAGS() to set ->flowi4_tos and always use RT_SCOPE_UNIVERSE for ->flowi4_scope. Then they rely on ip_rt_fix_tos() to adjust the scope based on the RTO_ONLINK bit and to mask the tos with IPTOS_RT_MASK. This patch modifies __build_flow_key() and build_sk_flow_key() to properly initialise ->flowi4_tos and ->flowi4_scope, so that the ICMP redirects and PMTU handlers don't need an extra call to ip_rt_fix_tos() before doing a fib lookup. That is, we: * Drop RT_CONN_FLAGS(): use ip_sock_rt_tos() and ip_sock_rt_scope() instead, so that we don't have to rely on ip_rt_fix_tos() to adjust the scope anymore. * Apply IPTOS_RT_MASK to the tos, so that we don't need ip_rt_fix_tos() to do it for us. * Drop the ip_rt_fix_tos() calls that now become useless. The only remaining ip_rt_fix_tos() caller is ip_route_output_key_hash() which needs it as long as external callers still use the RTO_ONLINK flag. Note: This patch also drops some useless RT_TOS() calls as IPTOS_RT_MASK is a stronger mask. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
16a28267 |
|
20-Apr-2022 |
Guillaume Nault <gnault@redhat.com> |
ipv4: Don't reset ->flowi4_scope in ip_rt_fix_tos(). All callers already initialise ->flowi4_scope with RT_SCOPE_UNIVERSE, either by manual field assignment, memset(0) of the whole structure or implicit structure initialisation of on-stack variables (RT_SCOPE_UNIVERSE actually equals 0). Therefore, we don't need to always initialise ->flowi4_scope in ip_rt_fix_tos(). We only need to reduce the scope to RT_SCOPE_LINK when the special RTO_ONLINK flag is present in the tos. This will allow some code simplification, like removing ip_rt_fix_tos(). Also, the long term idea is to remove RTO_ONLINK entirely by properly initialising ->flowi4_scope, instead of overloading ->flowi4_tos with a special flag. Eventually, this will allow to convert ->flowi4_tos to dscp_t. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c4eb6641 |
|
13-Apr-2022 |
Menglong Dong <imagedong@tencent.com> |
net: ipv4: add skb drop reasons to ip_error() Eventually, I find out the handler function for inputting route lookup fail: ip_error(). The drop reasons we used in ip_error() are almost corresponding to IPSTATS_MIB_*, and following new reasons are introduced: SKB_DROP_REASON_IP_INADDRERRORS SKB_DROP_REASON_IP_INNOROUTES Isn't the name SKB_DROP_REASON_IP_HOSTUNREACH and SKB_DROP_REASON_IP_NETUNREACH more accurate? To make them corresponding to IPSTATS_MIB_*, we keep their name still. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
888ade8f |
|
08-Apr-2022 |
Guillaume Nault <gnault@redhat.com> |
ipv4: Use dscp_t in struct fib_rt_info Use the new dscp_t type to replace the tos field of struct fib_rt_info. This ensures ECN bits are ignored and makes it compatible with the fa_dscp field of struct fib_alias. This also allows sparse to flag potential incorrect uses of DSCP and ECN bits. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
e6175a2e |
|
13-May-2022 |
Eyal Birger <eyal.birger@gmail.com> |
xfrm: fix "disable_policy" flag use when arriving from different devices In IPv4 setting the "disable_policy" flag on a device means no policy should be enforced for traffic originating from the device. This was implemented by seting the DST_NOPOLICY flag in the dst based on the originating device. However, dsts are cached in nexthops regardless of the originating devices, in which case, the DST_NOPOLICY flag value may be incorrect. Consider the following setup: +------------------------------+ | ROUTER | +-------------+ | +-----------------+ | | ipsec src |----|-|ipsec0 | | +-------------+ | |disable_policy=0 | +----+ | | +-----------------+ |eth1|-|----- +-------------+ | +-----------------+ +----+ | | noipsec src |----|-|eth0 | | +-------------+ | |disable_policy=1 | | | +-----------------+ | +------------------------------+ Where ROUTER has a default route towards eth1. dst entries for traffic arriving from eth0 would have DST_NOPOLICY and would be cached and therefore can be reused by traffic originating from ipsec0, skipping policy check. Fix by setting a IPSKB_NOPOLICY flag in IPCB and observing it instead of the DST in IN/FWD IPv4 policy checks. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
9e6c6d17 |
|
04-May-2022 |
Lokesh Dhoundiyal <lokesh.dhoundiyal@alliedtelesis.co.nz> |
ipv4: drop dst in multicast routing path kmemleak reports the following when routing multicast traffic over an ipsec tunnel. Kmemleak output: unreferenced object 0x8000000044bebb00 (size 256): comm "softirq", pid 0, jiffies 4294985356 (age 126.810s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 80 00 00 00 05 13 74 80 ..............t. 80 00 00 00 04 9b bf f9 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000f83947e0>] __kmalloc+0x1e8/0x300 [<00000000b7ed8dca>] metadata_dst_alloc+0x24/0x58 [<0000000081d32c20>] __ipgre_rcv+0x100/0x2b8 [<00000000824f6cf1>] gre_rcv+0x178/0x540 [<00000000ccd4e162>] gre_rcv+0x7c/0xd8 [<00000000c024b148>] ip_protocol_deliver_rcu+0x124/0x350 [<000000006a483377>] ip_local_deliver_finish+0x54/0x68 [<00000000d9271b3a>] ip_local_deliver+0x128/0x168 [<00000000bd4968ae>] xfrm_trans_reinject+0xb8/0xf8 [<0000000071672a19>] tasklet_action_common.isra.16+0xc4/0x1b0 [<0000000062e9c336>] __do_softirq+0x1fc/0x3e0 [<00000000013d7914>] irq_exit+0xc4/0xe0 [<00000000a4d73e90>] plat_irq_dispatch+0x7c/0x108 [<000000000751eb8e>] handle_int+0x16c/0x178 [<000000001668023b>] _raw_spin_unlock_irqrestore+0x1c/0x28 The metadata dst is leaked when ip_route_input_mc() updates the dst for the skb. Commit f38a9eb1f77b ("dst: Metadata destinations") correctly handled dropping the dst in ip_route_input_slow() but missed the multicast case which is handled by ip_route_input_mc(). Drop the dst in ip_route_input_mc() avoiding the leak. Fixes: f38a9eb1f77b ("dst: Metadata destinations") Signed-off-by: Lokesh Dhoundiyal <lokesh.dhoundiyal@alliedtelesis.co.nz> Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220505020017.3111846-1-chris.packham@alliedtelesis.co.nz Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
544b4dd5 |
|
17-Mar-2022 |
Guillaume Nault <gnault@redhat.com> |
ipv4: Fix route lookups when handling ICMP redirects and PMTU updates The PMTU update and ICMP redirect helper functions initialise their fl4 variable with either __build_flow_key() or build_sk_flow_key(). These initialisation functions always set ->flowi4_scope with RT_SCOPE_UNIVERSE and might set the ECN bits of ->flowi4_tos. This is not a problem when the route lookup is later done via ip_route_output_key_hash(), which properly clears the ECN bits from ->flowi4_tos and initialises ->flowi4_scope based on the RTO_ONLINK flag. However, some helpers call fib_lookup() directly, without sanitising the tos and scope fields, so the route lookup can fail and, as a result, the ICMP redirect or PMTU update aren't taken into account. Fix this by extracting the ->flowi4_tos and ->flowi4_scope sanitisation code into ip_rt_fix_tos(), then use this function in handlers that call fib_lookup() directly. Note 1: We can't sanitise ->flowi4_tos and ->flowi4_scope in a central place (like __build_flow_key() or flowi4_init_output()), because ip_route_output_key_hash() expects non-sanitised values. When called with sanitised values, it can erroneously overwrite RT_SCOPE_LINK with RT_SCOPE_UNIVERSE in ->flowi4_scope. Therefore we have to be careful to sanitise the values only for those paths that don't call ip_route_output_key_hash(). Note 2: The problem is mostly about sanitising ->flowi4_tos. Having ->flowi4_scope initialised with RT_SCOPE_UNIVERSE instead of RT_SCOPE_LINK probably wasn't really a problem: sockets with the SOCK_LOCALROUTE flag set (those that'd result in RTO_ONLINK being set) normally shouldn't receive ICMP redirects or PMTU updates. Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
40867d74 |
|
14-Mar-2022 |
David Ahern <dsahern@kernel.org> |
net: Add l3mdev index to flow struct and avoid oif reset for port devices The fundamental premise of VRF and l3mdev core code is binding a socket to a device (l3mdev or netdev with an L3 domain) to indicate L3 scope. Legacy code resets flowi_oif to the l3mdev losing any original port device binding. Ben (among others) has demonstrated use cases where the original port device binding is important and needs to be retained. This patch handles that by adding a new entry to the common flow struct that can indicate the l3mdev index for later rule and table matching avoiding the need to reset flowi_oif. In addition to allowing more use cases that require port device binds, this patch brings a few datapath simplications: 1. l3mdev_fib_rule_match is only called when walking fib rules and always after l3mdev_update_flow. That allows an optimization to bail early for non-VRF type uses cases when flowi_l3mdev is not set. Also, only that index needs to be checked for the FIB table id. 2. l3mdev_update_flow can be called with flowi_oif set to a l3mdev (e.g., VRF) device. By resetting flowi_oif only for this case the FLOWI_FLAG_SKIP_NH_OIF flag is not longer needed and can be removed, removing several checks in the datapath. The flowi_iif path can be simplified to only be called if the it is not loopback (loopback can not be assigned to an L3 domain) and the l3mdev index is not already set. 3. Avoid another device lookup in the output path when the fib lookup returns a reject failure. Note: 2 functional tests for local traffic with reject fib rules are updated to reflect the new direct failure at FIB lookup time for ping rather than the failure on packet path. The current code fails like this: HINT: Fails since address on vrf device is out of device scope COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1 ping: Warning: source address might be selected on device other than: eth1 PING 172.16.3.1 (172.16.3.1) from 172.16.3.1 eth1: 56(84) bytes of data. --- 172.16.3.1 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms where the test now directly fails: HINT: Fails since address on vrf device is out of device scope COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1 ping: connect: No route to host Signed-off-by: David Ahern <dsahern@kernel.org> Tested-by: Ben Greear <greearb@candelatech.com> Link: https://lore.kernel.org/r/20220314204551.16369-1-dsahern@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
29e5375d |
|
10-Feb-2022 |
Eric Dumazet <edumazet@google.com> |
ipv4: add (struct uncached_list)->quarantine list This is an optimization to keep the per-cpu lists as short as possible: Whenever rt_flush_dev() changes one rtable dst.dev matching the disappearing device, it can can transfer the object to a quarantine list, waiting for a final rt_del_uncached_list(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
32ccf110 |
|
04-Feb-2022 |
Guillaume Nault <gnault@redhat.com> |
ipv4: Use dscp_t in struct fib_alias Use the new dscp_t type to replace the fa_tos field of fib_alias. This ensures ECN bits are ignored and makes the field compatible with the fc_dscp field of struct fib_config. Converting old *tos variables and fields to dscp_t allows sparse to flag incorrect uses of DSCP and ECN bits. This patch is entirely about type annotation and shouldn't change any existing behaviour. Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
47ed9442 |
|
28-Jan-2022 |
David Ahern <dsahern@kernel.org> |
ipv4: Make ip_idents_reserve static ip_idents_reserve is only used in net/ipv4/route.c. Make it static and remove the export. Signed-off-by: David Ahern <dsahern@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2e9589ff |
|
26-Jan-2022 |
xu xin <xu.xin16@zte.com.cn> |
ipv4: Namespaceify min_adv_mss sysctl knob Different netns has different requirement on the setting of min_adv_mss sysctl which the advertised MSS will be never lower than. Enable min_adv_mss to be configured per network namespace. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9fcf986c |
|
16-Feb-2022 |
Eric Dumazet <edumazet@google.com> |
ipv4: fix data races in fib_alias_hw_flags_set fib_alias_hw_flags_set() can be used by concurrent threads, and is only RCU protected. We need to annotate accesses to following fields of struct fib_alias: offload, trap, offload_failed Because of READ_ONCE()WRITE_ONCE() limitations, make these field u8. BUG: KCSAN: data-race in fib_alias_hw_flags_set / fib_alias_hw_flags_set read to 0xffff888134224a6a of 1 bytes by task 2013 on cpu 1: fib_alias_hw_flags_set+0x28a/0x470 net/ipv4/fib_trie.c:1050 nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline] nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline] nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline] nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline] nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline] nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307 process_scheduled_works kernel/workqueue.c:2370 [inline] worker_thread+0x7df/0xa70 kernel/workqueue.c:2456 kthread+0x1bf/0x1e0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 write to 0xffff888134224a6a of 1 bytes by task 4872 on cpu 0: fib_alias_hw_flags_set+0x2d5/0x470 net/ipv4/fib_trie.c:1054 nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline] nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline] nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline] nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline] nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline] nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307 process_scheduled_works kernel/workqueue.c:2370 [inline] worker_thread+0x7df/0xa70 kernel/workqueue.c:2456 kthread+0x1bf/0x1e0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 value changed: 0x00 -> 0x02 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 4872 Comm: kworker/0:0 Not tainted 5.17.0-rc3-syzkaller-00188-g1d41d2e82623-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events nsim_fib_event_work Fixes: 90b93f1b31f8 ("ipv4: Add "offload" and "trap" indications to routes") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20220216173217.3792411-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
1135fad2 |
|
04-Jan-2022 |
xu xin <xu.xin16@zte.com.cn> |
Namespaceify mtu_expires sysctl This patch enables the sysctl mtu_expires to be configured per net namespace. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1de6b15a |
|
04-Jan-2022 |
xu xin <xu.xin16@zte.com.cn> |
Namespaceify min_pmtu sysctl This patch enables the sysctl min_pmtu to be configured per net namespace. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9038c320 |
|
04-Dec-2021 |
Eric Dumazet <edumazet@google.com> |
net: dst: add net device refcount tracking to dst_entry We want to track all dev_hold()/dev_put() to ease leak hunting. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
49ecc2e9 |
|
15-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: align static siphash keys siphash keys use 16 bytes. Define siphash_aligned_key_t macro so that we can make sure they are not crossing a cache line boundary. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
ffa66f15 |
|
20-Sep-2021 |
Mianhan Liu <liumh1@shanghaitech.edu.cn> |
net/ipv4/route.c: remove superfluous header files from route.c route.c hasn't use any macro or function declared in uaccess.h, types.h, string.h, sockios.h, times.h, protocol.h, arp.h and l3mdev.h. Thus, these files can be removed from route.c safely without affecting the compilation of the net module. Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
92548b0e |
|
30-Aug-2021 |
Eric Dumazet <edumazet@google.com> |
ipv4: fix endianness issue in inet_rtm_getroute_build_skb() The UDP length field should be in network order. This removes the following sparse error: net/ipv4/route.c:3173:27: warning: incorrect type in assignment (different base types) net/ipv4/route.c:3173:27: expected restricted __be16 [usertype] len net/ipv4/route.c:3173:27: got unsigned long Fixes: 404eb77ea766 ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Roopa Prabhu <roopa@nvidia.com> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
67d6d681 |
|
29-Aug-2021 |
Eric Dumazet <edumazet@google.com> |
ipv4: make exception cache less predictible Even after commit 6457378fe796 ("ipv4: use siphash instead of Jenkins in fnhe_hashfun()"), an attacker can still use brute force to learn some secrets from a victim linux host. One way to defeat these attacks is to make the max depth of the hash table bucket a random value. Before this patch, each bucket of the hash table used to store exceptions could contain 6 items under attack. After the patch, each bucket would contains a random number of items, between 6 and 10. The attacker can no longer infer secrets. This is slightly increasing memory size used by the hash table, by 50% in average, we do not expect this to be a problem. This patch is more complex than the prior one (IPv6 equivalent), because IPv4 was reusing the oldest entry. Since we need to be able to evict more than one entry per update_or_create_fnhe() call, I had to replace fnhe_oldest() with fnhe_remove_oldest(). Also note that we will queue extra kfree_rcu() calls under stress, which hopefully wont be a too big issue. Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Keyu Man <kman001@ucr.edu> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: David Ahern <dsahern@kernel.org> Tested-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1160dfa1 |
|
05-Aug-2021 |
Yajun Deng <yajun.deng@linux.dev> |
net: Remove redundant if statements The 'if (dev)' statement already move into dev_{put , hold}, so remove redundant if statements. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0547ffe6 |
|
02-Aug-2021 |
Yajun Deng <yajun.deng@linux.dev> |
net: Keep vertical alignment Those files under /proc/net/stat/ don't have vertical alignment, it looks very difficult. Modify the seq_printf statement, keep vertical alignment. v2: - Use seq_puts() and seq_printf() correctly. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ac6627a2 |
|
20-Jul-2021 |
Vadim Fedorenko <vfedorenko@novek.ru> |
net: ipv4: Consolidate ipv4_mtu and ip_dst_mtu_maybe_forward Consolidate IPv4 MTU code the same way it is done in IPv6 to have code aligned in both address families Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6457378f |
|
25-Aug-2021 |
Eric Dumazet <edumazet@google.com> |
ipv4: use siphash instead of Jenkins in fnhe_hashfun() A group of security researchers brought to our attention the weakness of hash function used in fnhe_hashfun(). Lets use siphash instead of Jenkins Hash, to considerably reduce security risks. Also remove the inline keyword, this really is distracting. Fixes: d546c621542d ("ipv4: harden fnhe_hashfun()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Keyu Man <kman001@ucr.edu> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fade5641 |
|
25-Jun-2021 |
Vadim Fedorenko <vfedorenko@novek.ru> |
net: lwtunnel: handle MTU calculation in forwading Commit 14972cbd34ff ("net: lwtunnel: Handle fragmentation") moved fragmentation logic away from lwtunnel by carry encap headroom and use it in output MTU calculation. But the forwarding part was not covered and created difference in MTU for output and forwarding and further to silent drops on ipv4 forwarding path. Fix it by taking into account lwtunnel encap headroom. The same commit also introduced difference in how to treat RTAX_MTU in IPv4 and IPv6 where latter explicitly removes lwtunnel encap headroom from route MTU. Make IPv4 version do the same. Fixes: 14972cbd34ff ("net: lwtunnel: Handle fragmentation") Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4253b498 |
|
17-May-2021 |
Ido Schimmel <idosch@OSS.NVIDIA.COM> |
ipv4: Add custom multipath hash policy Add a new multipath hash policy where the packet fields used for hash calculation are determined by user space via the fib_multipath_hash_fields sysctl that was introduced in the previous patch. The current set of available packet fields includes both outer and inner fields, which requires two invocations of the flow dissector. Avoid unnecessary dissection of the outer or inner flows by skipping dissection if none of the outer or inner fields are required. In accordance with the existing policies, when an skb is not available, packet fields are extracted from the provided flow key. In which case, only outer fields are considered. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2e68ea92 |
|
17-May-2021 |
Ido Schimmel <idosch@OSS.NVIDIA.COM> |
ipv4: Calculate multipath hash inside switch statement A subsequent patch will add another multipath hash policy where the multipath hash is calculated directly by the policy specific code and not outside of the switch statement. Prepare for this change by moving the multipath hash calculation inside the switch statement. No functional changes intended. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b87b04f5 |
|
12-Jun-2021 |
David Ahern <dsahern@kernel.org> |
ipv4: Fix device used for dst_alloc with local routes Oliver reported a use case where deleting a VRF device can hang waiting for the refcnt to drop to 0. The root cause is that the dst is allocated against the VRF device but cached on the loopback device. The use case (added to the selftests) has an implicit VRF crossing due to the ordering of the FIB rules (lookup local is before the l3mdev rule, but the problem occurs even if the FIB rules are re-ordered with local after l3mdev because the VRF table does not have a default route to terminate the lookup). The end result is is that the FIB lookup returns the loopback device as the nexthop, but the ingress device is in a VRF. The mismatch causes the dst alloc against the VRF device but then cached on the loopback. The fix is to bring the trick used for IPv6 (see ip6_rt_get_dev_rcu): pick the dst alloc device based the fib lookup result but with checks that the result has a nexthop device (e.g., not an unreachable or prohibit entry). Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") Reported-by: Oliver Herms <oliver.peter.herms@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
aa6dd211 |
|
24-Mar-2021 |
Eric Dumazet <edumazet@google.com> |
inet: use bigger hash table for IP ID generation In commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count") I used a very small hash table that could be abused by patient attackers to reveal sensitive information. Switch to a dynamic sizing, depending on RAM size. Typical big hosts will now use 128x more storage (2 MB) to get a similar increase in security and reduction of hash collisions. As a bonus, use of alloc_large_system_hash() spreads allocated memory among all NUMA nodes. Fixes: 73f156a6e8c1 ("inetpeer: get rid of ip_id_count") Reported-by: Amit Klein <aksecurity@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f105f26e |
|
15-Mar-2021 |
Yejune Deng <yejune.deng@gmail.com> |
net: ipv4: route.c: simplify procfs code proc_creat_seq() that directly take a struct seq_operations, and deal with network namespaces in ->open. Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6ad08600 |
|
12-Mar-2021 |
Shubhankar Kuranagatti <shubhankarvk@gmail.com> |
net: ipv4: route.c: Fix indentation of multi line comment. All comment lines inside the comment block have been aligned. Every line of comment starts with a * (uniformity in code). Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6b9c8f46 |
|
10-Mar-2021 |
Shubhankar Kuranagatti <shubhankarvk@gmail.com> |
net: ipv4: route.c: fix space before tab The extra space before tab space has been removed. Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c4c877b2 |
|
09-Mar-2021 |
Daniel Borkmann <daniel@iogearbox.net> |
net: Consolidate common blackhole dst ops Move generic blackhole dst ops to the core and use them from both ipv4_dst_blackhole_ops and ip6_dst_blackhole_ops where possible. No functional change otherwise. We need these also in other locations and having to define them over and over again is not great. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
36c5100e |
|
07-Feb-2021 |
Amit Cohen <amcohen@nvidia.com> |
IPv4: Add "offload failed" indication to routes After installing a route to the kernel, user space receives an acknowledgment, which means the route was installed in the kernel, but not necessarily in hardware. The asynchronous nature of route installation in hardware can lead to a routing daemon advertising a route before it was actually installed in hardware. This can result in packet loss or mis-routed packets until the route is installed in hardware. To avoid such cases, previous patch set added the ability to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed, this behavior is controlled by sysctl. With the above mentioned behavior, it is possible to know from user-space if the route was offloaded, but if the offload fails there is no indication to user-space. Following a failure, a routing daemon will wait indefinitely for a notification that will never come. This patch adds an "offload_failed" indication to IPv4 routes, so that users will have better visibility into the offload process. 'struct fib_alias', and 'struct fib_rt_info' are extended with new field that indicates if route offload failed. Note that the new field is added using unused bit and therefore there is no need to increase structs size. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9c97921a |
|
04-Feb-2021 |
Brian Vazquez <brianvv@google.com> |
net: fix building errors on powerpc when CONFIG_RETPOLINE is not set This commit fixes the errores reported when building for powerpc: ERROR: modpost: "ip6_dst_check" [vmlinux] is a static EXPORT_SYMBOL ERROR: modpost: "ipv4_dst_check" [vmlinux] is a static EXPORT_SYMBOL ERROR: modpost: "ipv4_mtu" [vmlinux] is a static EXPORT_SYMBOL ERROR: modpost: "ip6_mtu" [vmlinux] is a static EXPORT_SYMBOL Fixes: f67fbeaebdc0 ("net: use indirect call helpers for dst_mtu") Fixes: bbd807dfbf20 ("net: indirect call helpers for ipv4/ipv6 dst_check functions") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Brian Vazquez <brianvv@google.com> Link: https://lore.kernel.org/r/20210204181839.558951-2-brianvv@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
bbd807df |
|
01-Feb-2021 |
Brian Vazquez <brianvv@google.com> |
net: indirect call helpers for ipv4/ipv6 dst_check functions This patch avoids the indirect call for the common case: ip6_dst_check and ipv4_dst_check Signed-off-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
f67fbeae |
|
01-Feb-2021 |
Brian Vazquez <brianvv@google.com> |
net: use indirect call helpers for dst_mtu This patch avoids the indirect call for the common case: ip6_mtu and ipv4_mtu Signed-off-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
1ebf1790 |
|
26-Nov-2020 |
Guillaume Nault <gnault@redhat.com> |
ipv4: Fix tos mask in inet_rtm_getroute() When inet_rtm_getroute() was converted to use the RCU variants of ip_route_input() and ip_route_output_key(), the TOS parameters stopped being masked with IPTOS_RT_MASK before doing the route lookup. As a result, "ip route get" can return a different route than what would be used when sending real packets. For example: $ ip route add 192.0.2.11/32 dev eth0 $ ip route add unreachable 192.0.2.11/32 tos 2 $ ip route get 192.0.2.11 tos 2 RTNETLINK answers: No route to host But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would actually be routed using the first route: $ ping -c 1 -Q 2 192.0.2.11 PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data. 64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms --- 192.0.2.11 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to return results consistent with real route lookups. Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
ae8cb932 |
|
13-Nov-2020 |
Oliver Herms <oliver.peter.herms@gmail.com> |
IPv4: RTM_GETROUTE: Add RTA_ENCAP to result This patch adds an IPv4 routes encapsulation attribute to the result of netlink RTM_GETROUTE requests (e.g. ip route get 192.0.2.1). Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20201113085517.GA1307262@tws Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
62679a8d |
|
07-Nov-2020 |
Vincent Bernat <vincent@bernat.ch> |
net: evaluate net.ipvX.conf.all.disable_policy and disable_xfrm The disable_policy and disable_xfrm are a per-interface sysctl to disable IPsec policy or encryption on an interface. However, while a "all" variant is exposed, it was a noop since it was never evaluated. We use the usual "or" logic for this kind of sysctls. Signed-off-by: Vincent Bernat <vincent@bernat.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
874fb9e2 |
|
09-Oct-2020 |
David Ahern <dsahern@kernel.org> |
ipv4: Restore flowi4_oif update before call to xfrm_lookup_route Tobias reported regressions in IPsec tests following the patch referenced by the Fixes tag below. The root cause is dropping the reset of the flowi4_oif after the fib_lookup. Apparently it is needed for xfrm cases, so restore the oif update to ip_route_output_flow right before the call to xfrm_lookup_route. Fixes: 2fbc6e89b2f1 ("ipv4: Update exception handling for multipath routes via same device") Reported-by: Tobias Brunner <tobias@strongswan.org> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2fbc6e89 |
|
14-Sep-2020 |
David Ahern <dsahern@kernel.org> |
ipv4: Update exception handling for multipath routes via same device Kfir reported that pmtu exceptions are not created properly for deployments where multipath routes use the same device. After some digging I see 2 compounding problems: 1. ip_route_output_key_hash_rcu is updating the flowi4_oif *after* the route lookup. This is the second use case where this has been a problem (the first is related to use of vti devices with VRF). I can not find any reason for the oif to be changed after the lookup; the code goes back to the start of git. It does not seem logical so remove it. 2. fib_lookups for exceptions do not call fib_select_path to handle multipath route selection based on the hash. The end result is that the fib_lookup used to add the exception always creates it based using the first leg of the route. An example topology showing the problem: | host1 +------+ | eth0 | .209 +------+ | +------+ switch | br0 | +------+ | +---------+---------+ | host2 | host3 +------+ +------+ | eth0 | .250 | eth0 | 192.168.252.252 +------+ +------+ +-----+ +-----+ | vti | .2 | vti | 192.168.247.3 +-----+ +-----+ \ / ================================= tunnels 192.168.247.1/24 for h in host1 host2 host3; do ip netns add ${h} ip -netns ${h} link set lo up ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1 done ip netns add switch ip -netns switch li set lo up ip -netns switch link add br0 type bridge stp 0 ip -netns switch link set br0 up for n in 1 2 3; do ip -netns switch link add eth-sw type veth peer name eth-h${n} ip -netns switch li set eth-h${n} master br0 up ip -netns switch li set eth-sw netns host${n} name eth0 done ip -netns host1 addr add 192.168.252.209/24 dev eth0 ip -netns host1 link set dev eth0 up ip -netns host1 route add 192.168.247.0/24 \ nexthop via 192.168.252.250 dev eth0 nexthop via 192.168.252.252 dev eth0 ip -netns host2 addr add 192.168.252.250/24 dev eth0 ip -netns host2 link set dev eth0 up ip -netns host2 addr add 192.168.252.252/24 dev eth0 ip -netns host3 link set dev eth0 up ip netns add tunnel ip -netns tunnel li set lo up ip -netns tunnel li add br0 type bridge ip -netns tunnel li set br0 up for n in $(seq 11 20); do ip -netns tunnel addr add dev br0 192.168.247.${n}/24 done for n in 2 3 do ip -netns tunnel link add vti${n} type veth peer name eth${n} ip -netns tunnel link set eth${n} mtu 1360 master br0 up ip -netns tunnel link set vti${n} netns host${n} mtu 1360 up ip -netns host${n} addr add dev vti${n} 192.168.247.${n}/24 done ip -netns tunnel ro add default nexthop via 192.168.247.2 nexthop via 192.168.247.3 ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.11 ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.15 ip -netns host1 ro ls cache Before this patch the cache always shows exceptions against the first leg in the multipath route; 192.168.252.250 per this example. Since the hash has an initial random seed, you may need to vary the final octet more than what is listed. In my tests, using addresses between 11 and 19 usually found 1 that used both legs. With this patch, the cache will have exceptions for both legs. Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions") Reported-by: Kfir Itzhak <mastertheknife@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1869e226 |
|
13-Sep-2020 |
David Ahern <dsahern@gmail.com> |
ipv4: Initialize flowi4_multipath_hash in data path flowi4_multipath_hash was added by the commit referenced below for tunnels. Unfortunately, the patch did not initialize the new field for several fast path lookups that do not initialize the entire flow struct to 0. Fix those locations. Currently, flowi4_multipath_hash is random garbage and affects the hash value computed by fib_multipath_hash for multipath selection. Fixes: 24ba14406c5c ("route: Add multipath_hash in flowi_common to make user-define hash") Signed-off-by: David Ahern <dsahern@gmail.com> Cc: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5af68891 |
|
29-Aug-2020 |
Miaohe Lin <linmiaohe@huawei.com> |
net: clean up codestyle This is a pure codestyle cleanup patch. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
343d8c60 |
|
25-Aug-2020 |
Miaohe Lin <linmiaohe@huawei.com> |
net: clean up codestyle for net/ipv4 This is a pure codestyle cleanup patch. Also add a blank line after declarations as warned by checkpatch.pl. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8b4510d7 |
|
24-Aug-2020 |
Miaohe Lin <linmiaohe@huawei.com> |
net: gain ipv4 mtu when mtu is not locked When mtu is locked, we should not obtain ipv4 mtu as we return immediately in this case and leave acquired ipv4 mtu unused. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
df23bb18 |
|
03-Aug-2020 |
Stefano Brivio <sbrivio@redhat.com> |
ipv4: route: Ignore output interface in FIB lookup for PMTU route Currently, processes sending traffic to a local bridge with an encapsulation device as a port don't get ICMP errors if they exceed the PMTU of the encapsulated link. David Ahern suggested this as a hack, but it actually looks like the correct solution: when we update the PMTU for a given destination by means of updating or creating a route exception, the encapsulation might trigger this because of PMTU discovery happening either on the encapsulation device itself, or its lower layer. This happens on bridged encapsulations only. The output interface shouldn't matter, because we already have a valid destination. Drop the output interface restriction from the associated route lookup. For UDP tunnels, we will now have a route exception created for the encapsulation itself, with a MTU value reflecting its headroom, which allows a bridge forwarding IP packets originated locally to deliver errors back to the sending socket. The behaviour is now consistent with IPv6 and verified with selftests pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in this series. v2: - reset output interface only for bridge ports (David Ahern) - add and use netif_is_any_bridge_port() helper (David Ahern) Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2ce578ca |
|
27-Jun-2020 |
Miaohe Lin <linmiaohe@huawei.com> |
net: ipv4: Fix wrong type conversion from hint to rt in ip_route_use_hint() We can't cast sk_buff to rtable by (struct rtable *)hint. Use skb_rtable(). Fixes: 02b24941619f ("ipv4: use dst hint for ipv4 list receive") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a6211caa |
|
15-May-2020 |
Yuqi Jin <jinyuqi@huawei.com> |
net: revert "net: get rid of an signed integer overflow in ip_idents_reserve()" Commit adb03115f459 ("net: get rid of an signed integer overflow in ip_idents_reserve()") used atomic_cmpxchg to replace "atomic_add_return" inside the function "ip_idents_reserve". The reason was to avoid UBSAN warning. However, this change has caused performance degrade and in GCC-8, fno-strict-overflow is now mapped to -fwrapv -fwrapv-pointer and signed integer overflow is now undefined by default at all optimization levels[1]. Moreover, it was a bug in UBSAN vs -fwrapv /-fno-strict-overflow, so Let's revert it safely. [1] https://gcc.gnu.org/gcc-8/changes.html Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Arvind Sankar <nivedita@alum.mit.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Jiong Wang <jiongwang@huawei.com> Signed-off-by: Yuqi Jin <jinyuqi@huawei.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
57644431 |
|
08-May-2020 |
Paolo Abeni <pabeni@redhat.com> |
net: ipv4: really enforce backoff for redirects In commit b406472b5ad7 ("net: ipv4: avoid mixed n_redirects and rate_tokens usage") I missed the fact that a 0 'rate_tokens' will bypass the backoff algorithm. Since rate_tokens is cleared after a redirect silence, and never incremented on redirects, if the host keeps receiving packets requiring redirect it will reply ignoring the backoff. Additionally, the 'rate_last' field will be updated with the cadence of the ingress packet requiring redirect. If that rate is high enough, that will prevent the host from generating any other kind of ICMP messages The check for a zero 'rate_tokens' value was likely a shortcut to avoid the more complex backoff algorithm after a redirect silence period. Address the issue checking for 'n_redirects' instead, which is incremented on successful redirect, and does not interfere with other ICMP replies. Fixes: b406472b5ad7 ("net: ipv4: avoid mixed n_redirects and rate_tokens usage") Reported-and-tested-by: Colin Walters <walters@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
32927393 |
|
24-Apr-2020 |
Christoph Hellwig <hch@lst.de> |
sysctl: pass kernel pointers to ->proc_handler Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
af13b3c3 |
|
23-Mar-2020 |
David Laight <David.Laight@ACULAB.COM> |
Remove DST_HOST Previous changes to the IP routing code have removed all the tests for the DS_HOST route flag. Remove the flags and all the code that sets it. Signed-off-by: David Laight <david.laight@aculab.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
571912c6 |
|
23-Feb-2020 |
Martin Varghese <martin.varghese@nokia.com> |
net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc. The Bareudp tunnel module provides a generic L3 encapsulation tunnelling module for tunnelling different protocols like MPLS, IP,NSH etc inside a UDP tunnel. Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
97a32539 |
|
03-Feb-2020 |
Alexey Dobriyan <adobriyan@gmail.com> |
proc: convert everything to "struct proc_ops" The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in seq_file.h. Conversion rule is: llseek => proc_lseek unlocked_ioctl => proc_ioctl xxx => proc_xxx delete ".owner = THIS_MODULE" line [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c] [sfr@canb.auug.org.au: fix kernel/sched/psi.c] Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a3ea8673 |
|
23-Jan-2020 |
Vasily Averin <vvs@virtuozzo.com> |
rt_cpu_seq_next should increase position index if seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
90b93f1b |
|
14-Jan-2020 |
Ido Schimmel <idosch@mellanox.com> |
ipv4: Add "offload" and "trap" indications to routes When performing L3 offload, routes and nexthops are usually programmed into two different tables in the underlying device. Therefore, the fact that a nexthop resides in hardware does not necessarily mean that all the associated routes also reside in hardware and vice-versa. While the kernel can signal to user space the presence of a nexthop in hardware (via 'RTNH_F_OFFLOAD'), it does not have a corresponding flag for routes. In addition, the fact that a route resides in hardware does not necessarily mean that the traffic is offloaded. For example, unreachable routes (i.e., 'RTN_UNREACHABLE') are programmed to trap packets to the CPU so that the kernel will be able to generate the appropriate ICMP error packet. This patch adds an "offload" and "trap" indications to IPv4 routes, so that users will have better visibility into the offload process. 'struct fib_alias' is extended with two new fields that indicate if the route resides in hardware or not and if it is offloading traffic from the kernel or trapping packets to it. Note that the new fields are added in the 6 bytes hole and therefore the struct still fits in a single cache line [1]. Capable drivers are expected to invoke fib_alias_hw_flags_set() with the route's key in order to set the flags. The indications are dumped to user space via a new flags (i.e., 'RTM_F_OFFLOAD' and 'RTM_F_TRAP') in the 'rtm_flags' field in the ancillary header. v2: * Make use of 'struct fib_rt_info' in fib_alias_hw_flags_set() [1] struct fib_alias { struct hlist_node fa_list; /* 0 16 */ struct fib_info * fa_info; /* 16 8 */ u8 fa_tos; /* 24 1 */ u8 fa_type; /* 25 1 */ u8 fa_state; /* 26 1 */ u8 fa_slen; /* 27 1 */ u32 tb_id; /* 28 4 */ s16 fa_default; /* 32 2 */ u8 offload:1; /* 34: 0 1 */ u8 trap:1; /* 34: 1 1 */ u8 unused:6; /* 34: 2 1 */ /* XXX 5 bytes hole, try to pack */ struct callback_head rcu __attribute__((__aligned__(8))); /* 40 16 */ /* size: 56, cachelines: 1, members: 12 */ /* sum members: 50, holes: 1, sum holes: 5 */ /* sum bitfield members: 8 bits (1 bytes) */ /* forced alignments: 1, forced holes: 1, sum forced holes: 5 */ /* last cacheline: 56 bytes */ } __attribute__((__aligned__(8))); Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1e301fd0 |
|
14-Jan-2020 |
Ido Schimmel <idosch@mellanox.com> |
ipv4: Encapsulate function arguments in a struct fib_dump_info() is used to prepare RTM_{NEW,DEL}ROUTE netlink messages using the passed arguments. Currently, the function takes 11 arguments, 6 of which are attributes of the route being dumped (e.g., prefix, TOS). The next patch will need the function to also dump to user space an indication if the route is present in hardware or not. Instead of passing yet another argument, change the function to take a struct containing the different route attributes. v2: * Name last argument of fib_dump_info() * Move 'struct fib_rt_info' to include/net/ip_fib.h so that it could later be passed to fib_alias_hw_flags_set() Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bd085ef6 |
|
21-Dec-2019 |
Hangbin Liu <liuhangbin@gmail.com> |
net: add bool confirm_neigh parameter for dst_ops.update_pmtu The MTU update code is supposed to be invoked in response to real networking events that update the PMTU. In IPv6 PMTU update function __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor confirmed time. But for tunnel code, it will call pmtu before xmit, like: - tnl_update_pmtu() - skb_dst_update_pmtu() - ip6_rt_update_pmtu() - __ip6_rt_update_pmtu() - dst_confirm_neigh() If the tunnel remote dst mac address changed and we still do the neigh confirm, we will not be able to update neigh cache and ping6 remote will failed. So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we should not be invoking dst_confirm_neigh() as we have no evidence of successful two-way communication at this point. On the other hand it is also important to keep the neigh reachability fresh for TCP flows, so we cannot remove this dst_confirm_neigh() call. To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu to choose whether we should do neigh update or not. I will add the parameter in this patch and set all the callers to true to comply with the previous way, and fix the tunnel code one by one on later patches. v5: No change. v4: No change. v3: Do not remove dst_confirm_neigh, but add a new bool parameter in dst_ops.update_pmtu to control whether we should do neighbor confirm. Also split the big patch to small ones for each area. v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu. Suggested-by: David Miller <davem@davemloft.net> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02b24941 |
|
20-Nov-2019 |
Paolo Abeni <pabeni@redhat.com> |
ipv4: use dst hint for ipv4 list receive This is alike the previous change, with some additional ipv4 specific quirk. Even when using the route hint we still have to do perform additional per packet checks about source address validity: a new helper is added to wrap them. Hints are explicitly disabled if the destination is a local broadcast, that keeps the code simple and local broadcast are a slower path anyway. UDP flood performances vs recvmmsg() receiver: vanilla patched delta Kpps Kpps % 1683 1871 +11 In the worst case scenario - each packet has a different destination address - the performance delta is within noise range. v3 -> v4: - re-enable hints for forward v2 -> v3: - really fix build (sic) and hint usage check - use fib4_has_custom_rules() helpers (David A.) - add ip_extract_route_hint() helper (Edward C.) - use prev skb as hint instead of copying data (Willem) v1 -> v2: - fix build issue with !CONFIG_IP_MULTIPLE_TABLES Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
54074f1d |
|
01-Nov-2019 |
Matteo Croce <mcroce@redhat.com> |
icmp: remove duplicate code The same code which recognizes ICMP error packets is duplicated several times. Use the icmp_is_err() and icmpv6_is_err() helpers instead, which do the same thing. ip_multipath_l3_keys() and tcf_nat_act() didn't check for all the error types, assume that they should instead. Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5018c596 |
|
16-Oct-2019 |
Wei Wang <weiwan@google.com> |
ipv4: fix race condition between route lookup and invalidation Jesse and Ido reported the following race condition: <CPU A, t0> - Received packet A is forwarded and cached dst entry is taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set() <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables from multiple ISPs"), route is added / deleted and rt_cache_flush() is called <CPU B, t2> - Received packet B tries to use the same cached dst entry from t0, but rt_cache_valid() is no longer true and it is replaced in rt_cache_route() by the newer one. This calls dst_dev_put() on the original dst entry which assigns the blackhole netdev to 'dst->dev' <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due to 'dst->dev' being the blackhole netdev There are 2 issues in the v4 routing code: 1. A per-netns counter is used to do the validation of the route. That means whenever a route is changed in the netns, users of all routes in the netns needs to redo lookup. v6 has an implementation of only updating fn_sernum for routes that are affected. 2. When rt_cache_valid() returns false, rt_cache_route() is called to throw away the current cache, and create a new one. This seems unnecessary because as long as this route does not change, the route cache does not need to be recreated. To fully solve the above 2 issues, it probably needs quite some code changes and requires careful testing, and does not suite for net branch. So this patch only tries to add the deleted cached rt into the uncached list, so user could still be able to use it to receive packets until it's done. Fixes: 95c47f9cf5e0 ("ipv4: call dst_dev_put() properly") Signed-off-by: Wei Wang <weiwan@google.com> Reported-by: Ido Schimmel <idosch@idosch.org> Reported-by: Jesse Hathaway <jesse@mbuki-mvuki.org> Tested-by: Jesse Hathaway <jesse@mbuki-mvuki.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
595e0651 |
|
16-Oct-2019 |
Stefano Brivio <sbrivio@redhat.com> |
ipv4: Return -ENETUNREACH if we can't create route but saddr is valid ...instead of -EINVAL. An issue was found with older kernel versions while unplugging a NFS client with pending RPCs, and the wrong error code here prevented it from recovering once link is back up with a configured address. Incidentally, this is not an issue anymore since commit 4f8943f80883 ("SUNRPC: Replace direct task wakeups from softirq context"), included in 5.2-rc7, had the effect of decoupling the forwarding of this error by using SO_ERROR in xs_wake_error(), as pointed out by Benjamin Coddington. To the best of my knowledge, this isn't currently causing any further issue, but the error code doesn't look appropriate anyway, and we might hit this in other paths as well. In detail, as analysed by Gonzalo Siero, once the route is deleted because the interface is down, and can't be resolved and we return -EINVAL here, this ends up, courtesy of inet_sk_rebuild_header(), as the socket error seen by tcp_write_err(), called by tcp_retransmit_timer(). In turn, tcp_write_err() indirectly calls xs_error_report(), which wakes up the RPC pending tasks with a status of -EINVAL. This is then seen by call_status() in the SUN RPC implementation, which aborts the RPC call calling rpc_exit(), instead of handling this as a potentially temporary condition, i.e. as a timeout. Return -EINVAL only if the input parameters passed to ip_route_output_key_hash_rcu() are actually invalid (this is the case if the specified source address is multicast, limited broadcast or all zeroes), but return -ENETUNREACH in all cases where, at the given moment, the given source address doesn't allow resolving the route. While at it, drop the initialisation of err to -ENETUNREACH, which was added to __ip_route_output_key() back then by commit 0315e3827048 ("net: Fix behaviour of unreachable, blackhole and prohibit routes"), but actually had no effect, as it was, and is, overwritten by the fib_lookup() return code assignment, and anyway ignored in all other branches, including the if (fl4->saddr) one: I find this rather confusing, as it would look like -ENETUNREACH is the "default" error, while that statement has no effect. Also note that after commit fc75fc8339e7 ("ipv4: dont create routes on down devices"), we would get -ENETUNREACH if the device is down, but -EINVAL if the source address is specified and we can't resolve the route, and this appears to be rather inconsistent. Reported-by: Stefan Walter <walteste@inf.ethz.ch> Analysed-by: Benjamin Coddington <bcodding@redhat.com> Analysed-by: Gonzalo Siero <gsierohu@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b406472b |
|
04-Oct-2019 |
Paolo Abeni <pabeni@redhat.com> |
net: ipv4: avoid mixed n_redirects and rate_tokens usage Since commit c09551c6ff7f ("net: ipv4: use a dedicated counter for icmp_v4 redirect packets") we use 'n_redirects' to account for redirect packets, but we still use 'rate_tokens' to compute the redirect packets exponential backoff. If the device sent to the relevant peer any ICMP error packet after sending a redirect, it will also update 'rate_token' according to the leaking bucket schema; typically 'rate_token' will raise above BITS_PER_LONG and the redirect packets backoff algorithm will produce undefined behavior. Fix the issue using 'n_redirects' to compute the exponential backoff in ip_rt_send_redirect(). Note that we still clear rate_tokens after a redirect silence period, to avoid changing an established behaviour. The root cause predates git history; before the mentioned commit in the critical scenario, the kernel stopped sending redirects, after the mentioned commit the behavior more randomic. Reported-by: Xiumei Mu <xmu@redhat.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Fixes: c09551c6ff7f ("net: ipv4: use a dedicated counter for icmp_v4 redirect packets") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
77d5bc7e |
|
17-Sep-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Revert removal of rt_uses_gateway Julian noted that rt_uses_gateway has a more subtle use than 'is gateway set': https://lore.kernel.org/netdev/alpine.LFD.2.21.1909151104060.2546@ja.home.ssi.bg/ Revert that part of the commit referenced in the Fixes tag. Currently, there are no u8 holes in 'struct rtable'. There is a 4-byte hole in the second cacheline which contains the gateway declaration. So move rt_gw_family down to the gateway declarations since they are always used together, and then re-use that u8 for rt_uses_gateway. End result is that rtable size is unchanged. Fixes: 1550c171935d ("ipv4: Prepare rtable for IPv6 gateway") Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
#
e93fb3e9 |
|
23-Aug-2019 |
John Fastabend <john.fastabend@gmail.com> |
net: route dump netlink NLM_F_MULTI flag missing An excerpt from netlink(7) man page, In multipart messages (multiple nlmsghdr headers with associated payload in one byte stream) the first and all following headers have the NLM_F_MULTI flag set, except for the last header which has the type NLMSG_DONE. but, after (ee28906) there is a missing NLM_F_MULTI flag in the middle of a FIB dump. The result is user space applications following above man page excerpt may get confused and may stop parsing msg believing something went wrong. In the golang netlink lib [0] the library logic stops parsing believing the message is not a multipart message. Found this running Cilium[1] against net-next while adding a feature to auto-detect routes. I noticed with multiple route tables we no longer could detect the default routes on net tree kernels because the library logic was not returning them. Fix this by handling the fib_dump_info_fnhe() case the same way the fib_dump_info() handles it by passing the flags argument through the call chain and adding a flags argument to rt_fill_info(). Tested with Cilium stack and auto-detection of routes works again. Also annotated libs to dump netlink msgs and inspected NLM_F_MULTI and NLMSG_DONE flags look correct after this. Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same as is done for fib_dump_info() so this looks correct to me. [0] https://github.com/vishvananda/netlink/ [1] https://github.com/cilium/ Fixes: ee28906fd7a14 ("ipv4: Dump route exceptions if requested") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
828b2b44 |
|
06-Jul-2019 |
Stephen Suryaputra <ssuryaextr@gmail.com> |
ipv4: Multipath hashing on inner L3 needs to consider inner IPv6 pkts Commit 363887a2cdfe ("ipv4: Support multipath hashing on inner IP pkts for GRE tunnel") supports multipath policy value of 2, Layer 3 or inner Layer 3 if present, but it only considers inner IPv4. There is a use case of IPv6 is tunneled by IPv4 GRE, thus add the ability to hash on inner IPv6 addresses. Fixes: 363887a2cdfe ("ipv4: Support multipath hashing on inner IP pkts for GRE tunnel") Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
537de0c8 |
|
04-Jul-2019 |
Ido Schimmel <idosch@mellanox.com> |
ipv4: Fix NULL pointer dereference in ipv4_neigh_lookup() Both ip_neigh_gw4() and ip_neigh_gw6() can return either a valid pointer or an error pointer, but the code currently checks that the pointer is not NULL. Fix this by checking that the pointer is not an error pointer, as this can result in a NULL pointer dereference [1]. Specifically, I believe that what happened is that ip_neigh_gw4() returned '-EINVAL' (0xffffffffffffffea) to which the offset of 'refcnt' (0x70) was added, which resulted in the address 0x000000000000005a. [1] BUG: KASAN: null-ptr-deref in refcount_inc_not_zero_checked+0x6e/0x180 Read of size 4 at addr 000000000000005a by task swapper/2/0 CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.2.0-rc6-custom-reg-179657-gaa32d89 #396 Hardware name: Mellanox Technologies Ltd. MSN2010/SA002610, BIOS 5.6.5 08/24/2017 Call Trace: <IRQ> dump_stack+0x73/0xbb __kasan_report+0x188/0x1ea kasan_report+0xe/0x20 refcount_inc_not_zero_checked+0x6e/0x180 ipv4_neigh_lookup+0x365/0x12c0 __neigh_update+0x1467/0x22f0 arp_process.constprop.6+0x82e/0x1f00 __netif_receive_skb_one_core+0xee/0x170 process_backlog+0xe3/0x640 net_rx_action+0x755/0xd90 __do_softirq+0x29b/0xae7 irq_exit+0x177/0x1c0 smp_apic_timer_interrupt+0x164/0x5e0 apic_timer_interrupt+0xf/0x20 </IRQ> Fixes: 5c9f7c1dfc2e ("ipv4: Add helpers for neigh lookup for nexthop") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Shalom Toledo <shalomt@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8d7017fd |
|
01-Jul-2019 |
Mahesh Bandewar <maheshb@google.com> |
blackhole_netdev: use blackhole_netdev to invalidate dst entries Use blackhole_netdev instead of 'lo' device with lower MTU when marking dst "dead". Signed-off-by: Mahesh Bandewar <maheshb@google.com> Tested-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5cdda5f1 |
|
24-Jun-2019 |
Christian Brauner <christian@brauner.io> |
ipv4: enable route flushing in network namespaces Tools such as vpnc try to flush routes when run inside network namespaces by writing 1 into /proc/sys/net/ipv4/route/flush. This currently does not work because flush is not enabled in non-initial network namespaces. Since routes are per network namespace it is safe to enable /proc/sys/net/ipv4/route/flush in there. Link: https://github.com/lxc/lxd/issues/4257 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b18f128 |
|
26-Jun-2019 |
Stephen Suryaputra <ssuryaextr@gmail.com> |
ipv4: reset rt_iif for recirculated mcast/bcast out pkts Multicast or broadcast egress packets have rt_iif set to the oif. These packets might be recirculated back as input and lookup to the raw sockets may fail because they are bound to the incoming interface (skb_iif). If rt_iif is not zero, during the lookup, inet_iif() function returns rt_iif instead of skb_iif. Hence, the lookup fails. v2: Make it non vrf specific (David Ahern). Reword the changelog to reflect it. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
93ed54b1 |
|
26-Jun-2019 |
Eric Dumazet <edumazet@google.com> |
ipv4: fix suspicious RCU usage in fib_dump_info_fnhe() sysbot reported that we lack appropriate rcu_read_lock() protection in fib_dump_info_fnhe() net/ipv4/route.c:2875 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by syz-executor609/8966: #0: 00000000b7dbe288 (rtnl_mutex){+.+.}, at: netlink_dump+0xe7/0xfb0 net/netlink/af_netlink.c:2199 stack backtrace: CPU: 0 PID: 8966 Comm: syz-executor609 Not tainted 5.2.0-rc5+ #43 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5250 fib_dump_info_fnhe+0x9d9/0x1080 net/ipv4/route.c:2875 fn_trie_dump_leaf net/ipv4/fib_trie.c:2141 [inline] fib_table_dump+0x64a/0xd00 net/ipv4/fib_trie.c:2175 inet_dump_fib+0x83c/0xa90 net/ipv4/fib_frontend.c:1004 rtnl_dump_all+0x295/0x490 net/core/rtnetlink.c:3445 netlink_dump+0x558/0xfb0 net/netlink/af_netlink.c:2244 __netlink_dump_start+0x5b1/0x7d0 net/netlink/af_netlink.c:2352 netlink_dump_start include/linux/netlink.h:226 [inline] rtnetlink_rcv_msg+0x73d/0xb00 net/core/rtnetlink.c:5182 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5237 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328 netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1917 sock_sendmsg_nosec net/socket.c:646 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:665 sock_write_iter+0x27c/0x3e0 net/socket.c:994 call_write_iter include/linux/fs.h:1872 [inline] new_sync_write+0x4d3/0x770 fs/read_write.c:483 __vfs_write+0xe1/0x110 fs/read_write.c:496 vfs_write+0x20c/0x580 fs/read_write.c:558 ksys_write+0x14f/0x290 fs/read_write.c:611 __do_sys_write fs/read_write.c:623 [inline] __se_sys_write fs/read_write.c:620 [inline] __x64_sys_write+0x73/0xb0 fs/read_write.c:620 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4401b9 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffc8e134978 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401b9 RDX: 000000000000001c RSI: 0000000020000000 RDI: 0000000000000003 RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8 R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000401a40 R13: 0000000000401ad0 R14: 0000000000000000 R15: 0000000000000000 Fixes: ee28906fd7a1 ("ipv4: Dump route exceptions if requested") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stefano Brivio <sbrivio@redhat.com> Cc: David Ahern <dsahern@gmail.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ee28906f |
|
21-Jun-2019 |
Stefano Brivio <sbrivio@redhat.com> |
ipv4: Dump route exceptions if requested Since commit 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions."), cached exception routes are stored as a separate entity, so they are not dumped on a FIB dump, even if the RTM_F_CLONED flag is passed. This implies that the command 'ip route list cache' doesn't return any result anymore. If the RTM_F_CLONED is passed, and strict checking requested, retrieve nexthop exception routes and dump them. If no strict checking is requested, filtering can't be performed consistently: dump everything in that case. With this, we need to add an argument to the netlink callback in order to track how many entries were already dumped for the last leaf included in a partial netlink dump. A single additional argument is sufficient, even if we traverse logically nested structures (nexthop objects, hash table buckets, bucket chains): it doesn't matter if we stop in the middle of any of those, because they are always traversed the same way. As an example, s_i values in [], s_fa values in (): node (fa) #1 [1] nexthop #1 bucket #1 -> #0 in chain (1) bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4) bucket #3 -> #0 in chain (5) -> #1 in chain (6) nexthop #2 bucket #1 -> #0 in chain (7) -> #1 in chain (8) bucket #2 -> #0 in chain (9) -- node (fa) #2 [2] nexthop #1 bucket #1 -> #0 in chain (1) -> #1 in chain (2) bucket #2 -> #0 in chain (3) it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2) for "node #2": walking flattens all that. It would even be possible to drop the distinction between the in-tree (s_i) and in-node (s_fa) counter, but a further improvement might advise against this. This is only as accurate as the existing tracking mechanism for leaves: if a partial dump is restarted after exceptions are removed or expired, we might skip some non-dumped entries. To improve this, we could attach a 'sernum' attribute (similar to the one used for IPv6) to nexthop entities, and bump this counter whenever exceptions change: having a distinction between the two counters would make this more convenient. Listing of exception routes (modified routes pre-3.5) was tested against these versions of kernel and iproute2: iproute2 kernel 4.14.0 4.15.0 4.19.0 5.0.0 5.1.0 3.5-rc4 + + + + + 4.4 4.9 4.14 4.15 4.19 5.0 5.1 fixed + + + + + v7: - Move loop over nexthop objects to route.c, and pass struct fib_info and table ID to it, not a struct fib_alias (suggested by David Ahern) - While at it, note that the NULL check on fa->fa_info is redundant, and the check on RTNH_F_DEAD is also not consistent with what's done with regular route listing: just keep it for nhc_flags - Rename entry point function for dumping exceptions to fib_dump_info_fnhe(), and rearrange arguments for consistency with fib_dump_info() - Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle one bucket at a time - Expand commit message to describe why we can have a single "skip" counter for all exceptions stored in bucket chains in nexthop objects (suggested by David Ahern) v6: - Rebased onto net-next - Loop over nexthop paths too. Move loop over fnhe buckets to route.c, avoids need to export rt_fill_info() and to touch exceptions from fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that (suggested by David Ahern) Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d948974c |
|
21-Jun-2019 |
Stefano Brivio <sbrivio@redhat.com> |
ipv4/route: Allow NULL flowinfo in rt_fill_info() In the next patch, we're going to use rt_fill_info() to dump exception routes upon RTM_GETROUTE with NLM_F_ROOT, meaning userspace is requesting a dump and not a specific route selection, which in turn implies the input interface is not relevant. Update rt_fill_info() to handle a NULL flowinfo. v7: If fl4 is NULL, explicitly set r->rtm_tos to 0: it's not initialised otherwise (spotted by David Ahern) v6: New patch Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
363887a2 |
|
13-Jun-2019 |
Stephen Suryaputra <ssuryaextr@gmail.com> |
ipv4: Support multipath hashing on inner IP pkts for GRE tunnel Multipath hash policy value of 0 isn't distributing since the outer IP dest and src aren't varied eventhough the inner ones are. Since the flow is on the inner ones in the case of tunneled traffic, hashing on them is desired. This is done mainly for IP over GRE, hence only tested for that. But anything else supported by flow dissection should work. v2: Use skb_flow_dissect_flow_keys() directly so that other tunneling can be supported through flow dissection (per Nikolay Aleksandrov). v3: Remove accidental inclusion of ports in the hash keys and clarify the documentation (Nikolay Alexandrov). Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0a90478b |
|
02-Jun-2019 |
Xin Long <lucien.xin@gmail.com> |
ipv4: not do cache for local delivery if bc_forwarding is enabled With the topo: h1 ---| rp1 | | route rp3 |--- h3 (192.168.200.1) h2 ---| rp2 | If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on h2, and the packets can still be forwared. This issue was caused by the input route cache. It should only do the cache for either bc forwarding or local delivery. Otherwise, local delivery can use the route cache for bc forwarding of other interfaces. This patch is to fix it by not doing cache for local delivery if all.bc_forwarding is enabled. Note that we don't fix it by checking route cache local flag after rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the common route code shouldn't be touched for bc_forwarding. Fixes: 5cbf777cfdf6 ("route: add support for directed broadcast forwarding") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dcb1ecb5 |
|
03-Jun-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Prepare for fib6_nh from a nexthop object Convert more IPv4 code to use fib_nh_common over fib_nh to enable routes to use a fib6_nh based nexthop. In the end, only code not using a nexthop object in a fib_info should directly access fib_nh in a fib_info without checking the famiy and going through fib_nh_common. Those functions will be marked when it is not directly evident. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5481d73f |
|
03-Jun-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Use accessors for fib_info nexthop data Use helpers to access fib_nh and fib_nhs fields of a fib_info. Drop the fib_dev macro which is an alias for the first nexthop. Replacements: fi->fib_dev --> fib_info_nh(fi, 0)->fib_nh_dev fi->fib_nh --> fib_info_nh(fi, 0) fi->fib_nh[i] --> fib_info_nh(fi, i) fi->fib_nhs --> fib_info_num_path(fi) where fib_info_nh(fi, i) returns fi->fib_nh[nhsel] and fib_info_num_path returns fi->fib_nhs. Move the existing fib_info_nhc to nexthop.h and define the new ones there. A later patch adds a check if a fib_info uses a nexthop object, and defining the helpers in nexthop.h avoid circular header dependencies. After this all remaining open coded references to fi->fib_nhs and fi->fib_nh are in: - fib_create_info and helpers used to lookup an existing fib_info entry, and - the netdev event functions fib_sync_down_dev and fib_sync_up. The latter two will not be reused for nexthops, and the fib_create_info will be updated to handle a nexthop in a fib_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2874c5fd |
|
27-May-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
a5995e71 |
|
30-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Move exception bucket to nh_common Similar to the cached routes, make IPv4 exceptions accessible when using an IPv6 nexthop struct with IPv4 routes. Simplify the exception functions by passing in fib_nh_common since that is all it needs, and then cleanup the call sites that have extraneous fib_nh conversions. As with the cached routes this is a change in location only, from fib_nh up to fib_nh_common; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
87063a1f |
|
30-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Pass fib_nh_common to rt_cache_route Now that the cached routes are in fib_nh_common, pass it to rt_cache_route and simplify its callers. For rt_set_nexthop, the tclassid becomes the last user of fib_nh so move the container_of under the #ifdef CONFIG_IP_ROUTE_CLASSID. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0f457a36 |
|
30-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Move cached routes to fib_nh_common While the cached routes, nh_pcpu_rth_output and nh_rth_input, are IPv4 specific, a later patch wants to make them accessible for IPv6 nexthops with IPv4 routes using a fib6_nh. Move the cached routes from fib_nh to fib_nh_common and update references. Initialization of the cached entries is moved to fib_nh_common_init, and free is moved to fib_nh_common_release. Change in location only, from fib_nh up to fib_nh_common; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8cb08174 |
|
26-Apr-2019 |
Johannes Berg <johannes.berg@intel.com> |
netlink: make validation more configurable for future strictness We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
20ff83f1 |
|
24-Apr-2019 |
Eric Dumazet <edumazet@google.com> |
ipv4: add sanity checks in ipv4_link_failure() Before calling __ip_options_compile(), we need to ensure the network header is a an IPv4 one, and that it is already pulled in skb->head. RAW sockets going through a tunnel can end up calling ipv4_link_failure() with total garbage in the skb, or arbitrary lengthes. syzbot report : BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:355 [inline] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123 Write of size 69 at addr ffff888096abf068 by task syz-executor.4/9204 CPU: 0 PID: 9204 Comm: syz-executor.4 Not tainted 5.1.0-rc5+ #77 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 check_memory_region_inline mm/kasan/generic.c:185 [inline] check_memory_region+0x123/0x190 mm/kasan/generic.c:191 memcpy+0x38/0x50 mm/kasan/common.c:133 memcpy include/linux/string.h:355 [inline] __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123 __icmp_send+0x725/0x1400 net/ipv4/icmp.c:695 ipv4_link_failure+0x29f/0x550 net/ipv4/route.c:1204 dst_link_failure include/net/dst.h:427 [inline] vti6_xmit net/ipv6/ip6_vti.c:514 [inline] vti6_tnl_xmit+0x10d4/0x1c0c net/ipv6/ip6_vti.c:553 __netdev_start_xmit include/linux/netdevice.h:4414 [inline] netdev_start_xmit include/linux/netdevice.h:4423 [inline] xmit_one net/core/dev.c:3292 [inline] dev_hard_start_xmit+0x1b2/0x980 net/core/dev.c:3308 __dev_queue_xmit+0x271d/0x3060 net/core/dev.c:3878 dev_queue_xmit+0x18/0x20 net/core/dev.c:3911 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1527 neigh_output include/net/neighbour.h:508 [inline] ip_finish_output2+0x949/0x1740 net/ipv4/ip_output.c:229 ip_finish_output+0x73c/0xd50 net/ipv4/ip_output.c:317 NF_HOOK_COND include/linux/netfilter.h:278 [inline] ip_output+0x21f/0x670 net/ipv4/ip_output.c:405 dst_output include/net/dst.h:444 [inline] NF_HOOK include/linux/netfilter.h:289 [inline] raw_send_hdrinc net/ipv4/raw.c:432 [inline] raw_sendmsg+0x1d2b/0x2f20 net/ipv4/raw.c:663 inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg+0xdd/0x130 net/socket.c:661 sock_write_iter+0x27c/0x3e0 net/socket.c:988 call_write_iter include/linux/fs.h:1866 [inline] new_sync_write+0x4c7/0x760 fs/read_write.c:474 __vfs_write+0xe4/0x110 fs/read_write.c:487 vfs_write+0x20c/0x580 fs/read_write.c:549 ksys_write+0x14f/0x2d0 fs/read_write.c:599 __do_sys_write fs/read_write.c:611 [inline] __se_sys_write fs/read_write.c:608 [inline] __x64_sys_write+0x73/0xb0 fs/read_write.c:608 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x458c29 Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f293b44bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458c29 RDX: 0000000000000014 RSI: 00000000200002c0 RDI: 0000000000000003 RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f293b44c6d4 R13: 00000000004c8623 R14: 00000000004ded68 R15: 00000000ffffffff The buggy address belongs to the page: page:ffffea00025aafc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x1fffc0000000000() raw: 01fffc0000000000 0000000000000000 ffffffff025a0101 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888096abef80: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2 ffff888096abf000: f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 >ffff888096abf080: 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ^ ffff888096abf100: 00 00 00 00 f1 f1 f1 f1 00 00 f3 f3 00 00 00 00 ffff888096abf180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c543cb4a |
|
13-Apr-2019 |
Eric Dumazet <edumazet@google.com> |
ipv4: ensure rcu_read_lock() in ipv4_link_failure() fib_compute_spec_dst() needs to be called under rcu protection. syzbot reported : WARNING: suspicious RCU usage 5.1.0-rc4+ #165 Not tainted include/linux/inetdevice.h:220 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by swapper/0/0: #0: 0000000051b67925 ((&n->timer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:170 [inline] #0: 0000000051b67925 ((&n->timer)){+.-.}, at: call_timer_fn+0xda/0x720 kernel/time/timer.c:1315 stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.1.0-rc4+ #165 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5162 __in_dev_get_rcu include/linux/inetdevice.h:220 [inline] fib_compute_spec_dst+0xbbd/0x1030 net/ipv4/fib_frontend.c:294 spec_dst_fill net/ipv4/ip_options.c:245 [inline] __ip_options_compile+0x15a7/0x1a10 net/ipv4/ip_options.c:343 ipv4_link_failure+0x172/0x400 net/ipv4/route.c:1195 dst_link_failure include/net/dst.h:427 [inline] arp_error_report+0xd1/0x1c0 net/ipv4/arp.c:297 neigh_invalidate+0x24b/0x570 net/core/neighbour.c:995 neigh_timer_handler+0xc35/0xf30 net/core/neighbour.c:1081 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325 expire_timers kernel/time/timer.c:1362 [inline] __run_timers kernel/time/timer.c:1681 [inline] __run_timers kernel/time/timer.c:1649 [inline] run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694 __do_softirq+0x266/0x95a kernel/softirq.c:293 invoke_softirq kernel/softirq.c:374 [inline] irq_exit+0x180/0x1d0 kernel/softirq.c:414 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807 Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ed0de45a |
|
12-Apr-2019 |
Stephen Suryaputra <ssuryaextr@gmail.com> |
ipv4: recompile ip options in ipv4_link_failure Recompile IP options since IPCB may not be valid anymore when ipv4_link_failure is called from arp_error_report. Refer to the commit 3da1ed7ac398 ("net: avoid use IPCB in cipso_v4_error") and the commit before that (9ef6b42ad6fd) for a similar issue. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6de9c055 |
|
05-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Handle ipv6 gateway in ipv4_confirm_neigh Update ipv4_confirm_neigh to handle an ipv6 gateway. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5c9f7c1d |
|
05-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Add helpers for neigh lookup for nexthop A common theme in the output path is looking up a neigh entry for a nexthop, either the gateway in an rtable or a fallback to the daddr in the skb: nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr); neigh = __ipv4_neigh_lookup_noref(dev, nexthop); if (unlikely(!neigh)) neigh = __neigh_create(&arp_tbl, &nexthop, dev, false); To allow the nexthop to be an IPv6 address we need to consider the family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based on it. To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6 added in an earlier patch which handles: neigh = __ipv4_neigh_lookup_noref(dev, nexthop); if (unlikely(!neigh)) neigh = __neigh_create(&arp_tbl, &nexthop, dev, false); And then add a second one, ip_neigh_for_gw, that calls either ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway. Update the output paths in the VRF driver and core v4 code to use ip_neigh_for_gw simplifying the family based lookup and making both ready for a v6 nexthop. ipv4_neigh_lookup has a different need - the potential to resolve a passed in address in addition to any gateway in the rtable or skb. Since this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The difference between __neigh_create used by the helpers and neigh_create called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh and bump the refcnt on the neigh entry. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0f5f7d7b |
|
05-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Add support to rtable for ipv6 gateway Add support for an IPv6 gateway to rtable. Since a gateway is either IPv4 or IPv6, make it a union with rt_gw4 where rt_gw_family decides which address is in use. When dumping the route data, encode an ipv6 nexthop using RTA_VIA. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1550c171 |
|
05-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Prepare rtable for IPv6 gateway To allow the gateway to be either an IPv4 or IPv6 address, remove rt_uses_gateway from rtable and replace with rt_gw_family. If rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway to rt_gw4 to represent the IPv4 version. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bdf00467 |
|
05-Apr-2019 |
David Ahern <dsahern@gmail.com> |
net: Replace nhc_has_gw with nhc_gw_family Allow the gateway in a fib_nh_common to be from a different address family than the outer fib{6}_nh. To that end, replace nhc_has_gw with nhc_gw_family and update users of nhc_has_gw to check nhc_gw_family. Now nhc_family is used to know if the nh_common is part of a fib_nh or fib6_nh (used for container_of to get to route family specific data), and nhc_gw_family represents the address family for the gateway. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eba618ab |
|
02-Apr-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Add fib_nh_common to fib_result Most of the ipv4 code only needs data from fib_nh_common. Add fib_nh_common selection to fib_result and update users to use it. Right now, fib_nh_common in fib_result will point to a fib_nh struct that is embedded within a fib_info: fib_info --> fib_nh fib_nh ... fib_nh ^ fib_result->nhc ----+ Later, nhc can point to a fib_nh within a nexthop struct: fib_info --> nexthop --> fib_nh ^ fib_result->nhc ---------------+ or for a nexthop group: fib_info --> nexthop --> nexthop --> fib_nh nexthop --> fib_nh ... nexthop --> fib_nh ^ fib_result->nhc ---------------------------+ In all cases nhsel within fib_result will point to which leg in the multipath route is used. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b75ed8b1 |
|
27-Mar-2019 |
David Ahern <dsahern@gmail.com> |
ipv4: Rename fib_nh entries Rename fib_nh entries that will be moved to a fib_nh_common struct. Specifically, the device, oif, gateway, flags, scope, lwtstate, nh_weight and nh_upper_bound are common with all nexthop definitions. In the process shorten fib_nh_lwtstate to fib_nh_lws to avoid really long lines. Rename only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
df453700 |
|
27-Mar-2019 |
Eric Dumazet <edumazet@google.com> |
inet: switch IP ID generator to siphash According to Amit Klein and Benny Pinkas, IP ID generation is too weak and might be used by attackers. Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix()) having 64bit key and Jenkins hash is risky. It is time to switch to siphash and its 128bit keys. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02afc7ad |
|
20-Mar-2019 |
Julian Wiedmann <jwi@linux.ibm.com> |
net: dst: remove gc leftovers Get rid of some obsolete gc-related documentation and macros that were missed in commit 5b7c9a8ff828 ("net: remove dst gc related code"). CC: Wei Wang <weiwan@google.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ee60ad21 |
|
07-Mar-2019 |
Xin Long <lucien.xin@gmail.com> |
route: set the deleted fnhe fnhe_daddr to 0 in ip_del_fnhe to fix a race The race occurs in __mkroute_output() when 2 threads lookup a dst: CPU A CPU B find_exception() find_exception() [fnhe expires] ip_del_fnhe() [fnhe is deleted] rt_bind_exception() In rt_bind_exception() it will bind a deleted fnhe with the new dst, and this dst will get no chance to be freed. It causes a dev defcnt leak and consecutive dmesg warnings: unregister_netdevice: waiting for ethX to become free. Usage count = 1 Especially thanks Jon to identify the issue. This patch fixes it by setting fnhe_daddr to 0 in ip_del_fnhe() to stop binding the deleted fnhe with a new dst when checking fnhe's fnhe_daddr and daddr in rt_bind_exception(). It works as both ip_del_fnhe() and rt_bind_exception() are protected by fnhe_lock and the fhne is freed by kfree_rcu(). Fixes: deed49df7390 ("route: check and remove route cache when we get route") Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
22c74764 |
|
06-Mar-2019 |
Paolo Abeni <pabeni@redhat.com> |
ipv4/route: fail early when inet dev is missing If a non local multicast packet reaches ip_route_input_rcu() while the ingress device IPv4 private data (in_dev) is NULL, we end up doing a NULL pointer dereference in IN_DEV_MFORWARD(). Since the later call to ip_route_input_mc() is going to fail if !in_dev, we can fail early in such scenario and avoid the dangerous code path. v1 -> v2: - clarified the commit message, no code changes Reported-by: Tianhao Zhao <tizhao@redhat.com> Fixes: e58e41596811 ("net: Enable support for VRF with ipv4 multicast") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2a8e4997 |
|
01-Mar-2019 |
Ido Schimmel <idosch@mellanox.com> |
net: ipv4: Fix NULL pointer dereference in route lookup When calculating the multipath hash for input routes the flow info is not available and therefore should not be used. Fixes: 24ba14406c5c ("route: Add multipath_hash in flowi_common to make user-define hash") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Cc: wenxu <wenxu@ucloud.cn> Acked-by: wenxu <wenxu@ucloud.cn> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5e1a99ea |
|
27-Feb-2019 |
Hangbin Liu <liuhangbin@gmail.com> |
ipv4: Add ICMPv6 support when parse route ipproto For ip rules, we need to use 'ipproto ipv6-icmp' to match ICMPv6 headers. But for ip -6 route, currently we only support tcp, udp and icmp. Add ICMPv6 support so we can match ipv6-icmp rules for route lookup. v2: As David Ahern and Sabrina Dubroca suggested, Add an argument to rtm_getroute_parse_ip_proto() to handle ICMP/ICMPv6 with different family. Reported-by: Jianlin Shi <jishi@redhat.com> Fixes: eacb9384a3fe ("ipv6: support sport, dport and ip_proto in RTM_GETROUTE") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
24ba1440 |
|
23-Feb-2019 |
wenxu <wenxu@ucloud.cn> |
route: Add multipath_hash in flowi_common to make user-define hash Current fib_multipath_hash_policy can make hash based on the L3 or L4. But it only work on the outer IP. So a specific tunnel always has the same hash value. But a specific tunnel may contain so many inner connections. This patch provide a generic multipath_hash in floi_common. It can make a user-define hash which can mix with L3 or L4 hash. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c09551c6 |
|
06-Feb-2019 |
Lorenzo Bianconi <lorenzo.bianconi@redhat.com> |
net: ipv4: use a dedicated counter for icmp_v4 redirect packets According to the algorithm described in the comment block at the beginning of ip_rt_send_redirect, the host should try to send 'ip_rt_redirect_number' ICMP redirect packets with an exponential backoff and then stop sending them at all assuming that the destination ignores redirects. If the device has previously sent some ICMP error packets that are rate-limited (e.g TTL expired) and continues to receive traffic, the redirect packets will never be transmitted. This happens since peer->rate_tokens will be typically greater than 'ip_rt_redirect_number' and so it will never be reset even if the redirect silence timeout (ip_rt_redirect_silence) has elapsed without receiving any packet requiring redirects. Fix it by using a dedicated counter for the number of ICMP redirect packets that has been sent by the host I have not been able to identify a given commit that introduced the issue since ip_rt_send_redirect implements the same rate-limiting algorithm from commit 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1d2f4ebb |
|
31-Jan-2019 |
Edward Chron <echron@arista.com> |
ipv4/igmp: Don't drop IGMP pkt with zeros src addr Don't drop IGMP packets with a source address of all zeros which are IGMP proxy reports. This is documented in Section 2.1.1 IGMP Forwarding Rules of RFC 4541 IGMP and MLD Snooping Switches Considerations. Signed-off-by: Edward Chron <echron@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a00302b6 |
|
18-Jan-2019 |
Jakub Kicinski <kuba@kernel.org> |
net: ipv4: route: perform strict checks also for doit handlers Make RTM_GETROUTE's doit handler use strict checks when NETLINK_F_STRICT_CHK is set. v2: - new patch (DaveA). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
21f94775 |
|
20-Dec-2018 |
Ido Schimmel <idosch@mellanox.com> |
net: ipv4: Set skb->dev for output route resolution When user requests to resolve an output route, the kernel synthesizes an skb where the relevant parameters (e.g., source address) are set. The skb is then passed to ip_route_output_key_hash_rcu() which might call into the flow dissector in case a multipath route was hit and a nexthop needs to be selected based on the multipath hash. Since both 'skb->dev' and 'skb->sk' are not set, a warning is triggered in the flow dissector [1]. The warning is there to prevent codepaths from silently falling back to the standard flow dissector instead of the BPF one. Therefore, instead of removing the warning, set 'skb->dev' to the loopback device, as its not used for anything but resolving the correct namespace. [1] WARNING: CPU: 1 PID: 24819 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x314/0x16b0 ... RSP: 0018:ffffa0df41fdf650 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8bcded232000 RCX: 0000000000000000 RDX: ffffa0df41fdf7e0 RSI: ffffffff98e415a0 RDI: ffff8bcded232000 RBP: ffffa0df41fdf760 R08: 0000000000000000 R09: 0000000000000000 R10: ffffa0df41fdf7e8 R11: ffff8bcdf27a3000 R12: ffffffff98e415a0 R13: ffffa0df41fdf7e0 R14: ffffffff98dd2980 R15: ffffa0df41fdf7e0 FS: 00007f46f6897680(0000) GS:ffff8bcdf7a80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055933e95f9a0 CR3: 000000021e636000 CR4: 00000000001006e0 Call Trace: fib_multipath_hash+0x28c/0x2d0 ? fib_multipath_hash+0x28c/0x2d0 fib_select_path+0x241/0x32f ? __fib_lookup+0x6a/0xb0 ip_route_output_key_hash_rcu+0x650/0xa30 ? __alloc_skb+0x9b/0x1d0 inet_rtm_getroute+0x3f7/0xb80 ? __alloc_pages_nodemask+0x11c/0x2c0 rtnetlink_rcv_msg+0x1d9/0x2f0 ? rtnl_calcit.isra.24+0x120/0x120 netlink_rcv_skb+0x54/0x130 rtnetlink_rcv+0x15/0x20 netlink_unicast+0x20a/0x2c0 netlink_sendmsg+0x2d1/0x3d0 sock_sendmsg+0x39/0x50 ___sys_sendmsg+0x2a0/0x2f0 ? filemap_map_pages+0x16b/0x360 ? __handle_mm_fault+0x108e/0x13d0 __sys_sendmsg+0x63/0xa0 ? __sys_sendmsg+0x63/0xa0 __x64_sys_sendmsg+0x1f/0x30 do_syscall_64+0x5a/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: d0e13a1488ad ("flow_dissector: lookup netns by skb->sk if skb->dev is NULL") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b2c85100 |
|
20-Nov-2018 |
David S. Miller <davem@davemloft.net> |
ipv4: Don't try to print ASCII of link level header in martian dumps. This has no value whatsoever. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
28d35bcd |
|
09-Oct-2018 |
Sabrina Dubroca <sd@queasysnail.net> |
net: ipv4: don't let PMTU updates increase route MTU When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is received, we must clamp its value. However, we can receive a PMTU exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an increase in PMTU. To fix this, take the smallest of the old MTU and ip_rt_min_pmtu. Before this patch, in case of an update, the exception's MTU would always change. Now, an exception can have only its lock flag updated, but not the MTU, so we need to add a check on locking to the following "is this exception getting updated, or close to expiring?" test. Fixes: d52e5a7e7ca4 ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1620a336 |
|
04-Oct-2018 |
David Ahern <dsahern@gmail.com> |
net: Move free of dst_metrics to helper Move the refcounting and potential free of dst metrics associated for ipv4 and ipv6 to a common helper. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e1255ed4 |
|
04-Oct-2018 |
David Ahern <dsahern@gmail.com> |
net: common metrics init helper for dst_entry ipv4 and ipv6 both use refcounted metrics if FIB entries have metrics set. Move the common initialization code to a helper and use for both protocols. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e8e3fbe9 |
|
30-Sep-2018 |
Maciej Żenczykowski <maze@google.com> |
net: inet_rtm_getroute() - use new style struct initializer instead of memset Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e351bb62 |
|
30-Sep-2018 |
Maciej Żenczykowski <maze@google.com> |
net: ip_rt_get_source() - use new style struct initializer instead of memset (allows for better compiler optimization) Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1042caa7 |
|
25-Sep-2018 |
Maciej Żenczykowski <maze@google.com> |
net-ipv4: remove 2 always zero parameters from ipv4_redirect() (the parameters in question are mark and flow_flags) Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d888f396 |
|
25-Sep-2018 |
Maciej Żenczykowski <maze@google.com> |
net-ipv4: remove 2 always zero parameters from ipv4_update_pmtu() (the parameters in question are mark and flow_flags) Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5cbf777c |
|
27-Jul-2018 |
Xin Long <lucien.xin@gmail.com> |
route: add support for directed broadcast forwarding This patch implements the feature described in rfc1812#section-5.3.5.2 and rfc2644. It allows the router to forward directed broadcast when sysctl bc_forwarding is enabled. Note that this feature could be done by iptables -j TEE, but it would cause some problems: - target TEE's gateway param has to be set with a specific address, and it's not flexible especially when the route wants forward all directed broadcasts. - this duplicates the directed broadcasts so this may cause side effects to applications. Besides, to keep consistent with other os router like BSD, it's also necessary to implement it in the route rx path. Note that route cache needs to be flushed when bc_forwarding is changed. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6396bb22 |
|
12-Jun-2018 |
Kees Cook <keescook@chromium.org> |
treewide: kzalloc() -> kcalloc() The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
6da2ec56 |
|
12-Jun-2018 |
Kees Cook <keescook@chromium.org> |
treewide: kmalloc() -> kmalloc_array() The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(char) * COUNT + COUNT , ...) | kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) | kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) | kmalloc(sizeof(TYPE) * C2, ...) | kmalloc(C1 * C2 * C3, ...) | kmalloc(C1 * C2, ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
404eb77e |
|
22-May-2018 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
ipv4: support sport, dport and ip_proto in RTM_GETROUTE This is a followup to fib rules sport, dport and ipproto match support. Only supports tcp, udp and icmp for ipproto. Used by fib rule self tests. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
50d889b1 |
|
21-May-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv4: Add helper to return path MTU based on fib result Determine path MTU from a FIB lookup result. Logic is a distillation of ip_dst_mtu_maybe_forward. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
5a847a6e |
|
16-May-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv4: Initialize proto and ports in flow struct Updating the FIB tracepoint for the recent change to allow rules using the protocol and ports exposed a few places where the entries in the flow struct are not initialized. For __fib_validate_source add the call to fib4_rules_early_flow_dissect since it is invoked for the input path. For netfilter, add the memset on the flow struct to avoid future problems like this. In ip_route_input_slow need to set the fields if the skb dissection does not happen. Fixes: bfff4862653b ("net: fib_rules: support for match on ip_proto, sport and dport") Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3f3942ac |
|
15-May-2018 |
Christoph Hellwig <hch@lst.de> |
proc: introduce proc_create_single{,_data} Variants of proc_create{,_data} that directly take a seq_file show callback and drastically reduces the boilerplate code in the callers. All trivial callers converted over. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
0e8411e4 |
|
09-May-2018 |
Hangbin Liu <liuhangbin@gmail.com> |
ipv4: reset fnhe_mtu_locked after cache route flushed After route cache is flushed via ipv4_sysctl_rtcache_flush(), we forget to reset fnhe_mtu_locked in rt_bind_exception(). When pmtu is updated in __ip_rt_update_pmtu(), it will return directly since the pmtu is still locked. e.g. + ip netns exec client ping 10.10.1.1 -c 1 -s 1400 -M do PING 10.10.1.1 (10.10.1.1) 1400(1428) bytes of data. >From 10.10.0.254 icmp_seq=1 Frag needed and DF set (mtu = 0) Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
94720e3a |
|
02-May-2018 |
Julian Anastasov <ja@ssi.bg> |
ipv4: fix fnhe usage by non-cached routes Allow some non-cached routes to use non-expired fnhe: 1. ip_del_fnhe: moved above and now called by find_exception. The 4.5+ commit deed49df7390 expires fnhe only when caching routes. Change that to: 1.1. use fnhe for non-cached local output routes, with the help from (2) 1.2. allow __mkroute_input to detect expired fnhe (outdated fnhe_gw, for example) when do_cache is false, eg. when itag!=0 for unicast destinations. 2. __mkroute_output: keep fi to allow local routes with orig_oif != 0 to use fnhe info even when the new route will not be cached into fnhe. After commit 839da4d98960 ("net: ipv4: set orig_oif based on fib result for local traffic") it means all local routes will be affected because they are not cached. This change is used to solve a PMTU problem with IPVS (and probably Netfilter DNAT) setups that redirect local clients from target local IP (local route to Virtual IP) to new remote IP target, eg. IPVS TUN real server. Loopback has 64K MTU and we need to create fnhe on the local route that will keep the reduced PMTU for the Virtual IP. Without this change fnhe_pmtu is updated from ICMP but never exposed to non-cached local routes. This includes routes with flowi4_oif!=0 for 4.6+ and with flowi4_oif=any for 4.14+). 3. update_or_create_fnhe: make sure fnhe_expires is not 0 for new entries Fixes: 839da4d98960 ("net: ipv4: set orig_oif based on fib result for local traffic") Fixes: d6d5e999e5df ("route: do not cache fib route info on local routes with oif") Fixes: deed49df7390 ("route: check and remove route cache when we get route") Cc: David Ahern <dsahern@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d0ea2b12 |
|
07-Apr-2018 |
Eric Dumazet <edumazet@google.com> |
ipv4: fix uninit-value in ip_route_output_key_hash_rcu() syzbot complained that res.type could be used while not initialized. Using RTN_UNSPEC as initial value seems better than using garbage. BUG: KMSAN: uninit-value in __mkroute_output net/ipv4/route.c:2200 [inline] BUG: KMSAN: uninit-value in ip_route_output_key_hash_rcu+0x31f0/0x3940 net/ipv4/route.c:2493 CPU: 1 PID: 12207 Comm: syz-executor0 Not tainted 4.16.0+ #81 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 __mkroute_output net/ipv4/route.c:2200 [inline] ip_route_output_key_hash_rcu+0x31f0/0x3940 net/ipv4/route.c:2493 ip_route_output_key_hash net/ipv4/route.c:2322 [inline] __ip_route_output_key include/net/route.h:126 [inline] ip_route_output_flow+0x1eb/0x3c0 net/ipv4/route.c:2577 raw_sendmsg+0x1861/0x3ed0 net/ipv4/raw.c:653 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747 SyS_sendto+0x8a/0xb0 net/socket.c:1715 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x455259 RSP: 002b:00007fdc0625dc68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007fdc0625e6d4 RCX: 0000000000455259 RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000013 RBP: 000000000072bea0 R08: 0000000020000080 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000000004f7 R14: 00000000006fa7c8 R15: 0000000000000000 Local variable description: ----res.i.i@ip_route_output_flow Variable was created at: ip_route_output_flow+0x75/0x3c0 net/ipv4/route.c:2576 raw_sendmsg+0x1861/0x3ed0 net/ipv4/raw.c:653 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
514c6032 |
|
05-Apr-2018 |
Randy Dunlap <rdunlap@infradead.org> |
headers: untangle kmemleak.h from mm.h Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious reason. It looks like it's only a convenience, so remove kmemleak.h from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that don't already #include it. Also remove <linux/kmemleak.h> from source files that do not use it. This is tested on i386 allmodconfig and x86_64 allmodconfig. It would be good to run it through the 0day bot for other $ARCHes. I have neither the horsepower nor the storage space for the other $ARCHes. Update: This patch has been extensively build-tested by both the 0day bot & kisskb/ozlabs build farms. Both of them reported 2 build failures for which patches are included here (in v2). [ slab.h is the second most used header file after module.h; kernel.h is right there with slab.h. There could be some minor error in the counting due to some #includes having comments after them and I didn't combine all of those. ] [akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr] Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org Link: http://kisskb.ellerman.id.au/kisskb/head/13396/ Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Reported-by: Michael Ellerman <mpe@ellerman.id.au> [2 build failures] Reported-by: Fengguang Wu <fengguang.wu@intel.com> [2 build failures] Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: John Johansen <john.johansen@canonical.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2f635cee |
|
27-Mar-2018 |
Kirill Tkhai <ktkhai@virtuozzo.com> |
net: Drop pernet_operations::async Synchronous pernet_operations are not allowed anymore. All are asynchronous. So, drop the structure member. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d6444062 |
|
23-Mar-2018 |
Joe Perches <joe@perches.com> |
net: Use octal not symbolic permissions Prefer the direct use of octal for permissions. Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace and some typing. Miscellanea: o Whitespace neatening around these conversions. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d52e5a7e |
|
14-Mar-2018 |
Sabrina Dubroca <sd@queasysnail.net> |
ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu Prior to the rework of PMTU information storage in commit 2c8cec5c10bc ("ipv4: Cache learned PMTU information in inetpeer."), when a PMTU event advertising a PMTU smaller than net.ipv4.route.min_pmtu was received, we would disable setting the DF flag on packets by locking the MTU metric, and set the PMTU to net.ipv4.route.min_pmtu. Since then, we don't disable DF, and set PMTU to net.ipv4.route.min_pmtu, so the intermediate router that has this link with a small MTU will have to drop the packets. This patch reestablishes pre-2.6.39 behavior by splitting rtable->rt_pmtu into a bitfield with rt_mtu_locked and rt_pmtu. rt_mtu_locked indicates that we shouldn't set the DF bit on that path, and is checked in ip_dont_fragment(). One possible workaround is to set net.ipv4.route.min_pmtu to a value low enough to accommodate the lowest MTU encountered. Fixes: 2c8cec5c10bc ("ipv4: Cache learned PMTU information in inetpeer.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ec7127a5 |
|
02-Mar-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv4: Simplify fib_multipath_hash with optional flow keys As of commit e37b1e978bec5 ("ipv6: route: dissect flow in input path if fib rules need it") fib_multipath_hash takes an optional flow keys. If non-NULL it means the skb has already been dissected. If not set, then fib_multipath_hash needs to call skb_flow_dissect_flow_keys. Simplify the logic by setting flkeys to the local stack variable keys. Simplifies fib_multipath_hash by only have 1 set of instructions setting hash_keys. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6f74b6c2 |
|
02-Mar-2018 |
David Ahern <dsahern@gmail.com> |
net: Align ip_multipath_l3_keys and ip6_multipath_l3_keys Symmetry is good and allows easy comparison that ipv4 and ipv6 are doing the same thing. To that end, change ip_multipath_l3_keys to set addresses at the end after the icmp compares, and move the initialization of ipv6 flow keys to rt6_multipath_hash. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7efc0b6b |
|
02-Mar-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv4: Pass net to fib_multipath_hash instead of fib_info fib_multipath_hash only needs net struct to check a sysctl. Make it clear by passing net instead of fib_info. In the end this allows alignment between the ipv4 and ipv6 versions. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e2c0dc1f |
|
27-Feb-2018 |
Stephen Suryaputra <ssuryaextr@gmail.com> |
vrf: check forwarding on the original netdevice when generating ICMP dest unreachable When ip_error() is called the device is the l3mdev master instead of the original device. So the forwarding check should be on the original one. Changes from v2: - Handle the original device disappearing (per David Ahern) - Minimize the change in code order Changes from v1: - Only need to reset the device on which __in_dev_get_rcu() is done (per David Ahern). Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
773daa3c |
|
28-Feb-2018 |
Arnd Bergmann <arnd@arndb.de> |
net: ipv4: avoid unused variable warning for sysctl The newly introudced ip_min_valid_pmtu variable is only used when CONFIG_SYSCTL is set: net/ipv4/route.c:135:12: error: 'ip_min_valid_pmtu' defined but not used [-Werror=unused-variable] This moves it to the other variables like it, to avoid the harmless warning. Fixes: c7272c2f1229 ("net: ipv4: don't allow setting net.ipv4.route.min_pmtu below 68") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e37b1e97 |
|
28-Feb-2018 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
ipv6: route: dissect flow in input path if fib rules need it Dissect flow in fwd path if fib rules require it. Controlled by a flag to avoid penatly for the common case. Flag is set when fib rules with sport, dport and proto match that require flow dissect are installed. Also passes the dissected hash keys to the multipath hash function when applicable to avoid dissecting the flow again. icmp packets will continue to use inner header for hash calculations (Thanks to Nikolay Aleksandrov for some review here). Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c7272c2f |
|
26-Feb-2018 |
Sabrina Dubroca <sd@queasysnail.net> |
net: ipv4: don't allow setting net.ipv4.route.min_pmtu below 68 According to RFC 1191 sections 3 and 4, ICMP frag-needed messages indicating an MTU below 68 should be rejected: A host MUST never reduce its estimate of the Path MTU below 68 octets. and (talking about ICMP frag-needed's Next-Hop MTU field): This field will never contain a value less than 68, since every router "must be able to forward a datagram of 68 octets without fragmentation". Furthermore, by letting net.ipv4.route.min_pmtu be set to negative values, we can end up with a very large PMTU when (-1) is cast into u32. Let's also make ip_rt_min_pmtu a u32, since it's only ever compared to unsigned ints. Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1fe4b118 |
|
21-Feb-2018 |
David Ahern <dsahern@gmail.com> |
net: ipv4: Set addr_type in hash_keys for forwarded case The result of the skb flow dissect is copied from keys to hash_keys to ensure only the intended data is hashed. The original L4 hash patch overlooked setting the addr_type for this case; add it. Fixes: bf4e0a3db97eb ("net: ipv4: add support for ECMP hash policy choice") Reported-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
510c321b |
|
14-Feb-2018 |
Xin Long <lucien.xin@gmail.com> |
xfrm: reuse uncached_list to track xdsts In early time, when freeing a xdst, it would be inserted into dst_garbage.list first. Then if it's refcnt was still held somewhere, later it would be put into dst_busy_list in dst_gc_task(). When one dev was being unregistered, the dev of these dsts in dst_busy_list would be set with loopback_dev and put this dev. So that this dev's removal wouldn't get blocked, and avoid the kmsg warning: kernel:unregister_netdevice: waiting for veth0 to become \ free. Usage count = 2 However after Commit 52df157f17e5 ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle"), the xdst will not be freed with dst gc, and this warning happens. To fix it, we need to find these xdsts that are still held by others when removing the dev, and free xdst's dev and set it with loopback_dev. But unfortunately after flow_cache for xfrm was deleted, no list tracks them anymore. So we need to save these xdsts somewhere to release the xdst's dev later. To make this easier, this patch is to reuse uncached_list to track xdsts, so that the dev refcnt can be released in the event NETDEV_UNREGISTER process of fib_netdev_notifier. Thanks to Florian, we could move forward this fix quickly. Fixes: 52df157f17e5 ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle") Reported-by: Jianlin Shi <jishi@redhat.com> Reported-by: Hangbin Liu <liuhangbin@gmail.com> Tested-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
68e813aa |
|
14-Feb-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv4: Remove fib table id from rtable Remove rt_table_id from rtable. It was added for getroute to return the table id that was hit in the lookup. With the changes for fibmatch the table id can be extracted from the fib_info returned in the fib_result so it no longer needs to be in rtable directly. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9942895b |
|
13-Feb-2018 |
David Ahern <dsahern@gmail.com> |
net: Move ipv4 set_lwt_redirect helper to lwtunnel IPv4 uses set_lwt_redirect to set the lwtunnel redirect functions as needed. Move it to lwtunnel.h as lwtunnel_set_redirect and change IPv6 to also use it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8c2ceabe |
|
13-Feb-2018 |
David Ahern <dsahern@gmail.com> |
net/ipv4: Unexport fib_multipath_hash and fib_select_path Do not export fib_multipath_hash or fib_select_path; both are only used by core ipv4 code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f84c6821 |
|
12-Feb-2018 |
Kirill Tkhai <ktkhai@virtuozzo.com> |
net: Convert pernet_subsys, registered from inet_init() arp_net_ops just addr/removes /proc entry. devinet_ops allocates and frees duplicate of init_net tables and (un)registers sysctl entries. fib_net_ops allocates and frees pernet tables, creates/destroys netlink socket and (un)initializes /proc entries. Foreign pernet_operations do not touch them. ip_rt_proc_ops only modifies pernet /proc entries. xfrm_net_ops creates/destroys /proc entries, allocates/frees pernet statistics, hashes and tables, and (un)initializes sysctl files. These are not touched by foreigh pernet_operations xfrm4_net_ops allocates/frees private pernet memory, and configures sysctls. sysctl_route_ops creates/destroys sysctls. rt_genid_ops only initializes fields of just allocated net. ipv4_inetpeer_ops allocated/frees net private memory. igmp_net_ops just creates/destroys /proc files and socket, noone else interested in. tcp_sk_ops seems to be safe, because tcp_sk_init() does not depend on any other pernet_operations modifications. Iteration over hash table in inet_twsk_purge() is made under RCU lock, and it's safe to iterate the table this way. Removing from the table happen from inet_twsk_deschedule_put(), but this function is safe without any extern locks, as it's synchronized inside itself. There are many examples, it's used in different context. So, it's safe to leave tcp_sk_exit_batch() unlocked. tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe. udplite4_net_ops only creates/destroys pernet /proc file. icmp_sk_ops creates percpu sockets, not touched by foreign pernet_operations. ipmr_net_ops creates/destroys pernet fib tables, (un)registers fib rules and /proc files. This seem to be safe to execute in parallel with foreign pernet_operations. af_inet_ops just sets up default parameters of newly created net. ipv4_mib_ops creates and destroys pernet percpu statistics. raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops and ip_proc_ops only create/destroy pernet /proc files. ip4_frags_ops creates and destroys sysctl file. So, it's safe to make the pernet_operations async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
96890d62 |
|
15-Jan-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: delete /proc THIS_MODULE references /proc has been ignoring struct file_operations::owner field for 10 years. Specifically, it started with commit 786d7e1612f0b0adb6046f19b906609e4fe8b1ba ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where inode->i_fop is initialized with proxy struct file_operations for regular files: - if (de->proc_fops) - inode->i_fop = de->proc_fops; + if (de->proc_fops) { + if (S_ISREG(inode->i_mode)) + inode->i_fop = &proc_reg_file_ops; + else + inode->i_fop = de->proc_fops; + } VFS stopped pinning module at this point. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6503a304 |
|
11-Jan-2018 |
Lorenzo Colitti <lorenzo@google.com> |
net: ipv4: Make "ip route get" match iif lo rules again. Commit 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") broke "ip route get" in the presence of rules that specify iif lo. Host-originated traffic always has iif lo, because ip_route_output_key_hash and ip6_route_output_flags set the flow iif to LOOPBACK_IFINDEX. Thus, putting "iif lo" in an ip rule is a convenient way to select only originated traffic and not forwarded traffic. inet_rtm_getroute used to match these rules correctly because even though it sets the flow iif to 0, it called ip_route_output_key which overwrites iif with LOOPBACK_IFINDEX. But now that it calls ip_route_output_key_hash_rcu, the ifindex will remain 0 and not match the iif lo in the rule. As a result, "ip route get" will return ENETUNREACH. Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") Tested: https://android.googlesource.com/kernel/tests/+/master/net/test/multinetwork_test.py passes again Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0f6c480f |
|
28-Nov-2017 |
David Miller <davem@davemloft.net> |
xfrm: Move dst->path into struct xfrm_dst The first member of an IPSEC route bundle chain sets it's dst->path to the underlying ipv4/ipv6 route that carries the bundle. Stated another way, if one were to follow the xfrm_dst->child chain of the bundle, the final non-NULL pointer would be the path and point to either an ipv4 or an ipv6 route. This is largely used to make sure that PMTU events propagate down to the correct ipv4 or ipv6 route. When we don't have the top of an IPSEC bundle 'dst->path == dst'. Move it down into xfrm_dst and key off of dst->xfrm. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
|
#
cebe84c6 |
|
16-Nov-2017 |
Xin Long <lucien.xin@gmail.com> |
route: also update fnhe_genid when updating a route cache Now when ip route flush cache and it turn out all fnhe_genid != genid. If a redirect/pmtu icmp packet comes and the old fnhe is found and all it's members but fnhe_genid will be updated. Then next time when it looks up route and tries to rebind this fnhe to the new dst, the fnhe will be flushed due to fnhe_genid != genid. It causes this redirect/pmtu icmp packet acutally not to be applied. This patch is to also reset fnhe_genid when updating a route cache. Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions") Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e39d5246 |
|
16-Nov-2017 |
Xin Long <lucien.xin@gmail.com> |
route: update fnhe_expires for redirect when the fnhe exists Now when creating fnhe for redirect, it sets fnhe_expires for this new route cache. But when updating the exist one, it doesn't do it. It will cause this fnhe never to be expired. Paolo already noticed it before, in Jianlin's test case, it became even worse: When ip route flush cache, the old fnhe is not to be removed, but only clean it's members. When redirect comes again, this fnhe will be found and updated, but never be expired due to fnhe_expires not being set. So fix it by simply updating fnhe_expires even it's for redirect. Fixes: aee06da6726d ("ipv4: use seqlock for nh_exceptions") Reported-by: Jianlin Shi <jishi@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6aa7de05 |
|
23-Oct-2017 |
Mark Rutland <mark.rutland@arm.com> |
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
164a5e7a |
|
18-Oct-2017 |
Eric Dumazet <edumazet@google.com> |
ipv4: ipv4_default_advmss() should use route mtu ipv4_default_advmss() incorrectly uses the device MTU instead of the route provided one. IPv6 has the proper behavior, lets harmonize the two protocols. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6c0e7284 |
|
09-Oct-2017 |
Steffen Klassert <steffen.klassert@secunet.com> |
ipv4: Fix traffic triggered IPsec connections. A recent patch removed the dst_free() on the allocated dst_entry in ipv4_blackhole_route(). The dst_free() marked the dst_entry as dead and added it to the gc list. I.e. it was setup for a one time usage. As a result we may now have a blackhole route cached at a socket on some IPsec scenarios. This makes the connection unusable. Fix this by marking the dst_entry directly at allocation time as 'dead', so it is used only once. Fixes: b838d5e1c5b6 ("ipv4: mark DST_NOGC and remove the operation of dst_free()") Reported-by: Tobias Brunner <tobias@strongswan.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1bcdca3f |
|
04-Oct-2017 |
Tim Hansen <devtimhansen@gmail.com> |
net/ipv4: Remove unused variable in route.c int rc is unmodified after initalization in net/ipv4/route.c, this patch simply cleans up that variable and returns 0. This was found with coccicheck M=net/ipv4/ on linus' tree. Signed-off-by: Tim Hansen <devtimhansen@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bc044e8d |
|
28-Sep-2017 |
Paolo Abeni <pabeni@redhat.com> |
udp: perform source validation for mcast early demux The UDP early demux can leverate the rx dst cache even for multicast unconnected sockets. In such scenario the ipv4 source address is validated only on the first packet in the given flow. After that, when we fetch the dst entry from the socket rx cache, we stop enforcing the rp_filter and we even start accepting any kind of martian addresses. Disabling the dst cache for unconnected multicast socket will cause large performace regression, nearly reducing by half the max ingress tput. Instead we factor out a route helper to completely validate an skb source address for multicast packets and we call it from the UDP early demux for mcast packets landing on unconnected sockets, after successful fetching the related cached dst entry. This still gives a measurable, but limited performance regression: rp_filter = 0 rp_filter = 1 edmux disabled: 1182 Kpps 1127 Kpps edmux before: 2238 Kpps 2238 Kpps edmux after: 2037 Kpps 2019 Kpps The above figures are on top of current net tree. Applying the net-next commit 6e617de84e87 ("net: avoid a full fib lookup when rp_filter is disabled.") the delta with rp_filter == 0 will decrease even more. Fixes: 421b3885bf6d ("udp: ipv4: Add udp early demux") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bc3aae2b |
|
16-Aug-2017 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
net: check and errout if res->fi is NULL when RTM_F_FIB_MATCH is set Syzkaller hit 'general protection fault in fib_dump_info' bug on commit 4.13-rc5.. Guilty file: net/ipv4/fib_semantics.c kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 0 PID: 2808 Comm: syz-executor0 Not tainted 4.13.0-rc5 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: ffff880078562700 task.stack: ffff880078110000 RIP: 0010:fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314 RSP: 0018:ffff880078117010 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 00000000000000fe RCX: 0000000000000002 RDX: 0000000000000006 RSI: ffff880078117084 RDI: 0000000000000030 RBP: ffff880078117268 R08: 000000000000000c R09: ffff8800780d80c8 R10: 0000000058d629b4 R11: 0000000067fce681 R12: 0000000000000000 R13: ffff8800784bd540 R14: ffff8800780d80b5 R15: ffff8800780d80a4 FS: 00000000022fa940(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004387d0 CR3: 0000000079135000 CR4: 00000000000006f0 Call Trace: inet_rtm_getroute+0xc89/0x1f50 net/ipv4/route.c:2766 rtnetlink_rcv_msg+0x288/0x680 net/core/rtnetlink.c:4217 netlink_rcv_skb+0x340/0x470 net/netlink/af_netlink.c:2397 rtnetlink_rcv+0x28/0x30 net/core/rtnetlink.c:4223 netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline] netlink_unicast+0x4c4/0x6e0 net/netlink/af_netlink.c:1291 netlink_sendmsg+0x8c4/0xca0 net/netlink/af_netlink.c:1854 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 ___sys_sendmsg+0x779/0x8d0 net/socket.c:2035 __sys_sendmsg+0xd1/0x170 net/socket.c:2069 SYSC_sendmsg net/socket.c:2080 [inline] SyS_sendmsg+0x2d/0x50 net/socket.c:2076 entry_SYSCALL_64_fastpath+0x1a/0xa5 RIP: 0033:0x4512e9 RSP: 002b:00007ffc75584cc8 EFLAGS: 00000216 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00000000004512e9 RDX: 0000000000000000 RSI: 0000000020f2cfc8 RDI: 0000000000000003 RBP: 000000000000000e R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000216 R12: fffffffffffffffe R13: 0000000000718000 R14: 0000000020c44ff0 R15: 0000000000000000 Code: 00 0f b6 8d ec fd ff ff 48 8b 85 f0 fd ff ff 88 48 17 48 8b 45 28 48 8d 78 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e cb 0c 00 00 48 8b 45 28 44 RIP: fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314 RSP: ffff880078117010 ---[ end trace 254a7af28348f88b ]--- This patch adds a res->fi NULL check. example run: $ip route get 0.0.0.0 iif virt1-0 broadcast 0.0.0.0 dev lo cache <local,brd> iif virt1-0 $ip route get 0.0.0.0 iif virt1-0 fibmatch RTNETLINK answers: No route to host Reported-by: idaifish <idaifish@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: b61798130f1b ("net: ipv4: RTM_GETROUTE: return matched fib result when requested") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9620fef2 |
|
18-Aug-2017 |
Eric Dumazet <edumazet@google.com> |
ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c780a049 |
|
16-Aug-2017 |
Eric Dumazet <edumazet@google.com> |
ipv4: better IP_MAX_MTU enforcement While working on yet another syzkaller report, I found that our IP_MAX_MTU enforcements were not properly done. gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and final result can be bigger than IP_MAX_MTU :/ This is a problem because device mtu can be changed on other cpus or threads. While this patch does not fix the issue I am working on, it is probably worth addressing it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
394f51ab |
|
15-Aug-2017 |
Florian Westphal <fw@strlen.de> |
ipv4: route: set ipv4 RTM_GETROUTE to not use rtnl Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2c87d63a |
|
13-Aug-2017 |
Florian Westphal <fw@strlen.de> |
ipv4: route: fix inet_rtm_getroute induced crash "ip route get $daddr iif eth0 from $saddr" causes: BUG: KASAN: use-after-free in ip_route_input_rcu+0x1535/0x1b50 Call Trace: ip_route_input_rcu+0x1535/0x1b50 ip_route_input_noref+0xf9/0x190 tcp_v4_early_demux+0x1a4/0x2b0 ip_rcv+0xbcb/0xc05 __netif_receive_skb+0x9c/0xd0 netif_receive_skb_internal+0x5a8/0x890 Problem is that inet_rtm_getroute calls either ip_route_input_rcu (if an iif was provided) or ip_route_output_key_hash_rcu. But ip_route_input_rcu, unlike ip_route_output_key_hash_rcu, already associates the dst_entry with the skb. This clears the SKB_DST_NOREF bit (i.e. skb_dst_drop will release/free the entry while it should not). Thus only set the dst if we called ip_route_output_key_hash_rcu(). I tested this patch by running: while true;do ip r get 10.0.1.2;done > /dev/null & while true;do ip r get 10.0.1.2 iif eth0 from 10.0.1.1;done > /dev/null & ... and saw no crash or memory leak. Cc: Roopa Prabhu <roopa@cumulusnetworks.com> Cc: David Ahern <dsahern@gmail.com> Fixes: ba52d61e0ff ("ipv4: route: restore skb_dst_set in inet_rtm_getroute") Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9438c871 |
|
11-Aug-2017 |
David Ahern <dsahern@gmail.com> |
net: ipv4: remove unnecessary check on orig_oif rt_iif is going to be set to either 0 or orig_oif. If orig_oif is 0 it amounts to the same end result so remove the check. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
839da4d9 |
|
10-Aug-2017 |
David Ahern <dsahern@gmail.com> |
net: ipv4: set orig_oif based on fib result for local traffic Attempts to connect to a local address with a socket bound to a device with the local address hangs if there is no listener: $ ip addr sh dev eth1 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 02:e0:f9:1c:00:37 brd ff:ff:ff:ff:ff:ff inet 10.100.1.4/24 scope global eth1 valid_lft forever preferred_lft forever inet6 2001:db8:1::4/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe1c:37/64 scope link valid_lft forever preferred_lft forever $ vrf-test -I eth1 -r 10.100.1.4 <hangs when there is no server> (don't let the command name fool you; vrf-test works without vrfs.) The problem is that the original intended device, eth1 in this case, is lost when the tcp reset is sent, so the socket lookup does not find a match for the reset and the connect attempt hangs. Fix by adjusting orig_oif for local traffic to the device from the fib lookup result. With this patch you get the more user friendly: $ vrf-test -I eth1 -r 10.100.1.4 connect failed: 111: Connection refused orig_oif is saved to the newly created rtable as rt_iif and when set it is used as the dif for socket lookups. It is set based on flowi4_oif passed in to ip_route_output_key_hash_rcu and will be set to either the loopback device, an l3mdev device, nothing (flowi4_oif = 0 which is the case in the example above) or a netdev index depending on the lookup path. In each case, resetting orig_oif to the device in the fib result for the RTN_LOCAL case allows the actual device to be preserved as the skb tx and rx is done over the loopback or VRF device. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b97bac64 |
|
09-Aug-2017 |
Florian Westphal <fw@strlen.de> |
rtnetlink: make rtnl_register accept a flags parameter This change allows us to later indicate to rtnetlink core that certain doit functions should be called without acquiring rtnl_mutex. This change should have no effect, we simply replace the last (now unused) calcit argument with the new flag. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7aed9f72 |
|
07-Jun-2017 |
Jason A. Donenfeld <Jason@zx2c4.com> |
net/route: use get_random_int for random counter Using get_random_int here is faster, more fitting of the use case, and just as cryptographically secure. It also has the benefit of providing better randomness at early boot, which is when many of these structures are assigned. Also, semantically, it's not really proper to have been assigning an atomic_t in this way before, even if in practice it works fine. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a4c2fd7f |
|
17-Jun-2017 |
Wei Wang <weiwan@google.com> |
net: remove DST_NOCACHE flag DST_NOCACHE flag check has been removed from dst_release() and dst_hold_safe() in a previous patch because all the dst are now ref counted properly and can be released based on refcnt only. Looking at the rest of the DST_NOCACHE use, all of them can now be removed or replaced with other checks. So this patch gets rid of all the DST_NOCACHE usage and remove this flag completely. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b2a9c0ed |
|
17-Jun-2017 |
Wei Wang <weiwan@google.com> |
net: remove DST_NOGC flag Now that all the components have been changed to release dst based on refcnt only and not depend on dst gc anymore, we can remove the temporary flag DST_NOGC. Note that we also need to remove the DST_NOCACHE check in dst_release() and dst_hold_safe() because now all the dst are released based on refcnt and behaves as DST_NOCACHE. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b838d5e1 |
|
17-Jun-2017 |
Wei Wang <weiwan@google.com> |
ipv4: mark DST_NOGC and remove the operation of dst_free() With the previous preparation patches, we are ready to get rid of the dst gc operation in ipv4 code and release dst based on refcnt only. So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls to dst_free(). At this point, all dst created in ipv4 code do not use the dst gc anymore and will be destroyed at the point when refcnt drops to 0. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9df16efa |
|
17-Jun-2017 |
Wei Wang <weiwan@google.com> |
ipv4: call dst_hold_safe() properly This patch checks all the calls to dst_hold()/skb_dst_force()/dst_clone()/dst_use() to see if dst_hold_safe() is needed to avoid double free issue if dst gc is removed and dst_release() directly destroys dst when dst->__refcnt drops to 0. In tx path, TCP hold sk->sk_rx_dst ref count and also hold sock_lock(). UDP and other similar protocols always hold refcount for skb->_skb_refdst. So both paths seem to be safe. In rx path, as it is lockless and skb_dst_set_noref() is likely to be used, dst_hold_safe() should always be used when trying to hold dst. In the routing code, if dst is held during an rcu protected session, it is necessary to call dst_hold_safe() as the current dst might be in its rcu grace period. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
95c47f9c |
|
17-Jun-2017 |
Wei Wang <weiwan@google.com> |
ipv4: call dst_dev_put() properly As the intend of this patch series is to completely remove dst gc, we need to call dst_dev_put() to release the reference to dst->dev when removing routes from fib because we won't keep the gc list anymore and will lose the dst pointer right after removing the routes. Without the gc list, there is no way to find all the dst's that have dst->dev pointing to the going-down dev. Hence, we are doing dst_dev_put() immediately before we lose the last reference of the dst from the routing code. The next dst_check() will trigger a route re-lookup to find another route (if there is any). Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0830106c |
|
17-Jun-2017 |
Wei Wang <weiwan@google.com> |
ipv4: take dst->__refcnt when caching dst in fib In IPv4 routing code, fib_nh and fib_nh_exception can hold pointers to struct rtable but they never increment dst->__refcnt. This leads to the need of the dst garbage collector because when user is done with this dst and calls dst_release(), it can only decrement dst->__refcnt and can not free the dst even it sees dst->__refcnt drops from 1 to 0 (unless DST_NOCACHE flag is set) because the routing code might still hold reference to it. And when the routing code tries to delete a route, it has to put the dst to the gc_list if dst->__refcnt is not yet 0 and have a gc thread running periodically to check on dst->__refcnt and finally to free dst when refcnt becomes 0. This patch increments dst->__refcnt when fib_nh/fib_nh_exception holds reference to this dst and properly release the dst when fib_nh/fib_nh_exception has been updated with a new dst. This patch is a preparation in order to fully get rid of dst gc later. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1dbe3252 |
|
17-Jun-2017 |
Wei Wang <weiwan@google.com> |
net: use loopback dev when generating blackhole route Existing ipv4/6_blackhole_route() code generates a blackhole route with dst->dev pointing to the passed in dst->dev. It is not necessary to hold reference to the passed in dst->dev because the packets going through this route are dropped anyway. A loopback interface is good enough so that we don't need to worry about releasing this dst->dev when this dev is going down. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ba52d61e |
|
31-May-2017 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
ipv4: route: restore skb_dst_set in inet_rtm_getroute recent updates to inet_rtm_getroute dropped skb_dst_set in inet_rtm_getroute. This patch restores it because it is needed to release the dst correctly. Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") Reported-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3fb07daf |
|
25-May-2017 |
Eric Dumazet <edumazet@google.com> |
ipv4: add reference counting to metrics Andrey Konovalov reported crashes in ipv4_mtu() I could reproduce the issue with KASAN kernels, between 10.246.7.151 and 10.246.7.152 : 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 & 2) At the same time run following loop : while : do ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 done Cong Wang attempted to add back rt->fi in commit 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting") but this proved to add some issues that were complex to solve. Instead, I suggested to add a refcount to the metrics themselves, being a standalone object (in particular, no reference to other objects) I tried to make this patch as small as possible to ease its backport, instead of being super clean. Note that we believe that only ipv4 dst need to take care of the metric refcount. But if this is wrong, this patch adds the basic infrastructure to extend this to other families. Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang for his efforts on this problem. Fixes: 2860583fe840 ("ipv4: Kill rt->fi") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b6179813 |
|
25-May-2017 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
net: ipv4: RTM_GETROUTE: return matched fib result when requested This patch adds support to return matched fib result when RTM_F_FIB_MATCH flag is specified in RTM_GETROUTE request. This is useful for user-space applications/controllers wanting to query a matching route. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3765d35e |
|
25-May-2017 |
David Ahern <dsahern@gmail.com> |
net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup Convert inet_rtm_getroute to use ip_route_input_rcu and ip_route_output_key_hash_rcu passing the fib_result arg to both. The rcu lock is held through the creation of the response, so the rtable/dst does not need to be attached to the skb and is passed to rt_fill_info directly. In converting from ip_route_output_key to ip_route_output_key_hash_rcu the xfrm_lookup_route in ip_route_output_flow is dropped since flowi4_proto is not set for a route get request. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d3166e0c |
|
25-May-2017 |
David Ahern <dsahern@gmail.com> |
net: ipv4: Remove event arg to rt_fill_info rt_fill_info has 1 caller with the event set to RTM_NEWROUTE. Given that remove the arg and use RTM_NEWROUTE directly in rt_fill_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5510cdf7 |
|
25-May-2017 |
David Ahern <dsahern@gmail.com> |
net: ipv4: refactor ip_route_input_noref A later patch wants access to the fib result on an input route lookup with the rcu lock held. Refactor ip_route_input_noref pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper that takes the fib_result as an input arg. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3abd1ade |
|
25-May-2017 |
David Ahern <dsahern@gmail.com> |
net: ipv4: refactor __ip_route_output_key_hash A later patch wants access to the fib result on an output route lookup with the rcu lock held. Refactor __ip_route_output_key_hash, pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper with the fib_result as an input arg. To keep the name length under control remove the leading underscores from the name and add _rcu to the name of the new helper indicating it is called with the rcu read lock held. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
32f1bc0f |
|
08-May-2017 |
David S. Miller <davem@davemloft.net> |
Revert "ipv4: restore rt->fi for reference counting" This reverts commit 82486aa6f1b9bc8145e6d0fa2bc0b44307f3b875. As implemented, this causes dangling netdevice refs. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
82486aa6 |
|
04-May-2017 |
WANG Cong <xiyou.wangcong@gmail.com> |
ipv4: restore rt->fi for reference counting IPv4 dst could use fi->fib_metrics to store metrics but fib_info itself is refcnt'ed, so without taking a refcnt fi and fi->fib_metrics could be freed while dst metrics still points to it. This triggers use-after-free as reported by Andrey twice. This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to restore this reference counting. It is a quick fix for -net and -stable, for -net-next, as Eric suggested, we can consider doing reference counting for metrics itself instead of relying on fib_info. IPv6 is very different, it copies or steals the metrics from mx6_config in fib6_commit_metrics() so probably doesn't need a refcnt. Decnet has already done the refcnt'ing, see dn_fib_semantic_match(). Fixes: 2860583fe840 ("ipv4: Kill rt->fi") Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b7c8487c |
|
21-Apr-2017 |
Robert Shearman <rshearma@brocade.com> |
ipv4: Avoid caching l3mdev dst on mismatched local route David reported that doing the following: ip li add red type vrf table 10 ip link set dev eth1 vrf red ip addr add 127.0.0.1/8 dev red ip link set dev eth1 up ip li set red up ping -c1 -w1 -I red 127.0.0.1 ip li del red when either policy routing IP rules are present or the local table lookup ip rule is before the l3mdev lookup results in a hang with these messages: unregister_netdevice: waiting for red to become free. Usage count = 1 The problem is caused by caching the dst used for sending the packet out of the specified interface on a local route with a different nexthop interface. Thus the dst could stay around until the route in the table the lookup was done is deleted which may be never. Address the problem by not forcing output device to be the l3mdev in the flow's output interface if the lookup didn't use the l3mdev. This then results in the dst using the right device according to the route. Changes in v2: - make the dev_out passed in by __ip_route_output_key_hash correct instead of checking the nh dev if FLOWI_FLAG_SKIP_NH_OIF is set as suggested by David. Fixes: 5f02ce24c2696 ("net: l3mdev: Allow the l3mdev to be a loopback") Reported-by: David Ahern <dsa@cumulusnetworks.com> Suggested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c21ef3e3 |
|
16-Apr-2017 |
David Ahern <dsa@cumulusnetworks.com> |
net: rtnetlink: plumb extended ack to doit function Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse for doit functions that call it directly. This is the first step to using extended error reporting in rtnetlink. >From here individual subsystems can be updated to set netlink_ext_ack as needed. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fceb6435 |
|
12-Apr-2017 |
Johannes Berg <johannes.berg@intel.com> |
netlink: pass extended ACK struct to parsing functions Pass the new extended ACK reporting struct to all of the generic netlink parsing functions. For now, pass NULL in almost all callers (except for some in the core.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7ed14d97 |
|
11-Apr-2017 |
Gao Feng <fgao@ikuai8.com> |
net: ipv4: Refine the ipv4_default_advmss 1. Don't get the metric RTAX_ADVMSS of dst. There are two reasons. 1) Its caller dst_metric_advmss has already invoke dst_metric_advmss before invoke default_advmss. 2) The ipv4_default_advmss is used to get the default mss, it should not try to get the metric like ip6_default_advmss. 2. Use sizeof(tcphdr)+sizeof(iphdr) instead of literal 40. 3. Define one new macro IPV4_MAX_PMTU instead of 65535 according to RFC 2675, section 5.1. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bbadb9a2 |
|
07-Apr-2017 |
Florian Larysch <fl@n621.de> |
net: ipv4: fix multipath RTM_GETROUTE behavior when iif is given inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to ip_route_input when iif is given. If a multipath route is present for the designated destination, fib_multipath_hash ends up being called with that skb. However, as that skb contains no information beyond the protocol type, the calculated hash does not match the one we would see for a real packet. There is currently no way to fix this for layer 4 hashing, as RTM_GETROUTE doesn't have the necessary information to create layer 4 headers. To fix this for layer 3 hashing, set appropriate saddr/daddrs in the skb and also change the protocol to UDP to avoid special treatment for ICMP. Signed-off-by: Florian Larysch <fl@n621.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a8801799 |
|
03-Apr-2017 |
Florian Larysch <fl@n621.de> |
net: ipv4: fix multipath RTM_GETROUTE behavior when iif is given inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to ip_route_input when iif is given. If a multipath route is present for the designated destination, ip_multipath_icmp_hash ends up being called, which uses the source/destination addresses within the skb to calculate a hash. However, those are not set in the synthetic skb, causing it to return an arbitrary and incorrect result. Instead, use UDP, which gets no such special treatment. Signed-off-by: Florian Larysch <fl@n621.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bf4e0a3d |
|
16-Mar-2017 |
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> |
net: ipv4: add support for ECMP hash policy choice This patch adds support for ECMP hash policy choice via a new sysctl called fib_multipath_hash_policy and also adds support for L4 hashes. The current values for fib_multipath_hash_policy are: 0 - layer 3 (default) 1 - layer 4 If there's an skb hash already set and it matches the chosen policy then it will be used instead of being calculated (currently only for L4). In L3 mode we always calculate the hash due to the ICMP error special case, the flow dissector's field consistentification should handle the address order thus we can remove the address reversals. If the skb is provided we always use it for the hash calculation, otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6e28099d |
|
26-Feb-2017 |
Julian Anastasov <ja@ssi.bg> |
ipv4: mask tos for input route Restore the lost masking of TOS in input route code to allow ip rules to match it properly. Problem [1] noticed by Shmulik Ladkani <shmulik.ladkani@gmail.com> [1] http://marc.info/?t=137331755300040&r=1&w=2 Fixes: 89aef8921bfb ("ipv4: Delete routing cache.") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8bcfd092 |
|
26-Feb-2017 |
Julian Anastasov <ja@ssi.bg> |
ipv4: add missing initialization for flowi4_uid Avoid matching of random stack value for uid when rules are looked up on input route or when RP filter is used. Problem should affect only setups that use ip rules with uid range. Fixes: 622ec2c9d524 ("net: core: add UID to flows, rules, and routes") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
63fca65d |
|
06-Feb-2017 |
Julian Anastasov <ja@ssi.bg> |
net: add confirm_neigh method to dst_ops Add confirm_neigh method to dst_ops and use it from IPv4 and IPv6 to lookup and confirm the neighbour. Its usage via the new helper dst_confirm_neigh() should be restricted to MSG_PROBE users for performance reasons. For XFRM prefer the last tunnel address, if present. With help from Steffen Klassert. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8a430ed5 |
|
11-Jan-2017 |
David Ahern <dsa@cumulusnetworks.com> |
net: ipv4: fix table id in getroute response rtm_table is an 8-bit field while table ids are allowed up to u32. Commit 709772e6e065 ("net: Fix routing tables with id > 255 for legacy software") added the preference to set rtm_table in dumps to RT_TABLE_COMPAT if the table id is > 255. The table id returned on get route requests should do the same. Fixes: c36ba6603a11 ("net: Allow user to get table id from route lookup") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eafea739 |
|
07-Jan-2017 |
David Ahern <dsa@cumulusnetworks.com> |
net: ipv4: remove disable of bottom half in inet_rtm_getroute Nothing about the route lookup requires bottom half to be disabled. Remove the local_bh_disable ... local_bh_enable around ip_route_input. This appears to be a vestige of days gone by as it has been there since the beginning of git time. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dc33da59 |
|
06-Jan-2017 |
David Ahern <dsa@cumulusnetworks.com> |
net: ipv4: Remove flow arg from ip_mkroute_input fl4 arg is not used; remove it. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9f09eaea |
|
06-Jan-2017 |
David Ahern <dsa@cumulusnetworks.com> |
net: ipmr: Remove nowait arg to ipmr_get_route ipmr_get_route has 1 caller and the nowait arg is 0. Remove the arg and simplify ipmr_get_route accordingly. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0c8d803f |
|
05-Jan-2017 |
David Ahern <dsa@cumulusnetworks.com> |
net: ipv4: Simplify rt_fill_info rt_fill_info has only 1 caller and both of the last 2 args -- nowait and flags -- are hardcoded to 0. Given that remove them as input arguments and simplify rt_fill_info accordingly. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f5a0aab8 |
|
29-Dec-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: ipv4: dst for local input routes should use l3mdev if relevant IPv4 output routes already use l3mdev device instead of loopback for dst's if it is applicable. Change local input routes to do the same. This fixes icmp responses for unreachable UDP ports which are directed to the wrong table after commit 9d1a6c4ea43e4 because local_input routes use the loopback device. Moving from ingress device to loopback loses the L3 domain causing responses based on the dst to get to lost. Fixes: 9d1a6c4ea43e4 ("net: icmp_route_lookup should use rt dev to determine L3 domain") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7c0f6ba6 |
|
24-Dec-2016 |
Linus Torvalds <torvalds@linux-foundation.org> |
Replace <asm/uaccess.h> with <linux/uaccess.h> globally This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7d995694 |
|
22-Dec-2016 |
Lorenzo Colitti <lorenzo@google.com> |
net: ipv4: Don't crash if passing a null sk to ip_do_redirect. Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") made ip_do_redirect call sock_net(sk) to determine the network namespace of the passed-in socket. This crashes if sk is NULL. Fix this by getting the network namespace from the skb instead. Fixes: e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
efd85700 |
|
30-Nov-2016 |
Thomas Graf <tgraf@suug.ch> |
route: Set lwtstate for local traffic and cached input dsts A route on the output path hitting a RTN_LOCAL route will keep the dst associated on its way through the loopback device. On the receive path, the dst_input() call will thus invoke the input handler of the route created in the output path. Thus, lwt redirection for input must be done for dsts allocated in the otuput path as well. Also, if a route is cached in the input path, the allocated dst should respect lwtunnel configuration on the nexthop as well. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
11b3d9c5 |
|
30-Nov-2016 |
Thomas Graf <tgraf@suug.ch> |
route: Set orig_output when redirecting to lwt on locally generated traffic orig_output for IPv4 was only set for dsts which hit an input route. Set it consistently for locally generated traffic as well to allow lwt to continue the dst_output() path as configured by the nexthop. Fixes: 2536862311d ("lwt: Add support to redirect dst.input") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d109e61b |
|
29-Nov-2016 |
Lorenzo Colitti <lorenzo@google.com> |
net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu. Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") made __build_flow_key call sock_net(sk) to determine the network namespace of the passed-in socket. This crashes if sk is NULL. Fix this by getting the network namespace from the skb instead. Fixes: e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") Reported-by: Erez Shitrit <erezsh@dev.mellanox.co.il> Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
969447f2 |
|
10-Nov-2016 |
Stephen Suryaputra Lin <stephen.suryaputra.lin@gmail.com> |
ipv4: use new_gw for redirect neigh lookup In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0 and then the state of the neigh for the new_gw is checked. If the state isn't valid then the redirected route is deleted. This behavior is maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway is assigned to peer->redirect_learned.a4 before calling ipv4_neigh_lookup(). After commit 5943634fc559 ("ipv4: Maintain redirect and PMTU info in struct rtable again."), ipv4_neigh_lookup() is performed without the rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw) isn't zero, the function uses it as the key. The neigh is most likely valid since the old_gw is the one that sends the ICMP redirect message. Then the new_gw is assigned to fib_nh_exception. The problem is: the new_gw ARP may never gets resolved and the traffic is blackholed. So, use the new_gw for neigh lookup. Changes from v1: - use __ipv4_neigh_lookup instead (per Eric Dumazet). Fixes: 5943634fc559 ("ipv4: Maintain redirect and PMTU info in struct rtable again.") Signed-off-by: Stephen Suryaputra Lin <ssurya@ieee.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e2d118a1 |
|
03-Nov-2016 |
Lorenzo Colitti <lorenzo@google.com> |
net: inet: Support UID-based routing in IP protocols. - Use the UID in routing lookups made by protocol connect() and sendmsg() functions. - Make sure that routing lookups triggered by incoming packets (e.g., Path MTU discovery) take the UID of the socket into account. - For packets not associated with a userspace socket, (e.g., ping replies) use UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. This allows all namespaces to apply routing and iptables rules to kernel-originated traffic in that namespaces by matching UID 0. This is better than using the UID of the kernel socket that is sending the traffic, because the UID of kernel sockets created at namespace creation time (e.g., the per-processor ICMP and TCP sockets) is the UID of the user that created the socket, which might not be mapped in the namespace. Tested: compiles allnoconfig, allyesconfig, allmodconfig Tested: https://android-review.googlesource.com/253302 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
622ec2c9 |
|
03-Nov-2016 |
Lorenzo Colitti <lorenzo@google.com> |
net: core: add UID to flows, rules, and routes - Define a new FIB rule attributes, FRA_UID_RANGE, to describe a range of UIDs. - Define a RTA_UID attribute for per-UID route lookups and dumps. - Support passing these attributes to and from userspace via rtnetlink. The value INVALID_UID indicates no UID was specified. - Add a UID field to the flow structures. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e58e4159 |
|
31-Oct-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: Enable support for VRF with ipv4 multicast Enable support for IPv4 multicast: - similar to unicast the flow struct is updated to L3 master device if relevant prior to calling fib_rules_lookup. The table id is saved to the lookup arg so the rule action for ipmr can return the table associated with the device. - ip_mr_forward needs to check for master device mismatch as well since the skb->dev is set to it - allow multicast address on VRF device for Rx by checking for the daddr in the VRF device as well as the original ingress device - on Tx need to drop to __mkroute_output when FIB lookup fails for multicast destination address. - if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates IPMR FIB rules on first device create similar to FIB rules. In addition the VRF driver does not divert IPv4 multicast packets: it breaks on Tx since the fib lookup fails on the mcast address. With this patch, ipmr forwarding and local rx/tx work. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6104e112 |
|
12-Oct-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: ipv4: Do not drop to make_route if oif is l3mdev Commit e0d56fdd7342 was a bit aggressive removing l3mdev calls in the IPv4 stack. If the fib_lookup fails we do not want to drop to make_route if the oif is an l3mdev device. Also reverts 19664c6a0009 ("net: l3mdev: Remove netif_index_is_l3_master") which removed netif_index_is_l3_master. Fixes: e0d56fdd7342 ("net: l3mdev: remove redundant calls") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2cf75070 |
|
25-Sep-2016 |
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> |
ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route Since the commit below the ipmr/ip6mr rtnl_unicast() code uses the portid instead of the previous dst_pid which was copied from in_skb's portid. Since the skb is new the portid is 0 at that point so the packets are sent to the kernel and we get scheduling while atomic or a deadlock (depending on where it happens) by trying to acquire rtnl two times. Also since this is RTM_GETROUTE, it can be triggered by a normal user. Here's the sleeping while atomic trace: [ 7858.212557] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 [ 7858.212748] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0 [ 7858.212881] 2 locks held by swapper/0/0: [ 7858.213013] #0: (((&mrt->ipmr_expire_timer))){+.-...}, at: [<ffffffff810fbbf5>] call_timer_fn+0x5/0x350 [ 7858.213422] #1: (mfc_unres_lock){+.....}, at: [<ffffffff8161e005>] ipmr_expire_process+0x25/0x130 [ 7858.213807] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc7+ #179 [ 7858.213934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 7858.214108] 0000000000000000 ffff88005b403c50 ffffffff813a7804 0000000000000000 [ 7858.214412] ffffffff81a1338e ffff88005b403c78 ffffffff810a4a72 ffffffff81a1338e [ 7858.214716] 000000000000026c 0000000000000000 ffff88005b403ca8 ffffffff810a4b9f [ 7858.215251] Call Trace: [ 7858.215412] <IRQ> [<ffffffff813a7804>] dump_stack+0x85/0xc1 [ 7858.215662] [<ffffffff810a4a72>] ___might_sleep+0x192/0x250 [ 7858.215868] [<ffffffff810a4b9f>] __might_sleep+0x6f/0x100 [ 7858.216072] [<ffffffff8165bea3>] mutex_lock_nested+0x33/0x4d0 [ 7858.216279] [<ffffffff815a7a5f>] ? netlink_lookup+0x25f/0x460 [ 7858.216487] [<ffffffff8157474b>] rtnetlink_rcv+0x1b/0x40 [ 7858.216687] [<ffffffff815a9a0c>] netlink_unicast+0x19c/0x260 [ 7858.216900] [<ffffffff81573c70>] rtnl_unicast+0x20/0x30 [ 7858.217128] [<ffffffff8161cd39>] ipmr_destroy_unres+0xa9/0xf0 [ 7858.217351] [<ffffffff8161e06f>] ipmr_expire_process+0x8f/0x130 [ 7858.217581] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.217785] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.217990] [<ffffffff810fbc95>] call_timer_fn+0xa5/0x350 [ 7858.218192] [<ffffffff810fbbf5>] ? call_timer_fn+0x5/0x350 [ 7858.218415] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.218656] [<ffffffff810fde10>] run_timer_softirq+0x260/0x640 [ 7858.218865] [<ffffffff8166379b>] ? __do_softirq+0xbb/0x54f [ 7858.219068] [<ffffffff816637c8>] __do_softirq+0xe8/0x54f [ 7858.219269] [<ffffffff8107a948>] irq_exit+0xb8/0xc0 [ 7858.219463] [<ffffffff81663452>] smp_apic_timer_interrupt+0x42/0x50 [ 7858.219678] [<ffffffff816625bc>] apic_timer_interrupt+0x8c/0xa0 [ 7858.219897] <EOI> [<ffffffff81055f16>] ? native_safe_halt+0x6/0x10 [ 7858.220165] [<ffffffff810d64dd>] ? trace_hardirqs_on+0xd/0x10 [ 7858.220373] [<ffffffff810298e3>] default_idle+0x23/0x190 [ 7858.220574] [<ffffffff8102a20f>] arch_cpu_idle+0xf/0x20 [ 7858.220790] [<ffffffff810c9f8c>] default_idle_call+0x4c/0x60 [ 7858.221016] [<ffffffff810ca33b>] cpu_startup_entry+0x39b/0x4d0 [ 7858.221257] [<ffffffff8164f995>] rest_init+0x135/0x140 [ 7858.221469] [<ffffffff81f83014>] start_kernel+0x50e/0x51b [ 7858.221670] [<ffffffff81f82120>] ? early_idt_handler_array+0x120/0x120 [ 7858.221894] [<ffffffff81f8243f>] x86_64_start_reservations+0x2a/0x2c [ 7858.222113] [<ffffffff81f8257c>] x86_64_start_kernel+0x13b/0x14a Fixes: 2942e9005056 ("[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
adb03115 |
|
20-Sep-2016 |
Eric Dumazet <edumazet@google.com> |
net: get rid of an signed integer overflow in ip_idents_reserve() Jiri Pirko reported an UBSAN warning happening in ip_idents_reserve() [] UBSAN: Undefined behaviour in ./arch/x86/include/asm/atomic.h:156:11 [] signed integer overflow: [] -2117905507 + -695755206 cannot be represented in type 'int' Since we do not have uatomic_add_return() yet, use atomic_cmpxchg() so that the arithmetics can be done using unsigned int. Fixes: 04ca6973f7c1 ("ip: make IP identifiers less predictable") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e0d56fdd |
|
10-Sep-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: l3mdev: remove redundant calls A previous patch added l3mdev flow update making these hooks redundant. Remove them. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ebfc102c |
|
10-Sep-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: vrf: Flip IPv4 output path from FIB lookup hook to out hook Flip the IPv4 output path to use the l3mdev tx out hook. The VRF dst is not returned on the first FIB lookup. Instead, the dst on the skb is switched at the beginning of the IPv4 output processing to send the packet to the VRF driver on xmit. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5f02ce24 |
|
10-Sep-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: l3mdev: Allow the l3mdev to be a loopback Allow an L3 master device to act as the loopback for that L3 domain. For IPv4 the device can also have the address 127.0.0.1. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
14972cbd |
|
24-Aug-2016 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
net: lwtunnel: Handle fragmentation Today mpls iptunnel lwtunnel_output redirect expects the tunnel output function to handle fragmentation. This is ok but can be avoided if we did not do the mpls output redirect too early. ie we could wait until ip fragmentation is done and then call mpls output for each ip fragment. To make this work we will need, 1) the lwtunnel state to carry encap headroom 2) and do the redirect to the encap output handler on the ip fragment (essentially do the output redirect after fragmentation) This patch adds tunnel headroom in lwtstate to make sure we account for tunnel data in mtu calculations during fragmentation and adds new xmit redirect handler to redirect to lwtunnel xmit func after ip fragmentation. This includes IPV6 and some mtu fixes and testing from David Ahern. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1ff23bee |
|
07-May-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: l3mdev: Allow send on enslaved interface Allow udp and raw sockets to send by oif that is an enslaved interface versus the l3mdev/VRF device. For example, this allows BFD to use ifindex from IP_PKTINFO on a receive to send a response without the need to convert to the VRF index. It also allows ping and ping6 to work when specifying an enslaved interface (e.g., ping -I swp1 <ip>) which is a natural use case. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b45386ef |
|
27-Apr-2016 |
Eric Dumazet <edumazet@google.com> |
net: rename IP_INC_STATS_BH() Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to better express this is used in non preemptible context. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d6d5e999 |
|
08-Apr-2016 |
Chris Friesen <chris.friesen@windriver.com> |
route: do not cache fib route info on local routes with oif For local routes that require a particular output interface we do not want to cache the result. Caching the result causes incorrect behaviour when there are multiple source addresses on the interface. The end result being that if the intended recipient is waiting on that interface for the packet he won't receive it because it will be delivered on the loopback interface and the IP_PKTINFO ipi_ifindex will be set to the loopback interface as well. This can be tested by running a program such as "dhcp_release" which attempts to inject a packet on a particular interface so that it is received by another program on the same board. The receiving process should see an IP_PKTINFO ipi_ifndex value of the source interface (e.g., eth1) instead of the loopback interface (e.g., lo). The packet will still appear on the loopback interface in tcpdump but the important aspect is that the CMSG info is correct. Sample dhcp_release command line: dhcp_release eth1 192.168.204.222 02:11:33:22:44:66 Signed-off-by: Allain Legacy <allain.legacy@windriver.com> Signed off-by: Chris Friesen <chris.friesen@windriver.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9ab179d8 |
|
07-Apr-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: vrf: Fix dst reference counting Vivek reported a kernel exception deleting a VRF with an active connection through it. The root cause is that the socket has a cached reference to a dst that is destroyed. Converting the dst_destroy to dst_release and letting proper reference counting kick in does not work as the dst has a reference to the device which needs to be released as well. I talked to Hannes about this at netdev and he pointed out the ipv4 and ipv6 dst handling has dst_ifdown for just this scenario. Rather than continuing with the reinvented dst wheel in VRF just remove it and leverage the ipv4 and ipv6 versions. Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver") Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
deed49df |
|
18-Feb-2016 |
Xin Long <lucien.xin@gmail.com> |
route: check and remove route cache when we get route Since the gc of ipv4 route was removed, the route cached would has no chance to be removed, and even it has been timeout, it still could be used, cause no code to check it's expires. Fix this issue by checking and removing route cache when we get route. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
28335a74 |
|
07-Oct-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Do not drop to make_route if oif is l3mdev Commit deaa0a6a930 ("net: Lookup actual route when oif is VRF device") exposed a bug in __ip_route_output_key_hash for VRF devices: on FIB lookup failure if the oif is specified the current logic drops to make_route on the assumption that the route tables are wrong. For VRF/L3 master devices this leads to wrong dst entries and route lookups. For example: $ ip route ls table vrf-red unreachable default broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2 local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2 broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2 $ ip route get oif vrf-red 1.1.1.1 1.1.1.1 dev vrf-red src 10.0.0.2 cache With this patch: $ ip route get oif vrf-red 1.1.1.1 RTNETLINK answers: No route to host which is the correct response based on the default route Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ede2059d |
|
07-Oct-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
dst: Pass net into dst->output The network namespace is already passed into dst_output pass it into dst->output lwt->output and friends. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b92dacd4 |
|
07-Oct-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
ipv4: Merge __ip_local_out and __ip_local_out_sk Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4ebdfba7 |
|
07-Oct-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
dst: Pass a sk into .local_out For consistency with the other similar methods in the kernel pass a struct sock into the dst_ops .local_out method. Simplifying the socket passing case is needed a prequel to passing a struct net reference into .local_out. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
deaa0a6a |
|
05-Oct-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Lookup actual route when oif is VRF device If the user specifies a VRF device in a get route query the custom route pointing to the VRF device is returned: $ ip route ls table vrf-red unreachable default broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2 local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2 broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2 $ ip route get oif vrf-red 10.2.1.40 10.2.1.40 dev vrf-red cache Add the flags to skip the custom route and go directly to the FIB. With this patch the actual route is returned: $ ip route get oif vrf-red 10.2.1.40 10.2.1.40 dev eth1 src 10.2.1.2 cache Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3ce58d84 |
|
05-Oct-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Refactor path selection in __ip_route_output_key_hash VRF device needs the same path selection following lookup to set source address. Rather than duplicating code, move existing code into a function that is exported to modules. Code move only; no functional change. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
79a13159 |
|
30-Sep-2015 |
Peter Nørlund <pch@ordbogen.com> |
ipv4: ICMP packet inspection for multipath ICMP packets are inspected to let them route together with the flow they belong to, minimizing the chance that a problematic path will affect flows on other paths, and so that anycast environments can work with ECMP. Signed-off-by: Peter Nørlund <pch@ordbogen.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0e884c78 |
|
30-Sep-2015 |
Peter Nørlund <pch@ordbogen.com> |
ipv4: L3 hash-based multipath Replaces the per-packet multipath with a hash-based multipath using source and destination address. Signed-off-by: Peter Nørlund <pch@ordbogen.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b84f7878 |
|
29-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Initialize flow flags in input path The fib_table_lookup tracepoint found 2 places where the flowi4_flags is not initialized. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8e1ed705 |
|
29-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Replace calls to vrf_dev_get_rth Replace calls to vrf_dev_get_rth with l3mdev_get_rtable. The check on the flow flags is handled in the l3mdev operation. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
385add90 |
|
29-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Replace vrf_master_ifindex{, _rcu} with l3mdev equivalents Replace calls to vrf_master_ifindex_rcu and vrf_master_ifindex with either l3mdev_master_ifindex_rcu or l3mdev_master_ifindex. The pattern: oif = vrf_master_ifindex(dev) ? : dev->ifindex; is replaced with oif = l3mdev_fib_oif(dev); And remove the now unused vrf macros. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
007979ea |
|
29-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER and update the name of the netif_is_vrf and netif_index_is_vrf macros. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0d753960 |
|
28-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Remove martian_source_keep_err goto label err is initialized to -EINVAL when it is declared. It is not reset until fib_lookup which is well after the 3 users of the martian_source jump. So resetting err to -EINVAL at martian_source label is not needed. Removing that line obviates the need for the martian_source_keep_err label so delete it. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
75fea73d |
|
28-Sep-2015 |
Alexander Duyck <aduyck@mirantis.com> |
net: Swap ordering of tests in ip_route_input_mc This patch just swaps the ordering of one of the conditional tests in ip_route_input_mc. Specifically it swaps the testing for the source address to see if it is loopback, and the test to see if we allow a loopback source address. The reason for swapping these two tests is because it is much faster to test if an address is loopback than it is to dereference several pointers to get at the net structure to see if the use of loopback is allowed. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6f9c9615 |
|
25-Sep-2015 |
Eric Dumazet <edumazet@google.com> |
inet: constify ip_route_output_flow() socket argument Very soon, TCP stack might call inet_csk_route_req(), which calls inet_csk_route_req() with an unlocked listener socket, so we need to make sure ip_route_output_flow() is not trying to change any field from its socket argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0315e382 |
|
17-Sep-2015 |
Nikola Forró <nforro@redhat.com> |
net: Fix behaviour of unreachable, blackhole and prohibit routes Man page of ip-route(8) says following about route types: unreachable - these destinations are unreachable. Packets are dis‐ carded and the ICMP message host unreachable is generated. The local senders get an EHOSTUNREACH error. blackhole - these destinations are unreachable. Packets are dis‐ carded silently. The local senders get an EINVAL error. prohibit - these destinations are unreachable. Packets are discarded and the ICMP message communication administratively prohibited is generated. The local senders get an EACCES error. In the inet6 address family, this was correct, except the local senders got ENETUNREACH error instead of EHOSTUNREACH in case of unreachable route. In the inet address family, all three route types generated ICMP message net unreachable, and the local senders got ENETUNREACH error. In both address families all three route types now behave consistently with documentation. Signed-off-by: Nikola Forró <nforro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bde6f9de |
|
16-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Initialize table in fib result Sergey, Richard and Fabio reported an oops in ip_route_input_noref. e.g., from Richard: [ 0.877040] BUG: unable to handle kernel NULL pointer dereference at 0000000000000056 [ 0.877597] IP: [<ffffffff8155b5e2>] ip_route_input_noref+0x1a2/0xb00 [ 0.877597] PGD 3fa14067 PUD 3fa6e067 PMD 0 [ 0.877597] Oops: 0000 [#1] SMP [ 0.877597] Modules linked in: virtio_net virtio_pci virtio_ring virtio [ 0.877597] CPU: 1 PID: 119 Comm: ifconfig Not tainted 4.2.0+ #1 [ 0.877597] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 0.877597] task: ffff88003fab0bc0 ti: ffff88003faa8000 task.ti: ffff88003faa8000 [ 0.877597] RIP: 0010:[<ffffffff8155b5e2>] [<ffffffff8155b5e2>] ip_route_input_noref+0x1a2/0xb00 [ 0.877597] RSP: 0018:ffff88003ed03ba0 EFLAGS: 00010202 [ 0.877597] RAX: 0000000000000046 RBX: 00000000ffffff8f RCX: 0000000000000020 [ 0.877597] RDX: ffff88003fab50b8 RSI: 0000000000000200 RDI: ffffffff8152b4b8 [ 0.877597] RBP: ffff88003ed03c50 R08: 0000000000000000 R09: 0000000000000000 [ 0.877597] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003fab6f00 [ 0.877597] R13: ffff88003fab5000 R14: 0000000000000000 R15: ffffffff81cb5600 [ 0.877597] FS: 00007f6de5751700(0000) GS:ffff88003ed00000(0000) knlGS:0000000000000000 [ 0.877597] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.877597] CR2: 0000000000000056 CR3: 000000003fa6d000 CR4: 00000000000006e0 [ 0.877597] Stack: [ 0.877597] 0000000000000000 0000000000000046 ffff88003fffa600 ffff88003ed03be0 [ 0.877597] ffff88003f9e2c00 697da8c0017da8c0 ffff880000000000 000000000007fd00 [ 0.877597] 0000000000000000 0000000000000046 0000000000000000 0000000400000000 [ 0.877597] Call Trace: [ 0.877597] <IRQ> [ 0.877597] [<ffffffff812bfa1f>] ? cpumask_next_and+0x2f/0x40 [ 0.877597] [<ffffffff8158e13c>] arp_process+0x39c/0x690 [ 0.877597] [<ffffffff8158e57e>] arp_rcv+0x13e/0x170 [ 0.877597] [<ffffffff8151feec>] __netif_receive_skb_core+0x60c/0xa00 [ 0.877597] [<ffffffff81515795>] ? __build_skb+0x25/0x100 [ 0.877597] [<ffffffff81515795>] ? __build_skb+0x25/0x100 [ 0.877597] [<ffffffff81521ff6>] __netif_receive_skb+0x16/0x70 [ 0.877597] [<ffffffff81522078>] netif_receive_skb_internal+0x28/0x90 [ 0.877597] [<ffffffff8152288f>] napi_gro_receive+0x7f/0xd0 [ 0.877597] [<ffffffffa0017906>] virtnet_receive+0x256/0x910 [virtio_net] [ 0.877597] [<ffffffffa0017fd8>] virtnet_poll+0x18/0x80 [virtio_net] [ 0.877597] [<ffffffff815234cd>] net_rx_action+0x1dd/0x2f0 [ 0.877597] [<ffffffff81053228>] __do_softirq+0x98/0x260 [ 0.877597] [<ffffffff8164969c>] do_softirq_own_stack+0x1c/0x30 The root cause is use of res.table uninitialized. Thanks to Nikolay for noticing the uninitialized use amongst the maze of gotos. As Nikolay pointed out the second initialization is not required to fix the oops, but rather to fix a related problem where a valid lookup should be invalidated before creating the rth entry. Fixes: b7503e0cdb5d ("net: Add FIB table id to rtable") Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Reported-by: Richard Alpe <richard.alpe@ericsson.com> Reported-by: Fabio Estevam <festevam@gmail.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c36ba660 |
|
02-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Allow user to get table id from route lookup rt_fill_info which is called for 'route get' requests hardcodes the table id as RT_TABLE_MAIN which is not correct when multiple tables are used. Use the newly added table id in the rtable to send back the correct table similar to what is done for IPv6. To maintain current ABI a new request flag, RTM_F_LOOKUP_TABLE, is added to indicate the actual table is wanted versus the hardcoded response. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b7503e0c |
|
02-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Add FIB table id to rtable Add the FIB table id to rtable to make the information available for IPv4 as it is for IPv6. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d08c4f35 |
|
02-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Refactor rtable initialization All callers to rt_dst_alloc have nearly the same initialization following a successful allocation. Consolidate it into rt_dst_alloc. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
46fa062a |
|
28-Aug-2015 |
Jiri Benc <jbenc@redhat.com> |
ip_tunnels: convert the mode field of ip_tunnel_info to flags The mode field holds a single bit of information only (whether the ip_tunnel_info struct is for rx or tx). Change the mode field to bit flags. This allows more mode flags to be added. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
192132b9 |
|
27-Aug-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Add support for VRFs to inetpeer cache inetpeer caches based on address only, so duplicate IP addresses within a namespace return the same cached entry. Enhance the ipv4 address key to contain both the IPv4 address and VRF device index. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
61adedf3 |
|
20-Aug-2015 |
Jiri Benc <jbenc@redhat.com> |
route: move lwtunnel state to dst_entry Currently, the lwtunnel state resides in per-protocol data. This is a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa). The xmit function of the tunnel does not know whether the packet has been routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the lwtstate data to dst_entry makes such inter-protocol tunneling possible. As a bonus, this brings a nice diffstat. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
25368623 |
|
17-Aug-2015 |
Tom Herbert <tom@herbertland.com> |
lwt: Add support to redirect dst.input This patch adds the capability to redirect dst input in the same way that dst output is redirected by LWT. Also, save the original dst.input and and dst.out when setting up lwtunnel redirection. These can be called by the client as a pass- through. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
613d09b3 |
|
13-Aug-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Use VRF device index for lookups on TX As with ingress use the index of VRF master device for route lookups on egress. However, the oif should only be used to direct the lookups to a specific table. Routes in the table are not based on the VRF device but rather interfaces that are part of the VRF so do not consider the oif for lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this latter part. Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cd2fbe1b |
|
13-Aug-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Use VRF device index for lookups on RX On ingress use index of VRF master device for route lookups if real device is enslaved. Rules are expected to be installed for the VRF device to direct lookups to a specific table. Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0335f5b5 |
|
03-Aug-2015 |
Robert Shearman <rshearma@brocade.com> |
ipv4: apply lwtunnel encap for locally-generated packets lwtunnel encap is applied for forwarded packets, but not for locally-generated packets. This is because the output function is not overridden in __mkroute_output, unlike it is in __mkroute_input. The lwtunnel state is correctly set on the rth through the call to rt_set_nexthop, so all that needs to be done is to override the dst output function to be lwtunnel_output if there is lwtunnel state present and it requires output redirection. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5a6228a0 |
|
23-Jul-2015 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
lwtunnel: change prototype of lwtunnel_state_get() It saves some lines and simplify a bit the code when the state is returning by this function. It's also useful to handle a NULL entry. To avoid too long lines, I've also renamed lwtunnel_state_get() and lwtunnel_state_put() to lwtstate_get() and lwtstate_put(). CC: Thomas Graf <tgraf@suug.ch> CC: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2392debc |
|
22-Jul-2015 |
Julian Anastasov <ja@ssi.bg> |
ipv4: consider TOS in fib_select_default fib_select_default considers alternative routes only when res->fi is for the first alias in res->fa_head. In the common case this can happen only when the initial lookup matches the first alias with highest TOS value. This prevents the alternative routes to require specific TOS. This patch solves the problem as follows: - routes that require specific TOS should be returned by fib_select_default only when TOS matches, as already done in fib_table_lookup. This rule implies that depending on the TOS we can have many different lists of alternative gateways and we have to keep the last used gateway (fa_default) in first alias for the TOS instead of using single tb_default value. - as the aliases are ordered by many keys (TOS desc, fib_priority asc), we restrict the possible results to routes with matching TOS and lowest metric (fib_priority) and routes that match any TOS, again with lowest metric. For example, packet with TOS 8 can not use gw3 (not lowest metric), gw4 (different TOS) and gw6 (not lowest metric), all other gateways can be used: tos 8 via gw1 metric 2 <--- res->fa_head and res->fi tos 8 via gw2 metric 2 tos 8 via gw3 metric 3 tos 4 via gw4 tos 0 via gw5 tos 0 via gw6 metric 1 Reported-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3093fbe7 |
|
21-Jul-2015 |
Thomas Graf <tgraf@suug.ch> |
route: Per route IP tunnel metadata via lightweight tunnel This introduces a new IP tunnel lightweight tunnel type which allows to specify IP tunnel instructions per route. Only IPv4 is supported at this point. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1b7179d3 |
|
21-Jul-2015 |
Thomas Graf <tgraf@suug.ch> |
route: Extend flow representation with tunnel key Add a new flowi_tunnel structure which is a subset of ip_tunnel_key to allow routes to match on tunnel metadata. For now, the tunnel id is added to flowi_tunnel which allows for routes to be bound to specific virtual tunnels. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f38a9eb1 |
|
21-Jul-2015 |
Thomas Graf <tgraf@suug.ch> |
dst: Metadata destinations Introduces a new dst_metadata which enables to carry per packet metadata between forwarding and processing elements via the skb->dst pointer. The structure is set up to be a union. Thus, each separate type of metadata requires its own dst instance. If demand arises to carry multiple types of metadata concurrently, metadata dst entries can be made stackable. The metadata dst entry is refcnt'ed as expected for now but a non reference counted use is possible if the reference is forced before queueing the skb. In order to allow allocating dsts with variable length, the existing dst_alloc() is split into a dst_alloc() and dst_init() function. The existing dst_init() function to initialize the subsystem is being renamed to dst_subsys_init() to make it clear what is what. The check before ip_route_input() is changed to ignore metadata dsts and drop the dst inside the routing function thus allowing to interpret metadata in a later commit. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8602a625 |
|
21-Jul-2015 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
ipv4: redirect dst output to lwtunnel output For input routes with tunnel encap state this patch redirects dst output functions to lwtunnel_output which later resolves to the corresponding lwtunnel output function. This has been tested to work with mpls ip tunnels. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
571e7226 |
|
21-Jul-2015 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
ipv4: support for fib route lwtunnel encap attributes This patch adds support in ipv4 fib functions to parse user provided encap attributes and attach encap state data to fib_nh and rtable. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cb1c6168 |
|
08-Jul-2015 |
Masatake YAMATO <yamato@redhat.com> |
route: remove unsed variable in __mkroute_input flags local variable in __mkroute_input is not used as a variable. Signed-off-by: Masatake YAMATO <yamato@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0eeb075f |
|
23-Jun-2015 |
Andy Gospodarek <gospo@cumulusnetworks.com> |
net: ipv4 sysctl option to ignore routes when nexthop link is down This feature is only enabled with the new per-interface or ipv4 global sysctls called 'ignore_routes_with_linkdown'. net.ipv4.conf.all.ignore_routes_with_linkdown = 0 net.ipv4.conf.default.ignore_routes_with_linkdown = 0 net.ipv4.conf.lo.ignore_routes_with_linkdown = 0 ... When the above sysctls are set, will report to userspace that a route is dead and will no longer resolve to this nexthop when performing a fib lookup. This will signal to userspace that the route will not be selected. The signalling of a RTNH_F_DEAD is only passed to userspace if the sysctl is enabled and link is down. This was done as without it the netlink listeners would have no idea whether or not a nexthop would be selected. The kernel only sets RTNH_F_DEAD internally if the interface has IFF_UP cleared. With the new sysctl set, the following behavior can be observed (interface p8p1 is link-down): default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 dead linkdown 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 dead linkdown 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 90.0.0.1 via 70.0.0.2 dev p7p1 src 70.0.0.1 cache local 80.0.0.1 dev lo src 80.0.0.1 cache <local> 80.0.0.2 via 10.0.5.2 dev p9p1 src 10.0.5.15 cache While the route does remain in the table (so it can be modified if needed rather than being wiped away as it would be if IFF_UP was cleared), the proper next-hop is chosen automatically when the link is down. Now interface p8p1 is linked-up: default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 192.168.56.0/24 dev p2p1 proto kernel scope link src 192.168.56.2 90.0.0.1 via 80.0.0.2 dev p8p1 src 80.0.0.1 cache local 80.0.0.1 dev lo src 80.0.0.1 cache <local> 80.0.0.2 dev p8p1 src 80.0.0.1 cache and the output changes to what one would expect. If the sysctl is not set, the following output would be expected when p8p1 is down: default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 linkdown 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 linkdown 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 Since the dead flag does not appear, there should be no expectation that the kernel would skip using this route due to link being down. v2: Split kernel changes into 2 patches, this actually makes a behavioral change if the sysctl is set. Also took suggestion from Alex to simplify code by only checking sysctl during fib lookup and suggestion from Scott to add a per-interface sysctl. v3: Code clean-ups to make it more readable and efficient as well as a reverse path check fix. v4: Drop binary sysctl v5: Whitespace fixups from Dave v6: Style changes from Dave and checkpatch suggestions v7: One more checkpatch fixup Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
381c759d |
|
22-May-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
ipv4: Avoid crashing in ip_error ip_error does not check if in_dev is NULL before dereferencing it. IThe following sequence of calls is possible: CPU A CPU B ip_rcv_finish ip_route_input_noref() ip_route_input_slow() inetdev_destroy() dst_input() With the result that a network device can be destroyed while processing an input packet. A crash was triggered with only unicast packets in flight, and forwarding enabled on the only network device. The error condition was created by the removal of the network device. As such it is likely the that error code was -EHOSTUNREACH, and the action taken by ip_error (if in_dev had been accessible) would have been to not increment any counters and to have tried and likely failed to send an icmp error as the network device is going away. Therefore handle this weird case by just dropping the packet if !in_dev. It will result in dropping the packet sooner, and will not result in an actual change of behavior. Fixes: 251da4130115b ("ipv4: Cache ip_error() routes even when not forwarding.") Reported-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Tested-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Signed-off-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6a211654 |
|
01-May-2015 |
Andrew Lunn <andrew@lunn.ch> |
net: ipv4: route: Fix sending IGMP messages with link address In setups with a global scope address on an interface, and a lesser scope address on an interface sending IGMP reports, the reports can be sent using the other interfaces global scope address rather than the local interface address. RFC 2236 suggests: Ignore the Report if you cannot identify the source address of the packet as belonging to a subnet assigned to the interface on which the packet was received. since such reports could be forged. Look at the protocol when deciding if a RT_SCOPE_LINK address should be used for the packet. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
355b590c |
|
01-May-2015 |
Eric Dumazet <edumazet@google.com> |
ipv4: speedup ip_idents_reserve() Under stress, ip_idents_reserve() is accessing a contended cache line twice, with non optimal MESI transactions. If we place timestamps in separate location, we reduce this pressure by ~50% and allow atomic_add_return() to issue a Request for Ownership. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cb6ccf09 |
|
27-Apr-2015 |
Herbert Xu <herbert@gondor.apana.org.au> |
route: Use ipv4_mtu instead of raw rt_pmtu The commit 3cdaa5be9e81a914e633a6be7b7d2ef75b528562 ("ipv4: Don't increase PMTU with Datagram Too Big message") broke PMTU in cases where the rt_pmtu value has expired but is smaller than the new PMTU value. This obsolete rt_pmtu then prevents the new PMTU value from being installed. Fixes: 3cdaa5be9e81 ("ipv4: Don't increase PMTU with Datagram Too Big message") Reported-by: Gerd v. Egidy <gerd.von.egidy@intra2net.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
00db4124 |
|
03-Apr-2015 |
Ian Morris <ipm@chirality.org.uk> |
ipv4: coding style: comparison for inequality with NULL The ipv4 code uses a mixture of coding styles. In some instances check for non-NULL pointer is done as x != NULL and sometimes as x. x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
51456b29 |
|
03-Apr-2015 |
Ian Morris <ipm@chirality.org.uk> |
ipv4: coding style: comparison for equality with NULL The ipv4 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x == NULL and sometimes as !x. !x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
67b61f6c |
|
29-Mar-2015 |
Jiri Benc <jbenc@redhat.com> |
netlink: implement nla_get_in_addr and nla_get_in6_addr Those are counterparts to nla_put_in_addr and nla_put_in6_addr. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
930345ea |
|
29-Mar-2015 |
Jiri Benc <jbenc@redhat.com> |
netlink: implement nla_put_in_addr and nla_put_in6_addr IP addresses are often stored in netlink attributes. Add generic functions to do that. For nla_put_in_addr, it would be nicer to pass struct in_addr but this is not used universally throughout the kernel, in way too many places __be32 is used to store IPv4 address. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b6a7719a |
|
25-Mar-2015 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
ipv4: hash net ptr into fragmentation bucket selection As namespaces are sometimes used with overlapping ip address ranges, we should also use the namespace as input to the hash to select the ip fragmentation counter bucket. Cc: Eric Dumazet <edumazet@google.com> Cc: Flavio Leitner <fbl@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ddb3b603 |
|
09-Mar-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Remove protocol from struct dst_ops After my change to neigh_hh_init to obtain the protocol from the neigh_table there are no more users of protocol in struct dst_ops. Remove the protocol field from dst_ops and all of it's initializers. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3cdaa5be |
|
29-Jan-2015 |
Li Wei <lw@cn.fujitsu.com> |
ipv4: Don't increase PMTU with Datagram Too Big message. RFC 1191 said, "a host MUST not increase its estimate of the Path MTU in response to the contents of a Datagram Too Big message." Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
df4d9254 |
|
22-Jan-2015 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
ipv4: try to cache dst_entries which would cause a redirect Not caching dst_entries which cause redirects could be exploited by hosts on the same subnet, causing a severe DoS attack. This effect aggravated since commit f88649721268999 ("ipv4: fix dst race in sk_dst_get()"). Lookups causing redirects will be allocated with DST_NOCACHE set which will force dst_release to free them via RCU. Unfortunately waiting for RCU grace period just takes too long, we can end up with >1M dst_entries waiting to be released and the system will run OOM. rcuos threads cannot catch up under high softirq load. Attaching the flag to emit a redirect later on to the specific skb allows us to cache those dst_entries thus reducing the pressure on allocation and deallocation. This issue was discovered by Marcelo Leitner. Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: Marcelo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7b46a644 |
|
18-Jan-2015 |
David S. Miller <davem@davemloft.net> |
netlink: Fix bugs in nlmsg_end() conversions. Commit 053c095a82cf ("netlink: make nlmsg_end() and genlmsg_end() void") didn't catch all of the cases where callers were breaking out on the return value being equal to zero, which they no longer should when zero means success. Fix all such cases. Reported-by: Marcel Holtmann <marcel@holtmann.org> Reported-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
053c095a |
|
16-Jan-2015 |
Johannes Berg <johannes.berg@intel.com> |
netlink: make nlmsg_end() and genlmsg_end() void Contrary to common expectations for an "int" return, these functions return only a positive value -- if used correctly they cannot even return 0 because the message header will necessarily be in the skb. This makes the very common pattern of if (genlmsg_end(...) < 0) { ... } be a whole bunch of dead code. Many places also simply do return nlmsg_end(...); and the caller is expected to deal with it. This also commonly (at least for me) causes errors, because it is very common to write if (my_function(...)) /* error condition */ and if my_function() does "return nlmsg_end()" this is of course wrong. Additionally, there's not a single place in the kernel that actually needs the message length returned, and if anyone needs it later then it'll be very easy to just use skb->len there. Remove this, and make the functions void. This removes a bunch of dead code as described above. The patch adds lines because I did - return nlmsg_end(...); + nlmsg_end(...); + return 0; I could have preserved all the function's return values by returning skb->len, but instead I've audited all the places calling the affected functions and found that none cared. A few places actually compared the return value with <= 0 in dump functionality, but that could just be changed to < 0 with no change in behaviour, so I opted for the more efficient version. One instance of the error I've made numerous times now is also present in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't check for <0 or <=0 and thus broke out of the loop every single time. I've preserved this since it will (I think) have caused the messages to userspace to be formatted differently with just a single message for every SKB returned to userspace. It's possible that this isn't needed for the tools that actually use this, but I don't even know what they are so couldn't test that changing this behaviour would be acceptable. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5055c371 |
|
14-Jan-2015 |
Eric Dumazet <edumazet@google.com> |
ipv4: per cpu uncached list RAW sockets with hdrinc suffer from contention on rt_uncached_lock spinlock. One solution is to use percpu lists, since most routes are destroyed by the cpu that created them. It is unclear why we even have to put these routes in uncached_list, as all outgoing packets should be freed when a device is dismantled. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: caacf05e5ad1 ("ipv4: Properly purge netdev references on uncached routes.") Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fa19c2b0 |
|
30-Oct-2014 |
Nicolas Cavallari <nicolas.cavallari@green-communications.fr> |
ipv4: Do not cache routing failures due to disabled forwarding. If we cache them, the kernel will reuse them, independently of whether forwarding is enabled or not. Which means that if forwarding is disabled on the input interface where the first routing request comes from, then that unreachable result will be cached and reused for other interfaces, even if forwarding is enabled on them. The opposite is also true. This can be verified with two interfaces A and B and an output interface C, where B has forwarding enabled, but not A and trying ip route get $dst iif A from $src && ip route get $dst iif B from $src Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2c1a4311 |
|
24-Sep-2014 |
WANG Cong <xiyou.wangcong@gmail.com> |
neigh: check error pointer instead of NULL for ipv4_neigh_lookup() Fixes: commit f187bc6efb7250afee0e2009b6106 ("ipv4: No need to set generic neighbour pointer") Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f92ee619 |
|
16-Sep-2014 |
Steffen Klassert <steffen.klassert@secunet.com> |
xfrm: Generate blackhole routes only from route lookup functions Currently we genarate a blackhole route route whenever we have matching policies but can not resolve the states. Here we assume that dst_output() is called to kill the balckholed packets. Unfortunately this assumption is not true in all cases, so it is possible that these packets leave the system unwanted. We fix this by generating blackhole routes only from the route lookup functions, here we can guarantee a call to dst_output() afterwards. Fixes: 2774c131b1d ("xfrm: Handle blackhole route creation via afinfo.") Reported-by: Konstantinos Kolelis <k.kolelis@sirrix.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
d546c621 |
|
04-Sep-2014 |
Eric Dumazet <edumazet@google.com> |
ipv4: harden fnhe_hashfun() Lets make this hash function a bit secure, as ICMP attacks are still in the wild. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
caa41527 |
|
03-Sep-2014 |
Eric Dumazet <edumazet@google.com> |
ipv4: fix a race in update_or_create_fnhe() nh_exceptions is effectively used under rcu, but lacks proper barriers. Between kzalloc() and setting of nh->nh_exceptions(), we need a proper memory barrier. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 4895c771c7f00 ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
903ceff7 |
|
16-Aug-2014 |
Christoph Lameter <cl@linux.com> |
net: Replace get_cpu_var through this_cpu_ptr Replace uses of get_cpu_var for address calculation through this_cpu_ptr. Cc: netdev@vger.kernel.org Cc: Eric Dumazet <edumazet@google.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
|
#
b7a71b51 |
|
08-Aug-2014 |
Niv Yehezkel <executerx@gmail.com> |
ipv4: removed redundant conditional Since fib_lookup cannot return ESRCH no longer, checking for this error code is no longer neccesary. Signed-off-by: Niv Yehezkel <executerx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
04ca6973 |
|
26-Jul-2014 |
Eric Dumazet <edumazet@google.com> |
ip: make IP identifiers less predictable In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and Jedidiah describe ways exploiting linux IP identifier generation to infer whether two machines are exchanging packets. With commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we changed IP id generation, but this does not really prevent this side-channel technique. This patch adds a random amount of perturbation so that IP identifiers for a given destination [1] are no longer monotonically increasing after an idle period. Note that prandom_u32_max(1) returns 0, so if generator is used at most once per jiffy, this patch inserts no hole in the ID suite and do not increase collision probability. This is jiffies based, so in the worst case (HZ=1000), the id can rollover after ~65 seconds of idle time, which should be fine. We also change the hash used in __ip_select_ident() to not only hash on daddr, but also saddr and protocol, so that ICMP probes can not be used to infer information for other protocols. For IPv6, adds saddr into the hash as well, but not nexthdr. If I ping the patched target, we can see ID are now hard to predict. 21:57:11.008086 IP (...) A > target: ICMP echo request, seq 1, length 64 21:57:11.010752 IP (... id 2081 ...) target > A: ICMP echo reply, seq 1, length 64 21:57:12.013133 IP (...) A > target: ICMP echo request, seq 2, length 64 21:57:12.015737 IP (... id 3039 ...) target > A: ICMP echo reply, seq 2, length 64 21:57:13.016580 IP (...) A > target: ICMP echo request, seq 3, length 64 21:57:13.019251 IP (... id 3437 ...) target > A: ICMP echo reply, seq 3, length 64 [1] TCP sessions uses a per flow ID generator not changed by this patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jeffrey Knockel <jeffk@cs.unm.edu> Reported-by: Jedidiah R. Crandall <crandall@cs.unm.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Hannes Frederic Sowa <hannes@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7f502361 |
|
30-Jun-2014 |
Eric Dumazet <edumazet@google.com> |
ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix We have two different ways to handle changes to sk->sk_dst First way (used by TCP) assumes socket lock is owned by caller, and use no extra lock : __sk_dst_set() & __sk_dst_reset() Another way (used by UDP) uses sk_dst_lock because socket lock is not always taken. Note that sk_dst_lock is not softirq safe. These ways are not inter changeable for a given socket type. ipv4_sk_update_pmtu(), added in linux-3.8, added a race, as it used the socket lock as synchronization, but users might be UDP sockets. Instead of converting sk_dst_lock to a softirq safe version, use xchg() as we did for sk_rx_dst in commit e47eb5dfb296b ("udp: ipv4: do not use sk_dst_lock from softirq context") In a follow up patch, we probably can remove sk_dst_lock, as it is only used in IPv6. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Fixes: 9cb3a50c5f63e ("ipv4: Invalidate the socket cached route on pmtu events if possible") Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
73f156a6 |
|
02-Jun-2014 |
Eric Dumazet <edumazet@google.com> |
inetpeer: get rid of ip_id_count Ideally, we would need to generate IP ID using a per destination IP generator. linux kernels used inet_peer cache for this purpose, but this had a huge cost on servers disabling MTU discovery. 1) each inet_peer struct consumes 192 bytes 2) inetpeer cache uses a binary tree of inet_peer structs, with a nominal size of ~66000 elements under load. 3) lookups in this tree are hitting a lot of cache lines, as tree depth is about 20. 4) If server deals with many tcp flows, we have a high probability of not finding the inet_peer, allocating a fresh one, inserting it in the tree with same initial ip_id_count, (cf secure_ip_id()) 5) We garbage collect inet_peer aggressively. IP ID generation do not have to be 'perfect' Goal is trying to avoid duplicates in a short period of time, so that reassembly units have a chance to complete reassembly of fragments belonging to one message before receiving other fragments with a recycled ID. We simply use an array of generators, and a Jenkin hash using the dst IP as a key. ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it belongs (it is only used from this file) secure_ip_id() and secure_ipv6_id() no longer are needed. Rename ip_select_ident_more() to ip_select_ident_segs() to avoid unnecessary decrement/increment of the number of segments. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fbdc0ad0 |
|
22-May-2014 |
Li RongQing <roy.qing.li@gmail.com> |
ipv4: initialise the itag variable in __mkroute_input the value of itag is a random value from stack, and may not be initiated by fib_validate_source, which called fib_combine_itag if CONFIG_IP_ROUTE_CLASSID is not set This will make the cached dst uncertainty Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1b3c61dc |
|
13-May-2014 |
Lorenzo Colitti <lorenzo@google.com> |
net: Use fwmark reflection in PMTU discovery. Currently, routing lookups used for Path PMTU Discovery in absence of a socket or on unmarked sockets use a mark of 0. This causes PMTUD not to work when using routing based on netfilter fwmark mangling and fwmark ip rules, such as: iptables -j MARK --set-mark 17 ip rule add fwmark 17 lookup 100 This patch causes these route lookups to use the fwmark from the received ICMP error when the fwmark_reflect sysctl is enabled. This allows the administrator to make PMTUD work by configuring appropriate fwmark rules to mark the inbound ICMP packets. Black-box tested using user-mode linux by pointing different fwmarks at routing tables egressing on different interfaces, and using iptables mangling to mark packets inbound on each interface with the interface's fwmark. ICMPv4 and ICMPv6 PMTU discovery work as expected when mark reflection is enabled and fail when it is disabled. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0d5edc68 |
|
15-Apr-2014 |
Cong Wang <cwang@twopensource.com> |
ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source() In my special case, when a packet is redirected from veth0 to lo, its skb->dev->ifindex would be LOOPBACK_IFINDEX. Meanwhile we pass the hard-coded LOOPBACK_IFINDEX to fib_validate_source() in ip_route_input_slow(). This would cause the following check in fib_validate_source() fail: (dev->ifindex != oif || !IN_DEV_TX_REDIRECTS(idev)) when rp_filter is disabeld on loopback. As suggested by Julian, the caller should pass 0 here so that we will not end up by calling __fib_validate_source(). Cc: Eric Biederman <ebiederm@xmission.com> Cc: Julian Anastasov <ja@ssi.bg> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
aad88724 |
|
15-Apr-2014 |
Eric Dumazet <edumazet@google.com> |
ipv4: add a sock pointer to dst->output() path. In the dst->output() path for ipv4, the code assumes the skb it has to transmit is attached to an inet socket, specifically via ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the provider of the packet is an AF_PACKET socket. The dst->output() method gets an additional 'struct sock *sk' parameter. This needs a cascade of changes so that this parameter can be propagated from vxlan to final consumer. Fixes: 8f646c922d55 ("vxlan: keep original skb ownership") Reported-by: lucien xin <lucien.xin@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
91146153 |
|
13-Apr-2014 |
Julian Anastasov <ja@ssi.bg> |
ipv4: return valid RTA_IIF on ip route get Extend commit 13378cad02afc2adc6c0e07fca03903c7ada0b37 ("ipv4: Change rt->rt_iif encoding.") from 3.6 to return valid RTA_IIF on 'ip route get ... iif DEVICE' instead of rt_iif 0 which is displayed as 'iif *'. inet_iif is not appropriate to use because skb_iif is not set. Use the skb->dev->ifindex instead. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3ed66e91 |
|
07-Apr-2014 |
Christoph Lameter <cl@linux.com> |
net: replace __this_cpu_inc in route.c with raw_cpu_inc The RT_CACHE_STAT_INC macro triggers the new preemption checks for __this_cpu ops. I do not see any other synchronization that would allow the use of a __this_cpu operation here however in commit dbd2915ce87e ("[IPV4]: RT_CACHE_STAT_INC() warning fix") Andrew justifies the use of raw_smp_processor_id() here because "we do not care" about races. In the past we agreed that the price of disabling interrupts here to get consistent counters would be too high. These counters may be inaccurate due to race conditions. The use of __this_cpu op improves the situation already from what commit dbd2915ce87e did since the single instruction emitted on x86 does not allow the race to occur anymore. However, non x86 platforms could still experience a race here. Trace: __this_cpu_add operation in preemptible [00000000] code: avahi-daemon/1193 caller is __this_cpu_preempt_check+0x38/0x60 CPU: 1 PID: 1193 Comm: avahi-daemon Tainted: GF 3.12.0-rc4+ #187 Call Trace: check_preemption_disabled+0xec/0x110 __this_cpu_preempt_check+0x38/0x60 __ip_route_output_key+0x575/0x8c0 ip_route_output_flow+0x27/0x70 udp_sendmsg+0x825/0xa20 inet_sendmsg+0x85/0xc0 sock_sendmsg+0x9c/0xd0 ___sys_sendmsg+0x37c/0x390 __sys_sendmsg+0x49/0x90 SyS_sendmsg+0x12/0x20 tracesys+0xe1/0xe6 Signed-off-by: Christoph Lameter <cl@linux.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0b8c7f6f |
|
20-Mar-2014 |
Li RongQing <roy.qing.li@gmail.com> |
ipv4: remove ip_rt_dump from route.c ip_rt_dump do nothing after IPv4 route caches removal, so we can remove it. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4a4eb21f |
|
23-Mar-2014 |
Li RongQing <roy.qing.li@gmail.com> |
ipv4: remove ipv4_ifdown_dst from route.c ipv4_ifdown_dst does nothing after IPv4 route caches removal, so we can remove it. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a6254864 |
|
17-Feb-2014 |
Duan Jiong <duanj.fnst@cn.fujitsu.com> |
ipv4: fix counter in_slow_tot since commit 89aef8921bf("ipv4: Delete routing cache."), the counter in_slow_tot can't work correctly. The counter in_slow_tot increase by one when fib_lookup() return successfully in ip_route_input_slow(), but actually the dst struct maybe not be created and cached, so we can increase in_slow_tot after the dst struct is created. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cd0f0b95 |
|
14-Feb-2014 |
Duan Jiong <duanj.fnst@cn.fujitsu.com> |
ipv4: distinguish EHOSTUNREACH from the ENETUNREACH since commit 251da413("ipv4: Cache ip_error() routes even when not forwarding."), the counter IPSTATS_MIB_INADDRERRORS can't work correctly, because the value of err was always set to ENETUNREACH. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2045ceae |
|
12-Feb-2014 |
stephen hemminger <stephen@networkplumber.org> |
net: remove unnecessary return's One of my pet coding style peeves is the practice of adding extra return; at the end of function. Kill several instances of this in network code. I suppose some coccinelle wizardy could do this automatically. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f87c10a8 |
|
09-Jan-2014 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing While forwarding we should not use the protocol path mtu to calculate the mtu for a forwarded packet but instead use the interface mtu. We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was introduced for multicast forwarding. But as it does not conflict with our usage in unicast code path it is perfect for reuse. I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular dependencies because of IPSKB_FORWARDED. Because someone might have written a software which does probe destinations manually and expects the kernel to honour those path mtus I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone can disable this new behaviour. We also still use mtus which are locked on a route for forwarding. The reason for this change is, that path mtus information can be injected into the kernel via e.g. icmp_err protocol handler without verification of local sockets. As such, this could cause the IPv4 forwarding path to wrongfully emit fragmentation needed notifications or start to fragment packets along a path. Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED won't be set and further fragmentation logic will use the path mtu to determine the fragmentation size. They also recheck packet size with help of path mtu discovery and report appropriate errors. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dcdfdf56 |
|
19-Nov-2013 |
Alexei Starovoitov <ast@kernel.org> |
ipv4: fix race in concurrent ip_route_input_slow() CPUs can ask for local route via ip_route_input_noref() concurrently. if nh_rth_input is not cached yet, CPUs will proceed to allocate equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input via rt_cache_route() Most of the time they succeed, but on occasion the following two lines: orig = *p; prev = cmpxchg(p, orig, rt); in rt_cache_route() do race and one of the cpus fails to complete cmpxchg. But ip_route_input_slow() doesn't check the return code of rt_cache_route(), so dst is leaking. dst_destroy() is never called and 'lo' device refcnt doesn't go to zero, which can be seen in the logs as: unregister_netdevice: waiting for lo to become free. Usage count = 1 Adding mdelay() between above two lines makes it easily reproducible. Fix it similar to nh_pcpu_rth_output case. Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
482fc609 |
|
04-Nov-2013 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery, their sockets won't accept and install new path mtu information and they will always use the interface mtu for outgoing packets. It is guaranteed that the packet is not fragmented locally. But we won't set the DF-Flag on the outgoing frames. Florian Weimer had the idea to use this flag to ensure DNS servers are never generating outgoing fragments. They may well be fragmented on the path, but the server never stores or usees path mtu values, which could well be forged in an attack. (The root of the problem with path MTU discovery is that there is no reliable way to authenticate ICMP Fragmentation Needed But DF Set messages because they are sent from intermediate routers with their source addresses, and the IMCP payload will not always contain sufficient information to identify a flow.) Recent research in the DNS community showed that it is possible to implement an attack where DNS cache poisoning is feasible by spoofing fragments. This work was done by Amir Herzberg and Haya Shulman: <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf> This issue was previously discussed among the DNS community, e.g. <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>, without leading to fixes. This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK" for the enforcement of the non-fragmentable checks. If other users than ip_append_page/data should use this semantic too, we have to add a new flag to IPCB(skb)->flags to suppress local fragmentation and check for this in ip_finish_output. Many thanks to Florian Weimer for the idea and feedback while implementing this patch. Cc: David S. Miller <davem@davemloft.net> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0baf2b35 |
|
16-Oct-2013 |
Eric Dumazet <edumazet@google.com> |
ipv4: shrink rt_cache_stat Half of the rt_cache_stat fields are no longer used after IP route cache removal, lets shrink this per cpu area. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0a7e2260 |
|
04-Oct-2013 |
Jiri Benc <jbenc@redhat.com> |
ipv4: fix ineffective source address selection When sending out multicast messages, the source address in inet->mc_addr is ignored and rewritten by an autoselected one. This is caused by a typo in commit 813b3b5db831 ("ipv4: Use caller's on-stack flowi as-is in output route lookups"). Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
734d2725 |
|
18-Aug-2013 |
Eric Dumazet <edumazet@google.com> |
ipv4: raise IP_MAX_MTU to theoretical limit As discussed last year [1], there is no compelling reason to limit IPv4 MTU to 0xFFF0, while real limit is 0xFFFF [1] : http://marc.info/?l=linux-netdev&m=135607247609434&w=2 Willem raised this issue again because some of our internal regression tests broke after lo mtu being set to 65536. IP_MTU reports 0xFFF0, and the test attempts to send a RAW datagram of mtu + 1 bytes, expecting the send() to fail, but it does not. Alexey raised interesting points about TCP MSS, that should be addressed in follow-up patches in TCP stack if needed, as someone could also set an odd mtu anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ca4c3fc2 |
|
29-Jul-2013 |
fan.du <fan.du@windriver.com> |
net: split rt_genid for ipv4 and ipv6 Current net name space has only one genid for both IPv4 and IPv6, it has below drawbacks: - Add/delete an IPv4 address will invalidate all IPv6 routing table entries. - Insert/remove XFRM policy will also invalidate both IPv4/IPv6 routing table entries even when the policy is only applied for one address family. Thus, this patch attempt to split one genid for two to cater for IPv4 and IPv6 separately in a fine granularity. Signed-off-by: Fan Du <fan.du@windriver.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2ffae99d |
|
27-Jun-2013 |
Timo Teräs <timo.teras@iki.fi> |
ipv4: use next hop exceptions also for input routes Commit d2d68ba9 (ipv4: Cache input routes in fib_info nexthops) assmued that "locally destined, and routed packets, never trigger PMTU events or redirects that will be processed by us". However, it seems that tunnel devices do trigger PMTU events in certain cases. At least ip_gre, ip6_gre, sit, and ipip do use the inner flow's skb_dst(skb)->ops->update_pmtu to propage mtu information from the outer flows. These can cause the inner flow mtu to be decreased. If next hop exceptions are not consulted for pmtu, IP fragmentation will not be done properly for these routes. It also seems that we really need to have the PMTU information always for netfilter TCPMSS clamp-to-pmtu feature to work properly. So for the time being, cache separate copies of input routes for each next hop exception. Signed-off-by: Timo Teräs <timo.teras@iki.fi> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fe2c6338 |
|
12-Jun-2013 |
Joe Perches <joe@perches.com> |
net: Convert uses of typedef ctl_table to struct ctl_table Reduce the uses of this unnecessary typedef. Done via perl script: $ git grep --name-only -w ctl_table net | \ xargs perl -p -i -e '\ sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \ s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge' Reflow the modified lines that now exceed 80 columns. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5aad1de5 |
|
27-May-2013 |
Timo Teräs <timo.teras@iki.fi> |
ipv4: use separate genid for next hop exceptions commit 13d82bf5 (ipv4: Fix flushing of cached routing informations) added the support to flush learned pmtu information. However, using rt_genid is quite heavy as it is bumped on route add/change and multicast events amongst other places. These can happen quite often, especially if using dynamic routing protocols. While this is ok with routes (as they are just recreated locally), the pmtu information is learned from remote systems and the icmp notification can come with long delays. It is worthy to have separate genid to avoid excessive pmtu resets. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f016229e |
|
27-May-2013 |
Timo Teräs <timo.teras@iki.fi> |
ipv4: rate limit updating of next hop exceptions with same pmtu The tunnel devices call update_pmtu for each packet sent, this causes contention on the fnhe_lock. Ignore the pmtu update if pmtu is not actually changed, and there is still plenty of time before the entry expires. Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
387aa65a |
|
27-May-2013 |
Timo Teräs <timo.teras@iki.fi> |
ipv4: properly refresh rtable entries on pmtu/redirect events This reverts commit 05ab86c5 (xfrm4: Invalidate all ipv4 routes on IPsec pmtu events). Flushing all cached entries is not needed. Instead, invalidate only the related next hop dsts to recheck for the added next hop exception where needed. This also fixes a subtle race due to bumping generation id's before updating the pmtu. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f96ef988 |
|
28-May-2013 |
Michal Kubecek <mkubecek@suse.cz> |
ipv4: fix redirect handling for TCP packets Unlike ipv4_redirect() and ipv4_sk_redirect(), ip_do_redirect() doesn't call __build_flow_key() directly but via ip_rt_build_flow_key() wrapper. This leads to __build_flow_key() getting pointer to IPv4 header of the ICMP redirect packet rather than pointer to the embedded IPv4 header of the packet initiating the redirect. As a result, handling of ICMP redirects initiated by TCP packets is broken. Issue was introduced by 4895c771c ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
661d2967 |
|
21-Mar-2013 |
Thomas Graf <tgraf@suug.ch> |
rtnetlink: Remove passing of attributes into rtnl_doit functions With decnet converted, we can finally get rid of rta_buf and its computations around it. It also gets rid of the minimal header length verification since all message handlers do that explicitly anyway. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
082c7ca4 |
|
18-Feb-2013 |
Gao feng <gaofeng@cn.fujitsu.com> |
net: ipv4: fix waring -Wunused-variable the vars ip_rt_gc_timeout is used only when CONFIG_SYSCTL is selected. move these vars into CONFIG_SYSCTL. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d4beaa66 |
|
17-Feb-2013 |
Gao feng <gaofeng@cn.fujitsu.com> |
net: proc: change proc_net_fops_create to proc_create Right now, some modules such as bonding use proc_create to create proc entries under /proc/net/, and other modules such as ipv4 use proc_net_fops_create. It looks a little chaos.this patch changes all of proc_net_fops_create to proc_create. we can remove proc_net_fops_create after this patch. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b44108db |
|
21-Jan-2013 |
Steffen Klassert <steffen.klassert@secunet.com> |
ipv4: Fix route refcount on pmtu discovery git commit 9cb3a50c (ipv4: Invalidate the socket cached route on pmtu events if possible) introduced a refcount problem. We don't get a refcount on the route if we get it from__sk_dst_get(), but we need one if we want to reuse this route because __sk_dst_set() releases the refcount of the old route. This patch adds proper refcount handling for that case. We introduce a 'new' flag to indicate that we are going to use a new route and we release the old route only if we replace it by a new one. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9cb3a50c |
|
20-Jan-2013 |
Steffen Klassert <steffen.klassert@secunet.com> |
ipv4: Invalidate the socket cached route on pmtu events if possible The route lookup in ipv4_sk_update_pmtu() might return a route different from the route we cached at the socket. This is because standart routes are per cpu, so each cpu has it's own struct rtable. This means that we do not invalidate the socket cached route if the NET_RX_SOFTIRQ is not served by the same cpu that the sending socket uses. As a result, the cached route reused until we disconnect. With this patch we invalidate the socket cached route if possible. If the socket is owened by the user, we can't update the cached route directly. A followup patch will implement socket release callback functions for datagram sockets to handle this case. Reported-by: Yurij M. Plotnikov <Yurij.Plotnikov@oktetlabs.ru> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fa1e492a |
|
16-Jan-2013 |
Steffen Klassert <steffen.klassert@secunet.com> |
ipv4: Don't update the pmtu on mtu locked routes Routes with locked mtu should not use learned pmtu informations, so do not update the pmtu on these routes. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
38d523e2 |
|
16-Jan-2013 |
Steffen Klassert <steffen.klassert@secunet.com> |
ipv4: Remove output route check in ipv4_mtu The output route check was introduced with git commit 261663b0 (ipv4: Don't use the cached pmtu informations for input routes) during times when we cached the pmtu informations on the inetpeer. Now the pmtu informations are back in the routes, so this check is obsolete. It also had some unwanted side effects, as reported by Timo Teras and Lukas Tribus. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8caaf7b6 |
|
03-Dec-2012 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
ipv4/route/rtnl: get mcast attributes when dst is multicast Commit f1ce3062c538 (ipv4: Remove 'rt_dst' from 'struct rtable') removes the call to ipmr_get_route(), which will get multicast parameters of the route. I revert the part of the patch that remove this call. I think the goal was only to get rid of rt_dst field. The patch is only compiled-tested. My first idea was to remove ipmr_get_route() because rt_fill_info() was the only user, but it seems the previous patch cleans the code a bit too much ;-) Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
63617421 |
|
22-Nov-2012 |
Julian Anastasov <ja@ssi.bg> |
ipv4: do not cache looped multicasts Starting from 3.6 we cache output routes for multicasts only when using route to 224/4. For local receivers we can set RTCF_LOCAL flag depending on the membership but in such case we use maddr and saddr which are not caching keys as before. Additionally, we can not use same place to cache routes that differ in RTCF_LOCAL flag value. Fix it by caching only RTCF_MULTICAST entries without RTCF_LOCAL (send-only, no loopback). As a side effect, we avoid unneeded lookup for fnhe when not caching because multicasts are not redirected and they do not learn PMTU. Thanks to Maxime Bizon for showing the caching problems in __mkroute_output for 3.6 kernels: different RTCF_LOCAL flag in cache can lead to wrong ip_mc_output or ip_output call and the visible problem is that traffic can not reach local receivers via loopback. Reported-by: Maxime Bizon <mbizon@freebox.fr> Tested-by: Maxime Bizon <mbizon@freebox.fr> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
464dc801 |
|
15-Nov-2012 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Don't export sysctls to unprivileged users In preparation for supporting the creation of network namespaces by unprivileged users, modify all of the per net sysctl exports and refuse to allow them to unprivileged users. This makes it safe for unprivileged users in general to access per net sysctls, and allows sysctls to be exported to unprivileged users on an individual basis as they are deemed safe. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
703fb94e |
|
13-Nov-2012 |
Steffen Klassert <steffen.klassert@secunet.com> |
xfrm: Fix the gc threshold value for ipv4 The xfrm gc threshold value depends on ip_rt_max_size. This value was set to INT_MAX with the routing cache removal patch, so we start doing garbage collecting when we have INT_MAX/2 IPsec routes cached. Fix this by going back to the static threshold of 1024 routes. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
13d82bf5 |
|
17-Oct-2012 |
Steffen Klassert <steffen.klassert@secunet.com> |
ipv4: Fix flushing of cached routing informations Currently we can not flush cached pmtu/redirect informations via the ipv4_sysctl_rtcache_flush sysctl. We need to check the rt_genid of the old route and reset the nh exeption if the old route is expired when we bind a new route to a nh exeption. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
68aaed54 |
|
10-Oct-2012 |
stephen hemminger <shemminger@vyatta.com> |
ipv4: fix route mark sparse warning Sparse complains about RTA_MARK which is should be host order according to include file and usage in iproute. net/ipv4/route.c:2223:46: warning: incorrect type in argument 3 (different base types) net/ipv4/route.c:2223:46: expected restricted __be32 [usertype] value net/ipv4/route.c:2223:46: got unsigned int [unsigned] [usertype] flowic_mark Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c92b9655 |
|
08-Oct-2012 |
Julian Anastasov <ja@ssi.bg> |
ipv4: Add FLOWI_FLAG_KNOWN_NH Add flag to request that output route should be returned with known rt_gateway, in case we want to use it as nexthop for neighbour resolving. The returned route can be cached as follows: - in NH exception: because the cached routes are not shared with other destinations - in FIB NH: when using gateway because all destinations for NH share same gateway As last option, to return rt_gateway!=0 we have to set DST_NOCACHE. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
155e8336 |
|
08-Oct-2012 |
Julian Anastasov <ja@ssi.bg> |
ipv4: introduce rt_uses_gateway Add new flag to remember when route is via gateway. We will use it to allow rt_gateway to contain address of directly connected host for the cases when DST_NOCACHE is used or when the NH exception caches per-destination route without DST_NOCACHE flag, i.e. when routes are not used for other destinations. By this way we force the neighbour resolving to work with the routed destination but we can use different address in the packet, feature needed for IPVS-DR where original packet for virtual IP is routed via route to real IP. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f8a17175 |
|
08-Oct-2012 |
Julian Anastasov <ja@ssi.bg> |
ipv4: make sure nh_pcpu_rth_output is always allocated Avoid checking nh_pcpu_rth_output in fast path, abort fib_info creation on alloc_percpu failure. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e81da0e1 |
|
08-Oct-2012 |
Julian Anastasov <ja@ssi.bg> |
ipv4: fix sending of redirects After "Cache input routes in fib_info nexthops" (commit d2d68ba9fe) and "Elide fib_validate_source() completely when possible" (commit 7a9bc9b81a) we can not send ICMP redirects. It seems we should not cache the RTCF_DOREDIRECT flag in nh_rth_input because the same fib_info can be used for traffic that is not redirected, eg. from other input devices or from sources that are not in same subnet. As result, we have to disable the caching of RTCF_DOREDIRECT flag and to force source validation for the case when forwarding traffic to the input device. If traffic comes from directly connected source we allow redirection as it was done before both changes. Avoid setting RTCF_DOREDIRECT if IN_DEV_TX_REDIRECTS is disabled, this can avoid source address validation and to help caching the routes. After the change "Adjust semantics of rt->rt_gateway" (commit f8126f1d51) we should make sure our ICMP_REDIR_HOST messages contain daddr instead of 0.0.0.0 when target is directly connected. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ee9a8f7a |
|
07-Oct-2012 |
Steffen Klassert <steffen.klassert@secunet.com> |
ipv4: Don't report stale pmtu values to userspace We report cached pmtu values even if they are already expired. Change this to not report these values after they are expired and fix a race in the expire time calculation, as suggested by Eric Dumazet. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7f92d334 |
|
07-Oct-2012 |
Steffen Klassert <steffen.klassert@secunet.com> |
ipv4: Don't create nh exeption when the device mtu is smaller than the reported pmtu When a local tool like tracepath tries to send packets bigger than the device mtu, we create a nh exeption and set the pmtu to device mtu. The device mtu does not expire, so check if the device mtu is smaller than the reported pmtu and don't crerate a nh exeption in that case. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d851c12b |
|
07-Oct-2012 |
Steffen Klassert <steffen.klassert@secunet.com> |
ipv4: Always invalidate or update the route on pmtu events Some protocols, like IPsec still cache routes. So we need to invalidate the old route on pmtu events to avoid the reuse of stale routes. We also need to update the mtu and expire time of the route if we already use a nh exception route, otherwise we ignore newly learned pmtu values after the first expiration. With this patch we always invalidate or update the route on pmtu events. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b42664f8 |
|
10-Sep-2012 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
netns: move net->ipv4.rt_genid to net->rt_genid This commit prepares the use of rt_genid by both IPv4 and IPv6. Initialization is left in IPv4 part. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2885da72 |
|
07-Sep-2012 |
Eric Dumazet <edumazet@google.com> |
net: rt_cache_flush() cleanup We dont use jhash anymore since route cache removal, so we can get rid of get_random_bytes() calls for rt_genid changes. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bafa6d9d |
|
06-Sep-2012 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
ipv4/route: arg delay is useless in rt_cache_flush() Since route cache deletion (89aef8921bfbac22f), delay is no more used. Remove it. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
15e47304 |
|
07-Sep-2012 |
Eric W. Biederman <ebiederm@xmission.com> |
netlink: Rename pid to portid to avoid confusion It is a frequent mistake to confuse the netlink port identifier with a process identifier. Try to reduce this confusion by renaming fields that hold port identifiers portid instead of pid. I have carefully avoided changing the structures exported to userspace to avoid changing the userspace API. I have successfully built an allyesconfig kernel with this change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ba8bd0ea |
|
07-Sep-2012 |
Eric Dumazet <edumazet@google.com> |
net: rt_cache_flush() cleanup We dont use jhash anymore since route cache removal, so we can get rid of get_random_bytes() calls for rt_genid changes. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4ccfe6d4 |
|
06-Sep-2012 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
ipv4/route: arg delay is useless in rt_cache_flush() Since route cache deletion (89aef8921bfbac22f), delay is no more used. Remove it. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
98d75c37 |
|
27-Aug-2012 |
Alexander Duyck <alexander.h.duyck@intel.com> |
ipv4: Minor logic clean-up in ipv4_mtu In ipv4_mtu there is some logic where we are testing for a non-zero value and a timer expiration, then setting the value to zero, and then testing if the value is zero we set it to a value based on the dst. Instead of bothering with the extra steps it is easier to just cleanup the logic so that we set it to the dst based value if it is zero or if the timer has expired. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
|
#
c5ae7d41 |
|
27-Aug-2012 |
Eric Dumazet <edumazet@google.com> |
ipv4: must use rcu protection while calling fib_lookup Following lockdep splat was reported by Pavel Roskin : [ 1570.586223] =============================== [ 1570.586225] [ INFO: suspicious RCU usage. ] [ 1570.586228] 3.6.0-rc3-wl-main #98 Not tainted [ 1570.586229] ------------------------------- [ 1570.586231] /home/proski/src/linux/net/ipv4/route.c:645 suspicious rcu_dereference_check() usage! [ 1570.586233] [ 1570.586233] other info that might help us debug this: [ 1570.586233] [ 1570.586236] [ 1570.586236] rcu_scheduler_active = 1, debug_locks = 0 [ 1570.586238] 2 locks held by Chrome_IOThread/4467: [ 1570.586240] #0: (slock-AF_INET){+.-...}, at: [<ffffffff814f2c0c>] release_sock+0x2c/0xa0 [ 1570.586253] #1: (fnhe_lock){+.-...}, at: [<ffffffff815302fc>] update_or_create_fnhe+0x2c/0x270 [ 1570.586260] [ 1570.586260] stack backtrace: [ 1570.586263] Pid: 4467, comm: Chrome_IOThread Not tainted 3.6.0-rc3-wl-main #98 [ 1570.586265] Call Trace: [ 1570.586271] [<ffffffff810976ed>] lockdep_rcu_suspicious+0xfd/0x130 [ 1570.586275] [<ffffffff8153042c>] update_or_create_fnhe+0x15c/0x270 [ 1570.586278] [<ffffffff815305b3>] __ip_rt_update_pmtu+0x73/0xb0 [ 1570.586282] [<ffffffff81530619>] ip_rt_update_pmtu+0x29/0x90 [ 1570.586285] [<ffffffff815411dc>] inet_csk_update_pmtu+0x2c/0x80 [ 1570.586290] [<ffffffff81558d1e>] tcp_v4_mtu_reduced+0x2e/0xc0 [ 1570.586293] [<ffffffff81553bc4>] tcp_release_cb+0xa4/0xb0 [ 1570.586296] [<ffffffff814f2c35>] release_sock+0x55/0xa0 [ 1570.586300] [<ffffffff815442ef>] tcp_sendmsg+0x4af/0xf50 [ 1570.586305] [<ffffffff8156fc60>] inet_sendmsg+0x120/0x230 [ 1570.586308] [<ffffffff8156fb40>] ? inet_sk_rebuild_header+0x40/0x40 [ 1570.586312] [<ffffffff814f4bdd>] ? sock_update_classid+0xbd/0x3b0 [ 1570.586315] [<ffffffff814f4c50>] ? sock_update_classid+0x130/0x3b0 [ 1570.586320] [<ffffffff814ec435>] do_sock_write+0xc5/0xe0 [ 1570.586323] [<ffffffff814ec4a3>] sock_aio_write+0x53/0x80 [ 1570.586328] [<ffffffff8114bc83>] do_sync_write+0xa3/0xe0 [ 1570.586332] [<ffffffff8114c5a5>] vfs_write+0x165/0x180 [ 1570.586335] [<ffffffff8114c805>] sys_write+0x45/0x90 [ 1570.586340] [<ffffffff815d2722>] system_call_fastpath+0x16/0x1b Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Pavel Roskin <proski@gnu.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
78df76a0 |
|
23-Aug-2012 |
Eric Dumazet <edumazet@google.com> |
ipv4: take rt_uncached_lock only if needed Multicast traffic allocates dst with DST_NOCACHE, but dst is not inserted into rt_uncached_list. This slowdown multicast workloads on SMP because rt_uncached_lock is contended. Change the test before taking the lock to actually check the dst was inserted into rt_uncached_list. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9b04f350 |
|
21-Aug-2012 |
Eric Dumazet <edumazet@google.com> |
ipv4: properly update pmtu Sylvain Munault reported following info : - TCP connection get "stuck" with data in send queue when doing "large" transfers ( like typing 'ps ax' on a ssh connection ) - Only happens on path where the PMTU is lower than the MTU of the interface - Is not present right after boot, it only appears 10-20min after boot or so. (and that's inside the _same_ TCP connection, it works fine at first and then in the same ssh session, it'll get stuck) - Definitely seems related to fragments somehow since I see a router sending ICMP message saying fragmentation is needed. - Exact same setup works fine with kernel 3.5.1 Problem happens when the 10 minutes (ip_rt_mtu_expires) expiration period is over. ip_rt_update_pmtu() calls dst_set_expires() to rearm a new expiration, but dst_set_expires() does nothing because dst.expires is already set. It seems we want to set the expires field to a new value, regardless of prior one. With help from Julian Anastasov. Reported-by: Sylvain Munaut <s.munaut@whatever-company.com> Signed-off-by: Eric Dumazet <edumazet@google.com> CC: Julian Anastasov <ja@ssi.bg> Tested-by: Sylvain Munaut <s.munaut@whatever-company.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7bd86cc2 |
|
12-Aug-2012 |
Yan, Zheng <zheng.z.yan@intel.com> |
ipv4: Cache local output routes Commit caacf05e5ad1abf causes big drop of UDP loop back performance. The cause of the regression is that we do not cache the local output routes. Each time we send a datagram from unconnected UDP socket, the kernel allocates a dst_entry and adds it to the rt_uncached_list. It creates lock contention on the rt_uncached_lock. Reported-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1fb9489b |
|
08-Aug-2012 |
Pavel Emelyanov <xemul@parallels.com> |
net: Loopback ifindex is constant now As pointed out, there are places, that access net->loopback_dev->ifindex and after ifindex generation is made per-net this value becomes constant equals 1. So go ahead and introduce the LOOPBACK_IFINDEX constant and use it where appropriate. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9eb43e76 |
|
03-Aug-2012 |
Eric Dumazet <edumazet@google.com> |
ipv4: Introduce IN_DEV_NET_ROUTE_LOCALNET performance profiles show a high cost in the IN_DEV_ROUTE_LOCALNET() call done in ip_route_input_slow(), because of multiple dereferences, even if cache lines are clean and available in cpu caches. Since we already have the 'net' pointer, introduce IN_DEV_NET_ROUTE_LOCALNET() macro avoiding two dereferences (dev_net(in_dev->dev)) Also change the tests to use IN_DEV_NET_ROUTE_LOCALNET() only if saddr or/and daddr are loopback addresse. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e33cdac0 |
|
01-Aug-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: route.c cleanup Remove unused includes after IP cache removal Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
caacf05e |
|
31-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Properly purge netdev references on uncached routes. When a device is unregistered, we have to purge all of the references to it that may exist in the entire system. If a route is uncached, we currently have no way of accomplishing this. So create a global list that is scanned when a network device goes down. This mirrors the logic in net/core/dst.c's dst_ifdown(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c5038a83 |
|
31-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Cache routes in nexthop exception entries. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d26b3a7c |
|
30-Jul-2012 |
Eric Dumazet <edumazet@google.com> |
ipv4: percpu nh_rth_output cache Input path is mostly run under RCU and doesnt touch dst refcnt But output path on forwarding or UDP workloads hits badly dst refcount, and we have lot of false sharing, for example in ipv4_mtu() when reading rt->rt_pmtu Using a percpu cache for nh_rth_output gives a nice performance increase at a small cost. 24 udpflood test on my 24 cpu machine (dummy0 output device) (each process sends 1.000.000 udp frames, 24 processes are started) before : 5.24 s after : 2.06 s For reference, time on linux-3.5 : 6.60 s Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
54764bb6 |
|
30-Jul-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: Restore old dst_free() behavior. commit 404e0a8b6a55 (net: ipv4: fix RCU races on dst refcounts) tried to solve a race but added a problem at device/fib dismantle time : We really want to call dst_free() as soon as possible, even if sockets still have dst in their cache. dst_release() calls in free_fib_info_rcu() are not welcomed. Root of the problem was that now we also cache output routes (in nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in rt_free(), because output route lookups are done in process context. Based on feedback and initial patch from David Miller (adding another call_rcu_bh() call in fib, but it appears it was not the right fix) I left the inet_sk_rx_dst_set() helper and added __rcu attributes to nh_rth_output and nh_rth_input to better document what is going on in this code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
404e0a8b |
|
29-Jul-2012 |
Eric Dumazet <edumazet@google.com> |
net: ipv4: fix RCU races on dst refcounts commit c6cffba4ffa2 (ipv4: Fix input route performance regression.) added various fatal races with dst refcounts. crashes happen on tcp workloads if routes are added/deleted at the same time. The dst_free() calls from free_fib_info_rcu() are clearly racy. We need instead regular dst refcounting (dst_release()) and make sure dst_release() is aware of RCU grace periods : Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period before dst destruction for cached dst Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero() to make sure we dont increase a zero refcount (On a dst currently waiting an rcu grace period before destruction) rt_cache_route() must take a reference on the new cached route, and release it if was not able to install it. With this patch, my machines survive various benchmarks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c6cffba4 |
|
26-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Fix input route performance regression. With the routing cache removal we lost the "noref" code paths on input, and this can kill some routing workloads. Reinstate the noref path when we hit a cached route in the FIB nexthops. With help from Eric Dumazet. Reported-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4331debc |
|
24-Jul-2012 |
Eric Dumazet <edumazet@google.com> |
ipv4: rt_cache_valid must check expired routes commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) introduced rt_cache_valid() helper. It unfortunately doesn't check if route is expired before caching it. I noticed sk_setup_caps() was constantly called on a tcp workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
13378cad |
|
23-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Change rt->rt_iif encoding. On input packet processing, rt->rt_iif will be zero if we should use skb->dev->ifindex. Since we access rt->rt_iif consistently via inet_iif(), that is the only spot whose interpretation have to adjust. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
92101b3b |
|
23-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Prepare for change of rt->rt_iif encoding. Use inet_iif() consistently, and for TCP record the input interface of cached RX dst in inet sock. rt->rt_iif is going to be encoded differently, so that we can legitimately cache input routes in the FIB info more aggressively. When the input interface is "use SKB device index" the rt->rt_iif will be set to zero. This forces us to move the TCP RX dst cache installation into the ipv4 specific code, and as well it should since doing the route caching for ipv6 is pointless at the moment since it is not inspected in the ipv6 input paths yet. Also, remove the unlikely on dst->obsolete, all ipv4 dsts have obsolete set to a non-zero value to force invocation of the check callback. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fe3edf45 |
|
23-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Remove all RTCF_DIRECTSRC handliing. The last and final kernel user, ICMP address replies, has been removed. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2860583f |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Kill rt->fi It's not really needed. We only grabbed a reference to the fib_info for the sake of fib_info local metrics. However, fib_info objects are freed using RCU, as are therefore their private metrics (if any). We would have triggered a route cache flush if we eliminated a reference to a fib_info object in the routing tables. Therefore, any existing cached routes will first check and see that they have been invalidated before an errant reference to these metric values would occur. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9917e1e8 |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Turn rt->rt_route_iif into rt->rt_is_input. That is this value's only use, as a boolean to indicate whether a route is an input route or not. So implement it that way, using a u16 gap present in the struct already. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4fd551d7 |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Kill rt->rt_oif Never actually used. It was being set on output routes to the original OIF specified in the flow key used for the lookup. Adjust the only user, ipmr_rt_fib_lookup(), for greater correctness of the flowi4_oif and flowi4_iif values, thanks to feedback from Julian Anastasov. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
93ac5341 |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Dirty less cache lines in route caching paths. Don't bother incrementing dst->__use and setting dst->lastuse, they are completely pointless and just slow things down. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ba3f7f04 |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d2d68ba9 |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Cache input routes in fib_info nexthops. Caching input routes is slightly simpler than output routes, since we don't need to be concerned with nexthop exceptions. (locally destined, and routed packets, never trigger PMTU events or redirects that will be processed by us). However, we have to elide caching for the DIRECTSRC and non-zero itag cases. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f2bb4bed |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Cache output routes in fib_info nexthops. If we have an output route that lacks nexthop exceptions, we can cache it in the FIB info nexthop. Such routes will have DST_HOST cleared because such routes refer to a family of destinations, rather than just one. The sequence of the handling of exceptions during route lookup is adjusted to make the logic work properly. Before we allocate the route, we lookup the exception. Then we know if we will cache this route or not, and therefore whether DST_HOST should be set on the allocated route. Then we use DST_HOST to key off whether we should store the resulting route, during rt_set_nexthop(), in the FIB nexthop cache. With help from Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ceb33206 |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Kill routes during PMTU/redirect updates. Mark them obsolete so there will be a re-lookup to fetch the FIB nexthop exception info. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f5b0a874 |
|
19-Jul-2012 |
David S. Miller <davem@davemloft.net> |
net: Document dst->obsolete better. Add a big comment explaining how the field works, and use defines instead of magic constants for the values assigned to it. Suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f8126f1d |
|
13-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Adjust semantics of rt->rt_gateway. In order to allow prefixed routes, we have to adjust how rt_gateway is set and interpreted. The new interpretation is: 1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr 2) rt_gateway != 0, destination requires a nexthop gateway Abstract the fetching of the proper nexthop value using a new inline helper, rt_nexthop(), as suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
|
#
f1ce3062 |
|
12-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Remove 'rt_dst' from 'struct rtable' Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b4869889 |
|
30-Jun-2012 |
David Miller <davem@davemloft.net> |
ipv4: Remove 'rt_mark' from 'struct rtable' Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d6c0a4f6 |
|
30-Jun-2012 |
David Miller <davem@davemloft.net> |
ipv4: Kill 'rt_src' from 'struct rtable' Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1a00fee4 |
|
30-Jun-2012 |
David Miller <davem@davemloft.net> |
ipv4: Remove rt_key_{src,dst,tos} from struct rtable. They are always used in contexts where they can be reconstituted, or where the finally resolved rt->rt_{src,dst} is semantically equivalent. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
38a424e4 |
|
30-Jun-2012 |
David Miller <davem@davemloft.net> |
ipv4: Kill ip_route_input_noref(). The "noref" argument to ip_route_input_common() is now always ignored because we do not cache routes, and in that case we must always grab a reference to the resulting 'dst'. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
89aef892 |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Delete routing cache. The ipv4 routing cache is non-deterministic, performance wise, and is subject to reasonably easy to launch denial of service attacks. The routing cache works great for well behaved traffic, and the world was a much friendlier place when the tradeoffs that led to the routing cache's design were considered. What it boils down to is that the performance of the routing cache is a product of the traffic patterns seen by a system rather than being a product of the contents of the routing tables. The former of which is controllable by external entitites. Even for "well behaved" legitimate traffic, high volume sites can see hit rates in the routing cache of only ~%10. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
521f5490 |
|
19-Jul-2012 |
Julian Anastasov <ja@ssi.bg> |
ipv4: show pmtu in route list Override the metrics with rt_pmtu Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f31fd383 |
|
19-Jul-2012 |
Julian Anastasov <ja@ssi.bg> |
ipv4: Fix again the time difference calculation Fix again the diff value in rt_bind_exception after collision of two latest patches, my original commit actually fixed the same problem. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
aee06da6 |
|
18-Jul-2012 |
Julian Anastasov <ja@ssi.bg> |
ipv4: use seqlock for nh_exceptions Use global seqlock for the nh_exceptions. Call fnhe_oldest with the right hash chain. Correct the diff value for dst_set_expires. v2: after suggestions from Eric Dumazet: * get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe * continue daddr search in rt_bind_exception v3: * remove the daddr check before seqlock in rt_bind_exception * restart lookup in rt_bind_exception on detected seqlock change, as suggested by David Miller Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7fed84f6 |
|
19-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Fix time difference calculation in rt_bind_exception(). Reported-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5abf7f7e |
|
17-Jul-2012 |
Eric Dumazet <edumazet@google.com> |
ipv4: fix rcu splat free_nh_exceptions() should use rcu_dereference_protected(..., 1) since its called after one RCU grace period. Also add some const-ification in recent code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d3a25c98 |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Fix nexthop exception hash computation. Need to mask it with (FNHE_HASH_SIZE - 1). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4895c771 |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Add FIB nexthop exceptions. In a regime where we have subnetted route entries, we need a way to store persistent storage about destination specific learned values such as redirects and PMTU values. This is implemented here via nexthop exceptions. The initial implementation is a 2048 entry hash table with relaiming starting at chain length 5. A more sophisticated scheme can be devised if that proves necessary. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6700c270 |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() This will be used so that we can compose a full flow key. Even though we have a route in this context, we need more. In the future the routes will be without destination address, source address, etc. keying. One ipv4 route will cover entire subnets, etc. In this environment we have to have a way to possess persistent storage for redirects and PMTU information. This persistent storage will exist in the FIB tables, and that's why we'll need to be able to rebuild a full lookup flow key here. Using that flow key will do a fib_lookup() and create/update the persistent entry. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
85b91b03 |
|
13-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Don't store a rule pointer in fib_result. We only use it to fetch the rule's tclassid, so just store the tclassid there instead. This also decreases the size of fib_result by a full 8 bytes on 64-bit. On 32-bits it's a wash. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
99ee038d |
|
12-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Fix warnings in ip_do_redirect() for some configurations. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b587ee3b |
|
12-Jul-2012 |
David S. Miller <davem@davemloft.net> |
net: Add dummy dst_ops->redirect method where needed. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1f42539d |
|
11-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Kill ip_rt_redirect(). No longer needed, as the protocol handlers now all properly propagate the redirect back into the routing code. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b42597e2 |
|
11-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Add ipv4_redirect() and ipv4_sk_redirect() helper functions. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e47a185b |
|
11-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Generalize ip_do_redirect() and hook into new dst_ops->redirect. All of the redirect acceptance policy is now contained within. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
94206125 |
|
11-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Rearrange arguments to ip_rt_redirect() Pass in the SKB rather than just the IP addresses, so that policy and other aspects can reside in ip_rt_redirect() rather then icmp_redirect(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d0da720f |
|
11-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Pull redirect instantiation out into a helper function. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f185071d |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Remove inetpeer from routes. No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
31248731 |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Calling ->cow_metrics() now is a bug. Nothing every writes to ipv4 metrics any longer. PMTU is stored in rt->rt_pmtu. Dynamic TCP metrics are stored in a special TCP metrics cache, completely outside of the routes. Therefore ->cow_metrics() can simply nothing more than a WARN_ON trigger so we can catch anyone who tries to add new writes to ipv4 route metrics. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2db2d67e |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Kill dst_copy_metrics() call from ipv4_blackhole_route(). Blackhole routes have a COW metrics operation that returns NULL always, therefore this dst_copy_metrics() call did absolutely nothing. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
710ab6c0 |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Enforce max MTU metric at route insertion time. Rather than at every struct rtable creation. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5943634f |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Maintain redirect and PMTU info in struct rtable again. Maintaining this in the inetpeer entries was not the right way to do this at all. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
87a50699 |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
rtnetlink: Remove ts/tsage args to rtnl_put_cacheinfo(). Nobody provides non-zero values any longer. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3e12939a |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
inet: Kill FLOWI_FLAG_PRECOW_METRICS. No longer needed. TCP writes metrics, but now in it's own special cache that does not dirty the route metrics. Therefore there is no longer any reason to pre-cow metrics in this way. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1d861aa4 |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
inet: Minimize use of cached route inetpeer. Only use it in the absolutely required cases: 1) COW'ing metrics 2) ipv4 PMTU 3) ipv4 redirects Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
81166dd6 |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
tcp: Move timestamps from inetpeer to metrics cache. With help from Lin Ming. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
794785bf |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
net: Don't report route RTT metric value in cache dumps. We don't maintain it dynamically any longer, so reporting it would be extremely misleading. Report zero instead. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f187bc6e |
|
03-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: No need to set generic neighbour pointer. Nobody reads it any longer. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f894cbf8 |
|
02-Jul-2012 |
David S. Miller <davem@davemloft.net> |
net: Add optional SKB arg to dst_ops->neigh_lookup(). Causes the handler to use the daddr in the ipv4/ipv6 header when the route gateway is unspecified (local subnet). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3c521f2b |
|
02-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Don't report neigh uptodate state in rtcache procfs. Soon routes will not have a cached neigh attached, nor will we be able to necessarily go directly to a neigh from an arbitrary route. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a263b309 |
|
02-Jul-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Make neigh lookups directly in output packet path. Do not use the dst cached neigh, we'll be getting rid of that. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3085a4b7 |
|
28-Jun-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Remove extraneous assignment of dst->tclassid. We already set it several lines above. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9e56e380 |
|
28-Jun-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Adjust in_dev handling in fib_validate_source() Checking for in_dev being NULL is pointless. In fact, all of our callers have in_dev precomputed already, so just pass it in and remove the NULL checking. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
41347dcd |
|
28-Jun-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Kill rt->rt_spec_dst, no longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c10237e0 |
|
27-Jun-2012 |
David S. Miller <davem@davemloft.net> |
Revert "ipv4: tcp: dont cache unconfirmed intput dst" This reverts commit c074da2810c118b3812f32d6754bd9ead2f169e7. This change has several unwanted side effects: 1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll thus never create a real cached route. 2) All TCP traffic will use DST_NOCACHE and never use the routing cache at all. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c074da28 |
|
26-Jun-2012 |
Eric Dumazet <edumazet@google.com> |
ipv4: tcp: dont cache unconfirmed intput dst DDOS synflood attacks hit badly IP route cache. On typical machines, this cache is allowed to hold up to 8 Millions dst entries, 256 bytes for each, for a total of 2GB of memory. rt_garbage_collect() triggers and tries to cleanup things. Eventually route cache is disabled but machine is under fire and might OOM and crash. This patch exploits the new TCP early demux, to set a nocache boolean in case incoming TCP frame is for a not yet ESTABLISHED or TIMEWAIT socket. This 'nocache' boolean is then used in case dst entry is not found in route cache, to create an unhashed dst entry (DST_NOCACHE) SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache output dst for syncookies), so after this patch, a machine is able to absorb a DDOS synflood attack without polluting its IP route cache. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
251da413 |
|
26-Jun-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Cache ip_error() routes even when not forwarding. And account for the fact that, when we are not forwarding, we should bump statistic counters rather than emit an ICMP response. RP-filter rejected lookups are still not cached. Since -EHOSTUNREACH and -ENETUNREACH can now no longer be seen in ip_rcv_finish(), remove those checks. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
df67e6c9 |
|
26-Jun-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Remove unnecessary code from rt_check_expire(). IPv4 routing cache entries no longer use dst->expires, because the metrics, PMTU, and redirect information are stored in the inetpeer cache. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7586eceb |
|
19-Jun-2012 |
Eric Dumazet <edumazet@google.com> |
ipv4: tcp: dont cache output dst for syncookies Don't cache output dst for syncookies, as this adds pressure on IP route cache and rcu subsystem for no gain. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6fac2625 |
|
17-Jun-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Cap ADVMSS metric in the FIB rather than the routing cache. It makes no sense to execute this limit test every time we create a routing cache entry. We can't simply error out on these things since we've silently accepted and truncated them forever. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
36393395 |
|
14-Jun-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Handle PMTU in all ICMP error handlers. With ip_rt_frag_needed() removed, we have to explicitly update PMTU information in every ICMP error handler. Create two helper functions to facilitate this. 1) ipv4_sk_update_pmtu() This updates the PMTU when we have a socket context to work with. 2) ipv4_update_pmtu() Raw version, used when no socket context is available. For this interface, we essentially just pass in explicit arguments for the flow identity information we would have extracted from the socket. And you'll notice that ipv4_sk_update_pmtu() is simply implemented in terms of ipv4_update_pmtu() Note that __ip_route_output_key() is used, rather than something like ip_route_output_flow() or ip_route_output_key(). This is because we absolutely do not want to end up with a route that does IPSEC encapsulation and the like. Instead, we only want the route that would get us to the node described by the outermost IP header. Reported-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d0daebc3 |
|
11-Jun-2012 |
Thomas Graf <tgraf@suug.ch> |
ipv4: Add interface option to enable routing of 127.0.0.0/8 Routing of 127/8 is tradtionally forbidden, we consider packets from that address block martian when routing and do not process corresponding ARP requests. This is a sane default but renders a huge address space practically unuseable. The RFC states that no address within the 127/8 block should ever appear on any network anywhere but it does not forbid the use of such addresses outside of the loopback device in particular. For example to address a pool of virtual guests behind a load balancer. This patch adds a new interface option 'route_localnet' enabling routing of the 127/8 address block and processing of ARP requests on a specific interface. Note that for the feature to work, the default local route covering 127/8 dev lo needs to be removed. Example: $ sysctl -w net.ipv4.conf.eth0.route_localnet=1 $ ip route del 127.0.0.0/8 dev lo table local $ ip addr add 127.1.0.1/16 dev eth0 $ ip route flush cache V2: Fix invalid check to auto flush cache (thanks davem) Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7b34ca2a |
|
11-Jun-2012 |
David S. Miller <davem@davemloft.net> |
inet: Avoid potential NULL peer dereference. We handle NULL in rt{,6}_set_peer but then our caller will try to pass that NULL pointer into inet_putpeer() which isn't ready for it. Fix this by moving the NULL check one level up, and then remove the now unnecessary NULL check from inetpeer_ptr_set_peer(). Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8b96d22d |
|
11-Jun-2012 |
David S. Miller <davem@davemloft.net> |
inet: Use FIB table peer roots in routes. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b48c80ece |
|
10-Jun-2012 |
David S. Miller <davem@davemloft.net> |
inet: Add family scope inetpeer flushes. This implementation can deal with having many inetpeer roots, which is a necessary prerequisite for per-FIB table rooted peer tables. Each family (AF_INET, AF_INET6) has a sequence number which we bump when we get a family invalidation request. Each peer lookup cheaply checks whether the flush sequence of the root we are using is out of date, and if so flushes it and updates the sequence number. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
46517008 |
|
10-Jun-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Kill ip_rt_frag_needed(). There is zero point to this function. It's only real substance is to perform an extremely outdated BSD4.2 ICMP check, which we can safely remove. If you really have a MTU limited link being routed by a BSD4.2 derived system, here's a nickel go buy yourself a real router. The other actions of ip_rt_frag_needed(), checking and conditionally updating the peer, are done by the per-protocol handlers of the ICMP event. TCP, UDP, et al. have a handler which will receive this event and transmit it back into the associated route via dst_ops->update_pmtu(). This simplification is important, because it eliminates the one place where we do not have a proper route context in which to make an inetpeer lookup. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
97bab73f |
|
09-Jun-2012 |
David S. Miller <davem@davemloft.net> |
inet: Hide route peer accesses behind helpers. We encode the pointer(s) into an unsigned long with one state bit. The state bit is used so we can store the inetpeer tree root to use when resolving the peer later. Later the peer roots will be per-FIB table, and this change works to facilitate that. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c0efc887 |
|
09-Jun-2012 |
David S. Miller <davem@davemloft.net> |
inet: Pass inetpeer root into inet_getpeer*() interfaces. Otherwise we reference potentially non-existing members when ipv6 is disabled. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
56a6b248 |
|
09-Jun-2012 |
David S. Miller <davem@davemloft.net> |
inet: Consolidate inetpeer_invalidate_tree() interfaces. We only need one interface for this operation, since we always know which inetpeer root we want to flush. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c3426b47 |
|
09-Jun-2012 |
David S. Miller <davem@davemloft.net> |
inet: Initialize per-netns inetpeer roots in net/ipv{4,6}/route.c Instead of net/ipv4/inetpeer.c Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fbfe95a4 |
|
09-Jun-2012 |
David S. Miller <davem@davemloft.net> |
inet: Create and use rt{,6}_get_peer_create(). There's a lot of places that open-code rt{,6}_get_peer() only because they want to set 'create' to one. So add an rt{,6}_get_peer_create() for their sake. There were also a few spots open-coding plain rt{,6}_get_peer() and those are transformed here as well. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
54db0cc2 |
|
07-Jun-2012 |
Gao feng <gaofeng@cn.fujitsu.com> |
inetpeer: add parameter net for inet_getpeer_v4,v6 add struct net as a parameter of inet_getpeer_v[4,6], use net to replace &init_net. and modify some places to provide net for inet_getpeer_v[4,6] Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c8a627ed |
|
07-Jun-2012 |
Gao feng <gaofeng@cn.fujitsu.com> |
inetpeer: add namespace support for inetpeer now inetpeer doesn't support namespace,the information will be leaking across namespace. this patch move the global vars v4_peers and v6_peers to netns_ipv4 and netns_ipv6 as a field peers. add struct pernet_operations inetpeer_ops to initial pernet inetpeer data. and change family_to_base and inet_getpeer to support namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
31fe62b9 |
|
23-May-2012 |
Tim Bird <tim.bird@am.sony.com> |
mm: add a low limit to alloc_large_system_hash UDP stack needs a minimum hash size value for proper operation and also uses alloc_large_system_hash() for proper NUMA distribution of its hash tables and automatic sizing depending on available system memory. On some low memory situations, udp_table_init() must ignore the alloc_large_system_hash() result and reallocs a bigger memory area. As we cannot easily free old hash table, we leak it and kmemleak can issue a warning. This patch adds a low limit parameter to alloc_large_system_hash() to solve this problem. We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table allocation. Reported-by: Mark Asselstine <mark.asselstine@windriver.com> Reported-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
413c27d8 |
|
19-May-2012 |
Eldad Zack <eldad@fogrefinery.com> |
net/ipv4: replace simple_strtoul with kstrtoul Replace simple_strtoul with kstrtoul in three similar occurrences, all setup handlers: * route.c: set_rhash_entries * tcp.c: set_thash_entries * udp.c: set_uhash_entries Also check if the conversion failed. Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
91df42be |
|
15-May-2012 |
Joe Perches <joe@perches.com> |
net: ipv4 and ipv6: Convert printk(KERN_DEBUG to pr_debug Use the current debugging style and enable dynamic_debug. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e87cc472 |
|
13-May-2012 |
Joe Perches <joe@perches.com> |
net: Convert net_ratelimit uses to net_<level>_ratelimited Standardize the net core ratelimited logging functions. Coalesce formats, align arguments. Change a printk then vprintk sequence to use printf extension %pV. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ec8f23ce |
|
19-Apr-2012 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Convert all sysctl registrations to register_net_sysctl This results in code with less boiler plate that is a bit easier to read. Additionally stops us from using compatibility code in the sysctl core, hastening the day when the compatibility code can be removed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4e5ca785 |
|
19-Apr-2012 |
Eric W. Biederman <ebiederm@xmission.com> |
net ipv4: Remove the unneeded registration of an empty net/ipv4/neigh sysctl no longer requires explicit creation of directories. The neigh directory is always populated with at least a default entry so this won't cause any user visible changes. Delete the ipv4_path and the ipv4_skeleton these are no longer needed. Directly register the ipv4_route_table. And since I am an idiot remove the header definitions that I should have removed in the previous patch. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5dd3df10 |
|
19-Apr-2012 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Move all of the network sysctls without a namespace into init_net. This makes it clearer which sysctls are relative to your current network namespace. This makes it a little less error prone by not exposing sysctls for the initial network namespace in other namespaces. This is the same way we handle all of our other network interfaces to userspace and I can't honestly remember why we didn't do this for sysctls right from the start. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7426a564 |
|
18-Apr-2012 |
Shan Wei <davidshan@tencent.com> |
net: fix compile error of leaking kmemleak.h header net/core/sysctl_net_core.c: In function ‘sysctl_core_init’: net/core/sysctl_net_core.c:259: error: implicit declaration of function ‘kmemleak_not_leak’ with same error in net/ipv4/route.c Signed-off-by: Shan Wei <davidshan@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7f593881 |
|
16-Apr-2012 |
majianpeng <majianpeng@gmail.com> |
net/ipv4:Remove two memleak reports by kmemleak_not_leak. Signed-off-by: majianpeng <majianpeng@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
95c96174 |
|
14-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: cleanup unsigned to unsigned int Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5e73ea1a |
|
14-Apr-2012 |
Daniel Baluta <dbaluta@ixiacom.com> |
ipv4: fix checkpatch errors Fix checkpatch errors of the following type: * ERROR: "foo * bar" should be "foo *bar" * ERROR: "(foo*)" should be "(foo *)" Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d4a96865 |
|
04-Apr-2012 |
Amir Vadai <amirv@mellanox.com> |
net/route: export symbol ip_tos2prio Need to export this to enable drivers use rt_tos2priority() Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f3756b79 |
|
01-Apr-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Stop using NLA_PUT*(). These macros contain a hidden goto, and are thus extremely error prone and make code hard to audit. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9ffc93f2 |
|
28-Mar-2012 |
David Howells <dhowells@redhat.com> |
Remove all #inclusions of asm/system.h Remove all #inclusions of asm/system.h preparatory to splitting and killing it. Performed with the following command: perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *` Signed-off-by: David Howells <dhowells@redhat.com>
|
#
4e7b2f14 |
|
27-Mar-2012 |
Benjamin LaHaise <bcrl@kvack.org> |
net/ipv4: fix IPv4 multicast over network namespaces When using multicast over a local bridge feeding a number of LXC guests using veth, the LXC guests are unable to get a response from other guests when pinging 224.0.0.1. Multicast packets did not appear to be getting delivered to the network namespaces of the guest hosts, and further inspection showed that the incoming route was pointing to the loopback device of the host, not the guest. This lead to the wrong network namespace being picked up by sockets (like ICMP). Fix this by using the correct network namespace when creating the inbound route entry. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
afd46503 |
|
12-Mar-2012 |
Joe Perches <joe@perches.com> |
net: ipv4: Standardize prefixes for message logging Add #define pr_fmt(fmt) as appropriate. Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files. Standardize on "UDPLite: " for appropriate uses. Some prefixes were previously "UDPLITE: " and "UDP-Lite: ". Add KBUILD_MODNAME ": " to icmp and gre. Remove embedded prefixes as appropriate. Add missing "\n" to pr_info in gre.c. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
058bd4d2 |
|
11-Mar-2012 |
Joe Perches <joe@perches.com> |
net: Convert printks to pr_<level> Use a more current kernel messaging style. Convert a printk block to print_hex_dump. Coalesce formats, align arguments. Use %s, __func__ instead of embedding function names. Some messages that were prefixed with <foo>_close are now prefixed with <foo>_fini. Some ah4 and esp messages are now not prefixed with "ip ". The intent of this patch is to later add something like #define pr_fmt(fmt) "IPv4: " fmt. to standardize the output messages. Text size is trivially reduced. (x86-32 allyesconfig) $ size net/ipv4/built-in.o* text data bss dec hex filename 887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new 887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ac3f48de |
|
06-Mar-2012 |
Steffen Klassert <steffen.klassert@secunet.com> |
route: Remove redirect_genid As we invalidate the inetpeer tree along with the routing cache now, we don't need a genid to reset the redirect handling when the routing cache is flushed. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5faa5df1 |
|
06-Mar-2012 |
Steffen Klassert <steffen.klassert@secunet.com> |
inetpeer: Invalidate the inetpeer tree along with the routing cache We initialize the routing metrics with the values cached on the inetpeer in rt_init_metrics(). So if we have the metrics cached on the inetpeer, we ignore the user configured fib_metrics. To fix this issue, we replace the old tree with a fresh initialized inet_peer_base. The old tree is removed later with a delayed work queue. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
80703d26 |
|
15-Feb-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Eliminate spurious argument to __ipv4_neigh_lookup 'tbl' is always arp_tbl, so specifying it is pointless. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
39232973 |
|
26-Jan-2012 |
David S. Miller <davem@davemloft.net> |
ipv4/ipv6: Prepare for new route gateway semantics. In the future the ipv4/ipv6 route gateway will take on two types of values: 1) INADDR_ANY/IN6ADDR_ANY, for local network routes, and in this case the neighbour must be obtained using the destination address in ipv4/ipv6 header as the lookup key. 2) Everything else, the actual nexthop route address. So if the gateway is not inaddr-any we use it, otherwise we must use the packet's destination address. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e688a604 |
|
21-Dec-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: introduce DST_NOPEER dst flag Chris Boot reported crashes occurring in ipv6_select_ident(). [ 461.457562] RIP: 0010:[<ffffffff812dde61>] [<ffffffff812dde61>] ipv6_select_ident+0x31/0xa7 [ 461.578229] Call Trace: [ 461.580742] <IRQ> [ 461.582870] [<ffffffff812efa7f>] ? udp6_ufo_fragment+0x124/0x1a2 [ 461.589054] [<ffffffff812dbfe0>] ? ipv6_gso_segment+0xc0/0x155 [ 461.595140] [<ffffffff812700c6>] ? skb_gso_segment+0x208/0x28b [ 461.601198] [<ffffffffa03f236b>] ? ipv6_confirm+0x146/0x15e [nf_conntrack_ipv6] [ 461.608786] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77 [ 461.614227] [<ffffffff81271d64>] ? dev_hard_start_xmit+0x357/0x543 [ 461.620659] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111 [ 461.626440] [<ffffffffa0379745>] ? br_parse_ip_options+0x19a/0x19a [bridge] [ 461.633581] [<ffffffff812722ff>] ? dev_queue_xmit+0x3af/0x459 [ 461.639577] [<ffffffffa03747d2>] ? br_dev_queue_push_xmit+0x72/0x76 [bridge] [ 461.646887] [<ffffffffa03791e3>] ? br_nf_post_routing+0x17d/0x18f [bridge] [ 461.653997] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77 [ 461.659473] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge] [ 461.665485] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111 [ 461.671234] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge] [ 461.677299] [<ffffffffa0379215>] ? nf_bridge_update_protocol+0x20/0x20 [bridge] [ 461.684891] [<ffffffffa03bb0e5>] ? nf_ct_zone+0xa/0x17 [nf_conntrack] [ 461.691520] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge] [ 461.697572] [<ffffffffa0374812>] ? NF_HOOK.constprop.8+0x3c/0x56 [bridge] [ 461.704616] [<ffffffffa0379031>] ? nf_bridge_push_encap_header+0x1c/0x26 [bridge] [ 461.712329] [<ffffffffa037929f>] ? br_nf_forward_finish+0x8a/0x95 [bridge] [ 461.719490] [<ffffffffa037900a>] ? nf_bridge_pull_encap_header+0x1c/0x27 [bridge] [ 461.727223] [<ffffffffa0379974>] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge] [ 461.734292] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77 [ 461.739758] [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge] [ 461.746203] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111 [ 461.751950] [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge] [ 461.758378] [<ffffffffa037533a>] ? NF_HOOK.constprop.4+0x56/0x56 [bridge] This is caused by bridge netfilter special dst_entry (fake_rtable), a special shared entry, where attaching an inetpeer makes no sense. Problem is present since commit 87c48fa3b46 (ipv6: make fragment identifications less predictable) Introduce DST_NOPEER dst flag and make sure ipv6_select_ident() and __ip_select_ident() fallback to the 'no peer attached' handling. Reported-by: Chris Boot <bootc@bootc.net> Tested-by: Chris Boot <bootc@bootc.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b9eda06f |
|
21-Dec-2011 |
Stephen Rothwell <sfr@canb.auug.org.au> |
ipv4: using prefetch requires including prefetch.h Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9f28a2fc |
|
21-Dec-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: reintroduce route cache garbage collector Commit 2c8cec5c10b (ipv4: Cache learned PMTU information in inetpeer) removed IP route cache garbage collector a bit too soon, as this gc was responsible for expired routes cleanup, releasing their neighbour reference. As pointed out by Robert Gladewitz, recent kernels can fill and exhaust their neighbour cache. Reintroduce the garbage collection, since we'll have to wait our neighbour lookups become refcount-less to not depend on this stuff. Reported-by: Robert Gladewitz <gladewitz@gmx.de> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
27217455 |
|
02-Dec-2011 |
David Miller <davem@davemloft.net> |
net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}. To reflect the fact that a refrence is not obtained to the resulting neighbour entry. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Roland Dreier <roland@purestorage.com>
|
#
de398fb8 |
|
05-Dec-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Fix peer validation on cached lookup. If ipv4_valdiate_peer() fails during a cached entry lookup, we'll NULL derer since the loop iterator assumes rth is not NULL. Letting this be handled as a failure is just bogus, so just make it not fail. If we have trouble getting a non-NULL neighbour for the redirected gateway, just restore the original gateway and continue. The very next use of this cached route will try again. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f61759e6 |
|
02-Dec-2011 |
Julian Anastasov <ja@ssi.bg> |
ipv4: make sure RTO_ONLINK is saved in routing cache __mkroute_output fails to work with the original tos and uses value with stripped RTO_ONLINK bit. Make sure we put the original TOS bits into rt_key_tos because it used to match cached route. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
efbc368d |
|
01-Dec-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Perform peer validation on cached route lookup. Otherwise we won't notice the peer GENID change. Reported-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
32092ecf |
|
24-Jul-2011 |
David Miller <davem@davemloft.net> |
atm: clip: Use device neigh support on top of "arp_tbl". Instead of instantiating an entire new neigh_table instance just for ATM handling, use the neigh device private facility. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
218fa90f |
|
29-Nov-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: fix lockdep splat in rt_cache_seq_show After commit f2c31e32b378 (fix NULL dereferences in check_peer_redir()), dst_get_neighbour() should be guarded by rcu_read_lock() / rcu_read_unlock() section. Reported-by: Miles Lane <miles.lane@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
de68dca1 |
|
25-Nov-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
inet: add a redirect generation id in inetpeer Now inetpeer is the place where we cache redirect information for ipv4 destinations, we must be able to invalidate informations when a route is added/removed on host. As inetpeer is not yet namespace aware, this patch adds a shared redirect_genid, and a per inetpeer redirect_genid. This might be changed later if inetpeer becomes ns aware. Cache information for one inerpeer is valid as long as its redirect_genid has the same value than global redirect_genid. Reported-by: Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com> Tested-by: Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
261663b0 |
|
22-Nov-2011 |
Steffen Klassert <steffen.klassert@secunet.com> |
ipv4: Don't use the cached pmtu informations for input routes The pmtu informations on the inetpeer are visible for output and input routes. On packet forwarding, we might propagate a learned pmtu to the sender. As we update the pmtu informations of the inetpeer on demand, the original sender of the forwarded packets might never notice when the pmtu to that inetpeer increases. So use the mtu of the outgoing device on packet forwarding instead of the pmtu to the final destination. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
618f9bc7 |
|
22-Nov-2011 |
Steffen Klassert <steffen.klassert@secunet.com> |
net: Move mtu handling down to the protocol depended handlers We move all mtu handling from dst_mtu() down to the protocol layer. So each protocol can implement the mtu handling in a different manner. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ebb762f2 |
|
22-Nov-2011 |
Steffen Klassert <steffen.klassert@secunet.com> |
net: Rename the dst_opt default_mtu method to mtu We plan to invoke the dst_opt->default_mtu() method unconditioally from dst_mtu(). So rename the method to dst_opt->mtu() to match the name with the new meaning. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6b600b26 |
|
22-Nov-2011 |
Steffen Klassert <steffen.klassert@secunet.com> |
route: Use the device mtu as the default for blackhole routes As it is, we return null as the default mtu of blackhole routes. This may lead to a propagation of a bogus pmtu if the default_mtu method of a blackhole route is invoked. So return dst->dev->mtu as the default mtu instead. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9cc20b26 |
|
18-Nov-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: fix redirect handling commit f39925dbde77 (ipv4: Cache learned redirect information in inetpeer.) introduced a regression in ICMP redirect handling. It assumed ipv4_dst_check() would be called because all possible routes were attached to the inetpeer we modify in ip_rt_redirect(), but thats not true. commit 7cc9150ebe (route: fix ICMP redirect validation) tried to fix this but solution was not complete. (It fixed only one route) So we must lookup existing routes (including different TOS values) and call check_peer_redir() on them. Reported-by: Ivan Zahariev <famzah@icdsoft.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2bc8ca40 |
|
10-Oct-2011 |
Steffen Klassert <steffen.klassert@secunet.com> |
ipv4: Fix inetpeer expire time information As we update the learned pmtu informations on demand, we might report a nagative expiration time value to userspace if the pmtu informations are already expired and we have not send a packet to that inetpeer after expiration. With this patch we send a expire time of null to userspace after expiration until the next packet is send to that inetpeer. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
59445b6b |
|
19-Oct-2011 |
Gao feng <gaofeng@cn.fujitsu.com> |
ipv4: avoid useless call of the function check_peer_pmtu In func ipv4_dst_check,check_peer_pmtu should be called only when peer is updated. So,if the peer is not updated in ip_rt_frag_needed,we can not inc __rt_peer_genid. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7cc9150e |
|
24-Oct-2011 |
Flavio Leitner <fbl@redhat.com> |
route: fix ICMP redirect validation The commit f39925dbde7788cfb96419c0f092b086aa325c0f (ipv4: Cache learned redirect information in inetpeer.) removed some ICMP packet validations which are required by RFC 1122, section 3.2.2.2: ... A Redirect message SHOULD be silently discarded if the new gateway address it specifies is not on the same connected (sub-) net through which the Redirect arrived [INTRO:2, Appendix A], or if the source of the Redirect is not the current first-hop gateway for the specified destination (see Section 3.3.1). Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
349d2895 |
|
29-Sep-2011 |
Vasily Averin <vvs@sw.ru> |
ipv4: NET_IPV4_ROUTE_GC_INTERVAL removal removing obsoleted sysctl, ip_rt_gc_interval variable no longer used since 2.6.38 Signed-off-by: Vasily Averin <vvs@sw.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
33d480ce |
|
11-Aug-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: cleanup some rcu_dereference_raw RCU api had been completed and rcu_access_pointer() or rcu_dereference_protected() are better than generic rcu_dereference_raw() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
97a80410 |
|
08-Aug-2011 |
Julian Anastasov <ja@ssi.bg> |
ipv4: some rt_iif -> rt_route_iif conversions As rt_iif represents input device even for packets coming from loopback with output route, it is not an unique key specific to input routes. Now rt_route_iif has such role, it was fl.iif in 2.6.38, so better to change the checks at some places to save CPU cycles and to restore 2.6.38 semantics. compare_keys: - input routes: only rt_route_iif matters, rt_iif is same - output routes: only rt_oif matters, rt_iif is not used for matching in __ip_route_output_key - now we are back to 2.6.38 state ip_route_input_common: - matching rt_route_iif implies input route - compared to 2.6.38 we eliminated one rth->fl.oif check because it was not needed even for 2.6.38 compare_hash_inputs: Only the change here is not an optimization, it has effect only for output routes. I assume I'm restoring the original intention to ignore oif, it was using fl.iif - now we are back to 2.6.38 state Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d547f727 |
|
07-Aug-2011 |
Julian Anastasov <ja@ssi.bg> |
ipv4: fix the reusing of routing cache entries compare_keys and ip_route_input_common rely on rt_oif for distinguishing of input and output routes with same keys values. But sometimes the input route has also same hash chain (keyed by iif != 0) with the output routes (keyed by orig_oif=0). Problem visible if running with small number of rhash_entries. Fix them to use rt_route_iif instead. By this way input route can not be returned to users that request output route. The patch fixes the ip_rt_bug errors that were reported in ip_local_out context, mostly for 255.255.255.255 destinations. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6e5714ea |
|
03-Aug-2011 |
David S. Miller <davem@davemloft.net> |
net: Compute protocol sequence numbers and fragment IDs using MD5. Computers have become a lot faster since we compromised on the partial MD4 hash which we use currently for performance reasons. MD5 is a much safer choice, and is inline with both RFC1948 and other ISS generators (OpenBSD, Solaris, etc.) Furthermore, only having 24-bits of the sequence number be truly unpredictable is a very serious limitation. So the periodic regeneration and 8-bit counter have been removed. We compute and use a full 32-bit sequence number. For ipv6, DCCP was found to use a 32-bit truncated initial sequence number (it needs 43-bits) and that is fixed here as well. Reported-by: Dan Kaminsky <dan@doxpara.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f2c31e32 |
|
29-Jul-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: fix NULL dereferences in check_peer_redir() Gergely Kalman reported crashes in check_peer_redir(). It appears commit f39925dbde778 (ipv4: Cache learned redirect information in inetpeer.) added a race, leading to possible NULL ptr dereference. Since we can now change dst neighbour, we should make sure a reader can safely use a neighbour. Add RCU protection to dst neighbour, and make sure check_peer_redir() can be called safely by different cpus in parallel. As neighbours are already freed after one RCU grace period, this patch should not add typical RCU penalty (cache cold effects) Many thanks to Gergely for providing a pretty report pointing to the bug. Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b0fe4a31 |
|
22-Jul-2011 |
Julian Anastasov <ja@ssi.bg> |
ipv4: use RT_TOS after some rt_tos conversions rt_tos was changed to iph->tos but it must be filtered by RT_TOS Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d3aaeb38 |
|
18-Jul-2011 |
David S. Miller <davem@davemloft.net> |
net: Add ->neigh_lookup() operation to dst_ops In the future dst entries will be neigh-less. In that environment we need to have an easy transition point for current users of dst->neighbour outside of the packet output fast path. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
69cce1d1 |
|
18-Jul-2011 |
David S. Miller <davem@davemloft.net> |
net: Abstract dst->neighbour accesses behind helpers. dst_{get,set}_neighbour() Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b23b5455 |
|
16-Jul-2011 |
David S. Miller <davem@davemloft.net> |
neigh: Kill hh_cache->hh_output It's just taking on one of two possible values, either neigh_ops->output or dev_queue_xmit(). And this is purely depending upon whether nud_state has NUD_CONNECTED set or not. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f6b72b62 |
|
14-Jul-2011 |
David S. Miller <davem@davemloft.net> |
net: Embed hh_cache inside of struct neighbour. Now that there is a one-to-one correspondance between neighbour and hh_cache entries, we no longer need: 1) dynamic allocation 2) attachment to dst->hh 3) refcounting Initialization of the hh_cache entry is indicated by hh_len being non-zero, and such initialization is always done with the neighbour's lock held as a writer. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3769cffb |
|
11-Jul-2011 |
David Miller <davem@davemloft.net> |
ipv4: Inline neigh binding. Get rid of all of the useless and costly indirection by doing the neigh hash table lookup directly inside of the neighbour binding. Rename from arp_bind_neighbour to rt_bind_neighbour. Use new helpers {__,}ipv4_neigh_lookup() In rt_bind_neighbour() get rid of useless tests which are never true in the context this function is called, namely dev is never NULL and the dst->neighbour is always NULL. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4500ebf8 |
|
01-Jul-2011 |
Joe Perches <joe@perches.com> |
ipv4: Reduce switch/case indent Make the case labels the same indent as the switch. git diff -w shows no difference. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9aa3c94c |
|
18-Jun-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: fix multicast losses Knut Tidemann found that first packet of a multicast flow was not correctly received, and bisected the regression to commit b23dd4fe42b4 (Make output route lookup return rtable directly.) Special thanks to Knut, who provided a very nice bug report, including sample programs to demonstrate the bug. Reported-and-bisectedby: Knut Tidemann <knut.andre.tidemann@jotron.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c7ac8679 |
|
09-Jun-2011 |
Greg Rose <gregory.v.rose@intel.com> |
rtnetlink: Compute and store minimum ifinfo dump size The message size allocated for rtnl ifinfo dumps was limited to a single page. This is not enough for additional interface info available with devices that support SR-IOV and caused a bug in which VF info would not be displayed if more than approximately 40 VFs were created per interface. Implement a new function pointer for the rtnl_register service that will calculate the amount of data required for the ifinfo dump and allocate enough data to satisfy the request. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
fe6fe792 |
|
08-Jun-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: pmtu_expires fixes commit 2c8cec5c10bc (ipv4: Cache learned PMTU information in inetpeer) added some racy peer->pmtu_expires accesses. As its value can be changed by another cpu/thread, we should be more careful, reading its value once. Add peer_pmtu_expired() and peer_pmtu_cleaned() helpers Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c378a9c0 |
|
21-May-2011 |
Dave Jones <davej@redhat.com> |
ipv4: Give backtrace in ip_rt_bug(). Add a stack backtrace to the ip_rt_bug path for debugging Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a48eff12 |
|
18-May-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Pass explicit destination address to rt_bind_peer(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6882f933 |
|
18-May-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Kill RT_CACHE_DEBUG It's way past it's usefulness. And this gets rid of a bunch of stray ->rt_{dst,src} references. Even the comment documenting the macro was inaccurate (stated default was 1 when it's 0). If reintroduced, it should be done properly, with dynamic debug facilities. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c5be24ff |
|
13-May-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Trivial rt->rt_src conversions in net/ipv4/route.c At these points we have a fully filled in value via the IP header the form of ip_hdr(skb)->saddr Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8e36360a |
|
13-May-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Remove route key identity dependencies in ip_rt_get_source(). Pass in the sk_buff so that we can fetch the necessary keys from the packet header when working with input routes. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9a1b9496 |
|
04-May-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Pass explicit saddr/daddr args to ipmr_get_route(). This eliminates the need to use rt->rt_{src,dst}. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
475949d8 |
|
03-May-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Renamt struct rtable's rt_tos to rt_key_tos. To more accurately reflect that it is purely a routing cache lookup key and is used in no other context. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
56157872 |
|
02-May-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Make sure flowi4->{saddr,daddr} are always set. Slow path output route resolution always makes sure that ->{saddr,daddr} are set, and also if we trigger into IPSEC resolution we initialize them as well, because xfrm_lookup() expects them to be fully resolved. But if we hit the fast path and flowi4->flowi4_proto is zero, we won't do this initialization. Therefore, move the IPSEC path initialization to the route cache lookup fast path to make sure these are always set. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
813b3b5d |
|
28-Apr-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Use caller's on-stack flowi as-is in output route lookups. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cf911662 |
|
28-Apr-2011 |
David S. Miller <davem@davemloft.net> |
net: Use non-zero allocations in dst_alloc(). Make dst_alloc() and it's users explicitly initialize the entire entry. The zero'ing done by kmem_cache_zalloc() was almost entirely redundant. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5c1e6aa3 |
|
28-Apr-2011 |
David S. Miller <davem@davemloft.net> |
net: Make dst_alloc() take more explicit initializations. Now the dst->dev, dev->obsolete, and dst->flags values can be specified as well. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0972ddb2 |
|
24-Apr-2011 |
Held Bernhard <berny156@gmx.de> |
net: provide cow_metrics() methods to blackhole dst_ops Since commit 62fa8a846d7d (net: Implement read-only protection and COW'ing of metrics.) the kernel throws an oops. [ 101.620985] BUG: unable to handle kernel NULL pointer dereference at (null) [ 101.621050] IP: [< (null)>] (null) [ 101.621084] PGD 6e53c067 PUD 3dd6a067 PMD 0 [ 101.621122] Oops: 0010 [#1] SMP [ 101.621153] last sysfs file: /sys/devices/virtual/ppp/ppp/uevent [ 101.621192] CPU 2 [ 101.621206] Modules linked in: l2tp_ppp pppox ppp_generic slhc l2tp_netlink l2tp_core deflate zlib_deflate twofish_x86_64 twofish_common des_generic cbc ecb sha1_generic hmac af_key iptable_filter snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device loop snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm snd_timer snd i2c_i801 iTCO_wdt psmouse soundcore snd_page_alloc evdev uhci_hcd ehci_hcd thermal [ 101.621552] [ 101.621567] Pid: 5129, comm: openl2tpd Not tainted 2.6.39-rc4-Quad #3 Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R [ 101.621637] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [ 101.621684] RSP: 0018:ffff88003ddeba60 EFLAGS: 00010202 [ 101.621716] RAX: ffff88003ddb5600 RBX: ffff88003ddb5600 RCX: 0000000000000020 [ 101.621758] RDX: ffffffff81a69a00 RSI: ffffffff81b7ee61 RDI: ffff88003ddb5600 [ 101.621800] RBP: ffff8800537cd900 R08: 0000000000000000 R09: ffff88003ddb5600 [ 101.621840] R10: 0000000000000005 R11: 0000000000014b38 R12: ffff88003ddb5600 [ 101.621881] R13: ffffffff81b7e480 R14: ffffffff81b7e8b8 R15: ffff88003ddebad8 [ 101.621924] FS: 00007f06e4182700(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 [ 101.621971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 101.622005] CR2: 0000000000000000 CR3: 0000000045274000 CR4: 00000000000006e0 [ 101.622046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 101.622087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 101.622129] Process openl2tpd (pid: 5129, threadinfo ffff88003ddea000, task ffff88003de9a280) [ 101.622177] Stack: [ 101.622191] ffffffff81447efa ffff88007d3ded80 ffff88003de9a280 ffff88007d3ded80 [ 101.622245] 0000000000000001 ffff88003ddebbb8 ffffffff8148d5a7 0000000000000212 [ 101.622299] ffff88003dcea000 ffff88003dcea188 ffffffff00000001 ffffffff81b7e480 [ 101.622353] Call Trace: [ 101.622374] [<ffffffff81447efa>] ? ipv4_blackhole_route+0x1ba/0x210 [ 101.622415] [<ffffffff8148d5a7>] ? xfrm_lookup+0x417/0x510 [ 101.622450] [<ffffffff8127672a>] ? extract_buf+0x9a/0x140 [ 101.622485] [<ffffffff8144c6a0>] ? __ip_flush_pending_frames+0x70/0x70 [ 101.622526] [<ffffffff8146fbbf>] ? udp_sendmsg+0x62f/0x810 [ 101.622562] [<ffffffff813f98a6>] ? sock_sendmsg+0x116/0x130 [ 101.622599] [<ffffffff8109df58>] ? find_get_page+0x18/0x90 [ 101.622633] [<ffffffff8109fd6a>] ? filemap_fault+0x12a/0x4b0 [ 101.622668] [<ffffffff813fb5c4>] ? move_addr_to_kernel+0x64/0x90 [ 101.622706] [<ffffffff81405d5a>] ? verify_iovec+0x7a/0xf0 [ 101.622739] [<ffffffff813fc772>] ? sys_sendmsg+0x292/0x420 [ 101.622774] [<ffffffff810b994a>] ? handle_pte_fault+0x8a/0x7c0 [ 101.622810] [<ffffffff810b76fe>] ? __pte_alloc+0xae/0x130 [ 101.622844] [<ffffffff810ba2f8>] ? handle_mm_fault+0x138/0x380 [ 101.622880] [<ffffffff81024af9>] ? do_page_fault+0x189/0x410 [ 101.622915] [<ffffffff813fbe03>] ? sys_getsockname+0xf3/0x110 [ 101.622952] [<ffffffff81450c4d>] ? ip_setsockopt+0x4d/0xa0 [ 101.622986] [<ffffffff813f9932>] ? sockfd_lookup_light+0x22/0x90 [ 101.623024] [<ffffffff814b61fb>] ? system_call_fastpath+0x16/0x1b [ 101.623060] Code: Bad RIP value. [ 101.623090] RIP [< (null)>] (null) [ 101.623125] RSP <ffff88003ddeba60> [ 101.623146] CR2: 0000000000000000 [ 101.650871] ---[ end trace ca3856a7d8e8dad4 ]--- [ 101.651011] __sk_free: optmem leakage (160 bytes) detected. The oops happens in dst_metrics_write_ptr() include/net/dst.h:124: return dst->ops->cow_metrics(dst, p); dst->ops->cow_metrics is NULL and causes the oops. Provide cow_metrics() methods, like we did in commit 214f45c91bb (net: provide default_advmss() methods to blackhole dst_ops) Signed-off-by: Held Bernhard <berny156@gmx.de> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b71d1d42 |
|
21-Apr-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
inet: constify ip headers and in6_addr Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers where possible, to make code intention more obvious. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
21d8c49e |
|
14-Apr-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Call fib_select_default() only when actually necessary. fib_select_default() is a complete NOP, and completely pointless to invoke, when we have no more than 1 default route installed. And this is far and away the common case. So remember how many prefixlen==0 routes we have in the routing table, and elide the call when we have no more than one of those. This cuts output route creation time by 157 cycles on Niagara2+. In order to add the new int to fib_table, we have to correct the type of ->tb_data[] to unsigned long, otherwise the private area will be unaligned on 64-bit systems. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
|
#
5c04c819 |
|
06-Apr-2011 |
Michael Smith <msmith@cbnco.com> |
fib_validate_source(): pass sk_buff instead of mark This makes sk_buff available for other use in fib_validate_source(). Signed-off-by: Michael Smith <msmith@cbnco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1b86a58f |
|
07-Apr-2011 |
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> |
ipv4: Fix "Set rt->rt_iif more sanely on output routes." Commit 1018b5c01636c7c6bda31a719bda34fc631db29a ("Set rt->rt_iif more sanely on output routes.") breaks rt_is_{output,input}_route. This became the cause to return "IP_PKTINFO's ->ipi_ifindex == 0". To fix it, this does: 1) Add "int rt_route_iif;" to struct rtable 2) For input routes, always set rt_route_iif to same value as rt_iif 3) For output routes, always set rt_route_iif to zero. Set rt_iif as it is done currently. 4) Change rt_is_{output,input}_route() to test rt_route_iif Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
25985edc |
|
30-Mar-2011 |
Lucas De Marchi <lucas.demarchi@profusion.mobi> |
Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
|
#
436c3b66 |
|
24-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Invalidate nexthop cache nh_saddr more correctly. Any operation that: 1) Brings up an interface 2) Adds an IP address to an interface 3) Deletes an IP address from an interface can potentially invalidate the nh_saddr value, requiring it to be recomputed. Perform the recomputation lazily using a generation ID. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eb49a973 |
|
23-Mar-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: fix ip_rt_update_pmtu() commit 2c8cec5c10bc (Cache learned PMTU information in inetpeer) added an extra inet_putpeer() call in ip_rt_update_pmtu(). This results in various problems, since we can free one inetpeer, while it is still in use. Ref: http://www.spinics.net/lists/netdev/msg159121.html Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4a2b9c37 |
|
15-Mar-2011 |
Dan Siemon <dan@coverfire.com> |
net_sched: fix ip_tos2prio ECN support incorrectly maps ECN BESTEFFORT packets to TC_PRIO_FILLER (1) instead of TC_PRIO_BESTEFFORT (0) This means ECN enabled flows are placed in pfifo_fast/prio low priority band, giving ECN enabled flows [ECT(0) and CE codepoints] higher drop probabilities. This is rather unfortunate, given we would like ECN being more widely used. Ref : http://www.coverfire.com/archives/2011/03/13/pfifo_fast-and-ecn/ Signed-off-by: Dan Siemon <dan@coverfire.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dave Täht <d@taht.net> Cc: Jonathan Morton <chromatix99@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
46af3180 |
|
09-Mar-2011 |
Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> |
ipv4: Fix PMTU update. On current net-next-2.6, when Linux receives ICMP Type: 3, Code: 4 (Destination unreachable (Fragmentation needed)), icmp_unreach -> ip_rt_frag_needed (peer->pmtu_expires is set here) -> tcp_v4_err -> do_pmtu_discovery -> ip_rt_update_pmtu (peer->pmtu_expires is already set, so check_peer_pmtu is skipped.) -> check_peer_pmtu check_peer_pmtu is skipped and MTU is not updated. To fix this, let check_peer_pmtu execute unconditionally. And some minor fixes 1) Avoid potential peer->pmtu_expires set to be zero. 2) In check_peer_pmtu, argument of time_before is reversed. 3) check_peer_pmtu expects peer->pmtu_orig is initialized as zero, but not initialized. Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9d6ec938 |
|
11-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Use flowi4 in public route lookup interfaces. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
68a5e3dd |
|
11-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Use struct flowi4 internally in routing lookups. We will change the externally visible APIs next. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
22bd5b9b |
|
11-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Pass ipv4 flow objects into fib_lookup() paths. To start doing these conversions, we need to add some temporary flow4_* macros which will eventually go away when all the protocol code paths are changed to work on AF specific flowi objects. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1d28f42c |
|
11-Mar-2011 |
David S. Miller <davem@davemloft.net> |
net: Put flowi_* prefix on AF independent members of struct flowi I intend to turn struct flowi into a union of AF specific flowi structs. There will be a common structure that each variant includes first, much like struct sock_common. This is the first step to move in that direction. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1b7fe593 |
|
10-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Kill flowi arg to fib_select_multipath() Completely unused. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ff3fccb3 |
|
10-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Remove unnecessary test from ip_mkroute_input() fl->oif will always be zero on the input path, so there is no reason to test for that. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dbdd9a52 |
|
10-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Remove redundant RCU locking in ip_check_mc(). All callers are under rcu_read_lock() protection already. Rename to ip_check_mc_rcu() to make it even more clear. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
67e28ffd |
|
09-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Optimize flow initialization in input route lookup. Like in commit 44713b67db10c774f14280c129b0d5fd13c70cf2 ("ipv4: Optimize flow initialization in output route lookup." we can optimize the on-stack flow setup to only initialize the members which are actually used. Otherwise we bzero the entire structure, then initialize explicitly the first half of it. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5e2b61f7 |
|
04-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Remove flowi from struct rtable. The only necessary parts are the src/dst addresses, the interface indexes, the TOS, and the mark. The rest is unnecessary bloat, which amounts to nearly 50 bytes on 64-bit. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1018b5c0 |
|
04-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Set rt->rt_iif more sanely on output routes. rt->rt_iif is only ever inspected on input routes, for example DCCP uses this to populate a route lookup flow key when generating replies to another packet. Therefore, setting it to anything other than zero on output routes makes no sense. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3c0afdca |
|
04-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Get peer more cheaply in rt_init_metrics(). We know this is a new route object, so doing atomics and stuff makes no sense at all. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
44713b67 |
|
04-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Optimize flow initialization in output route lookup. We burn a lot of useless cycles, cpu store buffer traffic, and memory operations memset()'ing the on-stack flow used to perform output route lookups in __ip_route_output_key(). Only the first half of the flow object members even matter for output route lookups in this context, specifically: FIB rules matching cares about: dst, src, tos, iif, oif, mark FIB trie lookup cares about: dst FIB semantic match cares about: tos, scope, oif Therefore only initialize these specific members and elide the memset entirely. On Niagara2 this kills about ~300 cycles from the output route lookup path. Likely, we can take things further, since all callers of output route lookups essentially throw away the on-stack flow they use. So they don't care if we use it as a scratch-pad to compute the final flow key. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
|
#
5bfa787f |
|
02-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: ip_route_output_key() is better as an inline. This avoid a stack frame at zero cost. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b23dd4fe |
|
02-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Make output route lookup return rtable directly. Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
452edd59 |
|
02-Mar-2011 |
David S. Miller <davem@davemloft.net> |
xfrm: Return dst directly from xfrm_lookup() Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2774c131 |
|
01-Mar-2011 |
David S. Miller <davem@davemloft.net> |
xfrm: Handle blackhole route creation via afinfo. That way we don't have to potentially do this in every xfrm_lookup() caller. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
80c0bc9e |
|
01-Mar-2011 |
David S. Miller <davem@davemloft.net> |
xfrm: Kill XFRM_LOOKUP_WAIT flag. This can be determined from the flow flags instead. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
273447b3 |
|
01-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Kill can_sleep arg to ip_route_output_flow() This boolean state is now available in the flow flags. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
420d44da |
|
01-Mar-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep" Since that is what the current vague "flags" argument means. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
214f45c9 |
|
18-Feb-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: provide default_advmss() methods to blackhole dst_ops Commit 0dbaee3b37e118a (net: Abstract default ADVMSS behind an accessor.) introduced a possible crash in tcp_connect_init(), when dst->default_advmss() is called from dst_metric_advmss() Reported-by: George Spelvin <linux@horizon.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
982721f3 |
|
16-Feb-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Use const'ify fib_result deep in the route call chains. The only troublesome bit here is __mkroute_output which wants to override res->fi and res->type, compute those in local variables instead. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3c7bd1a1 |
|
16-Feb-2011 |
David S. Miller <davem@davemloft.net> |
net: Add initial_ref arg to dst_alloc(). This allows avoiding multiple writes to the initial __refcnt. The most simplest cases of wanting an initial reference of "1" in ipv4 and ipv6 have been converted, the rest have been left along and kept at the existing "0". Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0c4dcd58 |
|
17-Feb-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Consolidate ipv4 dst allocation logic. This also allows us to combine all the dst->flags settings and avoid read/modify/write sequences to this struct member. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
010c2708 |
|
17-Feb-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Move rcu_read_{lock,unlock}() into ip_route_output_slow(). Simplifies tail of __ip_route_output_key(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5ada5527 |
|
17-Feb-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Simplify output route creation call sequence. There's a lot of redundancy and unnecessary stack frames in the output route creation path. 1) Make __mkroute_output() return error pointers. 2) Eliminate ip_mkroute_output() entirely, made possible by #1. 3) Call __mkroute_output() directly and handling the returning error pointers in ip_route_output_slow(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f39925db |
|
09-Feb-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Cache learned redirect information in inetpeer. Note that we do not generate the redirect netevent any longer, because we don't create a new cached route. Instead, once the new neighbour is bound to the cached route, we emit a neigh update event instead. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2c8cec5c |
|
09-Feb-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Cache learned PMTU information in inetpeer. The general idea is that if we learn new PMTU information, we bump the peer genid. This triggers the dst_ops->check() code to validate and if necessary propagate the new PMTU value into the metrics. Learned PMTU information self-expires. This means that it is not necessary to kill a cached route entry just because the PMTU information is too old. As a consequence: 1) When the path appears unreachable (dst_ops->link_failure or dst_ops->negative_advice) we unwind the PMTU state if it is out of date, instead of killing the cached route. A redirected route will still be invalidated in these situations. 2) rt_check_expire(), rt_worker_func(), et al. are no longer necessary at all. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6431cbc2 |
|
07-Feb-2011 |
David S. Miller <davem@davemloft.net> |
inet: Create a mechanism for upward inetpeer propagation into routes. If we didn't have a routing cache, we would not be able to properly propagate certain kinds of dynamic path attributes, for example PMTU information and redirects. The reason is that if we didn't have a routing cache, then there would be no way to lookup all of the active cached routes hanging off of sockets, tunnels, IPSEC bundles, etc. Consider the case where we created a cached route, but no inetpeer entry existed and also we were not asked to pre-COW the route metrics and therefore did not force the creation a new inetpeer entry. If we later get a PMTU message, or a redirect, and store this information in a new inetpeer entry, there is no way to teach that cached route about the newly existing inetpeer entry. The facilities implemented here handle this problem. First we create a generation ID. When we create a cached route of any kind, we remember the generation ID at the time of attachment. Any time we force-create an inetpeer entry in response to new path information, we bump that generation ID. The dst_ops->check() callback is where the knowledge of this event is propagated. If the global generation ID does not equal the one stored in the cached route, and the cached route has not attached to an inetpeer yet, we look it up and attach if one is found. Now that we've updated the cached route's information, we update the route's generation ID too. This clears the way for implementing PMTU and redirects directly in the inetpeer cache. There is absolutely no need to consult cached route information in order to maintain this information. At this point nothing bumps the inetpeer genids, that comes in the later changes which handle PMTUs and redirects using inetpeers. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8d13a2a9 |
|
08-Feb-2011 |
David S. Miller <davem@davemloft.net> |
net: Kill NETEVENT_PMTU_UPDATE. Nobody actually does anything in response to the event, so just kill it off. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
92d86829 |
|
04-Feb-2011 |
David S. Miller <davem@davemloft.net> |
inetpeer: Move ICMP rate limiting state into inet_peer entries. Like metrics, the ICMP rate limiting bits are cached state about a destination. So move it into the inet_peer entries. If an inet_peer cannot be bound (the reason is memory allocation failure or similar), the policy is to allow. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0131ba45 |
|
04-Feb-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Don't miss existing cached metrics in new routes. Always lookup to see if we have an existing inetpeer entry for a route. Let FLOWI_FLAG_PRECOW_METRICS merely influence the "create" argument to rt_bind_peer(). Also, call rt_bind_peer() unconditionally since it is not possible for rt->peer to be non-NULL at this point. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0c838ff1 |
|
31-Jan-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: Consolidate all default route selection implementations. Both fib_trie and fib_hash have a local implementation of fib_table_select_default(). This is completely unnecessary code duplication. Since we now remember the fib_table and the head of the fib alias list of the default route, we can implement one single generic version of this routine. Looking at the fib_hash implementation you may get the impression that it's possible for there to be multiple top-level routes in the table for the default route. The truth is, it isn't, the insert code will only allow one entry to exist in the zero prefix hash table, because all keys evaluate to zero and all keys in a hash table must be unique. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ec831ea7 |
|
31-Jan-2011 |
Roland Dreier <roland@purestorage.com> |
net: Add default_mtu() methods to blackhole dst_ops When an IPSEC SA is still being set up, __xfrm_lookup() will return -EREMOTE and so ip_route_output_flow() will return a blackhole route. This can happen in a sndmsg call, and after d33e455337ea ("net: Abstract default MTU metric calculation behind an accessor.") this leads to a crash in ip_append_data() because the blackhole dst_ops have no default_mtu() method and so dst_mtu() calls a NULL pointer. Fix this by adding default_mtu() methods (that simply return 0, matching the old behavior) to the blackhole dst_ops. The IPv4 part of this patch fixes a crash that I saw when using an IPSEC VPN; the IPv6 part is untested because I don't have an IPv6 VPN, but it looks to be needed as well. Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b8dad61c |
|
28-Jan-2011 |
David S. Miller <davem@davemloft.net> |
ipv4: If fib metrics are default, no need to grab ref to FIB info. The fib metric memory in this case is static in the kernel image, so we don't need to reference count it since it's never going to go away on us. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a4daad6b |
|
27-Jan-2011 |
David S. Miller <davem@davemloft.net> |
net: Pre-COW metrics for TCP. TCP is going to record metrics for the connection, so pre-COW the route metrics at route cache entry creation time. This avoids several atomic operations that have to occur if we COW the metrics after the entry reaches global visibility. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
06582540 |
|
27-Jan-2011 |
David S. Miller <davem@davemloft.net> |
net: Store ipv4/ipv6 COW'd metrics in inetpeer cache. Please note that the IPSEC dst entry metrics keep using the generic metrics COW'ing mechanism using kmalloc/kfree. This gives the IPSEC routes an opportunity to use metrics which are unique to their encapsulated paths. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
62fa8a84 |
|
26-Jan-2011 |
David S. Miller <davem@davemloft.net> |
net: Implement read-only protection and COW'ing of metrics. Routing metrics are now copy-on-write. Initially a route entry points it's metrics at a read-only location. If a routing table entry exists, it will point there. Else it will point at the all zero metric place-holder called 'dst_default_metrics'. The writeability state of the metrics is stored in the low bits of the metrics pointer, we have two bits left to spare if we want to store more states. For the initial implementation, COW is implemented simply via kmalloc. However future enhancements will change this to place the writable metrics somewhere else, in order to increase sharing. Very likely this "somewhere else" will be the inetpeer cache. Note also that this means that metrics updates may transiently fail if we cannot COW the metrics successfully. But even by itself, this patch should decrease memory usage and increase cache locality especially for routing workloads. In those cases the read-only metric copies stay in place and never get written to. TCP workloads where metrics get updated, and those rare cases where PMTU triggers occur, will take a very slight performance hit. But that hit will be alleviated when the long-term writable metrics move to a more sharable location. Since the metrics storage went from a u32 array of RTAX_MAX entries to what is essentially a pointer, some retooling of the dst_entry layout was necessary. Most importantly, we need to preserve the alignment of the reference count so that it doesn't share cache lines with the read-mostly state, as per Eric Dumazet's alignment assertion checks. The only non-trivial bit here is the move of the 'flags' member into the writeable cacheline. This is OK since we are always accessing the flags around the same moment when we made a modification to the reference count. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c7066f70 |
|
14-Jan-2011 |
Patrick McHardy <kaber@trash.net> |
netfilter: fix Kconfig dependencies Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE, which itself depends on NET_SCHED; this dependency is missing from netfilter. Since matching on realms is also useful without having NET_SCHED enabled and the option really only controls whether the tclassid member is included in route and dst entries, rename the config option to IP_ROUTE_CLASSID and move it outside of traffic scheduling context to get rid of the NET_SCHED dependeny. Reported-by: Vladis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Patrick McHardy <kaber@trash.net>
|
#
9fc3bbb4 |
|
03-Jan-2011 |
Joel Sing <jsing@google.com> |
ipv4/route.c: respect prefsrc for local routes The preferred source address is currently ignored for local routes, which results in all local connections having a src address that is the same as the local dst address. Fix this by respecting the preferred source address when it is provided for local routes. This bug can be demonstrated as follows: # ifconfig dummy0 192.168.0.1 # ip route show table local | grep local.*dummy0 local 192.168.0.1 dev dummy0 proto kernel scope host src 192.168.0.1 # ip route change table local local 192.168.0.1 dev dummy0 \ proto kernel scope host src 127.0.0.1 # ip route show table local | grep local.*dummy0 local 192.168.0.1 dev dummy0 proto kernel scope host src 127.0.0.1 We now establish a local connection and verify the source IP address selection: # nc -l 192.168.0.1 3128 & # nc 192.168.0.1 3128 & # netstat -ant | grep 192.168.0.1:3128.*EST tcp 0 0 192.168.0.1:3128 192.168.0.1:33228 ESTABLISHED tcp 0 0 192.168.0.1:33228 192.168.0.1:3128 ESTABLISHED Signed-off-by: Joel Sing <jsing@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fc75fc83 |
|
21-Dec-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: dont create routes on down devices In ip_route_output_slow(), instead of allowing a route to be created on a not UPed device, report -ENETUNREACH immediately. # ip tunnel add mode ipip remote 10.16.0.164 local 10.16.0.72 dev eth0 # (Note : tunl1 is down) # ping -I tunl1 10.1.2.3 PING 10.1.2.3 (10.1.2.3) from 192.168.18.5 tunl1: 56(84) bytes of data. (nothing) # ./a.out tunl1 # ip tunnel del tunl1 Message from syslogd@shelby at Dec 22 10:12:08 ... kernel: unregister_netdevice: waiting for tunl1 to become free. Usage count = 3 After patch: # ping -I tunl1 10.1.2.3 connect: Network is unreachable Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6561a3b1 |
|
19-Dec-2010 |
David S. Miller <davem@davemloft.net> |
ipv4: Flush per-ns routing cache more sanely. Flush the routing cache only of entries that match the network namespace in which the purge event occurred. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
|
#
d33e4553 |
|
14-Dec-2010 |
David S. Miller <davem@davemloft.net> |
net: Abstract default MTU metric calculation behind an accessor. Like RTAX_ADVMSS, make the default calculation go through a dst_ops method rather than caching the computation in the routing cache entries. Now dst metrics are pretty much left as-is when new entries are created, thus optimizing metric sharing becomes a real possibility. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0dbaee3b |
|
13-Dec-2010 |
David S. Miller <davem@davemloft.net> |
net: Abstract default ADVMSS behind an accessor. Make all RTAX_ADVMSS metric accesses go through a new helper function, dst_metric_advmss(). Leave the actual default metric as "zero" in the real metric slot, and compute the actual default value dynamically via a new dst_ops AF specific callback. For stacked IPSEC routes, we use the advmss of the path which preserves existing behavior. Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates advmss on pmtu updates. This inconsistency in advmss handling results in more raw metric accesses than I wish we ended up with. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
323e126f |
|
12-Dec-2010 |
David S. Miller <davem@davemloft.net> |
ipv4: Don't pre-seed hoplimit metric. Always go through a new ip4_dst_hoplimit() helper, just like ipv6. This allowed several simplifications: 1) The interim dst_metric_hoplimit() can go as it's no longer userd. 2) The sysctl_ip_default_ttl entry no longer needs to use ipv4_doint_and_flush, since the sysctl is not cached in routing cache metrics any longer. 3) ipv4_doint_and_flush no longer needs to be exported and therefore can be marked static. When ipv4_doint_and_flush_strategy was removed some time ago, the external declaration in ip.h was mistakenly left around so kill that off too. We have to move the sysctl_ip_default_ttl declaration into ipv4's route cache definition header net/route.h, because currently net/ip.h (where the declaration lives now) has a back dependency on net/route.h Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5170ae82 |
|
12-Dec-2010 |
David S. Miller <davem@davemloft.net> |
net: Abstract RTAX_HOPLIMIT metric accesses behind helper. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
defb3519 |
|
08-Dec-2010 |
David S. Miller <davem@davemloft.net> |
net: Abstract away all dst_entry metrics accesses. Use helper functions to hide all direct accesses, especially writes, to dst_entry metrics values. This will allow us to: 1) More easily change how the metrics are stored. 2) Implement COW for metrics. In particular this will help us put metrics into the inetpeer cache if that is what we end up doing. We can make the _metrics member a pointer instead of an array, initially have it point at the read-only metrics in the FIB, and then on the first set grab an inetpeer entry and point the _metrics member there. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
|
#
b534ecf1 |
|
30-Nov-2010 |
David S. Miller <davem@davemloft.net> |
inetpeer: Make inet_getpeer() take an inet_peer_adress_t pointer. And make an inet_getpeer_v4() helper, update callers. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5811662b |
|
12-Nov-2010 |
Changli Gao <xiaosuo@gmail.com> |
net: use the macros defined for the members of flowi Use the macros defined for the members of flowi to clean the code up. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c7537967 |
|
11-Nov-2010 |
David S. Miller <davem@davemloft.net> |
ipv4: Make rt->fl.iif tests lest obscure. When we test rt->fl.iif against zero, we're seeing if it's an output or an input route. Make that explicit with some helper functions. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
72cdd1d9 |
|
11-Nov-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: get rid of rtable->idev It seems idev field in struct rtable has no special purpose, but adding extra atomic ops. We hold refcounts on the device itself (using percpu data, so pretty cheap in current kernel). infiniband case is solved using dst.dev instead of idev->dev Removal of this field means routing without route cache is now using shared data, percpu data, and only potential contention is a pair of atomic ops on struct neighbour per forwarded packet. About 5% speedup on routing test. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Roland Dreier <rolandd@cisco.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1c31720a |
|
25-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: add __rcu annotations to routes.c Add __rcu annotations to : (struct dst_entry)->rt_next (struct rt_hash_bucket)->chain And use appropriate rcu primitives to reduce sparse warnings if CONFIG_SPARSE_RCU_POINTER=y Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
27b75c95 |
|
14-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: avoid RCU for NOCACHE dst There is no point using RCU for dst we allocate for a very short time (used once). Change dst_release() to take DST_NOCACHE into account, but also change skb_dst_set_noref() to force a refcount increment for such dst. This is a _huge_ gain, because we dont waste memory to store xx thousand of dsts. Instead of queueing them to RCU, we can free them instantly. CPU caches can stay hot, re-using same memory blocks to hold temporary dsts. Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(), since atomic_dec_return() implies a full memory barrier. Stress test, 160.000.000 udp frames sent, IP route cache disabled (DDOS). Before: real 0m38.091s user 0m13.189s sys 7m53.018s After: real 0m29.946s user 0m12.157s sys 7m40.605s For reference, if IP route cache was enabled : real 0m32.030s user 0m10.521s sys 8m15.243s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
27a954bd |
|
17-Oct-2010 |
Andy Walls <awalls@md.metrocast.net> |
IPv4: route.c: Change checks against 0xffffffff to ipv4_is_lbcast() Change a few checks against the hardcoded broadcast address, 0xffffffff, to ipv4_is_lbcast(). Remove some existing checks using ipv4_is_lbcast() that are now obviously superfluous. Signed-off-by: Andy Walls <awalls@md.metrocast.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fc66f95c |
|
08-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net dst: use a percpu_counter to track entries struct dst_ops tracks number of allocated dst in an atomic_t field, subject to high cache line contention in stress workload. Switch to a percpu_counter, to reduce number of time we need to dirty a central location. Place it on a separate cache line to avoid dirtying read only fields. Stress test : (Sending 160.000.000 UDP frames, IP route cache disabled, dual E5540 @2.53GHz, 32bit kernel, FIB_TRIE, SLUB/NUMA) Before: real 0m51.179s user 0m15.329s sys 10m15.942s After: real 0m45.570s user 0m15.525s sys 9m56.669s With a small reordering of struct neighbour fields, subject of a following patch, (to separate refcnt from other read mostly fields) real 0m41.841s user 0m15.261s sys 8m45.949s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8391d07b |
|
07-Oct-2010 |
Dimitris Michailidis <dm@chelsio.com> |
ipv4: Remove leftover rcu_read_unlock calls from __mkroute_output() Commit "fib: RCU conversion of fib_lookup()" removed rcu_read_lock() from __mkroute_output but left a couple of calls to rcu_read_unlock() in there. This causes lockdep to complain that the rcu_read_unlock() call in __ip_route_output_key causes a lock inbalance and quickly crashes the kernel. The below fixes this for me. Signed-off-by: Dimitris Michailidis <dm@chelsio.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ebc0ffae |
|
05-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
fib: RCU conversion of fib_lookup() fib_lookup() converted to be called in RCU protected context, no reference taken and released on a contended cache line (fib_clntref) fib_table_lookup() and fib_semantic_match() get an additional parameter. struct fib_info gets an rcu_head field, and is freed after an rcu grace period. Stress test : (Sending 160.000.000 UDP frames on same neighbour, IP route cache disabled, dual E5540 @2.53GHz, 32bit kernel, FIB_HASH) (about same results for FIB_TRIE) Before patch : real 1m31.199s user 0m13.761s sys 23m24.780s After patch: real 1m5.375s user 0m14.997s sys 15m50.115s Before patch Profile : 13044.00 15.4% __ip_route_output_key vmlinux 8438.00 10.0% dst_destroy vmlinux 5983.00 7.1% fib_semantic_match vmlinux 5410.00 6.4% fib_rules_lookup vmlinux 4803.00 5.7% neigh_lookup vmlinux 4420.00 5.2% _raw_spin_lock vmlinux 3883.00 4.6% rt_set_nexthop vmlinux 3261.00 3.9% _raw_read_lock vmlinux 2794.00 3.3% fib_table_lookup vmlinux 2374.00 2.8% neigh_resolve_output vmlinux 2153.00 2.5% dst_alloc vmlinux 1502.00 1.8% _raw_read_lock_bh vmlinux 1484.00 1.8% kmem_cache_alloc vmlinux 1407.00 1.7% eth_header vmlinux 1406.00 1.7% ipv4_dst_destroy vmlinux 1298.00 1.5% __copy_from_user_ll vmlinux 1174.00 1.4% dev_queue_xmit vmlinux 1000.00 1.2% ip_output vmlinux After patch Profile : 13712.00 15.8% dst_destroy vmlinux 8548.00 9.9% __ip_route_output_key vmlinux 7017.00 8.1% neigh_lookup vmlinux 4554.00 5.3% fib_semantic_match vmlinux 4067.00 4.7% _raw_read_lock vmlinux 3491.00 4.0% dst_alloc vmlinux 3186.00 3.7% neigh_resolve_output vmlinux 3103.00 3.6% fib_table_lookup vmlinux 2098.00 2.4% _raw_read_lock_bh vmlinux 2081.00 2.4% kmem_cache_alloc vmlinux 2013.00 2.3% _raw_spin_lock vmlinux 1763.00 2.0% __copy_from_user_ll vmlinux 1763.00 2.0% ip_output vmlinux 1761.00 2.0% ipv4_dst_destroy vmlinux 1631.00 1.9% eth_header vmlinux 1440.00 1.7% _raw_read_unlock_bh vmlinux Reference results, if IP route cache is enabled : real 0m29.718s user 0m10.845s sys 7m37.341s 25213.00 29.5% __ip_route_output_key vmlinux 9011.00 10.5% dst_release vmlinux 4817.00 5.6% ip_push_pending_frames vmlinux 4232.00 5.0% ip_finish_output vmlinux 3940.00 4.6% udp_sendmsg vmlinux 3730.00 4.4% __copy_from_user_ll vmlinux 3716.00 4.4% ip_route_output_flow vmlinux 2451.00 2.9% __xfrm_lookup vmlinux 2221.00 2.6% ip_append_data vmlinux 1718.00 2.0% _raw_spin_lock_bh vmlinux 1655.00 1.9% __alloc_skb vmlinux 1572.00 1.8% sock_wfree vmlinux 1345.00 1.6% kfree vmlinux Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c7d4426a |
|
03-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: introduce DST_NOCACHE flag While doing stress tests with IP route cache disabled, and multi queue devices, I noticed a very high contention on one rwlock used in neighbour code. When many cpus are trying to send frames (possibly using a high performance multiqueue device) to the same neighbour, they fight for the neigh->lock rwlock in order to call neigh_hh_init(), and fight on hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test()) But we dont need to call neigh_hh_init() for dst that are used only once. It costs four atomic operations at least, on two contended cache lines, plus the high contention on neigh->lock rwlock. Introduce a new dst flag, DST_NOCACHE, that is set when dst was not inserted in route cache. With the stress test bench, sending 160000000 frames on one neighbour, results are : Before patch: real 2m28.406s user 0m11.781s sys 36m17.964s After patch: real 1m26.532s user 0m12.185s sys 20m3.903s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0197aa38 |
|
29-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: rcu conversion in ip_route_output_slow ip_route_output_slow() is enclosed in an rcu_read_lock() protected section, so that no references are taken/released on device, thanks to __ip_dev_find() & dev_get_by_index_rcu() Tested with ip route cache disabled, and a stress test : Before patch: elapsed time : real 1m38.347s user 0m11.909s sys 23m51.501s Profile: 13788.00 22.7% ip_route_output_slow [kernel] 7875.00 13.0% dst_destroy [kernel] 3925.00 6.5% fib_semantic_match [kernel] 3144.00 5.2% fib_rules_lookup [kernel] 3061.00 5.0% dst_alloc [kernel] 2276.00 3.7% rt_set_nexthop [kernel] 1762.00 2.9% fib_table_lookup [kernel] 1538.00 2.5% _raw_read_lock [kernel] 1358.00 2.2% ip_output [kernel] After patch: real 1m28.808s user 0m13.245s sys 20m37.293s 10950.00 17.2% ip_route_output_slow [kernel] 10726.00 16.9% dst_destroy [kernel] 5170.00 8.1% fib_semantic_match [kernel] 3937.00 6.2% dst_alloc [kernel] 3635.00 5.7% rt_set_nexthop [kernel] 2900.00 4.6% fib_rules_lookup [kernel] 2240.00 3.5% fib_table_lookup [kernel] 1427.00 2.2% _raw_read_lock [kernel] 1157.00 1.8% kmem_cache_alloc [kernel] Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dd28d1a0 |
|
29-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: __mkroute_output() speedup While doing stress tests with a disabled IP route cache, I found __mkroute_output() was touching three times in_device atomic refcount. Use RCU to touch it once to reduce cache line ping pongs. Before patch time to perform the test real 1m42.009s user 0m12.545s sys 25m0.726s Profile : 16109.00 26.4% ip_route_output_slow vmlinux 7434.00 12.2% dst_destroy vmlinux 3280.00 5.4% fib_rules_lookup vmlinux 3252.00 5.3% fib_semantic_match vmlinux 2622.00 4.3% fib_table_lookup vmlinux 2535.00 4.1% dst_alloc vmlinux 1750.00 2.9% _raw_read_lock vmlinux 1532.00 2.5% rt_set_nexthop vmlinux After patch real 1m36.503s user 0m12.977s sys 23m25.608s 14234.00 22.4% ip_route_output_slow vmlinux 8717.00 13.7% dst_destroy vmlinux 4052.00 6.4% fib_rules_lookup vmlinux 3951.00 6.2% fib_semantic_match vmlinux 3191.00 5.0% dst_alloc vmlinux 1764.00 2.8% fib_table_lookup vmlinux 1692.00 2.7% _raw_read_lock vmlinux 1605.00 2.5% rt_set_nexthop vmlinux Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7e1b33e5 |
|
27-Sep-2010 |
Ulrich Weber <uweber@astaro.com> |
ipv6: add IPv6 to neighbour table overflow warning IPv4 and IPv6 have separate neighbour tables, so the warning messages should be distinguishable. [ Add a suitable message prefix on the ipv4 side as well -DaveM ] Signed-off-by: Ulrich Weber <uweber@astaro.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
83180af0 |
|
23-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: fix rcu use in ip_route_output_slow __in_dev_get_rtnl(dev_out) is called while RTNL is not held, thus triggers a lockdep fault. At this point, we only perform a raw test of dev_out->ip_ptr being NULL, we dont need to make sure ip_ptr cant changed right after. We can use rcu_dereference_raw() for this. Reported-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a02cec21 |
|
22-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: return operator cleanup Change "return (EXPR);" to "return EXPR;" return is not a function, parentheses are not required. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ae2688d5 |
|
08-Sep-2010 |
Jianzhao Wang <jianzhao.wang@6wind.com> |
net: blackhole route should always be recalculated Blackhole routes are used when xfrm_lookup() returns -EREMOTE (error triggered by IKE for example), hence this kind of route is always temporary and so we should check if a better route exists for next packets. Bug has been introduced by commit d11a4dc18bf41719c9f0d7ed494d295dd2973b92. Signed-off-by: Jianzhao Wang <jianzhao.wang@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
49e8ab03 |
|
19-Aug-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: build_ehash_secret() and rt_bind_peer() cleanups Now cmpxchg() is available on all arches, we can use it in build_ehash_secret() and rt_bind_peer() instead of using spinlocks. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
963bfeee |
|
20-Jul-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: RTA_MARK addition Add a new rt attribute, RTA_MARK, and use it in rt_fill_info()/inet_rtm_getroute() to support following commands : ip route get 192.168.20.110 mark NUMBER ip route get 192.168.20.108 from 192.168.20.110 iif eth1 mark NUMBER ip route list cache [192.168.20.110] mark NUMBER Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4bc2f18b |
|
09-Jul-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net/ipv4: EXPORT_SYMBOL cleanups CodingStyle cleanups EXPORT_SYMBOL should immediately follow the symbol declaration. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
317fe0e6 |
|
15-Jun-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
inetpeer: restore small inet_peer structures Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches. Thats a bit unfortunate, since old size was exactly 64 bytes. This can be solved, using an union between this rcu_head an four fields, that are normally used only when a refcount is taken on inet_peer. rcu_head is used only when refcnt=-1, right before structure freeing. Add a inet_peer_refcheck() function to check this assertion for a while. We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d8d1f30b |
|
11-Jun-2010 |
Changli Gao <xiaosuo@gmail.com> |
net-next: remove useless union keyword remove useless union keyword in rtable, rt6_info and dn_route. Since there is only one member in a union, the union keyword isn't useful. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ed7865a4 |
|
07-Jun-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: avoid two atomic ops in ip_rt_redirect() in_dev_get() -> __in_dev_get_rcu() in a rcu protected function. [ Fix build with CONFIG_IP_ROUTE_VERBOSE disabled. -DaveM ] Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
47360228 |
|
02-Jun-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: RCU changes in __mkroute_input() Avoid two atomic ops on output device refcount Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
96d36220 |
|
02-Jun-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: RCU conversion of ip_route_input_slow/ip_route_input_mc Avoid two atomic ops on struct in_device refcount per incoming packet, if slow path taken, (or route cache disabled) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b5f7e755 |
|
01-Jun-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: add LINUX_MIB_IPRPFILTER snmp counter Christoph Lameter mentioned that packets could be dropped in input path because of rp_filter settings, without any SNMP counter being incremented. System administrator can have a hard time to track the problem. This patch introduces a new counter, LINUX_MIB_IPRPFILTER, incremented each time we drop a packet because Reverse Path Filter triggers. (We receive an IPv4 datagram on a given interface, and find the route to send an answer would use another interface) netstat -s | grep IPReversePathFilter IPReversePathFilter: 21714 Reported-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
27f39c73e |
|
19-May-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Use __this_cpu_inc() in fast path This patch saves 224 bytes of text on my machine. __this_cpu_inc() generates a single instruction, using no scratch registers : 65 ff 04 25 a8 30 01 00 incl %gs:0x130a8 instead of : 48 c7 c2 80 30 01 00 mov $0x13080,%rdx 65 48 8b 04 25 88 ea 00 00 mov %gs:0xea88,%rax 83 44 10 28 01 addl $0x1,0x28(%rax,%rdx,1) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
407eadd9 |
|
10-May-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: implements ip_route_input_noref() ip_route_input() is the version returning a refcounted dst, while ip_route_input_noref() returns a non refcounted one. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7fee226a |
|
11-May-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: add a noref bit on skb dst Use low order bit of skb->_skb_dst to tell dst is not refcounted. Change _skb_dst to _skb_refdst to make sure all uses are catched. skb_dst() returns the dst, regardless of noref bit set or not, but with a lockdep check to make sure a noref dst is not given if current user is not rcu protected. New skb_dst_set_noref() helper to set an notrefcounted dst on a skb. (with lockdep check) skb_dst_drop() drops a reference only if skb dst was refcounted. skb_dst_force() helper is used to force a refcount on dst, when skb is queued and not anymore RCU protected. Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in sock_queue_rcv_skb(), in __nf_queue(). Use skb_dst_force() in dev_requeue_skb(). Note: dst_use_noref() still dirties dst, we might transform it later to do one dirtying per jiffies. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3ee94372 |
|
08-May-2010 |
Neil Horman <nhorman@tuxdriver.com> |
ipv4: remove ip_rt_secret timer (v4) A while back there was a discussion regarding the rt_secret_interval timer. Given that we've had the ability to do emergency route cache rebuilds for awhile now, based on a statistical analysis of the various hash chain lengths in the cache, the use of the flush timer is somewhat redundant. This patch removes the rt_secret_interval sysctl, allowing us to rely solely on the statistical analysis mechanism to determine the need for route cache flushes. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0eae88f3 |
|
20-Apr-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Fix various endianness glitches Sparse can help us find endianness bugs, but we need to make some cleanups to be able to more easily spot real bugs. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5a0e3ad6 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
#
6a2bad70 |
|
24-Mar-2010 |
Pavel Emelyanov <xemul@openvz.org> |
ipv4: Restart rt_intern_hash after emergency rebuild (v2) The the rebuild changes the genid which in turn is used at the hash calculation. Thus if we don't restart and go on with inserting the rt will happen in wrong chain. (Fixed Neil's comment about the index passed into the rt_intern_hash) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Reviewed-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b35ecb5d |
|
24-Mar-2010 |
Pavel Emelyanov <xemul@openvz.org> |
ipv4: Cleanup struct net dereference in rt_intern_hash There's no need in getting it 3 times and gcc isn't smart enough to understand this himself. This is just a cleanup before the fix (next patch). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5e016cbf |
|
21-Mar-2010 |
Guenter Roeck <guenter.roeck@ericsson.com> |
ipv4: Don't drop redirected route cache entry unless PTMU actually expired TCP sessions over IPv4 can get stuck if routers between endpoints do not fragment packets but implement PMTU instead, and we are using those routers because of an ICMP redirect. Setup is as follows MTU1 MTU2 MTU1 A--------B------C------D with MTU1 > MTU2. A and D are endpoints, B and C are routers. B and C implement PMTU and drop packets larger than MTU2 (for example because DF is set on all packets). TCP sessions are initiated between A and D. There is packet loss between A and D, causing frequent TCP retransmits. After the number of retransmits on a TCP session reaches tcp_retries1, tcp calls dst_negative_advice() prior to each retransmit. This results in route cache entries for the peer to be deleted in ipv4_negative_advice() if the Path MTU is set. If the outstanding data on an affected TCP session is larger than MTU2, packets sent from the endpoints will be dropped by B or C, and ICMP NEEDFRAG will be returned. A and D receive NEEDFRAG messages and update PMTU. Before the next retransmit, tcp will again call dst_negative_advice(), causing the route cache entry (with correct PMTU) to be deleted. The retransmitted packet will be larger than MTU2, causing it to be dropped again. This sequence repeats until the TCP session aborts or is terminated. Problem is fixed by removing redirected route cache entries in ipv4_negative_advice() only if the PMTU is expired. Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d11a4dc1 |
|
18-Mar-2010 |
Timo Teräs <timo.teras@iki.fi> |
ipv4: check rt_genid in dst_check Xfrm_dst keeps a reference to ipv4 rtable entries on each cached bundle. The only way to renew xfrm_dst when the underlying route has changed, is to implement dst_check for this. This is what ipv6 side does too. The problems started after 87c1e12b5eeb7b30b4b41291bef8e0b41fc3dde9 ("ipsec: Fix bogus bundle flowi") which fixed a bug causing xfrm_dst to not get reused, until that all lookups always generated new xfrm_dst with new route reference and path mtu worked. But after the fix, the old routes started to get reused even after they were expired causing pmtu to break (well it would occationally work if the rtable gc had run recently and marked the route obsolete causing dst_check to get called). Signed-off-by: Timo Teras <timo.teras@iki.fi> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
858a18a6 |
|
15-Mar-2010 |
Vitaliy Gusev <vgusev@openvz.org> |
route: Fix caught BUG_ON during rt_secret_rebuild_oneshot() route: Fix caught BUG_ON during rt_secret_rebuild_oneshot() Call rt_secret_rebuild can cause BUG_ON(timer_pending(&net->ipv4.rt_secret_timer)) in add_timer as there is not any synchronization for call rt_secret_rebuild_oneshot() for the same net namespace. Also this issue affects to rt_secret_reschedule(). Thus use mod_timer enstead. Signed-off-by: Vitaliy Gusev <vgusev@openvz.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
98376387 |
|
07-Mar-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: fix route cache rebuilds We added an automatic route cache rebuilding in commit 1080d709fb9d8cd43 but had to correct few bugs. One of the assumption of original patch, was that entries where kept sorted in a given way. This assumption is known to be wrong (commit 1ddbcb005c395518 gave an explanation of this and corrected a leak) and expensive to respect. Paweł Staszewski reported to me one of his machine got its routing cache disabled after few messages like : [ 2677.850065] Route hash chain too long! [ 2677.850080] Adjust your secret_interval! [82839.662993] Route hash chain too long! [82839.662996] Adjust your secret_interval! [155843.731650] Route hash chain too long! [155843.731664] Adjust your secret_interval! [155843.811881] Route hash chain too long! [155843.811891] Adjust your secret_interval! [155843.858209] vlan0811: 5 rebuilds is over limit, route caching disabled [155843.858212] Route hash chain too long! [155843.858213] Adjust your secret_interval! This is because rt_intern_hash() might be fooled when computing a chain length, because multiple entries with same keys can differ because of TOS (or mark/oif) bits. In the rare case the fast algorithm see a too long chain, and before taking expensive path, we call a helper function in order to not count duplicates of same routes, that only differ with tos/mark/oif bits. This helper works with data already in cpu cache and is not be very expensive, despite its O(N^2) implementation. Paweł Staszewski sucessfully tested this patch on his loaded router. Reported-and-tested-by: Paweł Staszewski <pstaszewski@itcare.pl> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a898def2 |
|
22-Feb-2010 |
Paul E. McKenney <paulmck@kernel.org> |
net: Add checking to rcu_dereference() primitives Update rcu_dereference() primitives to use new lockdep-based checking. The rcu_dereference() in __in6_dev_get() may be protected either by rcu_read_lock() or RTNL, per Eric Dumazet. The rcu_dereference() in __sk_free() is protected by the fact that it is never reached if an update could change it. Check for this by using rcu_dereference_check() to verify that the struct sock's ->sk_wmem_alloc counter is zero. Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
7d720c3e |
|
16-Feb-2010 |
Tejun Heo <tj@kernel.org> |
percpu: add __percpu sparse annotations to net Add __percpu sparse annotations to net. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. The macro and type tricks around snmp stats make things a bit interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All snmp_mib_*() users which used to cast the argument to (void **) are updated to cast it to (void __percpu **). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Vlad Yasevich <vladislav.yasevich@hp.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0a931acf |
|
16-Jan-2010 |
Alexey Dobriyan <adobriyan@gmail.com> |
ipv4: don't remove /proc/net/rt_acct /proc/net/rt_acct is not created if NET_CLS_ROUTE=n. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
65324144 |
|
04-Jan-2010 |
Jesper Dangaard Brouer <hawk@comx.dk> |
net: RFC3069, private VLAN proxy arp support This is to be used together with switch technologies, like RFC3069, that where the individual ports are not allowed to communicate with each other, but they are allowed to talk to the upstream router. As described in RFC 3069, it is possible to allow these hosts to communicate through the upstream router by proxy_arp'ing. This patch basically allow proxy arp replies back to the same interface (from which the ARP request/solicitation was received). Tunable per device via proc "proxy_arp_pvlan": /proc/sys/net/ipv4/conf/*/proxy_arp_pvlan This switch technology is known by different vendor names: - In RFC 3069 it is called VLAN Aggregation. - Cisco and Allied Telesyn call it Private VLAN. - Hewlett-Packard call it Source-Port filtering or port-isolation. - Ericsson call it MAC-Forced Forwarding (RFC Draft). Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a5ee1551 |
|
29-Nov-2009 |
Eric W. Biederman <ebiederm@xmission.com> |
net: NETDEV_UNREGISTER_PERNET -> NETDEV_UNREGISTER_BATCH The motivation for an additional notifier in batched netdevice notification (rt_do_flush) only needs to be called once per batch not once per namespace. For further batching improvements I need a guarantee that the netdevices are unregistered in order allowing me to unregister an all of the network devices in a network namespace at the same time with the guarantee that the loopback device is really and truly unregistered last. Additionally it appears that we moved the route cache flush after the final synchronize_net, which seems wrong and there was no explanation. So I have restored the original location of the final synchronize_net. Cc: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a661c419 |
|
25-Nov-2009 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: convert /proc/net/rt_acct to seq_file Rewrite statistics accumulation to be in terms of structure fields, not raw u32 additions. Keep them in same order, though. This is the last user of create_proc_read_entry() in net/, please NAK all new ones as well as all new ->write_proc, ->read_proc and create_proc_entry() users. Cc me if there are problems. :-) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
09ad9bc7 |
|
25-Nov-2009 |
Octavian Purdila <opurdila@ixiacom.com> |
net: use net_eq to compare nets Generated with the following semantic patch @@ struct net *n1; struct net *n2; @@ - n1 == n2 + net_eq(n1, n2) @@ struct net *n1; struct net *n2; @@ - n1 != n2 + !net_eq(n1, n2) applied over {include,net,drivers/net}. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9d4fb27d |
|
23-Nov-2009 |
Joe Perches <joe@perches.com> |
net/ipv4: Move && and || to end of previous line On Sun, 2009-11-22 at 16:31 -0800, David Miller wrote: > It should be of the form: > if (x && > y) > > or: > if (x && y) > > Fix patches, rather than complaints, for existing cases where things > do not follow this pattern are certainly welcome. Also collapsed some multiple tabs to single space. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2c1409a0 |
|
12-Nov-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
inetpeer: Optimize inet_getid() While investigating for network latencies, I found inet_getid() was a contention point for some workloads, as inet_peer_idlock is shared by all inet_getid() users regardless of peers. One way to fix this is to make ip_id_count an atomic_t instead of __u16, and use atomic_add_return(). In order to keep sizeof(struct inet_peer) = 64 on 64bit arches tcp_ts_stamp is also converted to __u32 instead of "unsigned long". Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f8572d8f |
|
05-Nov-2009 |
Eric W. Biederman <ebiederm@xmission.com> |
sysctl net: Remove unused binary sysctl code Now that sys_sysctl is a compatiblity wrapper around /proc/sys all sysctl strategy routines, and all ctl_name and strategy entries in the sysctl tables are unused, and can be revmoed. In addition neigh_sysctl_register has been modified to no longer take a strategy argument and it's callers have been modified not to pass one. Cc: "David Miller" <davem@davemloft.net> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: netdev@vger.kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
#
b0c110ca |
|
17-Oct-2009 |
jamal <hadi@cyberus.ca> |
net: Fix RPF to work with policy routing Policy routing is not looked up by mark on reverse path filtering. This fixes it. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0eae750e |
|
19-Oct-2009 |
John Dykstra <john.dykstra1@gmail.com> |
IP: Cleanups Use symbols instead of magic constants while checking PMTU discovery setsockopt. Remove redundant test in ip_rt_frag_needed() (done by caller). Signed-off-by: John Dykstra <john.dykstra1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8d65af78 |
|
23-Sep-2009 |
Alexey Dobriyan <adobriyan@gmail.com> |
sysctl: remove "struct file *" argument of ->proc_handler It's unused. It isn't needed -- read or write flag is already passed and sysctl shouldn't care about the rest. It _was_ used in two places at arch/frv for some reason. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4481374c |
|
21-Sep-2009 |
Jan Beulich <JBeulich@novell.com> |
mm: replace various uses of num_physpages by totalram_pages Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. Some of the calculations (i.e. those not intending to use high memory) should likely even use (totalram_pages - totalhigh_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Dave Airlie <airlied@linux.ie> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
30038fc6 |
|
29-Aug-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: ip_rt_send_redirect() optimization While doing some forwarding benchmarks, I noticed ip_rt_send_redirect() is rather expensive, even if send_redirects is false for the device. Fix is to avoid two atomic ops, we dont really need to take a reference on in_dev Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a33bc5c1 |
|
30-Jul-2009 |
Neil Horman <nhorman@tuxdriver.com> |
xfrm: select sane defaults for xfrm[4|6] gc_thresh Choose saner defaults for xfrm[4|6] gc_thresh values on init Currently, the xfrm[4|6] code has hard-coded initial gc_thresh values (set to 1024). Given that the ipv4 and ipv6 routing caches are sized dynamically at boot time, the static selections can be non-sensical. This patch dynamically selects an appropriate gc threshold based on the corresponding main routing table size, using the assumption that we should in the worst case be able to handle as many connections as the routing table can. For ipv4, the maximum route cache size is 16 * the number of hash buckets in the route cache. Given that xfrm4 starts garbage collection at the gc_thresh and prevents new allocations at 2 * gc_thresh, we set gc_thresh to half the maximum route cache size. For ipv6, its a bit trickier. there is no maximum route cache size, but the ipv6 dst_ops gc_thresh is statically set to 1024. It seems sane to select a simmilar gc_thresh for the xfrm6 code that is half the number of hash buckets in the v6 route cache times 16 (like the v4 code does). Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b6280b47 |
|
22-Jun-2009 |
Neil Horman <nhorman@tuxdriver.com> |
ipv4 routing: Ensure that route cache entries are usable and reclaimable with caching is off When route caching is disabled (rt_caching returns false), We still use route cache entries that are created and passed into rt_intern_hash once. These routes need to be made usable for the one call path that holds a reference to them, and they need to be reclaimed when they're finished with their use. To be made usable, they need to be associated with a neighbor table entry (which they currently are not), otherwise iproute_finish2 just discards the packet, since we don't know which L2 peer to send the packet to. To do this binding, we need to follow the path a bit higher up in rt_intern_hash, which calls arp_bind_neighbour, but not assign the route entry to the hash table. Currently, if caching is off, we simply assign the route to the rp pointer and are reutrn success. This patch associates us with a neighbor entry first. Secondly, we need to make sure that any single use routes like this are known to the garbage collector when caching is off. If caching is off, and we try to hash in a route, it will leak when its refcount reaches zero. To avoid this, this patch calls rt_free on the route cache entry passed into rt_intern_hash. This places us on the gc list for the route cache garbage collector, so that when its refcount reaches zero, it will be reclaimed (Thanks to Alexey for this suggestion). I've tested this on a local system here, and with these patches in place, I'm able to maintain routed connectivity to remote systems, even if I set /proc/sys/net/ipv4/rt_cache_rebuild_count to -1, which forces rt_caching to return false. Signed-off-by: Neil Horman <nhorman@redhat.com> Reported-by: Jarek Poplawski <jarkao2@gmail.com> Reported-by: Maxime Bizon <mbizon@freebox.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
73e42897 |
|
20-Jun-2009 |
Neil Horman <nhorman@tuxdriver.com> |
ipv4: fix NULL pointer + success return in route lookup path Don't drop route if we're not caching I recently got a report of an oops on a route lookup. Maxime was testing what would happen if route caching was turned off (doing so by setting making rt_caching always return 0), and found that it triggered an oops. I looked at it and found that the problem stemmed from the fact that the route lookup routines were returning success from their lookup paths (which is good), but never set the **rp pointer to anything (which is bad). This happens because in rt_intern_hash, if rt_caching returns false, we call rt_drop and return 0. This almost emulates slient success. What we should be doing is assigning *rp = rt and _not_ dropping the route. This way, during slow path lookups, when we create a new route cache entry, we don't immediately discard it, rather we just don't add it into the cache hash table, but we let this one lookup use it for the purpose of this route request. Maxime has tested and reports it prevents the oops. There is still a subsequent routing issue that I'm looking into further, but I'm confident that, even if its related to this same path, this patch makes sense to take. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
125bb8f5 |
|
11-Jun-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: use a deferred timer in rt_check_expire For the sake of power saver lovers, use a deferrable timer to fire rt_check_expire() As some big routers cache equilibrium depends on garbage collection done in time, we take into account elapsed time between two rt_check_expire() invocations to adjust the amount of slots we have to check. Based on an initial idea and patch from Tero Kristo Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Tero Kristo <tero.kristo@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
adf30907 |
|
01-Jun-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: skb->dst accessors Define three accessors to get/set dst attached to a skb struct dst_entry *skb_dst(const struct sk_buff *skb) void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst) void skb_dst_drop(struct sk_buff *skb) This one should replace occurrences of : dst_release(skb->dst) skb->dst = NULL; Delete skb->dst field Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
511c3f92 |
|
01-Jun-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: skb->rtable accessor Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb Delete skb->rtable field Setting rtable is not allowed, just set dst instead as rtable is an alias. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1ddbcb00 |
|
19-May-2009 |
Eric Dumazet <dada1@cosmosbay.com> |
net: fix rtable leak in net/ipv4/route.c Alexander V. Lukyanov found a regression in 2.6.29 and made a complete analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=13339 Quoted here because its a perfect one : begin_of_quotation 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the patch has at least one critical flaw, and another problem. rt_intern_hash calculates rthi pointer, which is later used for new entry insertion. The same loop calculates cand pointer which is used to clean the list. If the pointers are the same, rtable leak occurs, as first the cand is removed then the new entry is appended to it. This leak leads to unregister_netdevice problem (usage count > 0). Another problem of the patch is that it tries to insert the entries in certain order, to facilitate counting of entries distinct by all but QoS parameters. Unfortunately, referencing an existing rtable entry moves it to list beginning, to speed up further lookups, so the carefully built order is destroyed. For the first problem the simplest patch it to set rthi=0 when rthi==cand, but it will also destroy the ordering. end_of_quotation Problematic commit is 1080d709fb9d8cd4392f93476ee46a9d6ea05a5b (net: implement emergency route cache rebulds when gc_elasticity is exceeded) Trying to keep dst_entries ordered is too complex and breaks the fact that order should depend on the frequency of use for garbage collection. A possible fix is to make rt_intern_hash() simpler, and only makes rt_check_expire() a litle bit smarter, being able to cope with an arbitrary entries order. The added loop is running on cache hot data, while cpu is prefetching next object, so should be unnoticied. Reported-and-analyzed-by: Alexander V. Lukyanov <lav@yar.ru> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cf8da764 |
|
19-May-2009 |
Eric Dumazet <dada1@cosmosbay.com> |
net: fix length computation in rt_check_expire() rt_check_expire() computes average and standard deviation of chain lengths, but not correclty reset length to 0 at beginning of each chain. This probably gives overflows for sum2 (and sum) on loaded machines instead of meaningful results. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c9503e0f |
|
27-Apr-2009 |
Anton Blanchard <anton@samba.org> |
ipv4: Limit size of route cache hash table Right now we have no upper limit on the size of the route cache hash table. On a 128GB POWER6 box it ends up as 32MB: IP route cache hash table entries: 4194304 (order: 9, 33554432 bytes) It would be nice to cap this for memory consumption reasons, but a massive hashtable also causes a significant spike when measuring OS jitter. With a 32MB hashtable and 4 million entries, rt_worker_func is taking 5 ms to complete. On another system with more memory it's taking 14 ms. Even though rt_worker_func does call cond_sched() to limit its impact, in an HPC environment we want to keep all sources of OS jitter to a minimum. With the patch applied we limit the number of entries to 512k which can still be overriden by using the rt_entries boot option: IP route cache hash table entries: 524288 (order: 6, 4194304 bytes) With this patch rt_worker_func now takes 0.460 ms on the same system. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0dcec8c2 |
|
25-Feb-2009 |
Ingo Molnar <mingo@elte.hu> |
alloc_percpu: add align argument to __alloc_percpu, fix Impact: build fix API was changed, but not all usage sites were converted: net/ipv4/route.c: In function ‘ip_rt_init’: net/ipv4/route.c:3379: error: too few arguments to function ‘__alloc_percpu’ Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
09640e63 |
|
01-Feb-2009 |
Harvey Harrison <harvey.harrison@gmail.com> |
net: replace uses of __constant_{endian} Base versions handle constant folding now. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4feb88e5 |
|
21-Jan-2009 |
Benjamin Thery <benjamin.thery@bull.net> |
netns: ipmr: enable namespace support in ipv4 multicast routing code This last patch makes the appropriate changes to use and propagate the network namespace where needed in IPv4 multicast routing code. This consists mainly in replacing all the remaining init_net occurences with current netns pointer retrieved from sockets, net devices or mfc_caches depending on the routines' contexts. Some routines receive a new 'struct net' parameter to propagate the current netns: * vif_add/vif_delete * ipmr_new_tunnel * mroute_clean_tables * ipmr_cache_find * ipmr_cache_report * ipmr_cache_unresolved * ipmr_mfc_add/ipmr_mfc_delete * ipmr_get_route * rt_fill_info (in route.c) Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0f23174a |
|
28-Dec-2008 |
Rusty Russell <rusty@rustcorp.com.au> |
cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits: net In future all cpumask ops will only be valid (in general) for bit numbers < nr_cpu_ids. So use that instead of NR_CPUS in iterators and other comparisons. This is always safe: no cpu number can be >= nr_cpu_ids, and nr_cpu_ids is initialized to NR_CPUS at boot. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
52479b62 |
|
25-Nov-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
netns xfrm: lookup in netns Pass netns to xfrm_lookup()/__xfrm_lookup(). For that pass netns to flow_cache_lookup() and resolver callback. Take it from socket or netdevice. Stub DECnet to init_net. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6bb3ce25 |
|
11-Nov-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: remove struct dst_entry::entry_size Unused after kmem_cache_zalloc() conversion. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6d9f239a |
|
03-Nov-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: '&' redux I want to compile out proc_* and sysctl_* handlers totally and stub them to NULL depending on config options, however usage of & will prevent this, since taking adress of NULL pointer will break compilation. So, drop & in front of every ->proc_handler and every ->strategy handler, it was never needed in fact. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
673d57e7 |
|
31-Oct-2008 |
Harvey Harrison <harvey.harrison@gmail.com> |
net: replace NIPQUAD() in net/ipv4/ net/ipv6/ Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u can be replaced with %pI4 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
93adcc80 |
|
28-Oct-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: don't use INIT_RCU_HEAD call_rcu() will unconditionally rewrite RCU head anyway. Applies to struct neigh_parms struct neigh_table struct net struct cipso_v4_doi struct in_ifaddr struct in_device rt->u.dst Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
def8b4fa |
|
28-Oct-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: reduce structures when XFRM=n ifdef out * struct sk_buff::sp (pointer) * struct dst_entry::xfrm (pointer) * struct sock::sk_policy (2 pointers) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1080d709 |
|
27-Oct-2008 |
Neil Horman <nhorman@tuxdriver.com> |
net: implement emergency route cache rebulds when gc_elasticity is exceeded This is a patch to provide on demand route cache rebuilding. Currently, our route cache is rebulid periodically regardless of need. This introduced unneeded periodic latency. This patch offers a better approach. Using code provided by Eric Dumazet, we compute the standard deviation of the average hash bucket chain length while running rt_check_expire. Should any given chain length grow to larger that average plus 4 standard deviations, we trigger an emergency hash table rebuild for that net namespace. This allows for the common case in which chains are well behaved and do not grow unevenly to not incur any latency at all, while those systems (which may be being maliciously attacked), only rebuild when the attack is detected. This patch take 2 other factors into account: 1) chains with multiple entries that differ by attributes that do not affect the hash value are only counted once, so as not to unduly bias system to rebuilding if features like QOS are heavily used 2) if rebuilding crosses a certain threshold (which is adjustable via the added sysctl in this patch), route caching is disabled entirely for that net namespace, since constant rebuilding is less efficient that no caching at all Tested successfully by me. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
00269b54 |
|
16-Oct-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
ipv4: Add a missing rcu_assign_pointer() in routing cache. rt_intern_hash() is doing an update of a RCU guarded hash chain without using rcu_assign_pointer() or equivalent barrier. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f221e726 |
|
15-Oct-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
sysctl: simplify ->strategy name and nlen parameters passed to ->strategy hook are unused, remove them. In general ->strategy hook should know what it's doing, and don't do something tricky for which, say, pointer to original userspace array may be needed (name). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> [ networking bits ] Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: Matt Mackall <mpm@selenic.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a210d01a |
|
01-Oct-2008 |
Julian Anastasov <ja@ssi.bg> |
ipv4: Loosen source address check on IPv4 output ip_route_output() contains a check to make sure that no flows with non-local source IP addresses are routed. This obviously makes using such addresses impossible. This patch introduces a flowi flag which makes omitting this check possible. The new flag provides a way of handling transparent and non-transparent connections differently. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a6272665 |
|
28-Aug-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
ip: speedup /proc/net/rt_cache handling When scanning route cache hash table, we can avoid taking locks for empty buckets. Both /proc/net/rt_cache and NETLINK RTM_GETROUTE interface are taken into account. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d994af0d |
|
27-Aug-2008 |
Hugh Dickins <hugh@veritas.com> |
ipv4: mode 0555 in ipv4_skeleton vpnc on today's kernel says Cannot open "/proc/sys/net/ipv4/route/flush": d--------- 0 root root 0 2008-08-26 11:32 /proc/sys/net/ipv4/route d--------- 0 root root 0 2008-08-26 19:16 /proc/sys/net/ipv4/neigh Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2f4520d3 |
|
25-Aug-2008 |
Al Viro <viro@zeniv.linux.org.uk> |
ipv4: sysctl fixes net.ipv4.neigh should be a part of skeleton to avoid ordering problems Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c6153b5b |
|
15-Aug-2008 |
Herbert Xu <herbert@gondor.apana.org.au> |
ipv4: Disable route secret interval on zero interval Let me first state that disabling the route cache hash rebuild should not be done without extensive analysis on the risk profile and careful deliberation. However, there are times when this can be done safely or for testing. For example, when you have mechanisms for ensuring that offending parties do not exist in your network. This patch lets the user disable the rebuild if the interval is set to zero. This also incidentally fixes a divide-by-zero error with name-spaces. In addition, this patch makes the effect of an interval change immediate rather than it taking effect at the next rebuild as is currently the case. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
11d46123 |
|
06-Aug-2008 |
David S. Miller <davem@davemloft.net> |
ipv4: Fix over-ifdeffing of ip_static_sysctl_init. Noticed by Paulius Zaleckas. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6d273f8d |
|
06-Aug-2008 |
Rami Rosen <ramirose@gmail.com> |
ipv4: replace dst_metric() with dst_mtu() in net/ipv4/route.c. This patch replaces dst_metric() with dst_mtu() in net/ipv4/route.c. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a1bc6eb4 |
|
30-Jul-2008 |
Al Viro <viro@zeniv.linux.org.uk> |
[PATCH] ipv4_static_sysctl_init() should be under CONFIG_SYSCTL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8a9204db |
|
31-Jul-2008 |
Ingo Molnar <mingo@elte.hu> |
net/ipv4/route.c: fix build error fix: net/ipv4/route.c: In function 'ip_static_sysctl_init': net/ipv4/route.c:3225: error: 'ipv4_route_path' undeclared (first use in this function) net/ipv4/route.c:3225: error: (Each undeclared identifier is reported only once net/ipv4/route.c:3225: error: for each function it appears in.) net/ipv4/route.c:3225: error: 'ipv4_route_table' undeclared (first use in this function) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eeb61f71 |
|
27-Jul-2008 |
Al Viro <viro@ZenIV.linux.org.uk> |
missing bits of net-namespace / sysctl Piss-poor sysctl registration API strikes again, film at 11... What we really need is _pathname_ required to be present in already registered table, so that kernel could warn about bad order. That's the next target for sysctl stuff (and generally saner and more explicit order of initialization of ipv[46] internals wouldn't hurt either). For the time being, here are full fixups required by ..._rotable() stuff; we make per-net sysctl sets descendents of "ro" one and make sure that sufficient skeleton is there before we start registering per-net sysctls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6f9f489a |
|
27-Jul-2008 |
Al Viro <viro@zeniv.linux.org.uk> |
net: missing bits of net-namespace / sysctl Piss-poor sysctl registration API strikes again, film at 11... What we really need is _pathname_ required to be present in already registered table, so that kernel could warn about bad order. That's the next target for sysctl stuff (and generally saner and more explicit order of initialization of ipv[46] internals wouldn't hurt either). For the time being, here are full fixups required by ..._rotable() stuff; we make per-net sysctl sets descendents of "ro" one and make sure that sufficient skeleton is there before we start registering per-net sysctls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6c3b8fc6 |
|
26-Jul-2008 |
Hugh Dickins <hugh@veritas.com> |
netns: fix ip_rt_frag_needed rt_is_expired Running recent kernels, and using a particular vpn gateway, I've been having to edit my mails down to get them accepted by the smtp server. Git bisect led to commit e84f84f276473dcc673f360e8ff3203148bdf0e2 - netns: place rt_genid into struct net. The conversion from a != test to rt_is_expired() put one negative too many: and now my mail works. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7c73a6fa |
|
16-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to IP_INC_STATS_BH Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
81c684d1 |
|
08-Jul-2008 |
Denis V. Lunev <den@openvz.org> |
ipv4: remove flush_mutex from ipv4_sysctl_rtcache_flush It is possible to avoid locking at all in ipv4_sysctl_rtcache_flush by defining local ctl_table on the stack. The patch is based on the suggestion from Eric W. Biederman. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
32cb5b4e |
|
05-Jul-2008 |
Denis V. Lunev <den@openvz.org> |
netns: selective flush of rt_cache dst cache is marked as expired on the per/namespace basis by previous path. Right now we have to implement selective cache shrinking. This procedure has been ported from older OpenVz codebase. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e84f84f2 |
|
05-Jul-2008 |
Denis V. Lunev <den@openvz.org> |
netns: place rt_genid into struct net Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b00180de |
|
05-Jul-2008 |
Denis V. Lunev <den@openvz.org> |
ipv4: pass current value of rt_genid into rt_hash Basically, there is no difference to atomic_read internally or pass it as a parameter as rt_hash is inline. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
86c657f6 |
|
05-Jul-2008 |
Denis V. Lunev <den@openvz.org> |
netns: add struct net parameter to rt_cache_invalidate Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9f5e97e5 |
|
05-Jul-2008 |
Denis V. Lunev <den@openvz.org> |
netns: make rt_secret_rebuild timer per namespace Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
39a23e75 |
|
05-Jul-2008 |
Denis V. Lunev <den@openvz.org> |
netns: register net.ipv4.route.flush in each namespace Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
639e104f |
|
05-Jul-2008 |
Denis V. Lunev <den@openvz.org> |
ipv4: remove static flush_delay variable flush delay is used as an external storage for net.ipv4.route.flush sysctl entry. It is write-only. The ctl_table->data for this entry is used once. Fix this case to point to the stack to remove global variable. Do this to avoid additional variable on struct net in the next patch. Possible race (as it was before) accessing this local variable is removed using flush_mutex. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
76e6ebfb |
|
05-Jul-2008 |
Denis V. Lunev <den@openvz.org> |
netns: add namespace parameter to rt_cache_flush Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0b040829 |
|
10-Jun-2008 |
Adrian Bunk <bunk@kernel.org> |
net: remove CVS keywords This patch removes CVS keywords that weren't updated for a long time from comments. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
51b77cae |
|
03-Jun-2008 |
Thomas Graf <tgraf@suug.ch> |
route: Mark unused route cache flags as such. Also removes an obsolete check for the unused flag RTCF_MASQ. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1ac06e03 |
|
20-May-2008 |
Herbert Xu <herbert@gondor.apana.org.au> |
ipsec: Use the correct ip_local_out function Because the IPsec output function xfrm_output_resume does its own dst_output call it should always call __ip_local_output instead of ip_local_output as the latter may invoke dst_output directly. Otherwise the return values from nf_hook and dst_output may clash as they both use the value 1 but for different purposes. When that clash occurs this can cause a packet to be used after it has been freed which usually leads to a crash. Because the offending value is only returned from dst_output with qdiscs such as HTB, this bug is normally not visible. Thanks to Marco Berizzi for his perseverance in tracking this down. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5ffc02a1 |
|
04-May-2008 |
Satoru SATOH <satoru.satoh@gmail.com> |
ip: Use inline function dst_metric() instead of direct access to dst->metric[] There are functions to refer to the value of dst->metric[THE_METRIC-1] directly without use of a inline function "dst_metric" defined in net/dst.h. The following patch changes them to use the inline function consistently. Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0bbeafd0 |
|
04-May-2008 |
Satoru SATOH <satoru.satoh@gmail.com> |
ip: Make use of the inline function dst_metric_locked() Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0010e465 |
|
29-Apr-2008 |
Timo Teras <timo.teras@iki.fi> |
ipv4: Update MTU to all related cache entries in ip_rt_frag_needed() Add struct net_device parameter to ip_rt_frag_needed() and update MTU to cache entries where ifindex is specified. This is similar to what is already done in ip_rt_redirect(). Signed-off-by: Timo Teras <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5e659e4c |
|
24-Apr-2008 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Fix heavy stack usage in seq_file output routines. Plan C: we can follow the Al Viro's proposal about %n like in this patch. The same applies to udp, fib (the /proc/net/route file), rt_cache and sctp debug. This is minus ~150-200 bytes for each. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a7d632b6 |
|
14-Apr-2008 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[IPV4]: Use NIPQUAD_FMT to format ipv4 addresses. And use %u to format port. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c0b8c32b |
|
10-Apr-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
IPV4: use xor rather than multiple ands for route compare The comparison in ip_route_input is a hot path, by recoding the C "and" as bit operations, fewer conditional branches get generated so the code should be faster. Maybe someday Gcc will be smart enough to do this? Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2fa7527b |
|
10-Apr-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
IPV4: route rekey timer can be deferrable No urgency on the rehash interval timer, so mark it as deferrable. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1294fc4a |
|
10-Apr-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
IPV4: route use jhash3 Since route hash is a triple, use jhash_3words rather doing the mixing directly. This should be as fast and give better distribution. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5969f71d |
|
10-Apr-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
IPV4: route inline changes Don't mark functions that are large as inline, let compiler decide. Also, use inline rather than __inline__. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
878628fb |
|
25-Mar-2008 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] NETNS: Omit namespace comparision without CONFIG_NET_NS. Introduce an inline net_eq() to compare two namespaces. Without CONFIG_NET_NS, since no namespace other than &init_net exists, it is always 1. We do not need to convert 1) inline vs inline and 2) inline vs &init_net comparisons. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
#
1218854a |
|
25-Mar-2008 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] NETNS: Omit seq_net_private->net without CONFIG_NET_NS. Without CONFIG_NET_NS, no namespace other than &init_net exists, no need to store net in seq_net_private. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
#
3b1e0a65 |
|
25-Mar-2008 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS. Introduce per-sock inlines: sock_net(), sock_net_set() and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set(). Without CONFIG_NET_NS, no namespace other than &init_net exists. Let's explicitly define them to help compiler optimizations. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
#
c346dca1 |
|
25-Mar-2008 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS. Introduce per-net_device inlines: dev_net(), dev_net_set(). Without CONFIG_NET_NS, no namespace other than &init_net exists. Let's explicitly define them to help compiler optimizations. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
#
817bc4db |
|
22-Mar-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
[IPV4] route: use read_mostly The route table parameters are set based on system memory and sysctl values that almost never change. Also the genid only changes every 10 minutes. RTprint is defined by never used. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ce259990 |
|
22-Mar-2008 |
Denis V. Lunev <den@openvz.org> |
[IPV4]: sk parameter is unused in ipv4_dst_blackhole. Just remove it. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ee6b9673 |
|
05-Mar-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
[IPV4]: Add 'rtable' field in struct sk_buff to alias 'dst' and avoid casts (Anonymous) unions can help us to avoid ugly casts. A common cast it the (struct rtable *)skb->dst one. Defining an union like : union { struct dst_entry *dst; struct rtable *rtable; }; permits to use skb->rtable in place. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1937504d |
|
28-Feb-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Enable all routing manipulation via netlink inside namespace. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
73b38711 |
|
28-Feb-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Register /proc/net/rt_cache for each namespace. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a75e936f |
|
28-Feb-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Process /proc/net/rt_cache inside a namespace. Show routing cache for a particular namespace only. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
642d6318 |
|
28-Feb-2008 |
Denis V. Lunev <den@openvz.org> |
[IPV4]: rt_cache_get_next should take rt_genid into account. In the other case /proc/net/rt_cache will look inconsistent in respect to genid. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
317805b8 |
|
28-Feb-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Process ip_rt_redirect in the correct namespace. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
77020720 |
|
28-Feb-2008 |
Wang Chen <wangchen@cn.fujitsu.com> |
[IPV4]: Use proc_create() to setup ->proc_fops first Use proc_create() to make sure that ->proc_fops be setup before gluing PDE to main tree. Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4136cd52 |
|
07-Feb-2008 |
Patrick McHardy <kaber@trash.net> |
[IPV4]: route: fix crash ip_route_input ip_route_me_harder() may call ip_route_input() with skbs that don't have skb->dev set for skbs rerouted in LOCAL_OUT and TCP resets generated by the REJECT target, resulting in a crash when dereferencing skb->dev->nd_net. Since ip_route_input() has an input device argument, it seems correct to use that one anyway. Bug introduced in b5921910a1 (Routing cache virtualization). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
29e75252 |
|
31-Jan-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
[IPV4] route cache: Introduce rt_genid for smooth cache invalidation Current ip route cache implementation is not suited to large caches. We can consume a lot of CPU when cache must be invalidated, since we currently need to evict all cache entries, and this eviction is sometimes asynchronous. min_delay & max_delay can somewhat control this asynchronism behavior, but whole thing is a kludge, regularly triggering infamous soft lockup messages. When entries are still in use, this also consumes a lot of ram, filling dst_garbage.list. A better scheme is to use a generation identifier on each entry, so that cache invalidation can be performed by changing the table identifier, without having to scan all entries. No more delayed flushing, no more stalling when secret_interval expires. Invalidated entries will then be freed at GC time (controled by ip_rt_gc_timeout or stress), or when an invalidated entry is found in a chain when an insert is done. Thus we keep a normal equilibrium. This patch : - renames rt_hash_rnd to rt_genid (and makes it an atomic_t) - Adds a new rt_genid field to 'struct rtable' (filling a hole on 64bit) - Checks entry->rt_genid at appropriate places :
|
#
e2422970 |
|
30-Jan-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: should explicitely initialize atomic_t field in struct dst_ops All but one struct dst_ops static initializations miss explicit initialization of entries field. As this field is atomic_t, we should use ATOMIC_INIT(0), and not rely on atomic_t implementation. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b5921910 |
|
23-Jan-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Routing cache virtualization. Basically, this piece looks relatively easy. Namespace is already available on the dst entry via device and the device is safe to dereferrence. Compare it with one of a searcher and skip entry if appropriate. The only exception is ip_rt_frag_needed. So, add namespace parameter to it. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f206351a |
|
22-Jan-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Add namespace parameter to ip_route_output_key. Needed to propagate it down to the ip_route_output_flow. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f1b050bf |
|
22-Jan-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Add namespace parameter to ip_route_output_flow. Needed to propagate it down to the __ip_route_output_key. Signed_off_by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
611c183e |
|
22-Jan-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Add namespace parameter to __ip_route_output_key. This is only required to propagate it down to the ip_route_output_slow. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b40afd0e |
|
22-Jan-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Add namespace parameter to ip_route_output_slow. This function needs a net namespace to lookup devices, fib tables, etc. in, so pass it there. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1ab35276 |
|
22-Jan-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Add namespace parameter to ip_dev_find. in_dev_find() need a namespace to pass it to fib_get_table(), so add an argument. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
010278ec |
|
22-Jan-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Add netns parameter to fib_select_default. Currently fib_select_default calls fib_get_table() with the init_net. Prepare it to provide a correct namespace to lookup default route. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ecfdc8c5 |
|
21-Jan-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Pass correct namespace in ip_rt_get_source. ip_rt_get_source is the infamous place for which dst_ifdown kludges have been implemented. This means that rt->u.dst.dev can be safely dereferrenced obtain nd_net. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
84a885f4 |
|
21-Jan-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Pass correct namespace in ip_route_input_slow. The packet on the input path always has a referrence to an input network device it is passed from. Extract network namespace from it. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
da0e28cb |
|
21-Jan-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Add netns parameter to fib_lookup. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1e637c74 |
|
21-Jan-2008 |
Jan Engelhardt <jengelh@computergmbh.de> |
[IPV4]: Enable use of 240/4 address space. This short patch modifies the IPv4 networking to enable use of the 240.0.0.0/4 (aka "class-E") address space as propsed in the internet draft draft-fuller-240space-00.txt. Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
569d3645 |
|
18-Jan-2008 |
Daniel Lezcano <dlezcano@fr.ibm.com> |
[NETNS][DST] dst: pass the dst_ops as parameter to the gc functions The garbage collection function receive the dst_ops structure as parameter. This is useful for the next incoming patchset because it will need the dst_ops (there will be several instances) and the network namespace pointer (contained in the dst_ops). The protocols which do not take care of the namespaces will not be impacted by this change (expect for the function signature), they do just ignore the parameter. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6b175b26 |
|
10-Jan-2008 |
Eric W. Biederman <ebiederm@xmission.com> |
[NETNS]: Add netns parameter to inet_(dev_)add_type. The patch extends the inet_addr_type and inet_dev_addr_type with the network namespace pointer. That allows to access the different tables relatively to the network namespace. The modification of the signature function is reported in all the callers of the inet_addr_type using the pointer to the well known init_net. Acked-by: Benjamin Thery <benjamin.thery@bull.net> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cb7928a5 |
|
09-Jan-2008 |
Rami Rosen <ramirose@gmail.com> |
[IPV4]: Remove unsupported DNAT (RTCF_NAT and RTCF_NAT) in IPV4 - The DNAT (Destination NAT) is not implemented in IPV4. - This patch remove the code which checks these flags in net/ipv4/arp.c and net/ipv4/route.c. The RTCF_NAT and RTCF_NAT should stay in the header (linux/in_route.h) because they are used in DECnet. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b790cedd |
|
21-Dec-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[INET]: Avoid an integer divide in rt_garbage_collect() Since 'goal' is a signed int, compiler may emit an integer divide to compute goal/2. Using a right shift is OK here and less expensive. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f97c1e0c |
|
16-Dec-2007 |
Joe Perches <joe@perches.com> |
[IPV4] net/ipv4: Use ipv4_is_<type> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
586f1211 |
|
16-Dec-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[IPV4]: Switch users of ipv4_devconf(_all) to use the pernet one These are scattered over the code, but almost all the "critical" places already have the proper struct net at hand except for snmp proc showing function and routing rtnl handler. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bb72845e |
|
12-Dec-2007 |
Herbert Xu <herbert@gondor.apana.org.au> |
[IPSEC]: Make callers of xfrm_lookup to use XFRM_LOOKUP_WAIT This patch converts all callers of xfrm_lookup that used an explicit value of 1 to indiciate blocking to use the new flag XFRM_LOOKUP_WAIT. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5a3e55d6 |
|
07-Dec-2007 |
Denis V. Lunev <den@openvz.org> |
[NET]: Multiple namespaces in the all dst_ifdown routines. Move dst entries to a namespace loopback to catch refcounting leaks. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1ff1cc20 |
|
05-Dec-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[IPV4] ROUTE: Convert rt_hash_lock_init() macro into function There's no need in having this function exist in a form of macro. Properly formatted function looks much better. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
107f1634 |
|
05-Dec-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[IPV4] ROUTE: Clean up proc files creation. The rt_cache, stats/rt_cache and rt_acct(optional) files creation looks a bit messy. Clean this out and join them to other proc-related functions under the proper ifdef. The struct net * argument in a new function will help net namespaces patches look nicer. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
78c686e9 |
|
05-Dec-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[IPV4] ROUTE: Collect proc-related functions together The net/ipv4/route.c file declares some entries for proc to dump some routing info. The reading functions are scattered over this file - collect them together. Besides, remove a useless IP_RT_ACCT_CPU macro. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
beb659bd |
|
19-Nov-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[PATCH] IPV4 : Move ip route cache flush (secret_rebuild) from softirq to workqueue Every 600 seconds (ip_rt_secret_interval), a softirq flush of the whole ip route cache is triggered. On loaded machines, this can starve softirq for many seconds and can eventually crash. This patch moves this flush to a workqueue context, using the worker we intoduced in commit 39c90ece7565f5c47110c2fa77409d7a9478bd5b (IPV4: Convert rt_check_expire() from softirq processing to workqueue.) Also, immediate flushes (echo 0 >/proc/sys/net/ipv4/route/flush) are using rt_do_flush() helper function, wich take attention to rescheduling. Next step will be to handle delayed flushes ("echo -1 >/proc/sys/net/ipv4/route/flush" or "ip route flush cache") Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
97c53cac |
|
19-Nov-2007 |
Denis V. Lunev <den@openvz.org> |
[NET]: Make rtnetlink infrastructure network namespace aware (v3) After this patch none of the netlink callback support anything except the initial network namespace but the rtnetlink infrastructure now handles multiple network namespaces. Changes from v2: - IPv6 addrlabel processing Changes from v1: - no need for special rtnl_unlock handling - fixed IPv6 ndisc Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b854272b |
|
30-Nov-2007 |
Denis V. Lunev <den@openvz.org> |
[NET]: Modify all rtnetlink methods to only work in the initial namespace (v2) Before I can enable rtnetlink to work in all network namespaces I need to be certain that something won't break. So this patch deliberately disables all of the rtnletlink methods in everything except the initial network namespace. After the methods have been audited this extra check can be disabled. Changes from v1: - added IPv6 addrlabel protection Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
#
8dbde28d |
|
16-Nov-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: NET_CLS_ROUTE : convert ip_rt_acct to per_cpu variables ip_rt_acct needs 4096 bytes per cpu to perform some accounting. It is actually allocated as a single huge array [4096*NR_CPUS] (rounded up to a power of two) Converting it to a per cpu variable is wanted to : - Save space on machines were num_possible_cpus() < NR_CPUS - Better NUMA placement (each cpu gets memory on its node) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
862b82c6 |
|
13-Nov-2007 |
Herbert Xu <herbert@gondor.apana.org.au> |
[IPSEC]: Merge most of the output path As part of the work on asynchrnous cryptographic operations, we need to be able to resume from the spot where they occur. As such, it helps if we isolate them to one spot. This patch moves most of the remaining family-specific processing into the common output code. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
352e512c |
|
13-Nov-2007 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Eliminate duplicate copies of dst_discard We have a number of copies of dst_discard scattered around the place which all do the same thing, namely free a packet on the input or output paths. This patch deletes all of them except dst_discard and points all the users to it. The only non-trivial bit is decnet where it returns an error. However, conceptually this is identical to the blackhole functions used in IPv4 and IPv6 which do not return errors. So they should either all return errors or all return zero. For now I've stuck with the majority and picked zero as the return value. It doesn't really matter in practice since few if any driver would react differently depending on a zero return value or NET_RX_DROP. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b24b8a24 |
|
23-Jan-2008 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Convert init_timer into setup_timer Many-many code in the kernel initialized the timer->function and timer->data together with calling init_timer(timer). There is already a helper for this. Use it for networking code. The patch is HUGE, but makes the code 130 lines shorter (98 insertions(+), 228 deletions(-)). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0bcceadc |
|
10-Jan-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
[IPV4] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache In rt_cache_get_next(), no need to guard seq->private by a rcu_dereference() since seq is private to the thread running this function. Reading seq.private once (as guaranted bu rcu_dereference()) or several time if compiler really is dumb enough wont change the result. But we miss real spots where rcu_dereference() are needed, both in rt_cache_get_first() and rt_cache_get_next() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d8c92830 |
|
07-Jan-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
[IPV4] ROUTE: ip_rt_dump() is unecessary slow I noticed "ip route list cache x.y.z.t" can be *very* slow. While strace-ing -T it I also noticed that first part of route cache is fetched quite fast : recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202 GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3772 <0.000047> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3736 <0.000042> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3740 <0.000055> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3712 <0.000043> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3732 <0.000053> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202 GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3708 <0.000052> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202 GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3680 <0.000041> while the part at the end of the table is more expensive: recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3656 <0.003857> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3772 <0.003891> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3712 <0.003765> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3700 <0.003879> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3676 <0.003797> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3724 <0.003856> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3736 <0.003848> The following patch corrects this performance/latency problem, removing quadratic behavior. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
56c99d04 |
|
06-Dec-2007 |
Denis V. Lunev <den@openvz.org> |
[IPV4]: Remove prototype of ip_rt_advice ip_rt_advice has been gone, so no need to keep prototype and debug message. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7f53878d |
|
07-Dec-2007 |
Mitsuru Chinen <mitch@linux.vnet.ibm.com> |
[IPv4]: Reply net unreachable ICMP message IPv4 stack doesn't reply any ICMP destination unreachable message with net unreachable code when IP detagrams are being discarded because of no route could be found in the forwarding path. Incidentally, IPv6 stack replies such ICMPv6 message in the similar situation. Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
483b23ff |
|
16-Nov-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: Corrects a bug in ip_rt_acct_read() It seems that stats of cpu 0 are counted twice, since for_each_possible_cpu() is looping on all possible cpus, including 0 Before percpu conversion of ip_rt_acct, we should also remove the assumption that CPU 0 is online (or even possible) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d90bf5a9 |
|
14-Nov-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: rt_check_expire() can take a long time, add a cond_resched() On commit 39c90ece7565f5c47110c2fa77409d7a9478bd5b: [IPV4]: Convert rt_check_expire() from softirq processing to workqueue. we converted rt_check_expire() from softirq to workqueue, allowing the function to perform all work it was supposed to do. When the IP route cache is big, rt_check_expire() can take a long time to run. (default settings : 20% of the hash table is scanned at each invocation) Adding cond_resched() helps giving cpu to higher priority tasks if necessary. Using a "if (need_resched())" test before calling "cond_resched();" is necessary to avoid spending too much time doing the resched check. (My tests gave a time reduction from 88 ms to 25 ms per rt_check_expire() run on my i686 test machine) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
03f49f34 |
|
10-Nov-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Make helper to get dst entry and "use" it There are many places that get the dst entry, increase the __use counter and set the "lastuse" time stamp. Make a helper for this. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b1667609 |
|
10-Nov-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[IPV4]: Remove bugus goto-s from ip_route_input_slow Both places look like if (err == XXX) goto yyy; done: while both yyy targets look like err = XXX; goto done; so this is ok to remove the above if-s. yyy labels are used in other places and are not removed. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cf7732e4 |
|
10-Oct-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Make core networking code use seq_open_private This concerns the ipv4 and ipv6 code mostly, but also the netlink and unix sockets. The netlink code is an example of how to use the __seq_open_private() call - it saves the net namespace on this private. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cfcabdcc |
|
09-Oct-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[NET]: sparse warning fixes Fix a bunch of sparse warnings. Mostly about 0 used as NULL pointer, and shadowed variable declarations. One notable case was that hash size should have been unsigned. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2774c7ab |
|
26-Sep-2007 |
Eric W. Biederman <ebiederm@xmission.com> |
[NET]: Make the loopback device per network namespace. This patch makes loopback_dev per network namespace. Adding code to create a different loopback device for each network namespace and adding the code to free a loopback device when a network namespace exits. This patch modifies all users the loopback_dev so they access it as init_net.loopback_dev, keeping all of the code compiling and working. A later pass will be needed to update the users to use something other than the initial network namespace. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
de3cb747 |
|
25-Sep-2007 |
Daniel Lezcano <dlezcano@fr.ibm.com> |
[NET]: Dynamically allocate the loopback device, part 1. This patch replaces all occurences to the static variable loopback_dev to a pointer loopback_dev. That provides the mindless, trivial, uninteressting change part for the dynamic allocation for the loopback. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Acked-By: Kirill Korotaev <dev@sw.ru> Acked-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
39c90ece |
|
15-Sep-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[IPV4]: Convert rt_check_expire() from softirq processing to workqueue. On loaded/big hosts, rt_check_expire() if of litle use, because it generally breaks out of its main loop because of a jiffies change. It can take a long time (read : timer invocations) to actually scan the whole hash table, freeing unused entries. Converting it to use a workqueue instead of softirq is a nice move because we can allow rt_check_expire() to do the scan it is supposed to do, without hogging the CPU. This has an impact on the average number of entries in cache, reducing ram usage. Cache is more responsive to parameter changes (/proc/sys/net/ipv4/route/gc_timeout and /proc/sys/net/ipv4/route/gc_interval) Note: Maybe the default value of gc_interval (60 seconds) is too high, since this means we actually need 5 (300/60) invocations to scan the whole table. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
881d966b |
|
17-Sep-2007 |
Eric W. Biederman <ebiederm@xmission.com> |
[NET]: Make the device list and device lookups per namespace. This patch makes most of the generic device layer network namespace safe. This patch makes dev_base_head a network namespace variable, and then it picks up a few associated variables. The functions: dev_getbyhwaddr dev_getfirsthwbytype dev_get_by_flags dev_get_by_name __dev_get_by_name dev_get_by_index __dev_get_by_index dev_ioctl dev_ethtool dev_load wireless_process_ioctl were modified to take a network namespace argument, and deal with it. vlan_ioctl_set and brioctl_set were modified so their hooks will receive a network namespace argument. So basically anthing in the core of the network stack that was affected to by the change of dev_base was modified to handle multiple network namespaces. The rest of the network stack was simply modified to explicitly use &init_net the initial network namespace. This can be fixed when those components of the network stack are modified to handle multiple network namespaces. For now the ifindex generator is left global. Fundametally ifindex numbers are per namespace, or else we will have corner case problems with migration when we get that far. At the same time there are assumptions in the network stack that the ifindex of a network device won't change. Making the ifindex number global seems a good compromise until the network stack can cope with ifindex changes when you change namespaces, and the like. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
457c4cbc |
|
11-Sep-2007 |
Eric W. Biederman <ebiederm@xmission.com> |
[NET]: Make /proc/net per network namespace This patch makes /proc/net per network namespace. It modifies the global variables proc_net and proc_net_stat to be per network namespace. The proc_net file helpers are modified to take a network namespace argument, and all of their callers are fixed to pass &init_net for that argument. This ensures that all of the /proc/net files are only visible and usable in the initial network namespace until the code behind them has been updated to be handle multiple network namespaces. Making /proc/net per namespace is necessary as at least some files in /proc/net depend upon the set of network devices which is per network namespace, and even more files in /proc/net have contents that are relevant to a single network namespace. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1bcabbdb |
|
01-Aug-2007 |
Mariusz Kozlowski <m.kozlowski@tuxland.pl> |
[IPV4] route.c: mostly kmalloc + memset conversion to k[cz]alloc Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
20c2df83 |
|
19-Jul-2007 |
Paul Mundt <lethal@linux-sh.org> |
mm: Remove slab destructors from kmem_cache_create(). Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
|
#
4839c52b |
|
09-Jul-2007 |
Philippe De Muyter <phdm@macqel.be> |
[IPV4]: Make ip_tos2prio const. Signed-off-by: Philippe De Muyter <phdm@macqel.be> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e06e7c61 |
|
10-Jun-2007 |
David S. Miller <davem@sunset.davemloft.net> |
[IPV4]: The scheduled removal of multipath cached routing support. With help from Chris Wedgwood. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
42f811b8 |
|
05-Jun-2007 |
Herbert Xu <herbert@gondor.apana.org.au> |
[IPV4]: Convert IPv4 devconf to an array This patch converts the ipv4_devconf config members (everything except sysctl) to an array. This allows easier manipulation which will be needed later on to provide better management of default config values. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
14e50e57 |
|
24-May-2007 |
David S. Miller <davem@sunset.davemloft.net> |
[XFRM]: Allow packet drops during larval state resolution. The current IPSEC rule resolution behavior we have does not work for a lot of people, even though technically it's an improvement from the -EAGAIN buisness we had before. Right now we'll block until the key manager resolves the route. That works for simple cases, but many folks would rather packets get silently dropped until the key manager resolves the IPSEC rules. We can't tell these folks to "set the socket non-blocking" because they don't have control over the non-block setting of things like the sockets used to resolve DNS deep inside of the resolver libraries in libc. With that in mind I coded up the patch below with some help from Herbert Xu which provides packet-drop behavior during larval state resolution, controllable via sysctl and off by default. This lays the framework to either: 1) Make this default at some point or... 2) Move this logic into xfrm{4,6}_policy.c and implement the ARP-like resolution queue we've all been dreaming of. The idea would be to queue packets to the policy, then once the larval state is resolved by the key manager we re-resolve the route and push the packets out. The packets would timeout if the rule didn't get resolved in a certain amount of time. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f6c5d736 |
|
18-May-2007 |
David S. Miller <davem@sunset.davemloft.net> |
[IPV4]: Remove IPVS icmp hack from route.c for now. Revert: 2d771cd86d4c3af26f34a7bcdc1b87696824cad9 This is dangerous if enabled and a better solution to the problem is being worked on. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2d771cd8 |
|
26-Mar-2007 |
Janusz Krzysztofik <jkrzyszt@tis.icnet.pl> |
[IPV4] LVS: Allow to send ICMP unreachable responses when real-servers are removed this is a small patch by Janusz Krzysztofik to ip_route_output_slow() that allows VIP-less LVS linux director to generate packets originating >From VIP if sysctl_ip_nonlocal_bind is set. In a nutshell, the intention is for an LVS linux director to be able to send ICMP unreachable responses to end-users when real-servers are removed. http://archive.linuxvirtualserver.org/html/lvs-users/2007-01/msg00106.html Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
63f3444f |
|
22-Mar-2007 |
Thomas Graf <tgraf@suug.ch> |
[IPv4]: Use rtnl registration interface Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eddc9ec5 |
|
20-Apr-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f690808e |
|
12-Mar-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[NET]: make seq_operations const The seq_file operations stuff can be marked constant to get it out of dirty cache. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c1d2bbe1 |
|
10-Apr-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Introduce skb_reset_network_header(skb) For the common, open coded 'skb->nh.raw = skb->data' operation, so that we can later turn skb->nh.raw into a offset, reducing the size of struct sk_buff in 64bit land while possibly keeping it as a pointer on 32bit. This one touches just the most simple case, next will handle the slightly more "complex" cases. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
98e399f8 |
|
19-Mar-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Introduce skb_mac_header() For the places where we need a pointer to the mac header, it is still legal to touch skb->mac.raw directly if just adding to, subtracting from or setting it to another layer header. This one also converts some more cases to skb_reset_mac_header() that my regex missed as it had no spaces before nor after '=', ugh. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
459a98ed |
|
19-Mar-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Introduce skb_reset_mac_header(skb) For the common, open coded 'skb->mac.raw = skb->data' operation, so that we can later turn skb->mac.raw into a offset, reducing the size of struct sk_buff in 64bit land while possibly keeping it as a pointer on 32bit. This one touches just the most simple case, next will handle the slightly more "complex" cases. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9d729f72 |
|
04-Mar-2007 |
James Morris <jmorris@namei.org> |
[NET]: Convert xtime.tv_sec to get_seconds() Where appropriate, convert references to xtime.tv_sec to the get_seconds() helper function. Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cd354f1a |
|
14-Feb-2007 |
Tim Schmielau <tim@physik3.uni-rostock.de> |
[PATCH] remove many unneeded #includes of sched.h After Al Viro (finally) succeeded in removing the sched.h #include in module.h recently, it makes sense again to remove other superfluous sched.h includes. There are quite a lot of files which include it but don't actually need anything defined in there. Presumably these includes were once needed for macros that used to live in sched.h, but moved to other header files in the course of cleaning it up. To ease the pain, this time I did not fiddle with any header files and only removed #includes from .c-files, which tend to cause less trouble. Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha, arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig, allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all configs in arch/arm/configs on arm. I also checked that no new warnings were introduced by the patch (actually, some warnings are removed that were emitted by unnecessarily included header files). Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9a32144e |
|
12-Feb-2007 |
Arjan van de Ven <arjan@linux.intel.com> |
[PATCH] mark struct file_operations const 7 Many struct file_operations in the kernel can be "const". Marking them const moves these to the .rodata section, which avoids false sharing with potential dirty data. In addition it'll catch accidental writes at compile time to these shared resources. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
093c2ca4 |
|
09-Feb-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[IPV4]: Convert ipv4 route to use the new dst_entry 'next' pointer This patch removes the rt_next pointer from 'struct rtable.u' union, and renames u.rt_next to u.dst_rt_next. It also moves 'struct flowi' right after 'struct dst_entry' to prepare the gain on lookups. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e905a9ed |
|
09-Feb-2007 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] IPV4: Fix whitespace errors. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
26932566 |
|
01-Feb-2007 |
Patrick McHardy <kaber@trash.net> |
[NETLINK]: Don't BUG on undersized allocations Currently netlink users BUG when the allocated skb for an event notification is undersized. While this is certainly a kernel bug, its not critical and crashing the kernel is too drastic, especially when considering that these errors have appeared multiple times in the past and it BUGs even if no listeners are present. This patch replaces BUG by WARN_ON and changes the notification functions to inform potential listeners of undersized allocations using a unique error code (EMSGSIZE). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
14fb8a76 |
|
18-Dec-2006 |
Li Yewang <lyw@nanjing-fnst.com> |
[IPV4]: Fix BUG of ip_rt_send_redirect() Fix the redirect packet of the router if the jiffies wraparound. Signed-off-by: Li Yewang <lyw@nanjing-fnst.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1f29bcd7 |
|
10-Dec-2006 |
Alexey Dobriyan <adobriyan@gmail.com> |
[PATCH] sysctl: remove unused "context" param Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: David Howells <dhowells@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
1b6651f1 |
|
04-Dec-2006 |
Patrick McHardy <kaber@trash.net> |
[XFRM]: Use output device disable_xfrm for forwarded packets Currently the behaviour of disable_xfrm is inconsistent between locally generated and forwarded packets. For locally generated packets disable_xfrm disables the policy lookup if it is set on the output device, for forwarded traffic however it looks at the input device. This makes it impossible to disable xfrm on all devices but a dummy device and use normal routing to direct traffic to that device. Always use the output device when checking disable_xfrm. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e3703b3d |
|
27-Nov-2006 |
Thomas Graf <tgraf@suug.ch> |
[RTNETLINK]: Add rtnl_put_cacheinfo() to unify some code IPv4, IPv6, and DECNet all use struct rta_cacheinfo in a similiar way, therefore rtnl_put_cacheinfo() is added to reuse code. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
714e85be |
|
14-Nov-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[IPV6]: Assorted trivial endianness annotations. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
47dcf0cb |
|
09-Nov-2006 |
Thomas Graf <tgraf@suug.ch> |
[NET]: Rethink mark field in struct flowi Now that all protocols have been made aware of the mark field it can be moved out of the union thus simplyfing its usage. The config options in the IPv4/IPv6/DECnet subsystems to enable respectively disable mark based routing only obfuscate the code with ifdefs, the cost for the additional comparison in the flow key is insignificant, and most distributions have all these options enabled by default anyway. Therefore it makes sense to remove the config options and enable mark based routing by default. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
82e91ffe |
|
09-Nov-2006 |
Thomas Graf <tgraf@suug.ch> |
[NET]: Turn nfmark into generic mark nfmark is being used in various subsystems and has become the defacto mark field for all kinds of packets. Therefore it makes sense to rename it to `mark' and remove the dependency on CONFIG_NETFILTER. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8238b218 |
|
12-Oct-2006 |
David S. Miller <davem@sunset.davemloft.net> |
[NET]: Do not memcmp() over pad bytes of struct flowi. They are not necessarily initialized to zero by the compiler, for example when using run-time initializers of automatic on-stack variables. Noticed by Eric Dumazet and Patrick McHardy. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
17fb2c64 |
|
26-Sep-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[IPV4]: RTA_{DST,SRC,GATEWAY,PREFSRC} annotated these are passed net-endian; use be32 netlink accessors Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e448515c |
|
26-Sep-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[IPV4] net/ipv4/route.c: trivial endianness annotations Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d9c9df8c |
|
26-Sep-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[IPV4]: fib_validate_source() annotations annotated arguments and inferred net-endian variables in callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a61ced5d |
|
26-Sep-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[IPV4]: inet_select_addr() annotations argument and return value are net-endian. Annotated function and inferred net-endian variables in callers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8c7bc840 |
|
26-Sep-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[IPV4]: annotate rt_hash_code() users Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f7655229 |
|
26-Sep-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[IPV4]: ip_rt_redirect() annotations The first 4 arguments of ip_rt_redirect() are net-endian. Annotated. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9e12bb22 |
|
26-Sep-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[IPV4]: ip_route_input() annotations ip_route_input() takes net-endian source and destination address. * Annotated as such. * arguments of its invocations annotated where needed. * local helpers getting the same values passed to by it (ip_route_input_mc(), ip_route_input_slow(), ip_handle_martian_source(), ip_mkroute_input(), ip_mkroute_input_def(), __mkroute_input()) annotated Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e5d679f33 |
|
26-Aug-2006 |
Alexey Dobriyan <adobriyan@gmail.com> |
[NET]: Use SLAB_PANIC Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d889ce3b |
|
17-Aug-2006 |
Thomas Graf <tgraf@suug.ch> |
[IPv4]: Convert route get to new netlink api Fixes various unvalidated netlink attributes causing memory corruptions when left empty by userspace applications. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
be403ea1 |
|
17-Aug-2006 |
Thomas Graf <tgraf@suug.ch> |
[IPv4]: Convert FIB dumping to use new netlink api Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2942e900 |
|
15-Aug-2006 |
Thomas Graf <tgraf@suug.ch> |
[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9e762a4a |
|
11-Aug-2006 |
Patrick McHardy <kaber@trash.net> |
[NET]: Introduce RTA_TABLE/FRA_TABLE attributes Introduce RTA_TABLE route attribute and FRA_TABLE routing rule attribute to hold 32 bit routing table IDs. Usespace compatibility is provided by continuing to accept and send the rtm_table field, but because of its limited size it can only carry the low 8 bits of the table ID. This implies that if larger IDs are used, _all_ userspace programs using them need to use RTA_TABLE. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8d1502de |
|
07-Aug-2006 |
Kirill Korotaev <dev@sw.ru> |
[IPV4]: Limit rt cache size properly. From: Kirill Korotaev <dev@sw.ru> During OpenVZ stress testing we found that UDP traffic with random src can generate too much excessive rt hash growing leading finally to OOM and kernel panics. It was found that for 4GB i686 system (having 1048576 total pages and 225280 normal zone pages) kernel allocates the following route hash: syslog: IP route cache hash table entries: 262144 (order: 8, 1048576 bytes) => ip_rt_max_size = 4194304 entries, i.e. max rt size is 4194304 * 256b = 1Gb of RAM > normal_zone Attached the patch which removes HASH_HIGHMEM flag from alloc_large_system_hash() call. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8d71740c |
|
30-Jul-2006 |
Tom Tucker <tom@opengridcomputing.com> |
[NET]: Core net changes to generate netevents Generate netevents for: - neighbour changes - routing redirects - pmtu changes Signed-off-by: Tom Tucker <tom@opengridcomputing.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
62051200 |
|
03-Jul-2006 |
Ingo Molnar <mingo@elte.hu> |
[PATCH] lockdep: fix RT_HASH_LOCK_SZ On lockdep we have a quite big spinlock_t, so keep the size down. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
8a25d5de |
|
03-Jul-2006 |
Ingo Molnar <mingo@elte.hu> |
[PATCH] lockdep: prove spinlock rwlock locking correctness Use the lock validator framework to prove spinlock and rwlock locking correctness. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
6ab3d562 |
|
30-Jun-2006 |
Jörn Engel <joern@wohnheim.fh-wedel.de> |
Remove obsolete #include <linux/config.h> Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
#
bfe5d834 |
|
25-Jun-2006 |
Paul Mackerras <paulus@samba.org> |
[PATCH] Define __raw_get_cpu_var and use it There are several instances of per_cpu(foo, raw_smp_processor_id()), which is semantically equivalent to __get_cpu_var(foo) but without the warning that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled. For those architectures with optimized per-cpu implementations, namely ia64, powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower code than __get_cpu_var(), so it would be preferable to use __get_cpu_var on those platforms. This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x, raw_smp_processor_id()) on architectures that use the generic per-cpu implementation, and turns into __get_cpu_var(x) on the architectures that have an optimized per-cpu implementation. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
d2c962b8 |
|
17-Apr-2006 |
Stephen Hemminger <shemminger@osdl.org> |
[IPV4]: ip_route_input panic fix This fixes http://bugzilla.kernel.org/show_bug.cgi?id=6388 The bug is caused by ip_route_input dereferencing skb->nh.protocol of the dummy skb passed dow from inet_rtm_getroute (Thanks Thomas for seeing it). It only happens if the route requested is for a multicast IP address. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6f912042 |
|
10-Apr-2006 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
[PATCH] for_each_possible_cpu: network codes for_each_cpu() actually iterates across all possible CPUs. We've had mistakes in the past where people were using for_each_cpu() where they should have been iterating across only online or present CPUs. This is inefficient and possibly buggy. We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the future. This patch replaces for_each_cpu with for_each_possible_cpu under /net Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
cef2685e |
|
25-Mar-2006 |
Ilia Sotnikov <hostcc@gmail.com> |
[IPV4]: Aggregate route entries with different TOS values When we get an ICMP need-to-frag message, the original TOS value in the ICMP payload cannot be used as a key to look up the routes to update. This is because the TOS field may have been modified by routers on the way. Similarly, ip_rt_redirect should also ignore the TOS as the router that gave us the message may have modified the TOS value. The patch achieves this objective by aggregating entries with different TOS values (but are otherwise identical) into the same bucket. This makes it easy to update them at the same time when an ICMP message is received. In future we should use a twin-hashing scheme where teh aggregation occurs at the entry level. That is, the TOS goes back into the hash for normal lookups while ICMP lookups will end up with a node that gives us a list that contains all other route entries that differ only by TOS. Signed-off-by: Ilia Sotnikov <hostcc@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
85259878 |
|
21-Feb-2006 |
Suresh Bhogavilli <sbhogavilli@verisign.com> |
[IPV4]: Fix garbage collection of multipath route entries When garbage collecting route cache entries of multipath routes in rt_garbage_collect(), entries were deleted from the hash bucket 'i' while holding a spin lock on bucket 'k' resulting in a system hang. Delete entries, if any, from bucket 'k' instead. Signed-off-by: Suresh Bhogavilli <sbhogavilli@verisign.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dbd2915c |
|
17-Jan-2006 |
Andrew Morton <akpm@osdl.org> |
[IPV4]: RT_CACHE_STAT_INC() warning fix BUG: using smp_processor_id() in preemptible [00000001] code: rpc.statd/2408 And it _is_ a bug, but I guess we don't care enough to add preempt_disable(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2f970d83 |
|
17-Jan-2006 |
Eric Dumazet <dada1@cosmosbay.com> |
[IPV4]: rt_cache_stat can be statically defined Using __get_cpu_var(obj) is slightly faster than per_cpu_ptr(obj, raw_smp_processor_id()). 1) Smaller code and memory use For static and small objects, DEFINE_PER_CPU(type, object) is preferred over a alloc_percpu() : Better and smaller code to access them, and no extra memory (storing the pointer, and the percpu array of pointers) x86_64 code before patch mov 1237577(%rip),%rax # ffffffff803e5990 <rt_cache_stat> not %rax # part of per_cpu machinery mov %gs:0x3c,%edx # get cpu number movslq %edx,%rdx # extend 32 bits cpu number to 64 bits mov (%rax,%rdx,8),%rax # get the pointer for this cpu incl 0x38(%rax) x86_64 code after patch mov $per_cpu__rt_cache_stat,%rdx mov %gs:0x48,%rax # get percpu data offset incl 0x38(%rax,%rdx,1) 2) False sharing avoidance for SMP : For a small NR_CPUS, the array of per cpu pointers allocated in alloc_percpu() can be <= 32 bytes. This let slab code gives a part of a cache line. If the other part of this 64 bytes (or 128 bytes) cache line is used by a mostly written object, we can have false sharing and expensive per_cpu_ptr() operations. Size of rt_cache_stat is 64 bytes, so this patch is not a danger of a too big increase of bss (in UP mode) or static per_cpu data for SMP (PERCPU_ENOUGH_ROOM is currently 32768 bytes) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9b5b5cff |
|
29-Nov-2005 |
Arjan van de Ven <arjan@infradead.org> |
[NET]: Add const markers to various variables. the patch below marks various variables const in net/; the goal is to move them to the .rodata section so that they can't false-share cachelines with things that get written to, as well as potentially helping gcc a bit with optimisations. (these were found using a gcc patch to warn about such variables) Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
18955cfc |
|
29-Nov-2005 |
Mike Stroyan <mike.stroyan@hp.com> |
[IPV4] tcp/route: Another look at hash table sizes The tcp_ehash hash table gets too big on systems with really big memory. It is worse on systems with pages larger than 4KB. It wastes memory that could be better used. It also makes the netstat command slow because reading /proc/net/tcp and /proc/net/tcp6 needs to go through the full hash table. The default value should not be larger for larger page sizes. It seems that the effect of page size is an unintended error dating back a long time. I also wonder if the default value really should be a larger fraction of memory for systems with more memory. While systems with really big ram can afford more space for hash tables, it is not clear to me that they benefit from increasing the allocation ratio for this table. The amount of memory allocated is determined by net/ipv4/tcp.c:tcp_init and mm/page_alloc.c:alloc_large_system_hash. tcp_init calls alloc_large_system_hash passing parameters- bucketsize=sizeof(struct tcp_ehash_bucket) numentries=thash_entries scale=(num_physpages >= 128 * 1024) ? (25-PAGE_SHIFT) : (27-PAGE_SHIFT) limit=0 On i386, PAGE_SHIFT is 12 for a page size of 4K On ia64, PAGE_SHIFT defaults to 14 for a page size of 16K The num_physpages test above makes the allocation take a larger fraction of the total memory on systems with larger memory. The threshold size for a i386 system is 512MB. For an ia64 system with 16KB pages the threshold is 2GB. For smaller memory systems- On i386, scale = (27 - 12) = 15 On ia64, scale = (27 - 14) = 13 For larger memory systems- On i386, scale = (25 - 12) = 13 On ia64, scale = (25 - 14) = 11 For the rest of this discussion, I'll just track the larger memory case. The default behavior has numentries=thash_entries=0, so the allocated size is determined by either scale or by the default limit of 1/16 of total memory. In alloc_large_system_hash- | numentries = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages; | numentries += (1UL << (20 - PAGE_SHIFT)) - 1; | numentries >>= 20 - PAGE_SHIFT; | numentries <<= 20 - PAGE_SHIFT; At this point, numentries is pages for all of memory, rounded up to the nearest megabyte boundary. | /* limit to 1 bucket per 2^scale bytes of low memory */ | if (scale > PAGE_SHIFT) | numentries >>= (scale - PAGE_SHIFT); | else | numentries <<= (PAGE_SHIFT - scale); On i386, numentries >>= (13 - 12), so numentries is 1/8196 of bytes of total memory. On ia64, numentries <<= (14 - 11), so numentries is 1/2048 of bytes of total memory. | log2qty = long_log2(numentries); | | do { | size = bucketsize << log2qty; bucketsize is 16, so size is 16 times numentries, rounded down to a power of two. On i386, size is 1/512 of bytes of total memory. On ia64, size is 1/128 of bytes of total memory. For smaller systems the results are On i386, size is 1/2048 of bytes of total memory. On ia64, size is 1/512 of bytes of total memory. The large page effect can be removed by just replacing the use of PAGE_SHIFT with a constant of 12 in the calls to alloc_large_system_hash. That makes them more like the other uses of that function from fs/inode.c and fs/dcache.c Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e5ed6399 |
|
03-Oct-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[IPV4]: Replace __in_dev_get with __in_dev_get_rcu/rtnl The following patch renames __in_dev_get() to __in_dev_get_rtnl() and introduces __in_dev_get_rcu() to cover the second case. 1) RCU with refcnt should use in_dev_get(). 2) RCU without refcnt should use __in_dev_get_rcu(). 3) All others must hold RTNL and use __in_dev_get_rtnl(). There is one exception in net/ipv4/route.c which is in fact a pre-existing race condition. I've marked it as such so that we remember to fix it. This patch is based on suggestions and prior work by Suzanne Wood and Paul McKenney. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ce723d8e |
|
08-Sep-2005 |
Julian Anastasov <ja@ssi.bg> |
[IPV4]: Fix refcount damaging in net/ipv4/route.c One such place that can damage the dst refcnts is route.c with CONFIG_IP_ROUTE_MULTIPATH_CACHED enabled, i don't see the user's .config. In this new code i see that rt_intern_hash is called before dst->refcnt is set to 1, dst is the 2nd arg to rt_intern_hash. Arg 2 of rt_intern_hash must come with refcnt 1 as it is added to table or dropped depending on error/add/update. One such example is ip_mkroute_input where __mkroute_input return rth with refcnt 0 which is provided to rt_intern_hash. ip_mkroute_output looks like a 2nd such place. Appending untested patch for comments and review. The idea is to put previous reference as we are going to return next result/error. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d8c97a94 |
|
09-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET]: Export symbols needed by the current DCCP code Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0742fd53 |
|
09-Aug-2005 |
Adrian Bunk <bunk@stusta.de> |
[IPV4]: possible cleanups This patch contains the following possible cleanups: - make needlessly global code static - #if 0 the following unused global function: - xfrm4_state.c: xfrm4_state_fini - remove the following unneeded EXPORT_SYMBOL's: - ip_output.c: ip_finish_output - ip_output.c: sysctl_ip_default_ttl - fib_frontend.c: ip_dev_find - inetpeer.c: inet_peer_idlock - ip_options.c: ip_options_compile - ip_options.c: ip_options_undo - net/core/request_sock.c: sysctl_max_syn_backlog Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0b7f22aa |
|
11-Jul-2005 |
Olaf Kirch <okir@suse.de> |
[IPV4]: Prevent oops when printing martian source In some cases, we may be generating packets with a source address that qualifies as martian. This can happen when we're in the middle of setting up the network, and netfilter decides to reject a packet with an RST. The IPv4 routing code would try to print a warning and oops, because locally generated packets do not have a valid skb->mac.raw pointer at this point. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bb1d23b0 |
|
05-Jul-2005 |
Eric Dumazet <dada1@cosmosbay.com> |
[IPV4]: Bug fix in rt_check_expire() - rt_check_expire() fixes (an overflow occured if size of the hash was >= 65536) reminder of the bugfix: The rt_check_expire() has a serious problem on machines with large route caches, and a standard HZ value of 1000. With default values, ie ip_rt_gc_interval = 60*HZ = 60000 ; the loop count : for (t = ip_rt_gc_interval << rt_hash_log; t >= 0; overflows (t is a 31 bit value) as soon rt_hash_log is >= 16 (65536 slots in route cache hash table). In this case, rt_check_expire() does nothing at all Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
424c4b70 |
|
05-Jul-2005 |
Eric Dumazet <dada1@cosmosbay.com> |
[IPV4]: Use the fancy alloc_large_system_hash() function for route hash table - rt hash table allocated using alloc_large_system_hash() function. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
22c047cc |
|
05-Jul-2005 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: Hashed spinlocks in net/ipv4/route.c - Locking abstraction - Spinlocks moved out of rt hash table : Less memory (50%) used by rt hash table. it's a win even on UP. - Sizing of spinlocks table depends on NR_CPUS Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2c2910a4 |
|
28-Jun-2005 |
Dietmar Eggemann <dietmar.eggemann@gmx.de> |
[IPV4]: Snmpv2 Mib IP counter ipInAddrErrors support I followed Thomas' proposal to see every martian destination as a case where the ipInAddrErrors counter has to be incremented. There are two advantages by doing so: (1) The relation between the ipInReceive counter and all the other ipInXXX counters is more accurate in the case the RTN_UNICAST code check fails and (2) it makes the code in ip_route_input_slow easier. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7abaa27c |
|
22-Jun-2005 |
Chuck Short <zulcss@gmail.com> |
[IPV4]: Fix route.c gcc4 warnings Signed-off by: Chuck Short <zulcss@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b6544c0b |
|
18-Jun-2005 |
Jamal Hadi Salim <hadi@cyberus.ca> |
[NETLINK]: Correctly set NLM_F_MULTI without checking the pid This patch rectifies some rtnetlink message builders that derive the flags from the pid. It is now explicit like the other cases which get it right. Also fixes half a dozen dumpers which did not set NLM_F_MULTI at all. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02c30a84 |
|
05-May-2005 |
Jesper Juhl <juhl-lkml@dif.dk> |
[PATCH] update Ross Biro bouncing email address Ross moved. Remove the bad email address so people will find the correct one in ./CREDITS. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
5bec0039 |
|
28-Apr-2005 |
Olaf Rempel <razzor@kopf-tisch.de> |
[NET]: /proc/net/stat/* header cleanup Signed-off-by: Olaf Rempel <razzor@kopf-tisch.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7e3e0360 |
|
28-Apr-2005 |
Dave Jones <davej@redhat.com> |
[IPV4]: Incorrect permissions on route flush sysctl This has been brought up before.. http://lkml.org/lkml/2000/1/21/116 but didnt seem to get resolved. This morning I got someone file a bugzilla about it breaking sysctl(8). Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9c2b3328 |
|
19-Apr-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[NET]: skbuff: remove old NET_CALLER macro Here is a revised alternative that uses BUG_ON/WARN_ON (as suggested by Herbert Xu) to eliminate NET_CALLER. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1da177e4 |
|
16-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|