History log of /linux-master/net/ipv4/icmp.c
Revision Date Author Comments
# c58e88d4 20-Apr-2024 Eric Dumazet <edumazet@google.com>

icmp: prevent possible NULL dereferences from icmp_build_probe()

First problem is a double call to __in_dev_get_rcu(), because
the second one could return NULL.

if (__in_dev_get_rcu(dev) && __in_dev_get_rcu(dev)->ifa_list)

Second problem is a read from dev->ip6_ptr with no NULL check:

if (!list_empty(&rcu_dereference(dev->ip6_ptr)->addr_list))

Use the correct RCU API to fix these.

v2: add missing include <net/addrconf.h>

Fixes: d329ea5bd884 ("icmp: add response to RFC 8335 PROBE messages")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2b1dc628 04-Oct-2023 Florian Westphal <fw@strlen.de>

xfrm: pass struct net to xfrm_decode_session wrappers

Preparation patch, extra arg is not used.
No functional changes intended.

This is needed to replace the xfrm session decode functions with
the flow dissector.

skb_flow_dissect() cannot be used as-is, because it attempts to deduce the
'struct net' to use for bpf program fetch from skb->sk or skb->dev, but
xfrm code path can see skbs that have neither sk or dev filled in.

So either flow dissector needs to try harder, e.g. by also trying
skb->dst->dev, or we have to pass the struct net explicitly.

Passing the struct net doesn't look too bad to me, most places
already have it available or can derive it from the output device.

Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/netdev/202309271628.27fd2187-oliver.sang@intel.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>


# 7d63b671 30-Mar-2023 Eric Dumazet <edumazet@google.com>

icmp: guard against too small mtu

syzbot was able to trigger a panic [1] in icmp_glue_bits(), or
more exactly in skb_copy_and_csum_bits()

There is no repro yet, but I think the issue is that syzbot
manages to lower device mtu to a small value, fooling __icmp_send()

__icmp_send() must make sure there is enough room for the
packet to include at least the headers.

We might in the future refactor skb_copy_and_csum_bits() and its
callers to no longer crash when something bad happens.

[1]
kernel BUG at net/core/skbuff.c:3343 !
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 15766 Comm: syz-executor.0 Not tainted 6.3.0-rc4-syzkaller-00039-gffe78bbd5121 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
RIP: 0010:skb_copy_and_csum_bits+0x798/0x860 net/core/skbuff.c:3343
Code: f0 c1 c8 08 41 89 c6 e9 73 ff ff ff e8 61 48 d4 f9 e9 41 fd ff ff 48 8b 7c 24 48 e8 52 48 d4 f9 e9 c3 fc ff ff e8 c8 27 84 f9 <0f> 0b 48 89 44 24 28 e8 3c 48 d4 f9 48 8b 44 24 28 e9 9d fb ff ff
RSP: 0018:ffffc90000007620 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000000001e8 RCX: 0000000000000100
RDX: ffff8880276f6280 RSI: ffffffff87fdd138 RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
R10: 00000000000001e8 R11: 0000000000000001 R12: 000000000000003c
R13: 0000000000000000 R14: ffff888028244868 R15: 0000000000000b0e
FS: 00007fbc81f1c700(0000) GS:ffff88802ca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2df43000 CR3: 00000000744db000 CR4: 0000000000150ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
icmp_glue_bits+0x7b/0x210 net/ipv4/icmp.c:353
__ip_append_data+0x1d1b/0x39f0 net/ipv4/ip_output.c:1161
ip_append_data net/ipv4/ip_output.c:1343 [inline]
ip_append_data+0x115/0x1a0 net/ipv4/ip_output.c:1322
icmp_push_reply+0xa8/0x440 net/ipv4/icmp.c:370
__icmp_send+0xb80/0x1430 net/ipv4/icmp.c:765
ipv4_send_dest_unreach net/ipv4/route.c:1239 [inline]
ipv4_link_failure+0x5a9/0x9e0 net/ipv4/route.c:1246
dst_link_failure include/net/dst.h:423 [inline]
arp_error_report+0xcb/0x1c0 net/ipv4/arp.c:296
neigh_invalidate+0x20d/0x560 net/core/neighbour.c:1079
neigh_timer_handler+0xc77/0xff0 net/core/neighbour.c:1166
call_timer_fn+0x1a0/0x580 kernel/time/timer.c:1700
expire_timers+0x29b/0x4b0 kernel/time/timer.c:1751
__run_timers kernel/time/timer.c:2022 [inline]

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+d373d60fddbdc915e666@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230330174502.1915328-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d0941130 24-Jan-2023 Jamie Bainbridge <jamie.bainbridge@gmail.com>

icmp: Add counters for rate limits

There are multiple ICMP rate limiting mechanisms:

* Global limits: net.ipv4.icmp_msgs_burst/icmp_msgs_per_sec
* v4 per-host limits: net.ipv4.icmp_ratelimit/ratemask
* v6 per-host limits: net.ipv6.icmp_ratelimit/ratemask

However, when ICMP output is limited, there is no way to tell
which limit has been hit or even if the limits are responsible
for the lack of ICMP output.

Add counters for each of the cases above. As we are within
local_bh_disable(), use the __INC stats variant.

Example output:

# nstat -sz "*RateLimit*"
IcmpOutRateLimitGlobal 134 0.0
IcmpOutRateLimitHost 770 0.0
Icmp6OutRateLimitHost 84 0.0

Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Suggested-by: Abhishek Rawal <rawal.abhishek92@gmail.com>
Link: https://lore.kernel.org/r/273b32241e6b7fdc5c609e6f5ebc68caf3994342.1674605770.git.jamie.bainbridge@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 8032bf12 09-Oct-2022 Jason A. Donenfeld <Jason@zx2c4.com>

treewide: use get_random_u32_below() instead of deprecated function

This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>


# 0968d2a4 13-Jul-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

ip: Fix data-races around sysctl_ip_no_pmtu_disc.

While reading sysctl_ip_no_pmtu_disc, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1ebcb25a 11-Jul-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

icmp: Fix a data-race around sysctl_icmp_ratemask.

While reading sysctl_icmp_ratemask, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2a4eb714 11-Jul-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

icmp: Fix a data-race around sysctl_icmp_ratelimit.

While reading sysctl_icmp_ratelimit, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d2efabce 11-Jul-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

icmp: Fix a data-race around sysctl_icmp_errors_use_inbound_ifaddr.

While reading sysctl_icmp_errors_use_inbound_ifaddr, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.

Fixes: 1c2fb7f93cb2 ("[IPV4]: Sysctl configurable icmp error source address.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b04f9b7e 11-Jul-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

icmp: Fix a data-race around sysctl_icmp_ignore_bogus_error_responses.

While reading sysctl_icmp_ignore_bogus_error_responses, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 66484bb9 11-Jul-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

icmp: Fix a data-race around sysctl_icmp_echo_ignore_broadcasts.

While reading sysctl_icmp_echo_ignore_broadcasts, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4a2f7083 11-Jul-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

icmp: Fix data-races around sysctl_icmp_echo_enable_probe.

While reading sysctl_icmp_echo_enable_probe, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.

Fixes: d329ea5bd884 ("icmp: add response to RFC 8335 PROBE messages")
Fixes: 1fd07f33c3ea ("ipv6: ICMPV6: add response to ICMPV6 RFC 8335 PROBE messages")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bb7bb35a 11-Jul-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

icmp: Fix a data-race around sysctl_icmp_echo_ignore_all.

While reading sysctl_icmp_echo_ignore_all, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 48d7ee32 06-Jul-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

icmp: Fix data-races around sysctl.

While reading icmp sysctl variables, they can be changed concurrently.
So, we need to add READ_ONCE() to avoid data-races.

Fixes: 4cdf507d5452 ("icmp: add a global rate limitation")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2e47eece 28-Apr-2022 Yu Zhe <yuzhe@nfschina.com>

ipv4: remove unnecessary type castings

remove unnecessary void* type castings.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b384c95a 07-Apr-2022 Menglong Dong <imagedong@tencent.com>

net: icmp: add skb drop reasons to icmp protocol

Replace kfree_skb() used in icmp_rcv() and icmpv6_rcv() with
kfree_skb_reason().

In order to get the reasons of the skb drops after icmp message handle,
we change the return type of 'handler()' in 'struct icmp_control' from
'bool' to 'enum skb_drop_reason'. This may change its original
intention, as 'false' means failure, but 'SKB_NOT_DROPPED_YET' means
success now. Therefore, all 'handler' and the call of them need to be
handled. Following 'handler' functions are involved:

icmp_unreach()
icmp_redirect()
icmp_echo()
icmp_timestamp()
icmp_discard()

And following new drop reasons are added:

SKB_DROP_REASON_ICMP_CSUM
SKB_DROP_REASON_INVALID_PROTO

The reason 'INVALID_PROTO' is introduced for the case that the packet
doesn't follow rfc 1122 and is dropped. This is not a common case, and
I believe we can locate the problem from the data in the packet. For now,
this 'INVALID_PROTO' is used for the icmp broadcasts with wrong types.

Maybe there should be a document file for these reasons. For example,
list all the case that causes the 'UNHANDLED_PROTO' and 'INVALID_PROTO'
drop reason. Therefore, users can locate their problems according to the
document.

Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a15c89c7 24-Jan-2022 Eric Dumazet <edumazet@google.com>

ipv4: do not use per netns icmp sockets

Back in linux-2.6.25 (commit 4a6ad7a141cb "[NETNS]: Make icmp_sk per namespace."),
we added private per-cpu/per-netns ipv4 icmp sockets.

This adds memory and cpu costs, which do not seem needed.
Now typical servers have 256 or more cores, this adds considerable
tax to netns users.

icmp sockets are used from BH context, are not receiving packets,
and do not store any persistent state but the 'struct net' pointer.

icmp_xmit_lock() already makes sure to lock the chosen per-cpu
socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1fcd7945 14-Oct-2021 Xin Long <lucien.xin@gmail.com>

icmp: fix icmp_ext_echo_iio parsing in icmp_build_probe

In icmp_build_probe(), the icmp_ext_echo_iio parsing should be done
step by step and skb_header_pointer() return value should always be
checked, this patch fixes 3 places in there:

- On case ICMP_EXT_ECHO_CTYPE_NAME, it should only copy ident.name
from skb by skb_header_pointer(), its len is ident_len. Besides,
the return value of skb_header_pointer() should always be checked.

- On case ICMP_EXT_ECHO_CTYPE_INDEX, move ident_len check ahead of
skb_header_pointer(), and also do the return value check for
skb_header_pointer().

- On case ICMP_EXT_ECHO_CTYPE_ADDR, before accessing iio->ident.addr.
ctype3_hdr.addrlen, skb_header_pointer() should be called first,
then check its return value and ident_len.
On subcases ICMP_AFI_IP and ICMP_AFI_IP6, also do check for ident.
addr.ctype3_hdr.addrlen and skb_header_pointer()'s return value.
On subcase ICMP_AFI_IP, the len for skb_header_pointer() should be
"sizeof(iio->extobj_hdr) + sizeof(iio->ident.addr.ctype3_hdr) +
sizeof(struct in_addr)" or "ident_len".

v1->v2:
- To make it more clear, call skb_header_pointer() once only for
iio->indent's parsing as Jakub Suggested.
v2->v3:
- The extobj_hdr.length check against sizeof(_iio) should be done
before calling skb_header_pointer(), as Eric noticed.

Fixes: d329ea5bd884 ("icmp: add response to RFC 8335 PROBE messages")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/31628dd76657ea62f5cf78bb55da6b35240831f1.1634205050.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 1160dfa1 05-Aug-2021 Yajun Deng <yajun.deng@linux.dev>

net: Remove redundant if statements

The 'if (dev)' statement already move into dev_{put , hold}, so remove
redundant if statements.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1fd07f33 26-Jun-2021 Andreas Roeseler <andreas.a.roeseler@gmail.com>

ipv6: ICMPV6: add response to ICMPV6 RFC 8335 PROBE messages

This patch builds off of commit 2b246b2569cd2ac6ff700d0dce56b8bae29b1842
and adds functionality to respond to ICMPV6 PROBE requests.

Add icmp_build_probe function to construct PROBE requests for both
ICMPV4 and ICMPV6.

Modify icmpv6_rcv to detect ICMPV6 PROBE messages and call the
icmpv6_echo_reply handler.

Modify icmpv6_echo_reply to build a PROBE response message based on the
queried interface.

This patch has been tested using a branch of the iputils git repo which can
be found here: https://github.com/Juniper-Clinic-2020/iputils/tree/probe-request

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e32ea44c 03-Jun-2021 Andreas Roeseler <andreas.a.roeseler@gmail.com>

icmp: fix lib conflict with trinity

Including <linux/in.h> and <netinet/in.h> in the dependencies breaks
compilation of trinity due to multiple definitions. <linux/in.h> is only
used in <linux/icmp.h> to provide the definition of the struct in_addr,
but this can be substituted out by using the datatype __be32.

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32182747 18-Jun-2021 Toke Høiland-Jørgensen <toke@redhat.com>

icmp: don't send out ICMP messages with a source address of 0.0.0.0

When constructing ICMP response messages, the kernel will try to pick a
suitable source address for the outgoing packet. However, if no IPv4
addresses are configured on the system at all, this will fail and we end up
producing an ICMP message with a source address of 0.0.0.0. This can happen
on a box routing IPv4 traffic via v6 nexthops, for instance.

Since 0.0.0.0 is not generally routable on the internet, there's a good
chance that such ICMP messages will never make it back to the sender of the
original packet that the ICMP message was sent in response to. This, in
turn, can create connectivity and PMTUd problems for senders. Fortunately,
RFC7600 reserves a dummy address to be used as a source for ICMP
messages (192.0.0.8/32), so let's teach the kernel to substitute that
address as a last resort if the regular source address selection procedure
fails.

Below is a quick example reproducing this issue with network namespaces:

ip netns add ns0
ip l add type veth peer netns ns0
ip l set dev veth0 up
ip a add 10.0.0.1/24 dev veth0
ip a add fc00:dead:cafe:42::1/64 dev veth0
ip r add 10.1.0.0/24 via inet6 fc00:dead:cafe:42::2
ip -n ns0 l set dev veth0 up
ip -n ns0 a add fc00:dead:cafe:42::2/64 dev veth0
ip -n ns0 r add 10.0.0.0/24 via inet6 fc00:dead:cafe:42::1
ip netns exec ns0 sysctl -w net.ipv4.icmp_ratelimit=0
ip netns exec ns0 sysctl -w net.ipv4.ip_forward=1
tcpdump -tpni veth0 -c 2 icmp &
ping -w 1 10.1.0.1 > /dev/null
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on veth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 29, seq 1, length 64
IP 0.0.0.0 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92
2 packets captured
2 packets received by filter
0 packets dropped by kernel

With this patch the above capture changes to:
IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 31127, seq 1, length 64
IP 192.0.0.8 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Juliusz Chroboczek <jch@irif.fr>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e542d29c 27-Apr-2021 Andreas Roeseler <andreas.a.roeseler@gmail.com>

icmp: standardize naming of RFC 8335 PROBE constants

The current definitions of constants for PROBE, currently defined only
in the net-next kernel branch, are inconsistent, with
some beginning with ICMP and others with simply EXT. This patch
attempts to standardize the naming conventions of the constants for
PROBE before their release into a stable Kernel, and to update the
relevant definitions in net/ipv4/icmp.c.

Similarly, the definitions for the code field (previously
ICMP_EXT_MAL_QUERY, etc) use the same prefixes as the type field. This
patch adds _CODE_ to the prefix to clarify the distinction of these
constants.

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Acked-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210427153635.2591-1-andreas.a.roeseler@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 31433202 12-Apr-2021 Andreas Roeseler <andreas.a.roeseler@gmail.com>

icmp: ICMPV6: pass RFC 8335 reply messages to ping_rcv

The current icmp_rcv function drops all unknown ICMP types, including
ICMP_EXT_ECHOREPLY (type 43). In order to parse Extended Echo Reply messages, we have
to pass these packets to the ping_rcv function, which does not do any
other filtering and passes the packet to the designated socket.

Pass incoming RFC 8335 ICMP Extended Echo Reply packets to the ping_rcv
handler instead of discarding the packet.

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d329ea5b 29-Mar-2021 Andreas Roeseler <andreas.a.roeseler@gmail.com>

icmp: add response to RFC 8335 PROBE messages

Modify the icmp_rcv function to check PROBE messages and call icmp_echo
if a PROBE request is detected.

Modify the existing icmp_echo function to respond ot both ping and PROBE
requests.

This was tested using a custom modification to the iputils package and
wireshark. It supports IPV4 probing by name, ifindex, and probing by
both IPV4 and IPV6 addresses. It currently does not support responding
to probes off the proxy node (see RFC 8335 Section 2).

The modification to the iputils package is still in development and can
be found here: https://github.com/Juniper-Clinic-2020/iputils.git. It
supports full sending functionality of PROBE requests, but currently
does not parse the response messages, which is why Wireshark is required
to verify the sent and recieved PROBE messages. The modification adds
the ``-e'' flag to the command which allows the user to specify the
interface identifier to query the probed host. An example usage would be
<./ping -4 -e 1 [destination]> to send a PROBE request of ifindex 1 to the
destination node.

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ee576c47 23-Feb-2021 Jason A. Donenfeld <Jason@zx2c4.com>

net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending

The icmp{,v6}_send functions make all sorts of use of skb->cb, casting
it with IPCB or IP6CB, assuming the skb to have come directly from the
inet layer. But when the packet comes from the ndo layer, especially
when forwarded, there's no telling what might be in skb->cb at that
point. As a result, the icmp sending code risks reading bogus memory
contents, which can result in nasty stack overflows such as this one
reported by a user:

panic+0x108/0x2ea
__stack_chk_fail+0x14/0x20
__icmp_send+0x5bd/0x5c0
icmp_ndo_send+0x148/0x160

In icmp_send, skb->cb is cast with IPCB and an ip_options struct is read
from it. The optlen parameter there is of particular note, as it can
induce writes beyond bounds. There are quite a few ways that can happen
in __ip_options_echo. For example:

// sptr/skb are attacker-controlled skb bytes
sptr = skb_network_header(skb);
// dptr/dopt points to stack memory allocated by __icmp_send
dptr = dopt->__data;
// sopt is the corrupt skb->cb in question
if (sopt->rr) {
optlen = sptr[sopt->rr+1]; // corrupt skb->cb + skb->data
soffset = sptr[sopt->rr+2]; // corrupt skb->cb + skb->data
// this now writes potentially attacker-controlled data, over
// flowing the stack:
memcpy(dptr, sptr+sopt->rr, optlen);
}

In the icmpv6_send case, the story is similar, but not as dire, as only
IP6CB(skb)->iif and IP6CB(skb)->dsthao are used. The dsthao case is
worse than the iif case, but it is passed to ipv6_find_tlv, which does
a bit of bounds checking on the value.

This is easy to simulate by doing a `memset(skb->cb, 0x41,
sizeof(skb->cb));` before calling icmp{,v6}_ndo_send, and it's only by
good fortune and the rarity of icmp sending from that context that we've
avoided reports like this until now. For example, in KASAN:

BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xa0e/0x12b0
Write of size 38 at addr ffff888006f1f80e by task ping/89
CPU: 2 PID: 89 Comm: ping Not tainted 5.10.0-rc7-debug+ #5
Call Trace:
dump_stack+0x9a/0xcc
print_address_description.constprop.0+0x1a/0x160
__kasan_report.cold+0x20/0x38
kasan_report+0x32/0x40
check_memory_region+0x145/0x1a0
memcpy+0x39/0x60
__ip_options_echo+0xa0e/0x12b0
__icmp_send+0x744/0x1700

Actually, out of the 4 drivers that do this, only gtp zeroed the cb for
the v4 case, while the rest did not. So this commit actually removes the
gtp-specific zeroing, while putting the code where it belongs in the
shared infrastructure of icmp{,v6}_ndo_send.

This commit fixes the issue by passing an empty IPCB or IP6CB along to
the functions that actually do the work. For the icmp_send, this was
already trivial, thanks to __icmp_send providing the plumbing function.
For icmpv6_send, this required a tiny bit of refactoring to make it
behave like the v4 case, after which it was straight forward.

Fixes: a2b78e9b2cac ("sunvnet: generate ICMP PTMUD messages for smaller port MTUs")
Reported-by: SinYu <liuxyon@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/netdev/CAF=yD-LOF116aHub6RMe8vB8ZpnrrnoTdqhobEx+bvoA8AsP0w@mail.gmail.com/T/
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://lore.kernel.org/r/20210223131858.72082-1-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 3df98d79 27-Sep-2020 Paul Moore <paul@paul-moore.com>

lsm,selinux: pass flowi_common instead of flowi to the LSM hooks

As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely. As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.

Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>


# b38e7819 15-Oct-2020 Eric Dumazet <edumazet@google.com>

icmp: randomize the global rate limiter

Keyu Man reported that the ICMP rate limiter could be used
by attackers to get useful signal. Details will be provided
in an upcoming academic publication.

Our solution is to add some noise, so that the attackers
no longer can get help from the predictable token bucket limiter.

Fixes: 4cdf507d5452 ("icmp: add a global rate limitation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# e1e84eb5 12-Oct-2020 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2)

As per RFC792, ICMP errors should be sent to the source host.

However, in configurations with Virtual Routing and Forwarding tables,
looking up which routing table to use is currently done by using the
destination net_device.

commit 9d1a6c4ea43e ("net: icmp_route_lookup should use rt dev to
determine L3 domain") changes the interface passed to
l3mdev_master_ifindex() and inet_addr_type_dev_table() from skb_in->dev
to skb_dst(skb_in)->dev. This effectively uses the destination device
rather than the source device for choosing which routing table should be
used to lookup where to send the ICMP error.

Therefore, if the source and destination interfaces are within separate
VRFs, or one in the global routing table and the other in a VRF, looking
up the source host in the destination interface's routing table will
fail if the destination interface's routing table contains no route to
the source host.

One observable effect of this issue is that traceroute does not work in
the following cases:

- Route leaking between global routing table and VRF
- Route leaking between VRFs

Preferably use the source device routing table when sending ICMP error
messages. If no source device is set, fall-back on the destination
device routing table. Else, use the main routing table (index 0).

[ It has been pointed out that a similar issue may exist with ICMP
errors triggered when forwarding between network namespaces. It would
be worthwhile to investigate, but is outside of the scope of this
investigation. ]

[ It has also been pointed out that a similar issue exists with
unreachable / fragmentation needed messages, which can be triggered by
changing the MTU of eth1 in r1 to 1400 and running:

ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2

Some investigation points to raw_icmp_error() and raw_err() as being
involved in this last scenario. The focus of this patch is TTL expired
ICMP messages, which go through icmp_route_lookup.
Investigation of failure modes related to raw_icmp_error() is beyond
this investigation's scope. ]

Fixes: 9d1a6c4ea43e ("net: icmp_route_lookup should use rt dev to determine L3 domain")
Link: https://tools.ietf.org/html/rfc792
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 5af68891 29-Aug-2020 Miaohe Lin <linmiaohe@huawei.com>

net: clean up codestyle

This is a pure codestyle cleanup patch. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 373c15c2 24-Aug-2020 Miaohe Lin <linmiaohe@huawei.com>

net: Use helper macro RT_TOS() in __icmp_send()

Use helper macro RT_TOS() to get tos in __icmp_send().

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cc44c17b 10-Jul-2020 Al Viro <viro@zeniv.linux.org.uk>

csum_partial_copy_nocheck(): drop the last argument

It's always 0. Note that we theoretically could use ~0U as well -
result will be the same modulo 0xffff, _if_ the damn thing did the
right thing for any value of initial sum; later we'll make use of
that when convenient.

However, unlike csum_and_copy_..._user(), there are instances that
did not work for arbitrary initial sums; c6x is one such.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3ea7ca80 10-Jul-2020 Al Viro <viro@zeniv.linux.org.uk>

icmp_push_reply(): reorder adding the checksum up

do csum_partial_copy_nocheck() on the first fragment, then
add the rest to it. Equivalent transformation.

That was the only caller of csum_partial_copy_nocheck() that
might pass it non-zero as the last argument.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8d5930df 10-Jul-2020 Al Viro <viro@zeniv.linux.org.uk>

skb_copy_and_csum_bits(): don't bother with the last argument

it's always 0

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 01370434 24-Jul-2020 Willem de Bruijn <willemb@google.com>

icmp6: support rfc 4884

Extend the rfc 4884 read interface introduced for ipv4 in
commit eba75c587e81 ("icmp: support rfc 4884") to ipv6.

Add socket option SOL_IPV6/IPV6_RECVERR_RFC4884.

Changes v1->v2:
- make ipv6_icmp_error_rfc4884 static (file scope)

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 178c49d9 24-Jul-2020 Willem de Bruijn <willemb@google.com>

icmp: prepare rfc 4884 for ipv6

The RFC 4884 spec is largely the same between IPv4 and IPv6.
Factor out the IPv4 specific parts in preparation for IPv6 support:

- icmp types supported

- icmp header size, and thus offset to original datagram start

- datagram length field offset in icmp(6)hdr.

- datagram length field word size: 4B for IPv4, 8B for IPv6.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c4e9e09f 24-Jul-2020 Willem de Bruijn <willemb@google.com>

icmp: revise rfc4884 tests

1) Only accept packets with original datagram len field >= header len.

The extension header must start after the original datagram headers.
The embedded datagram len field is compared against the 128B minimum
stipulated by RFC 4884. It is unlikely that headers extend beyond
this. But as we know the exact header length, check explicitly.

2) Remove the check that datagram length must be <= 576B.

This is a send constraint. There is no value in testing this on rx.
Within private networks it may be known safe to send larger packets.
Process these packets.

This test was also too lax. It compared original datagram length
rather than entire icmp packet length. The stand-alone fix would be:

- if (hlen + skb->len > 576)
+ if (-skb_network_offset(skb) + skb->len > 576)

Fixes: eba75c587e81 ("icmp: support rfc 4884")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# eba75c58 10-Jul-2020 Willem de Bruijn <willemb@google.com>

icmp: support rfc 4884

Add setsockopt SOL_IP/IP_RECVERR_4884 to return the offset to an
extension struct if present.

ICMP messages may include an extension structure after the original
datagram. RFC 4884 standardized this behavior. It stores the offset
in words to the extension header in u8 icmphdr.un.reserved[1].

The field is valid only for ICMP types destination unreachable, time
exceeded and parameter problem, if length is at least 128 bytes and
entire packet does not exceed 576 bytes.

Return the offset to the start of the extension struct when reading an
ICMP error from the error queue, if it matches the above constraints.

Do not return the raw u8 field. Return the offset from the start of
the user buffer, in bytes. The kernel does not return the network and
transport headers, so subtract those.

Also validate the headers. Return the offset regardless of validation,
as an invalid extension must still not be misinterpreted as part of
the original datagram. Note that !invalid does not imply valid. If
the extension version does not match, no validation can take place,
for instance.

For backward compatibility, make this optional, set by setsockopt
SOL_IP/IP_RECVERR_RFC4884. For API example and feature test, see
github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp_v2.c

For forward compatibility, reserve only setsockopt value 1, leaving
other bits for additional icmp extensions.

Changes
v1->v2:
- convert word offset to byte offset from start of user buffer
- return in ee_data as u8 may be insufficient
- define extension struct and object header structs
- return len only if constraints met
- if returning len, also validate

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0da7536f 01-Jul-2020 Willem de Bruijn <willemb@google.com>

ip: Fix SO_MARK in RST, ACK and ICMP packets

When no full socket is available, skbs are sent over a per-netns
control socket. Its sk_mark is temporarily adjusted to match that
of the real (request or timewait) socket or to reflect an incoming
skb, so that the outgoing skb inherits this in __ip_make_skb.

Introduction of the socket cookie mark field broke this. Now the
skb is set through the cookie and cork:

<caller> # init sockc.mark from sk_mark or cmsg
ip_append_data
ip_setup_cork # convert sockc.mark to cork mark
ip_push_pending_frames
ip_finish_skb
__ip_make_skb # set skb->mark to cork mark

But I missed these special control sockets. Update all callers of
__ip(6)_make_skb that were originally missed.

For IPv6, the same two icmp(v6) paths are affected. The third
case is not, as commit 92e55f412cff ("tcp: don't annotate
mark on control socket from tcp_v6_send_response()") replaced
the ctl_sk->sk_mark with passing the mark field directly as a
function argument. That commit predates the commit that
introduced the bug.

Fixes: c6af0c227a22 ("ip: support SO_MARK cmsg")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reported-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1cec2cac 27-Apr-2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

docs: networking: convert ip-sysctl.txt to ReST

- add SPDX header;
- adjust titles and chapters, adding proper markups;
- mark code blocks and literals as such;
- mark lists as such;
- mark tables as such;
- use footnote markup;
- adjust identation, whitespaces and blank lines;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a8eceea8 12-Mar-2020 Joe Perches <joe@perches.com>

inet: Use fallthrough;

Convert the various uses of fallthrough comments to fallthrough;

Done via script
Link: https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

And by hand:

net/ipv6/ip6_fib.c has a fallthrough comment outside of an #ifdef block
that causes gcc to emit a warning if converted in-place.

So move the new fallthrough; inside the containing #ifdef/#endif too.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0b41713b 11-Feb-2020 Jason A. Donenfeld <Jason@zx2c4.com>

icmp: introduce helper for nat'd source address in network device context

This introduces a helper function to be called only by network drivers
that wraps calls to icmp[v6]_send in a conntrack transformation, in case
NAT has been used. We don't want to pollute the non-driver path, though,
so we introduce this as a helper to be called by places that actually
make use of this, as suggested by Florian.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bbab7ef2 08-Nov-2019 Eric Dumazet <edumazet@google.com>

net: icmp: fix data-race in cmp_global_allow()

This code reads two global variables without protection
of a lock. We need READ_ONCE()/WRITE_ONCE() pairs to
avoid load/store-tearing and better document the intent.

KCSAN reported :
BUG: KCSAN: data-race in icmp_global_allow / icmp_global_allow

read to 0xffffffff861a8014 of 4 bytes by task 11201 on cpu 0:
icmp_global_allow+0x36/0x1b0 net/ipv4/icmp.c:254
icmpv6_global_allow net/ipv6/icmp.c:184 [inline]
icmpv6_global_allow net/ipv6/icmp.c:179 [inline]
icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514
icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43
ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640
dst_link_failure include/net/dst.h:419 [inline]
vti_xmit net/ipv4/ip_vti.c:243 [inline]
vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279
__netdev_start_xmit include/linux/netdevice.h:4420 [inline]
netdev_start_xmit include/linux/netdevice.h:4434 [inline]
xmit_one net/core/dev.c:3280 [inline]
dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296
__dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873
dev_queue_xmit+0x21/0x30 net/core/dev.c:3906
neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
neigh_output include/net/neighbour.h:511 [inline]
ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116
__ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
__ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
dst_output include/net/dst.h:436 [inline]
ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179

write to 0xffffffff861a8014 of 4 bytes by task 11183 on cpu 1:
icmp_global_allow+0x174/0x1b0 net/ipv4/icmp.c:272
icmpv6_global_allow net/ipv6/icmp.c:184 [inline]
icmpv6_global_allow net/ipv6/icmp.c:179 [inline]
icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514
icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43
ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640
dst_link_failure include/net/dst.h:419 [inline]
vti_xmit net/ipv4/ip_vti.c:243 [inline]
vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279
__netdev_start_xmit include/linux/netdevice.h:4420 [inline]
netdev_start_xmit include/linux/netdevice.h:4434 [inline]
xmit_one net/core/dev.c:3280 [inline]
dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296
__dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873
dev_queue_xmit+0x21/0x30 net/core/dev.c:3906
neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
neigh_output include/net/neighbour.h:511 [inline]
ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116
__ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
__ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 11183 Comm: syz-executor.2 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 4cdf507d5452 ("icmp: add a global rate limitation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2adf81c0 31-Oct-2019 Francesco Ruggeri <fruggeri@arista.com>

net: icmp: use input address in traceroute

Even with icmp_errors_use_inbound_ifaddr set, traceroute returns the
primary address of the interface the packet was received on, even if
the path goes through a secondary address. In the example:

1.0.3.1/24
---- 1.0.1.3/24 1.0.1.1/24 ---- 1.0.2.1/24 1.0.2.4/24 ----
|H1|--------------------------|R1|--------------------------|H2|
---- N1 ---- N2 ----

where 1.0.3.1/24 is R1's primary address on N1, traceroute from
H1 to H2 returns:

traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets
1 1.0.3.1 (1.0.3.1) 0.018 ms 0.006 ms 0.006 ms
2 1.0.2.4 (1.0.2.4) 0.021 ms 0.007 ms 0.007 ms

After applying this patch, it returns:

traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets
1 1.0.1.1 (1.0.1.1) 0.033 ms 0.007 ms 0.006 ms
2 1.0.2.4 (1.0.2.4) 0.011 ms 0.007 ms 0.007 ms

Original-patch-by: Bill Fenner <fenner@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e2c69393 22-Aug-2019 Hangbin Liu <liuhangbin@gmail.com>

ipv4/icmp: fix rt dst dev null pointer dereference

In __icmp_send() there is a possibility that the rt->dst.dev is NULL,
e,g, with tunnel collect_md mode, which will cause kernel crash.
Here is what the code path looks like, for GRE:

- ip6gre_tunnel_xmit
- ip6gre_xmit_ipv4
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmp_send
- net = dev_net(rt->dst.dev); <-- here

The reason is __metadata_dst_init() init dst->dev to NULL by default.
We could not fix it in __metadata_dst_init() as there is no dev supplied.
On the other hand, the reason we need rt->dst.dev is to get the net.
So we can just try get it from skb->dev when rt->dst.dev is NULL.

v4: Julian Anastasov remind skb->dev also could be NULL. We'd better
still use dst.dev and do a check to avoid crash.

v3: No changes.

v2: fix the issue in __icmp_send() instead of updating shared dst dev
in {ip_md, ip6}_tunnel_xmit.

Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0f404bbd 19-Aug-2019 Li RongQing <lirongqing@baidu.com>

net: fix icmp_socket_deliver argument 2 input

it expects a unsigned int, but got a __be32

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 046386ca 31-May-2019 Eric Dumazet <edumazet@google.com>

ipv4: icmp: use this_cpu_read() in icmp_sk()

this_cpu_read(*X) is faster than *this_cpu_ptr(X)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 9ef6b42a 25-Feb-2019 Nazarov Sergey <s-nazarov@yandex.ru>

net: Add __icmp_send helper.

Add __icmp_send function having ip_options struct parameter

Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
Reviewed-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e9128c14 23-Feb-2019 Kefeng Wang <wangkefeng.wang@huawei.com>

ipv4: icmp: use icmp_sk_exit()

Simply use icmp_sk_exit() when inet_ctl_sock_create() fail in icmp_sk_init().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32bbd879 07-Nov-2018 Stefano Brivio <sbrivio@redhat.com>

net: Convert protocol error handlers from void to int

We'll need this to handle ICMP errors for tunnels without a sending socket
(i.e. FoU and GUE). There, we might have to look up different types of IP
tunnels, registered as network protocols, before we get a match, so we
want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
inet_protos and inet6_protos. These error codes will be used in the next
patch.

For consistency, return sensible error codes in protocol error handlers
whenever handlers can't handle errors because, even if valid, they don't
match a protocol or any of its states.

This has no effect on existing error handling paths.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1042caa7 25-Sep-2018 Maciej Żenczykowski <maze@google.com>

net-ipv4: remove 2 always zero parameters from ipv4_redirect()

(the parameters in question are mark and flow_flags)

Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d888f396 25-Sep-2018 Maciej Żenczykowski <maze@google.com>

net-ipv4: remove 2 always zero parameters from ipv4_update_pmtu()

(the parameters in question are mark and flow_flags)

Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 35178206 06-Jul-2018 Willem de Bruijn <willemb@google.com>

ipv4: ipcm_cookie initializers

Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba19978 ("ip: limit use of gso_size to udp").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bc969a97 03-Jul-2018 Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>

net: ipv4: Hook into time based transmission

Add a transmit_time field to struct inet_cork, then copy the
timestamp from the CMSG cookie at ip_setup_cork() so we can
safely copy it into the skb later during __ip_make_skb().

For the raw fast path, just perform the copy at raw_send_hdrinc().

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f635cee 27-Mar-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Drop pernet_operations::async

Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f84c6821 12-Feb-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Convert pernet_subsys, registered from inet_init()

arp_net_ops just addr/removes /proc entry.

devinet_ops allocates and frees duplicate of init_net tables
and (un)registers sysctl entries.

fib_net_ops allocates and frees pernet tables, creates/destroys
netlink socket and (un)initializes /proc entries. Foreign
pernet_operations do not touch them.

ip_rt_proc_ops only modifies pernet /proc entries.

xfrm_net_ops creates/destroys /proc entries, allocates/frees
pernet statistics, hashes and tables, and (un)initializes
sysctl files. These are not touched by foreigh pernet_operations

xfrm4_net_ops allocates/frees private pernet memory, and
configures sysctls.

sysctl_route_ops creates/destroys sysctls.

rt_genid_ops only initializes fields of just allocated net.

ipv4_inetpeer_ops allocated/frees net private memory.

igmp_net_ops just creates/destroys /proc files and socket,
noone else interested in.

tcp_sk_ops seems to be safe, because tcp_sk_init() does not
depend on any other pernet_operations modifications. Iteration
over hash table in inet_twsk_purge() is made under RCU lock,
and it's safe to iterate the table this way. Removing from
the table happen from inet_twsk_deschedule_put(), but this
function is safe without any extern locks, as it's synchronized
inside itself. There are many examples, it's used in different
context. So, it's safe to leave tcp_sk_exit_batch() unlocked.

tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe.

udplite4_net_ops only creates/destroys pernet /proc file.

icmp_sk_ops creates percpu sockets, not touched by foreign
pernet_operations.

ipmr_net_ops creates/destroys pernet fib tables, (un)registers
fib rules and /proc files. This seem to be safe to execute
in parallel with foreign pernet_operations.

af_inet_ops just sets up default parameters of newly created net.

ipv4_mib_ops creates and destroys pernet percpu statistics.

raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops
and ip_proc_ops only create/destroy pernet /proc files.

ip4_frags_ops creates and destroys sysctl file.

So, it's safe to make the pernet_operations async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 15285402 23-Oct-2017 Gustavo A. R. Silva <garsilva@embeddedor.com>

ipv4: icmp: use BUG_ON instead of if condition followed by BUG

Use BUG_ON instead of if condition followed by BUG in icmp_timestamp.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 258bbb1b 12-Oct-2017 Matteo Croce <mcroce@redhat.com>

icmp: don't fail on fragment reassembly time exceeded

The ICMP implementation currently replies to an ICMP time exceeded message
(type 11) with an ICMP host unreachable message (type 3, code 1).

However, time exceeded messages can either represent "time to live exceeded
in transit" (code 0) or "fragment reassembly time exceeded" (code 1).

Unconditionally replying to "fragment reassembly time exceeded" with
host unreachable messages might cause unjustified connection resets
which are now easily triggered as UFO has been removed, because, in turn,
sending large buffers triggers IP fragmentation.

The issue can be easily reproduced by running a lot of UDP streams
which is likely to trigger IP fragmentation:

# start netserver in the test namespace
ip netns add test
ip netns exec test netserver

# create a VETH pair
ip link add name veth0 type veth peer name veth0 netns test
ip link set veth0 up
ip -n test link set veth0 up

for i in $(seq 20 29); do
# assign addresses to both ends
ip addr add dev veth0 192.168.$i.1/24
ip -n test addr add dev veth0 192.168.$i.2/24

# start the traffic
netperf -L 192.168.$i.1 -H 192.168.$i.2 -t UDP_STREAM -l 0 &
done

# wait
send_data: data send error: No route to host (errno 113)
netperf: send_omni: send_data failed: No route to host

We need to differentiate instead: if fragment reassembly time exceeded
is reported, we need to silently drop the packet,
if time to live exceeded is reported, maintain the current behaviour.
In both cases increment the related error count "icmpInTimeExcds".

While at it, fix a typo in a comment, and convert the if statement
into a switch to mate it more readable.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 91ed1e66 03-Aug-2017 Paolo Abeni <pabeni@redhat.com>

ip/options: explicitly provide net ns to __ip_options_echo()

__ip_options_echo() uses the current network namespace, and
currently retrives it via skb->dst->dev.

This commit adds an explicit 'net' argument to __ip_options_echo()
and update all the call sites to provide it, usually via a simpler
sock_net().

After this change, __ip_options_echo() no more needs to access
skb->dst and we can drop a couple of hack to preserve such
info in the rx path.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 849a44de 14-Jun-2017 Jesper Dangaard Brouer <brouer@redhat.com>

net: don't global ICMP rate limit packets originating from loopback

Florian Weimer seems to have a glibc test-case which requires that
loopback interfaces does not get ICMP ratelimited. This was broken by
commit c0303efeab73 ("net: reduce cycles spend on ICMP replies that
gets rate limited").

An ICMP response will usually be routed back-out the same incoming
interface. Thus, take advantage of this and skip global ICMP
ratelimit when the incoming device is loopback. In the unlikely event
that the outgoing it not loopback, due to strange routing policy
rules, ICMP rate limiting still works via peer ratelimiting via
icmpv4_xrlim_allow(). Thus, we should still comply with RFC1812
(section 4.3.2.8 "Rate Limiting").

This seems to fix the reproducer given by Florian. While still
avoiding to perform expensive and unneeded outgoing route lookup for
rate limited packets (in the non-loopback case).

Fixes: c0303efeab73 ("net: reduce cycles spend on ICMP replies that gets rate limited")
Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: "H.J. Lu" <hjl.tools@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3abd1ade 25-May-2017 David Ahern <dsahern@gmail.com>

net: ipv4: refactor __ip_route_output_key_hash

A later patch wants access to the fib result on an output route lookup
with the rcu lock held. Refactor __ip_route_output_key_hash, pushing
the logic between rcu_read_lock ... rcu_read_unlock into a new helper
with the fib_result as an input arg.

To keep the name length under control remove the leading underscores
from the name and add _rcu to the name of the new helper indicating it
is called with the rcu read lock held.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bf4e0a3d 16-Mar-2017 Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

net: ipv4: add support for ECMP hash policy choice

This patch adds support for ECMP hash policy choice via a new sysctl
called fib_multipath_hash_policy and also adds support for L4 hashes.
The current values for fib_multipath_hash_policy are:
0 - layer 3 (default)
1 - layer 4
If there's an skb hash already set and it matches the chosen policy then it
will be used instead of being calculated (currently only for L4).
In L3 mode we always calculate the hash due to the ICMP error special
case, the flow dissector's field consistentification should handle the
address order thus we can remove the address reversals.
If the skb is provided we always use it for the hash calculation,
otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7ba91ecb 09-Jan-2017 Jesper Dangaard Brouer <brouer@redhat.com>

net: for rate-limited ICMP replies save one atomic operation

It is possible to avoid the atomic operation in icmp{v6,}_xmit_lock,
by checking the sysctl_icmp_msgs_per_sec ratelimit before these calls,
as pointed out by Eric Dumazet, but the BH disabled state must be correct.

The icmp_global_allow() call states it must be called with BH
disabled. This protection was given by the calls icmp_xmit_lock and
icmpv6_xmit_lock. Thus, split out local_bh_disable/enable from these
functions and maintain it explicitly at callers.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c0303efe 09-Jan-2017 Jesper Dangaard Brouer <brouer@redhat.com>

net: reduce cycles spend on ICMP replies that gets rate limited

This patch split the global and per (inet)peer ICMP-reply limiter
code, and moves the global limit check to earlier in the packet
processing path. Thus, avoid spending cycles on ICMP replies that
gets limited/suppressed anyhow.

The global ICMP rate limiter icmp_global_allow() is a good solution,
it just happens too late in the process. The kernel goes through the
full route lookup (return path) for the ICMP message, before taking
the rate limit decision of not sending the ICMP reply.

Details: The kernels global rate limiter for ICMP messages got added
in commit 4cdf507d5452 ("icmp: add a global rate limitation"). It is
a token bucket limiter with a global lock. It brilliantly avoids
locking congestion by only updating when 20ms (HZ/50) were elapsed. It
can then avoids taking lock when credit is exhausted (when under
pressure) and time constraint for refill is not yet meet.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8d9ba388 09-Jan-2017 Jesper Dangaard Brouer <brouer@redhat.com>

Revert "icmp: avoid allocating large struct on stack"

This reverts commit 9a99d4a50cb8 ("icmp: avoid allocating large struct
on stack"), because struct icmp_bxm no really a large struct, and
allocating and free of this small 112 bytes hurts performance.

Fixes: 9a99d4a50cb8 ("icmp: avoid allocating large struct on stack")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7c0f6ba6 24-Dec-2016 Linus Torvalds <torvalds@linux-foundation.org>

Replace <asm/uaccess.h> with <linux/uaccess.h> globally

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f91c58d6 06-Dec-2016 Zhang Shengju <zhangshengju@cmss.chinamobile.com>

icmp: correct return value of icmp_rcv()

Currently, icmp_rcv() always return zero on a packet delivery upcall.

To make its behavior more compliant with the way this API should be
used, this patch changes this to let it return NET_RX_SUCCESS when the
packet is proper handled, and NET_RX_DROP otherwise.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9d1a6c4e 07-Nov-2016 David Ahern <dsa@cumulusnetworks.com>

net: icmp_route_lookup should use rt dev to determine L3 domain

icmp_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have an rt.
Update icmp_route_lookup to use the rt on the skb to determine L3
domain.

Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e2d118a1 03-Nov-2016 Lorenzo Colitti <lorenzo@google.com>

net: inet: Support UID-based routing in IP protocols.

- Use the UID in routing lookups made by protocol connect() and
sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
(e.g., Path MTU discovery) take the UID of the socket into
account.
- For packets not associated with a userspace socket, (e.g., ping
replies) use UID 0 inside the user namespace corresponding to
the network namespace the socket belongs to. This allows
all namespaces to apply routing and iptables rules to
kernel-originated traffic in that namespaces by matching UID 0.
This is better than using the UID of the kernel socket that is
sending the traffic, because the UID of kernel sockets created
at namespace creation time (e.g., the per-processor ICMP and
TCP sockets) is the UID of the user that created the socket,
which might not be mapped in the namespace.

Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 214d3f1f 27-Apr-2016 Eric Dumazet <edumazet@google.com>

net: icmp: rename ICMPMSGIN_INC_STATS_BH()

Remove misleading _BH suffix.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5d3848bc 27-Apr-2016 Eric Dumazet <edumazet@google.com>

net: rename ICMP_INC_STATS_BH()

Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 822c8685 27-Feb-2016 Deepa Dinamani <deepa.kernel@gmail.com>

net: ipv4: Convert IP network timestamps to be y2038 safe

ICMP timestamp messages and IP source route options require
timestamps to be in milliseconds modulo 24 hours from
midnight UT format.

Add inet_current_timestamp() function to support this. The function
returns the required timestamp in network byte order.

Timestamp calculation is also changed to call ktime_get_real_ts64()
which uses struct timespec64. struct timespec64 is y2038 safe.
Previously it called getnstimeofday() which uses struct timespec.
struct timespec is not y2038 safe.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 02a6d613 14-Oct-2015 Paolo Abeni <pabeni@redhat.com>

Revert "ipv4/icmp: redirect messages can use the ingress daddr as source"

Revert the commit e2ca690b657f ("ipv4/icmp: redirect messages
can use the ingress daddr as source"), which tried to introduce a more
suitable behaviour for ICMP redirect messages generated by VRRP routers.
However RFC 5798 section 8.1.1 states:

The IPv4 source address of an ICMP redirect should be the address
that the end-host used when making its next-hop routing decision.

while said commit used the generating packet destination
address, which do not match the above and in most cases leads to
no redirect packets to be generated.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e2ca690b 09-Oct-2015 Paolo Abeni <pabeni@redhat.com>

ipv4/icmp: redirect messages can use the ingress daddr as source

This patch allows configuring how the source address of ICMP
redirect messages is selected; by default the old behaviour is
retained, while setting icmp_redirects_use_orig_daddr force the
usage of the destination address of the packet that caused the
redirect.

The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
following scenario:

Two machines are set up with VRRP to act as routers out of a subnet,
they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
x.x.x.254/24.

If a host in said subnet needs to get an ICMP redirect from the VRRP
router, i.e. to reach a destination behind a different gateway, the
source IP in the ICMP redirect is chosen as the primary IP on the
interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.

The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
and will continue to use the wrong next-op.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 79a13159 30-Sep-2015 Peter Nørlund <pch@ordbogen.com>

ipv4: ICMP packet inspection for multipath

ICMP packets are inspected to let them route together with the flow they
belong to, minimizing the chance that a problematic path will affect flows
on other paths, and so that anycast environments can work with ECMP.

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 385add90 29-Sep-2015 David Ahern <dsa@cumulusnetworks.com>

net: Replace vrf_master_ifindex{, _rcu} with l3mdev equivalents

Replace calls to vrf_master_ifindex_rcu and vrf_master_ifindex with either
l3mdev_master_ifindex_rcu or l3mdev_master_ifindex.

The pattern:
oif = vrf_master_ifindex(dev) ? : dev->ifindex;
is replaced with
oif = l3mdev_fib_oif(dev);

And remove the now unused vrf macros.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bdb06cbf 24-Sep-2015 David Ahern <dsa@cumulusnetworks.com>

net: Fix panic in icmp_route_lookup

Andrey reported a panic:

[ 7249.865507] BUG: unable to handle kernel pointer dereference at 000000b4
[ 7249.865559] IP: [<c16afeca>] icmp_route_lookup+0xaa/0x320
[ 7249.865598] *pdpt = 0000000030f7f001 *pde = 0000000000000000
[ 7249.865637] Oops: 0000 [#1]
...
[ 7249.866811] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.3.0-999-generic #201509220155
[ 7249.866876] Hardware name: MSI MS-7250/MS-7250, BIOS 080014 08/02/2006
[ 7249.866916] task: c1a5ab00 ti: c1a52000 task.ti: c1a52000
[ 7249.866949] EIP: 0060:[<c16afeca>] EFLAGS: 00210246 CPU: 0
[ 7249.866981] EIP is at icmp_route_lookup+0xaa/0x320
[ 7249.867012] EAX: 00000000 EBX: f483ba48 ECX: 00000000 EDX: f2e18a00
[ 7249.867045] ESI: 000000c0 EDI: f483ba70 EBP: f483b9ec ESP: f483b974
[ 7249.867077] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 7249.867108] CR0: 8005003b CR2: 000000b4 CR3: 36ee07c0 CR4: 000006f0
[ 7249.867141] Stack:
[ 7249.867165] 320310ee 00000000 00000042 320310ee 00000000 c1aeca00
f3920240 f0c69180
[ 7249.867268] f483ba04 f855058b a89b66cd f483ba44 f8962f4b 00000000
e659266c f483ba54
[ 7249.867361] 8004753c f483ba5c f8962f4b f2031140 000003c1 ffbd8fa0
c16b0e00 00000064
[ 7249.867448] Call Trace:
[ 7249.867494] [<f855058b>] ? e1000_xmit_frame+0x87b/0xdc0 [e1000e]
[ 7249.867534] [<f8962f4b>] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
[ 7249.867576] [<f8962f4b>] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
[ 7249.867615] [<c16b0e00>] ? icmp_send+0xa0/0x380
[ 7249.867648] [<c16b102f>] icmp_send+0x2cf/0x380
[ 7249.867681] [<f89c8126>] nf_send_unreach+0xa6/0xc0 [nf_reject_ipv4]
[ 7249.867714] [<f89cd0da>] reject_tg+0x7a/0x9f [ipt_REJECT]
[ 7249.867746] [<f88c29a7>] ipt_do_table+0x317/0x70c [ip_tables]
[ 7249.867780] [<f895e0a6>] ? __nf_conntrack_find_get+0x166/0x3b0
[nf_conntrack]
[ 7249.867838] [<f895eea8>] ? nf_conntrack_in+0x398/0x600 [nf_conntrack]
[ 7249.867889] [<f84c0035>] iptable_filter_hook+0x35/0x80 [iptable_filter]
[ 7249.867933] [<c16776a1>] nf_iterate+0x71/0x80
[ 7249.867970] [<c1677715>] nf_hook_slow+0x65/0xc0
[ 7249.868002] [<c1681811>] __ip_local_out_sk+0xc1/0xd0
[ 7249.868034] [<c1680f30>] ? ip_forward_options+0x1a0/0x1a0
[ 7249.868066] [<c1681836>] ip_local_out_sk+0x16/0x30
[ 7249.868097] [<c1684054>] ip_send_skb+0x14/0x80
[ 7249.868129] [<c16840f4>] ip_push_pending_frames+0x34/0x40
[ 7249.868163] [<c16844a2>] ip_send_unicast_reply+0x282/0x310
[ 7249.868196] [<c16a0863>] tcp_v4_send_reset+0x1b3/0x380
[ 7249.868227] [<c16a1b63>] tcp_v4_rcv+0x323/0x990
[ 7249.868257] [<c16776a1>] ? nf_iterate+0x71/0x80
[ 7249.868289] [<c167dc2b>] ip_local_deliver_finish+0x8b/0x230
[ 7249.868322] [<c167df4c>] ip_local_deliver+0x4c/0xa0
[ 7249.868353] [<c167dba0>] ? ip_rcv_finish+0x390/0x390
[ 7249.868384] [<c167d88c>] ip_rcv_finish+0x7c/0x390
[ 7249.868415] [<c167e280>] ip_rcv+0x2e0/0x420
...

Prior to the VRF change the oif was not set in the flow struct, so the
VRF support should really have only added the vrf_master_ifindex lookup.

Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
Cc: Andrey Melnikov <temnota.am@gmail.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 192132b9 27-Aug-2015 David Ahern <dsa@cumulusnetworks.com>

net: Add support for VRFs to inetpeer cache

inetpeer caches based on address only, so duplicate IP addresses within
a namespace return the same cached entry. Enhance the ipv4 address key
to contain both the IPv4 address and VRF device index.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 18041e31 18-Aug-2015 Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

vrf: vrf_master_ifindex_rcu is not always called with rcu read lock

While running net-next I hit this:
[ 634.073119] ===============================
[ 634.073150] [ INFO: suspicious RCU usage. ]
[ 634.073182] 4.2.0-rc6+ #45 Not tainted
[ 634.073213] -------------------------------
[ 634.073244] include/net/vrf.h:38 suspicious rcu_dereference_check()
usage!
[ 634.073274]
other info that might help us debug this:

[ 634.073307]
rcu_scheduler_active = 1, debug_locks = 1
[ 634.073338] 2 locks held by swapper/0/0:
[ 634.073369] #0: (((&n->timer))){+.-...}, at: [<ffffffff8112bc35>]
call_timer_fn+0x5/0x480
[ 634.073412] #1: (slock-AF_INET){+.-...}, at: [<ffffffff8174f0f5>]
icmp_send+0x155/0x5f0
[ 634.073450]
stack backtrace:
[ 634.073483] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-rc6+ #45
[ 634.073514] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[ 634.073545] 0000000000000000 0593ba8242d9ace4 ffff88002fc03b48
ffffffff81803f1b
[ 634.073612] 0000000000000000 ffffffff81e12500 ffff88002fc03b78
ffffffff811003c5
[ 634.073642] 0000000000000000 ffff88002ec4e600 ffffffff81f00f80
ffff88002fc03cf0
[ 634.073669] Call Trace:
[ 634.073694] <IRQ> [<ffffffff81803f1b>] dump_stack+0x4c/0x65
[ 634.073728] [<ffffffff811003c5>] lockdep_rcu_suspicious+0xc5/0x100
[ 634.073763] [<ffffffff8174eb56>] icmp_route_lookup+0x176/0x5c0
[ 634.073793] [<ffffffff8174f2fb>] ? icmp_send+0x35b/0x5f0
[ 634.073818] [<ffffffff8174f274>] ? icmp_send+0x2d4/0x5f0
[ 634.073844] [<ffffffff8174f3ce>] icmp_send+0x42e/0x5f0
[ 634.073873] [<ffffffff8170b662>] ipv4_link_failure+0x22/0xa0
[ 634.073899] [<ffffffff8174bdda>] arp_error_report+0x3a/0x80
[ 634.073926] [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
[ 634.073952] [<ffffffff816d396e>] neigh_invalidate+0x8e/0x110
[ 634.073984] [<ffffffff816d62ae>] neigh_timer_handler+0x1ae/0x290
[ 634.074013] [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
[ 634.074013] [<ffffffff8112bce3>] call_timer_fn+0xb3/0x480
[ 634.074013] [<ffffffff8112bc35>] ? call_timer_fn+0x5/0x480
[ 634.074013] [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
[ 634.074013] [<ffffffff8112c2bc>] run_timer_softirq+0x20c/0x430
[ 634.074013] [<ffffffff810af50e>] __do_softirq+0xde/0x630
[ 634.074013] [<ffffffff810afc97>] irq_exit+0x117/0x120
[ 634.074013] [<ffffffff81810976>] smp_apic_timer_interrupt+0x46/0x60
[ 634.074013] [<ffffffff8180e950>] apic_timer_interrupt+0x70/0x80
[ 634.074013] <EOI> [<ffffffff8106b9d6>] ? native_safe_halt+0x6/0x10
[ 634.074013] [<ffffffff81101d8d>] ? trace_hardirqs_on+0xd/0x10
[ 634.074013] [<ffffffff81027d43>] default_idle+0x23/0x200
[ 634.074013] [<ffffffff8102852f>] arch_cpu_idle+0xf/0x20
[ 634.074013] [<ffffffff810f89ba>] default_idle_call+0x2a/0x40
[ 634.074013] [<ffffffff810f8dcc>] cpu_startup_entry+0x39c/0x4c0
[ 634.074013] [<ffffffff817f9cad>] rest_init+0x13d/0x150
[ 634.074013] [<ffffffff81f69038>] start_kernel+0x4a8/0x4c9
[ 634.074013] [<ffffffff81f68120>] ?
early_idt_handler_array+0x120/0x120
[ 634.074013] [<ffffffff81f68339>] x86_64_start_reservations+0x2a/0x2c
[ 634.074013] [<ffffffff81f68485>] x86_64_start_kernel+0x14a/0x16d

It would seem vrf_master_ifindex_rcu() can be called without RCU held in
other contexts as well so introduce a new helper which acquires rcu and
returns the ifindex.
Also add curly braces around both the "if" and "else" parts as per the
style guide.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 30bbaa19 13-Aug-2015 David Ahern <dsa@cumulusnetworks.com>

net: Fix up inet_addr_type checks

Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
if the passed in device is enslaved to a VRF then the table for that VRF
is used for the lookup.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 613d09b3 13-Aug-2015 David Ahern <dsa@cumulusnetworks.com>

net: Use VRF device index for lookups on TX

As with ingress use the index of VRF master device for route lookups on
egress. However, the oif should only be used to direct the lookups to a
specific table. Routes in the table are not based on the VRF device but
rather interfaces that are part of the VRF so do not consider the oif for
lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this
latter part.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 773a69d6 21-Jul-2015 Thomas Graf <tgraf@suug.ch>

icmp: Don't leak original dst into ip_route_input()

ip_route_input() unconditionally overwrites the dst. Hide the original
dst attached to the skb by calling skb_dst_set(skb, NULL) prior to
ip_route_input().

Reported-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 51456b29 03-Apr-2015 Ian Morris <ipm@chirality.org.uk>

ipv4: coding style: comparison for equality with NULL

The ipv4 code uses a mixture of coding styles. In some instances check
for NULL pointer is done as x == NULL and sometimes as !x. !x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 349c9e3c 29-Jan-2015 Eric Dumazet <edumazet@google.com>

ipv4: icmp: use percpu allocation

Get rid of nr_cpu_ids and use modern percpu allocation.

Note that the sockets themselves are not yet allocated
using NUMA affinity.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e3e32170 17-Nov-2014 Rick Jones <rick.jones2@hp.com>

icmp: Remove some spurious dropped packet profile hits from the ICMP path

If icmp_rcv() has successfully processed the incoming ICMP datagram, we
should use consume_skb() rather than kfree_skb() because a hit on the likes
of perf -e skb:kfree_skb is not called-for.

Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ba7a46f1 11-Nov-2014 Joe Perches <joe@perches.com>

net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited

Use the more common dynamic_debug capable net_dbg_ratelimited
and remove the LIMIT_NETDEBUG macro.

All messages are still ratelimited.

Some KERN_<LEVEL> uses are changed to KERN_DEBUG.

This may have some negative impact on messages that were
emitted at KERN_INFO that are not not enabled at all unless
DEBUG is defined or dynamic_debug is enabled. Even so,
these messages are now _not_ emitted by default.

This also eliminates the use of the net_msg_warn sysctl
"/proc/sys/net/core/warnings". For backward compatibility,
the sysctl is not removed, but it has no function. The extern
declaration of net_msg_warn is removed from sock.h and made
static in net/core/sysctl_net_core.c

Miscellanea:

o Update the sysctl documentation
o Remove the embedded uses of pr_fmt
o Coalesce format fragments
o Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4cdf507d 19-Sep-2014 Eric Dumazet <edumazet@google.com>

icmp: add a global rate limitation

Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
protected by a lock, meaning that hosts can be stuck hard if all cpus
want to check ICMP limits.

When say a DNS or NTP server process is restarted, inetpeer tree grows
quick and machine comes to its knees.

iptables can not help because the bottleneck happens before ICMP
messages are even cooked and sent.

This patch adds a new global limitation, using a token bucket filter,
controlled by two new sysctl :

icmp_msgs_per_sec - INTEGER
Limit maximal number of ICMP packets sent per second from this host.
Only messages whose type matches icmp_ratemask are
controlled by this limit.
Default: 1000

icmp_msgs_burst - INTEGER
icmp_msgs_per_sec controls number of ICMP packets sent per second,
while icmp_msgs_burst controls the burst size of these packets.
Default: 50

Note that if we really want to send millions of ICMP messages per
second, we might extend idea and infra added in commit 04ca6973f7c1a
("ip: make IP identifiers less predictable") :
add a token bucket in the ip_idents hash and no longer rely on inetpeer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 188b1210 01-Aug-2014 Duan Jiong <duanj.fnst@cn.fujitsu.com>

ipv4: remove nested rcu_read_lock/unlock

ip_local_deliver_finish() already have a rcu_read_lock/unlock, so
the rcu_read_lock/unlock is unnecessary.

See the stack below:
ip_local_deliver_finish
|
|
->icmp_rcv
|
|
->icmp_socket_deliver

Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7304fe46 31-Jul-2014 Duan Jiong <duanj.fnst@cn.fujitsu.com>

net: fix the counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS

When dealing with ICMPv[46] Error Message, function icmp_socket_deliver()
and icmpv6_notify() do some valid checks on packet's length, but then some
protocols check packet's length redaudantly. So remove those duplicated
statements, and increase counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS in
function icmp_socket_deliver() and icmpv6_notify() respectively.

In addition, add missed counter in udp6/udplite6 when socket is NULL.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 68b7107b 30-Jun-2014 Edward Allcutt <edward.allcutt@openmarket.com>

ipv4: icmp: Fix pMTU handling for rare case

Some older router implementations still send Fragmentation Needed
errors with the Next-Hop MTU field set to zero. This is explicitly
described as an eventuality that hosts must deal with by the
standard (RFC 1191) since older standards specified that those
bits must be zero.

Linux had a generic (for all of IPv4) implementation of the algorithm
described in the RFC for searching a list of MTU plateaus for a good
value. Commit 46517008e116 ("ipv4: Kill ip_rt_frag_needed().")
removed this as part of the changes to remove the routing cache.
Subsequently any Fragmentation Needed packet with a zero Next-Hop
MTU has been discarded without being passed to the per-protocol
handlers or notifying userspace for raw sockets.

When there is a router which does not implement RFC 1191 on an
MTU limited path then this results in stalled connections since
large packets are discarded and the local protocols are not
notified so they never attempt to lower the pMTU.

One example I have seen is an OpenBSD router terminating IPSec
tunnels. It's worth pointing out that this case is distinct from
the BSD 4.2 bug which incorrectly calculated the Next-Hop MTU
since the commit in question dismissed that as a valid concern.

All of the per-protocols handlers implement the simple approach from
RFC 1191 of immediately falling back to the minimum value. Although
this is sub-optimal it is vastly preferable to connections hanging
indefinitely.

Remove the Next-Hop MTU != 0 check and allow such packets
to follow the normal path.

Fixes: 46517008e116 ("ipv4: Kill ip_rt_frag_needed().")
Signed-off-by: Edward Allcutt <edward.allcutt@openmarket.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e110861f 13-May-2014 Lorenzo Colitti <lorenzo@google.com>

net: add a sysctl to reflect the fwmark on replies

Kernel-originated IP packets that have no user socket associated
with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
are emitted with a mark of zero. Add a sysctl to make them have
the same mark as the packet they are replying to.

This allows an administrator that wishes to do so to use
mark-based routing, firewalling, etc. for these replies by
marking the original packets inbound.

Tested using user-mode linux:
- ICMP/ICMPv6 echo replies and errors.
- TCP RST packets (IPv4 and IPv6).

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 29a96e1f 07-May-2014 Tom Herbert <therbert@google.com>

icmp: Call skb_checksum_simple_validate

Use skb_checksum_simple_validate to verify checksum.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8ed1dc44 09-Jan-2014 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv4: introduce hardened ip_no_pmtu_disc mode

This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
to be honored by protocols which do more stringent validation on the
ICMP's packet payload. This knob is useful for people who e.g. want to
run an unmodified DNS server in a namespace where they need to use pmtu
for TCP connections (as they are used for zone transfers or fallback
for requests) but don't want to use possibly spoofed UDP pmtu information.

Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
if the returned packet is in the window or if the association is valid.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cd174e67 13-Dec-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv4: new ip_no_pmtu_disc mode to always discard incoming frag needed msgs

This new mode discards all incoming fragmentation-needed notifications
as I guess was originally intended with this knob. To not break backward
compatibility too much, I only added a special case for mode 2 in the
receiving path.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 974eda11 13-Dec-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

inet: make no_pmtu_disc per namespace and kill ipv4_config

The other field in ipv4_config, log_martians, was converted to a
per-interface setting, so we can just remove the whole structure.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aa661581 24-Sep-2013 Francesco Fusco <ffusco@redhat.com>

ipv4: processing ancillary IP_TOS or IP_TTL

If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
packets with the specified TTL or TOS overriding the socket values specified
with the traditional setsockopt().

The struct inet_cork stores the values of TOS, TTL and priority that are
passed through the struct ipcm_cookie. If there are user-specified TOS
(tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
used to override the per-socket values. In case of TOS also the priority
is changed accordingly.

Two helper functions get_rttos and get_rtconn_flags are defined to take
into account the presence of a user specified TOS value when computing
RT_TOS and RT_CONN_FLAGS.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9a99d4a5 02-Jun-2013 Cong Wang <amwang@redhat.com>

icmp: avoid allocating large struct on stack

struct icmp_bxm is a large struct, reduce stack usage
by allocating it on heap.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 08578d8d 02-Jun-2013 Rami Rosen <ramirose@gmail.com>

] icmp: fix icmp_unreach() comment.

ICMP_PARAMETERPROB is handled by icmp_unreach(); This patch adds
ICMP_PARAMETERPROB to the list of ICMP message types handled by icmp_unreach().

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f7c0c2ae 28-May-2013 Simon Horman <horms@verge.net.au>

ipv4: Correct comparisons and calculations using skb->tail and skb-transport_header

This corrects an regression introduced by "net: Use 16bits for *_headers
fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In
that case skb->tail will be a pointer whereas skb->transport_header
will be an offset from head. This is corrected by using wrappers that
ensure that comparisons and calculations are always made using pointers.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6d0bfe22 22-May-2013 Lorenzo Colitti <lorenzo@google.com>

net: ipv6: Add IPv6 support to the ping socket.

This adds the ability to send ICMPv6 echo requests without a
raw socket. The equivalent ability for ICMPv4 was added in
2011.

Instead of having separate code paths for IPv4 and IPv6, make
most of the code in net/ipv4/ping.c dual-stack and only add a
few IPv6-specific bits (like the protocol definition) to a new
net/ipv6/ping.c. Hopefully this will reduce divergence and/or
duplication of bugs in the future.

Caveats:

- Setting options via ancillary data (e.g., using IPV6_PKTINFO
to specify the outgoing interface) is not yet supported.
- There are no separate security settings for IPv4 and IPv6;
everything is controlled by /proc/net/ipv4/ping_group_range.
- The proc interface does not yet display IPv6 ping sockets
properly.

Tested with a patched copy of ping6 and using raw socket calls.
Compiles and works with all of CONFIG_IPV6={n,m,y}.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6a5dc9e5 29-Apr-2013 Eric Dumazet <edumazet@google.com>

net: Add MIB counters for checksum errors

Add MIB counters for checksum errors in IP layer,
and TCP/UDP/ICMP layers, to help diagnose problems.

$ nstat -a | grep Csum
IcmpInCsumErrors 72 0.0
TcpInCsumErrors 382 0.0
UdpInCsumErrors 463221 0.0
Icmp6InCsumErrors 75 0.0
Udp6InCsumErrors 173442 0.0
IpExtInCsumErrors 10884 0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5b052042 21-Feb-2013 Li Wei <lw@cn.fujitsu.com>

ipv4: fix error handling in icmp_protocol.

Now we handle icmp errors in each transport protocol's err_handler,
for icmp protocols, that is ping_err. Since this handler only care
of those icmp errors triggered by echo request, errors triggered
by echo reply(which sent by kernel) are sliently ignored.

So wrap ping_err() with icmp_err() to deal with those icmp errors.

Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e1a67642 24-Nov-2012 Neal Cardwell <ncardwell@google.com>

ipv4: avoid passing NULL to inet_putpeer() in icmpv4_xrlim_allow()

inet_getpeer_v4() can return NULL under OOM conditions, and while
inet_peer_xrlim_allow() is OK with a NULL peer, inet_putpeer() will
crash.

This code path now uses the same idiom as the others from:
1d861aa4b3fb08822055345f480850205ffe6170 ("inet: Minimize use of
cached route inetpeer.").

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 92101b3b 23-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Prepare for change of rt->rt_iif encoding.

Use inet_iif() consistently, and for TCP record the input interface of
cached RX dst in inet sock.

rt->rt_iif is going to be encoded differently, so that we can
legitimately cache input routes in the FIB info more aggressively.

When the input interface is "use SKB device index" the rt->rt_iif will
be set to zero.

This forces us to move the TCP RX dst cache installation into the ipv4
specific code, and as well it should since doing the route caching for
ipv6 is pointless at the moment since it is not inspected in the ipv6
input paths yet.

Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
obsolete set to a non-zero value to force invocation of the check
callback.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 838942a5 23-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Really ignore ICMP address requests/replies.

Alexey removed kernel side support for requests, and the
only thing we do for replies is log a message if something
doesn't look right.

As Alexey's comment indicates, this belongs in userspace (if
anywhere), and thus we can safely just get rid of this code.

Signed-off-by: David S. Miller <davem@davemloft.net>


# f0a70e90 12-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Put proper checks into icmp_socket_deliver().

All handler->err() routines expect that we've done a pskb_may_pull()
test to make sure that IP header length + 8 bytes can be safely
pulled.

Reported-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1f42539d 11-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Kill ip_rt_redirect().

No longer needed, as the protocol handlers now all properly
propagate the redirect back into the routing code.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 94206125 11-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Rearrange arguments to ip_rt_redirect()

Pass in the SKB rather than just the IP addresses, so that policy
and other aspects can reside in ip_rt_redirect() rather then
icmp_redirect().

Signed-off-by: David S. Miller <davem@davemloft.net>


# d3351b75 11-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Deliver ICMP redirects to sockets too.

And thus, we can remove the ping_err() hack.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 1de9243b 11-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Pull icmp socket delivery out into a helper function.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 1d861aa4 10-Jul-2012 David S. Miller <davem@davemloft.net>

inet: Minimize use of cached route inetpeer.

Only use it in the absolutely required cases:

1) COW'ing metrics

2) ipv4 PMTU

3) ipv4 redirects

Signed-off-by: David S. Miller <davem@davemloft.net>


# 35ebf65e 28-Jun-2012 David S. Miller <davem@davemloft.net>

ipv4: Create and use fib_compute_spec_dst() helper.

The specific destination is the host we direct unicast replies to.
Usually this is the original packet source address, but if we are
responding to a multicast or broadcast packet we have to use something
different.

Specifically we must use the source address we would use if we were to
send a packet to the unicast source of the original packet.

The routing cache precomputes this value, but we want to remove that
precomputation because it creates a hard dependency on the expensive
rpfilter source address validation which we'd like to make cheaper.

There are only three places where this matters:

1) ICMP replies.

2) pktinfo CMSG

3) IP options

Now there will be no real users of rt->rt_spec_dst and we can simply
remove it altogether.

Signed-off-by: David S. Miller <davem@davemloft.net>


# f9242b6b 19-Jun-2012 David S. Miller <davem@davemloft.net>

inet: Sanitize inet{,6} protocol demux.

Don't pretend that inet_protos[] and inet6_protos[] are hashes, thay
are just a straight arrays. Remove all unnecessary hash masking.

Document MAX_INET_PROTOS.

Use RAW_HTABLE_SIZE when appropriate.

Reported-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 46517008 10-Jun-2012 David S. Miller <davem@davemloft.net>

ipv4: Kill ip_rt_frag_needed().

There is zero point to this function.

It's only real substance is to perform an extremely outdated BSD4.2
ICMP check, which we can safely remove. If you really have a MTU
limited link being routed by a BSD4.2 derived system, here's a nickel
go buy yourself a real router.

The other actions of ip_rt_frag_needed(), checking and conditionally
updating the peer, are done by the per-protocol handlers of the ICMP
event.

TCP, UDP, et al. have a handler which will receive this event and
transmit it back into the associated route via dst_ops->update_pmtu().

This simplification is important, because it eliminates the one place
where we do not have a proper route context in which to make an
inetpeer lookup.

Signed-off-by: David S. Miller <davem@davemloft.net>


# fbfe95a4 09-Jun-2012 David S. Miller <davem@davemloft.net>

inet: Create and use rt{,6}_get_peer_create().

There's a lot of places that open-code rt{,6}_get_peer() only because
they want to set 'create' to one. So add an rt{,6}_get_peer_create()
for their sake.

There were also a few spots open-coding plain rt{,6}_get_peer() and
those are transformed here as well.

Signed-off-by: David S. Miller <davem@davemloft.net>


# e87cc472 13-May-2012 Joe Perches <joe@perches.com>

net: Convert net_ratelimit uses to net_<level>_ratelimited

Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9ffc93f2 28-Mar-2012 David Howells <dhowells@redhat.com>

Remove all #inclusions of asm/system.h

Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>


# afd46503 12-Mar-2012 Joe Perches <joe@perches.com>

net: ipv4: Standardize prefixes for message logging

Add #define pr_fmt(fmt) as appropriate.

Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
Standardize on "UDPLite: " for appropriate uses.
Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".

Add KBUILD_MODNAME ": " to icmp and gre.
Remove embedded prefixes as appropriate.

Add missing "\n" to pr_info in gre.c.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 058bd4d2 11-Mar-2012 Joe Perches <joe@perches.com>

net: Convert printks to pr_<level>

Use a more current kernel messaging style.

Convert a printk block to print_hex_dump.
Coalesce formats, align arguments.
Use %s, __func__ instead of embedding function names.

Some messages that were prefixed with <foo>_close are
now prefixed with <foo>_fini. Some ah4 and esp messages
are now not prefixed with "ip ".

The intent of this patch is to later add something like
#define pr_fmt(fmt) "IPv4: " fmt.
to standardize the output messages.

Text size is trivially reduced. (x86-32 allyesconfig)

$ size net/ipv4/built-in.o*
text data bss dec hex filename
887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new
887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 87fb4b7b 13-Oct-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: more accurate skb truesize

skb truesize currently accounts for sk_buff struct and part of skb head.
kmalloc() roundings are also ignored.

Considering that skb_shared_info is larger than sk_buff, its time to
take it into account for better memory accounting.

This patch introduces SKB_TRUESIZE(X) macro to centralize various
assumptions into a single place.

At skb alloc phase, we put skb_shared_info struct at the exact end of
skb head, to allow a better use of memory (lowering number of
reallocations), since kmalloc() gives us power-of-two memory blocks.

Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
aligned to cache lines, as before.

Note: This patch might trigger performance regressions because of
misconfigured protocol stacks, hitting per socket or global memory
limits that were previously not reached. But its a necessary step for a
more accurate memory accounting.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <ak@linux.intel.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 415b3334 22-Jul-2011 David S. Miller <davem@davemloft.net>

icmp: Fix regression in nexthop resolution during replies.

icmp_route_lookup() uses the wrong flow parameters if the reverse
session route lookup isn't used.

So do not commit to the re-decoded flow until we actually make a
final decision to use a real route saved in 'rt2'.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a48eff12 18-May-2011 David S. Miller <davem@davemloft.net>

ipv4: Pass explicit destination address to rt_bind_peer().

Signed-off-by: David S. Miller <davem@davemloft.net>


# c319b4d7 13-May-2011 Vasiliy Kulikov <segoon@openwall.com>

net: ipv4: add IPPROTO_ICMP socket kind

This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).

Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/

A new ping socket is created with

socket(PF_INET, SOCK_DGRAM, PROT_ICMP)

Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.

Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.

ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.

ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).

socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.

The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).

Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping

For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/

Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.

All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.

PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.

PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().

PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.

RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9f6abb5f 09-May-2011 David S. Miller <davem@davemloft.net>

ipv4: icmp: Eliminate remaining uses of rt->rt_src

On input packets, rt->rt_src always equals ip_hdr(skb)->saddr

Anything that mangles or otherwise changes the IP header must
relookup the route found at skb_rtable(). Therefore this
invariant must always hold true.

Signed-off-by: David S. Miller <davem@davemloft.net>


# f5fca608 08-May-2011 David S. Miller <davem@davemloft.net>

ipv4: Pass flow key down into ip_append_*().

This way rt->rt_dst accesses are unnecessary.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 77968b78 08-May-2011 David S. Miller <davem@davemloft.net>

ipv4: Pass flow keys down into datagram packet building engine.

This way ip_output.c no longer needs rt->rt_{src,dst}.

We already have these keys sitting, ready and waiting, on the stack or
in a socket structure.

Signed-off-by: David S. Miller <davem@davemloft.net>


# f6d8bd05 21-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com>

inet: add RCU protection to inet->opt

We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b71d1d42 21-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com>

inet: constify ip headers and in6_addr

Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# 9cce96df 12-Mar-2011 David S. Miller <davem@davemloft.net>

net: Put fl4_* macros to struct flowi4 and use them again.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 9d6ec938 11-Mar-2011 David S. Miller <davem@davemloft.net>

ipv4: Use flowi4 in public route lookup interfaces.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 6281dcc9 11-Mar-2011 David S. Miller <davem@davemloft.net>

net: Make flowi ports AF dependent.

Create two sets of port member accessors, one set prefixed by fl4_*
and the other prefixed by fl6_*

This will let us to create AF optimal flow instances.

It will work because every context in which we access the ports,
we have to be fully aware of which AF the flowi is anyways.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 1d28f42c 11-Mar-2011 David S. Miller <davem@davemloft.net>

net: Put flowi_* prefix on AF independent members of struct flowi

I intend to turn struct flowi into a union of AF specific flowi
structs. There will be a common structure that each variant includes
first, much like struct sock_common.

This is the first step to move in that direction.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 5e2b61f7 04-Mar-2011 David S. Miller <davem@davemloft.net>

ipv4: Remove flowi from struct rtable.

The only necessary parts are the src/dst addresses, the
interface indexes, the TOS, and the mark.

The rest is unnecessary bloat, which amounts to nearly
50 bytes on 64-bit.

Signed-off-by: David S. Miller <davem@davemloft.net>


# b23dd4fe 02-Mar-2011 David S. Miller <davem@davemloft.net>

ipv4: Make output route lookup return rtable directly.

Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 452edd59 02-Mar-2011 David S. Miller <davem@davemloft.net>

xfrm: Return dst directly from xfrm_lookup()

Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>


# f6d460cf 01-Mar-2011 David S. Miller <davem@davemloft.net>

ipv4: Make icmp route lookup code a bit clearer.

The route lookup code in icmp_send() is slightly tricky as a result of
having to handle all of the requirements of RFC 4301 host relookups.

Pull the route resolution into a seperate function, so that the error
handling and route reference counting is hopefully easier to see and
contained wholly within this new routine.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 92d86829 04-Feb-2011 David S. Miller <davem@davemloft.net>

inetpeer: Move ICMP rate limiting state into inet_peer entries.

Like metrics, the ICMP rate limiting bits are cached state about
a destination. So move it into the inet_peer entries.

If an inet_peer cannot be bound (the reason is memory allocation
failure or similar), the policy is to allow.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 5811662b 12-Nov-2010 Changli Gao <xiaosuo@gmail.com>

net: use the macros defined for the members of flowi

Use the macros defined for the members of flowi to clean the code up.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7d98ffd8 04-Nov-2010 Ulrich Weber <uweber@astaro.com>

xfrm: update flowi saddr in icmp_send if unset

otherwise xfrm_lookup will fail to find correct policy

Signed-off-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c7537967 11-Nov-2010 David S. Miller <davem@davemloft.net>

ipv4: Make rt->fl.iif tests lest obscure.

When we test rt->fl.iif against zero, we're seeing if it's
an output or an input route.

Make that explicit with some helper functions.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 2244d07b 17-Aug-2010 Oliver Hartkopp <socketcan@hartkopp.net>

net: simplify flags for tx timestamping

This patch removes the abstraction introduced by the union skb_shared_tx in
the shared skb data.

The access of the different union elements at several places led to some
confusion about accessing the shared tx_flags e.g. in skb_orphan_try().

http://marc.info/?l=linux-netdev&m=128084897415886&w=2

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4bc2f18b 09-Jul-2010 Eric Dumazet <eric.dumazet@gmail.com>

net/ipv4: EXPORT_SYMBOL cleanups

CodingStyle cleanups

EXPORT_SYMBOL should immediately follow the symbol declaration.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d8d1f30b 11-Jun-2010 Changli Gao <xiaosuo@gmail.com>

net-next: remove useless union keyword

remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cfa087f6 07-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com>

icmp: RCU conversion in icmp_address_reply()

- rcu_read_lock() already held by caller
- use __in_dev_get_rcu() instead of in_dev_get() / in_dev_put()
- remove goto out;

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7fee226a 11-May-2010 Eric Dumazet <eric.dumazet@gmail.com>

net: add a noref bit on skb dst

Use low order bit of skb->_skb_dst to tell dst is not refcounted.

Change _skb_dst to _skb_refdst to make sure all uses are catched.

skb_dst() returns the dst, regardless of noref bit set or not, but
with a lockdep check to make sure a noref dst is not given if current
user is not rcu protected.

New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
(with lockdep check)

skb_dst_drop() drops a reference only if skb dst was refcounted.

skb_dst_force() helper is used to force a refcount on dst, when skb
is queued and not anymore RCU protected.

Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
!IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
sock_queue_rcv_skb(), in __nf_queue().

Use skb_dst_force() in dev_requeue_skb().

Note: dst_use_noref() still dirties dst, we might transform it
later to do one dirtying per jiffies.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1f8438a8 03-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com>

icmp: Account for ICMP out errors

When ip_append() fails because of socket limit or memory shortage,
increment ICMP_MIB_OUTERRORS counter, so that "netstat -s" can report
these errors.

LANG=C netstat -s | grep "ICMP messages failed"
0 ICMP messages failed

For IPV6, implement ICMP6_MIB_OUTERRORS counter as well.

# grep Icmp6OutErrors /proc/net/dev_snmp6/*
/proc/net/dev_snmp6/eth0:Icmp6OutErrors 0
/proc/net/dev_snmp6/lo:Icmp6OutErrors 0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# e754834e 22-Jan-2010 Alexey Dobriyan <adobriyan@gmail.com>

icmp: move icmp_err_convert[] to .rodata

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 685c7944 01-Nov-2009 Eric Dumazet <eric.dumazet@gmail.com>

icmp: icmp_send() can avoid a dev_put()

We can avoid touching device refcount in icmp_send(),
using dev_get_by_index_rcu()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b3a5b6cc 23-Sep-2009 Eric Dumazet <eric.dumazet@gmail.com>

icmp: No need to call sk_write_space()

We can make icmp messages tx completion callback a litle bit faster.

Setting SOCK_USE_WRITE_QUEUE sk flag tells sock_wfree() to
not call sk_write_space() on a socket we know no thread is posssibly
waiting for write space. (on per cpu kernel internal icmp sockets only)

This avoids the sock_def_write_space() call and
read_lock(&sk->sk_callback_lock)/read_unlock(&sk->sk_callback_lock) calls
as well.

We avoid three atomic ops.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32613090 13-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

net: constify struct net_protocol

Remove long removed "inet_protocol_base" declaration.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# adf30907 01-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com>

net: skb->dst accessors

Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 511c3f92 01-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com>

net: skb->rtable accessor

Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

Delete skb->rtable field

Setting rtable is not allowed, just set dst instead as rtable is an alias.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6eb07772 22-Feb-2009 Eric W. Biederman <ebiederm@aristanetworks.com>

netns: Fix icmp shutdown.

Recently I had a kernel panic in icmp_send during a network namespace
cleanup. There were packets in the arp queue that failed to be sent
and we attempted to generate an ICMP host unreachable message, but
failed because icmp_sk_exit had already been called.

The network devices are removed from a network namespace and their
arp queues are flushed before we do attempt to shutdown subsystems
so this error should have been impossible.

It turns out icmp_init is using register_pernet_device instead
of register_pernet_subsys. Which resulted in icmp being shut down
while we still had the possibility of packets in flight, making
a nasty NULL pointer deference in interrupt context possible.

Changing this to register_pernet_subsys fixes the problem in
my testing.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 959d2726 22-Feb-2009 Eric W. Biederman <ebiederm@aristanetworks.com>

netns: Fix icmp shutdown.

Recently I had a kernel panic in icmp_send during a network namespace
cleanup. There were packets in the arp queue that failed to be sent
and we attempted to generate an ICMP host unreachable message, but
failed because icmp_sk_exit had already been called.

The network devices are removed from a network namespace and their
arp queues are flushed before we do attempt to shutdown subsystems
so this error should have been impossible.

It turns out icmp_init is using register_pernet_device instead
of register_pernet_subsys. Which resulted in icmp being shut down
while we still had the possibility of packets in flight, making
a nasty NULL pointer deference in interrupt context possible.

Changing this to register_pernet_subsys fixes the problem in
my testing.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 51f31cab 11-Feb-2009 Patrick Ohly <patrick.ohly@intel.com>

ip: support for TX timestamps on UDP and RAW sockets

Instructions for time stamping outgoing packets are take from the
socket layer and later copied into the new skb.

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 52479b62 25-Nov-2008 Alexey Dobriyan <adobriyan@gmail.com>

netns xfrm: lookup in netns

Pass netns to xfrm_lookup()/__xfrm_lookup(). For that pass netns
to flow_cache_lookup() and resolver callback.

Take it from socket or netdevice. Stub DECnet to init_net.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2e77d89b 24-Nov-2008 Eric Dumazet <dada1@cosmosbay.com>

net: avoid a pair of dst_hold()/dst_release() in ip_append_data()

We can reduce pressure on dst entry refcount that slowdown UDP transmit
path on SMP machines. This pressure is visible on RTP servers when
delivering content to mediagateways, especially big ones, handling
thousand of streams. Several cpus send UDP frames to the same
destination, hence use the same dst entry.

This patch makes ip_append_data() eventually steal the refcount its
callers had to take on the dst entry.

This doesnt avoid all refcounting, but still gives speedups on SMP,
on UDP/RAW transmit path

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 673d57e7 31-Oct-2008 Harvey Harrison <harvey.harrison@gmail.com>

net: replace NIPQUAD() in net/ipv4/ net/ipv6/

Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# def8b4fa 28-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

net: reduce structures when XFRM=n

ifdef out
* struct sk_buff::sp (pointer)
* struct dst_entry::xfrm (pointer)
* struct sock::sk_policy (2 pointers)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 113aa838 13-Oct-2008 Alan Cox <alan@redhat.com>

net: Rationalise email address: Network Specific Parts

Clean up the various different email addresses of mine listed in the code
to a single current and valid address. As Dave says his network merges
for 2.6.28 are now done this seems a good point to send them in where
they won't risk disrupting real changes.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fdc0bde9 23-Aug-2008 Denis V. Lunev <den@openvz.org>

icmp: icmp_sk() should not use smp_processor_id() in preemptible code

Pass namespace into icmp_xmit_lock, obtain socket inside and return
it as a result for caller.

Thanks Alexey Dobryan for this report:

Steps to reproduce:

CONFIG_PREEMPT=y
CONFIG_DEBUG_PREEMPT=y
tracepath <something>

BUG: using smp_processor_id() in preemptible [00000000] code: tracepath/3205
caller is icmp_sk+0x15/0x30
Pid: 3205, comm: tracepath Not tainted 2.6.27-rc4 #1

Call Trace:
[<ffffffff8031af14>] debug_smp_processor_id+0xe4/0xf0
[<ffffffff80409405>] icmp_sk+0x15/0x30
[<ffffffff8040a17b>] icmp_send+0x4b/0x3f0
[<ffffffff8025a415>] ? trace_hardirqs_on_caller+0xd5/0x160
[<ffffffff8025a4ad>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff8023a475>] ? local_bh_enable_ip+0x95/0x110
[<ffffffff804285b9>] ? _spin_unlock_bh+0x39/0x40
[<ffffffff8025a26c>] ? mark_held_locks+0x4c/0x90
[<ffffffff8025a4ad>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff8025a415>] ? trace_hardirqs_on_caller+0xd5/0x160
[<ffffffff803e91b4>] ip_fragment+0x8d4/0x900
[<ffffffff803e7030>] ? ip_finish_output2+0x0/0x290
[<ffffffff803e91e0>] ? ip_finish_output+0x0/0x60
[<ffffffff803e6650>] ? dst_output+0x0/0x10
[<ffffffff803e922c>] ip_finish_output+0x4c/0x60
[<ffffffff803e92e3>] ip_output+0xa3/0xf0
[<ffffffff803e68d0>] ip_local_out+0x20/0x30
[<ffffffff803e753f>] ip_push_pending_frames+0x27f/0x400
[<ffffffff80406313>] udp_push_pending_frames+0x233/0x3d0
[<ffffffff804067d1>] udp_sendmsg+0x321/0x6f0
[<ffffffff8040d155>] inet_sendmsg+0x45/0x80
[<ffffffff803b967f>] sock_sendmsg+0xdf/0x110
[<ffffffff8024a100>] ? autoremove_wake_function+0x0/0x40
[<ffffffff80257ce5>] ? validate_chain+0x415/0x1010
[<ffffffff8027dc10>] ? __do_fault+0x140/0x450
[<ffffffff802597d0>] ? __lock_acquire+0x260/0x590
[<ffffffff803b9e55>] ? sockfd_lookup_light+0x45/0x80
[<ffffffff803ba50a>] sys_sendto+0xea/0x120
[<ffffffff80428e42>] ? _spin_unlock_irqrestore+0x42/0x80
[<ffffffff803134bc>] ? __up_read+0x4c/0xb0
[<ffffffff8024e0c6>] ? up_read+0x26/0x30
[<ffffffff8020b8bb>] system_call_fastpath+0x16/0x1b

icmp6_sk() is similar.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 923c6586 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

mib: put icmpmsg statistics on struct net

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b60538a0 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

mib: put icmp statistics on struct net

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f66ac03d 15-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

mib: add struct net to ICMPMSGIN_INC_STATS_BH

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 903fc196 15-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

mib: add struct net to ICMPMSGOUT_INC_STATS

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dcfc23ca 15-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

mib: add struct net to ICMP_INC_STATS_BH

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 75c939bb 15-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

mib: add struct net to ICMP_INC_STATS

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fd54d716 15-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

inet: toss struct net initialization around

Some places, that deal with ICMP statistics already have where
to get a struct net from, but use it directly, without declaring
a separate variable on the stack.

Since I will need this net soon, I declare a struct net on the
stack and use it in the existing places in a separate patch not
to spoil the future ones.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0388b004 15-Jul-2008 Pavel Emelyanov <xemul@openvz.org>

icmp: add struct net argument to icmp_out_count

This routine deals with ICMP statistics, but doesn't have a
struct net at hands, so add one.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0b040829 10-Jun-2008 Adrian Bunk <bunk@kernel.org>

net: remove CVS keywords

This patch removes CVS keywords that weren't updated for a long time
from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0010e465 29-Apr-2008 Timo Teras <timo.teras@iki.fi>

ipv4: Update MTU to all related cache entries in ip_rt_frag_needed()

Add struct net_device parameter to ip_rt_frag_needed() and update MTU to
cache entries where ifindex is specified. This is similar to what is
already done in ip_rt_redirect().

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f25c3d61 21-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV4]: Convert do_gettimeofday() to getnstimeofday().

What do_gettimeofday() does is to call getnstimeofday() and
to convert the result from timespec{} to timeval{}.
After that, these callers convert the result again to msec.
Use getnstimeofday() and convert the units at once.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 263173af 21-Apr-2008 Adrian Bunk <bunk@kernel.org>

[IPV4]: Make icmp_sk_init() static.

This patch makes the needlessly global icmp_sk_init() static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a7d632b6 14-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV4]: Use NIPQUAD_FMT to format ipv4 addresses.

And use %u to format port.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c1e9894d 03-Apr-2008 Denis V. Lunev <den@openvz.org>

[ICMP]: Simplify ICMP control socket creation.

Replace sock_create_kern with inet_ctl_sock_create.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# af268182 03-Apr-2008 Herbert Xu <herbert@gondor.apana.org.au>

[ICMP]: Ensure that ICMP relookup maintains status quo

The ICMP relookup path is only meant to modify behaviour when
appropriate IPsec policies are in place and marked as requiring
relookups. It is certainly not meant to modify behaviour when
IPsec policies don't exist at all.

However, due to an oversight on the error paths existing behaviour
may in fact change should one of the relookup steps fail.

This patch corrects this by redirecting all errors on relookup
failures to the previous code path. That is, if the initial
xfrm_lookup let the packet pass, we will stand by that decision
should the relookup fail due to an error.

This should be safe from a security point-of-view because compliant
systems must install a default deny policy so the packet would'nt
have passed in that case.

Many thanks to Julian Anastasov for pointing out this error.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7c0ecc4c 26-Mar-2008 Pavel Emelyanov <xemul@openvz.org>

[ICMP]: Dst entry leak in icmp_send host re-lookup code (v2).

Commit 8b7817f3a959ed99d7443afc12f78a7e1fcc2063 ([IPSEC]: Add ICMP host
relookup support) introduced some dst leaks on error paths: the rt
pointer can be forgotten to be put. Fix it bu going to a proper label.

Found after net namespace's lo refused to unregister :) Many thanks to
Den for valuable help during debugging.

Herbert pointed out, that xfrm_lookup() will put the rtable in case
of error itself, so the first goto fix is redundant.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 789e41e6 26-Mar-2008 Pavel Emelyanov <xemul@openvz.org>

[NETNS][ICMP]: Build fix for NET_NS=n case (dev->nd_net is omitted).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b34a95ee 26-Mar-2008 Pavel Emelyanov <xemul@openvz.org>

[NETNS][ICMP]: Use per-net sysctls in ipv4/icmp.c.

This mostly re-uses the net, used in icmp netnsization patches from Denis.

After this ICMP sysctls are completely virtualized.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a24022e1 26-Mar-2008 Pavel Emelyanov <xemul@openvz.org>

[NETNS][ICMP]: Move ICMP sysctls on struct net.

Initialization is moved to icmp_sk_init, all the places, that
refer to them use init_net for now.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c346dca1 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS.

Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# ee6b9673 05-Mar-2008 Eric Dumazet <dada1@cosmosbay.com>

[IPV4]: Add 'rtable' field in struct sk_buff to alias 'dst' and avoid casts

(Anonymous) unions can help us to avoid ugly casts.

A common cast it the (struct rtable *)skb->dst one.

Defining an union like :
union {
struct dst_entry *dst;
struct rtable *rtable;
};
permits to use skb->rtable in place.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1d1c8d13 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[ICMP]: Section conflict between icmp_sk_init/icmp_sk_exit.

Functions from __exit section should not be called from ones in __init
section. Fix this conflict.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4a6ad7a1 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[NETNS]: Make icmp_sk per namespace.

All preparations are done. Now just add a hook to perform an
initialization on namespace startup and replace icmp_sk macro with
proper inline call.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5c8cafd6 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[NETNS]: icmp(v6)_sk should not pin a namespace.

So, change icmp(v6)_sk creation/disposal to the scheme used in the
netlink for rtnl, i.e. create a socket in the context of the init_net
and assign the namespace without getting a referrence later.

Also use sk_release_kernel instead of sock_release to properly destroy
such sockets.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 79c91159 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[ICMP]: Allocate data for __icmp(v6)_sk dynamically.

Own __icmp(v6)_sk should be present in each namespace. So, it should be
allocated dynamically. Though, alloc_percpu does not fit the case as it
implies additional dereferrence for no bonus.

Allocate data for pointers just like __percpu_alloc_mask does and place
pointers to struct sock into this array.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 405666db 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[ICMP]: Pass proper ICMP socket into icmp(v6)_xmit_(un)lock.

We have to get socket lock inside icmp(v6)_xmit_lock/unlock. The socket
is get from global variable now. When this code became namespaces, one
should pass a namespace and get socket from it.

Though, above is useless. Socket is available in the caller, just pass
it inside. This saves a bit of code now and saves more later.

add/remove: 0/0 grow/shrink: 1/3 up/down: 1/-169 (-168)
function old new delta
icmp_rcv 718 719 +1
icmpv6_rcv 2343 2303 -40
icmp_send 1566 1518 -48
icmp_reply 549 468 -81

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b7e729c4 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[ICMP]: Store sock rather than socket for ICMP flow control.

Basically, there is no difference, what to store: socket or sock. Though,
sock looks better as there will be 1 less dereferrence on the fast path.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1e3cf683 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[ICMP]: Optimize icmp_socket usage.

Use this macro only once in a function to save a bit of space.

add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-98 (-98)
function old new delta
icmp_reply 562 561 -1
icmp_push_reply 305 258 -47
icmp_init 273 223 -50

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a5710d65 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[ICMP]: Add return code to icmp_init.

icmp_init could fail and this is normal for namespace other than initial.
So, the panic should be triggered only on init_net initialization path.

Additionally create rollback path for icmp_init as a separate function.
It will also be used later during namespace destruction.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9b0f976f 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[INET]: Remove struct net_proto_family* from _init calls.

struct net_proto_family* is not used in icmp[v6]_init, ndisc_init,
igmp_init and tcp_v4_init. Remove it.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8cf22943 05-Feb-2008 Herbert Xu <herbert@gondor.apana.org.au>

[ICMP]: Restore pskb_pull calls in receive function

Somewhere along the development of my ICMP relookup patch the header
length check went AWOL on the non-IPsec path. This patch restores the
check.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dde1bc0e 23-Jan-2008 Denis V. Lunev <den@openvz.org>

[NETNS]: Add namespace for ICMP replying code.

All needed API is done, the namespace is available when required from
the device on the DST entry from the incoming packet. So, just replace
init_net with proper namespace.

Other protocols will follow.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b5921910 23-Jan-2008 Denis V. Lunev <den@openvz.org>

[NETNS]: Routing cache virtualization.

Basically, this piece looks relatively easy. Namespace is already
available on the dst entry via device and the device is safe to
dereferrence. Compare it with one of a searcher and skip entry if
appropriate.

The only exception is ip_rt_frag_needed. So, add namespace parameter to it.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f206351a 22-Jan-2008 Denis V. Lunev <den@openvz.org>

[NETNS]: Add namespace parameter to ip_route_output_key.

Needed to propagate it down to the ip_route_output_flow.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 611c183e 22-Jan-2008 Denis V. Lunev <den@openvz.org>

[NETNS]: Add namespace parameter to __ip_route_output_key.

This is only required to propagate it down to the
ip_route_output_slow.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 69a73829 22-Jan-2008 Eric Dumazet <dada1@cosmosbay.com>

[DST]: shrinks sizeof(struct rtable) by 64 bytes on x86_64

On x86_64, sizeof(struct rtable) is 0x148, which is rounded up to
0x180 bytes by SLAB allocator.

We can reduce this to exactly 0x140 bytes, without alignment overhead,
and store 12 struct rtable per PAGE instead of 10.

rate_tokens is currently defined as an "unsigned long", while its
content should not exceed 6*HZ. It can safely be converted to an
unsigned int.

Moving tclassid right after rate_tokens to fill the 4 bytes hole
permits to save 8 bytes on 'struct dst_entry', which finally permits
to save 8 bytes on 'struct rtable'

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6b175b26 10-Jan-2008 Eric W. Biederman <ebiederm@xmission.com>

[NETNS]: Add netns parameter to inet_(dev_)add_type.

The patch extends the inet_addr_type and inet_dev_addr_type with the
network namespace pointer. That allows to access the different tables
relatively to the network namespace.

The modification of the signature function is reported in all the
callers of the inet_addr_type using the pointer to the well known
init_net.

Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 74feb6e8 24-Jan-2008 Eric Dumazet <dada1@cosmosbay.com>

[ICMP]: Avoid sparse warnings in net/ipv4/icmp.c

CHECK net/ipv4/icmp.c
net/ipv4/icmp.c:249:13: warning: context imbalance in 'icmp_xmit_unlock' -
unexpected unlock
net/ipv4/icmp.c:376:13: warning: context imbalance in 'icmp_reply' - different
lock contexts for basic block
net/ipv4/icmp.c:430:6: warning: context imbalance in 'icmp_send' - different
lock contexts for basic block

Solution is to declare both icmp_xmit_lock() and icmp_xmit_unlock() as inline

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aebcf82c 12-Dec-2007 Herbert Xu <herbert@gondor.apana.org.au>

[IPSEC]: Do not let packets pass when ICMP flag is off

This fixes a logical error in ICMP policy checks which lets
packets through if the state ICMP flag is off.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7233b9f3 12-Dec-2007 Herbert Xu <herbert@gondor.apana.org.au>

[IPSEC]: Fix reversed ICMP6 policy check

The policy check I added for ICMP on IPv6 is reversed. This
patch fixes that.

It also adds an skb->sp check so that unprotected packets that
fail the policy check do not crash the machine.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8b7817f3 12-Dec-2007 Herbert Xu <herbert@gondor.apana.org.au>

[IPSEC]: Add ICMP host relookup support

RFC 4301 requires us to relookup ICMP traffic that does not match any
policies using the reverse of its payload. This patch implements this
for ICMP traffic that originates from or terminates on localhost.

This is activated on outbound with the new policy flag XFRM_POLICY_ICMP,
and on inbound by the new state flag XFRM_STATE_ICMP.

On inbound the policy check is now performed by the ICMP protocol so
that it can repeat the policy check where necessary.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7bc54c90 19-Nov-2007 Pavel Emelyanov <xemul@openvz.org>

[IPv4] RAW: Compact the API for the kernel

The raw sockets functions are explicitly used from
inside the kernel in two places:

1. in ip_local_deliver_finish to intercept skb-s
2. in icmp_error

For this purposes many functions and even data structures,
that are naturally internal for raw protocol, are exported.

Compact the API to two functions and hide all the other
(including hash table and rwlock) inside the net/ipv4/raw.c

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5b4d383a 21-Jan-2008 Wang Chen <wangchen@cn.fujitsu.com>

[ICMP]: ICMP_MIB_OUTMSGS increment duplicated

Commit "96793b482540f3a26e2188eaf75cb56b7829d3e3" (Add ICMPMsgStats
MIB (RFC 4293)) made a mistake.

In that patch, David L added a icmp_out_count() in
ip_push_pending_frames(), remove icmp_out_count() from
icmp_reply(). But he forgot to remove icmp_out_count() from
icmp_send() too. Since icmp_send and icmp_reply will call
icmp_push_reply, which will call ip_push_pending_frames, a duplicated
increment happened in icmp_send.

This patch remove the icmp_out_count from icmp_send too.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 39296ed6 26-Oct-2007 Adrian Bunk <bunk@kernel.org>

[INET]: Unexport icmpmsg_statistics

This patch removes the unused EXPORT_SYMBOL(icmpmsg_statistics).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 96793b48 17-Sep-2007 David L Stevens <dlstevens@us.ibm.com>

[IPV4]: Add ICMPMsgStats MIB (RFC 4293)

Background: RFC 4293 deprecates existing individual, named ICMP
type counters to be replaced with the ICMPMsgStatsTable. This table
includes entries for both IPv4 and IPv6, and requires counting of all
ICMP types, whether or not the machine implements the type.

These patches "remove" (but not really) the existing counters, and
replace them with the ICMPMsgStats tables for v4 and v6.
It includes the named counters in the /proc places they were, but gets the
values for them from the new tables. It also counts packets generated
from raw socket output (e.g., OutEchoes, MLD queries, RA's from
radvd, etc).

Changes:
1) create icmpmsg_statistics mib
2) create icmpv6msg_statistics mib
3) modify existing counters to use these
4) modify /proc/net/snmp to add "IcmpMsg" with all ICMP types
listed by number for easy SNMP parsing
5) modify /proc/net/snmp printing for "Icmp" to get the named data
from new counters.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 881d966b 17-Sep-2007 Eric W. Biederman <ebiederm@xmission.com>

[NET]: Make the device list and device lookups per namespace.

This patch makes most of the generic device layer network
namespace safe. This patch makes dev_base_head a
network namespace variable, and then it picks up
a few associated variables. The functions:
dev_getbyhwaddr
dev_getfirsthwbytype
dev_get_by_flags
dev_get_by_name
__dev_get_by_name
dev_get_by_index
__dev_get_by_index
dev_ioctl
dev_ethtool
dev_load
wireless_process_ioctl

were modified to take a network namespace argument, and
deal with it.

vlan_ioctl_set and brioctl_set were modified so their
hooks will receive a network namespace argument.

So basically anthing in the core of the network stack that was
affected to by the change of dev_base was modified to handle
multiple network namespaces. The rest of the network stack was
simply modified to explicitly use &init_net the initial network
namespace. This can be fixed when those components of the network
stack are modified to handle multiple network namespaces.

For now the ifindex generator is left global.

Fundametally ifindex numbers are per namespace, or else
we will have corner case problems with migration when
we get that far.

At the same time there are assumptions in the network stack
that the ifindex of a network device won't change. Making
the ifindex number global seems a good compromise until
the network stack can cope with ifindex changes when
you change namespaces, and the like.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6e1d9103 01-Jun-2007 Patrick McHardy <kaber@trash.net>

[ICMP]: Fix icmp_errors_use_inbound_ifaddr sysctl

Currently when icmp_errors_use_inbound_ifaddr is set and an ICMP error is
sent after the packet passed through ip_output(), an address from the
outgoing interface is chosen as ICMP source address since skb->dev doesn't
point to the incoming interface anymore.

Fix this by doing an interface lookup on rt->dst.iif and using that device.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d8cf2728 19-May-2007 Patrick McHardy <kaber@trash.net>

[IPV4]: icmp: fix crash with sysctl_icmp_errors_use_inbound_ifaddr

When icmp_send is called on the local output path before the
packet hits ip_output, skb->dev is not set, causing a crash
when sysctl_icmp_errors_use_inbound_ifaddr is set. This can
happen with the netfilter REJECT target or IPsec tunnels.

Let routing decide the ICMP source address in that case, since the
packet is locally generated there is no inbound interface and
the sysctl should not apply.

The option actually seems to be unfixable broken, on the path
after ip_output() skb->dev points to the outgoing device and
we don't know the incoming device anymore, so its going to do
the absolute wrong thing and pick the address of the outgoing
interface. Add a comment about this.

Reported by Curtis Doty <Curtis@GreenKey.net>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 27a884dc 19-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Convert skb->tail to sk_buff_data_t

So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)

Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 88c7664f 13-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce icmp_hdr(), remove skb->h.icmph

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# eddc9ec5 20-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d56f90a7 10-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_network_header()

For the places where we need a pointer to the network header, it is still legal
to touch skb->nh.raw directly if just adding to, subtracting from or setting it
to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e905a9ed 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET] IPV4: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5f92a738 14-Nov-2006 Al Viro <viro@zeniv.linux.org.uk>

[NET]: Annotate callers of the reset of checksum.h stuff.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d3bc23e7 14-Nov-2006 Al Viro <viro@zeniv.linux.org.uk>

[NET]: Annotate callers of csum_fold() in net/*

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b03d73e3 27-Sep-2006 Al Viro <viro@zeniv.linux.org.uk>

[IPV4] net/ipv4/icmp.c: trivial annotations

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3ca3c68e 27-Sep-2006 Al Viro <viro@zeniv.linux.org.uk>

[IPV4]: struct ip_options annotations

->faddr is net-endian; annotated as such, variables inferred to be net-endian
annotated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e4883014 26-Sep-2006 Al Viro <viro@zeniv.linux.org.uk>

[IPV4]: icmp_send() annotation

The last argument is network-endian (it will go straight into the packet).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a144ea4b 28-Sep-2006 Al Viro <viro@zeniv.linux.org.uk>

[IPV4]: annotate struct in_ifaddr

ifa_local, ifa_address, ifa_mask, ifa_broadcast and ifa_anycast are
net-endian. Annotated them and variables that are inferred to be
net-endian.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a61ced5d 26-Sep-2006 Al Viro <viro@zeniv.linux.org.uk>

[IPV4]: inet_select_addr() annotations

argument and return value are net-endian. Annotated function and inferred
net-endian variables in callers.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab32ea5d 22-Sep-2006 Brian Haley <brian.haley@hp.com>

[NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly

Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly.

Couldn't actually measure any performance increase while testing (.3%
I consider noise), but seems like the right thing to do.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 84fa7933 29-Aug-2006 Patrick McHardy <kaber@trash.net>

[NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE

Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose
checksum still needs to be completed) and CHECKSUM_COMPLETE (for
incoming packets, device supplied full checksum).

Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# beb8d13b 05-Aug-2006 Venkat Yekkirala <vyekkirala@TrustedCS.com>

[MLSXFRM]: Add flow labeling

This labels the flows that could utilize IPSec xfrms at the points the
flows are defined so that IPSec policy and SAs at the right label can
be used.

The following protos are currently not handled, but they should
continue to be able to use single-labeled IPSec like they currently
do.

ipmr
ip_gre
ipip
igmp
sit
sctp
ip6_tunnel (IPv6 over IPv6 tunnel device)
decnet

Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# f86502bf 05-Jun-2006 David S. Miller <davem@sunset.davemloft.net>

[IPV4] icmp: Kill local 'ip' arg in icmp_redirect().

It is typed wrong, and it's only assigned and used once.
So just pass in iph->daddr directly which fixes both problems.

Based upon a patch by Alexey Dobriyan.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 6f912042 10-Apr-2006 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

[PATCH] for_each_possible_cpu: network codes

for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.

We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.

This patch replaces for_each_cpu with for_each_possible_cpu under /net

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# cef2685e 25-Mar-2006 Ilia Sotnikov <hostcc@gmail.com>

[IPV4]: Aggregate route entries with different TOS values

When we get an ICMP need-to-frag message, the original TOS value in the
ICMP payload cannot be used as a key to look up the routes to update.
This is because the TOS field may have been modified by routers on the
way. Similarly, ip_rt_redirect should also ignore the TOS as the router
that gave us the message may have modified the TOS value.

The patch achieves this objective by aggregating entries with different
TOS values (but are otherwise identical) into the same bucket. This
makes it easy to update them at the same time when an ICMP message is
received.

In future we should use a twin-hashing scheme where teh aggregation
occurs at the entry level. That is, the TOS goes back into the hash
for normal lookups while ICMP lookups will end up with a node that
gives us a list that contains all other route entries that differ
only by TOS.

Signed-off-by: Ilia Sotnikov <hostcc@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 77decfc7 13-Feb-2006 Dave Jones <davej@redhat.com>

[IPV4] ICMP: Invert default for invalid icmp msgs sysctl

isic can trigger these msgs to be spewed at a very high rate.
There's already a sysctl to turn them off. Given these messages
aren't useful for most people, this patch disables them by
default.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fa60cf7f 04-Feb-2006 Herbert Xu <herbert@gondor.apana.org.au>

[ICMP]: Fix extra dst release when ip_options_echo fails

When two ip_route_output_key lookups in icmp_send were combined I
forgot to change the error path for ip_options_echo to not drop the
dst reference since it now sits before the dst lookup. To fix it we
simply jump past the ip_rt_put call.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f00c401b 02-Feb-2006 Horms <horms@verge.net.au>

[IPV4]: Remove suprious use of goto out: in icmp_reply

This seems to be an artifact of the follwoing commit in February '02.

e7e173af42dbf37b1d946f9ee00219cb3b2bea6a

In a nutshell, goto out and return actually do the same thing,
and both are called in this function. This patch removes out.

Signed-Off-By: Horms <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 09a62660 08-Jan-2006 Kris Katterjohn <kjak@users.sourceforge.net>

[NET]: Change some "if (x) BUG();" to "BUG_ON(x);"

This changes some simple "if (x) BUG();" statements to "BUG_ON(x);"

Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 14c85021 26-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[INET_SOCK]: Move struct inet_sock & helper functions to net/inet_sock.h

To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.

Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9b5b5cff 29-Nov-2005 Arjan van de Ven <arjan@infradead.org>

[NET]: Add const markers to various variables.

the patch below marks various variables const in net/; the goal is to
move them to the .rodata section so that they can't false-share
cachelines with things that get written to, as well as potentially
helping gcc a bit with optimisations. (these were found using a gcc
patch to warn about such variables)

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fb286bb2 10-Nov-2005 Herbert Xu <herbert@gondor.apana.org.au>

[NET]: Detect hardware rx checksum faults correctly

Here is the patch that introduces the generic skb_checksum_complete
which also checks for hardware RX checksum faults. If that happens,
it'll call netdev_rx_csum_fault which currently prints out a stack
trace with the device name. In future it can turn off RX checksum.

I've converted every spot under net/ that does RX checksum checks to
use skb_checksum_complete or __skb_checksum_complete with the
exceptions of:

* Those places where checksums are done bit by bit. These will call
netdev_rx_csum_fault directly.

* The following have not been completely checked/converted:

ipmr
ip_vs
netfilter
dccp

This patch is based on patches and suggestions from Stephen Hemminger
and David S. Miller.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 670c02c2 13-Oct-2005 John Hawkes <hawkes@sgi.com>

[NET]: Wider use of for_each_*cpu()

In 'net' change the explicit use of for-loops and NR_CPUS into the
general for_each_cpu() or for_each_online_cpu() constructs, as
appropriate. This widens the scope of potential future optimizations
of the general constructs, as well as takes advantage of the existing
optimizations of first_cpu() and next_cpu(), which is advantageous
when the true CPU count is much smaller than NR_CPUS.

Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 7ce31246 03-Oct-2005 David S. Miller <davem@sunset.davemloft.net>

[IPV4]: Update icmp sysctl docs and disable broadcast ECHO/TIMESTAMP by default

It's not a good idea to be smurf'able by default.
The few people who need this can turn it on.

Signed-off-by: David S. Miller <davem@davemloft.net>


# ba89966c 26-Aug-2005 Eric Dumazet <dada1@cosmosbay.com>

[NET]: use __read_mostly on kmem_cache_t , DEFINE_SNMP_STAT pointers

This patch puts mostly read only data in the right section
(read_mostly), to help sharing of these data between CPUS without
memory ping pongs.

On one of my production machine, tcp_statistics was sitting in a
heavily modified cache line, so *every* SNMP update had to force a
reload.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 64ce2073 09-Aug-2005 Patrick McHardy <kaber@trash.net>

[NET]: Make NETDEBUG pure printk wrappers

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cb94c62c 18-Aug-2005 Patrick McHardy <kaber@trash.net>

[IPV4]: Fix DST leak in icmp_push_reply()

Based upon a bug report and initial patch by
Ollie Wild.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ca933452 08-Aug-2005 Heikki Orsila <heikki.orsila@iki.fi>

[IPV4]: Debug cleanup

Here's a small patch to cleanup NETDEBUG() use in net/ipv4/ for Linux
kernel 2.6.13-rc5. Also weird use of indentation is changed in some
places.

Signed-off-by: Heikki Orsila <heikki.orsila@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4c866aa7 08-Jul-2005 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>

[IPV4]: Apply sysctl_icmp_echo_ignore_broadcasts to ICMP_TIMESTAMP as well.

This was the full intention of the original code.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 1c2fb7f9 13-Jun-2005 J. Simonetti <jeroen@simonetti.nl>

[IPV4]: Sysctl configurable icmp error source address.

This patch alows you to change the source address of icmp error
messages. It applies cleanly to 2.6.11.11 and retains the default
behaviour.

In the old (default) behaviour icmp error messages are sent with the ip
of the exiting interface.

The new behaviour (when the sysctl variable is toggled on), it will send
the message with the ip of the interface that received the packet that
caused the icmp error. This is the behaviour network administrators will
expect from a router. It makes debugging complicated network layouts
much easier. Also, all 'vendor routers' I know of have the later
behaviour.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!