History log of /linux-master/net/ipv4/ping.c
Revision Date Author Comments
# fefba7d1 11-Oct-2023 Daan De Meyer <daan.j.demeyer@gmail.com>

bpf: Propagate modified uaddrlen from cgroup sockaddr programs

As prep for adding unix socket support to the cgroup sockaddr hooks,
let's propagate the sockaddr length back to the caller after running
a bpf cgroup sockaddr hook program. While not important for AF_INET or
AF_INET6, the sockaddr length is important when working with AF_UNIX
sockaddrs as the size of the sockaddr cannot be determined just from the
address family or the sockaddr's contents.

__cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as
an input/output argument. After running the program, the modified sockaddr
length is stored in the uaddrlen pointer.

Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-3-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>


# 02715925 21-Sep-2023 Eric Dumazet <edumazet@google.com>

inet: implement lockless getsockopt(IP_MULTICAST_IF)

Add missing annotations to inet->mc_index and inet->mc_addr
to fix data-races.

getsockopt(IP_MULTICAST_IF) can be lockless.

setsockopt() side is left for later.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 959d5c11 21-Sep-2023 Eric Dumazet <edumazet@google.com>

inet: implement lockless getsockopt(IP_UNICAST_IF)

Add missing READ_ONCE() annotations when reading inet->uc_index

Implementing getsockopt(IP_UNICAST_IF) locklessly seems possible,
the setsockopt() part might not be possible at the moment.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ceaa7141 21-Sep-2023 Eric Dumazet <edumazet@google.com>

inet: implement lockless IP_MTU_DISCOVER

inet->pmtudisc can be read locklessly.

Implement proper lockless reads and writes to inet->pmtudisc

ip_sock_set_mtu_discover() can now be called from arbitrary
contexts.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 859f8b26 12-Sep-2023 Eric Dumazet <edumazet@google.com>

ipv6: lockless IPV6_FLOWINFO_SEND implementation

np->sndflow reads are racy.

Use one bit ftom atomic inet->inet_flags instead,
IPV6_FLOWINFO_SEND setsockopt() can be lockless.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3fa29971 12-Sep-2023 Eric Dumazet <edumazet@google.com>

ipv6: lockless IPV6_RECVERR implemetation

np->recverr is moved to inet->inet_flags to fix data-races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6b5f43ea 16-Aug-2023 Eric Dumazet <edumazet@google.com>

inet: move inet->recverr to inet->inet_flags

IP_RECVERR socket option can now be set/get without locking the socket.

This patch potentially avoid data-races around inet->recverr.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c274af22 16-Aug-2023 Eric Dumazet <edumazet@google.com>

inet: introduce inet->inet_flags

Various inet fields are currently racy.

do_ip_setsockopt() and do_ip_getsockopt() are mostly holding
the socket lock, but some (fast) paths do not.

Use a new inet->inet_flags to hold atomic bits in the series.

Remove inet->cmsg_flags, and use instead 9 bits from inet_flags.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 726de790 22-May-2023 Guillaume Nault <gnault@redhat.com>

ping: Stop using RTO_ONLINK.

Define a new helper to figure out the correct route scope to use on TX,
depending on socket configuration, ancillary data and send flags.

Use this new helper to properly initialise the scope in
flowi4_init_output(), instead of overriding tos with the RTO_ONLINK
flag.

The objective is to eventually remove RTO_ONLINK, which will allow
converting .flowi4_tos to dscp_t.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f1b5dfe6 10-May-2023 Kuniyuki Iwashima <kuniyu@amazon.com>

ping: Convert hlist_nulls to plain hlist.

Since introduced in commit c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP
socket kind"), ping socket does not use SLAB_TYPESAFE_BY_RCU nor check
nulls marker in loops.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab5fb73f 03-Apr-2023 Kuniyuki Iwashima <kuniyu@amazon.com>

ping: Fix potentail NULL deref for /proc/net/icmp.

After commit dbca1596bbb0 ("ping: convert to RCU lookups, get rid
of rwlock"), we use RCU for ping sockets, but we should use spinlock
for /proc/net/icmp to avoid a potential NULL deref mentioned in
the previous patch.

Let's go back to using spinlock there.

Note we can convert ping sockets to use hlist instead of hlist_nulls
because we do not use SLAB_TYPESAFE_BY_RCU for ping sockets.

Fixes: dbca1596bbb0 ("ping: convert to RCU lookups, get rid of rwlock")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 7a7160ed 18-Nov-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

net: Return errno in sk->sk_prot->get_port().

We assume the correct errno is -EADDRINUSE when sk->sk_prot->get_port()
fails, so some ->get_port() functions return just 1 on failure and the
callers return -EADDRINUSE instead.

However, mptcp_get_port() can return -EINVAL. Let's not ignore the error.

Note the only exception is inet_autobind(), all of whose callers return
-EAGAIN instead.

Fixes: cec37a6e41aa ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c25b7a7a 29-Nov-2022 Florian Westphal <fw@strlen.de>

inet: ping: use hlist_nulls rcu iterator during lookup

ping_lookup() does not acquire the table spinlock, so iteration should
use hlist_nulls_for_each_entry_rcu().

Spotted during code review.

Fixes: dbca1596bbb0 ("ping: convert to RCU lookups, get rid of rwlock")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20221129140644.28525-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 0d24148b 11-Oct-2022 Eric Dumazet <edumazet@google.com>

inet: ping: fix recent breakage

Blamed commit broke the assumption used by ping sendmsg() that
allocated skb would have MAX_HEADER bytes in skb->head.

This patch changes the way ping works, by making sure
the skb head contains space for the icmp header,
and adjusting ping_getfrag() which was desperate
about going past the icmp header :/

This is adopting what UDP does, mostly.

syzbot is able to crash a host using both kfence and following repro in a loop.

fd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_ICMPV6)
connect(fd, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0),
inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28
sendmsg(fd, {msg_name=NULL, msg_namelen=0, msg_iov=[
{iov_base="\200\0\0\0\23\0\0\0\0\0\0\0\0\0"..., iov_len=65496}],
msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0

When kfence triggers, skb->head only has 64 bytes, immediately followed
by struct skb_shared_info (no extra headroom based on ksize(ptr))

Then icmpv6_push_pending_frames() is overwriting first bytes
of skb_shinfo(skb), making nr_frags bigger than MAX_SKB_FRAGS,
and/or setting shinfo->gso_size to a non zero value.

If nr_frags is mangled, a crash happens in skb_release_data()

If gso_size is mangled, we have the following report:

lo: caps=(0x00000516401d7c69, 0x00000516401d7c69)
WARNING: CPU: 0 PID: 7548 at net/core/dev.c:3239 skb_warn_bad_offload+0x119/0x230 net/core/dev.c:3239
Modules linked in:
CPU: 0 PID: 7548 Comm: syz-executor268 Not tainted 6.0.0-syzkaller-02754-g557f050166e5 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
RIP: 0010:skb_warn_bad_offload+0x119/0x230 net/core/dev.c:3239
Code: 70 03 00 00 e8 58 c3 24 fa 4c 8d a5 e8 00 00 00 e8 4c c3 24 fa 4c 89 e9 4c 89 e2 4c 89 f6 48 c7 c7 00 53 f5 8a e8 13 ac e7 01 <0f> 0b 5b 5d 41 5c 41 5d 41 5e e9 28 c3 24 fa e8 23 c3 24 fa 48 89
RSP: 0018:ffffc9000366f3e8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88807a9d9d00 RCX: 0000000000000000
RDX: ffff8880780c0000 RSI: ffffffff8160f6f8 RDI: fffff520006cde6f
RBP: ffff888079952000 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000400 R11: 0000000000000000 R12: ffff8880799520e8
R13: ffff88807a9da070 R14: ffff888079952000 R15: 0000000000000000
FS: 0000555556be6300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020010000 CR3: 000000006eb7b000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
gso_features_check net/core/dev.c:3521 [inline]
netif_skb_features+0x83e/0xb90 net/core/dev.c:3554
validate_xmit_skb+0x2b/0xf10 net/core/dev.c:3659
__dev_queue_xmit+0x998/0x3ad0 net/core/dev.c:4248
dev_queue_xmit include/linux/netdevice.h:3008 [inline]
neigh_hh_output include/net/neighbour.h:530 [inline]
neigh_output include/net/neighbour.h:544 [inline]
ip6_finish_output2+0xf97/0x1520 net/ipv6/ip6_output.c:134
__ip6_finish_output net/ipv6/ip6_output.c:195 [inline]
ip6_finish_output+0x690/0x1160 net/ipv6/ip6_output.c:206
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip6_output+0x1ed/0x540 net/ipv6/ip6_output.c:227
dst_output include/net/dst.h:445 [inline]
ip6_local_out+0xaf/0x1a0 net/ipv6/output_core.c:161
ip6_send_skb+0xb7/0x340 net/ipv6/ip6_output.c:1966
ip6_push_pending_frames+0xdd/0x100 net/ipv6/ip6_output.c:1986
icmpv6_push_pending_frames+0x2af/0x490 net/ipv6/icmp.c:303
ping_v6_sendmsg+0xc44/0x1190 net/ipv6/ping.c:190
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:734
____sys_sendmsg+0x712/0x8c0 net/socket.c:2482
___sys_sendmsg+0x110/0x1b0 net/socket.c:2536
__sys_sendmsg+0xf3/0x1c0 net/socket.c:2565
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f21aab42b89
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 41 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff1729d038 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f21aab42b89
RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
RBP: 0000000000000000 R08: 000000000000000d R09: 000000000000000d
R10: 000000000000000d R11: 0000000000000246 R12: 00007fff1729d050
R13: 00000000000f4240 R14: 0000000000021dd1 R15: 00007fff1729d044
</TASK>

Fixes: 47cf88993c91 ("net: unify alloclen calculation for paged requests")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 87445f36 11-Oct-2022 Eric Dumazet <edumazet@google.com>

ipv6: ping: fix wrong checksum for large frames

For a given ping datagram, ping_getfrag() is called once
per skb fragment.

A large datagram requiring more than one page fragment
is currently getting the checksum of the last fragment,
instead of the cumulative one.

After this patch, "ping -s 35000 ::1" is working correctly.

Fixes: 6d0bfe226116 ("net: ipv6: Add IPv6 support to the ping socket.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0ffe2412 08-Sep-2022 YiFei Zhu <zhuyifei@google.com>

bpf: Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping

Usually when a TCP/UDP connection is initiated, we can bind the socket
to a specific IP attached to an interface in a cgroup/connect hook.
But for pings, this is impossible, as the hook is not being called.

This adds the hook invocation to unprivileged ICMP ping (i.e. ping
sockets created with SOCK_DGRAM IPPROTO_ICMP(V6) as opposed to
SOCK_RAW. Logic is mirrored from UDP sockets where the hook is invoked
during pre_connect, after a check for suficiently sized addr_len.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Link: https://lore.kernel.org/r/5764914c252fad4cd134fb6664c6ede95f409412.1662682323.git.zhuyifei@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>


# dbca1596 17-Jun-2022 Eric Dumazet <edumazet@google.com>

ping: convert to RCU lookups, get rid of rwlock

Using rwlock in networking code is extremely risky.
writers can starve if enough readers are constantly
grabing the rwlock.

I thought rwlock were at fault and sent this patch:

https://lkml.org/lkml/2022/6/17/272

But Peter and Linus essentially told me rwlock had to be unfair.

We need to get rid of rwlock in networking code.

Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b4a028c4 17-Jun-2022 Riccardo Paolo Bestetti <pbl@bestov.io>

ipv4: ping: fix bind address validity check

Commit 8ff978b8b222 ("ipv4/raw: support binding to nonlocal addresses")
introduced a helper function to fold duplicated validity checks of bind
addresses into inet_addr_valid_or_nonlocal(). However, this caused an
unintended regression in ping_check_bind_addr(), which previously would
reject binding to multicast and broadcast addresses, but now these are
both incorrectly allowed as reported in [1].

This patch restores the original check. A simple reordering is done to
improve readability and make it evident that multicast and broadcast
addresses should not be allowed. Also, add an early exit for INADDR_ANY
which replaces lost behavior added by commit 0ce779a9f501 ("net: Avoid
unnecessary inet_addr_type() call when addr is INADDR_ANY").

Furthermore, this patch introduces regression selftests to catch these
specific cases.

[1] https://lore.kernel.org/netdev/CANP3RGdkAcDyAZoT1h8Gtuu0saq+eOrrTiWbxnOs+5zn+cpyKg@mail.gmail.com/

Fixes: 8ff978b8b222 ("ipv4/raw: support binding to nonlocal addresses")
Cc: Miaohe Lin <linmiaohe@huawei.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Signed-off-by: Riccardo Paolo Bestetti <pbl@bestov.io>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2e47eece 28-Apr-2022 Yu Zhe <yuzhe@nfschina.com>

ipv4: remove unnecessary type castings

remove unnecessary void* type castings.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ec095263 11-Apr-2022 Oliver Hartkopp <socketcan@hartkopp.net>

net: remove noblock parameter from recvmsg() entities

The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().

Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.

err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);

or in

err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);

instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# b384c95a 07-Apr-2022 Menglong Dong <imagedong@tencent.com>

net: icmp: add skb drop reasons to icmp protocol

Replace kfree_skb() used in icmp_rcv() and icmpv6_rcv() with
kfree_skb_reason().

In order to get the reasons of the skb drops after icmp message handle,
we change the return type of 'handler()' in 'struct icmp_control' from
'bool' to 'enum skb_drop_reason'. This may change its original
intention, as 'false' means failure, but 'SKB_NOT_DROPPED_YET' means
success now. Therefore, all 'handler' and the call of them need to be
handled. Following 'handler' functions are involved:

icmp_unreach()
icmp_redirect()
icmp_echo()
icmp_timestamp()
icmp_discard()

And following new drop reasons are added:

SKB_DROP_REASON_ICMP_CSUM
SKB_DROP_REASON_INVALID_PROTO

The reason 'INVALID_PROTO' is introduced for the case that the packet
doesn't follow rfc 1122 and is dropped. This is not a common case, and
I believe we can locate the problem from the data in the packet. For now,
this 'INVALID_PROTO' is used for the icmp broadcasts with wrong types.

Maybe there should be a document file for these reasons. For example,
list all the case that causes the 'UNHANDLED_PROTO' and 'INVALID_PROTO'
drop reason. Therefore, users can locate their problems according to the
document.

Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 41a95a00 07-Apr-2022 Menglong Dong <imagedong@tencent.com>

net: icmp: introduce __ping_queue_rcv_skb() to report drop reasons

In order to avoid to change the return value of ping_queue_rcv_skb(),
introduce the function __ping_queue_rcv_skb(), which is able to report
the reasons of skb drop as its return value, as Paolo suggested.

Meanwhile, make ping_queue_rcv_skb() a simple call to
__ping_queue_rcv_skb().

The kfree_skb() and sock_queue_rcv_skb() used in ping_queue_rcv_skb()
are replaced with kfree_skb_reason() and sock_queue_rcv_skb_reason()
now.

Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f4b41f06 04-Apr-2022 Oliver Hartkopp <socketcan@hartkopp.net>

net: remove noblock parameter from skb_recv_datagram()

skb_recv_datagram() has two parameters 'flags' and 'noblock' that are
merged inside skb_recv_datagram() by 'flags | (noblock ? MSG_DONTWAIT : 0)'

As 'flags' may contain MSG_DONTWAIT as value most callers split the 'flags'
into 'flags' and 'noblock' with finally obsolete bit operations like this:

skb_recv_datagram(sk, flags & ~MSG_DONTWAIT, flags & MSG_DONTWAIT, &rc);

And this is not even done consistently with the 'flags' parameter.

This patch removes the obsolete and costly splitting into two parameters
and only performs bit operations when really needed on the caller side.

One missing conversion thankfully reported by kernel test robot. I missed
to enable kunit tests to build the mctp code.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e1a7ac6f 04-May-2022 Nicolas Dichtel <nicolas.dichtel@6wind.com>

ping: fix address binding wrt vrf

When ping_group_range is updated, 'ping' uses the DGRAM ICMP socket,
instead of an IP raw socket. In this case, 'ping' is unable to bind its
socket to a local address owned by a vrflite.

Before the patch:
$ sysctl -w net.ipv4.ping_group_range='0 2147483647'
$ ip link add blue type vrf table 10
$ ip link add foo type dummy
$ ip link set foo master blue
$ ip link set foo up
$ ip addr add 192.168.1.1/24 dev foo
$ ip addr add 2001::1/64 dev foo
$ ip vrf exec blue ping -c1 -I 192.168.1.1 192.168.1.2
ping: bind: Cannot assign requested address
$ ip vrf exec blue ping6 -c1 -I 2001::1 2001::2
ping6: bind icmp socket: Cannot assign requested address

CC: stable@vger.kernel.org
Fixes: 1b69c6d0ae90 ("net: Introduce L3 Master device abstraction")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# cd33bdcb 23-Feb-2022 Xin Long <lucien.xin@gmail.com>

ping: remove pr_err from ping_lookup

As Jakub noticed, prints should be avoided on the datapath.
Also, as packets would never come to the else branch in
ping_lookup(), remove pr_err() from ping_lookup().

Fixes: 35a79e64de29 ("ping: fix the dif and sdif check in ping_lookup")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/1ef3f2fcd31bd681a193b1fcf235eee1603819bd.1645674068.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 35a79e64 15-Feb-2022 Xin Long <lucien.xin@gmail.com>

ping: fix the dif and sdif check in ping_lookup

When 'ping' changes to use PING socket instead of RAW socket by:

# sysctl -w net.ipv4.ping_group_range="0 100"

There is another regression caused when matching sk_bound_dev_if
and dif, RAW socket is using inet_iif() while PING socket lookup
is using skb->dev->ifindex, the cmd below fails due to this:

# ip link add dummy0 type dummy
# ip link set dummy0 up
# ip addr add 192.168.111.1/24 dev dummy0
# ping -I dummy0 192.168.111.1 -c1

The issue was also reported on:

https://github.com/iputils/iputils/issues/104

But fixed in iputils in a wrong way by not binding to device when
destination IP is on device, and it will cause some of kselftests
to fail, as Jianlin noticed.

This patch is to use inet(6)_iif and inet(6)_sdif to get dif and
sdif for PING socket, and keep consistent with RAW socket.

Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2afc3b5a 22-Jan-2022 Xin Long <lucien.xin@gmail.com>

ping: fix the sk_bound_dev_if match in ping_lookup

When 'ping' changes to use PING socket instead of RAW socket by:

# sysctl -w net.ipv4.ping_group_range="0 100"

the selftests 'router_broadcast.sh' will fail, as such command

# ip vrf exec vrf-h1 ping -I veth0 198.51.100.255 -b

can't receive the response skb by the PING socket. It's caused by mismatch
of sk_bound_dev_if and dif in ping_rcv() when looking up the PING socket,
as dif is vrf-h1 if dif's master was set to vrf-h1.

This patch is to fix this regression by also checking the sk_bound_dev_if
against sdif so that the packets can stil be received even if the socket
is not bound to the vrf device but to the real iif.

Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 91a760b2 06-Jan-2022 Menglong Dong <imagedong@tencent.com>

net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()

The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:

err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}

Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.

To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().

Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com


# 8ff978b8 17-Nov-2021 Riccardo Paolo Bestetti <pbl@bestov.io>

ipv4/raw: support binding to nonlocal addresses

Add support to inet v4 raw sockets for binding to nonlocal addresses
through the IP_FREEBIND and IP_TRANSPARENT socket options, as well as
the ipv4.ip_nonlocal_bind kernel parameter.

Add helper function to inet_sock.h to check for bind address validity on
the base of the address type and whether nonlocal address are enabled
for the socket via any of the sockopts/sysctl, deduplicating checks in
ipv4/ping.c, ipv4/af_inet.c, ipv6/af_inet6.c (for mapped v4->v6
addresses), and ipv4/raw.c.

Add test cases with IP[V6]_FREEBIND verifying that both v4 and v6 raw
sockets support binding to nonlocal addresses after the change. Add
necessary support for the test cases to nettest.

Signed-off-by: Riccardo Paolo Bestetti <pbl@bestov.io>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20211117090010.125393-1-pbl@bestov.io
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d198b277 13-Sep-2021 Eric Dumazet <edumazet@google.com>

Revert "Revert "ipv4: fix memory leaks in ip_cmsg_send() callers""

This reverts commit d7807a9adf4856171f8441f13078c33941df48ab.

As mentioned in https://lkml.org/lkml/2021/9/13/1819
5 years old commit 919483096bfe ("ipv4: fix memory leaks in ip_cmsg_send() callers")
was a correct fix.

ip_cmsg_send() can loop over multiple cmsghdr()

If IP_RETOPTS has been successful, but following cmsghdr generates an error,
we do not free ipc.ok

If IP_RETOPTS is not successful, we have freed the allocated temporary space,
not the one currently in ipc.opt.

Sure, code could be refactored, but let's not bring back old bugs.

Fixes: d7807a9adf48 ("Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d7807a9a 12-Sep-2021 Yajun Deng <yajun.deng@linux.dev>

Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"

This reverts commit 919483096bfe75dda338e98d56da91a263746a0a.

There is only when ip_options_get() return zero need to free.
It already called kfree() when return error.

Fixes: 919483096bfe ("ipv4: fix memory leaks in ip_cmsg_send() callers")
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e3ae2365 27-Jun-2021 Alexander Aring <aahringo@redhat.com>

net: sock: introduce sk_error_report

This patch introduces a function wrapper to call the sk_error_report
callback. That will prepare to add additional handling whenever
sk_error_report is called, for example to trace socket errors.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9d44fa3e 09-Jun-2021 Zheng Yongjun <zhengyongjun3@huawei.com>

ping: Check return value of function 'ping_queue_rcv_skb'

Function 'ping_queue_rcv_skb' not always return success, which will
also return fail. If not check the wrong return value of it, lead to function
`ping_rcv` return success.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 08baf54f 29-Mar-2021 Andreas Roeseler <andreas.a.roeseler@gmail.com>

net: add support for sending RFC 8335 PROBE messages

Modify the ping_supported function to support PROBE message types. This
allows tools such as the ping command in the iputils package to be
modified to send PROBE requests through the existing framework for
sending ping requests.

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3df98d79 27-Sep-2020 Paul Moore <paul@paul-moore.com>

lsm,selinux: pass flowi_common instead of flowi to the LSM hooks

As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely. As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.

Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>


# 5af68891 29-Aug-2020 Miaohe Lin <linmiaohe@huawei.com>

net: clean up codestyle

This is a pure codestyle cleanup patch. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0ce779a9 25-Aug-2020 Miaohe Lin <linmiaohe@huawei.com>

net: Avoid unnecessary inet_addr_type() call when addr is INADDR_ANY

We can avoid unnecessary inet_addr_type() call by check addr against
INADDR_ANY first.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0316a211 25-Aug-2020 Miaohe Lin <linmiaohe@huawei.com>

net: Set ping saddr after we successfully get the ping port

We can defer set ping saddr until we successfully get the ping port. So we
can avoid clear saddr when failed. Since ping_clear_saddr() is not used
anymore now, remove it.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5eff0690 03-Jul-2020 Sabrina Dubroca <sd@queasysnail.net>

ipv4: fill fl4_icmp_{type,code} in ping_v4_sendmsg

IPv4 ping sockets don't set fl4.fl4_icmp_{type,code}, which leads to
incomplete IPsec ACQUIRE messages being sent to userspace. Currently,
both raw sockets and IPv6 ping sockets set those fields.

Expected output of "ip xfrm monitor":
acquire proto esp
sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 8 code 0 dev ens4
policy src 10.0.2.15/32 dst 8.8.8.8/32
<snip>

Currently with ping sockets:
acquire proto esp
sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 0 code 0 dev ens4
policy src 10.0.2.15/32 dst 8.8.8.8/32
<snip>

The Libreswan test suite found this problem after Fedora changed the
value for the sysctl net.ipv4.ping_group_range.

Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Paul Wouters <pwouters@redhat.com>
Tested-by: Paul Wouters <pwouters@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c6af0c22 11-Sep-2019 Willem de Bruijn <willemb@google.com>

ip: support SO_MARK cmsg

Enable setting skb->mark for UDP and RAW sockets using cmsg.

This is analogous to existing support for TOS, TTL, txtime, etc.

Packet sockets already support this as of commit c7d39e32632e
("packet: support per-packet fwmark for af_packet sendmsg").

Similar to other fields, implement by
1. initialize the sockcm_cookie.mark from socket option sk_mark
2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
3. initialize inet_cork.mark from sockcm_cookie.mark
4. initialize each (usually just one) skb->mark from inet_cork.mark

Step 1 is handled in one location for most protocols by ipcm_init_sk
as of commit 351782067b6b ("ipv4: ipcm_cookie initializers").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ea9a0379 17-May-2019 Patrick Talbert <ptalbert@redhat.com>

net: Treat sock->sk_drops as an unsigned int when printing

Currently, procfs socket stats format sk_drops as a signed int (%d). For large
values this will cause a negative number to be printed.

We know the drop count can never be a negative so change the format specifier to
%u.

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 854da991 01-Oct-2018 Robert Shearman <rshearma@vyatta.att-mail.com>

ipv4: Allow sending multicast packets on specific i/f using VRF socket

It is useful to be able to use the same socket for listening in a
specific VRF, as for sending multicast packets out of a specific
interface. However, the bound device on the socket currently takes
precedence and results in the packets not being sent.

Relax the condition on overriding the output interface to use for
sending packets out of UDP, raw and ping sockets to allow multicast
packets to be sent using the specified multicast interface.

Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 83ba4645 31-Jul-2018 Vincent Bernat <vincent@bernat.im>

net: add helpers checking if socket can be bound to nonlocal address

The construction "net->ipv4.sysctl_ip_nonlocal_bind || inet->freebind
|| inet->transparent" is present three times and its IPv6 counterpart
is also present three times. We introduce two small helpers to
characterize these tests uniformly.

Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 678ca42d 06-Jul-2018 Willem de Bruijn <willemb@google.com>

ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6

skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
after modification by __sock_cmsg_send, by calling sock_tx_timestamp.

The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
individual protocols that support tx timestamps call this function
and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
called in __ip6_append_data.

There is no need to store both tx_flags and ts_flags in the cookie
as one is derived from the other. Convert when setting up the cork
and remove the redundant field. This is similar to IPv6, only have
the conversion happen only once per datagram, in ip(6)_setup_cork.

Also change __ip6_append_data to match __ip_append_data. Only update
tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
redundant: only valid protocols can have non-zero cork->tx_flags.

After this change the IPv4 and IPv6 logic is the same.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 35178206 06-Jul-2018 Willem de Bruijn <willemb@google.com>

ipv4: ipcm_cookie initializers

Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba19978 ("ip: limit use of gso_size to udp").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bc969a97 03-Jul-2018 Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>

net: ipv4: Hook into time based transmission

Add a transmit_time field to struct inet_cork, then copy the
timestamp from the CMSG cookie at ip_setup_cork() so we can
safely copy it into the skb later during __ip_make_skb().

For the raw fast path, just perform the copy at raw_send_hdrinc().

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c3506372 10-Apr-2018 Christoph Hellwig <hch@lst.de>

proc: introduce proc_create_net{,_data}

Variants of proc_create{,_data} that directly take a struct seq_operations
and deal with network namespaces in ->open and ->release. All callers of
proc_create + seq_open_net converted over, and seq_{open,release}_net are
removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# f4550221 10-Apr-2018 Christoph Hellwig <hch@lst.de>

ipv{4,6}/ping: simplify proc file creation

Remove the pointless ping_seq_afinfo indirection and make the code look
like most other protocols.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# 1b97013b 10-May-2018 Andrey Ignatov <rdna@fb.com>

ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg

Fix more memory leaks in ip_cmsg_send() callers. Part of them were fixed
earlier in 919483096bfe.

* udp_sendmsg one was there since the beginning when linux sources were
first added to git;
* ping_v4_sendmsg one was copy/pasted in c319b4d76b9e.

Whenever return happens in udp_sendmsg() or ping_v4_sendmsg() IP options
have to be freed if they were allocated previously.

Add label so that future callers (if any) can use it instead of kfree()
before return that is easy to forget.

Fixes: c319b4d76b9e (net: ipv4: add IPPROTO_ICMP socket kind)
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f635cee 27-Mar-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Drop pernet_operations::async

Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d6444062 23-Mar-2018 Joe Perches <joe@perches.com>

net: Use octal not symbolic permissions

Prefer the direct use of octal for permissions.

Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
and some typing.

Miscellanea:

o Whitespace neatening around these conversions.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f84c6821 12-Feb-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Convert pernet_subsys, registered from inet_init()

arp_net_ops just addr/removes /proc entry.

devinet_ops allocates and frees duplicate of init_net tables
and (un)registers sysctl entries.

fib_net_ops allocates and frees pernet tables, creates/destroys
netlink socket and (un)initializes /proc entries. Foreign
pernet_operations do not touch them.

ip_rt_proc_ops only modifies pernet /proc entries.

xfrm_net_ops creates/destroys /proc entries, allocates/frees
pernet statistics, hashes and tables, and (un)initializes
sysctl files. These are not touched by foreigh pernet_operations

xfrm4_net_ops allocates/frees private pernet memory, and
configures sysctls.

sysctl_route_ops creates/destroys sysctls.

rt_genid_ops only initializes fields of just allocated net.

ipv4_inetpeer_ops allocated/frees net private memory.

igmp_net_ops just creates/destroys /proc files and socket,
noone else interested in.

tcp_sk_ops seems to be safe, because tcp_sk_init() does not
depend on any other pernet_operations modifications. Iteration
over hash table in inet_twsk_purge() is made under RCU lock,
and it's safe to iterate the table this way. Removing from
the table happen from inet_twsk_deschedule_put(), but this
function is safe without any extern locks, as it's synchronized
inside itself. There are many examples, it's used in different
context. So, it's safe to leave tcp_sk_exit_batch() unlocked.

tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe.

udplite4_net_ops only creates/destroys pernet /proc file.

icmp_sk_ops creates percpu sockets, not touched by foreign
pernet_operations.

ipmr_net_ops creates/destroys pernet fib tables, (un)registers
fib rules and /proc files. This seem to be safe to execute
in parallel with foreign pernet_operations.

af_inet_ops just sets up default parameters of newly created net.

ipv4_mib_ops creates and destroys pernet percpu statistics.

raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops
and ip_proc_ops only create/destroy pernet /proc files.

ip4_frags_ops creates and destroys sysctl file.

So, it's safe to make the pernet_operations async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 41c6d650 30-Jun-2017 Reshetova, Elena <elena.reshetova@intel.com>

net: convert sock.sk_refcnt from atomic_t to refcount_t

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 43a66845 24-Mar-2017 Eric Dumazet <edumazet@google.com>

ping: implement proper locking

We got a report of yet another bug in ping

http://www.openwall.com/lists/oss-security/2017/03/24/6

->disconnect() is not called with socket lock held.

Fix this by acquiring ping rwlock earlier.

Thanks to Daniel, Alexander and Andrey for letting us know this problem.

Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Daniel Jiang <danieljiang0415@gmail.com>
Reported-by: Solar Designer <solar@openwall.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 73d2c667 07-Feb-2017 WANG Cong <xiyou.wangcong@gmail.com>

ping: fix a null pointer dereference

Andrey reported a kernel crash:

general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 2 PID: 3880 Comm: syz-executor1 Not tainted 4.10.0-rc6+ #124
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880060048040 task.stack: ffff880069be8000
RIP: 0010:ping_v4_push_pending_frames net/ipv4/ping.c:647 [inline]
RIP: 0010:ping_v4_sendmsg+0x1acd/0x23f0 net/ipv4/ping.c:837
RSP: 0018:ffff880069bef8b8 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: ffff880069befb90 RCX: 0000000000000000
RDX: 0000000000000018 RSI: ffff880069befa30 RDI: 00000000000000c2
RBP: ffff880069befbb8 R08: 0000000000000008 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000000 R12: ffff880069befab0
R13: ffff88006c624a80 R14: ffff880069befa70 R15: 0000000000000000
FS: 00007f6f7c716700(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004a6f28 CR3: 000000003a134000 CR4: 00000000000006e0
Call Trace:
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
SYSC_sendto+0x660/0x810 net/socket.c:1687
SyS_sendto+0x40/0x50 net/socket.c:1655
entry_SYSCALL_64_fastpath+0x1f/0xc2

This is because we miss a check for NULL pointer for skb_peek() when
the queue is empty. Other places already have the same check.

Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0dec879f 06-Feb-2017 Julian Anastasov <ja@ssi.bg>

net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP

When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.

The datagram protocols can use MSG_CONFIRM to confirm the
neighbour. When used with MSG_PROBE we do not reach the
code where neighbour is confirmed, so we have to do the
same slow lookup by using the dst_confirm_neigh() helper.
When MSG_PROBE is not used, ip_append_data/ip6_append_data
will set the skb flag dst_pending_confirm.

Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a7ef6715 16-Jan-2017 Gao Feng <fgao@ikuai8.com>

net: ping: Use right format specifier to avoid type casting

The inet_num is u16, so use %hu instead of casting it to int. And
the sk_bound_dev_if is int actually, so it needn't cast to int.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0b62fca2 03-Nov-2016 Al Viro <viro@zeniv.linux.org.uk>

switch getfrag callbacks to ..._full() primitives

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0eab121e 05-Dec-2016 Kees Cook <keescook@chromium.org>

net: ping: check minimum size on ICMP header length

Prior to commit c0371da6047a ("put iov_iter into msghdr") in v3.19, there
was no check that the iovec contained enough bytes for an ICMP header,
and the read loop would walk across neighboring stack contents. Since the
iov_iter conversion, bad arguments are noticed, but the returned error is
EFAULT. Returning EINVAL is a clearer error and also solves the problem
prior to v3.19.

This was found using trinity with KASAN on v3.18:

BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0
Read of size 8 by task trinity-c2/9623
page:ffffffbe034b9a08 count:0 mapcount:0 mapping: (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G BU 3.18.0-dirty #15
Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT)
Call trace:
[<ffffffc000209c98>] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90
[<ffffffc000209e54>] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffc000f18dc4>] dump_stack+0x7c/0xd0 lib/dump_stack.c:50
[< inline >] print_address_description mm/kasan/report.c:147
[< inline >] kasan_report_error mm/kasan/report.c:236
[<ffffffc000373dcc>] kasan_report+0x380/0x4b8 mm/kasan/report.c:259
[< inline >] check_memory_region mm/kasan/kasan.c:264
[<ffffffc00037352c>] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507
[<ffffffc0005b9624>] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15
[< inline >] memcpy_from_msg include/linux/skbuff.h:2667
[<ffffffc000ddeba0>] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674
[<ffffffc000dded30>] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714
[<ffffffc000dc91dc>] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749
[< inline >] __sock_sendmsg_nosec net/socket.c:624
[< inline >] __sock_sendmsg net/socket.c:632
[<ffffffc000cab61c>] sock_sendmsg+0x124/0x164 net/socket.c:643
[< inline >] SYSC_sendto net/socket.c:1797
[<ffffffc000cad270>] SyS_sendto+0x178/0x1d8 net/socket.c:1761

CVE-2016-8399

Reported-by: Qidan He <i@flanker017.me>
Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e2d118a1 03-Nov-2016 Lorenzo Colitti <lorenzo@google.com>

net: inet: Support UID-based routing in IP protocols.

- Use the UID in routing lookups made by protocol connect() and
sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
(e.g., Path MTU discovery) take the UID of the socket into
account.
- For packets not associated with a userspace socket, (e.g., ping
replies) use UID 0 inside the user namespace corresponding to
the network namespace the socket belongs to. This allows
all namespaces to apply routing and iptables rules to
kernel-originated traffic in that namespaces by matching UID 0.
This is better than using the UID of the kernel socket that is
sending the traffic, because the UID of kernel sockets created
at namespace creation time (e.g., the per-processor ICMP and
TCP sockets) is the UID of the user that created the socket,
which might not be mapped in the namespace.

Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 286c72de 20-Oct-2016 Eric Dumazet <edumazet@google.com>

udp: must lock the socket in udp_disconnect()

Baozeng Ding reported KASAN traces showing uses after free in
udp_lib_get_port() and other related UDP functions.

A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.

I could write a reproducer with two threads doing :

static int sock_fd;
static void *thr1(void *arg)
{
for (;;) {
connect(sock_fd, (const struct sockaddr *)arg,
sizeof(struct sockaddr_in));
}
}

static void *thr2(void *arg)
{
struct sockaddr_in unspec;

for (;;) {
memset(&unspec, 0, sizeof(unspec));
connect(sock_fd, (const struct sockaddr *)&unspec,
sizeof(unspec));
}
}

Problem is that udp_disconnect() could run without holding socket lock,
and this was causing list corruptions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 81243eac 07-Oct-2016 Alexey Dobriyan <adobriyan@gmail.com>

cred: simpler, 1D supplementary groups

Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.

If number of gids is <= 32, memory allocation is more or less tolerable
(140/148 bytes). But if it is not, code allocates full page (!)
regardless and, what's even more fun, doesn't reuse small 32-entry
array.

2D array means dependent shifts, loads and LEAs without possibility to
optimize them (gid is never known at compile time).

All of the above is unnecessary. Switch to the usual
trailing-zero-len-array scheme. Memory is allocated with
kmalloc/vmalloc() and only as much as needed. Accesses become simpler
(LEA 8(gi,idx,4) or even without displacement).

Maximum number of gids is 65536 which translates to 256KB+8 bytes. I
think kernel can handle such allocation.

On my usual desktop system with whole 9 (nine) aux groups, struct
group_info shrinks from 148 bytes to 44 bytes, yay!

Nice side effects:

- "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,

- fix little mess in net/ipv4/ping.c
should have been using GROUP_AT macro but this point becomes moot,

- aux group allocation is persistent and should be accounted as such.

Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c14ac945 02-Apr-2016 Soheil Hassas Yeganeh <soheil@google.com>

sock: enable timestamping using control messages

Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 24025c46 02-Apr-2016 Soheil Hassas Yeganeh <soheil@google.com>

ipv4: process socket-level control messages in IPv4

Process socket-level control messages by invoking
__sock_cmsg_send in ip_cmsg_send for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip_cmsg_send is called in udp, icmp,
and raw, we also process socket-level control messages.

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv4 control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a9d56235 29-Feb-2016 Eric Engestrom <eric.engestrom@imgtec.com>

net/ipv4: remove left over dead code

8cc785f6f429c2a3fb81745dc142cbd72a462c4a ("net: ipv4: make the ping
/proc code AF-independent") removed the code using it, but renamed this
variable instead of removing it.

Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 91948309 04-Feb-2016 Eric Dumazet <edumazet@google.com>

ipv4: fix memory leaks in ip_cmsg_send() callers

Dmitry reported memory leaks of IP options allocated in
ip_cmsg_send() when/if this function returns an error.

Callers are responsible for the freeing.

Many thanks to Dmitry for the report and diagnostic.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 086c653f 10-Feb-2016 Craig Gallek <kraig@google.com>

sock: struct proto hash function may error

In order to support fast reuseport lookups in TCP, the hash function
defined in struct proto must be capable of returning an error code.
This patch changes the function signature of all related hash functions
to return an integer and handles or propagates this return value at
all call sites.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ad64b8be 06-Jan-2016 Lance Richardson <lrichard@redhat.com>

ipv4: eliminate lock count warnings in ping.c

Add lock release/acquire annotations to ping_seq_start() and
ping_seq_stop() to satisfy sparse.

Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 35a256fe 08-Jul-2015 Tom Herbert <tom@herbertland.com>

ipv6: Nonlocal bind

Add support to allow non-local binds similar to how this was done for IPv4.
Non-local binds are very useful in emulating the Internet in a box, etc.

This add the ip_nonlocal_bind sysctl under ipv6.

Testing:

Set up nonlocal binding and receive routing on a host, e.g.:

ip -6 rule add from ::/0 iif eth0 lookup 200
ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
sysctl -w net.ipv6.ip_nonlocal_bind=1

Set up routing to 2001:0:0:1::/64 on peer to go to first host

ping6 -I 2001:0:0:1::1 peer-address -- to verify

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a134f083 01-May-2015 David S. Miller <davem@davemloft.net>

ipv4: Missing sk_nulls_node_init() in ping_unhash().

If we don't do that, then the poison value is left in the ->pprev
backlink.

This can cause crashes if we do a disconnect, followed by a connect().

Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Wen Xu <hotdog3645@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 00db4124 03-Apr-2015 Ian Morris <ipm@chirality.org.uk>

ipv4: coding style: comparison for inequality with NULL

The ipv4 code uses a mixture of coding styles. In some instances check
for non-NULL pointer is done as x != NULL and sometimes as x. x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 51456b29 03-Apr-2015 Ian Morris <ipm@chirality.org.uk>

ipv4: coding style: comparison for equality with NULL

The ipv4 code uses a mixture of coding styles. In some instances check
for NULL pointer is done as x == NULL and sometimes as !x. !x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6eada011 18-Mar-2015 Eric Dumazet <edumazet@google.com>

netns: constify net_hash_mix() and various callers

const qualifiers ease code review by making clear
which objects are not written in a function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9145736d 03-Mar-2015 Lorenzo Colitti <lorenzo@google.com>

net: ping: Return EAFNOSUPPORT when appropriate.

1. For an IPv4 ping socket, ping_check_bind_addr does not check
the family of the socket address that's passed in. Instead,
make it behave like inet_bind, which enforces either that the
address family is AF_INET, or that the family is AF_UNSPEC and
the address is 0.0.0.0.
2. For an IPv6 ping socket, ping_check_bind_addr returns EINVAL
if the socket family is not AF_INET6. Return EAFNOSUPPORT
instead, for consistency with inet6_bind.
3. Make ping_v4_sendmsg and ping_v6_sendmsg return EAFNOSUPPORT
instead of EINVAL if an incorrect socket address structure is
passed in.
4. Make IPv6 ping sockets be IPv6-only. The code does not support
IPv4, and it cannot easily be made to support IPv4 because
the protocol numbers for ICMP and ICMPv6 are different. This
makes connect(::ffff:192.0.2.1) fail with EAFNOSUPPORT instead
of making the socket unusable.

Among other things, this fixes an oops that can be triggered by:

int s = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
struct sockaddr_in6 sin6 = {
.sin6_family = AF_INET6,
.sin6_addr = in6addr_any,
};
bind(s, (struct sockaddr *) &sin6, sizeof(sin6));

Change-Id: If06ca86d9f1e4593c0d6df174caca3487c57a241
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1b784140 02-Mar-2015 Ying Xue <ying.xue@windriver.com>

net: Remove iocb argument from sendmsg and recvmsg

After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 21226abb 28-Nov-2014 Al Viro <viro@zeniv.linux.org.uk>

net: switch memcpy_fromiovec()/memcpy_fromiovecend() users to copy_from_iter()

That takes care of the majority of ->sendmsg() instances - most of them
via memcpy_to_msg() or assorted getfrag() callbacks. One place where we
still keep memcpy_fromiovecend() is tipc - there we potentially read the
same data over and over; separate patch, that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# cacdc7d2 27-Nov-2014 Al Viro <viro@zeniv.linux.org.uk>

ip: stash a pointer to msghdr in struct ping_fakehdr

... instead of storing its ->mgs_iter.iov there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fc752f1f 23-Jan-2015 Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>

ping: Fix race in free in receive path

An exception is seen in ICMP ping receive path where the skb
destructor sock_rfree() tries to access a freed socket. This happens
because ping_rcv() releases socket reference with sock_put() and this
internally frees up the socket. Later icmp_rcv() will try to free the
skb and as part of this, skb destructor is called and which leads
to a kernel panic as the socket is freed already in ping_rcv().

-->|exception
-007|sk_mem_uncharge
-007|sock_rfree
-008|skb_release_head_state
-009|skb_release_all
-009|__kfree_skb
-010|kfree_skb
-011|icmp_rcv
-012|ip_local_deliver_finish

Fix this incorrect free by cloning this skb and processing this cloned
skb instead.

This patch was suggested by Eric Dumazet

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c0371da6 24-Nov-2014 Al Viro <viro@zeniv.linux.org.uk>

put iov_iter into msghdr

Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f4713a3d 26-Nov-2014 Willem de Bruijn <willemb@google.com>

net-timestamp: make tcp_recvmsg call ipv6_recv_error for AF_INET6 socks

TCP timestamping introduced MSG_ERRQUEUE handling for TCP sockets.
If the socket is of family AF_INET6, call ipv6_recv_error instead
of ip_recv_error.

This change is more complex than a single branch due to the loadable
ipv6 module. It reuses a pre-existing indirect function call from
ping. The ping code is safe to call, because it is part of the core
ipv6 module and always present when AF_INET6 sockets are active.

Fixes: 4ed2d765 (net-timestamp: TCP timestamping)
Signed-off-by: Willem de Bruijn <willemb@google.com>

----

It may also be worthwhile to add WARN_ON_ONCE(sk->family == AF_INET6)
to ip_recv_error.
Signed-off-by: David S. Miller <davem@davemloft.net>


# 91a0b603 24-Nov-2014 Jane Zhou <a17711@motorola.com>

net/ping: handle protocol mismatching scenario

ping_lookup() may return a wrong sock if sk_buff's and sock's protocols
dont' match. For example, sk_buff's protocol is ETH_P_IPV6, but sock's
sk_family is AF_INET, in that case, if sk->sk_bound_dev_if is zero, a wrong
sock will be returned.
the fix is to "continue" the searching, if no matching, return NULL.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Jane Zhou <a17711@motorola.com>
Signed-off-by: Yiwei Zhao <gbjc64@motorola.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6ce8e9ce 06-Apr-2014 Al Viro <viro@zeniv.linux.org.uk>

new helper: memcpy_from_msg()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e3e32170 17-Nov-2014 Rick Jones <rick.jones2@hp.com>

icmp: Remove some spurious dropped packet profile hits from the ICMP path

If icmp_rcv() has successfully processed the incoming ICMP datagram, we
should use consume_skb() rather than kfree_skb() because a hit on the likes
of perf -e skb:kfree_skb is not called-for.

Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 51f3d02b 05-Nov-2014 David S. Miller <davem@davemloft.net>

net: Add and use skb_copy_datagram_msg() helper.

This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".

When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.

Having a helper like this means there will be less places to touch
during that transformation.

Based upon descriptions and patch from Al Viro.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 49a60158 05-Sep-2014 Vincent Bernat <vincent@bernat.im>

net/ipv4: bind ip_nonlocal_bind to current netns

net.ipv4.ip_nonlocal_bind sysctl was global to all network
namespaces. This patch allows to set a different value for each
network namespace.

Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4330487a 31-Jul-2014 Duan Jiong <duanj.fnst@cn.fujitsu.com>

net: use inet6_iif instead of IP6CB()->iif

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ba6b918a 06-May-2014 Cong Wang <xiyou.wangcong@gmail.com>

ping: move ping_group_range out of CONFIG_SYSCTL

Similarly, when CONFIG_SYSCTL is not set, ping_group_range should still
work, just that no one can change it. Therefore we should move it out of
sysctl_net_ipv4.c. And, it should not share the same seqlock with
ip_local_port_range.

BTW, rename it to ->ping_group_range instead.

Cc: David S. Miller <davem@davemloft.net>
Cc: Francois Romieu <romieu@fr.zoreil.com>
Reported-by: Stefan de Konink <stefan@konink.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c9d8f1a6 06-May-2014 Cong Wang <xiyou.wangcong@gmail.com>

ipv4: move local_port_range out of CONFIG_SYSCTL

When CONFIG_SYSCTL is not set, ip_local_port_range should still work,
just that no one can change it. Therefore we should move it out of sysctl_inet.c.
Also, rename it to ->ip_local_ports instead.

Cc: David S. Miller <davem@davemloft.net>
Cc: Francois Romieu <romieu@fr.zoreil.com>
Reported-by: Stefan de Konink <stefan@konink.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b04c4619 13-Apr-2014 Wang, Xiaoming <xiaoming.wang@intel.com>

net: ipv4: current group_info should be put after using.

Plug a group_info refcount leak in ping_init.
group_info is only needed during initialization and
the code failed to release the reference on exit.
While here move grabbing the reference to a place
where it is actually needed.

Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Zhang Dongxing <dongxing.zhang@intel.com>
Signed-off-by: xiaoming wang <xiaoming.wang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c8e6ad08 18-Feb-2014 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsg

In case we decide in udp6_sendmsg to send the packet down the ipv4
udp_sendmsg path because the destination is either of family AF_INET or
the destination is an ipv4 mapped ipv6 address, we don't honor the
maybe specified ipv4 mapped ipv6 address in IPV6_PKTINFO.

We simply can check for this option in ip_cmsg_send because no calls to
ipv6 module functions are needed to do so.

Reported-by: Gert Doering <gert@space.net>
Cc: Tore Anderson <tore@fud.no>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 82b276cd 19-Jan-2014 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: protect protocols not handling ipv4 from v4 connection/bind attempts

Some ipv6 protocols cannot handle ipv4 addresses, so we must not allow
connecting and binding to them. sendmsg logic does already check msg->name
for this but must trust already connected sockets which could be set up
for connection to ipv4 address family.

Per-socket flag ipv6only is of no use here, as it is under users control
by setsockopt.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b261c75 19-Jan-2014 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: make IPV6_RECVPKTINFO work for ipv4 datagrams

We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data
for IPv4 datagrams on IPv6 sockets.

This patch splits the ip6_datagram_recv_ctl into two functions, one
which handles both protocol families, AF_INET and AF_INET6, while the
ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data.

ip6_datagram_recv_*_ctl never reported back any errors, so we can make
them return void. Also provide a helper for protocols which don't offer dual
personality to further use ip6_datagram_recv_ctl, which is exported to
modules.

I needed to shuffle the code for ping around a bit to make it easier to
implement dual personality for ping ipv6 sockets in future.

Reported-by: Gert Doering <gert@space.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 342dfc30 17-Jan-2014 Steffen Hurrle <steffen@hurrle.net>

net: add build-time checks for msg->msg_name size

This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg
handler msg_name and msg_namelen logic").

DECLARE_SOCKADDR validates that the structure we use for writing the
name information to is not larger than the buffer which is reserved
for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
consistently in sendmsg code paths.

Signed-off-by: Steffen Hurrle <steffen@hurrle.net>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ea074b34 28-Dec-2013 Stephen Hemminger <stephen@networkplumber.org>

ipv4: ping make local stuff static

Don't export ping_table or ping_v4_sendmsg. Both are only used
inside ping code.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f1d8cba6 28-Nov-2013 Eric Dumazet <edumazet@google.com>

inet: fix possible seqlock deadlocks

In commit c9e9042994d3 ("ipv4: fix possible seqlock deadlock") I left
another places where IP_INC_STATS_BH() were improperly used.

udp_sendmsg(), ping_v4_sendmsg() and tcp_v4_connect() are called from
process context, not from softirq context.

This was detected by lockdep seqlock support.

Reported-by: jongman heo <jongman.heo@samsung.com>
Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP")
Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 85fbaa75 22-Nov-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

inet: fix addr_len/msg->msg_namelen assignment in recv_error and rxpmtu functions

Commit bceaa90240b6019ed73b49965eac7d167610be69 ("inet: prevent leakage
of uninitialized memory to user in recv syscalls") conditionally updated
addr_len if the msg_name is written to. The recv_error and rxpmtu
functions relied on the recvmsg functions to set up addr_len before.

As this does not happen any more we have to pass addr_len to those
functions as well and set it to the size of the corresponding sockaddr
length.

This broke traceroute and such.

Fixes: bceaa90240b6 ("inet: prevent leakage of uninitialized memory to user in recv syscalls")
Reported-by: Brad Spengler <spender@grsecurity.net>
Reported-by: Tom Labanowski
Cc: mpb <mpb.mail@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cf970c00 17-Nov-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ping: prevent NULL pointer dereference on write to msg_name

A plain read() on a socket does set msg->msg_name to NULL. So check for
NULL pointer first.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bceaa902 17-Nov-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

inet: prevent leakage of uninitialized memory to user in recv syscalls

Only update *addr_len when we actually fill in sockaddr, otherwise we
can return uninitialized memory from the stack to the caller in the
recvfrom, recvmmsg and recvmsg syscalls. Drop the the (addr_len == NULL)
checks because we only get called with a valid addr_len pointer either
from sock_common_recvmsg or inet_recvmsg.

If a blocking read waits on a socket which is concurrently shut down we
now return zero and set msg_msgnamelen to 0.

Reported-by: mpb <mpb.mail@gmail.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 652586df 14-Nov-2013 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

seq_file: remove "%n" usage from seq_file users

All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c2bb06db 09-Oct-2013 Eric Dumazet <edumazet@google.com>

net: fix build errors if ipv6 is disabled

CONFIG_IPV6=n is still a valid choice ;)

It appears we can remove dead code.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# efe4208f 03-Oct-2013 Eric Dumazet <edumazet@google.com>

ipv6: make lookups simpler and faster

TCP listener refactoring, part 4 :

To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common

Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.

Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).

inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.

inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.

We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0bbf87d8 28-Sep-2013 Eric W. Biederman <ebiederm@xmission.com>

net ipv4: Convert ipv4.ip_local_port_range to be per netns v3

- Move sysctl_local_ports from a global variable into struct netns_ipv4.
- Modify inet_get_local_port_range to take a struct net, and update all
of the callers.
- Move the initialization of sysctl_local_ports into
sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c

v2:
- Ensure indentation used tabs
- Fixed ip.h so it applies cleanly to todays net-next

v3:
- Compile fixes of strange callers of inet_get_local_port_range.
This patch now successfully passes an allmodconfig build.
Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range

Originally-by: Samya <samya@twitter.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aa661581 24-Sep-2013 Francesco Fusco <ffusco@redhat.com>

ipv4: processing ancillary IP_TOS or IP_TTL

If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
packets with the specified TTL or TOS overriding the socket values specified
with the traditional setsockopt().

The struct inet_cork stores the values of TOS, TTL and priority that are
passed through the struct ipcm_cookie. If there are user-specified TOS
(tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
used to override the per-socket values. In case of TOS also the priority
is changed accordingly.

Two helper functions get_rttos and get_rtconn_flags are defined to take
into account the presence of a user specified TOS value when computing
RT_TOS and RT_CONN_FLAGS.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d14c5ab6 15-Aug-2013 Francesco Fusco <ffusco@redhat.com>

net: proc_fs: trivial: print UIDs as unsigned int

UIDs are printed in the proc_fs as signed int, whereas
they are unsigned int.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a06a2d37 12-Jun-2013 Wu Fengguang <fengguang.wu@intel.com>

net: ping_check_bind_addr() etc. can be static

net/ipv4/ping.c:286:5: sparse: symbol 'ping_check_bind_addr' was not declared. Should it be static?
net/ipv4/ping.c:355:6: sparse: symbol 'ping_set_saddr' was not declared. Should it be static?
net/ipv4/ping.c:370:6: sparse: symbol 'ping_clear_saddr' was not declared. Should it be static?

net/ipv6/ping.c:60:5: sparse: symbol 'dummy_ipv6_recv_error' was not declared. Should it be static?
net/ipv6/ping.c:64:5: sparse: symbol 'dummy_ip6_datagram_recv_ctl' was not declared. Should it be static?
net/ipv6/ping.c:69:5: sparse: symbol 'dummy_icmpv6_err_convert' was not declared. Should it be static?
net/ipv6/ping.c:73:6: sparse: symbol 'dummy_ipv6_icmp_error' was not declared. Should it be static?
net/ipv6/ping.c:75:5: sparse: symbol 'dummy_ipv6_chk_addr' was not declared. Should it be static?
net/ipv6/ping.c:201:5: sparse: symbol 'ping_v6_seq_show' was not declared. Should it be static?

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c26d6b46 02-Jun-2013 Cong Wang <amwang@redhat.com>

ping: always initialize ->sin6_scope_id and ->sin6_flowinfo

If we don't need scope id, we should initialize it to zero.
Same for ->sin6_flowinfo.

Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d862e546 31-May-2013 Lorenzo Colitti <lorenzo@google.com>

net: ipv6: Implement /proc/net/icmp6.

The format is based on /proc/net/icmp and /proc/net/{udp,raw}6.

Compiles and displays reasonable results with CONFIG_IPV6={n,m,y}
Couldn't figure out how to test without CONFIG_PROC_FS enabled.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8cc785f6 31-May-2013 Lorenzo Colitti <lorenzo@google.com>

net: ipv4: make the ping /proc code AF-independent

Introduce a ping_seq_afinfo structure (similar to its UDP
equivalent) and use it to make some of the ping /proc functions
address-family independent. Rename the remaining ping /proc
functions from ping_* to ping_v4_*.

Compiles and displays reasonable results with CONFIG_IPV6={n,m,y}

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6d0bfe22 22-May-2013 Lorenzo Colitti <lorenzo@google.com>

net: ipv6: Add IPv6 support to the ping socket.

This adds the ability to send ICMPv6 echo requests without a
raw socket. The equivalent ability for ICMPv4 was added in
2011.

Instead of having separate code paths for IPv4 and IPv6, make
most of the code in net/ipv4/ping.c dual-stack and only add a
few IPv6-specific bits (like the protocol definition) to a new
net/ipv6/ping.c. Hopefully this will reduce divergence and/or
duplication of bugs in the future.

Caveats:

- Setting options via ancillary data (e.g., using IPV6_PKTINFO
to specify the outgoing interface) is not yet supported.
- There are no separate security settings for IPv4 and IPv6;
everything is controlled by /proc/net/ipv4/ping_group_range.
- The proc interface does not yet display IPv6 ping sockets
properly.

Tested with a patched copy of ping6 and using raw socket calls.
Compiles and works with all of CONFIG_IPV6={n,m,y}.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bf84a010 14-Apr-2013 Daniel Borkmann <daniel@iogearbox.net>

net: sock: make sock_tx_timestamp void

Currently, sock_tx_timestamp() always returns 0. The comment that
describes the sock_tx_timestamp() function wrongly says that it
returns an error when an invalid argument is passed (from commit
20d4947353be, ``net: socket infrastructure for SO_TIMESTAMPING'').
Make the function void, so that we can also remove all the unneeded
if conditions that check for such a _non-existant_ error case in the
output path.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b531ed61 20-Feb-2013 Li Wei <lw@cn.fujitsu.com>

ipv4: fix a bug in ping_err().

We should get 'type' and 'code' from the outer ICMP header.

Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ece31ffd 17-Feb-2013 Gao feng <gaofeng@cn.fujitsu.com>

net: proc: change proc_net_remove to remove_proc_entry

proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.

this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d4beaa66 17-Feb-2013 Gao feng <gaofeng@cn.fujitsu.com>

net: proc: change proc_net_fops_create to proc_create

Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.

It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8141ed9f 20-Jan-2013 Steffen Klassert <steffen.klassert@secunet.com>

ipv4: Add a socket release callback for datagram sockets

This implements a socket release callback function to check
if the socket cached route got invalid during the time
we owned the socket. The function is used from udp, raw
and ping sockets.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7064d16e 24-May-2012 Eric W. Biederman <ebiederm@xmission.com>

userns: Use kgids for sysctl_ping_group_range

- Store sysctl_ping_group_range as a paire of kgid_t values
instead of a pair of gid_t values.
- Move the kgid conversion work from ping_init_sock into ipv4_ping_group_range
- For invalid cases reset to the default disabled state.

With the kgid_t conversion made part of the original value sanitation
from userspace understand how the code will react becomes clearer
and it becomes possible to set the sysctl ping group range from
something other than the initial user namespace.

Cc: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# a7cb5a49 24-May-2012 Eric W. Biederman <ebiederm@xmission.com>

userns: Print out socket uids in a user namespace aware fashion.

Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 55be7a9c 11-Jul-2012 David S. Miller <davem@davemloft.net>

ipv4: Add redirect support to all protocol icmp error handlers.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 36393395 14-Jun-2012 David S. Miller <davem@davemloft.net>

ipv4: Handle PMTU in all ICMP error handlers.

With ip_rt_frag_needed() removed, we have to explicitly update PMTU
information in every ICMP error handler.

Create two helper functions to facilitate this.

1) ipv4_sk_update_pmtu()

This updates the PMTU when we have a socket context to
work with.

2) ipv4_update_pmtu()

Raw version, used when no socket context is available. For this
interface, we essentially just pass in explicit arguments for
the flow identity information we would have extracted from the
socket.

And you'll notice that ipv4_sk_update_pmtu() is simply implemented
in terms of ipv4_update_pmtu()

Note that __ip_route_output_key() is used, rather than something like
ip_route_output_flow() or ip_route_output_key(). This is because we
absolutely do not want to end up with a route that does IPSEC
encapsulation and the like. Instead, we only want the route that
would get us to the node described by the outermost IP header.

Reported-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ae2975bc 14-Nov-2011 Eric W. Biederman <ebiederm@xmission.com>

userns: Convert group_info values from gid_t to kgid_t.

As a first step to converting struct cred to be all kuid_t and kgid_t
values convert the group values stored in group_info to always be
kgid_t values. Unless user namespaces are used this change should
have no effect.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 95c96174 14-Apr-2012 Eric Dumazet <eric.dumazet@gmail.com>

net: cleanup unsigned to unsigned int

Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5e73ea1a 14-Apr-2012 Daniel Baluta <dbaluta@ixiacom.com>

ipv4: fix checkpatch errors

Fix checkpatch errors of the following type:
* ERROR: "foo * bar" should be "foo *bar"
* ERROR: "(foo*)" should be "(foo *)"

Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9ffc93f2 28-Mar-2012 David Howells <dhowells@redhat.com>

Remove all #inclusions of asm/system.h

Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>


# 058bd4d2 11-Mar-2012 Joe Perches <joe@perches.com>

net: Convert printks to pr_<level>

Use a more current kernel messaging style.

Convert a printk block to print_hex_dump.
Coalesce formats, align arguments.
Use %s, __func__ instead of embedding function names.

Some messages that were prefixed with <foo>_close are
now prefixed with <foo>_fini. Some ah4 and esp messages
are now not prefixed with "ip ".

The intent of this patch is to later add something like
#define pr_fmt(fmt) "IPv4: " fmt.
to standardize the output messages.

Text size is trivially reduced. (x86-32 allyesconfig)

$ size net/ipv4/built-in.o*
text data bss dec hex filename
887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new
887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a5e7424d 21-Feb-2012 David S. Miller <davem@davemloft.net>

ipv4: ping: Fix recvmsg MSG_OOB error handling.

Don't return an uninitialized variable as the error, return
-EOPNOTSUPP instead.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 76e21053 08-Feb-2012 Erich E. Hoover <ehoover@mines.edu>

ipv4: Implement IP_UNICAST_IF socket option.

The IP_UNICAST_IF feature is needed by the Wine project. This patch
implements the feature by setting the outgoing interface in a similar
fashion to that of IP_MULTICAST_IF. A separate option is needed to
handle this feature since the existing options do not provide all of
the characteristics required by IP_UNICAST_IF, a summary is provided
below.

SO_BINDTODEVICE:
* SO_BINDTODEVICE requires administrative privileges, IP_UNICAST_IF
does not. From reading some old mailing list articles my
understanding is that SO_BINDTODEVICE requires administrative
privileges because it can override the administrator's routing
settings.
* The SO_BINDTODEVICE option restricts both outbound and inbound
traffic, IP_UNICAST_IF only impacts outbound traffic.

IP_PKTINFO:
* Since IP_PKTINFO and IP_UNICAST_IF are independent options,
implementing IP_UNICAST_IF with IP_PKTINFO will likely break some
applications.
* Implementing IP_UNICAST_IF on top of IP_PKTINFO significantly
complicates the Wine codebase and reduces the socket performance
(doing this requires a lot of extra communication between the
"server" and "user" layers).

bind():
* bind() does not work on broadcast packets, IP_UNICAST_IF is
specifically intended to work with broadcast packets.
* Like SO_BINDTODEVICE, bind() restricts both outbound and inbound
traffic.

Signed-off-by: Erich E. Hoover <ehoover@mines.edu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 747465ef 16-Jan-2012 Eric Dumazet <eric.dumazet@gmail.com>

net: fix some sparse errors

make C=2 CF="-D__CHECK_ENDIAN__" M=net

And fix flowi4_init_output() prototype for sport

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fb120c0a 16-Nov-2011 Eric Dumazet <eric.dumazet@gmail.com>

ping: dont increment ICMP_MIB_INERRORS

ping module incorrectly increments ICMP_MIB_INERRORS if feeded with a
frame not belonging to its own sockets.

RFC 2011 states that ICMP_MIB_INERRORS should count "the number of ICMP
messages which the entiry received but determined as having
ICMP-specific errors (bad ICMP checksums, bad length, etc.)."

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bc3b2d7f 15-Jul-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules

These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# 8ad2475e 19-Jun-2011 Jesper Juhl <jj@chaosbits.net>

ipv4, ping: Remove duplicate icmp.h include

Remove the duplicate inclusion of net/icmp.h from net/ipv4/ping.c

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e44ba033 20-Jun-2011 Vitaliy Ivanov <vitalivanov@gmail.com>

treewide: remove duplicate includes

Many stupid corrections of duplicated includes based on the output of
scripts/checkincludes.pl.

Signed-off-by: Vitaliy Ivanov <vitalivanov@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 19a76fa9 22-May-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: ping: cleanups ping_v4_unhash()

net/ipv4/ping.c: In function ‘ping_v4_unhash’:
net/ipv4/ping.c:140:28: warning: variable ‘hslot’ set but not used

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 75e308c8 18-May-2011 Changli Gao <xiaosuo@gmail.com>

net: ping: fix the coding style

The characters in a line should be no more than 80.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bb0cd2fb 18-May-2011 Changli Gao <xiaosuo@gmail.com>

net: ping: make local functions static

As these functions are only used in this file.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f56e03e8 16-May-2011 Vasiliy Kulikov <segoon@openwall.com>

net: ping: fix build failure

If CONFIG_PROC_SYSCTL=n the building process fails:

ping.c:(.text+0x52af3): undefined reference to `inet_get_ping_group_range_net'

Moved inet_get_ping_group_range_net() to ping.c.

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1a8218e9 15-May-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: ping: dont call udp_ioctl()

udp_ioctl() really handles UDP and UDPLite protocols.

1) It can increment UDP_MIB_INERRORS in case first_packet_length() finds
a frame with bad checksum.

2) It has a dependency on sizeof(struct udphdr), not applicable to
ICMP/PING

If ping sockets need to handle SIOCINQ/SIOCOUTQ ioctl, this should be
done differently.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1b1cb1f7 13-May-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: ping: small changes

ping_table is not __read_mostly, since it contains one rwlock,
and is static to ping.c

ping_port_rover & ping_v4_lookup are static

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c319b4d7 13-May-2011 Vasiliy Kulikov <segoon@openwall.com>

net: ipv4: add IPPROTO_ICMP socket kind

This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).

Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/

A new ping socket is created with

socket(PF_INET, SOCK_DGRAM, PROT_ICMP)

Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.

Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.

ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.

ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).

socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.

The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).

Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping

For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/

Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.

All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.

PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.

PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().

PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.

RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>