History log of /linux-master/net/ipv6/ip6_input.c
Revision Date Author Comments
# 2f0ff05a 28-Feb-2024 Eric Dumazet <edumazet@google.com>

ipv6/addrconf: annotate data-races around devconf fields (II)

Final (?) round of this series.

Annotate lockless reads on following devconf fields,
because they be changed concurrently from /proc/net/ipv6/conf.

- accept_dad
- optimistic_dad
- use_optimistic
- use_oif_addrs_only
- ra_honor_pio_life
- keep_addr_on_down
- ndisc_notify
- ndisc_evict_nocarrier
- suppress_frag_ndisc
- addr_gen_mode
- seg6_enabled
- ioam6_enabled
- ioam6_id
- ioam6_id_wide
- drop_unicast_in_l2_multicast
- mldv[12]_unsolicited_report_interval
- force_mld_version
- force_tllao
- accept_untracked_na
- drop_unsolicited_na
- accept_source_route

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d289ab65 28-Feb-2024 Eric Dumazet <edumazet@google.com>

ipv6: annotate data-races around cnf.disable_ipv6

disable_ipv6 is read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

v2: do not preload net before rtnl_trylock() in
addrconf_disable_ipv6() (Jiri)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8423be89 31-Aug-2023 Sriram Yagnaraman <sriram.yagnaraman@est.tech>

ipv6: ignore dst hint for multipath routes

Route hints when the nexthop is part of a multipath group causes packets
in the same receive batch to be sent to the same nexthop irrespective of
the multipath hash of the packet. So, do not extract route hint for
packets whose destination is part of a multipath group.

A new SKB flag IP6SKB_MULTIPATH is introduced for this purpose, set the
flag when route is looked up in fib6_select_path() and use it in
ip6_can_use_hint() to check for the existence of the flag.

Fixes: 197dbf24e360 ("ipv6: introduce and uses route look hints for list input.")
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b0e214d2 21-Mar-2023 Madhu Koriginja <madhu.koriginja@nxp.com>

netfilter: keep conntrack reference until IPsecv6 policy checks are done

Keep the conntrack reference until policy checks have been performed for
IPsec V6 NAT support, just like ipv4.

The reference needs to be dropped before a packet is
queued to avoid having the conntrack module unloadable.

Fixes: 58a317f1061c ("netfilter: ipv6: add IPv6 NAT support")
Signed-off-by: Madhu Koriginja <madhu.koriginja@nxp.com>
Signed-off-by: Florian Westphal <fw@strlen.de>


# 11052589 13-Jul-2022 Kuniyuki Iwashima <kuniyu@amazon.com>

tcp/udp: Make early_demux back namespacified.

Commit e21145a9871a ("ipv4: namespacify ip_early_demux sysctl knob") made
it possible to enable/disable early_demux on a per-netns basis. Then, we
introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
TCP/UDP in commit dddb64bcb346 ("net: Add sysctl to toggle early demux for
tcp and udp"). However, the .proc_handler() was wrong and actually
disabled us from changing the behaviour in each netns.

We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
.early_demux() handler is not NULL. When we toggle (tcp|udp)_early_demux,
the change itself is saved in each netns variable, but the .early_demux()
handler is a global variable, so the handler is switched based on the
init_net's sysctl variable. Thus, netns (tcp|udp)_early_demux knobs have
nothing to do with the logic. Whether we CAN execute proto .early_demux()
is always decided by init_net's sysctl knob, and whether we DO it or not is
by each netns ip_early_demux knob.

This patch namespacifies (tcp|udp)_early_demux again. For now, the users
of the .early_demux() handler are TCP and UDP only, and they are called
directly to avoid retpoline. So, we can remove the .early_demux() handler
from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
If another proto needs .early_demux(), we can restore it at that time.

Fixes: dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 0339d25a 13-Apr-2022 Eric Dumazet <edumazet@google.com>

ipv6: fix NULL deref in ip6_rcv_core()

idev can be NULL, as the surrounding code suggests.

Fixes: 4daf841a2ef3 ("net: ipv6: add skb drop reasons to ip6_rcv_core()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Menglong Dong <imagedong@tencent.com>
Cc: Jiang Biao <benbjiang@tencent.com>
Cc: Hao Peng <flyingpeng@tencent.com>
Link: https://lore.kernel.org/r/20220413205653.1178458-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# eeab7e7f 13-Apr-2022 Menglong Dong <imagedong@tencent.com>

net: ipv6: add skb drop reasons to ip6_protocol_deliver_rcu()

Replace kfree_skb() used in ip6_protocol_deliver_rcu() with
kfree_skb_reason().

No new reasons are added.

Some paths are ignored, as they are not common, such as encapsulation
on non-final protocol.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4daf841a 13-Apr-2022 Menglong Dong <imagedong@tencent.com>

net: ipv6: add skb drop reasons to ip6_rcv_core()

Replace kfree_skb() used in ip6_rcv_core() with kfree_skb_reason().
No new drop reasons are added.

Seems now we use 'SKB_DROP_REASON_IP_INHDR' for too many case during
ipv6 header parse or check, just like what 'IPSTATS_MIB_INHDRERRORS'
do. Will it be too general and hard to know what happened?

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 794c24e9 06-Apr-2022 Jeffrey Ji <jeffreyji@google.com>

net-core: rx_otherhost_dropped to core_stats

Increment rx_otherhost_dropped counter when packet dropped due to
mismatched dest MAC addr.

An example when this drop can occur is when manually crafting raw
packets that will be consumed by a user space application via a tap
device. For testing purposes local traffic was generated using trafgen
for the client and netcat to start a server

Tested: Created 2 netns, sent 1 packet using trafgen from 1 to the other
with "{eth(daddr=$INCORRECT_MAC...}", verified that iproute2 showed the
counter was incremented. (Also had to modify iproute2 to show the stat,
additional patch for that coming next.)

Signed-off-by: Jeffrey Ji <jeffreyji@google.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220406172600.1141083-1-jeffreyjilinux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# cd14e9b7 02-Mar-2022 Martin KaFai Lau <kafai@fb.com>

net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally

The previous patches handled the delivery_time in the ingress path
before the routing decision is made. This patch can postpone clearing
delivery_time in a skb until knowing it is delivered locally and also
set the (rcv) timestamp if needed. This patch moves the
skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
and ip6_input_finish().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 145c7a79 04-Feb-2022 Eric Dumazet <edumazet@google.com>

ipv6: make mc_forwarding atomic

This fixes minor data-races in ip6_mc_input() and
batadv_mcast_mla_rtr_flags_softif_get_ipv6()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dcc32f4f 17-Mar-2021 Jakub Kicinski <kuba@kernel.org>

ipv6: weaken the v4mapped source check

This reverts commit 6af1799aaf3f1bc8defedddfa00df3192445bbf3.

Commit 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped
source address") introduced an input check against v4mapped addresses.
Use of such addresses on the wire is indeed questionable and not
allowed on public Internet. As the commit pointed out

https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02

lists potential issues.

Unfortunately there are applications which use v4mapped addresses,
and breaking them is a clear regression. For example v4mapped
addresses (or any semi-valid addresses, really) may be used
for uni-direction event streams or packet export.

Since the issue which sparked the addition of the check was with
TCP and request_socks in particular push the check down to TCPv6
and DCCP. This restores the ability to receive UDPv6 packets with
v4mapped address as the source.

Keep using the IPSTATS_MIB_INHDRERRORS statistic to minimize the
user-visible changes.

Fixes: 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped source address")
Reported-by: Sunyi Shao <sunyishao@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f9a4719c 03-Feb-2021 Leon Romanovsky <leon@kernel.org>

ipv6: move udp declarations to net/udp.h

Fix the following compilation warning:

net/ipv6/udp.c:1031:30: warning: no previous prototype for 'udp_v6_early_demux' [-Wmissing-prototypes]
1031 | INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb)
| ^~~~~~~~~~~~~~~~~~
net/ipv6/udp.c:1072:29: warning: no previous prototype for 'udpv6_rcv' [-Wmissing-prototypes]
1072 | INDIRECT_CALLABLE_SCOPE int udpv6_rcv(struct sk_buff *skb)
| ^~~~~~~~~

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# cf7fbe66 29-Mar-2020 Joe Stringer <joe@wand.net.nz>

bpf: Add socket assign support

Add support for TPROXY via a new bpf helper, bpf_sk_assign().

This helper requires the BPF program to discover the socket via a call
to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
helper takes its own reference to the socket in addition to any existing
reference that may or may not currently be obtained for the duration of
BPF processing. For the destination socket to receive the traffic, the
traffic must be routed towards that socket via local route. The
simplest example route is below, but in practice you may want to route
traffic more narrowly (eg by CIDR):

$ ip route add local default dev lo

This patch avoids trying to introduce an extra bit into the skb->sk, as
that would require more invasive changes to all code interacting with
the socket to ensure that the bit is handled correctly, such as all
error-handling cases along the path from the helper in BPF through to
the orphan path in the input. Instead, we opt to use the destructor
variable to switch on the prefetch of the socket.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz


# 197dbf24 20-Nov-2019 Paolo Abeni <pabeni@redhat.com>

ipv6: introduce and uses route look hints for list input.

When doing RX batch packet processing, we currently always repeat
the route lookup for each ingress packet. When no custom rules are
in place, and there aren't routes depending on source addresses,
we know that packets with the same destination address will use
the same dst.

This change tries to avoid per packet route lookup caching
the destination address of the latest successful lookup, and
reusing it for the next packet when the above conditions are
in place. Ingress traffic for most servers should fit.

The measured performance delta under UDP flood vs a recvmmsg
receiver is as follow:

vanilla patched delta
Kpps Kpps %
1431 1674 +17

In the worst-case scenario - each packet has a different
destination address - the performance delta is within noise
range.

v3 -> v4:
- support hints for SUBFLOW build, too (David A.)
- several style fixes (Eric)

v2 -> v3:
- add fib6_has_custom_rules() helpers (David A.)
- add ip6_extract_route_hint() helper (Edward C.)
- use hint directly in ip6_list_rcv_finish() (Willem)

v1 -> v2:
- fix build issue with !CONFIG_IPV6_MULTIPLE_TABLES
- fix potential race when fib6_has_custom_rules is set
while processing a packet batch

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 51210ad5 28-Oct-2019 Florian Westphal <fw@strlen.de>

inet: do not call sublist_rcv on empty list

syzbot triggered struct net NULL deref in NF_HOOK_LIST:
RIP: 0010:NF_HOOK_LIST include/linux/netfilter.h:331 [inline]
RIP: 0010:ip6_sublist_rcv+0x5c9/0x930 net/ipv6/ip6_input.c:292
ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:328
__netif_receive_skb_list_ptype net/core/dev.c:5274 [inline]

Reason:
void ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
struct net_device *orig_dev)
[..]
list_for_each_entry_safe(skb, next, head, list) {
/* iterates list */
skb = ip6_rcv_core(skb, dev, net);
/* ip6_rcv_core drops skb -> NULL is returned */
if (skb == NULL)
continue;
[..]
}
/* sublist is empty -> curr_net is NULL */
ip6_sublist_rcv(&sublist, curr_dev, curr_net);

Before the recent change NF_HOOK_LIST did a list iteration before
struct net deref, i.e. it was a no-op in the empty list case.

List iteration now happens after *net deref, causing crash.

Follow the same pattern as the ip(v6)_list_rcv loop and add a list_empty
test for the final sublist dispatch too.

Cc: Edward Cree <ecree@solarflare.com>
Reported-by: syzbot+c54f457cad330e57e967@syzkaller.appspotmail.com
Fixes: ca58fbe06c54 ("netfilter: add and use nf_hook_slow_list()")
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6af1799a 02-Oct-2019 Eric Dumazet <edumazet@google.com>

ipv6: drop incoming packets having a v4mapped source address

This began with a syzbot report. syzkaller was injecting
IPv6 TCP SYN packets having a v4mapped source address.

After an unsuccessful 4-tuple lookup, TCP creates a request
socket (SYN_RECV) and calls reqsk_queue_hash_req()

reqsk_queue_hash_req() calls sk_ehashfn(sk)

At this point we have AF_INET6 sockets, and the heuristic
used by sk_ehashfn() to either hash the IPv4 or IPv6 addresses
is to use ipv6_addr_v4mapped(&sk->sk_v6_daddr)

For the particular spoofed packet, we end up hashing V4 addresses
which were not initialized by the TCP IPv6 stack, so KMSAN fired
a warning.

I first fixed sk_ehashfn() to test both source and destination addresses,
but then faced various problems, including user-space programs
like packetdrill that had similar assumptions.

Instead of trying to fix the whole ecosystem, it is better
to admit that we have a dual stack behavior, and that we
can not build linux kernels without V4 stack anyway.

The dual stack API automatically forces the traffic to be IPv4
if v4mapped addresses are used at bind() or connect(), so it makes
no sense to allow IPv6 traffic to use the same v4mapped class.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 895b5c9f 29-Sep-2019 Florian Westphal <fw@strlen.de>

netfilter: drop bridge nf reset from nf_reset

commit 174e23810cd31
("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
recycle always drop skb extensions. The additional skb_ext_del() that is
performed via nf_reset on napi skb recycle is not needed anymore.

Most nf_reset() calls in the stack are there so queued skb won't block
'rmmod nf_conntrack' indefinitely.

This removes the skb_ext_del from nf_reset, and renames it to a more
fitting nf_reset_ct().

In a few selected places, add a call to skb_ext_reset to make sure that
no active extensions remain.

I am submitting this for "net", because we're still early in the release
cycle. The patch applies to net-next too, but I think the rename causes
needless divergence between those trees.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c7a42eb4 23-Aug-2019 Xin Long <lucien.xin@gmail.com>

net: ipv6: fix listify ip6_rcv_finish in case of forwarding

We need a similar fix for ipv6 as Commit 0761680d5215 ("net: ipv4: fix
listify ip_rcv_finish in case of forwarding") does for ipv4.

This issue can be reprocuded by syzbot since Commit 323ebb61e32b ("net:
use listified RX for handling GRO_NORMAL skbs") on net-next. The call
trace was:

kernel BUG at include/linux/skbuff.h:2225!
RIP: 0010:__skb_pull include/linux/skbuff.h:2225 [inline]
RIP: 0010:skb_pull+0xea/0x110 net/core/skbuff.c:1902
Call Trace:
sctp_inq_pop+0x2f1/0xd80 net/sctp/inqueue.c:202
sctp_endpoint_bh_rcv+0x184/0x8d0 net/sctp/endpointola.c:385
sctp_inq_push+0x1e4/0x280 net/sctp/inqueue.c:80
sctp_rcv+0x2807/0x3590 net/sctp/input.c:256
sctp6_rcv+0x17/0x30 net/sctp/ipv6.c:1049
ip6_protocol_deliver_rcu+0x2fe/0x1660 net/ipv6/ip6_input.c:397
ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:438
NF_HOOK include/linux/netfilter.h:305 [inline]
NF_HOOK include/linux/netfilter.h:299 [inline]
ip6_input+0xe4/0x3f0 net/ipv6/ip6_input.c:447
dst_input include/net/dst.h:442 [inline]
ip6_sublist_rcv_finish+0x98/0x1e0 net/ipv6/ip6_input.c:84
ip6_list_rcv_finish net/ipv6/ip6_input.c:118 [inline]
ip6_sublist_rcv+0x80c/0xcf0 net/ipv6/ip6_input.c:282
ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:316
__netif_receive_skb_list_ptype net/core/dev.c:5049 [inline]
__netif_receive_skb_list_core+0x5fc/0x9d0 net/core/dev.c:5097
__netif_receive_skb_list net/core/dev.c:5149 [inline]
netif_receive_skb_list_internal+0x7eb/0xe60 net/core/dev.c:5244
gro_normal_list.part.0+0x1e/0xb0 net/core/dev.c:5757
gro_normal_list net/core/dev.c:5755 [inline]
gro_normal_one net/core/dev.c:5769 [inline]
napi_frags_finish net/core/dev.c:5782 [inline]
napi_gro_frags+0xa6a/0xea0 net/core/dev.c:5855
tun_get_user+0x2e98/0x3fa0 drivers/net/tun.c:1974
tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2020

Fixes: d8269e2cbf90 ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
Fixes: 323ebb61e32b ("net: use listified RX for handling GRO_NORMAL skbs")
Reported-by: syzbot+eb349eeee854e389c36d@syzkaller.appspotmail.com
Reported-by: syzbot+4a0643a653ac375612d1@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 97ff7ffb 03-May-2019 Paolo Abeni <pabeni@redhat.com>

net: use indirect calls helpers at early demux stage

So that we avoid another indirect call per RX packet, if
early demux is enabled.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0e219ae4 03-May-2019 Paolo Abeni <pabeni@redhat.com>

net: use indirect calls helpers for L3 handler hooks

So that we avoid another indirect call per RX packet in the common
case.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 22f6bbb7 04-Dec-2018 Edward Cree <ecree@solarflare.com>

net: use skb_list_del_init() to remove from RX sublists

list_del() leaves the skb->next pointer poisoned, which can then lead to
a crash in e.g. OVS forwarding. For example, setting up an OVS VXLAN
forwarding bridge on sfc as per:

========
$ ovs-vsctl show
5dfd9c47-f04b-4aaa-aa96-4fbb0a522a30
Bridge "br0"
Port "br0"
Interface "br0"
type: internal
Port "enp6s0f0"
Interface "enp6s0f0"
Port "vxlan0"
Interface "vxlan0"
type: vxlan
options: {key="1", local_ip="10.0.0.5", remote_ip="10.0.0.4"}
ovs_version: "2.5.0"
========
(where 10.0.0.5 is an address on enp6s0f1)
and sending traffic across it will lead to the following panic:
========
general protection fault: 0000 [#1] SMP PTI
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.20.0-rc3-ehc+ #701
Hardware name: Dell Inc. PowerEdge R710/0M233H, BIOS 6.4.0 07/23/2013
RIP: 0010:dev_hard_start_xmit+0x38/0x200
Code: 53 48 89 fb 48 83 ec 20 48 85 ff 48 89 54 24 08 48 89 4c 24 18 0f 84 ab 01 00 00 48 8d 86 90 00 00 00 48 89 f5 48 89 44 24 10 <4c> 8b 33 48 c7 03 00 00 00 00 48 8b 05 c7 d1 b3 00 4d 85 f6 0f 95
RSP: 0018:ffff888627b437e0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: dead000000000100 RCX: ffff88862279c000
RDX: ffff888614a342c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff888618a88000 R08: 0000000000000001 R09: 00000000000003e8
R10: 0000000000000000 R11: ffff888614a34140 R12: 0000000000000000
R13: 0000000000000062 R14: dead000000000100 R15: ffff888616430000
FS: 0000000000000000(0000) GS:ffff888627b40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6d2bc6d000 CR3: 000000000200a000 CR4: 00000000000006e0
Call Trace:
<IRQ>
__dev_queue_xmit+0x623/0x870
? masked_flow_lookup+0xf7/0x220 [openvswitch]
? ep_poll_callback+0x101/0x310
do_execute_actions+0xaba/0xaf0 [openvswitch]
? __wake_up_common+0x8a/0x150
? __wake_up_common_lock+0x87/0xc0
? queue_userspace_packet+0x31c/0x5b0 [openvswitch]
ovs_execute_actions+0x47/0x120 [openvswitch]
ovs_dp_process_packet+0x7d/0x110 [openvswitch]
ovs_vport_receive+0x6e/0xd0 [openvswitch]
? dst_alloc+0x64/0x90
? rt_dst_alloc+0x50/0xd0
? ip_route_input_slow+0x19a/0x9a0
? __udp_enqueue_schedule_skb+0x198/0x1b0
? __udp4_lib_rcv+0x856/0xa30
? __udp4_lib_rcv+0x856/0xa30
? cpumask_next_and+0x19/0x20
? find_busiest_group+0x12d/0xcd0
netdev_frame_hook+0xce/0x150 [openvswitch]
__netif_receive_skb_core+0x205/0xae0
__netif_receive_skb_list_core+0x11e/0x220
netif_receive_skb_list+0x203/0x460
? __efx_rx_packet+0x335/0x5e0 [sfc]
efx_poll+0x182/0x320 [sfc]
net_rx_action+0x294/0x3c0
__do_softirq+0xca/0x297
irq_exit+0xa6/0xb0
do_IRQ+0x54/0xd0
common_interrupt+0xf/0xf
</IRQ>
========
So, in all listified-receive handling, instead pull skbs off the lists with
skb_list_del_init().

Fixes: 9af86f933894 ("net: core: fix use-after-free in __netif_receive_skb_list_core")
Fixes: 7da517a3bc52 ("net: core: Another step of skb receive list processing")
Fixes: a4ca8b7df73c ("net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()")
Fixes: d8269e2cbf90 ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 80bde363 06-Nov-2018 Paolo Abeni <pabeni@redhat.com>

ipv6: factor out protocol delivery helper

So that we can re-use it at the UDP level in the next patch

rfc v3 -> v1:
- add the helper declaration into the ipv6 header

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5226b6a9 07-Nov-2018 Mike Manning <mmanning@vyatta.att-mail.com>

ipv6: handling of multicast packets received in VRF

If the skb for multicast packets marked as enslaved to a VRF are
received, then the secondary device index should be used to obtain
the real device. And verify the multicast address against the
enslaved rather than the l3mdev device.

Signed-off-by: Dewi Morgan <morgand@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3ede0bbc 19-Sep-2018 Robert Shearman <rshearma@vyatta.att-mail.com>

ipv6: Allow the l3mdev to be a loopback

There is no way currently for an IPv6 client connect using a loopback
address in a VRF, whereas for IPv4 the loopback address can be added:

$ sudo ip addr add dev vrfred 127.0.0.1/8
$ sudo ip -6 addr add ::1/128 dev vrfred
RTNETLINK answers: Cannot assign requested address

So allow ::1 to be configured on an L3 master device. In order for
this to be usable ip_route_output_flags needs to not consider ::1 to
be a link scope address (since oif == l3mdev and so it would be
dropped), and ipv6_rcv needs to consider the l3mdev to be a loopback
device so that it doesn't drop the packets.

Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d8269e2c 05-Jul-2018 Edward Cree <ecree@solarflare.com>

net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()

Essentially the same as the ipv4 equivalents.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bdb7cc64 16-Apr-2018 Stephen Suryaputra <ssuryaextr@gmail.com>

ipv6: Count interface receive statistics on the ingress netdev

The statistics such as InHdrErrors should be counted on the ingress
netdev rather than on the dev from the dst, which is the egress.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0aa8c13e 14-Apr-2017 Florian Westphal <fw@strlen.de>

ipv6: drop non loopback packets claiming to originate from ::1

We lack a saddr check for ::1. This causes security issues e.g. with acls
permitting connections from ::1 because of assumption that these originate
from local machine.

Assuming a source address of ::1 is local seems reasonable.
RFC4291 doesn't allow such a source address either, so drop such packets.

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dddb64bc 23-Mar-2017 Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>

net: Add sysctl to toggle early demux for tcp and udp

Certain system process significant unconnected UDP workload.
It would be preferrable to disable UDP early demux for those systems
and enable it for TCP only.

By disabling UDP demux, we see these slight gains on an ARM64 system-
782 -> 788Mbps unconnected single stream UDPv4
633 -> 654Mbps unconnected UDPv4 different sources

The performance impact can change based on CPU architecure and cache
sizes. There will not much difference seen if entire UDP hash table
is in cache.

Both sysctls are enabled by default to preserve existing behavior.

v1->v2: Change function pointer instead of adding conditional as
suggested by Stephen.

v2->v3: Read once in callers to avoid issues due to compiler
optimizations. Also update commit message with the tests.

v3->v4: Store and use read once result instead of querying pointer
again incorrectly.

v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Suggested-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Tom Herbert <tom@herbertland.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b4869aa2 06-Jun-2016 David Ahern <dsa@cumulusnetworks.com>

net: vrf: ipv6 support for local traffic to local addresses

Add support for locally originated traffic to VRF-local IPv6 addresses.
Similar to IPv4 a local dst is set on the skb and the packet is
reinserted with a call to netif_rx. With this patch, ping, tcp and udp
packets to a local IPv6 address are successfully routed:

$ ip addr show dev eth1
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 2100:1::1/120 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
valid_lft forever preferred_lft forever

$ ping6 -c1 -I red 2100:1::1
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms

ip6_input is exported so the VRF driver can use it for the dst input
function. The dst_alloc function for IPv4 defaults to setting the input and
output functions; IPv6's does not. VRF does not need to duplicate the Rx path
so just export the ipv6 input function.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1da44f9c 18-May-2016 Tom Herbert <tom@herbertland.com>

ipv6: Change "final" protocol processing for encapsulation

When performing foo-over-UDP, UDP packets are processed by the
encapsulation handler which returns another protocol to process.
This may result in processing two (or more) protocols in the
loop that are marked as INET6_PROTO_FINAL. The actions taken
for hitting a final protocol, in particular the skb_postpull_rcsum
can only be performed once.

This patch set adds a check of a final protocol has been seen. The
rules are:
- If the final protocol has not been seen any protocol is processed
(final and non-final). In the case of a final protocol, the final
actions are taken (like the skb_postpull_rcsum)
- If a final protocol has been seen (e.g. an encapsulating UDP
header) then no further non-final protocols are allowed
(e.g. extension headers). For more final protocols the
final actions are not taken (e.g. skb_postpull_rcsum).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4c64242a 18-May-2016 Tom Herbert <tom@herbertland.com>

ipv6: Fix nexthdr for reinjection

In ip6_input_finish the nexthdr protocol is retrieved from the
next header offset that is returned in the cb of the skb.
This method does not work for UDP encapsulation that may not
even have a concept of a nexthdr field (e.g. FOU).

This patch checks for a final protocol (INET6_PROTO_FINAL) when a
protocol handler returns > 0. If the protocol is not final then
resubmission is performed on nhoff value. If the protocol is final
then the nexthdr is taken to be the return value.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 74b20582 10-May-2016 David Ahern <dsa@cumulusnetworks.com>

net: l3mdev: Add hook in ip and ipv6

Currently the VRF driver uses the rx_handler to switch the skb device
to the VRF device. Switching the dev prior to the ip / ipv6 layer
means the VRF driver has to duplicate IP/IPv6 processing which adds
overhead and makes features such as retaining the ingress device index
more complicated than necessary.

This patch moves the hook to the L3 layer just after the first NF_HOOK
for PRE_ROUTING. This location makes exposing the original ingress device
trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
in the future.

dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
with the switched device through the packet taps to maintain current
behavior (tcpdump can be used on either the vrf device or the enslaved
devices).

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c2005eb0 27-Apr-2016 Eric Dumazet <edumazet@google.com>

ipv6: rename IP6_UPD_PO_STATS_BH()

Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1d015503 27-Apr-2016 Eric Dumazet <edumazet@google.com>

ipv6: rename IP6_INC_STATS_BH()

Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e21145a9 14-Feb-2016 Nikolay Borisov <kernel@kyup.com>

ipv4: namespacify ip_early_demux sysctl knob

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# abbc3043 04-Feb-2016 Johannes Berg <johannes.berg@intel.com>

ipv6: add option to drop unicast encapsulated in L2 multicast

In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv6 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.

Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0c4b51f0 15-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: Pass net into okfn

This is immediately motivated by the bridge code that chains functions that
call into netfilter. Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.

As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.

To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn. For the moment dst_output_okfn
just silently drops the struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 29a26a56 15-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: Pass struct net into the netfilter hooks

Pass a network namespace parameter into the netfilter hooks. At the
call site of the netfilter hooks the path a packet is taking through
the network stack is well known which allows the network namespace to
be easily and reliabily.

This allows the replacement of magic code like
"dev_net(state->in?:state->out)" that appears at the start of most
netfilter hooks with "state->net".

In almost all cases the network namespace passed in is derived
from the first network device passed in, guaranteeing those
paths will not see any changes in practice.

The exceptions are:
xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

In all cases these exceptions seem to be a better expression for the
network namespace the packet is being processed in then the historic
"dev_net(in?in:out)". I am documenting them in case something odd
pops up and someone starts trying to track down what happened.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9865249f 15-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

ipv6: Don't recompute net in ip6_rcv

Avoid silly redundant code

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 48fb6b55 22-Jul-2015 Wei-Chun Chao <weichunc@plumgrid.com>

ipv6: fix crash over flow-based vxlan device

Similar check was added in ip_rcv but not in ipv6_rcv.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81734e0a>] ipv6_rcv+0xfa/0x500
Call Trace:
[<ffffffff816c9786>] ? ip_rcv+0x296/0x400
[<ffffffff817732d2>] ? packet_rcv+0x52/0x410
[<ffffffff8168e99f>] __netif_receive_skb_core+0x63f/0x9a0
[<ffffffffc02b34a0>] ? br_handle_frame_finish+0x580/0x580 [bridge]
[<ffffffff8109912c>] ? update_rq_clock.part.81+0x1c/0x40
[<ffffffff8168ed18>] __netif_receive_skb+0x18/0x60
[<ffffffff8168fa1f>] process_backlog+0x9f/0x150

Fixes: ee122c79d422 (vxlan: Flow based tunneling)
Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4c938d22 02-Jul-2015 Angga <Hermin.Anggawijaya@alliedtelesis.co.nz>

ipv6: Make MLD packets to only be processed locally

Before commit daad151263cf ("ipv6: Make ipv6_is_mld() inline and use it
from ip6_mc_input().") MLD packets were only processed locally. After the
change, a copy of MLD packet goes through ip6_mr_input, causing
MRT6MSG_NOCACHE message to be generated to user space.

Make MLD packet only processed locally.

Fixes: daad151263cf ("ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().")
Signed-off-by: Hermin Anggawijaya <hermin.anggawijaya@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1b0ccfe5 10-Jun-2015 David S. Miller <davem@davemloft.net>

Revert "ipv6: Fix protocol resubmission"

This reverts commit 0243508edd317ff1fa63b495643a7c192fbfcd92.

It introduces new regressions.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 0243508e 07-Jun-2015 Josh Hunt <johunt@akamai.com>

ipv6: Fix protocol resubmission

UDP encapsulation is broken on IPv6. This is because the logic to resubmit
the nexthdr is inverted, checking for a ret value > 0 instead of < 0. Also,
the resubmit label is in the wrong position since we already get the
nexthdr value when performing decapsulation. In addition the skb pull is no
longer necessary either.

This changes the return value check to look for < 0, using it for the
nexthdr on the next iteration, and moves the resubmit label to the proper
location.

With these changes the v6 code now matches what we do in the v4 ip input
code wrt resubmitting when decapsulating.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Acked-by: "Tom Herbert" <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7026b1dd 05-Apr-2015 David Miller <davem@davemloft.net>

netfilter: Pass socket pointer down through okfn().

On the output paths in particular, we have to sometimes deal with two
socket contexts. First, and usually skb->sk, is the local socket that
generated the frame.

And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.

We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.

The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device. We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 53b24b8f 29-Mar-2015 Ian Morris <ipm@chirality.org.uk>

ipv6: coding style: comparison for inequality with NULL

The ipv6 code uses a mixture of coding styles. In some instances check for NULL
pointer is done as x != NULL and sometimes as x. x is preferred according to
checkpatch and this patch makes the code consistent by adopting the latter
form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e5d08d71 23-Nov-2014 Ian Morris <ipm@chirality.org.uk>

ipv6: coding style improvements (remove assignment in if statements)

This change has no functional impact and simply addresses some coding
style issues detected by checkpatch. Specifically this change
adjusts "if" statements which also include the assignment of a
variable.

No changes to the resultant object files result as determined by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 67ba4152 24-Aug-2014 Ian Morris <ipm@chirality.org.uk>

ipv6: White-space cleansing : Line Layouts

This patch makes no changes to the logic of the code but simply addresses
coding style issues as detected by checkpatch.

Both objdump and diff -w show no differences.

A number of items are addressed in this patch:
* Multiple spaces converted to tabs
* Spaces before tabs removed.
* Spaces in pointer typing cleansed (char *)foo etc.
* Remove space after sizeof
* Ensure spacing around comparators such as if statements.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a452ce34 27-Jan-2014 Holger Eitzenberger <holger@eitzenberger.org>

net: Fix memory leak if TPROXY used with TCP early demux

I see a memory leak when using a transparent HTTP proxy using TPROXY
together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):

unreferenced object 0xffff88008cba4a40 (size 1696):
comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
hex dump (first 32 bytes):
0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00 .. j@..7..2.....
02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
[<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
[<ffffffff812702cf>] sk_clone_lock+0x14/0x283
[<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
[<ffffffff8129a893>] netlink_broadcast+0x14/0x16
[<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
[<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
[<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
[<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
[<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
[<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
[<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
[<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
[<ffffffff8127aa77>] process_backlog+0xee/0x1c5
[<ffffffff8127c949>] net_rx_action+0xa7/0x200
[<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157

But there are many more, resulting in the machine going OOM after some
days.

From looking at the TPROXY code, and with help from Florian, I see
that the memory leak is introduced in tcp_v4_early_demux():

void tcp_v4_early_demux(struct sk_buff *skb)
{
/* ... */

iph = ip_hdr(skb);
th = tcp_hdr(skb);

if (th->doff < sizeof(struct tcphdr) / 4)
return;

sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
iph->saddr, th->source,
iph->daddr, ntohs(th->dest),
skb->skb_iif);
if (sk) {
skb->sk = sk;

where the socket is assigned unconditionally to skb->sk, also bumping
the refcnt on it. This is problematic, because in our case the skb
has already a socket assigned in the TPROXY target. This then results
in the leak I see.

The very same issue seems to be with IPv6, but haven't tested.

Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1f07d03e 06-Aug-2013 Eric Dumazet <edumazet@google.com>

net: add SNMP counters tracking incoming ECN bits

With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP
counters do not count the number of frames, but number of aggregated
segments.

Its probably too late to change this now.

This patch adds four new counters, tracking number of frames, regardless
of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6.

Ip[6]NoECTPkts : Number of packets received with NOECT
Ip[6]ECT1Pkts : Number of packets received with ECT(1)
Ip[6]ECT0Pkts : Number of packets received with ECT(0)
Ip[6]CEPkts : Number of packets received with Congestion Experienced

lph37:~# nstat | egrep "Pkts|InReceive"
IpInReceives 1634137 0.0
Ip6InReceives 3714107 0.0
Ip6InNoECTPkts 19205 0.0
Ip6InECT0Pkts 52651828 0.0
IpExtInNoECTPkts 33630 0.0
IpExtInECT0Pkts 15581379 0.0
IpExtInCEPkts 6 0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1c4a154e 26-Mar-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: don't accept node local multicast traffic from the wire

Erik Hugne's errata proposal (Errata ID: 3480) to RFC4291 has been
verified: http://www.rfc-editor.org/errata_search.php?eid=3480

We have to check for pkt_type and loopback flag because either the
packets are allowed to travel over the loopback interface (in which case
pkt_type is PACKET_HOST and IFF_LOOPBACK flag is set) or they travel
over a non-loopback interface back to us (in which case PACKET_TYPE is
PACKET_LOOPBACK and IFF_LOOPBACK flag is not set).

Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ddf64354 07-Mar-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: stop multicast forwarding to process interface scoped addresses

v2:
a) used struct ipv6_addr_props

v3:
a) reverted changes for ipv6_addr_props

v4:
a) do not use __ipv6_addr_needs_scope_id

Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d8c6f4b9 01-Mar-2013 Neil Horman <nhorman@tuxdriver.com>

ipv[4|6]: correct dropwatch false positive in local_deliver_finish

I had a report recently of a user trying to use dropwatch to localise some frame
loss, and they were getting false positives. Turned out they were using a user
space SCTP stack that used raw sockets to grab frames. When we don't have a
registered protocol for a given packet, we record it as a drop, even if a raw
socket receieves the frame. We should only record the drop in the event a raw
socket doesnt exist to receive the frames

Tested by the reported successfully

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: William Reich <reich@ulticom.com>
Tested-by: William Reich <reich@ulticom.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: William Reich <reich@ulticom.com>
CC: eric.dumazet@gmail.com
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 20314092 09-Feb-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: don't accept multicast traffic with scope 0

v2:
a) moved before multicast source address check
b) changed comment to netdev style

Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dd3332bf 12-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6: Store Router Alert option in IP6CB directly.

Router Alert option is very small and we can store the value
itself in the skb.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# daad1512 12-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().

Move generalized version of ipv6_is_mld() to header,
and use it from ip6_mc_input().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cca32e4b 29-Jul-2012 Eric Dumazet <edumazet@google.com>

net: TCP early demux cleanup

early_demux() handlers should be called in RCU context, and as we
use skb_dst_set_noref(skb, dst), caller must not exit from RCU context
before dst use (skb_dst(skb)) or release (skb_drop(dst))

Therefore, rcu_read_lock()/rcu_read_unlock() pairs around
->early_demux() are confusing and not needed :

Protocol handlers are already in an RCU read lock section.
(__netif_receive_skb() does the rcu_read_lock() )

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c7109986 25-Jul-2012 Eric Dumazet <edumazet@google.com>

ipv6: Early TCP socket demux

This is the IPv6 missing bits for infrastructure added in commit
41063e9dd1195 (ipv4: Early TCP socket demux.)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f9242b6b 19-Jun-2012 David S. Miller <davem@davemloft.net>

inet: Sanitize inet{,6} protocol demux.

Don't pretend that inet_protos[] and inet6_protos[] are hashes, thay
are just a straight arrays. Remove all unnecessary hash masking.

Document MAX_INET_PROTOS.

Use RAW_HTABLE_SIZE when appropriate.

Reported-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a50feda5 18-May-2012 Eric Dumazet <edumazet@google.com>

ipv6: bool/const conversions phase2

Mostly bool conversions, some inline removals and const additions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 75f2811c 30-Nov-2011 Jesse Gross <jesse@nicira.com>

ipv6: Add fragment reporting to ipv6_skip_exthdr().

While parsing through IPv6 extension headers, fragment headers are
skipped making them invisible to the caller. This reports the
fragment offset of the last header in order to make it possible to
determine whether the packet is fragmented and, if so whether it is
a first or last fragment.

Signed-off-by: Jesse Gross <jesse@nicira.com>


# c457338d 07-Nov-2011 Brian Haley <brian.haley@hp.com>

ipv6: drop packets when source address is multicast

RFC 4291 Section 2.7 says Multicast addresses must not be used as source
addresses in IPv6 packets - drop them on input so we don't process the
packet further.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Reported-and-Tested-by: Kumar Sanghvi <divinekumar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b71d1d42 21-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com>

inet: constify ip headers and in6_addr

Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# b2e0b385 22-Mar-2010 Jan Engelhardt <jengelh@medozas.de>

netfilter: ipv6: use NFPROTO values for NF_HOOK invocation

The semantic patch that was used:
// <smpl>
@@
@@
(NF_HOOK
|NF_HOOK_THRESH
|nf_hook
)(
-PF_INET6,
+NFPROTO_IPV6,
...)
// </smpl>

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>


# 3ffe533c 18-Feb-2010 Alexey Dobriyan <adobriyan@gmail.com>

ipv6: drop unused "dev" arg of icmpv6_send()

Dunno, what was the idea, it wasn't used for a long time.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 41135cc8 13-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

net: constify struct inet6_protocol

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5c91face 06-Jul-2009 Mark Smith <lk-netdev@lk-netdev.nosense.org>

ipv6: correct return on ipv6_rcv() packet drop

The routine ipv6_rcv() uses magic number 0 for a return when it drops a
packet. This corresponds to NET_RX_SUCCESS, which is obviously
incorrect. Correct this by using NET_RX_DROP instead.

ps. It isn't exactly clear who the IPv6 maintainers are, apologies if
I've missed any.

Signed-off-by: Mark Smith <markzzzsmith@yahoo.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 71f9dacd 26-Jun-2009 Herbert Xu <herbert@gondor.apana.org.au>

inet: Call skb_orphan before tproxy activates

As transparent proxying looks up the socket early and assigns
it to the skb for later processing, we must drop any existing
socket ownership prior to that in order to distinguish between
the case where tproxy is active and where it is not.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# adf30907 01-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com>

net: skb->dst accessors

Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# edf391ff 27-Apr-2009 Neil Horman <nhorman@tuxdriver.com>

snmp: add missing counters for RFC 4293

The IP MIB (RFC 4293) defines stats for InOctets, OutOctets, InMcastOctets and
OutMcastOctets:
http://tools.ietf.org/html/rfc4293
But it seems we don't track those in any way that easy to separate from other
protocols. This patch adds those missing counters to the stats file. Tested
successfully by me

With help from Eric Dumazet.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 71f6f6df 27-Mar-2009 Jesper Nilsson <jesper.nilsson@axis.com>

ipv6: Plug sk_buff leak in ipv6_rcv (net/ipv6/ip6_input.c)

Commit 778d80be52699596bf70e0eb0761cf5e1e46088d
(ipv6: Add disable_ipv6 sysctl to disable IPv6 operaion on specific interface)
seems to have introduced a leak of sk_buff's for ipv6 traffic,
at least in some configurations where idev is NULL, or when ipv6
is disabled via sysctl.

The problem is that if the first condition of the if-statement
returns non-NULL, it returns an skb with only one reference,
and when the other conditions apply, execution jumps to the "out"
label, which does not call kfree_skb for it.

To plug this leak, change to use the "drop" label instead.
(this relies on it being ok to call kfree_skb on NULL)
This also allows us to avoid calling rcu_read_unlock here,
and removes the only user of the "out" label.

Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1d6e55f1 27-Jan-2009 Thomas Goff <thomas.goff@boeing.com>

IPv6: Fix multicast routing bugs.

This patch addresses the IPv6 multicast routing issues described
below. It was tested with XORP 1.4/1.5 as the IPv6 PIM-SM routing
daemon against FreeBSD peers.

net/ipv6/ip6_input.c:

- Don't try to forward link-local multicast packets.

- Don't reset skb2->dev before calling ip6_mr_input() so packets can
be identified as coming from the PIM register vif properly.

net/ipv6/ip6mr.c:

- Fix incoming PIM register messages processing:

* The IPv6 pseudo-header should be included when checksumming PIM
messages (RFC 4601 section 4.9; RFC 3973 section 4.7.1).

* Packets decapsulated from PIM register messages should have
skb->protocol ETH_P_IPV6.

- Enable/disable IPv6 multicast forwarding on the corresponding
interface when a routing daemon adds/removes a multicast virtual
interface.

- Remove incorrect skb_pull() to fix userspace signaling.

- Enable/disable global IPv6 multicast forwarding when an IPv6
multicast routing socket is opened/closed.

net/ipv6/route.c:

- Don't use strict routing logic for packets decapsulated from PIM
register messages (similar to disabling rp_filter for the IPv4
case).

Signed-off-by: Thomas Goff <thomas.goff@boeing.com>
Reviewed-by: Fred Templin <fred.l.templin@boeing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 483a47d2 08-Oct-2008 Denis V. Lunev <den@openvz.org>

ipv6: added net argument to IP6_INC_STATS_BH

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 53b7997f 19-Jul-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

ipv6 netns: Make several "global" sysctl variables namespace aware.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 778d80be 27-Jun-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

ipv6: Add disable_ipv6 sysctl to disable IPv6 operaion on specific interface.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# f630e43a 19-Jun-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

ipv6: Drop packets for loopback address from outside of the box.

[ Based upon original report and patch by Karsten Keil. Karsten
has verified that this fixes the TAHI test case "ICMPv6 test
v6LC.5.1.2 Part F". -DaveM ]

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0b040829 10-Jun-2008 Adrian Bunk <bunk@kernel.org>

net: remove CVS keywords

This patch removes CVS keywords that weren't updated for a long time
from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aba6096b 10-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Kill several warnings without CONFIG_IPV6_MROUTE.

Pointed out by Andrew Morton <akpm@linux-foundation.org>.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 7bc570c8 02-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6] MROUTE: Support multicast forwarding.

Based on ancient patch by Mickael Hoerdt
<hoerdt@clarinet.u-strasbg.fr>, which is available at
<http://www-r2.u-strasbg.fr/~hoerdt/dev/linux_ipv6_mforwarding/patch-linux-ipv6-mforwarding-0.1a>.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 4c7966b8 31-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6] MCAST: Ensure to check multicast listener(s).

In ip6_mc_input(), we need to check whether we have listener(s) for
the packet.

After commit ae7bf20a6316272acfcaef5d265b18aaa54b41e4, all packets
for multicast destinations are delivered to upper layer if
IFF_PROMISC or IFF_ALLMULTI is set.

In fact, bug was rather ancient; the original (before the commit)
intent of the dev->flags check was to skip the ipv6_chk_mcast_addr()
call, assuming L2 filters packets appropriately, but it was even not
true.

Let's explicitly check our multicast list.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# da6bb5c0 05-Mar-2008 Daniel Lezcano <dlezcano@fr.ibm.com>

[NETNS][IPV6] ip6_input - enable ipv6_rcv to handle several network namespace

The different subsystem of ipv6 are ready for namespaces, so let's
activate it for ipv6_rcv.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 69d6da0b 19-Nov-2007 Pavel Emelyanov <xemul@openvz.org>

[IPv6] RAW: Compact the API for the kernel

Same as in the previous patch for ipv4, compact the
API and hide hash table and rwlock inside the raw.c
file.

Plus fix some "bad" places from checkpatch.pl point
of view (assignments inside if()).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6e23ae2a 19-Nov-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Introduce NF_INET_ hook values

The IPv4 and IPv6 hook values are identical, yet some code tries to figure
out the "correct" value by looking at the address family. Introduce NF_INET_*
values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__
section for userspace compatibility.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e5bbef20 15-Oct-2007 Herbert Xu <herbert@gondor.apana.org.au>

[IPV6]: Replace sk_buff ** with sk_buff * in input handlers

With all the users of the double pointers removed from the IPv6 input path,
this patch converts all occurances of sk_buff ** to sk_buff * in IPv6 input
handlers.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ad643a79 15-Oct-2007 Patrick McHardy <kaber@trash.net>

[IPV6]: Uninline netfilter okfns

Uninline netfilter okfns for those cases where gcc can generate tail-calls.

Before:
text data bss dec hex filename
8994153 1016524 524652 10535329 a0c1a1 vmlinux

After:
text data bss dec hex filename
8992761 1016524 524652 10533937 a0bc31 vmlinux
-------------------------------------------------------
-1392

All cases have been verified to generate tail-calls with and without netfilter.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# de3cb747 25-Sep-2007 Daniel Lezcano <dlezcano@fr.ibm.com>

[NET]: Dynamically allocate the loopback device, part 1.

This patch replaces all occurences to the static variable
loopback_dev to a pointer loopback_dev. That provides the
mindless, trivial, uninteressting change part for the dynamic
allocation for the loopback.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-By: Kirill Korotaev <dev@sw.ru>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e730c155 17-Sep-2007 Eric W. Biederman <ebiederm@xmission.com>

[NET]: Make packet reception network namespace safe

This patch modifies every packet receive function
registered with dev_add_pack() to drop packets if they
are not from the initial network namespace.

This should ensure that the various network stacks do
not receive packets in a anything but the initial network
namespace until the code has been converted and is ready
for them.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ae7bf20a 14-May-2007 Corey Mutter <crm-netdev@mutternet.com>

[IPV6]: Reverse sense of promisc tests in ip6_mc_input

Reverse the sense of the promiscuous-mode tests in ip6_mc_input().

Signed-off-by: Corey Mutter <crm-netdev@mutternet.com>
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b0e380b1 10-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: unions of just one member don't get anything done, kill them

Renaming skb->h to skb->transport_header, skb->nh to skb->network_header and
skb->mac to skb->mac_header, to match the names of the associated helpers
(skb[_[re]set]_{transport,network,mac}_header).

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cfe1fc77 16-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_network_header_len

For the common sequence "skb->h.raw - skb->nh.raw", similar to skb->mac_len,
that is precalculated tho, don't think we need to bloat skb with one more
member, so just use this new helper, reducing the number of non-skbuff.h
references to the layer headers even more.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ea2ae17d 25-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_transport_offset()

For the quite common 'skb->h.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# badff6d0 13-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_reset_transport_header(skb)

For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple cases:

skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()

The next ones will handle the slightly more "complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0660e03f 25-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce ipv6_hdr(), remove skb->nh.ipv6h

Now the skb->nh union has just one member, .raw, i.e. it is just like the
skb->mac union, strange, no? I'm just leaving it like that till the transport
layer is done with, when we'll rename skb->mac.raw to skb->mac_header (or
->mac_header_offset?), ditto for ->{h,nh}.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d56f90a7 10-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_network_header()

For the places where we need a pointer to the network header, it is still legal
to touch skb->nh.raw directly if just adding to, subtracting from or setting it
to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 60e5c166 05-Apr-2007 Mitsuru Chinen <mitch@linux.vnet.ibm.com>

[IPv6]: Exclude truncated packets from InHdrErrors statistics

Incoming trancated packets are counted as not only InTruncatedPkts but
also InHdrErrors. They should be counted as InTruncatedPkts only.

Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cd354f1a 14-Feb-2007 Tim Schmielau <tim@physik3.uni-rostock.de>

[PATCH] remove many unneeded #includes of sched.h

After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there. Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.

To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.

Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm. I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).

Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1ab1457c 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET] IPV6: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a11d206d 04-Nov-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Per-interface statistics support.

For IP MIB (RFC4293).

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 7a3025b1 13-Oct-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Introduce ip6_dst_idev() to get inet6_dev{} stored in dst_entry{}.

Otherwise, we will see a lot of casts...

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# a80ff03e 23-Aug-2006 Masahide NAKAMURA <nakam@linux-ipv6.org>

[IPV6]: Allow to replace skbuff by TLV parser.

In receiving Mobile IPv6 home address option which is a TLV carried by
destination options header, kernel will try to mangle source adderss
of packet. Think of cloned skbuff it is required to replace it by the
parser just like routing header case.

This is a framework to achieve that to allow TLV parser to replace
inbound skbuff pointer.

Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6b7fdc3a 25-Jul-2006 Guillaume Chazarain <guichaz@yahoo.fr>

[IPV6]: Clean skb cb on IPv6 input.

Clear the accumulated junk in IP6CB when starting to handle an IPV6
packet.

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2889139a 30-Jun-2006 Herbert Xu <herbert@gondor.apana.org.au>

[IPV6]: Remove redundant length check on input

We don't need to check skb->len when we're just about to call
pskb_may_pull since that checks it for us.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b809739a 18-Apr-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Clean up hop-by-hop options handler.

- Removed unused argument (nhoff) for ipv6_parse_hopopts().
- Make ipv6_parse_hopopts() to align with other extension header
handlers.
- Removed pointless assignment (hdr), which is not used afterwards.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b05e1066 07-Jan-2006 Patrick McHardy <kaber@trash.net>

[IPV4/6]: Netfilter IPsec input hooks

When the innermost transform uses transport mode the decapsulated packet
is not visible to netfilter. Pass the packet through the PRE_ROUTING and
LOCAL_IN hooks again before handing it to upper layer protocols to make
netfilter-visibility symetrical to the output path.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 951dbc8a 07-Jan-2006 Patrick McHardy <kaber@trash.net>

[IPV6]: Move nextheader offset to the IP6CB

Move nextheader offset to the IP6CB to make it possible to pass a
packet to ip6_input_finish multiple times and have it skip already
parsed headers. As a nice side effect this gets rid of the manual
hopopts skipping in ip6_input_finish.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9fb9cbb1 09-Nov-2005 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: Add nf_conntrack subsystem.

The existing connection tracking subsystem in netfilter can only
handle ipv4. There were basically two choices present to add
connection tracking support for ipv6. We could either duplicate all
of the ipv4 connection tracking code into an ipv6 counterpart, or (the
choice taken by these patches) we could design a generic layer that
could handle both ipv4 and ipv6 and thus requiring only one sub-protocol
(TCP, UDP, etc.) connection tracking helper module to be written.

In fact nf_conntrack is capable of working with any layer 3
protocol.

The existing ipv4 specific conntrack code could also not deal
with the pecularities of doing connection tracking on ipv6,
which is also cured here. For example, these issues include:

1) ICMPv6 handling, which is used for neighbour discovery in
ipv6 thus some messages such as these should not participate
in connection tracking since effectively they are like ARP
messages

2) fragmentation must be handled differently in ipv6, because
the simplistic "defrag, connection track and NAT, refrag"
(which the existing ipv4 connection tracking does) approach simply
isn't feasible in ipv6

3) ipv6 extension header parsing must occur at the correct spots
before and after connection tracking decisions, and there were
no provisions for this in the existing connection tracking
design

4) ipv6 has no need for stateful NAT

The ipv4 specific conntrack layer is kept around, until all of
the ipv4 specific conntrack helpers are ported over to nf_conntrack
and it is feature complete. Once that occurs, the old conntrack
stuff will get placed into the feature-removal-schedule and we will
fully kill it off 6 months later.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# d13964f4 09-Aug-2005 Patrick McHardy <kaber@trash.net>

[IPV4/6]: Check if packet was actually delivered to a raw socket to decide whether to send an ICMP unreachable

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f2ccd8fa 09-Aug-2005 David S. Miller <davem@davemloft.net>

[NET]: Kill skb->real_dev

Bonding just wants the device before the skb_bond()
decapsulation occurs, so simply pass that original
device into packet_type->func() as an argument.

It remains to be seen whether we can use this same
exact thing to get rid of skb->input_dev as well.

Signed-off-by: David S. Miller <davem@davemloft.net>


# bfd272b1 17-Aug-2005 Patrick McHardy <kaber@trash.net>

[IPV6]: Fix SKB leak in ip6_input_finish()

Changing it to how ip_input handles should fix it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fad87aca 16-Aug-2005 Patrick McHardy <kaber@trash.net>

[IPV6]: Fix SKB leak in ip6_input_finish()

Changing it to how ip_input handles should fix it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!