History log of /linux-master/net/ipv6/ip6_output.c
Revision Date Author Comments
# 35c3e279 14-Mar-2024 Abhishek Chauhan <quic_abchauha@quicinc.com>

Revert "net: Re-use and set mono_delivery_time bit for userspace tstamp packets"

This reverts commit 885c36e59f46375c138de18ff1692f18eff67b7f.

The patch currently broke the bpf selftest test_tc_dtime because
uapi field __sk_buff->tstamp_type depends on skb->mono_delivery_time which
does not necessarily mean mono with the original fix as the bit was re-used
for userspace timestamp as well to avoid tstamp reset in the forwarding
path. To solve this we need to keep mono_delivery_time as is and
introduce another bit called user_delivery_time and fall back to the
initial proposal of setting the user_delivery_time bit based on
sk_clockid set from userspace.

Fixes: 885c36e59f46 ("net: Re-use and set mono_delivery_time bit for userspace tstamp packets")
Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 885c36e5 01-Mar-2024 Abhishek Chauhan <quic_abchauha@quicinc.com>

net: Re-use and set mono_delivery_time bit for userspace tstamp packets

Bridge driver today has no support to forward the userspace timestamp
packets and ends up resetting the timestamp. ETF qdisc checks the
packet coming from userspace and encounters to be 0 thereby dropping
time sensitive packets. These changes will allow userspace timestamps
packets to be forwarded from the bridge to NIC drivers.

Setting the same bit (mono_delivery_time) to avoid dropping of
userspace tstamp packets in the forwarding path.

Existing functionality of mono_delivery_time remains unaltered here,
instead just extended with userspace tstamp support for bridge
forwarding path.

Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240301201348.2815102-1-quic_abchauha@quicinc.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 624d5aec 28-Feb-2024 Eric Dumazet <edumazet@google.com>

ipv6: annotate data-races around devconf->disable_policy

idev->cnf.disable_policy and net->ipv6.devconf_all->disable_policy
can be read locklessly. Add appropriate annotations on reads
and writes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a8fbd4d9 28-Feb-2024 Eric Dumazet <edumazet@google.com>

ipv6: annotate data-races around devconf->proxy_ndp

devconf->proxy_ndp can be read and written locklessly,
add appropriate annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32f75417 28-Feb-2024 Eric Dumazet <edumazet@google.com>

ipv6: annotate data-races around cnf.forwarding

idev->cnf.forwarding and net->ipv6.devconf_all->forwarding
might be read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d289ab65 28-Feb-2024 Eric Dumazet <edumazet@google.com>

ipv6: annotate data-races around cnf.disable_ipv6

disable_ipv6 is read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

v2: do not preload net before rtnl_trylock() in
addrconf_disable_ipv6() (Jiri)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 488b6d91 13-Feb-2024 Vadim Fedorenko <vadim.fedorenko@linux.dev>

net-timestamp: make sk_tskey more predictable in error path

When SOF_TIMESTAMPING_OPT_ID is used to ambiguate timestamped datagrams,
the sk_tskey can become unpredictable in case of any error happened
during sendmsg(). Move increment later in the code and make decrement of
sk_tskey in error path. This solution is still racy in case of multiple
threads doing snedmsg() over the very same socket in parallel, but still
makes error path much more predictable.

Fixes: 09c2d251b707 ("net-timestamp: add key to disambiguate concurrent datagrams")
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240213110428.1681540-1-vadfed@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 03d6c848 24-Oct-2023 Yan Zhai <yan@cloudflare.com>

ipv6: avoid atomic fragment on GSO packets

When the ipv6 stack output a GSO packet, if its gso_size is larger than
dst MTU, then all segments would be fragmented. However, it is possible
for a GSO packet to have a trailing segment with smaller actual size
than both gso_size as well as the MTU, which leads to an "atomic
fragment". Atomic fragments are considered harmful in RFC-8021. An
Existing report from APNIC also shows that atomic fragments are more
likely to be dropped even it is equivalent to a no-op [1].

Add an extra check in the GSO slow output path. For each segment from
the original over-sized packet, if it fits with the path MTU, then avoid
generating an atomic fragment.

Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1]
Fixes: b210de4f8c97 ("net: ipv6: Validate GSO SKB before finish IPv6 processing")
Reported-by: David Wragg <dwragg@cloudflare.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Link: https://lore.kernel.org/r/90912e3503a242dca0bc36958b11ed03a2696e5e.1698156966.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 1f7ec1b3 24-Oct-2023 Yan Zhai <yan@cloudflare.com>

ipv6: refactor ip6_finish_output for GSO handling

Separate GSO and non-GSO packets handling to make the logic cleaner. For
GSO packets, frag_max_size check can be omitted because it is only
useful for packets defragmented by netfilter hooks. Both local output
and GRO logic won't produce GSO packets when defragment is needed. This
also mirrors what IPv4 side code is doing.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/0e1d4599f858e2becff5c4fe0b5f843236bc3fe8.1698156966.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# e57a3447 24-Oct-2023 Yan Zhai <yan@cloudflare.com>

ipv6: drop feature RTAX_FEATURE_ALLFRAG

RTAX_FEATURE_ALLFRAG was added before the first git commit:

https://www.mail-archive.com/bk-commits-head@vger.kernel.org/msg03399.html

The feature would send packets to the fragmentation path if a box
receives a PMTU value with less than 1280 byte. However, since commit
9d289715eb5c ("ipv6: stop sending PTB packets for MTU < 1280"), such
message would be simply discarded. The feature flag is neither supported
in iproute2 utility. In theory one can still manipulate it with direct
netlink message, but it is not ideal because it was based on obsoleted
guidance of RFC-2460 (replaced by RFC-8200).

The feature would always test false at the moment, so remove related
code or mark them as unused.

Signed-off-by: Yan Zhai <yan@cloudflare.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/d78e44dcd9968a252143ffe78460446476a472a1.1698156966.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# fc47e86d 20-Oct-2023 Beniamino Galvani <b.galvani@gmail.com>

ipv6: rename and move ip6_dst_lookup_tunnel()

At the moment ip6_dst_lookup_tunnel() is used only by bareudp.
Ideally, other UDP tunnel implementations should use it, but to do so
the function needs to accept new parameters that are specific for UDP
tunnels, such as the ports.

Prepare for these changes by renaming the function to
udp_tunnel6_dst_lookup() and move it to file
net/ipv6/ip6_udp_tunnel.c.

This is similar to what already done for IPv4 in commit bf3fcbf7e7a0
("ipv4: rename and move ip_route_output_tunnel()").

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b4a11b20 18-Oct-2023 Heng Guo <heng.guo@windriver.com>

net: fix IPSTATS_MIB_OUTPKGS increment in OutForwDatagrams.

Reproduce environment:
network with 3 VM linuxs is connected as below:
VM1<---->VM2(latest kernel 6.5.0-rc7)<---->VM3
VM1: eth0 ip: 192.168.122.207 MTU 1500
VM2: eth0 ip: 192.168.122.208, eth1 ip: 192.168.123.224 MTU 1500
VM3: eth0 ip: 192.168.123.240 MTU 1500

Reproduce:
VM1 send 1400 bytes UDP data to VM3 using tools scapy with flags=0.
scapy command:
send(IP(dst="192.168.123.240",flags=0)/UDP()/str('0'*1400),count=1,
inter=1.000000)

Result:
Before IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates
Ip: 1 64 11 0 3 4 0 0 4 7 0 0 0 0 0 0 0 0 0
......
----------------------------------------------------------------------
After IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates
Ip: 1 64 12 0 3 5 0 0 4 8 0 0 0 0 0 0 0 0 0
......
----------------------------------------------------------------------
"ForwDatagrams" increase from 4 to 5 and "OutRequests" also increase
from 7 to 8.

Issue description and patch:
IPSTATS_MIB_OUTPKTS("OutRequests") is counted with IPSTATS_MIB_OUTOCTETS
("OutOctets") in ip_finish_output2().
According to RFC 4293, it is "OutOctets" counted with "OutTransmits" but
not "OutRequests". "OutRequests" does not include any datagrams counted
in "ForwDatagrams".
ipSystemStatsOutOctets OBJECT-TYPE
DESCRIPTION
"The total number of octets in IP datagrams delivered to the
lower layers for transmission. Octets from datagrams
counted in ipIfStatsOutTransmits MUST be counted here.
ipSystemStatsOutRequests OBJECT-TYPE
DESCRIPTION
"The total number of IP datagrams that local IP user-
protocols (including ICMP) supplied to IP in requests for
transmission. Note that this counter does not include any
datagrams counted in ipSystemStatsOutForwDatagrams.
So do patch to define IPSTATS_MIB_OUTPKTS to "OutTransmits" and add
IPSTATS_MIB_OUTREQUESTS for "OutRequests".
Add IPSTATS_MIB_OUTREQUESTS counter in __ip_local_out() for ipv4 and add
IPSTATS_MIB_OUT counter in ip6_finish_output2() for ipv6.

Test result with patch:
Before IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates OutTransmits
Ip: 1 64 9 0 5 1 0 0 3 3 0 0 0 0 0 0 0 0 0 4
......
root@qemux86-64:~# cat /proc/net/netstat
......
IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts
OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets
InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts
InECT0Pkts InCEPkts ReasmOverlaps
IpExt: 0 0 0 0 0 0 2976 1896 0 0 0 0 0 9 0 0 0 0
----------------------------------------------------------------------
After IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates OutTransmits
Ip: 1 64 10 0 5 2 0 0 3 3 0 0 0 0 0 0 0 0 0 5
......
root@qemux86-64:~# cat /proc/net/netstat
......
IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts
OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets
InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts
InECT0Pkts InCEPkts ReasmOverlaps
IpExt: 0 0 0 0 0 0 4404 3324 0 0 0 0 0 10 0 0 0 0
----------------------------------------------------------------------
"ForwDatagrams" increase from 1 to 2 and "OutRequests" is keeping 3.
"OutTransmits" increase from 4 to 5 and "OutOctets" increase 1428.

Signed-off-by: Heng Guo <heng.guo@windriver.com>
Reviewed-by: Kun Song <Kun.Song@windriver.com>
Reviewed-by: Filip Pudak <filip.pudak@windriver.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cf8b49fb 10-Oct-2023 Heng Guo <heng.guo@windriver.com>

net: fix IPSTATS_MIB_OUTFORWDATAGRAMS increment after fragment check

Reproduce environment:
network with 3 VM linuxs is connected as below:
VM1<---->VM2(latest kernel 6.5.0-rc7)<---->VM3
VM1: eth0 ip: 192.168.122.207 MTU 1800
VM2: eth0 ip: 192.168.122.208, eth1 ip: 192.168.123.224 MTU 1500
VM3: eth0 ip: 192.168.123.240 MTU 1800

Reproduce:
VM1 send 1600 bytes UDP data to VM3 using tools scapy with flags='DF'.
scapy command:
send(IP(dst="192.168.123.240",flags='DF')/UDP()/str('0'*1600),count=1,
inter=1.000000)

Result:
Before IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss
Ip: 1 64 6 0 2 2 0 0 2 4 0 0 0 0 0 0 0 0 0
......
root@qemux86-64:~#
----------------------------------------------------------------------
After IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss
Ip: 1 64 7 0 2 2 0 0 2 5 0 0 0 0 0 0 0 1 0
......
root@qemux86-64:~#
----------------------------------------------------------------------
ForwDatagrams is always keeping 2 without increment.

Issue description and patch:
ip_exceeds_mtu() in ip_forward() drops this IP datagram because skb len
(1600 sending by scapy) is over MTU(1500 in VM2) if "DF" is set.
According to RFC 4293 "3.2.3. IP Statistics Tables",
+-------+------>------+----->-----+----->-----+
| InForwDatagrams (6) | OutForwDatagrams (6) |
| V +->-+ OutFragReqds
| InNoRoutes | | (packets)
/ (local packet (3) | |
| IF is that of the address | +--> OutFragFails
| and may not be the receiving IF) | | (packets)
the IPSTATS_MIB_OUTFORWDATAGRAMS should be counted before fragment
check.
The existing implementation, instead, would incease the counter after
fragment check: ip_exceeds_mtu() in ipv4 and ip6_pkt_too_big() in ipv6.
So do patch to move IPSTATS_MIB_OUTFORWDATAGRAMS counter to ip_forward()
for ipv4 and ip6_forward() for ipv6.

Test result with patch:
Before IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss
Ip: 1 64 6 0 2 2 0 0 2 4 0 0 0 0 0 0 0 0 0
......
root@qemux86-64:~#
----------------------------------------------------------------------
After IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss
Ip: 1 64 7 0 2 3 0 0 2 5 0 0 0 0 0 0 0 1 0
......
root@qemux86-64:~#
----------------------------------------------------------------------
ForwDatagrams is updated from 2 to 3.

Reviewed-by: Filip Pudak <filip.pudak@windriver.com>
Signed-off-by: Heng Guo <heng.guo@windriver.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231011015137.27262-1-heng.guo@windriver.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 10bbf165 21-Sep-2023 Eric Dumazet <edumazet@google.com>

net: implement lockless SO_PRIORITY

This is a followup of 8bf43be799d4 ("net: annotate data-races
around sk->sk_priority").

sk->sk_priority can be read and written without holding the socket lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fa17a6d8 18-Sep-2023 Eric Dumazet <edumazet@google.com>

ipv6: lockless IPV6_ADDR_PREFERENCES implementation

We have data-races while reading np->srcprefs

Switch the field to a plain byte, add READ_ONCE()
and WRITE_ONCE() annotations where needed,
and IPV6_ADDR_PREFERENCES setsockopt() can now be lockless.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230918142321.1794107-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 6b724bc4 12-Sep-2023 Eric Dumazet <edumazet@google.com>

ipv6: lockless IPV6_MTU_DISCOVER implementation

Most np->pmtudisc reads are racy.

Move this 3bit field on a full byte, add annotations
and make IPV6_MTU_DISCOVER setsockopt() lockless.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 83cd5eb6 12-Sep-2023 Eric Dumazet <edumazet@google.com>

ipv6: lockless IPV6_ROUTER_ALERT_ISOLATE implementation

Reads from np->rtalert_isolate are racy.

Move this flag to inet->inet_flags to fix data-races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1086ca7c 12-Sep-2023 Eric Dumazet <edumazet@google.com>

ipv6: lockless IPV6_DONTFRAG implementation

Move np->dontfrag flag to inet->inet_flags to fix data-races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5121516b 12-Sep-2023 Eric Dumazet <edumazet@google.com>

ipv6: lockless IPV6_AUTOFLOWLABEL implementation

Move np->autoflowlabel and np->autoflowlabel_set in inet->inet_flags,
to fix data-races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 15f926c4 12-Sep-2023 Eric Dumazet <edumazet@google.com>

ipv6: lockless IPV6_MTU implementation

np->frag_size can be read/written without holding socket lock.

Add missing annotations and make IPV6_MTU setsockopt() lockless.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b0adfba7 12-Sep-2023 Eric Dumazet <edumazet@google.com>

ipv6: lockless IPV6_UNICAST_HOPS implementation

Some np->hop_limit accesses are racy, when socket lock is not held.

Add missing annotations and switch to full lockless implementation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e3390b30 31-Aug-2023 Eric Dumazet <edumazet@google.com>

net: annotate data-races around sk->sk_tsflags

sk->sk_tsflags can be read locklessly, add corresponding annotations.

Fixes: b9f40e21ef42 ("net-timestamp: move timestamp flags out of sk_flags")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e4da8c78 25-Aug-2023 Heng Guo <heng.guo@windriver.com>

net: ipv4, ipv6: fix IPSTATS_MIB_OUTOCTETS increment duplicated

commit edf391ff1723 ("snmp: add missing counters for RFC 4293") had
already added OutOctets for RFC 4293. In commit 2d8dbb04c63e ("snmp: fix
OutOctets counter to include forwarded datagrams"), OutOctets was
counted again, but not removed from ip_output().

According to RFC 4293 "3.2.3. IP Statistics Tables",
ipipIfStatsOutTransmits is not equal to ipIfStatsOutForwDatagrams. So
"IPSTATS_MIB_OUTOCTETS must be incremented when incrementing" is not
accurate. And IPSTATS_MIB_OUTOCTETS should be counted after fragment.

This patch reverts commit 2d8dbb04c63e ("snmp: fix OutOctets counter to
include forwarded datagrams") and move IPSTATS_MIB_OUTOCTETS to
ip_finish_output2 for ipv4.

Reviewed-by: Filip Pudak <filip.pudak@windriver.com>
Signed-off-by: Heng Guo <heng.guo@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a171fbec 17-Aug-2023 Yan Zhai <yan@cloudflare.com>

lwt: Check LWTUNNEL_XMIT_CONTINUE strictly

LWTUNNEL_XMIT_CONTINUE is implicitly assumed in ip(6)_finish_output2,
such that any positive return value from a xmit hook could cause
unexpected continue behavior, despite that related skb may have been
freed. This could be error-prone for future xmit hook ops. One of the
possible errors is to return statuses of dst_output directly.

To make the code safer, redefine LWTUNNEL_XMIT_CONTINUE value to
distinguish from dst_output statuses and check the continue
condition explicitly.

Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure")
Suggested-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/96b939b85eda00e8df4f7c080f770970a4c5f698.1692326837.git.yan@cloudflare.com


# cafbe182 16-Aug-2023 Eric Dumazet <edumazet@google.com>

inet: move inet->hdrincl to inet->inet_flags

IP_HDRINCL socket option can now be set/read
without locking the socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ce650a16 02-Aug-2023 David Howells <dhowells@redhat.com>

udp6: Fix __ip6_append_data()'s handling of MSG_SPLICE_PAGES

__ip6_append_data() can has a similar problem to __ip_append_data()[1] when
asked to splice into a partially-built UDP message that has more than the
frag-limit data and up to the MTU limit, but in the ipv6 case, it errors
out with EINVAL. This can be triggered with something like:

pipe(pfd);
sfd = socket(AF_INET6, SOCK_DGRAM, 0);
connect(sfd, ...);
send(sfd, buffer, 8137, MSG_CONFIRM|MSG_MORE);
write(pfd[1], buffer, 8);
splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0);

where the amount of data given to send() is dependent on the MTU size (in
this instance an interface with an MTU of 8192).

The problem is that the calculation of the amount to copy in
__ip6_append_data() goes negative in two places, but a check has been put
in to give an error in this case.

This happens because when pagedlen > 0 (which happens for MSG_ZEROCOPY and
MSG_SPLICE_PAGES), the terms in:

copy = datalen - transhdrlen - fraggap - pagedlen;

then mostly cancel when pagedlen is substituted for, leaving just -fraggap.

Fix this by:

(1) Insert a note about the dodgy calculation of 'copy'.

(2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above
equation, so that 'offset' isn't regressed and 'length' isn't
increased, which will mean that length and thus copy should match the
amount left in the iterator.

(3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if
we're asked to splice more than is in the iterator. It might be
better to not give the warning or even just give a 'short' write.

(4) If MSG_SPLICE_PAGES, override the copy<0 check.

[!] Note that this should also affect MSG_ZEROCOPY, but that will return
-EINVAL for the range of send sizes that requires the skbuff to be split.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/ [1]
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1580952.1690961810@warthog.procyon.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 5a6f6873 14-Jun-2023 David Howells <dhowells@redhat.com>

ip, ip6: Fix splice to raw and ping sockets

Splicing to SOCK_RAW sockets may set MSG_SPLICE_PAGES, but in such a case,
__ip_append_data() will call skb_splice_from_iter() to access the 'from'
data, assuming it to point to a msghdr struct with an iter, instead of
using the provided getfrag function to access it.

In the case of raw_sendmsg(), however, this is not the case and 'from' will
point to a raw_frag_vec struct and raw_getfrag() will be the frag-getting
function. A similar issue may occur with rawv6_sendmsg().

Fix this by ignoring MSG_SPLICE_PAGES if getfrag != ip_generic_getfrag as
ip_generic_getfrag() expects "from" to be a msghdr*, but the other getfrags
don't. Note that this will prevent MSG_SPLICE_PAGES from being effective
for udplite.

This likely affects ping sockets too. udplite looks like it should be okay
as it expects "from" to be a msghdr.

Signed-off-by: David Howells <dhowells@redhat.com>
Reported-by: syzbot+d8486855ef44506fd675@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000ae4cbf05fdeb8349@google.com/
Fixes: 2dc334f1a63a ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()")
Tested-by: syzbot+d8486855ef44506fd675@syzkaller.appspotmail.com
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1410156.1686729856@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d457a0e3 08-Jun-2023 Eric Dumazet <edumazet@google.com>

net: move gso declarations and functions to their own files

Move declarations into include/net/gso.h and code into net/core/gso.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 6d8192bd 22-May-2023 David Howells <dhowells@redhat.com>

ip6, udp6: Support MSG_SPLICE_PAGES

Make IP6/UDP6 sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator if possible, copying the data if not.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 09eed119 20-Mar-2023 Eric Dumazet <edumazet@google.com>

neighbour: switch to standard rcu, instead of rcu_bh

rcu_bh is no longer a win, especially for objects freed
with standard call_rcu().

Switch neighbour code to no longer disable BH when not necessary.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# b071af52 13-Mar-2023 Eric Dumazet <edumazet@google.com>

neighbour: annotate lockless accesses to n->nud_state

We have many lockless accesses to n->nud_state.

Before adding another one in the following patch,
add annotations to readers and writers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 8ca5a579 06-Mar-2023 Vadim Fedorenko <vadfed@meta.com>

net-timestamp: extend SOF_TIMESTAMPING_OPT_ID to HW timestamps

When the feature was added it was enabled for SW timestamps only but
with current hardware the same out-of-order timestamps can be seen.
Let's expand the area for the feature to all types of timestamps.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ea30388b 03-Apr-2023 Ziyang Xuan <william.xuanziyang@huawei.com>

ipv6: Fix an uninit variable access bug in __ip6_make_skb()

Syzbot reported a bug as following:

=====================================================
BUG: KMSAN: uninit-value in arch_atomic64_inc arch/x86/include/asm/atomic64_64.h:88 [inline]
BUG: KMSAN: uninit-value in arch_atomic_long_inc include/linux/atomic/atomic-long.h:161 [inline]
BUG: KMSAN: uninit-value in atomic_long_inc include/linux/atomic/atomic-instrumented.h:1429 [inline]
BUG: KMSAN: uninit-value in __ip6_make_skb+0x2f37/0x30f0 net/ipv6/ip6_output.c:1956
arch_atomic64_inc arch/x86/include/asm/atomic64_64.h:88 [inline]
arch_atomic_long_inc include/linux/atomic/atomic-long.h:161 [inline]
atomic_long_inc include/linux/atomic/atomic-instrumented.h:1429 [inline]
__ip6_make_skb+0x2f37/0x30f0 net/ipv6/ip6_output.c:1956
ip6_finish_skb include/net/ipv6.h:1122 [inline]
ip6_push_pending_frames+0x10e/0x550 net/ipv6/ip6_output.c:1987
rawv6_push_pending_frames+0xb12/0xb90 net/ipv6/raw.c:579
rawv6_sendmsg+0x297e/0x2e60 net/ipv6/raw.c:922
inet_sendmsg+0x101/0x180 net/ipv4/af_inet.c:827
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
____sys_sendmsg+0xa8e/0xe70 net/socket.c:2476
___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2530
__sys_sendmsg net/socket.c:2559 [inline]
__do_sys_sendmsg net/socket.c:2568 [inline]
__se_sys_sendmsg net/socket.c:2566 [inline]
__x64_sys_sendmsg+0x367/0x540 net/socket.c:2566
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Uninit was created at:
slab_post_alloc_hook mm/slab.h:766 [inline]
slab_alloc_node mm/slub.c:3452 [inline]
__kmem_cache_alloc_node+0x71f/0xce0 mm/slub.c:3491
__do_kmalloc_node mm/slab_common.c:967 [inline]
__kmalloc_node_track_caller+0x114/0x3b0 mm/slab_common.c:988
kmalloc_reserve net/core/skbuff.c:492 [inline]
__alloc_skb+0x3af/0x8f0 net/core/skbuff.c:565
alloc_skb include/linux/skbuff.h:1270 [inline]
__ip6_append_data+0x51c1/0x6bb0 net/ipv6/ip6_output.c:1684
ip6_append_data+0x411/0x580 net/ipv6/ip6_output.c:1854
rawv6_sendmsg+0x2882/0x2e60 net/ipv6/raw.c:915
inet_sendmsg+0x101/0x180 net/ipv4/af_inet.c:827
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
____sys_sendmsg+0xa8e/0xe70 net/socket.c:2476
___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2530
__sys_sendmsg net/socket.c:2559 [inline]
__do_sys_sendmsg net/socket.c:2568 [inline]
__se_sys_sendmsg net/socket.c:2566 [inline]
__x64_sys_sendmsg+0x367/0x540 net/socket.c:2566
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

It is because icmp6hdr does not in skb linear region under the scenario
of SOCK_RAW socket. Access icmp6_hdr(skb)->icmp6_type directly will
trigger the uninit variable access bug.

Use a local variable icmp6_type to carry the correct value in different
scenarios.

Fixes: 14878f75abd5 ("[IPV6]: Add ICMPMsgStats MIB (RFC 4293) [rev 2]")
Reported-by: syzbot+8257f4dcef79de670baf@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=3d605ec1d0a7f2a269a1a6936ac7f2b85975ee9c
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9f535c87 19-Jan-2023 Gergely Risko <gergely.risko@gmail.com>

ipv6: fix reachability confirmation with proxy_ndp

When proxying IPv6 NDP requests, the adverts to the initial multicast
solicits are correct and working. On the other hand, when later a
reachability confirmation is requested (on unicast), no reply is sent.

This causes the neighbor entry expiring on the sending node, which is
mostly a non-issue, as a new multicast request is sent. There are
routers, where the multicast requests are intentionally delayed, and in
these environments the current implementation causes periodic packet
loss for the proxied endpoints.

The root cause is the erroneous decrease of the hop limit, as this
is checked in ndisc.c and no answer is generated when it's 254 instead
of the correct 255.

Cc: stable@vger.kernel.org
Fixes: 46c7655f0b56 ("ipv6: decrease hop limit counter in ip6_forward()")
Signed-off-by: Gergely Risko <gergely.risko@gmail.com>
Tested-by: Gergely Risko <gergely.risko@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 803e8486 06-Dec-2022 Eric Dumazet <edumazet@google.com>

ipv6: avoid use-after-free in ip6_fragment()

Blamed commit claimed rcu_read_lock() was held by ip6_fragment() callers.

It seems to not be always true, at least for UDP stack.

syzbot reported:

BUG: KASAN: use-after-free in ip6_dst_idev include/net/ip6_fib.h:245 [inline]
BUG: KASAN: use-after-free in ip6_fragment+0x2724/0x2770 net/ipv6/ip6_output.c:951
Read of size 8 at addr ffff88801d403e80 by task syz-executor.3/7618

CPU: 1 PID: 7618 Comm: syz-executor.3 Not tainted 6.1.0-rc6-syzkaller-00012-g4312098baf37 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd1/0x138 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:284 [inline]
print_report+0x15e/0x45d mm/kasan/report.c:395
kasan_report+0xbf/0x1f0 mm/kasan/report.c:495
ip6_dst_idev include/net/ip6_fib.h:245 [inline]
ip6_fragment+0x2724/0x2770 net/ipv6/ip6_output.c:951
__ip6_finish_output net/ipv6/ip6_output.c:193 [inline]
ip6_finish_output+0x9a3/0x1170 net/ipv6/ip6_output.c:206
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip6_output+0x1f1/0x540 net/ipv6/ip6_output.c:227
dst_output include/net/dst.h:445 [inline]
ip6_local_out+0xb3/0x1a0 net/ipv6/output_core.c:161
ip6_send_skb+0xbb/0x340 net/ipv6/ip6_output.c:1966
udp_v6_send_skb+0x82a/0x18a0 net/ipv6/udp.c:1286
udp_v6_push_pending_frames+0x140/0x200 net/ipv6/udp.c:1313
udpv6_sendmsg+0x18da/0x2c80 net/ipv6/udp.c:1606
inet6_sendmsg+0x9d/0xe0 net/ipv6/af_inet6.c:665
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xd3/0x120 net/socket.c:734
sock_write_iter+0x295/0x3d0 net/socket.c:1108
call_write_iter include/linux/fs.h:2191 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x9ed/0xdd0 fs/read_write.c:584
ksys_write+0x1ec/0x250 fs/read_write.c:637
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fde3588c0d9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fde365b6168 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fde359ac050 RCX: 00007fde3588c0d9
RDX: 000000000000ffdc RSI: 00000000200000c0 RDI: 000000000000000a
RBP: 00007fde358e7ae9 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fde35acfb1f R14: 00007fde365b6300 R15: 0000000000022000
</TASK>

Allocated by task 7618:
kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
__kasan_slab_alloc+0x82/0x90 mm/kasan/common.c:325
kasan_slab_alloc include/linux/kasan.h:201 [inline]
slab_post_alloc_hook mm/slab.h:737 [inline]
slab_alloc_node mm/slub.c:3398 [inline]
slab_alloc mm/slub.c:3406 [inline]
__kmem_cache_alloc_lru mm/slub.c:3413 [inline]
kmem_cache_alloc+0x2b4/0x3d0 mm/slub.c:3422
dst_alloc+0x14a/0x1f0 net/core/dst.c:92
ip6_dst_alloc+0x32/0xa0 net/ipv6/route.c:344
ip6_rt_pcpu_alloc net/ipv6/route.c:1369 [inline]
rt6_make_pcpu_route net/ipv6/route.c:1417 [inline]
ip6_pol_route+0x901/0x1190 net/ipv6/route.c:2254
pol_lookup_func include/net/ip6_fib.h:582 [inline]
fib6_rule_lookup+0x52e/0x6f0 net/ipv6/fib6_rules.c:121
ip6_route_output_flags_noref+0x2e6/0x380 net/ipv6/route.c:2625
ip6_route_output_flags+0x76/0x320 net/ipv6/route.c:2638
ip6_route_output include/net/ip6_route.h:98 [inline]
ip6_dst_lookup_tail+0x5ab/0x1620 net/ipv6/ip6_output.c:1092
ip6_dst_lookup_flow+0x90/0x1d0 net/ipv6/ip6_output.c:1222
ip6_sk_dst_lookup_flow+0x553/0x980 net/ipv6/ip6_output.c:1260
udpv6_sendmsg+0x151d/0x2c80 net/ipv6/udp.c:1554
inet6_sendmsg+0x9d/0xe0 net/ipv6/af_inet6.c:665
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xd3/0x120 net/socket.c:734
__sys_sendto+0x23a/0x340 net/socket.c:2117
__do_sys_sendto net/socket.c:2129 [inline]
__se_sys_sendto net/socket.c:2125 [inline]
__x64_sys_sendto+0xe1/0x1b0 net/socket.c:2125
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 7599:
kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2e/0x40 mm/kasan/generic.c:511
____kasan_slab_free mm/kasan/common.c:236 [inline]
____kasan_slab_free+0x160/0x1c0 mm/kasan/common.c:200
kasan_slab_free include/linux/kasan.h:177 [inline]
slab_free_hook mm/slub.c:1724 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1750
slab_free mm/slub.c:3661 [inline]
kmem_cache_free+0xee/0x5c0 mm/slub.c:3683
dst_destroy+0x2ea/0x400 net/core/dst.c:127
rcu_do_batch kernel/rcu/tree.c:2250 [inline]
rcu_core+0x81f/0x1980 kernel/rcu/tree.c:2510
__do_softirq+0x1fb/0xadc kernel/softirq.c:571

Last potentially related work creation:
kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xbc/0xd0 mm/kasan/generic.c:481
call_rcu+0x9d/0x820 kernel/rcu/tree.c:2798
dst_release net/core/dst.c:177 [inline]
dst_release+0x7d/0xe0 net/core/dst.c:167
refdst_drop include/net/dst.h:256 [inline]
skb_dst_drop include/net/dst.h:268 [inline]
skb_release_head_state+0x250/0x2a0 net/core/skbuff.c:838
skb_release_all net/core/skbuff.c:852 [inline]
__kfree_skb net/core/skbuff.c:868 [inline]
kfree_skb_reason+0x151/0x4b0 net/core/skbuff.c:891
kfree_skb_list_reason+0x4b/0x70 net/core/skbuff.c:901
kfree_skb_list include/linux/skbuff.h:1227 [inline]
ip6_fragment+0x2026/0x2770 net/ipv6/ip6_output.c:949
__ip6_finish_output net/ipv6/ip6_output.c:193 [inline]
ip6_finish_output+0x9a3/0x1170 net/ipv6/ip6_output.c:206
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip6_output+0x1f1/0x540 net/ipv6/ip6_output.c:227
dst_output include/net/dst.h:445 [inline]
ip6_local_out+0xb3/0x1a0 net/ipv6/output_core.c:161
ip6_send_skb+0xbb/0x340 net/ipv6/ip6_output.c:1966
udp_v6_send_skb+0x82a/0x18a0 net/ipv6/udp.c:1286
udp_v6_push_pending_frames+0x140/0x200 net/ipv6/udp.c:1313
udpv6_sendmsg+0x18da/0x2c80 net/ipv6/udp.c:1606
inet6_sendmsg+0x9d/0xe0 net/ipv6/af_inet6.c:665
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xd3/0x120 net/socket.c:734
sock_write_iter+0x295/0x3d0 net/socket.c:1108
call_write_iter include/linux/fs.h:2191 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x9ed/0xdd0 fs/read_write.c:584
ksys_write+0x1ec/0x250 fs/read_write.c:637
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Second to last potentially related work creation:
kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xbc/0xd0 mm/kasan/generic.c:481
call_rcu+0x9d/0x820 kernel/rcu/tree.c:2798
dst_release net/core/dst.c:177 [inline]
dst_release+0x7d/0xe0 net/core/dst.c:167
refdst_drop include/net/dst.h:256 [inline]
skb_dst_drop include/net/dst.h:268 [inline]
__dev_queue_xmit+0x1b9d/0x3ba0 net/core/dev.c:4211
dev_queue_xmit include/linux/netdevice.h:3008 [inline]
neigh_resolve_output net/core/neighbour.c:1552 [inline]
neigh_resolve_output+0x51b/0x840 net/core/neighbour.c:1532
neigh_output include/net/neighbour.h:546 [inline]
ip6_finish_output2+0x56c/0x1530 net/ipv6/ip6_output.c:134
__ip6_finish_output net/ipv6/ip6_output.c:195 [inline]
ip6_finish_output+0x694/0x1170 net/ipv6/ip6_output.c:206
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip6_output+0x1f1/0x540 net/ipv6/ip6_output.c:227
dst_output include/net/dst.h:445 [inline]
NF_HOOK include/linux/netfilter.h:302 [inline]
NF_HOOK include/linux/netfilter.h:296 [inline]
mld_sendpack+0xa09/0xe70 net/ipv6/mcast.c:1820
mld_send_cr net/ipv6/mcast.c:2121 [inline]
mld_ifc_work+0x720/0xdc0 net/ipv6/mcast.c:2653
process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
worker_thread+0x669/0x1090 kernel/workqueue.c:2436
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306

The buggy address belongs to the object at ffff88801d403dc0
which belongs to the cache ip6_dst_cache of size 240
The buggy address is located 192 bytes inside of
240-byte region [ffff88801d403dc0, ffff88801d403eb0)

The buggy address belongs to the physical page:
page:ffffea00007500c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1d403
memcg:ffff888022f49c81
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 ffffea0001ef6580 dead000000000002 ffff88814addf640
raw: 0000000000000000 00000000800c000c 00000001ffffffff ffff888022f49c81
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 3719, tgid 3719 (kworker/0:6), ts 136223432244, free_ts 136222971441
prep_new_page mm/page_alloc.c:2539 [inline]
get_page_from_freelist+0x10b5/0x2d50 mm/page_alloc.c:4288
__alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5555
alloc_pages+0x1aa/0x270 mm/mempolicy.c:2285
alloc_slab_page mm/slub.c:1794 [inline]
allocate_slab+0x213/0x300 mm/slub.c:1939
new_slab mm/slub.c:1992 [inline]
___slab_alloc+0xa91/0x1400 mm/slub.c:3180
__slab_alloc.constprop.0+0x56/0xa0 mm/slub.c:3279
slab_alloc_node mm/slub.c:3364 [inline]
slab_alloc mm/slub.c:3406 [inline]
__kmem_cache_alloc_lru mm/slub.c:3413 [inline]
kmem_cache_alloc+0x31a/0x3d0 mm/slub.c:3422
dst_alloc+0x14a/0x1f0 net/core/dst.c:92
ip6_dst_alloc+0x32/0xa0 net/ipv6/route.c:344
icmp6_dst_alloc+0x71/0x680 net/ipv6/route.c:3261
mld_sendpack+0x5de/0xe70 net/ipv6/mcast.c:1809
mld_send_cr net/ipv6/mcast.c:2121 [inline]
mld_ifc_work+0x720/0xdc0 net/ipv6/mcast.c:2653
process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
worker_thread+0x669/0x1090 kernel/workqueue.c:2436
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1459 [inline]
free_pcp_prepare+0x65c/0xd90 mm/page_alloc.c:1509
free_unref_page_prepare mm/page_alloc.c:3387 [inline]
free_unref_page+0x1d/0x4d0 mm/page_alloc.c:3483
__unfreeze_partials+0x17c/0x1a0 mm/slub.c:2586
qlink_free mm/kasan/quarantine.c:168 [inline]
qlist_free_all+0x6a/0x170 mm/kasan/quarantine.c:187
kasan_quarantine_reduce+0x184/0x210 mm/kasan/quarantine.c:294
__kasan_slab_alloc+0x66/0x90 mm/kasan/common.c:302
kasan_slab_alloc include/linux/kasan.h:201 [inline]
slab_post_alloc_hook mm/slab.h:737 [inline]
slab_alloc_node mm/slub.c:3398 [inline]
kmem_cache_alloc_node+0x304/0x410 mm/slub.c:3443
__alloc_skb+0x214/0x300 net/core/skbuff.c:497
alloc_skb include/linux/skbuff.h:1267 [inline]
netlink_alloc_large_skb net/netlink/af_netlink.c:1191 [inline]
netlink_sendmsg+0x9a6/0xe10 net/netlink/af_netlink.c:1896
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xd3/0x120 net/socket.c:734
__sys_sendto+0x23a/0x340 net/socket.c:2117
__do_sys_sendto net/socket.c:2129 [inline]
__se_sys_sendto net/socket.c:2125 [inline]
__x64_sys_sendto+0xe1/0x1b0 net/socket.c:2125
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: 1758fd4688eb ("ipv6: remove unnecessary dst_hold() in ip6_fragment()")
Reported-by: syzbot+8c0ac31aa9681abb9e2d@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20221206101351.2037285-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# e7d2b510 23-Sep-2022 Pavel Begunkov <asml.silence@gmail.com>

net: shrink struct ubuf_info

We can benefit from a smaller struct ubuf_info, so leave only mandatory
fields and let users to decide how they want to extend it. Convert
MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields.
This reduces the size from 48 bytes to just 16.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 47cf8899 25-Aug-2022 Pavel Begunkov <asml.silence@gmail.com>

net: unify alloclen calculation for paged requests

Consolidate alloclen and pagedlen calculation for zerocopy and normal
paged requests. The current non-zerocopy paged version can a bit
overallocate and unnecessary copy a small chunk of data into the linear
part.

Cc: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/netdev/CA+FuTSf0+cJ9_N_xrHmCGX_KoVCWcE0YQBdtgEkzGvcLMSv7Qw@mail.gmail.com/
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b0e4edb7b91f171c7119891d3c61040b8c56596e.1661428921.git.asml.silence@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# ab7e2e0d 05-Aug-2022 Matthias May <matthias.may@westermo.com>

ipv6: do not use RT_TOS for IPv6 flowlabel

According to Guillaume Nault RT_TOS should never be used for IPv6.

Quote:
RT_TOS() is an old macro used to interprete IPv4 TOS as described in
the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4
code, although, given the current state of the code, most of the
existing calls have no consequence.

But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS"
field to be interpreted the RFC 1349 way. There's no historical
compatibility to worry about.

Fixes: 571912c69f0e ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.")
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Matthias May <matthias.may@westermo.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 1fd3ae8c 12-Jul-2022 Pavel Begunkov <asml.silence@gmail.com>

ipv6/udp: support externally provided ubufs

Teach ipv6/udp how to use external ubuf_info provided in msghdr and
also prepare it for managed frags by sprinkling
skb_zcopy_downgrade_managed() when it could mix managed and not managed
frags.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 773ba4fe 12-Jul-2022 Pavel Begunkov <asml.silence@gmail.com>

ipv6: avoid partial copy for zc

Even when zerocopy transmission is requested and possible,
__ip_append_data() will still copy a small chunk of data just because it
allocated some extra linear space (e.g. 128 bytes). It wastes CPU cycles
on copy and iter manipulations and also misalignes potentially aligned
data. Avoid such copies. And as a bonus we can allocate smaller skb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# f93431c8 07-Jun-2022 Wang Yufen <wangyufen@huawei.com>

ipv6: Fix signed integer overflow in __ip6_append_data

Resurrect ubsan overflow checks and ubsan report this warning,
fix it by change the variable [length] type to size_t.

UBSAN: signed-integer-overflow in net/ipv6/ip6_output.c:1489:19
2147479552 + 8567 cannot be represented in type 'int'
CPU: 0 PID: 253 Comm: err Not tainted 5.16.0+ #1
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x214/0x230
show_stack+0x30/0x78
dump_stack_lvl+0xf8/0x118
dump_stack+0x18/0x30
ubsan_epilogue+0x18/0x60
handle_overflow+0xd0/0xf0
__ubsan_handle_add_overflow+0x34/0x44
__ip6_append_data.isra.48+0x1598/0x1688
ip6_append_data+0x128/0x260
udpv6_sendmsg+0x680/0xdd0
inet6_sendmsg+0x54/0x90
sock_sendmsg+0x70/0x88
____sys_sendmsg+0xe8/0x368
___sys_sendmsg+0x98/0xe0
__sys_sendmmsg+0xf4/0x3b8
__arm64_sys_sendmmsg+0x34/0x48
invoke_syscall+0x64/0x160
el0_svc_common.constprop.4+0x124/0x300
do_el0_svc+0x44/0xc8
el0_svc+0x3c/0x1e8
el0t_64_sync_handler+0x88/0xb0
el0t_64_sync+0x16c/0x170

Changes since v1:
-Change the variable [length] type to unsigned, as Eric Dumazet suggested.
Changes since v2:
-Don't change exthdrlen type in ip6_make_skb, as Paolo Abeni suggested.
Changes since v3:
-Don't change ulen type in udpv6_sendmsg and l2tp_ip6_sendmsg, as
Jakub Kicinski suggested.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Link: https://lore.kernel.org/r/20220607120028.845916-1-wangyufen@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 80e425b6 13-May-2022 Coco Li <lixiaoyan@google.com>

ipv6: Add hop-by-hop header to jumbograms in ip6_output

Instead of simply forcing a 0 payload_len in IPv6 header,
implement RFC 2675 and insert a custom extension header.

Note that only TCP stack is currently potentially generating
jumbograms, and that this extension header is purely local,
it wont be sent on a physical link.

This is needed so that packet capture (tcpdump and friends)
can properly dissect these large packets.

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 58f71be5 28-Apr-2022 Pavel Begunkov <asml.silence@gmail.com>

ipv6: refactor ip6_finish_output2()

Throw neigh checks in ip6_finish_output2() under a single slow path if,
so we don't have the overhead in the hot path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b143ed7 28-Apr-2022 Pavel Begunkov <asml.silence@gmail.com>

ipv6: help __ip6_finish_output() inlining

There are two callers of __ip6_finish_output(), both are in
ip6_finish_output(). We can combine the call sites into one and handle
return code after, that will inline __ip6_finish_output().

Note, error handling under NET_XMIT_CN will only return 0 if
__ip6_finish_output() succeded, and in this case it return 0.
Considering that NET_XMIT_SUCCESS is 0, it'll be returning exactly the
same result for it as before.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2edc1a38 13-Apr-2022 Menglong Dong <imagedong@tencent.com>

net: ip: add skb drop reasons to ip forwarding

Replace kfree_skb() which is used in ip6_forward() and ip_forward()
with kfree_skb_reason().

The new drop reason 'SKB_DROP_REASON_PKT_TOO_BIG' is introduced for
the case that the length of the packet exceeds MTU and can't
fragment.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e3fa461d 08-Apr-2022 Nicolas Dichtel <nicolas.dichtel@6wind.com>

ipv6: fix panic when forwarding a pkt with no in6 dev

kongweibin reported a kernel panic in ip6_forward() when input interface
has no in6 dev associated.

The following tc commands were used to reproduce this panic:
tc qdisc del dev vxlan100 root
tc qdisc add dev vxlan100 root netem corrupt 5%

CC: stable@vger.kernel.org
Fixes: ccd27f05ae7b ("ipv6: fix 'disable_policy' for fwd packets")
Reported-by: kongweibin <kongweibin2@huawei.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 40867d74 14-Mar-2022 David Ahern <dsahern@kernel.org>

net: Add l3mdev index to flow struct and avoid oif reset for port devices

The fundamental premise of VRF and l3mdev core code is binding a socket
to a device (l3mdev or netdev with an L3 domain) to indicate L3 scope.
Legacy code resets flowi_oif to the l3mdev losing any original port
device binding. Ben (among others) has demonstrated use cases where the
original port device binding is important and needs to be retained.
This patch handles that by adding a new entry to the common flow struct
that can indicate the l3mdev index for later rule and table matching
avoiding the need to reset flowi_oif.

In addition to allowing more use cases that require port device binds,
this patch brings a few datapath simplications:

1. l3mdev_fib_rule_match is only called when walking fib rules and
always after l3mdev_update_flow. That allows an optimization to bail
early for non-VRF type uses cases when flowi_l3mdev is not set. Also,
only that index needs to be checked for the FIB table id.

2. l3mdev_update_flow can be called with flowi_oif set to a l3mdev
(e.g., VRF) device. By resetting flowi_oif only for this case the
FLOWI_FLAG_SKIP_NH_OIF flag is not longer needed and can be removed,
removing several checks in the datapath. The flowi_iif path can be
simplified to only be called if the it is not loopback (loopback can
not be assigned to an L3 domain) and the l3mdev index is not already
set.

3. Avoid another device lookup in the output path when the fib lookup
returns a reject failure.

Note: 2 functional tests for local traffic with reject fib rules are
updated to reflect the new direct failure at FIB lookup time for ping
rather than the failure on packet path. The current code fails like this:

HINT: Fails since address on vrf device is out of device scope
COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
ping: Warning: source address might be selected on device other than: eth1
PING 172.16.3.1 (172.16.3.1) from 172.16.3.1 eth1: 56(84) bytes of data.

--- 172.16.3.1 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

where the test now directly fails:

HINT: Fails since address on vrf device is out of device scope
COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
ping: connect: No route to host

Signed-off-by: David Ahern <dsahern@kernel.org>
Tested-by: Ben Greear <greearb@candelatech.com>
Link: https://lore.kernel.org/r/20220314204551.16369-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# de799101 02-Mar-2022 Martin KaFai Lau <kafai@fb.com>

net: Add skb_clear_tstamp() to keep the mono delivery_time

Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.

If skb->tstamp has the mono delivery_time, clearing it can hurt
the performance when it finally transmits out to fq@phy-dev.

The earlier patch added a skb->mono_delivery_time bit to
flag the skb->tstamp carrying the mono delivery_time.

This patch adds skb_clear_tstamp() helper which keeps
the mono delivery_time and clears everything else.

The delivery_time clearing will be postponed until the stack knows the
skb will be delivered locally. It will be done in a latter patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a1ac9c8a 02-Mar-2022 Martin KaFai Lau <kafai@fb.com>

net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp

skb->tstamp was first used as the (rcv) timestamp.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).

Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.

Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.

While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev. For example, when forwarding from egress to
ingress and then finally back to egress:

tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
^ ^
reset rest

This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.

The current use case is to keep the TCP mono delivery_time (EDT) and
to be used with sch_fq. A latter patch will also allow tc-bpf@ingress
to read and change the mono delivery_time.

In the future, another bit (e.g. skb->user_delivery_time) can be added
for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.

[ This patch is a prep work. The following patches will
get the other parts of the stack ready first. Then another patch
after that will finally set the skb->mono_delivery_time. ]

skb_set_delivery_time() function is added. It is used by the tcp_output.c
and during ip[6] fragmentation to assign the delivery_time to
the skb->tstamp and also set the skb->mono_delivery_time.

A note on the change in ip_send_unicast_reply() in ip_output.c.
It is only used by TCP to send reset/ack out of a ctl_sk.
Like the new skb_set_delivery_time(), this patch sets
the skb->mono_delivery_time to 0 for now as a place
holder. It will be enabled in a latter patch.
A similar case in tcp_ipv6 can be done with
skb_set_delivery_time() in tcp_v6_send_response().

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5e187189 25-Feb-2022 Menglong Dong <imagedong@tencent.com>

net: ip: add skb drop reasons for ip egress path

Replace kfree_skb() which is used in the packet egress path of IP layer
with kfree_skb_reason(). Functions that are involved include:

__ip_queue_xmit()
ip_finish_output()
ip_mc_finish_output()
ip6_output()
ip6_finish_output()
ip6_finish_output2()

Following new drop reasons are introduced:

SKB_DROP_REASON_IP_OUTNOROUTES
SKB_DROP_REASON_BPF_CGROUP_EGRESS
SKB_DROP_REASON_IPV6DISABLED
SKB_DROP_REASON_NEIGH_CREATEFAIL

Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 13651224 16-Feb-2022 Jakub Kicinski <kuba@kernel.org>

net: ping6: support setting basic SOL_IPV6 options via cmsg

Support setting IPV6_HOPLIMIT, IPV6_TCLASS, IPV6_DONTFRAG
during sendmsg via SOL_IPV6 cmsgs.

tclass and dontfrag are init'ed from struct ipv6_pinfo in
ipcm6_init_sk(), while hlimit is inited to -1, so we need
to handle it being populated via cmsg explicitly.

Leave extension headers and flowlabel unimplemented.
Those are slightly more laborious to test and users
seem to primarily care about IPV6_TCLASS.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 40ac240c 26-Jan-2022 Pavel Begunkov <asml.silence@gmail.com>

ipv6: optimise dst refcounting on cork init

udpv6_sendmsg() doesn't need dst after calling ip6_make_skb(), so
instead of taking an additional reference inside ip6_setup_cork()
and releasing the initial one afterwards, we can hand over a reference
into ip6_make_skb() saving two atomics. The only other user of
ip6_setup_cork() is ip6_append_data() and it requires an extra
dst_hold().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# f37a4cc6 26-Jan-2022 Pavel Begunkov <asml.silence@gmail.com>

udp6: pass flow in ip6_make_skb together with cork

Another preparation patch. inet_cork_full already contains a field for
iflow, so we can avoid passing a separate struct iflow6 into
__ip6_append_data() and ip6_make_skb(), and use the flow stored in
inet_cork_full. Make sure callers set cork->fl, i.e. we init it in
ip6_append_data() and before calling ip6_make_skb().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# f3b46a3e 26-Jan-2022 Pavel Begunkov <asml.silence@gmail.com>

ipv6: pass full cork into __ip6_append_data()

Convert a struct inet_cork argument in __ip6_append_data() to struct
inet_cork_full. As one struct contains another inet_cork is still can
be accessed via ->base field. It's a preparation patch making further
changes a bit cleaner.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 940ea00b 26-Jan-2022 Pavel Begunkov <asml.silence@gmail.com>

ipv6: don't zero inet_cork_full::fl after use

It doesn't appear there is any reason for ip6_cork_release() to zero
cork->fl, it'll be fully filled on next initialisation. This 88 bytes
memset accounts to 0.3-0.5% of total CPU cycles.
It's also needed in following patches and allows to remove an extar flow
copy in udp_v6_push_pending_frames().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d656b2ea 26-Jan-2022 Pavel Begunkov <asml.silence@gmail.com>

ipv6: clean up cork setup/release

Clean up ip6_setup_cork() and ip6_cork_release() adding a local variable
for v6_cork->opt. It's a preparation patch for further changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# b60d4e58 26-Jan-2022 Pavel Begunkov <asml.silence@gmail.com>

ipv6: remove daddr temp buffer in __ip6_make_skb

ipv6_push_nfrag_opts() doesn't change passed daddr, and so
__ip6_make_skb() doesn't actually need to keep an on-stack copy of
fl6->daddr. Set initially final_dst to fl6->daddr,
ipv6_push_nfrag_opts() will override it if needed, and get rid of extra
copies.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# cd3c7480 26-Jan-2022 Pavel Begunkov <asml.silence@gmail.com>

ipv6: optimise dst refcounting on skb init

__ip6_make_skb() gets a cork->dst ref, hands it over to skb and shortly
after puts cork->dst. Save two atomics by stealing it without extra
referencing, ip6_cork_release() handles NULL cork->dst.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 5e34af41 10-Mar-2022 Tadeusz Struk <tadeusz.struk@linaro.org>

net: ipv6: fix skb_over_panic in __ip6_append_data

Syzbot found a kernel bug in the ipv6 stack:
LINK: https://syzkaller.appspot.com/bug?id=205d6f11d72329ab8d62a610c44c5e7e25415580
The reproducer triggers it by sending a crafted message via sendmmsg()
call, which triggers skb_over_panic, and crashes the kernel:

skbuff: skb_over_panic: text:ffffffff84647fb4 len:65575 put:65575
head:ffff888109ff0000 data:ffff888109ff0088 tail:0x100af end:0xfec0
dev:<NULL>

Update the check that prevents an invalid packet with MTU equal
to the fregment header size to eat up all the space for payload.

The reproducer can be found here:
LINK: https://syzkaller.appspot.com/text?tag=ReproC&x=1648c83fb00000

Reported-by: syzbot+e223cf47ec8ae183f2a0@syzkaller.appspotmail.com
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20220310232538.1044947-1-tadeusz.struk@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 6596a022 19-Jan-2022 Jiri Bohac <jbohac@suse.cz>

xfrm: fix MTU regression

Commit 749439bfac6e1a2932c582e2699f91d329658196 ("ipv6: fix udpv6
sendmsg crash caused by too small MTU") breaks PMTU for xfrm.

A Packet Too Big ICMPv6 message received in response to an ESP
packet will prevent all further communication through the tunnel
if the reported MTU minus the ESP overhead is smaller than 1280.

E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead
is 92 bytes. Receiving a PTB with MTU of 1371 or less will result
in all further packets in the tunnel dropped. A ping through the
tunnel fails with "ping: sendmsg: Invalid argument".

Apparently the MTU on the xfrm route is smaller than 1280 and
fails the check inside ip6_setup_cork() added by 749439bf.

We found this by debugging USGv6/ipv6ready failures. Failing
tests are: "Phase-2 Interoperability Test Scenario IPsec" /
5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation).

Commit b515d2637276a3810d6595e10ab02c13bfd0b63a ("xfrm:
xfrm_state_mtu should return at least 1280 for ipv6") attempted
to fix this but caused another regression in TCP MSS calculations
and had to be reverted.

The patch below fixes the situation by dropping the MTU
check and instead checking for the underflows described in the
749439bf commit message.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: 749439bfac6e ("ipv6: fix udpv6 sendmsg crash caused by too small MTU")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>


# a1cdec57 17-Feb-2022 Eric Dumazet <edumazet@google.com>

net-timestamp: convert sk->sk_tskey to atomic_t

UDP sendmsg() can be lockless, this is causing all kinds
of data races.

This patch converts sk->sk_tskey to remove one of these races.

BUG: KCSAN: data-race in __ip_append_data / __ip_append_data

read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
__ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
__do_sys_sendmmsg net/socket.c:2582 [inline]
__se_sys_sendmmsg net/socket.c:2579 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
__ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
__do_sys_sendmmsg net/socket.c:2582 [inline]
__se_sys_sendmmsg net/socket.c:2579 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x0000054d -> 0x0000054e

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85fa6f-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 09c2d251b707 ("net-timestamp: add key to disambiguate concurrent datagrams")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aba54656 15-Nov-2021 Eric Dumazet <edumazet@google.com>

net: remove sk_route_nocaps

Instead of using a full netdev_features_t, we can use a single bit,
as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
sk->sk_route_cap.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 19d36c5f 18-Nov-2021 Eric Dumazet <edumazet@google.com>

ipv6: fix typos in __ip6_finish_output()

We deal with IPv6 packets, so we need to use IP6CB(skb)->flags and
IP6SKB_REROUTED, instead of IPCB(skb)->flags and IPSKB_REROUTED

Found by code inspection, please double check that fixing this bug
does not surface other bugs.

Fixes: 09ee9dba9611 ("ipv6: Reinject IPv6 packets if IPsec policy matches after SNAT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tobias Brunner <tobias@strongswan.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Tested-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0857d6f8c 14-Oct-2021 Stephen Suryaputra <ssuryaextr@gmail.com>

ipv6: When forwarding count rx stats on the orig netdev

Commit bdb7cc643fc9 ("ipv6: Count interface receive statistics on the
ingress netdev") does not work when ip6_forward() executes on the skbs
with vrf-enslaved netdev. Use IP6CB(skb)->iif to get to the right one.

Add a selftest script to verify.

Fixes: bdb7cc643fc9 ("ipv6: Count interface receive statistics on the ingress netdev")
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20211014130845.410602-1-ssuryaextr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 0c9f227b 02-Aug-2021 Vasily Averin <vvs@virtuozzo.com>

ipv6: use skb_expand_head in ip6_xmit

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e415ed3a 02-Aug-2021 Vasily Averin <vvs@virtuozzo.com>

ipv6: use skb_expand_head in ip6_finish_output2

Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate
a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 427faee1 20-Jul-2021 Vadim Fedorenko <vfedorenko@novek.ru>

net: ipv6: introduce ip6_dst_mtu_maybe_forward

Replace ip6_dst_mtu_forward with ip6_dst_mtu_maybe_forward and
reuse this code in ip6_mtu. Actually these two functions were
almost duplicates, this change will simplify the maintaince of
mtu calculation code.

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 46c7655f 22-Jul-2021 Kangmin Park <l4stpr0gr4m@gmail.com>

ipv6: decrease hop limit counter in ip6_forward()

Decrease hop limit counter when deliver skb to ndp proxy.

Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2d85a1b3 19-Jul-2021 Vasily Averin <vvs@virtuozzo.com>

ipv6: ip6_finish_output2: set sk into newly allocated nskb

skb_set_owner_w() should set sk not to old skb but to new nskb.

Fixes: 5796015fa968 ("ipv6: allocate enough headroom in ip6_finish_output2()")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Link: https://lore.kernel.org/r/70c0744f-89ae-1869-7e3e-4fa292158f4b@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 5796015f 12-Jul-2021 Vasily Averin <vvs@virtuozzo.com>

ipv6: allocate enough headroom in ip6_finish_output2()

When TEE target mirrors traffic to another interface, sk_buff may
not have enough headroom to be processed correctly.
ip_finish_output2() detect this situation for ipv4 and allocates
new skb with enogh headroom. However ipv6 lacks this logic in
ip_finish_output2 and it leads to skb_under_panic:

skbuff: skb_under_panic: text:ffffffffc0866ad4 len:96 put:24
head:ffff97be85e31800 data:ffff97be85e317f8 tail:0x58 end:0xc0 dev:gre0
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:110!
invalid opcode: 0000 [#1] SMP PTI
CPU: 2 PID: 393 Comm: kworker/2:2 Tainted: G OE 5.13.0 #13
Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
Workqueue: ipv6_addrconf addrconf_dad_work
RIP: 0010:skb_panic+0x48/0x4a
Call Trace:
skb_push.cold.111+0x10/0x10
ipgre_header+0x24/0xf0 [ip_gre]
neigh_connected_output+0xae/0xf0
ip6_finish_output2+0x1a8/0x5a0
ip6_output+0x5c/0x110
nf_dup_ipv6+0x158/0x1000 [nf_dup_ipv6]
tee_tg6+0x2e/0x40 [xt_TEE]
ip6t_do_table+0x294/0x470 [ip6_tables]
nf_hook_slow+0x44/0xc0
nf_hook.constprop.34+0x72/0xe0
ndisc_send_skb+0x20d/0x2e0
ndisc_send_ns+0xd1/0x210
addrconf_dad_work+0x3c8/0x540
process_one_work+0x1d1/0x370
worker_thread+0x30/0x390
kthread+0x116/0x130
ret_from_fork+0x22/0x30

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ccd27f05 06-Jul-2021 Nicolas Dichtel <nicolas.dichtel@6wind.com>

ipv6: fix 'disable_policy' for fwd packets

The goal of commit df789fe75206 ("ipv6: Provide ipv6 version of
"disable_policy" sysctl") was to have the disable_policy from ipv4
available on ipv6.
However, it's not exactly the same mechanism. On IPv4, all packets coming
from an interface, which has disable_policy set, bypass the policy check.
For ipv6, this is done only for local packets, ie for packets destinated to
an address configured on the incoming interface.

Let's align ipv6 with ipv4 so that the 'disable_policy' sysctl has the same
effect for both protocols.

My first approach was to create a new kind of route cache entries, to be
able to set DST_NOPOLICY without modifying routes. This would have added a
lot of code. Because the local delivery path is already handled, I choose
to focus on the forwarding path to minimize code churn.

Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c305b9e6 23-Jun-2021 zhang kai <zhangkaiheb@126.com>

ipv6: delete useless dst check in ip6_dst_lookup_tail

parameter dst always points to null.

Signed-off-by: zhang kai <zhangkaiheb@126.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6d123b81 23-Jun-2021 Jakub Kicinski <kuba@kernel.org>

net: ip: avoid OOM kills with large UDP sends over loopback

Dave observed number of machines hitting OOM on the UDP send
path. The workload seems to be sending large UDP packets over
loopback. Since loopback has MTU of 64k kernel will try to
allocate an skb with up to 64k of head space. This has a good
chance of failing under memory pressure. What's worse if
the message length is <32k the allocation may trigger an
OOM killer.

This is entirely avoidable, we can use an skb with page frags.

af_unix solves a similar problem by limiting the head
length to SKB_MAX_ALLOC. This seems like a good and simple
approach. It means that UDP messages > 16kB will now
use fragments if underlying device supports SG, if extra
allocator pressure causes regressions in real workloads
we can switch to trying the large allocation first and
falling back.

v4: pre-calculate all the additions to alloclen so
we can be sure it won't go over order-2

Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6585d7dc 01-Feb-2021 Brian Vazquez <brianvv@google.com>

net: use indirect call helpers for dst_output

This patch avoids the indirect call for the common case:
ip6_output and ip_output

Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 8e044917 06-Jan-2021 Jonathan Lemon <jonathan.lemon@gmail.com>

skbuff: Rename skb_zcopy_{get|put} to net_zcopy_{get|put}

Unlike the rest of the skb_zcopy_ functions, these routines
operate on a 'struct ubuf', not a skb. Remove the 'skb_'
prefix from the naming to make things clearer.

Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 8c793822 06-Jan-2021 Jonathan Lemon <jonathan.lemon@gmail.com>

skbuff: rename sock_zerocopy_* to msg_zerocopy_*

At Willem's suggestion, rename the sock_zerocopy_* functions
so that they match the MSG_ZEROCOPY flag, which makes it clear
they are specific to this zerocopy implementation.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 236a6b1c 06-Jan-2021 Jonathan Lemon <jonathan.lemon@gmail.com>

skbuff: Call sock_zerocopy_put_abort from skb_zcopy_put_abort

The sock_zerocopy_put_abort function contains logic which is
specific to the current zerocopy implementation. Add a wrapper
which checks the callback and dispatches apppropriately.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# b210de4f 07-Jan-2021 Aya Levin <ayal@nvidia.com>

net: ipv6: Validate GSO SKB before finish IPv6 processing

There are cases where GSO segment's length exceeds the egress MTU:
- Forwarding of a TCP GRO skb, when DF flag is not set.
- Forwarding of an skb that arrived on a virtualisation interface
(virtio-net/vhost/tap) with TSO/GSO size set by other network
stack.
- Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
interface with a smaller MTU.
- Arriving GRO skb (or GSO skb in a virtualised environment) that is
bridged to a NETIF_F_TSO tunnel stacked over an interface with an
insufficient MTU.

If so:
- Consume the SKB and its segments.
- Issue an ICMP packet with 'Packet Too Big' message containing the
MTU, allowing the source host to reduce its Path MTU appropriately.

Note: These cases are handled in the same manner in IPv4 output finish.
This patch aligns the behavior of IPv6 and the one of IPv4.

Fixes: 9e50849054a4 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 272928d1 12-Oct-2020 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

ipv6/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2)

As per RFC4443, the destination address field for ICMPv6 error messages
is copied from the source address field of the invoking packet.

In configurations with Virtual Routing and Forwarding tables, looking up
which routing table to use for sending ICMPv6 error messages is
currently done by using the destination net_device.

If the source and destination interfaces are within separate VRFs, or
one in the global routing table and the other in a VRF, looking up the
source address of the invoking packet in the destination interface's
routing table will fail if the destination interface's routing table
contains no route to the invoking packet's source address.

One observable effect of this issue is that traceroute6 does not work in
the following cases:

- Route leaking between global routing table and VRF
- Route leaking between VRFs

Use the source device routing table when sending ICMPv6 error
messages.

[ In the context of ipv4, it has been pointed out that a similar issue
may exist with ICMP errors triggered when forwarding between network
namespaces. It would be worthwhile to investigate whether ipv6 has
similar issues, but is outside of the scope of this investigation. ]

[ Testing shows that similar issues exist with ipv6 unreachable /
fragmentation needed messages. However, investigation of this
additional failure mode is beyond this investigation's scope. ]

Link: https://tools.ietf.org/html/rfc4443
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 634a63e7 17-Sep-2020 Randy Dunlap <rdunlap@infradead.org>

net: ipv6: delete duplicated words

Drop repeated words in net/ipv6/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8d5930df 10-Jul-2020 Al Viro <viro@zeniv.linux.org.uk>

skb_copy_and_csum_bits(): don't bother with the last argument

it's always 0

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b51cd7c8 12-Jul-2020 Andrew Lunn <andrew@lunn.ch>

net: ipv6: kerneldoc fixes

Simple fixes which require no deep knowledge of the code.

Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 571912c6 23-Feb-2020 Martin Varghese <martin.varghese@nokia.com>

net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.

The Bareudp tunnel module provides a generic L3 encapsulation
tunnelling module for tunnelling different protocols like MPLS,
IP,NSH etc inside a UDP tunnel.

Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c4e85f73 04-Dec-2019 Sabrina Dubroca <sd@queasysnail.net>

net: ipv6: add net argument to ip6_dst_lookup_flow

This will be used in the conversion of ipv6_stub to ip6_dst_lookup_flow,
as some modules currently pass a net argument without a socket to
ip6_dst_lookup. This is equivalent to commit 343d60aada5a ("ipv6: change
ipv6_stub_impl.ipv6_dst_lookup to take net argument").

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 28f8bfd1 12-Nov-2019 Phil Sutter <phil@nwl.cc>

netfilter: Support iif matches in POSTROUTING

Instead of generally passing NULL to NF_HOOK_COND() for input device,
pass skb->dev which contains input device for routed skbs.

Note that iptables (both legacy and nft) reject rules with input
interface match from being added to POSTROUTING chains, but nftables
allows this.

Cc: Eric Garver <eric@garver.life>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9669fffc 16-Oct-2019 Eric Dumazet <edumazet@google.com>

net: ensure correct skb->tstamp in various fragmenters

Thomas found that some forwarded packets would be stuck
in FQ packet scheduler because their skb->tstamp contained
timestamps far in the future.

We thought we addressed this point in commit 8203e2d844d3
("net: clear skb->tstamp in forwarding paths") but there
is still an issue when/if a packet needs to be fragmented.

In order to meet EDT requirements, we have to make sure all
fragments get the original skb->tstamp.

Note that this original skb->tstamp should be zero in
forwarding path, but might have a non zero value in
output path if user decided so.

Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Thomas Bartschies <Thomas.Bartschies@cvk.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4f6570d7 24-Sep-2019 Eric Dumazet <edumazet@google.com>

ipv6: add priority parameter to ip6_xmit()

Currently, ip6_xmit() sets skb->priority based on sk->sk_priority

This is not desirable for TCP since TCP shares the same ctl socket
for a given netns. We want to be able to send RST or ACK packets
with a non zero skb->priority.

This patch has no functional change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c6af0c22 11-Sep-2019 Willem de Bruijn <willemb@google.com>

ip: support SO_MARK cmsg

Enable setting skb->mark for UDP and RAW sockets using cmsg.

This is analogous to existing support for TOS, TTL, txtime, etc.

Packet sockets already support this as of commit c7d39e32632e
("packet: support per-packet fwmark for af_packet sendmsg").

Similar to other fields, implement by
1. initialize the sockcm_cookie.mark from socket option sk_mark
2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
3. initialize inet_cork.mark from sockcm_cookie.mark
4. initialize each (usually just one) skb->mark from inet_cork.mark

Step 1 is handled in one location for most protocols by ipcm_init_sk
as of commit 351782067b6b ("ipv4: ipcm_cookie initializers").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9b1c1ef1 24-Jun-2019 Nicolas Dichtel <nicolas.dichtel@6wind.com>

ipv6: constify rt6_nexthop()

There is no functional change in this patch, it only prepares the next one.

rt6_nexthop() will be used by ip6_dst_lookup_neigh(), which uses const
variables.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 522924b5 07-Jun-2019 Willem de Bruijn <willemb@google.com>

net: correct udp zerocopy refcnt also when zerocopy only on append

The below patch fixes an incorrect zerocopy refcnt increment when
appending with MSG_MORE to an existing zerocopy udp skb.

send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt 1
send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt still 1 (bar frags)

But it missed that zerocopy need not be passed at the first send. The
right test whether the uarg is newly allocated and thus has extra
refcnt 1 is not !skb, but !skb_zcopy.

send(.., MSG_MORE); // <no uarg>
send(.., MSG_ZEROCOPY); // refcnt 1

Fixes: 100f6d8e09905 ("net: correct zerocopy refcnt with udp MSG_MORE")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b7034146 02-Jun-2019 Eric Dumazet <edumazet@google.com>

net: fix use-after-free in kfree_skb_list

syzbot reported nasty use-after-free [1]

Lets remove frag_list field from structs ip_fraglist_iter
and ip6_fraglist_iter. This seens not needed anyway.

[1] :
BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947

CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
__kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
kasan_report+0x12/0x20 mm/kasan/common.c:614
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
__ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
dst_output include/net/dst.h:433 [inline]
ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x44add9
Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf

Allocated by task 8947:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_kmalloc mm/kasan/common.c:489 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
slab_post_alloc_hook mm/slab.h:437 [inline]
slab_alloc_node mm/slab.c:3269 [inline]
kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
__alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
alloc_skb include/linux/skbuff.h:1058 [inline]
__ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 8947:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
__cache_free mm/slab.c:3432 [inline]
kmem_cache_free+0x86/0x260 mm/slab.c:3698
kfree_skbmem net/core/skbuff.c:625 [inline]
kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
__kfree_skb net/core/skbuff.c:682 [inline]
kfree_skb net/core/skbuff.c:699 [inline]
kfree_skb+0xf0/0x390 net/core/skbuff.c:693
kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
__dev_xmit_skb net/core/dev.c:3551 [inline]
__dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
neigh_output include/net/neighbour.h:511 [inline]
ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
__ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
dst_output include/net/dst.h:433 [inline]
ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888085a3cbc0
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 0 bytes inside of
224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
The buggy address belongs to the page:
page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
>ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
^
ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter")
Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 956fe219 28-May-2019 brakmo <brakmo@fb.com>

bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls

Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
congestion notifications from the BPF programs.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 100f6d8e 30-May-2019 Willem de Bruijn <willemb@google.com>

net: correct zerocopy refcnt with udp MSG_MORE

TCP zerocopy takes a uarg reference for every skb, plus one for the
tcp_sendmsg_locked datapath temporarily, to avoid reaching refcnt zero
as it builds, sends and frees skbs inside its inner loop.

UDP and RAW zerocopy do not send inside the inner loop so do not need
the extra sock_zerocopy_get + sock_zerocopy_put pair. Commit
52900d22288ed ("udp: elide zerocopy operation in hot path") introduced
extra_uref to pass the initial reference taken in sock_zerocopy_alloc
to the first generated skb.

But, sock_zerocopy_realloc takes this extra reference at the start of
every call. With MSG_MORE, no new skb may be generated to attach the
extra_uref to, so refcnt is incorrectly 2 with only one skb.

Do not take the extra ref if uarg && !tcp, which implies MSG_MORE.
Update extra_uref accordingly.

This conditional assignment triggers a false positive may be used
uninitialized warning, so have to initialize extra_uref at define.

Changes v1->v2: fix typo in Fixes SHA1

Fixes: 52900d22288e7 ("udp: elide zerocopy operation in hot path")
Reported-by: syzbot <syzkaller@googlegroups.com>
Diagnosed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8a6a1f17 29-May-2019 Pablo Neira Ayuso <pablo@netfilter.org>

net: ipv6: split skbuff into fragments transformer

This patch exposes a new API to refragment a skbuff. This allows you to
split either a linear skbuff or to force the refragmentation of an
existing fraglist using a different mtu. The API consists of:

* ip6_frag_init(), that initializes the internal state of the transformer.
* ip6_frag_next(), that allows you to fetch the next fragment. This function
internally allocates the skbuff that represents the fragment, it pushes
the IPv6 header, and it also copies the payload for each fragment.

The ip6_frag_state object stores the internal state of the splitter.

This code has been extracted from ip6_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0feca619 29-May-2019 Pablo Neira Ayuso <pablo@netfilter.org>

net: ipv6: add skbuff fraglist splitter

This patch adds the skbuff fraglist split iterator. This API provides an
iterator to transform the fraglist into single skbuff objects, it
consists of:

* ip6_fraglist_init(), that initializes the internal state of the
fraglist iterator.
* ip6_fraglist_prepare(), that restores the IPv6 header on the fragment.
* ip6_fraglist_next(), that retrieves the fragment from the fraglist and
updates the internal state of the iterator to point to the next
fragment in the fraglist.

The ip6_fraglist_iter object stores the internal state of the iterator.

This code has been extracted from ip6_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 0353f282 05-Apr-2019 David Ahern <dsahern@gmail.com>

neighbor: Add skip_cache argument to neigh_output

A later patch allows an IPv6 gateway with an IPv4 route. The neighbor
entry will exist in the v6 ndisc table and the cached header will contain
the ipv6 protocol which is wrong for an IPv4 packet. For an IPv4 packet to
use the v6 neighbor entry, neigh_output needs to skip the cached header
and just use the output callback for the neigh entry.

A future patchset can look at expanding the hh_cache to handle 2
protocols. For now, IPv6 gateways with an IPv4 route will take the
extra overhead of generating the header.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ef0efcd3 02-Apr-2019 Junwei Hu <hujunwei4@huawei.com>

ipv6: Fix dangling pointer when ipv6 fragment

At the beginning of ip6_fragment func, the prevhdr pointer is
obtained in the ip6_find_1stfragopt func.
However, all the pointers pointing into skb header may change
when calling skb_checksum_help func with
skb->ip_summed = CHECKSUM_PARTIAL condition.
The prevhdr pointe will be dangling if it is not reloaded after
calling __skb_linearize func in skb_checksum_help func.

Here, I add a variable, nexthdr_offset, to evaluate the offset,
which does not changes even after calling __skb_linearize func.

Fixes: 405c92f7a541 ("ipv6: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
Reported-by: Wenhao Zhang <zhangwenhao8@huawei.com>
Reported-by: syzbot+e8ce541d095e486074fc@syzkaller.appspotmail.com
Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9036b2fe 01-Mar-2019 Francesco Ruggeri <fruggeri@arista.com>

net: ipv6: add socket option IPV6_ROUTER_ALERT_ISOLATE

By default IPv6 socket with IPV6_ROUTER_ALERT socket option set will
receive all IPv6 RA packets from all namespaces.
IPV6_ROUTER_ALERT_ISOLATE socket option restricts packets received by
the socket to be only from the socket's namespace.

Signed-off-by: Maxim Martynov <maxim@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# df5042f4 18-Dec-2018 Florian Westphal <fw@strlen.de>

sk_buff: add skb extension infrastructure

This adds an optional extension infrastructure, with ispec (xfrm) and
bridge netfilter as first users.
objdiff shows no changes if kernel is built without xfrm and br_netfilter
support.

The third (planned future) user is Multipath TCP which is still
out-of-tree.
MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
numbers used by individual subflows.

This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.

Extending skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method would
be required for the receive side.

mptcp has same requirements as secpath/bridge netfilter:

1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb (clone inherits extension)
3. adding extension to an skb will COW the extension buffer if needed.

The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
mapping for tx and rx processing.

Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
are available for this skb.
This has two purposes.
a) avoids the need to initialize the pointer.
b) allows to "delete" an extension by clearing its bit
value in ->active_extensions.

While it would be possible to store the active_extensions byte
in the extension struct instead of sk_buff, there is one problem
with this:
When an extension has to be disabled, we can always clear the
bit in skb->active_extensions. But in case it would be stored in the
extension buffer itself, we might have to COW it first, if
we are dealing with a cloned skb. On kmalloc failure we would
be unable to turn an extension off.

2. extension pointer, located at the end of the sk_buff.
If the active_extensions byte is 0, the pointer is undefined,
it is not initialized on skb allocation.

This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but this replaces similar code that
manages skb->nf_bridge and skb->sp structs in the followup patches of
the series.

It is possible to add support for extensions that are not preseved on
clones/copies.

To do this, it would be needed to define a bitmask of all extensions that
need copy/cow semantics, and change __skb_ext_copy() to check
->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
->active_extensions to 0 on the new clone.

This isn't done here because all extensions that get added here
need the copy/cow semantics.

v2:
Allocate entire extension space using kmem_cache.
Upside is that this allows better tracking of used memory,
downside is that we will allocate more space than strictly needed in
most cases (its unlikely that all extensions are active/needed at same
time for same skb).
The allocated memory (except the small extension header) is not cleared,
so no additonal overhead aside from memory usage.

Avoid atomic_dec_and_test operation on skb_ext_put()
by using similar trick as kfree_skbmem() does with fclone_ref:
If recount is 1, there is no concurrent user and we can free right away.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8203e2d8 14-Dec-2018 Eric Dumazet <edumazet@google.com>

net: clear skb->tstamp in forwarding paths

Sergey reported that forwarding was no longer working
if fq packet scheduler was used.

This is caused by the recent switch to EDT model, since incoming
packets might have been timestamped by __net_timestamp()

__net_timestamp() uses ktime_get_real(), while fq expects packets
using CLOCK_MONOTONIC base.

The fix is to clear skb->tstamp in forwarding paths.

Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.")
Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sergey Matyukevich <geomatsi@gmail.com>
Tested-by: Sergey Matyukevich <geomatsi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 97ef7b4c 08-Dec-2018 Willem de Bruijn <willemb@google.com>

ip: silence udp zerocopy smatch false positive

extra_uref is used in __ip(6)_append_data only if uarg is set.

Smatch sees that the variable is passed to sock_zerocopy_put_abort.
This function accesses it only when uarg is set, but smatch cannot
infer this.

Make this dependency explicit.

Fixes: 52900d22288e ("udp: elide zerocopy operation in hot path")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 66033f47 06-Dec-2018 Stefano Brivio <sbrivio@redhat.com>

ipv6: Check available headroom in ip6_xmit() even without options

Even if we send an IPv6 packet without options, MAX_HEADER might not be
enough to account for the additional headroom required by alignment of
hardware headers.

On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().

Those would be enough to append our 14 bytes header, but we're going to
align that to 16 bytes, and write 2 bytes out of the allocated slab in
neigh_hh_output().

KASan says:

[ 264.967848] ==================================================================
[ 264.967861] BUG: KASAN: slab-out-of-bounds in ip6_finish_output2+0x1aec/0x1c70
[ 264.967866] Write of size 16 at addr 000000006af1c7fe by task netperf/6201
[ 264.967870]
[ 264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
[ 264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
[ 264.967887] Call Trace:
[ 264.967896] ([<00000000001347d6>] show_stack+0x56/0xa0)
[ 264.967903] [<00000000017e379c>] dump_stack+0x23c/0x290
[ 264.967912] [<00000000007bc594>] print_address_description+0xf4/0x290
[ 264.967919] [<00000000007bc8fc>] kasan_report+0x13c/0x240
[ 264.967927] [<000000000162f5e4>] ip6_finish_output2+0x1aec/0x1c70
[ 264.967935] [<000000000163f890>] ip6_finish_output+0x430/0x7f0
[ 264.967943] [<000000000163fe44>] ip6_output+0x1f4/0x580
[ 264.967953] [<000000000163882a>] ip6_xmit+0xfea/0x1ce8
[ 264.967963] [<00000000017396e2>] inet6_csk_xmit+0x282/0x3f8
[ 264.968033] [<000003ff805fb0ba>] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
[ 264.968037] [<000003ff80631192>] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
[ 264.968041] [<0000000001220020>] dev_hard_start_xmit+0x268/0x928
[ 264.968069] [<0000000001330e8e>] sch_direct_xmit+0x7ae/0x1350
[ 264.968071] [<000000000122359c>] __dev_queue_xmit+0x2b7c/0x3478
[ 264.968075] [<00000000013d2862>] ip_finish_output2+0xce2/0x11a0
[ 264.968078] [<00000000013d9b14>] ip_finish_output+0x56c/0x8c8
[ 264.968081] [<00000000013ddd1e>] ip_output+0x226/0x4c0
[ 264.968083] [<00000000013dbd6c>] __ip_queue_xmit+0x894/0x1938
[ 264.968100] [<000003ff80bc3a5c>] sctp_packet_transmit+0x29d4/0x3648 [sctp]
[ 264.968116] [<000003ff80b7bf68>] sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
[ 264.968131] [<000003ff80b7c716>] sctp_outq_flush+0x22e/0x7d8 [sctp]
[ 264.968146] [<000003ff80b35c68>] sctp_cmd_interpreter.isra.16+0x530/0x6800 [sctp]
[ 264.968161] [<000003ff80b3410a>] sctp_do_sm+0x222/0x648 [sctp]
[ 264.968177] [<000003ff80bbddac>] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
[ 264.968192] [<000003ff80b93328>] __sctp_connect+0x830/0xc20 [sctp]
[ 264.968208] [<000003ff80bb11ce>] sctp_inet_connect+0x2e6/0x378 [sctp]
[ 264.968212] [<0000000001197942>] __sys_connect+0x21a/0x450
[ 264.968215] [<000000000119aff8>] sys_socketcall+0x3d0/0xb08
[ 264.968218] [<000000000184ea7a>] system_call+0x2a2/0x2c0

[...]

Just like ip_finish_output2() does for IPv4, check that we have enough
headroom in ip6_xmit(), and reallocate it if we don't.

This issue is older than git history.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f839a6c9 04-Dec-2018 Ido Schimmel <idosch@mellanox.com>

net: Do not route unicast IP packets twice

Packets marked with 'offload_l3_fwd_mark' were already forwarded by a
capable device and should not be forwarded again by the kernel.
Therefore, have the kernel consume them.

The check is performed in ip{,6}_forward_finish() in order to allow the
kernel to process such packets in ip{,6}_forward() and generate required
exceptions. For example, ICMP redirects.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 52900d22 30-Nov-2018 Willem de Bruijn <willemb@google.com>

udp: elide zerocopy operation in hot path

With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
Release of its last reference triggers a completion notification.

The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
the skbs, because it can build, send and free skbs within its loop,
possibly reaching refcount zero and freeing the ubuf_info too soon.

The UDP stack currently also takes this extra ref, but does not need
it as all skbs are sent after return from __ip(6)_append_data.

Avoid the extra refcount_inc and refcount_dec_and_test, and generally
the sock_zerocopy_put in the common path, by passing the initial
reference to the first skb.

This approach is taken instead of initializing the refcount to 0, as
that would generate error "refcount_t: increment on 0" on the
next skb_zcopy_set.

Changes
v3 -> v4
- Move skb_zcopy_set below the only kfree_skb that might cause
a premature uarg destroy before skb_zerocopy_put_abort
- Move the entire skb_shinfo assignment block, to keep that
cacheline access in one place

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b5947e5d 30-Nov-2018 Willem de Bruijn <willemb@google.com>

udp: msg_zerocopy

Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
interpret flag MSG_ZEROCOPY.

This patch was previously part of the zerocopy RFC patchsets. Zerocopy
is not effective at small MTU. With segmentation offload building
larger datagrams, the benefit of page flipping outweights the cost of
generating a completion notification.

tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
test patch and making skb_orphan_frags_rx same as skb_orphan_frags:

ipv4 udp -t 1
tx=191312 (11938 MB) txc=0 zc=n
rx=191312 (11938 MB)
ipv4 udp -z -t 1
tx=304507 (19002 MB) txc=304507 zc=y
rx=304507 (19002 MB)
ok
ipv6 udp -t 1
tx=174485 (10888 MB) txc=0 zc=n
rx=174485 (10888 MB)
ipv6 udp -z -t 1
tx=294801 (18396 MB) txc=294801 zc=y
rx=294801 (18396 MB)
ok

Changes
v1 -> v2
- Fixup reverse christmas tree violation
v2 -> v3
- Split refcount avoidance optimization into separate patch
- Fix refcount leak on error in fragmented case
(thanks to Paolo Abeni for pointing this one out!)
- Fix refcount inc on zero
- Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
This is needed since commit 5cf4a8532c99 ("tcp: really ignore
MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aba36930 24-Nov-2018 Willem de Bruijn <willemb@google.com>

net: always initialize pagedlen

In ip packet generation, pagedlen is initialized for each skb at the
start of the loop in __ip(6)_append_data, before label alloc_new_skb.

Depending on compiler options, code can be generated that jumps to
this label, triggering use of an an uninitialized variable.

In practice, at -O2, the generated code moves the initialization below
the label. But the code should not rely on that for correctness.

Fixes: 15e36f5b8e98 ("udp: paged allocation with gso")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bbd6528d 14-Sep-2018 Eric Dumazet <edumazet@google.com>

ipv6: fix possible use-after-free in ip6_xmit()

In the unlikely case ip6_xmit() has to call skb_realloc_headroom(),
we need to call skb_set_owner_w() before consuming original skb,
otherwise we risk a use-after-free.

Bring IPv6 in line with what we do in IPv4 to fix this.

Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a8305bff 29-Jul-2018 David S. Miller <davem@davemloft.net>

net: Add and use skb_mark_not_on_list().

An SKB is not on a list if skb->next is NULL.

Codify this convention into a helper function and use it
where we are dequeueing an SKB and need to mark it as such.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 3dd1c9a1 23-Jul-2018 Paolo Abeni <pabeni@redhat.com>

ip: hash fragments consistently

The skb hash for locally generated ip[v6] fragments belonging
to the same datagram can vary in several circumstances:
* for connected UDP[v6] sockets, the first fragment get its hash
via set_owner_w()/skb_set_hash_from_sk()
* for unconnected IPv6 UDPv6 sockets, the first fragment can get
its hash via ip6_make_flowlabel()/skb_get_hash_flowi6(), if
auto_flowlabel is enabled

For the following frags the hash is usually computed via
skb_get_hash().
The above can cause OoO for unconnected IPv6 UDPv6 socket: in that
scenario the egress tx queue can be selected on a per packet basis
via the skb hash.
It may also fool flow-oriented schedulers to place fragments belonging
to the same datagram in different flows.

Fix the issue by copying the skb hash from the head frag into
the others at fragmentation time.

Before this commit:
perf probe -a "dev_queue_xmit skb skb->hash skb->l4_hash:b1@0/8 skb->sw_hash:b1@1/8"
netperf -H $IPV4 -t UDP_STREAM -l 5 -- -m 2000 -n &
perf record -e probe:dev_queue_xmit -e probe:skb_set_owner_w -a sleep 0.1
perf script
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=3713014309 l4_hash=1 sw_hash=0
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=0 l4_hash=0 sw_hash=0

After this commit:
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0

Fixes: b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf on xmit")
Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fbf47813 06-Jul-2018 Willem de Bruijn <willemb@google.com>

ip: unconditionally set cork gso_size

Now that ipc(6)->gso_size is correctly initialized in all callers of
ip(6)_setup_cork, it is safe to unconditionally pass it to the cork.

Link: http://lkml.kernel.org/r/20180619164752.143249-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 678ca42d 06-Jul-2018 Willem de Bruijn <willemb@google.com>

ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6

skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
after modification by __sock_cmsg_send, by calling sock_tx_timestamp.

The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
individual protocols that support tx timestamps call this function
and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
called in __ip6_append_data.

There is no need to store both tx_flags and ts_flags in the cookie
as one is derived from the other. Convert when setting up the cork
and remove the redundant field. This is similar to IPv6, only have
the conversion happen only once per datagram, in ip(6)_setup_cork.

Also change __ip6_append_data to match __ip_append_data. Only update
tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
redundant: only valid protocols can have non-zero cork->tx_flags.

After this change the IPv4 and IPv6 logic is the same.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5fdaa88d 06-Jul-2018 Willem de Bruijn <willemb@google.com>

ipv6: fold sockcm_cookie into ipcm6_cookie

ipcm_cookie includes sockcm_cookie. Do the same for ipcm6_cookie.

This reduces the number of arguments that need to be passed around,
applies ipcm6_init to all cookie fields at once and reduces code
differentiation between ipv4 and ipv6.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a818f75e 03-Jul-2018 Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>

net: ipv6: Hook into time based transmission

Add a struct sockcm_cookie parameter to ip6_setup_cork() so
we can easily re-use the transmit_time field from struct inet_cork
for most paths, by copying the timestamp from the CMSG cookie.
This is later copied into the skb during __ip6_make_skb().

For the raw fast path, also pass the sockcm_cookie as a parameter
so we can just perform the copy at rawv6_send_hdrinc() directly.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9887cba1 18-Jun-2018 Willem de Bruijn <willemb@google.com>

ip: limit use of gso_size to udp

The ipcm(6)_cookie field gso_size is set only in the udp path. The ip
layer copies this to cork only if sk_type is SOCK_DGRAM. This check
proved too permissive. Ping and l2tp sockets have the same type.

Limit to sockets of type SOCK_DGRAM and protocol IPPROTO_UDP to
exclude ping sockets.

v1 -> v2
- remove irrelevant whitespace changes

Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f17becf 31-May-2018 Stephen Suryaputra <ssuryaextr@gmail.com>

vrf: check the original netdevice for generating redirect

Use the right device to determine if redirect should be sent especially
when using vrf. Same as well as when sending the redirect.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 113f99c3 17-May-2018 Willem de Bruijn <willemb@google.com>

net: test tailroom before appending to linear skb

Device features may change during transmission. In particular with
corking, a device may toggle scatter-gather in between allocating
and writing to an skb.

Do not unconditionally assume that !NETIF_F_SG at write time implies
that the same held at alloc time and thus the skb has sufficient
tailroom.

This issue predates git history.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 15e36f5b 26-Apr-2018 Willem de Bruijn <willemb@google.com>

udp: paged allocation with gso

When sending large datagrams that are later segmented, store data in
page frags to avoid copying from linear in skb_segment.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bec1f6f6 26-Apr-2018 Willem de Bruijn <willemb@google.com>

udp: generate gso with UDP_SEGMENT

Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.

To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.

A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.

Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.

The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.

Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.

tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles

tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles

tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles

udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles

udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles

[*] after reverting commit 0a6b2a1dc2a2
("tcp: switch to GSO being always on")

Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:

perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4

Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1cd7884d 26-Apr-2018 Willem de Bruijn <willemb@google.com>

udp: expose inet cork to udp

UDP segmentation offload needs access to inet_cork in the udp layer.
Pass the struct to ip(6)_make_skb instead of allocating it on the
stack in that function itself.

This patch is a noop otherwise.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a68886a6 20-Apr-2018 David Ahern <dsahern@gmail.com>

net/ipv6: Make from in rt6_info rcu protected

When a dst entry is created from a fib entry, the 'from' in rt6_info
is set to the fib entry. The 'from' reference is used most notably for
cookie checking - making sure stale dst entries are updated if the
fib entry is changed.

When a fib entry is deleted, the pcpu routes on it are walked releasing
the fib6_info reference. This is needed for the fib6_info cleanup to
happen and to make sure all device references are released in a timely
manner.

There is a race window when a FIB entry is deleted and the 'from' on the
pcpu route is dropped and the pcpu route hits a cookie check. Handle
this race using rcu on from.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 07cb9623 26-Feb-2018 Felix Fietkau <nbd@nbd.name>

ipv6: make ip6_dst_mtu_forward inline

Just like ip_dst_mtu_maybe_forward(), to avoid a dependency with ipv6.ko.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 93531c67 17-Apr-2018 David Ahern <dsahern@gmail.com>

net/ipv6: separate handling of FIB entries from dst based routes

Last step before flipping the data type for FIB entries:
- use fib6_info_alloc to create FIB entries in ip6_route_info_create
and addrconf_dst_alloc
- use fib6_info_release in place of dst_release, ip6_rt_put and
rt6_release
- remove the dst_hold before calling __ip6_ins_rt or ip6_del_rt
- when purging routes, drop per-cpu routes
- replace inc and dec of rt6i_ref with fib6_info_hold and fib6_info_release
- use rt->from since it points to the FIB entry
- drop references to exception bucket, fib6_metrics and per-cpu from
dst entries (those are relevant for fib entries only)

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bdb7cc64 16-Apr-2018 Stephen Suryaputra <ssuryaextr@gmail.com>

ipv6: Count interface receive statistics on the ingress netdev

The statistics such as InHdrErrors should be counted on the ingress
netdev rather than on the dev from the dst, which is the egress.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 71a1c915 05-Apr-2018 Jeff Barnhill <0xeffeff@gmail.com>

net/ipv6: Increment OUTxxx counters after netfilter hook

At the end of ip6_forward(), IPSTATS_MIB_OUTFORWDATAGRAMS and
IPSTATS_MIB_OUTOCTETS are incremented immediately before the NF_HOOK call
for NFPROTO_IPV6 / NF_INET_FORWARD. As a result, these counters get
incremented regardless of whether or not the netfilter hook allows the
packet to continue being processed. This change increments the counters
in ip6_forward_finish() so that it will not happen if the netfilter hook
chooses to terminate the packet, which is similar to how IPv4 works.

Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9e8445a5 04-Apr-2018 Paolo Abeni <pabeni@redhat.com>

net: avoid unneeded atomic operation in ip*_append_data()

After commit 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates
done by __ip_append_data()") and commit 1f4c6eb24029 ("ipv6:
factorize sk_wmem_alloc updates done by __ip6_append_data()"),
when transmitting sub MTU datagram, an addtional, unneeded atomic
operation is performed in ip*_append_data() to update wmem_alloc:
in the above condition the delta is 0.

The above cause small but measurable performance regression in UDP
xmit tput test with packet size below MTU.

This change avoids such overhead updating wmem_alloc only if
wmem_alloc_delta is non zero.

The error path is left intentionally unmodified: it's a slow path
and simplicity is preferred to performances.

Fixes: 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()")
Fixes: 1f4c6eb24029 ("ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 96818159 03-Apr-2018 Alexey Kodanev <alexey.kodanev@oracle.com>

ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()

Add 'connected' parameter to ip6_sk_dst_lookup_flow() and update
the cache only if ip6_sk_dst_check() returns NULL and a socket
is connected.

The function is used as before, the new behavior for UDP sockets
in udpv6_sendmsg() will be enabled in the next patch.

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1f4c6eb2 31-Mar-2018 Eric Dumazet <edumazet@google.com>

ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()

While testing my inet defrag changes, I found that the senders
could spend ~20% of cpu cycles in skb_set_owner_w() updating
sk->sk_wmem_alloc for every fragment they cook, competing
with TX completion of prior skbs possibly happening on another cpus.

The solution to this problem is to use alloc_skb() instead
of sock_wmalloc() and manually perform a single sk_wmem_alloc change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 10b8a3de 23-Mar-2018 Paolo Abeni <pabeni@redhat.com>

ipv6: the entire IPv6 header chain must fit the first fragment

While building ipv6 datagram we currently allow arbitrary large
extheaders, even beyond pmtu size. The syzbot has found a way
to exploit the above to trigger the following splat:

kernel BUG at ./include/linux/skbuff.h:2073!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4230 Comm: syzkaller672661 Not tainted 4.16.0-rc2+ #326
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__skb_pull include/linux/skbuff.h:2073 [inline]
RIP: 0010:__ip6_make_skb+0x1ac8/0x2190 net/ipv6/ip6_output.c:1636
RSP: 0018:ffff8801bc18f0f0 EFLAGS: 00010293
RAX: ffff8801b17400c0 RBX: 0000000000000738 RCX: ffffffff84f01828
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801b415ac18
RBP: ffff8801bc18f360 R08: ffff8801b4576844 R09: 0000000000000000
R10: ffff8801bc18f380 R11: ffffed00367aee4e R12: 00000000000000d6
R13: ffff8801b415a740 R14: dffffc0000000000 R15: ffff8801b45767c0
FS: 0000000001535880(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000002000b000 CR3: 00000001b4123001 CR4: 00000000001606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
ip6_finish_skb include/net/ipv6.h:969 [inline]
udp_v6_push_pending_frames+0x269/0x3b0 net/ipv6/udp.c:1073
udpv6_sendmsg+0x2a96/0x3400 net/ipv6/udp.c:1343
inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764
sock_sendmsg_nosec net/socket.c:630 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:640
___sys_sendmsg+0x320/0x8b0 net/socket.c:2046
__sys_sendmmsg+0x1ee/0x620 net/socket.c:2136
SYSC_sendmmsg net/socket.c:2167 [inline]
SyS_sendmmsg+0x35/0x60 net/socket.c:2162
do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4404c9
RSP: 002b:00007ffdce35f948 EFLAGS: 00000217 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004404c9
RDX: 0000000000000003 RSI: 0000000020001f00 RDI: 0000000000000003
RBP: 00000000006cb018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 0000000020000080 R11: 0000000000000217 R12: 0000000000401df0
R13: 0000000000401e80 R14: 0000000000000000 R15: 0000000000000000
Code: ff e8 1d 5e b9 fc e9 15 e9 ff ff e8 13 5e b9 fc e9 44 e8 ff ff e8 29
5e b9 fc e9 c0 e6 ff ff e8 3f f3 80 fc 0f 0b e8 38 f3 80 fc <0f> 0b 49 8d
87 80 00 00 00 4d 8d 87 84 00 00 00 48 89 85 20 fe
RIP: __skb_pull include/linux/skbuff.h:2073 [inline] RSP: ffff8801bc18f0f0
RIP: __ip6_make_skb+0x1ac8/0x2190 net/ipv6/ip6_output.c:1636 RSP:
ffff8801bc18f0f0

As stated by RFC 7112 section 5:

When a host fragments an IPv6 datagram, it MUST include the entire
IPv6 Header Chain in the First Fragment.

So this patch addresses the issue dropping datagrams with excessive
extheader length. It also updates the error path to report to the
calling socket nonnegative pmtu values.

The issue apparently predates git history.

v1 -> v2: cleanup error path, as per Eric's suggestion

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+91e6f9932ff122fa4410@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 779b7931 28-Feb-2018 Daniel Axtens <dja@axtens.net>

net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len

If you take a GSO skb, and split it into packets, will the network
length (L3 headers + L4 headers + payload) of those packets be small
enough to fit within a given MTU?

skb_gso_validate_mtu gives you the answer to that question. However,
we recently added to add a way to validate the MAC length of a split GSO
skb (L2+L3+L4+payload), and the names get confusing, so rename
skb_gso_validate_mtu to skb_gso_validate_network_len

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8571ab47 28-Feb-2018 Yuval Mintz <yuvalm@mellanox.com>

ip6mr: Make mroute_sk rcu-based

In ipmr the mr_table socket is handled under RCU. Introduce the same
for ip6mr.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e9191ffb 22-Jan-2018 Ben Hutchings <ben.hutchings@codethink.co.uk>

ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL

Commit 513674b5a2c9 ("net: reevalulate autoflowlabel setting after
sysctl setting") removed the initialisation of
ipv6_pinfo::autoflowlabel and added a second flag to indicate
whether this field or the net namespace default should be used.

The getsockopt() handling for this case was not updated, so it
currently returns 0 for all sockets for which IPV6_AUTOFLOWLABEL is
not explicitly enabled. Fix it to return the effective value, whether
that has been set at the socket or net namespace level.

Fixes: 513674b5a2c9 ("net: reevalulate autoflowlabel setting after sysctl ...")
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 95ef498d 11-Jan-2018 Eric Dumazet <edumazet@google.com>

ipv6: ip6_make_skb() needs to clear cork.base.dst

In my last patch, I missed fact that cork.base.dst was not initialized
in ip6_make_skb() :

If ip6_setup_cork() returns an error, we might attempt a dst_release()
on some random pointer.

Fixes: 862c03ee1deb ("ipv6: fix possible mem leaks in ipv6_make_skb()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 749439bf 09-Jan-2018 Mike Maloney <maloney@google.com>

ipv6: fix udpv6 sendmsg crash caused by too small MTU

The logic in __ip6_append_data() assumes that the MTU is at least large
enough for the headers. A device's MTU may be adjusted after being
added while sendmsg() is processing data, resulting in
__ip6_append_data() seeing any MTU. For an mtu smaller than the size of
the fragmentation header, the math results in a negative 'maxfraglen',
which causes problems when refragmenting any previous skb in the
skb_write_queue, leaving it possibly malformed.

Instead sendmsg returns EINVAL when the mtu is calculated to be less
than IPV6_MIN_MTU.

Found by syzkaller:
kernel BUG at ./include/linux/skbuff.h:2064!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 14216 Comm: syz-executor5 Not tainted 4.13.0-rc4+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff8801d0b68580 task.stack: ffff8801ac6b8000
RIP: 0010:__skb_pull include/linux/skbuff.h:2064 [inline]
RIP: 0010:__ip6_make_skb+0x18cf/0x1f70 net/ipv6/ip6_output.c:1617
RSP: 0018:ffff8801ac6bf570 EFLAGS: 00010216
RAX: 0000000000010000 RBX: 0000000000000028 RCX: ffffc90003cce000
RDX: 00000000000001b8 RSI: ffffffff839df06f RDI: ffff8801d9478ca0
RBP: ffff8801ac6bf780 R08: ffff8801cc3f1dbc R09: 0000000000000000
R10: ffff8801ac6bf7a0 R11: 43cb4b7b1948a9e7 R12: ffff8801cc3f1dc8
R13: ffff8801cc3f1d40 R14: 0000000000001036 R15: dffffc0000000000
FS: 00007f43d740c700(0000) GS:ffff8801dc100000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7834984000 CR3: 00000001d79b9000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
ip6_finish_skb include/net/ipv6.h:911 [inline]
udp_v6_push_pending_frames+0x255/0x390 net/ipv6/udp.c:1093
udpv6_sendmsg+0x280d/0x31a0 net/ipv6/udp.c:1363
inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:762
sock_sendmsg_nosec net/socket.c:633 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:643
SYSC_sendto+0x352/0x5a0 net/socket.c:1750
SyS_sendto+0x40/0x50 net/socket.c:1718
entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x4512e9
RSP: 002b:00007f43d740bc08 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00000000007180a8 RCX: 00000000004512e9
RDX: 000000000000002e RSI: 0000000020d08000 RDI: 0000000000000005
RBP: 0000000000000086 R08: 00000000209c1000 R09: 000000000000001c
R10: 0000000000040800 R11: 0000000000000216 R12: 00000000004b9c69
R13: 00000000ffffffff R14: 0000000000000005 R15: 00000000202c2000
Code: 9e 01 fe e9 c5 e8 ff ff e8 7f 9e 01 fe e9 4a ea ff ff 48 89 f7 e8 52 9e 01 fe e9 aa eb ff ff e8 a8 b6 cf fd 0f 0b e8 a1 b6 cf fd <0f> 0b 49 8d 45 78 4d 8d 45 7c 48 89 85 78 fe ff ff 49 8d 85 ba
RIP: __skb_pull include/linux/skbuff.h:2064 [inline] RSP: ffff8801ac6bf570
RIP: __ip6_make_skb+0x18cf/0x1f70 net/ipv6/ip6_output.c:1617 RSP: ffff8801ac6bf570

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Mike Maloney <maloney@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 862c03ee 10-Jan-2018 Eric Dumazet <edumazet@google.com>

ipv6: fix possible mem leaks in ipv6_make_skb()

ip6_setup_cork() might return an error, while memory allocations have
been done and must be rolled back.

Fixes: 6422398c2ab0 ("ipv6: introduce ipv6_make_skb")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Reported-by: Mike Maloney <maloney@google.com>
Acked-by: Mike Maloney <maloney@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 09952107 06-Jan-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: flow table support for IPv6

This patch adds the IPv6 flow table type, that implements the datapath
flow table to forward IPv6 traffic.

This patch exports ip6_dst_mtu_forward() that is required to check for
mtu to pass up packets that need PMTUD handling to the classic
forwarding path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 09ee9dba 21-Dec-2017 Tobias Brunner <tobias@strongswan.org>

ipv6: Reinject IPv6 packets if IPsec policy matches after SNAT

If SNAT modifies the source address the resulting packet might match
an IPsec policy, reinject the packet if that's the case.

The exact same thing is already done for IPv4.

Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 513674b5 20-Dec-2017 Shaohua Li <shli@fb.com>

net: reevalulate autoflowlabel setting after sysctl setting

sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
supposed to not include flowlabel. This is true for normal packet, but
not for reset packet.

The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
changed, so the sock will keep the old behavior in terms of auto
flowlabel. Reset packet is suffering from this problem, because reset
packet is sent from a special control socket, which is created at boot
time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
socket will always have its ipv6_pinfo.autoflowlabel set, even after
user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
have flowlabel. Normal sock created before sysctl setting suffers from
the same issue. We can't even turn off autoflowlabel unless we kill all
socks in the hosts.

To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
autoflowlabel setting from user, otherwise we always call
ip6_default_np_autolabel() which has the new settings of sysctl.

Note, this changes behavior a little bit. Before commit 42240901f7c4
(ipv6: Implement different admin modes for automatic flow labels), the
autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
existing connection will change autoflowlabel behavior. After that
commit, autoflowlabel behavior is sticky in the whole life of the sock.
With this patch, the behavior isn't sticky again.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0f6c480f 28-Nov-2017 David Miller <davem@davemloft.net>

xfrm: Move dst->path into struct xfrm_dst

The first member of an IPSEC route bundle chain sets it's dst->path to
the underlying ipv4/ipv6 route that carries the bundle.

Stated another way, if one were to follow the xfrm_dst->child chain of
the bundle, the final non-NULL pointer would be the path and point to
either an ipv4 or an ipv6 route.

This is largely used to make sure that PMTU events propagate down to
the correct ipv4 or ipv6 route.

When we don't have the top of an IPSEC bundle 'dst->path == dst'.

Move it down into xfrm_dst and key off of dst->xfrm.

Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>


# 864e2a1f 21-Oct-2017 Eric Dumazet <edumazet@google.com>

ipv6: flowlabel: do not leave opt->tot_len with garbage

When syzkaller team brought us a C repro for the crash [1] that
had been reported many times in the past, I finally could find
the root cause.

If FlowLabel info is merged by fl6_merge_options(), we leave
part of the opt_space storage provided by udp/raw/l2tp with random value
in opt_space.tot_len, unless a control message was provided at sendmsg()
time.

Then ip6_setup_cork() would use this random value to perform a kzalloc()
call. Undefined behavior and crashes.

Fix is to properly set tot_len in fl6_merge_options()

At the same time, we can also avoid consuming memory and cpu cycles
to clear it, if every option is copied via a kmemdup(). This is the
change in ip6_setup_cork().

[1]
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 6613 Comm: syz-executor0 Not tainted 4.14.0-rc4+ #127
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff8801cb64a100 task.stack: ffff8801cc350000
RIP: 0010:ip6_setup_cork+0x274/0x15c0 net/ipv6/ip6_output.c:1168
RSP: 0018:ffff8801cc357550 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: ffff8801cc357748 RCX: 0000000000000010
RDX: 0000000000000002 RSI: ffffffff842bd1d9 RDI: 0000000000000014
RBP: ffff8801cc357620 R08: ffff8801cb17f380 R09: ffff8801cc357b10
R10: ffff8801cb64a100 R11: 0000000000000000 R12: ffff8801cc357ab0
R13: ffff8801cc357b10 R14: 0000000000000000 R15: ffff8801c3bbf0c0
FS: 00007f9c5c459700(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020324000 CR3: 00000001d1cf2000 CR4: 00000000001406f0
DR0: 0000000020001010 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
ip6_make_skb+0x282/0x530 net/ipv6/ip6_output.c:1729
udpv6_sendmsg+0x2769/0x3380 net/ipv6/udp.c:1340
inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:762
sock_sendmsg_nosec net/socket.c:633 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:643
SYSC_sendto+0x358/0x5a0 net/socket.c:1750
SyS_sendto+0x40/0x50 net/socket.c:1718
entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x4520a9
RSP: 002b:00007f9c5c458c08 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 00000000004520a9
RDX: 0000000000000001 RSI: 0000000020fd1000 RDI: 0000000000000016
RBP: 0000000000000086 R08: 0000000020e0afe4 R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000216 R12: 00000000004bb1ee
R13: 00000000ffffffff R14: 0000000000000016 R15: 0000000000000029
Code: e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 ea 0f 00 00 48 8d 79 04 48 b8 00 00 00 00 00 fc ff df 45 8b 74 24 04 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
RIP: ip6_setup_cork+0x274/0x15c0 net/ipv6/ip6_output.c:1168 RSP: ffff8801cc357550

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 85f1bd9a 09-Aug-2017 Willem de Bruijn <willemb@google.com>

udp: consistently apply ufo or fragmentation

When iteratively building a UDP datagram with MSG_MORE and that
datagram exceeds MTU, consistently choose UFO or fragmentation.

Once skb_is_gso, always apply ufo. Conversely, once a datagram is
split across multiple skbs, do not consider ufo.

Sendpage already maintains the first invariant, only add the second.
IPv6 does not have a sendpage implementation to modify.

A gso skb must have a partial checksum, do not follow sk_no_check_tx
in udp_send_skb.

Found by syzkaller.

Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# afce615a 24-Jul-2017 Stefano Brivio <sbrivio@redhat.com>

ipv6: Don't increase IPSTATS_MIB_FRAGFAILS twice in ip6_fragment()

RFC 2465 defines ipv6IfStatsOutFragFails as:

"The number of IPv6 datagrams that have been discarded
because they needed to be fragmented at this output
interface but could not be."

The existing implementation, instead, would increase the counter
twice in case we fail to allocate room for single fragments:
once for the fragment, once for the datagram.

This didn't look intentional though. In one of the two affected
affected failure paths, the double increase was simply a result
of a new 'goto fail' statement, introduced to avoid a skb leak.
The other path appears to be affected since at least 2.6.12-rc2.

Reported-by: Sabrina Dubroca <sdubroca@redhat.com>
Fixes: 1d325d217c7f ("ipv6: ip6_fragment: fix headroom tests and skb leak")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 988cf74d 03-Jul-2017 David S. Miller <davem@davemloft.net>

inet: Stop generating UFO packets.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 14afee4b 30-Jun-2017 Reshetova, Elena <elena.reshetova@intel.com>

net: convert sock.sk_wmem_alloc from atomic_t to refcount_t

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a5cb659b 19-Jun-2017 Michal Kubeček <mkubecek@suse.cz>

net: account for current skb length when deciding about UFO

Our customer encountered stuck NFS writes for blocks starting at specific
offsets w.r.t. page boundary caused by networking stack sending packets via
UFO enabled device with wrong checksum. The problem can be reproduced by
composing a long UDP datagram from multiple parts using MSG_MORE flag:

sendto(sd, buff, 1000, MSG_MORE, ...);
sendto(sd, buff, 1000, MSG_MORE, ...);
sendto(sd, buff, 3000, 0, ...);

Assume this packet is to be routed via a device with MTU 1500 and
NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
this condition is tested (among others) to decide whether to call
ip_ufo_append_data():

((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))

At the moment, we already have skb with 1028 bytes of data which is not
marked for GSO so that the test is false (fragheaderlen is usually 20).
Thus we append second 1000 bytes to this skb without invoking UFO. Third
sendto(), however, has sufficient length to trigger the UFO path so that we
end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
uses udp_csum() to calculate the checksum but that assumes all fragments
have correct checksum in skb->csum which is not true for UFO fragments.

When checking against MTU, we need to add skb->len to length of new segment
if we already have a partially filled skb and fragheaderlen only if there
isn't one.

In the IPv6 case, skb can only be null if this is the first segment so that
we have to use headersize (length of the first IPv6 header) rather than
fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.

Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Fixes: e4c5e13aa45c ("ipv6: Should use consistent conditional judgement for
ip6 fragment between __ip6_append_data and ip6_finish_output")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1758fd46 17-Jun-2017 Wei Wang <weiwan@google.com>

ipv6: remove unnecessary dst_hold() in ip6_fragment()

In ipv6 tx path, rcu_read_lock() is taken so that dst won't get freed
during the execution of ip6_fragment(). Hence, no need to hold dst in
it.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d58ff351 16-Jun-2017 Johannes Berg <johannes.berg@intel.com>

networking: make skb_push & __skb_push return void pointers

It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:

@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)

@@
expression E, SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)

@@
expression SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- fn(SKB, LEN)[0]
+ *(u8 *)fn(SKB, LEN)

Note that the last part there converts from push(...)[0] to the
more idiomatic *(u8 *)push(...).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 89dfba3e 10-Jun-2017 Chenbo Feng <fengc@google.com>

Remove the redundant skb->dev initialization in ip6_fragment

After moves the skb->dev and skb->protocol initialization into
ip6_output, setting the skb->dev inside ip6_fragment is unnecessary.

Fixes: 97a7a37a7b7b("ipv6: Initial skb->dev and skb->protocol in ip6_output")
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 97a7a37a 09-Jun-2017 Chenbo Feng <fengc@google.com>

ipv6: Initial skb->dev and skb->protocol in ip6_output

Move the initialization of skb->dev and skb->protocol from
ip6_finish_output2 to ip6_output. This can make the skb->dev and
skb->protocol information avalaible to the CGROUP eBPF filter.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 232cd35d 19-May-2017 Eric Dumazet <edumazet@google.com>

ipv6: fix out of bound writes in __ip6_append_data()

Andrey Konovalov and idaifish@gmail.com reported crashes caused by
one skb shared_info being overwritten from __ip6_append_data()

Andrey program lead to following state :

copy -4200 datalen 2000 fraglen 2040
maxfraglen 2040 alloclen 2048 transhdrlen 0 offset 0 fraggap 6200

The skb_copy_and_csum_bits(skb_prev, maxfraglen, data + transhdrlen,
fraggap, 0); is overwriting skb->head and skb_shared_info

Since we apparently detect this rare condition too late, move the
code earlier to even avoid allocating skb and risking crashes.

Once again, many thanks to Andrey and syzkaller team.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: <idaifish@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7dd7eb95 17-May-2017 David S. Miller <davem@davemloft.net>

ipv6: Check ip6_find_1stfragopt() return value properly.

Do not use unsigned variables to see if it returns a negative
error or not.

Fixes: 2423496af35d ("ipv6: Prevent overrun when parsing v6 header options")
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2423496a 16-May-2017 Craig Gallek <kraig@google.com>

ipv6: Prevent overrun when parsing v6 header options

The KASAN warning repoted below was discovered with a syzkaller
program. The reproducer is basically:
int s = socket(AF_INET6, SOCK_RAW, NEXTHDR_HOP);
send(s, &one_byte_of_data, 1, MSG_MORE);
send(s, &more_than_mtu_bytes_data, 2000, 0);

The socket() call sets the nexthdr field of the v6 header to
NEXTHDR_HOP, the first send call primes the payload with a non zero
byte of data, and the second send call triggers the fragmentation path.

The fragmentation code tries to parse the header options in order
to figure out where to insert the fragment option. Since nexthdr points
to an invalid option, the calculation of the size of the network header
can made to be much larger than the linear section of the skb and data
is read outside of it.

This fix makes ip6_find_1stfrag return an error if it detects
running out-of-bounds.

[ 42.361487] ==================================================================
[ 42.364412] BUG: KASAN: slab-out-of-bounds in ip6_fragment+0x11c8/0x3730
[ 42.365471] Read of size 840 at addr ffff88000969e798 by task ip6_fragment-oo/3789
[ 42.366469]
[ 42.366696] CPU: 1 PID: 3789 Comm: ip6_fragment-oo Not tainted 4.11.0+ #41
[ 42.367628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[ 42.368824] Call Trace:
[ 42.369183] dump_stack+0xb3/0x10b
[ 42.369664] print_address_description+0x73/0x290
[ 42.370325] kasan_report+0x252/0x370
[ 42.370839] ? ip6_fragment+0x11c8/0x3730
[ 42.371396] check_memory_region+0x13c/0x1a0
[ 42.371978] memcpy+0x23/0x50
[ 42.372395] ip6_fragment+0x11c8/0x3730
[ 42.372920] ? nf_ct_expect_unregister_notifier+0x110/0x110
[ 42.373681] ? ip6_copy_metadata+0x7f0/0x7f0
[ 42.374263] ? ip6_forward+0x2e30/0x2e30
[ 42.374803] ip6_finish_output+0x584/0x990
[ 42.375350] ip6_output+0x1b7/0x690
[ 42.375836] ? ip6_finish_output+0x990/0x990
[ 42.376411] ? ip6_fragment+0x3730/0x3730
[ 42.376968] ip6_local_out+0x95/0x160
[ 42.377471] ip6_send_skb+0xa1/0x330
[ 42.377969] ip6_push_pending_frames+0xb3/0xe0
[ 42.378589] rawv6_sendmsg+0x2051/0x2db0
[ 42.379129] ? rawv6_bind+0x8b0/0x8b0
[ 42.379633] ? _copy_from_user+0x84/0xe0
[ 42.380193] ? debug_check_no_locks_freed+0x290/0x290
[ 42.380878] ? ___sys_sendmsg+0x162/0x930
[ 42.381427] ? rcu_read_lock_sched_held+0xa3/0x120
[ 42.382074] ? sock_has_perm+0x1f6/0x290
[ 42.382614] ? ___sys_sendmsg+0x167/0x930
[ 42.383173] ? lock_downgrade+0x660/0x660
[ 42.383727] inet_sendmsg+0x123/0x500
[ 42.384226] ? inet_sendmsg+0x123/0x500
[ 42.384748] ? inet_recvmsg+0x540/0x540
[ 42.385263] sock_sendmsg+0xca/0x110
[ 42.385758] SYSC_sendto+0x217/0x380
[ 42.386249] ? SYSC_connect+0x310/0x310
[ 42.386783] ? __might_fault+0x110/0x1d0
[ 42.387324] ? lock_downgrade+0x660/0x660
[ 42.387880] ? __fget_light+0xa1/0x1f0
[ 42.388403] ? __fdget+0x18/0x20
[ 42.388851] ? sock_common_setsockopt+0x95/0xd0
[ 42.389472] ? SyS_setsockopt+0x17f/0x260
[ 42.390021] ? entry_SYSCALL_64_fastpath+0x5/0xbe
[ 42.390650] SyS_sendto+0x40/0x50
[ 42.391103] entry_SYSCALL_64_fastpath+0x1f/0xbe
[ 42.391731] RIP: 0033:0x7fbbb711e383
[ 42.392217] RSP: 002b:00007ffff4d34f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 42.393235] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbbb711e383
[ 42.394195] RDX: 0000000000001000 RSI: 00007ffff4d34f60 RDI: 0000000000000003
[ 42.395145] RBP: 0000000000000046 R08: 00007ffff4d34f40 R09: 0000000000000018
[ 42.396056] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400aad
[ 42.396598] R13: 0000000000000066 R14: 00007ffff4d34ee0 R15: 00007fbbb717af00
[ 42.397257]
[ 42.397411] Allocated by task 3789:
[ 42.397702] save_stack_trace+0x16/0x20
[ 42.398005] save_stack+0x46/0xd0
[ 42.398267] kasan_kmalloc+0xad/0xe0
[ 42.398548] kasan_slab_alloc+0x12/0x20
[ 42.398848] __kmalloc_node_track_caller+0xcb/0x380
[ 42.399224] __kmalloc_reserve.isra.32+0x41/0xe0
[ 42.399654] __alloc_skb+0xf8/0x580
[ 42.400003] sock_wmalloc+0xab/0xf0
[ 42.400346] __ip6_append_data.isra.41+0x2472/0x33d0
[ 42.400813] ip6_append_data+0x1a8/0x2f0
[ 42.401122] rawv6_sendmsg+0x11ee/0x2db0
[ 42.401505] inet_sendmsg+0x123/0x500
[ 42.401860] sock_sendmsg+0xca/0x110
[ 42.402209] ___sys_sendmsg+0x7cb/0x930
[ 42.402582] __sys_sendmsg+0xd9/0x190
[ 42.402941] SyS_sendmsg+0x2d/0x50
[ 42.403273] entry_SYSCALL_64_fastpath+0x1f/0xbe
[ 42.403718]
[ 42.403871] Freed by task 1794:
[ 42.404146] save_stack_trace+0x16/0x20
[ 42.404515] save_stack+0x46/0xd0
[ 42.404827] kasan_slab_free+0x72/0xc0
[ 42.405167] kfree+0xe8/0x2b0
[ 42.405462] skb_free_head+0x74/0xb0
[ 42.405806] skb_release_data+0x30e/0x3a0
[ 42.406198] skb_release_all+0x4a/0x60
[ 42.406563] consume_skb+0x113/0x2e0
[ 42.406910] skb_free_datagram+0x1a/0xe0
[ 42.407288] netlink_recvmsg+0x60d/0xe40
[ 42.407667] sock_recvmsg+0xd7/0x110
[ 42.408022] ___sys_recvmsg+0x25c/0x580
[ 42.408395] __sys_recvmsg+0xd6/0x190
[ 42.408753] SyS_recvmsg+0x2d/0x50
[ 42.409086] entry_SYSCALL_64_fastpath+0x1f/0xbe
[ 42.409513]
[ 42.409665] The buggy address belongs to the object at ffff88000969e780
[ 42.409665] which belongs to the cache kmalloc-512 of size 512
[ 42.410846] The buggy address is located 24 bytes inside of
[ 42.410846] 512-byte region [ffff88000969e780, ffff88000969e980)
[ 42.411941] The buggy address belongs to the page:
[ 42.412405] page:ffffea000025a780 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0
[ 42.413298] flags: 0x100000000008100(slab|head)
[ 42.413729] raw: 0100000000008100 0000000000000000 0000000000000000 00000001800c000c
[ 42.414387] raw: ffffea00002a9500 0000000900000007 ffff88000c401280 0000000000000000
[ 42.415074] page dumped because: kasan: bad access detected
[ 42.415604]
[ 42.415757] Memory state around the buggy address:
[ 42.416222] ffff88000969e880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 42.416904] ffff88000969e900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 42.417591] >ffff88000969e980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 42.418273] ^
[ 42.418588] ffff88000969ea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 42.419273] ffff88000969ea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 42.419882] ==================================================================

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 79e49503 13-Mar-2017 Florian Westphal <fw@strlen.de>

ipv6: avoid write to a possibly cloned skb

ip6_fragment, in case skb has a fraglist, checks if the
skb is cloned. If it is, it will move to the 'slow path' and allocates
new skbs for each fragment.

However, right before entering the slowpath loop, it updates the
nexthdr value of the last ipv6 extension header to NEXTHDR_FRAGMENT,
to account for the fragment header that will be inserted in the new
ipv6-fragment skbs.

In case original skb is cloned this munges nexthdr value of another
skb. Avoid this by doing the nexthdr update for each of the new fragment
skbs separately.

This was observed with tcpdump on a bridge device where netfilter ipv6
reassembly is active: tcpdump shows malformed fragment headers as
the l4 header (icmpv6, tcp, etc). is decoded as a fragment header.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Andreas Karis <akaris@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b3b45ed 09-Mar-2017 Alexey Kodanev <alexey.kodanev@oracle.com>

udp: avoid ufo handling on IP payload compression packets

commit c146066ab802 ("ipv4: Don't use ufo handling on later transformed
packets") and commit f89c56ce710a ("ipv6: Don't use ufo handling on
later transformed packets") added a check that 'rt->dst.header_len' isn't
zero in order to skip UFO, but it doesn't include IPcomp in transport mode
where it equals zero.

Packets, after payload compression, may not require further fragmentation,
and if original length exceeds MTU, later compressed packets will be
transmitted incorrectly. This can be reproduced with LTP udp_ipsec.sh test
on veth device with enabled UFO, MTU is 1500 and UDP payload is 2000:

* IPv4 case, offset is wrong + unnecessary fragmentation
udp_ipsec.sh -p comp -m transport -s 2000 &
tcpdump -ni ltp_ns_veth2
...
IP (tos 0x0, ttl 64, id 45203, offset 0, flags [+],
proto Compressed IP (108), length 49)
10.0.0.2 > 10.0.0.1: IPComp(cpi=0x1000)
IP (tos 0x0, ttl 64, id 45203, offset 1480, flags [none],
proto UDP (17), length 21) 10.0.0.2 > 10.0.0.1: ip-proto-17

* IPv6 case, sending small fragments
udp_ipsec.sh -6 -p comp -m transport -s 2000 &
tcpdump -ni ltp_ns_veth2
...
IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
payload length: 37) fd00::2 > fd00::1: IPComp(cpi=0x1000)
IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
payload length: 21) fd00::2 > fd00::1: IPComp(cpi=0x1000)

Fix it by checking 'rt->dst.xfrm' pointer to 'xfrm_state' struct, skip UFO
if xfrm is set. So the new check will include both cases: IPcomp and IPsec.

Fixes: c146066ab802 ("ipv4: Don't use ufo handling on later transformed packets")
Fixes: f89c56ce710a ("ipv6: Don't use ufo handling on later transformed packets")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 00ea1cee 18-Feb-2017 Willem de Bruijn <willemb@google.com>

ipv6: release dst on error in ip6_dst_lookup_tail

If ip6_dst_lookup_tail has acquired a dst and fails the IPv4-mapped
check, release the dst before returning an error.

Fixes: ec5e3b0a1d41 ("ipv6: Inhibit IPv4-mapped src address on the wire.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ec5e3b0a 12-Feb-2017 Jonathan T. Leighton <jtleight@udel.edu>

ipv6: Inhibit IPv4-mapped src address on the wire.

This patch adds a check for the problematic case of an IPv4-mapped IPv6
source address and a destination address that is neither an IPv4-mapped
IPv6 address nor in6addr_any, and returns an appropriate error. The
check in done before returning from looking up the route.

Signed-off-by: Jonathan T. Leighton <jtleight@udel.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c16ec185 11-Feb-2017 Julian Anastasov <ja@ssi.bg>

net: rename dst_neigh_output back to neigh_output

After the dst->pending_confirm flag was removed, we do not
need anymore to provide dst arg to dst_neigh_output.
So, rename it to neigh_output as before commit 5110effee8fd
("net: Do delayed neigh confirmation.").

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0dec879f 06-Feb-2017 Julian Anastasov <ja@ssi.bg>

net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP

When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.

The datagram protocols can use MSG_CONFIRM to confirm the
neighbour. When used with MSG_PROBE we do not reach the
code where neighbour is confirmed, so we have to do the
same slow lookup by using the dst_confirm_neigh() helper.
When MSG_PROBE is not used, ip_append_data/ip6_append_data
will set the skb flag dst_pending_confirm.

Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4ff06203 06-Feb-2017 Julian Anastasov <ja@ssi.bg>

net: add dst_pending_confirm flag to skbuff

Add new skbuff flag to allow protocols to confirm neighbour.
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.

Add sock_confirm_neigh() helper to confirm the neighbour and
use it for IPv4, IPv6 and VRF before dst_neigh_output.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2b89ed65 29-Jan-2017 Vlad Yasevich <vyasevich@gmail.com>

ipv6: Paritially checksum full MTU frames

IPv6 will mark data that is smaller that mtu - headersize as
CHECKSUM_PARTIAL, but if the data will completely fill the mtu,
the packet checksum will be computed in software instead.
Extend the conditional to include the data that fills the mtu
as well.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 92e55f41 26-Jan-2017 Pablo Neira <pablo@netfilter.org>

tcp: don't annotate mark on control socket from tcp_v6_send_response()

Unlike ipv4, this control socket is shared by all cpus so we cannot use
it as scratchpad area to annotate the mark that we pass to ip6_xmit().

Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket
family caches the flowi6 structure in the sctp_transport structure, so
we cannot use to carry the mark unless we later on reset it back, which
I discarded since it looks ugly to me.

Fixes: bf99b4ded5f8 ("tcp: fix mark propagation with fwmark_reflect enabled")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e4c5e13a 28-Dec-2016 Zheng Li <james.z.li@ericsson.com>

ipv6: Should use consistent conditional judgement for ip6 fragment between __ip6_append_data and ip6_finish_output

There is an inconsistent conditional judgement between __ip6_append_data
and ip6_finish_output functions, the variable length in __ip6_append_data
just include the length of application's payload and udp6 header, don't
include the length of ipv6 header, but in ip6_finish_output use
(skb->len > ip6_skb_dst_mtu(skb)) as judgement, and skb->len include the
length of ipv6 header.

That causes some particular application's udp6 payloads whose length are
between (MTU - IPv6 Header) and MTU were fragmented by ip6_fragment even
though the rst->dev support UFO feature.

Add the length of ipv6 header to length in __ip6_append_data to keep
consistent conditional judgement as ip6_finish_output for ip6 fragment.

Signed-off-by: Zheng Li <james.z.li@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 33b48679 23-Nov-2016 Daniel Mack <daniel@zonque.org>

net: ipv4, ipv6: run cgroup eBPF egress programs

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from ip_output(), ip6_output() and
ip_mc_output(). From mentioned functions we have two socket contexts
as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through
okfn()."). We explicitly need to use sk instead of skb->sk here,
since otherwise the same program would run multiple times on egress
when encap devices are involved, which is not desired in our case.

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c72d8cda 18-Nov-2016 Alexey Dobriyan <adobriyan@gmail.com>

net: fix bogus cast in skb_pagelen() and use unsigned variables

1) cast to "int" is unnecessary:
u8 will be promoted to int before decrementing,
small positive numbers fit into "int", so their values won't be changed
during promotion.

Once everything is int including loop counters, signedness doesn't
matter: 32-bit operations will stay 32-bit operations.

But! Someone tried to make this loop smart by making everything of
the same type apparently in an attempt to optimise it.
Do the optimization, just differently.
Do the cast where it matters. :^)

2) frag size is unsigned entity and sum of fragments sizes is also
unsigned.

Make everything unsigned, leave no MOVSX instruction behind.

add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
function old new delta
skb_cow_data 835 834 -1
ip_do_fragment 2549 2548 -1
ip6_fragment 3130 3128 -2
Total: Before=154865032, After=154865028, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 613fa3ca 08-Nov-2016 David Lebrun <david.lebrun@uclouvain.be>

ipv6: add source address argument for ipv6_push_nfrag_opts

This patch prepares for insertion of SRH through setsockopt().
The new source address argument is used when an HMAC field is
present in the SRH, which must be filled. The HMAC signature
process requires the source address as input text.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f89c56ce 26-Oct-2016 Jakub Sitnicki <jkbs@redhat.com>

ipv6: Don't use ufo handling on later transformed packets

Similar to commit c146066ab802 ("ipv4: Don't use ufo handling on later
transformed packets"), don't perform UFO on packets that will be IPsec
transformed. To detect it we rely on the fact that headerlen in
dst_entry is non-zero only for transformation bundles (xfrm_dst
objects).

Unwanted segmentation can be observed with a NETIF_F_UFO capable device,
such as a dummy device:

DEV=dum0 LEN=1493

ip li add $DEV type dummy
ip addr add fc00::1/64 dev $DEV nodad
ip link set $DEV up
ip xfrm policy add dir out src fc00::1 dst fc00::2 \
tmpl src fc00::1 dst fc00::2 proto esp spi 1
ip xfrm state add src fc00::1 dst fc00::2 \
proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b

tcpdump -n -nn -i $DEV -t &
socat /dev/zero,readbytes=$LEN udp6:[fc00::2]:$LEN

tcpdump output before:

IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
IP6 fc00::1 > fc00::2: frag (1448|48)
IP6 fc00::1 > fc00::2: ESP(spi=0x00000001,seq=0x2), length 88

... and after:

IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
IP6 fc00::1 > fc00::2: frag (1448|80)

Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")

Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8a966fc0 10-Sep-2016 David Ahern <dsa@cumulusnetworks.com>

net: ipv6: Remove l3mdev_get_saddr6

No longer needed

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e0d56fdd 10-Sep-2016 David Ahern <dsa@cumulusnetworks.com>

net: l3mdev: remove redundant calls

A previous patch added l3mdev flow update making these hooks
redundant. Remove them.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a8e3e1a9 10-Sep-2016 David Ahern <dsa@cumulusnetworks.com>

net: l3mdev: Add hook to output path

This patch adds the infrastructure to the output path to pass an skb
to an l3mdev device if it has a hook registered. This is the Tx parallel
to l3mdev_ip{6}_rcv in the receive path and is the basis for removing
the existing hook that returns the vrf dst on the fib lookup.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 14972cbd 24-Aug-2016 Roopa Prabhu <roopa@cumulusnetworks.com>

net: lwtunnel: Handle fragmentation

Today mpls iptunnel lwtunnel_output redirect expects the tunnel
output function to handle fragmentation. This is ok but can be
avoided if we did not do the mpls output redirect too early.
ie we could wait until ip fragmentation is done and then call
mpls output for each ip fragment.

To make this work we will need,
1) the lwtunnel state to carry encap headroom
2) and do the redirect to the encap output handler on the ip fragment
(essentially do the output redirect after fragmentation)

This patch adds tunnel headroom in lwtstate to make sure we
account for tunnel data in mtu calculations during fragmentation
and adds new xmit redirect handler to redirect to lwtunnel xmit func
after ip fragmentation.

This includes IPV6 and some mtu fixes and testing from David Ahern.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0d240e78 16-Jun-2016 David Ahern <dsa@cumulusnetworks.com>

net: vrf: Implement get_saddr for IPv6

IPv6 source address selection needs to consider the real egress route.
Similar to IPv4 implement a get_saddr6 method which is called if
source address has not been set. The get_saddr6 method does a full
lookup which means pulling a route from the VRF FIB table and properly
considering linklocal/multicast destination addresses. Lookup failures
(eg., unreachable) then cause the source address selection to fail
which gets propagated back to the caller.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 00bc0ef5 08-Jun-2016 Jakub Sitnicki <jkbs@redhat.com>

ipv6: Skip XFRM lookup if dst_entry in socket cache is valid

At present we perform an xfrm_lookup() for each UDPv6 message we
send. The lookup involves querying the flow cache (flow_cache_lookup)
and, in case of a cache miss, creating an XFRM bundle.

If we miss the flow cache, we can end up creating a new bundle and
deriving the path MTU (xfrm_init_pmtu) from on an already transformed
dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down
to xfrm_lookup(). This can happen only if we're caching the dst_entry
in the socket, that is when we're using a connected UDP socket.

To put it another way, the path MTU shrinks each time we miss the flow
cache, which later on leads to incorrectly fragmented payload. It can
be observed with ESPv6 in transport mode:

1) Set up a transformation and lower the MTU to trigger fragmentation
# ip xfrm policy add dir out src ::1 dst ::1 \
tmpl src ::1 dst ::1 proto esp spi 1
# ip xfrm state add src ::1 dst ::1 \
proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
# ip link set dev lo mtu 1500

2) Monitor the packet flow and set up an UDP sink
# tcpdump -ni lo -ttt &
# socat udp6-listen:12345,fork /dev/null &

3) Send a datagram that needs fragmentation with a connected socket
# perl -e 'print "@" x 1470 | socat - udp6:[::1]:12345
2016/06/07 18:52:52 socat[724] E read(3, 0x555bb3d5ba00, 8192): Protocol error
00:00:00.000000 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x2), length 1448
00:00:00.000014 IP6 ::1 > ::1: frag (1448|32)
00:00:00.000050 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x3), length 1272
(^ ICMPv6 Parameter Problem)
00:00:00.000022 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x5), length 136

4) Compare it to a non-connected socket
# perl -e 'print "@" x 1500' | socat - udp6-sendto:[::1]:12345
00:00:40.535488 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x6), length 1448
00:00:00.000010 IP6 ::1 > ::1: frag (1448|64)

What happens in step (3) is:

1) when connecting the socket in __ip6_datagram_connect(), we
perform an XFRM lookup, miss the flow cache, create an XFRM
bundle, and cache the destination,

2) afterwards, when sending the datagram, we perform an XFRM lookup,
again, miss the flow cache (due to mismatch of flowi6_iif and
flowi6_oif, which is an issue of its own), and recreate an XFRM
bundle based on the cached (and already transformed) destination.

To prevent the recreation of an XFRM bundle, avoid an XFRM lookup
altogether whenever we already have a destination entry cached in the
socket. This prevents the path MTU shrinkage and brings us on par with
UDPv4.

The fix also benefits connected PINGv6 sockets, another user of
ip6_sk_dst_lookup_flow(), who also suffer messages being transformed
twice.

Joint work with Hannes Frederic Sowa.

Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ae7ef81e 02-Jun-2016 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

skbuff: introduce skb_gso_validate_mtu

skb_gso_network_seglen is not enough for checking fragment sizes if
skb is using GSO_BY_FRAGS as we have to check frag per frag.

This patch introduces skb_gso_validate_mtu, based on the former, which
will wrap the use case inside it as all calls to skb_gso_network_seglen
were to validate if it fits on a given TMU, and improve the check.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 26879da5 02-May-2016 Wei Wang <weiwan@google.com>

ipv6: add new struct ipcm6_cookie

In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
This is not a good practice and makes it hard to add new parameters.
This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
ipv4 and include the above mentioned variables. And we only pass the
pointer to this structure to corresponding functions. This makes it easier
to add new parameters in the future and makes the function cleaner.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1d015503 27-Apr-2016 Eric Dumazet <edumazet@google.com>

ipv6: rename IP6_INC_STATS_BH()

Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3ba3458f 05-Apr-2016 Jakub Sitnicki <jkbs@redhat.com>

ipv6: Count in extension headers in skb->network_header

When sending a UDPv6 message longer than MTU, account for the length
of fragmentable IPv6 extension headers in skb->network_header offset.
Same as we do in alloc_new_skb path in __ip6_append_data().

This ensures that later on __ip6_make_skb() will make space in
headroom for fragmentable extension headers:

/* move skb->data to ip header from ext header */
if (skb->data < skb_network_header(skb))
__skb_pull(skb, skb_network_offset(skb));

Prevents a splat due to skb_under_panic:

skbuff: skb_under_panic: text:ffffffff8143397b len:2126 put:14 \
head:ffff880005bacf50 data:ffff880005bacf4a tail:0x48 end:0xc0 dev:lo
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:104!
invalid opcode: 0000 [#1] KASAN
CPU: 0 PID: 160 Comm: reproducer Not tainted 4.6.0-rc2 #65
[...]
Call Trace:
[<ffffffff813eb7b9>] skb_push+0x79/0x80
[<ffffffff8143397b>] eth_header+0x2b/0x100
[<ffffffff8141e0d0>] neigh_resolve_output+0x210/0x310
[<ffffffff814eab77>] ip6_finish_output2+0x4a7/0x7c0
[<ffffffff814efe3a>] ip6_output+0x16a/0x280
[<ffffffff815440c1>] ip6_local_out+0xb1/0xf0
[<ffffffff814f1115>] ip6_send_skb+0x45/0xd0
[<ffffffff81518836>] udp_v6_send_skb+0x246/0x5d0
[<ffffffff8151985e>] udpv6_sendmsg+0xa6e/0x1090
[...]

Reported-by: Ji Jianwen <jiji@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c14ac945 02-Apr-2016 Soheil Hassas Yeganeh <soheil@google.com>

sock: enable timestamping using control messages

Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 64d4e343 27-Feb-2016 WANG Cong <xiyou.wangcong@gmail.com>

net: remove skb_sender_cpu_clear()

After commit 52bd2d62ce67 ("net: better skb->sender_cpu and skb->napi_id cohabitation")
skb_sender_cpu_clear() becomes empty and can be removed.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6f21c96a 28-Jan-2016 Paolo Abeni <pabeni@redhat.com>

ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()

The current implementation of ip6_dst_lookup_tail basically
ignore the egress ifindex match: if the saddr is set,
ip6_route_output() purposefully ignores flowi6_oif, due
to the commit d46a9d678e4c ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE
flag if saddr set"), if the saddr is 'any' the first route lookup
in ip6_dst_lookup_tail fails, but upon failure a second lookup will
be performed with saddr set, thus ignoring the ifindex constraint.

This commit adds an output route lookup function variant, which
allows the caller to specify lookup flags, and modify
ip6_dst_lookup_tail() to enforce the ifindex match on the second
lookup via said helper.

ip6_route_output() becames now a static inline function build on
top of ip6_route_output_flags(); as a side effect, out-of-tree
modules need now a GPL license to access the output route lookup
functionality.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 40ba3302 10-Jan-2016 Michal Kubeček <mkubecek@suse.cz>

udp: disallow UFO for sockets with SO_NO_CHECK option

Commit acf8dd0a9d0b ("udp: only allow UFO for packets from SOCK_DGRAM
sockets") disallows UFO for packets sent from raw sockets. We need to do
the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if
for a bit different reason: while such socket would override the
CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and
bad offloading flags warning is triggered in __skb_gso_segment().

In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow
UFO for packets sent by sockets with UDP_NO_CHECK6_TX option.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Tested-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c8cd0989 14-Dec-2015 Tom Herbert <tom@herbertland.com>

net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM

These netif flags are unnecessary convolutions. It is more
straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM,
and NETIF_F_IPV6_CSUM directly.

This patch also:
- Cleans up can_checksum_protocol
- Simplifies netdev_intersect_features

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 405c92f7 27-Oct-2015 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment

CHECKSUM_PARTIAL skbs should never arrive in ip_fragment. If we get one
of those warn about them once and handle them gracefully by recalculating
the checksum.

Fixes: commit 32dce968dd987 ("ipv6: Allow for partial checksums on non-ufo packets")
See-also: commit 72e843bb09d45 ("ipv6: ip6_fragment() should check CHECKSUM_PARTIAL")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 682b1a9d 27-Oct-2015 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: no CHECKSUM_PARTIAL on MSG_MORE corked sockets

We cannot reliable calculate packet size on MSG_MORE corked sockets
and thus cannot decide if they are going to be fragmented later on,
so better not use CHECKSUM_PARTIAL in the first place.

The IPv6 code also intended to protect and not use CHECKSUM_PARTIAL in
the existence of IPv6 extension headers, but the condition was wrong. Fix
it up, too. Also the condition to check whether the packet fits into
one fragment was wrong and has been corrected.

Fixes: commit 32dce968dd987 ("ipv6: Allow for partial checksums on non-ufo packets")
See-also: commit 72e843bb09d45 ("ipv6: ip6_fragment() should check CHECKSUM_PARTIAL")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 89bc7848 28-Oct-2015 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: protect mtu calculation of wrap-around and infinite loop by rounding issues

Raw sockets with hdrincl enabled can insert ipv6 extension headers
right into the data stream. In case we need to fragment those packets,
we reparse the options header to find the place where we can insert
the fragment header. If the extension headers exceed the link's MTU we
actually cannot make progress in such a case.

Instead of ending up in broken arithmetic or rounding towards 0 and
entering an endless loop in ip6_fragment, just prevent those cases by
aborting early and signal -EMSGSIZE to user space.

This is the second version of the patch which doesn't use the
overflow_usub function, which got reverted for now.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1e0d69a9 28-Oct-2015 Hannes Frederic Sowa <hannes@stressinduktion.org>

Revert "Merge branch 'ipv6-overflow-arith'"

Linus dislikes these changes. To not hold up the net-merge let's revert
it for now and fix the bug like Linus suggested.

This reverts commit ec3661b42257d9a06cf0d318175623ac7a660113, reversing
changes made to c80dbe04612986fd6104b4a1be21681b113b5ac9.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b72a2b01 16-Oct-2015 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: protect mtu calculation of wrap-around and infinite loop by rounding issues

Raw sockets with hdrincl enabled can insert ipv6 extension headers
right into the data stream. In case we need to fragment those packets,
we reparse the options header to find the place where we can insert
the fragment header. If the extension headers exceed the link's MTU we
actually cannot make progress in such a case.

Instead of ending up in broken arithmetic or rounding towards 0 and
entering an endless loop in ip6_fragment, just prevent those cases by
aborting early and signal -EMSGSIZE to user space.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f1900fb5 19-Oct-2015 David Ahern <dsa@cumulusnetworks.com>

net: Really fix vti6 with oif in dst lookups

6e28b000825d ("net: Fix vti use case with oif in dst lookups for IPv6")
is missing the checks on FLOWI_FLAG_SKIP_NH_OIF. Add them.

Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups")
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ca254490 12-Oct-2015 David Ahern <dsa@cumulusnetworks.com>

net: Add VRF support to IPv6 stack

As with IPv4 support for VRFs added to IPv6 stack by replacing hardcoded
table ids with possibly device specific ones and manipulating the oif in
the flowi6. The flow flags are used to skip oif compare in nexthop lookups
if the device is enslaved to a VRF via the L3 master device.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9ef2e965 08-Oct-2015 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: drop frames with attached skb->sk in forwarding

This is a clone of commit 2ab957492d13b ("ip_forward: Drop frames with
attached skb->sk") for ipv6.

This commit has exactly the same reasons as the above mentioned commit,
namely to prevent panics during netfilter reload or a misconfigured stack.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ede2059d 07-Oct-2015 Eric W. Biederman <ebiederm@xmission.com>

dst: Pass net into dst->output

The network namespace is already passed into dst_output pass it into
dst->output lwt->output and friends.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 33224b16 07-Oct-2015 Eric W. Biederman <ebiederm@xmission.com>

ipv4, ipv6: Pass net into ip_local_out and ip6_local_out

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 79288330 07-Oct-2015 Eric W. Biederman <ebiederm@xmission.com>

ipv6: Merge ip6_local_out and ip6_local_out_sk

Stop hidding the sk parameter with an inline helper function and make
all of the callers pass it, so that it is clear what the function is
doing.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 13206b6b 07-Oct-2015 Eric W. Biederman <ebiederm@xmission.com>

net: Pass net into dst_output and remove dst_output_okfn

Replace dst_output_okfn with dst_output

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7d8c6e39 12-Jun-2015 Eric W. Biederman <ebiederm@xmission.com>

ipv6: Pass struct net through ip6_fragment

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 1c1e9d2b 25-Sep-2015 Eric Dumazet <edumazet@google.com>

ipv6: constify ip6_xmit() sock argument

This is to document that socket lock might not be held at this point.

skb_set_owner_w() and ipv6_local_error() are using proper atomic ops
or spinlocks, so we promote the socket to non const when calling them.

netfilter hooks should never assume socket lock is held,
we also promote the socket to non const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3aef934f 25-Sep-2015 Eric Dumazet <edumazet@google.com>

ipv6: constify ip6_dst_lookup_{flow|tail}() sock arguments

ip6_dst_lookup_flow() and ip6_dst_lookup_tail() do not touch
socket, lets add a const qualifier.

This will permit the same change in inet6_csk_route_req()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1d325d21 16-Sep-2015 Florian Westphal <fw@strlen.de>

ipv6: ip6_fragment: fix headroom tests and skb leak

David Woodhouse reports skb_under_panic when we try to push ethernet
header to fragmented ipv6 skbs:

skbuff: skb_under_panic: text:c1277f1e len:1294 put:14 head:dec98000
data:dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan
[..]
ip6_finish_output2+0x196/0x4da

David further debugged this:
[..] offending fragments were arriving here with skb_headroom(skb)==10.
Which is reasonable, being the Solos ADSL card's header of 8 bytes
followed by 2 bytes of PPP frame type.

The problem is that if netfilter ipv6 defragmentation is used, skb_cow()
in ip6_forward will only see reassembled skb.

Therefore, headroom is overestimated by 8 bytes (we pulled fragment
header) and we don't check the skbs in the frag_list either.

We can't do these checks in netfilter defrag since outdev isn't known yet.

Furthermore, existing tests in ip6_fragment did not consider the fragment
or ipv6 header size when checking headroom of the fraglist skbs.

While at it, also fix a skb leak on memory allocation -- ip6_fragment
must consume the skb.

I tested this e1000 driver hacked to not allocate additional headroom
(we end up in slowpath, since LL_RESERVED_SPACE is 16).

If 2 bytes of headroom are allocated, fastpath is taken (14 byte
ethernet header was pulled, so 16 byte headroom available in all
fragments).

Reported-by: David Woodhouse <dwmw2@infradead.org>
Diagnosed-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# be10de0a 17-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: Add blank lines in callers of netfilter hooks

In code review it was noticed that I had failed to add some blank lines
in places where they are customarily used. Taking a second look at the
code I have to agree blank lines would be nice so I have added them
here.

Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0c4b51f0 15-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: Pass net into okfn

This is immediately motivated by the bridge code that chains functions that
call into netfilter. Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.

As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.

To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn. For the moment dst_output_okfn
just silently drops the struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 29a26a56 15-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: Pass struct net into the netfilter hooks

Pass a network namespace parameter into the netfilter hooks. At the
call site of the netfilter hooks the path a packet is taking through
the network stack is well known which allows the network namespace to
be easily and reliabily.

This allows the replacement of magic code like
"dev_net(state->in?:state->out)" that appears at the start of most
netfilter hooks with "state->net".

In almost all cases the network namespace passed in is derived
from the first network device passed in, guaranteeing those
paths will not see any changes in practice.

The exceptions are:
xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

In all cases these exceptions seem to be a better expression for the
network namespace the packet is being processed in then the historic
"dev_net(in?in:out)". I am documenting them in case something odd
pops up and someone starts trying to track down what happened.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 19a0644c 15-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

ipv6: Cache net in ip6_output

Keep net in a local variable so I can use it in NF_HOOK_COND
when I pass struct net to all of the netfilter hooks.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 78126c41 15-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

ipv6: Only compute net once in ip6_finish_output2

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a70649e 15-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

net: Merge dst_output and dst_output_sk

Add a sock paramter to dst_output making dst_output_sk superfluous.
Add a skb->sk parameter to all of the callers of dst_output
Have the callers of dst_output_sk call dst_output.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 67800f9b 31-Jul-2015 Tom Herbert <tom@herbertland.com>

ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel

We can't call skb_get_hash here since the packet is not complete to do
flow_dissector. Create hash based on flowi6 instead.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 343d60aa 30-Jul-2015 Roopa Prabhu <roopa@cumulusnetworks.com>

ipv6: change ipv6_stub_impl.ipv6_dst_lookup to take net argument

This patch adds net argument to ipv6_stub_impl.ipv6_dst_lookup
for use cases where sk is not available (like mpls).
sk appears to be needed to get the namespace 'net' and is optional
otherwise. This patch series changes ipv6_stub_impl.ipv6_dst_lookup
to take net argument. sk remains optional.

All callers of ipv6_stub_impl.ipv6_dst_lookup have been modified
to pass net. I have modified them to use already available
'net' in the scope of the call. I can change them to
sock_net(sk) to avoid any unintended change in behaviour if sock
namespace is different. They dont seem to be from code inspection.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a0a9f33b 15-Jul-2015 Phil Sutter <phil@nwl.cc>

net/ipv6: update flowi6_oif in ip6_dst_lookup_flow if not set

Newly created flows don't have flowi6_oif set (at least if the
associated socket is not interface-bound). This leads to a mismatch in
__xfrm6_selector_match() for policies which specify an interface in the
selector (sel->ifindex != 0).

Backtracing shows this happens in code-paths originating from e.g.
ip6_datagram_connect(), rawv6_sendmsg() or tcp_v6_connect(). (UDP was
not tested for.)

In summary, this patch fixes policy matching on outgoing interface for
locally generated packets.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 485fca66 21-May-2015 Florian Westphal <fw@strlen.de>

ipv6: don't increase size when refragmenting forwarded ipv6 skbs

since commit 6aafeef03b9d ("netfilter: push reasm skb through instead of
original frag skbs") we will end up sometimes re-fragmenting skbs
that we've reassembled.

ipv6 defrag preserves the original skbs using the skb frag list, i.e. as long
as the skb frag list is preserved there is no problem since we keep
original geometry of fragments intact.

However, in the rare case where the frag list is munged or skb
is linearized, we might send larger fragments than what we originally
received.

A router in the path might then send packet-too-big errors even if
sender never sent fragments exceeding the reported mtu:

mtu 1500 - 1500:1400 - 1400:1280 - 1280
A R1 R2 B

1 - A sends to B, fragment size 1400
2 - R2 sends pkttoobig error for 1280
3 - A sends to B, fragment size 1280
4 - R2 sends pkttoobig error for 1280 again because it sees fragments of size 1400.

make sure ip6_fragment always caps MTU at largest packet size seen
when defragmented skb is forwarded.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2647a9b0 22-May-2015 Martin KaFai Lau <kafai@fb.com>

ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST

When creating a RTF_CACHE route, RTF_ANYCAST is set based on rt6i_dst.
Also, rt6i_gateway is always set to the nexthop while the nexthop
could be a gateway or the rt6i_dst.addr.

After removing the rt6i_dst and rt6i_src dependency in the last patch,
we also need to stop the caller from depending on rt6i_gateway and
RTF_ANYCAST.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fd0273d7 22-May-2015 Martin KaFai Lau <kafai@fb.com>

ipv6: Remove external dependency on rt6i_dst and rt6i_src

This patch removes the assumptions that the returned rt is always
a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the
destination and source address. The dst and src can be recovered from
the calling site.

We may consider to rename (rt6i_dst, rt6i_src) to
(rt6i_key_dst, rt6i_key_src) later.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 286c2349 22-May-2015 Martin KaFai Lau <kafai@fb.com>

ipv6: Clean up ipv6_select_ident() and ip6_fragment()

This patch changes the ipv6_select_ident() signature to return a
fragment id instead of taking a whole frag_hdr as a param to
only set the frag_hdr->identification.

It also cleans up ip6_fragment() to obtain the fragment id at the
beginning instead of using multiple "if" later to check fragment id
has been generated or not.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e87a468e 14-May-2015 Vlad Yasevich <vyasevich@gmail.com>

ipv6: Fix udp checksums with raw sockets

It was reported that trancerout6 would cause
a kernel to crash when trying to compute checksums
on raw UDP packets. The cause was the check in
__ip6_append_data that would attempt to use
partial checksums on the packet. However,
raw sockets do not initialize partial checksum
fields so partial checksums can't be used.

Solve this the same way IPv4 does it. raw sockets
pass transhdrlen value of 0 to ip_append_data which
causes the checksum to be computed in software. Use
the same check in ip6_append_data (check transhdrlen).

Reported-by: Wolfgang Walter <linux@stwm.de>
CC: Wolfgang Walter <linux@stwm.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e16e888b 05-May-2015 Markus Stenberg <markus.stenberg@iki.fi>

ipv6: Fixed source specific default route handling.

If there are only IPv6 source specific default routes present, the
host gets -ENETUNREACH on e.g. connect() because ip6_dst_lookup_tail
calls ip6_route_output first, and given source address any, it fails,
and ip6_route_get_saddr is never called.

The change is to use the ip6_route_get_saddr, even if the initial
ip6_route_output fails, and then doing ip6_route_output _again_ after
we have appropriate source address available.

Note that this is '99% fix' to the problem; a correct fix would be to
do route lookups only within addrconf.c when picking a source address,
and never call ip6_route_output before source address has been
populated.

Signed-off-by: Markus Stenberg <markus.stenberg@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7026b1dd 05-Apr-2015 David Miller <davem@davemloft.net>

netfilter: Pass socket pointer down through okfn().

On the output paths in particular, we have to sometimes deal with two
socket contexts. First, and usually skb->sk, is the local socket that
generated the frame.

And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.

We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.

The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device. We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.

Signed-off-by: David S. Miller <davem@davemloft.net>


# f60e5990 01-Apr-2015 hannes@stressinduktion.org <hannes@stressinduktion.org>

ipv6: protect skb->sk accesses from recursive dereference inside the stack

We should not consult skb->sk for output decisions in xmit recursion
levels > 0 in the stack. Otherwise local socket settings could influence
the result of e.g. tunnel encapsulation process.

ipv6 does not conform with this in three places:

1) ip6_fragment: we do consult ipv6_npinfo for frag_size

2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
loop the packet back to the local socket

3) ip6_skb_dst_mtu could query the settings from the user socket and
force a wrong MTU

Furthermore:
In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
PF_PACKET socket ontop of an IPv6-backed vxlan device.

Reuse xmit_recursion as we are currently only interested in protecting
tunnel devices.

Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 53b24b8f 29-Mar-2015 Ian Morris <ipm@chirality.org.uk>

ipv6: coding style: comparison for inequality with NULL

The ipv6 code uses a mixture of coding styles. In some instances check for NULL
pointer is done as x != NULL and sometimes as x. x is preferred according to
checkpatch and this patch makes the code consistent by adopting the latter
form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 63159f29 29-Mar-2015 Ian Morris <ipm@chirality.org.uk>

ipv6: coding style: comparison for equality with NULL

The ipv6 code uses a mixture of coding styles. In some instances check for NULL
pointer is done as x == NULL and sometimes as !x. !x is preferred according to
checkpatch and this patch makes the code consistent by adopting the latter
form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a352dd0 25-Mar-2015 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: hash net ptr into fragmentation bucket selection

As namespaces are sometimes used with overlapping ip address ranges,
we should also use the namespace as input to the hash to select the ip
fragmentation counter bucket.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c29390c6 11-Mar-2015 Eric Dumazet <edumazet@google.com>

xps: must clear sender_cpu before forwarding

John reported that my previous commit added a regression
on his router.

This is because sender_cpu & napi_id share a common location,
so get_xps_queue() can see garbage and perform an out of bound access.

We need to make sure sender_cpu is cleared before doing the transmit,
otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger
this bug.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: John <jw@nuclearfallout.net>
Bisected-by: John <jw@nuclearfallout.net>
Fixes: 2bd82484bb4c ("xps: fix xps for stacked devices")
Signed-off-by: David S. Miller <davem@davemloft.net>


# acf8dd0a 02-Mar-2015 Michal Kubeček <mkubecek@suse.cz>

udp: only allow UFO for packets from SOCK_DGRAM sockets

If an over-MTU UDP datagram is sent through a SOCK_RAW socket to a
UFO-capable device, ip_ufo_append_data() sets skb->ip_summed to
CHECKSUM_PARTIAL unconditionally as all GSO code assumes transport layer
checksum is to be computed on segmentation. However, in this case,
skb->csum_start and skb->csum_offset are never set as raw socket
transmit path bypasses udp_send_skb() where they are usually set. As a
result, driver may access invalid memory when trying to calculate the
checksum and store the result (as observed in virtio_net driver).

Moreover, the very idea of modifying the userspace provided UDP header
is IMHO against raw socket semantics (I wasn't able to find a document
clearly stating this or the opposite, though). And while allowing
CHECKSUM_NONE in the UFO case would be more efficient, it would be a bit
too intrusive change just to handle a corner case like this. Therefore
disallowing UFO for packets from SOCK_DGRAM seems to be the best option.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bf250a1f 10-Feb-2015 Vlad Yasevich <vyasevich@gmail.com>

ipv6: Partial checksum only UDP packets

ip6_append_data is used by other protocols and some of them can't
be partially checksummed. Only partially checksum UDP protocol.

Fixes: 32dce968dd987a (ipv6: Allow for partial checksums on non-ufo packets)
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0508c07f 03-Feb-2015 Vlad Yasevich <vyasevich@gmail.com>

ipv6: Select fragment id during UFO segmentation if not set.

If the IPv6 fragment id has not been set and we perform
fragmentation due to UFO, select a new fragment id.
We now consider a fragment id of 0 as unset and if id selection
process returns 0 (after all the pertrubations), we set it to
0x80000000, thus giving us ample space not to create collisions
with the next packet we may have to fragment.

When doing UFO integrity checking, we also select the
fragment id if it has not be set yet. This is stored into
the skb_shinfo() thus allowing UFO to function correclty.

This patch also removes duplicate fragment id generation code
and moves ipv6_select_ident() into the header as it may be
used during GSO.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32dce968 31-Jan-2015 Vlad Yasevich <vyasevich@gmail.com>

ipv6: Allow for partial checksums on non-ufo packets

Currntly, if we are not doing UFO on the packet, all UDP
packets will start with CHECKSUM_NONE and thus perform full
checksum computations in software even if device support
IPv6 checksum offloading.

Let's start start with CHECKSUM_PARTIAL if the device
supports it and we are sending only a single packet at
or below mtu size.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6422398c 31-Jan-2015 Vlad Yasevich <vyasevich@gmail.com>

ipv6: introduce ipv6_make_skb

This commit is very similar to
commit 1c32c5ad6fac8cee1a77449f5abf211e911ff830
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue Mar 1 02:36:47 2011 +0000

inet: Add ip_make_skb and ip_finish_skb

It adds IPv6 version of the helpers ip6_make_skb and ip6_finish_skb.

The job of ip6_make_skb is to collect messages into an ipv6 packet
and poplulate ipv6 eader. The job of ip6_finish_skb is to transmit
the generated skb. Together they replicated the job of
ip6_push_pending_frames() while also provide the capability to be
called independently. This will be needed to add lockless UDP sendmsg
support.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0bbe84a6 31-Jan-2015 Vlad Yasevich <vyasevich@gmail.com>

ipv6: Append sending data to arbitrary queue

Add the ability to append data to arbitrary queue. This
will be needed later to implement lockless UDP sends.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 366e41d9 31-Jan-2015 Vlad Yasevich <vyasevich@gmail.com>

ipv6: pull cork initialization into its own function.

Pull IPv6 cork initialization into its own function that
can be re-used. IPv6 specific cork data did not have an
explicit data structure. This patch creats eone so that
just ipv6 cork data can be as arguemts. Also, since
IPv6 tries to save the flow label into inet_cork_full
tructure, pass the full cork.

Adjust ip6_cork_release() to take cork data structures.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e5d08d71 23-Nov-2014 Ian Morris <ipm@chirality.org.uk>

ipv6: coding style improvements (remove assignment in if statements)

This change has no functional impact and simply addresses some coding
style issues detected by checkpatch. Specifically this change
adjusts "if" statements which also include the assignment of a
variable.

No changes to the resultant object files result as determined by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cbffccc9 05-Nov-2014 Joe Perches <joe@perches.com>

net; ipv[46] - Remove 2 unnecessary NETDEBUG OOM messages

These messages aren't useful as there's a generic dump_stack()
on OOM.

Neaten the comment and if test above the OOM by separating the
assign in if into an allocation then if test.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f92ee619 16-Sep-2014 Steffen Klassert <steffen.klassert@secunet.com>

xfrm: Generate blackhole routes only from route lookup functions

Currently we genarate a blackhole route route whenever we have
matching policies but can not resolve the states. Here we assume
that dst_output() is called to kill the balckholed packets.
Unfortunately this assumption is not true in all cases, so
it is possible that these packets leave the system unwanted.

We fix this by generating blackhole routes only from the
route lookup functions, here we can guarantee a call to
dst_output() afterwards.

Fixes: 2774c131b1d ("xfrm: Handle blackhole route creation via afinfo.")
Reported-by: Konstantinos Kolelis <k.kolelis@sirrix.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>


# 46cfd725 09-Sep-2014 Florian Westphal <fw@strlen.de>

net: use kfree_skb_list() helper in more places

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4c83acbc 24-Aug-2014 Ian Morris <ipm@chirality.org.uk>

ipv6: White-space cleansing : gaps between function and symbol export

This patch makes no changes to the logic of the code but simply addresses
coding style issues as detected by checkpatch.

Both objdump and diff -w show no differences.

This patch removes some blank lines between the end of a function
definition and the EXPORT_SYMBOL_GPL macro in order to prevent
checkpatch warning that EXPORT_SYMBOL must immediately follow
a function.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 67ba4152 24-Aug-2014 Ian Morris <ipm@chirality.org.uk>

ipv6: White-space cleansing : Line Layouts

This patch makes no changes to the logic of the code but simply addresses
coding style issues as detected by checkpatch.

Both objdump and diff -w show no differences.

A number of items are addressed in this patch:
* Multiple spaces converted to tabs
* Spaces before tabs removed.
* Spaces in pointer typing cleansed (char *)foo etc.
* Remove space after sizeof
* Ensure spacing around comparators such as if statements.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 09c2d251 04-Aug-2014 Willem de Bruijn <willemb@google.com>

net-timestamp: add key to disambiguate concurrent datagrams

Datagrams timestamped on transmission can coexist in the kernel stack
and be reordered in packet scheduling. When reading looped datagrams
from the socket error queue it is not always possible to unique
correlate looped data with original send() call (for application
level retransmits). Even if possible, it may be expensive and complex,
requiring packet inspection.

Introduce a data-independent ID mechanism to associate timestamps with
send calls. Pass an ID alongside the timestamp in field ee_data of
sock_extended_err.

The ID is a simple 32 bit unsigned int that is associated with the
socket and incremented on each send() call for which software tx
timestamp generation is enabled.

The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
avoid changing ee_data for existing applications that expect it 0.
The counter is reset each time the flag is reenabled. Reenabling
does not change the ID of already submitted data. It is possible
to receive out of order IDs if the timestamp stream is not quiesced
first.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 04ca6973 26-Jul-2014 Eric Dumazet <edumazet@google.com>

ip: make IP identifiers less predictable

In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
Jedidiah describe ways exploiting linux IP identifier generation to
infer whether two machines are exchanging packets.

With commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we
changed IP id generation, but this does not really prevent this
side-channel technique.

This patch adds a random amount of perturbation so that IP identifiers
for a given destination [1] are no longer monotonically increasing after
an idle period.

Note that prandom_u32_max(1) returns 0, so if generator is used at most
once per jiffy, this patch inserts no hole in the ID suite and do not
increase collision probability.

This is jiffies based, so in the worst case (HZ=1000), the id can
rollover after ~65 seconds of idle time, which should be fine.

We also change the hash used in __ip_select_ident() to not only hash
on daddr, but also saddr and protocol, so that ICMP probes can not be
used to infer information for other protocols.

For IPv6, adds saddr into the hash as well, but not nexthdr.

If I ping the patched target, we can see ID are now hard to predict.

21:57:11.008086 IP (...)
A > target: ICMP echo request, seq 1, length 64
21:57:11.010752 IP (... id 2081 ...)
target > A: ICMP echo reply, seq 1, length 64

21:57:12.013133 IP (...)
A > target: ICMP echo request, seq 2, length 64
21:57:12.015737 IP (... id 3039 ...)
target > A: ICMP echo reply, seq 2, length 64

21:57:13.016580 IP (...)
A > target: ICMP echo request, seq 3, length 64
21:57:13.019251 IP (... id 3437 ...)
target > A: ICMP echo reply, seq 3, length 64

[1] TCP sessions uses a per flow ID generator not changed by this patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jeffrey Knockel <jeffk@cs.unm.edu>
Reported-by: Jedidiah R. Crandall <crandall@cs.unm.edu>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ac3d2e5a 25-Jul-2014 Li RongQing <roy.qing.li@gmail.com>

ipv6: remove obsolete comment in ip6_append_data()

After 11878b40e[net-timestamp: SOCK_RAW and PING timestamping], this comment
becomes obsolete since the codes check not only UDP socket, but also RAW sock;
and the codes are clear, not need the comments

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 11878b40 14-Jul-2014 Willem de Bruijn <willemb@google.com>

net-timestamp: SOCK_RAW and PING timestamping

Add SO_TIMESTAMPING to sockets of type PF_INET[6]/SOCK_RAW:

Add the necessary sock_tx_timestamp calls to the datapath for RAW
sockets (ping sockets already had these calls).

Fix the IP output path to pass the timestamp flags on the first
fragment also for these sockets. The existing code relies on
transhdrlen != 0 to indicate a first fragment. For these sockets,
that assumption does not hold.

This fixes http://bugzilla.kernel.org/show_bug.cgi?id=77221

Tested SOCK_RAW on IPv4 and IPv6, not PING.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e3f0b86b 11-Jul-2014 Himangi Saraogi <himangi774@gmail.com>

ipv6: Use BUG_ON

The semantic patch that makes this transformation is as follows:

// <smpl>
@@ expression e; @@
-if (e) BUG();
+BUG_ON(e);
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cb1ce2ef 01-Jul-2014 Tom Herbert <therbert@google.com>

ipv6: Implement automatic flow label generation on transmit

Automatically generate flow labels for IPv6 packets on transmit.
The flow label is computed based on skb_get_hash. The flow label will
only automatically be set when it is zero otherwise (i.e. flow label
manager hasn't set one). This supports the transmit side functionality
of RFC 6438.

Added an IPv6 sysctl auto_flowlabels to enable/disable this behavior
system wide, and added IPV6_AUTOFLOWLABEL socket option to enable this
functionality per socket.

By default, auto flowlabels are disabled to avoid possible conflicts
with flow label manager, however if this feature proves useful we
may want to enable it by default.

It should also be noted that FreeBSD has already implemented automatic
flow labels (including the sysctl and socket option). In FreeBSD,
automatic flow labels default to enabled.

Performance impact:

Running super_netperf with 200 flows for TCP_RR and UDP_RR for
IPv6. Note that in UDP case, __skb_get_hash will be called for
every packet with explains slight regression. In the TCP case
the hash is saved in the socket so there is no regression.

Automatic flow labels disabled:

TCP_RR:
86.53% CPU utilization
127/195/322 90/95/99% latencies
1.40498e+06 tps

UDP_RR:
90.70% CPU utilization
118/168/243 90/95/99% latencies
1.50309e+06 tps

Automatic flow labels enabled:

TCP_RR:
85.90% CPU utilization
128/199/337 90/95/99% latencies
1.40051e+06

UDP_RR
92.61% CPU utilization
115/164/236 90/95/99% latencies
1.4687e+06

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 73f156a6 02-Jun-2014 Eric Dumazet <edumazet@google.com>

inetpeer: get rid of ip_id_count

Ideally, we would need to generate IP ID using a per destination IP
generator.

linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.

1) each inet_peer struct consumes 192 bytes

2) inetpeer cache uses a binary tree of inet_peer structs,
with a nominal size of ~66000 elements under load.

3) lookups in this tree are hitting a lot of cache lines, as tree depth
is about 20.

4) If server deals with many tcp flows, we have a high probability of
not finding the inet_peer, allocating a fresh one, inserting it in
the tree with same initial ip_id_count, (cf secure_ip_id())

5) We garbage collect inet_peer aggressively.

IP ID generation do not have to be 'perfect'

Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.

We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.

ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)

secure_ip_id() and secure_ipv6_id() no longer are needed.

Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3a1cebe7 11-May-2014 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: fix calculation of option len in ip6_append_data

tot_len does specify the size of struct ipv6_txoptions. We need opt_flen +
opt_nflen to calculate the overall length of additional ipv6 extensions.

I found this while auditing the ipv6 output path for a memory corruption
reported by Alexey Preobrazhensky while he fuzzed an instrumented
AddressSanitizer kernel with trinity. This may or may not be the cause
of the original bug.

Fixes: 4df98e76cde7c6 ("ipv6: pmtudisc setting not respected with UFO/CORK")
Reported-by: Alexey Preobrazhensky <preobr@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 60ff7467 04-May-2014 WANG Cong <xiyou.wangcong@gmail.com>

net: rename local_df to ignore_df

As suggested by several people, rename local_df to ignore_df,
since it means "ignore df bit if it is set".

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 418a3156 04-May-2014 Florian Westphal <fw@strlen.de>

net: ipv6: send pkttoobig immediately if orig frag size > mtu

If conntrack defragments incoming ipv6 frags it stores largest original
frag size in ip6cb and sets ->local_df.

We must thus first test the largest original frag size vs. mtu, and not
vice versa.

Without this patch PKTTOOBIG is still generated in ip6_fragment() later
in the stack, but

1) IPSTATS_MIB_INTOOBIGERRORS won't increment
2) packet did (needlessly) traverse netfilter postrouting hook.

Fixes: fe6cc55f3a9 ("net: ip, ipv6: handle gso skbs in forwarding path")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aad88724 15-Apr-2014 Eric Dumazet <edumazet@google.com>

ipv4: add a sock pointer to dst->output() path.

In the dst->output() path for ipv4, the code assumes the skb it has to
transmit is attached to an inet socket, specifically via
ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
provider of the packet is an AF_PACKET socket.

The dst->output() method gets an additional 'struct sock *sk'
parameter. This needs a cascade of changes so that this parameter can
be propagated from vxlan to final consumer.

Fixes: 8f646c922d55 ("vxlan: keep original skb ownership")
Reported-by: lucien xin <lucien.xin@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 43a43b60 31-Mar-2014 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: some ipv6 statistic counters failed to disable bh

After commit c15b1ccadb323ea ("ipv6: move DAD and addrconf_verify
processing to workqueue") some counters are now updated in process context
and thus need to disable bh before doing so, otherwise deadlocks can
happen on 32-bit archs. Fabio Estevam noticed this while while mounting
a NFS volume on an ARM board.

As a compensation for missing this I looked after the other *_STATS_BH
and found three other calls which need updating:

1) icmp6_send: ip6_fragment -> icmpv6_send -> icmp6_send (error handling)
2) ip6_push_pending_frames: rawv6_sendmsg -> rawv6_push_pending_frames -> ...
(only in case of icmp protocol with raw sockets in error handling)
3) ping6_v6_sendmsg (error handling)

Fixes: c15b1ccadb323ea ("ipv6: move DAD and addrconf_verify processing to workqueue")
Reported-by: Fabio Estevam <festevam@gmail.com>
Tested-by: Fabio Estevam <fabio.estevam@freescale.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e367c2d0 16-Mar-2014 lucien <lucien.xin@gmail.com>

ipv6: ip6_append_data_mtu do not handle the mtu of the second fragment properly

In ip6_append_data_mtu(), when the xfrm mode is not tunnel(such as
transport),the ipsec header need to be added in the first fragment, so the mtu
will decrease to reserve space for it, then the second fragment come, the mtu
should be turn back, as the commit 0c1833797a5a6ec23ea9261d979aa18078720b74
said. however, in the commit a493e60ac4bbe2e977e7129d6d8cbb0dd236be, it use
*mtu = min(*mtu, ...) to change the mtu, which lead to the new mtu is alway
equal with the first fragment's. and cannot turn back.

when I test through ping6 -c1 -s5000 $ip (mtu=1280):
...frag (0|1232) ESP(spi=0x00002000,seq=0xb), length 1232
...frag (1232|1216)
...frag (2448|1216)
...frag (3664|1216)
...frag (4880|164)

which should be:
...frag (0|1232) ESP(spi=0x00001000,seq=0x1), length 1232
...frag (1232|1232)
...frag (2464|1232)
...frag (3696|1232)
...frag (4928|116)

so delete the min() when change back the mtu.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Fixes: 75a493e60ac4bb ("ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size")
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 090f1166 10-Mar-2014 Li RongQing <roy.qing.li@gmail.com>

ipv6: ip6_forward: perform skb->pkt_type check at the beginning

Packets which have L2 address different from ours should be
already filtered before entering into ip6_forward().

Perform that check at the beginning to avoid processing such packets.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0b95227a 25-Feb-2014 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: yet another new IPV6_MTU_DISCOVER option IPV6_PMTUDISC_OMIT

This option has the same semantic as IP_PMTUDISC_OMIT for IPv4 which
got recently introduced. It doesn't honor the path mtu discovered by the
host but in contrary to IPV6_PMTUDISC_INTERFACE allows the generation of
fragments if the packet size exceeds the MTU of the outgoing interface
MTU.

Fixes: 93b36cf3425b9b ("ipv6: support IPV6_PMTU_INTERFACE on sockets")
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 478b360a 15-Feb-2014 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=n

When using nftables with CONFIG_NETFILTER_XT_TARGET_TRACE=n, we get
lots of "TRACE: filter:output:policy:1 IN=..." warnings as several
places will leave skb->nf_trace uninitialised.

Unlike iptables tracing functionality is not conditional in nftables,
so always copy/zero nf_trace setting when nftables is enabled.

Move this into __nf_copy() helper.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# fe6cc55f 13-Feb-2014 Florian Westphal <fw@strlen.de>

net: ip, ipv6: handle gso skbs in forwarding path

Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.

Given:
Host <mtu1500> R1 <mtu1200> R2

Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.

In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.

When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.

This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.

For ipv6, we send out pkt too big error for gso if the individual
segments are too big.

For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.

Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.

However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there. I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.

Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.

This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small. Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway. Also its not like this could not be improved later
once the dust settles.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0954cf9c 09-Jan-2014 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: introduce ip6_dst_mtu_forward and protect forwarding path with it

In the IPv6 forwarding path we are only concerend about the outgoing
interface MTU, but also respect locked MTUs on routes. Tunnel provider
or IPSEC already have to recheck and if needed send PtB notifications
to the sending host in case the data does not fit into the packet with
added headers (we only know the final header sizes there, while also
using path MTU information).

The reason for this change is, that path MTU information can be injected
into the kernel via e.g. icmp_err protocol handler without verification
of local sockets. As such, this could cause the IPv6 forwarding path to
wrongfully emit Packet-too-Big errors and drop IPv6 packets.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4df98e76 15-Dec-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: pmtudisc setting not respected with UFO/CORK

Sockets marked with IPV6_PMTUDISC_PROBE (or later IPV6_PMTUDISC_INTERFACE)
don't respect this setting when the outgoing interface supports UFO.

We had the same problem in IPv4, which was fixed in commit
daba287b299ec7a2c61ae3a714920e90e8396ad5 ("ipv4: fix DO and PROBE pmtu
mode regarding local fragmentation with UFO/CORK").

Also IPV6_DONTFRAG mode did not care about already corked data, thus
it may generate a fragmented frame even if this socket option was
specified. It also did not care about the length of the ipv6 header and
possible options.

In the error path allow the user to receive the pmtu notifications via
both, rxpmtu method or error queue. The user may opted in for both,
so deliver the notification to both error handlers (the handlers check
if the error needs to be enqueued).

Also report back consistent pmtu values when sending on an already
cork-appended socket.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 93b36cf3 14-Dec-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: support IPV6_PMTU_INTERFACE on sockets

IPV6_PMTU_INTERFACE is the same as IPV6_PMTU_PROBE for ipv6. Add it
nontheless for symmetry with IPv4 sockets. Also drop incoming MTU
information if this mode is enabled.

The additional bit in ipv6_pinfo just eats in the padding behind the
bitfield. There are no changes to the layout of the struct at all.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 15c77d8b 05-Dec-2013 Eric Dumazet <edumazet@google.com>

ipv6: consistent use of IP6_INC_STATS_BH() in ip6_forward()

ip6_forward() runs from softirq context, we can use the SNMP macros
assuming this.

Use same indentation for all IP6_INC_STATS_BH() calls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0e0d44ab 28-Aug-2013 Steffen Klassert <steffen.klassert@secunet.com>

net: Remove FLOWI_FLAG_CAN_SLEEP

FLOWI_FLAG_CAN_SLEEP was used to notify xfrm about the posibility
to sleep until the needed states are resolved. This code is gone,
so FLOWI_FLAG_CAN_SLEEP is not needed anymore.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>


# 7f88c6b2 28-Nov-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: fix possible seqlock deadlock in ip6_finish_output2

IPv6 stats are 64 bits and thus are protected with a seqlock. By not
disabling bottom-half we could deadlock here if we don't disable bh and
a softirq reentrantly updates the same mib.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9037c357 06-Nov-2013 Jiri Pirko <jiri@resnulli.us>

ip6_output: fragment outgoing reassembled skb properly

If reassembled packet would fit into outdev MTU, it is not fragmented
according the original frag size and it is send as single big packet.

The second case is if skb is gso. In that case fragmentation does not happen
according to the original frag size.

This patch fixes these.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5ac68e7c 07-Oct-2013 John Stultz <john.stultz@linaro.org>

ipv6: Fix possible ipv6 seqlock deadlock

While enabling lockdep on seqlocks, I ran across the warning below
caused by the ipv6 stats being updated in both irq and non-irq context.

This patch changes from IP6_INC_STATS_BH to IP6_INC_STATS (suggested
by Eric Dumazet) to resolve this problem.

[ 11.120383] =================================
[ 11.121024] [ INFO: inconsistent lock state ]
[ 11.121663] 3.12.0-rc1+ #68 Not tainted
[ 11.122229] ---------------------------------
[ 11.122867] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[ 11.123741] init/4483 [HC0[0]:SC1[3]:HE1:SE0] takes:
[ 11.124505] (&stats->syncp.seq#6){+.?...}, at: [<c1ab80c2>] ndisc_send_ns+0xe2/0x130
[ 11.125736] {SOFTIRQ-ON-W} state was registered at:
[ 11.126447] [<c10e0eb7>] __lock_acquire+0x5c7/0x1af0
[ 11.127222] [<c10e2996>] lock_acquire+0x96/0xd0
[ 11.127925] [<c1a9a2c3>] write_seqcount_begin+0x33/0x40
[ 11.128766] [<c1a9aa03>] ip6_dst_lookup_tail+0x3a3/0x460
[ 11.129582] [<c1a9e0ce>] ip6_dst_lookup_flow+0x2e/0x80
[ 11.130014] [<c1ad18e0>] ip6_datagram_connect+0x150/0x4e0
[ 11.130014] [<c1a4d0b5>] inet_dgram_connect+0x25/0x70
[ 11.130014] [<c198dd61>] SYSC_connect+0xa1/0xc0
[ 11.130014] [<c198f571>] SyS_connect+0x11/0x20
[ 11.130014] [<c198fe6b>] SyS_socketcall+0x12b/0x300
[ 11.130014] [<c1bbf880>] syscall_call+0x7/0xb
[ 11.130014] irq event stamp: 1184
[ 11.130014] hardirqs last enabled at (1184): [<c1086901>] local_bh_enable+0x71/0x110
[ 11.130014] hardirqs last disabled at (1183): [<c10868cd>] local_bh_enable+0x3d/0x110
[ 11.130014] softirqs last enabled at (0): [<c108014d>] copy_process.part.42+0x45d/0x11a0
[ 11.130014] softirqs last disabled at (1147): [<c1086e05>] irq_exit+0xa5/0xb0
[ 11.130014]
[ 11.130014] other info that might help us debug this:
[ 11.130014] Possible unsafe locking scenario:
[ 11.130014]
[ 11.130014] CPU0
[ 11.130014] ----
[ 11.130014] lock(&stats->syncp.seq#6);
[ 11.130014] <Interrupt>
[ 11.130014] lock(&stats->syncp.seq#6);
[ 11.130014]
[ 11.130014] *** DEADLOCK ***
[ 11.130014]
[ 11.130014] 3 locks held by init/4483:
[ 11.130014] #0: (rcu_read_lock){.+.+..}, at: [<c109363c>] SyS_setpriority+0x4c/0x620
[ 11.130014] #1: (((&ifa->dad_timer))){+.-...}, at: [<c108c1c0>] call_timer_fn+0x0/0xf0
[ 11.130014] #2: (rcu_read_lock){.+.+..}, at: [<c1ab6494>] ndisc_send_skb+0x54/0x5d0
[ 11.130014]
[ 11.130014] stack backtrace:
[ 11.130014] CPU: 0 PID: 4483 Comm: init Not tainted 3.12.0-rc1+ #68
[ 11.130014] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 11.130014] 00000000 00000000 c55e5c10 c1bb0e71 c57128b0 c55e5c4c c1badf79 c1ec1123
[ 11.130014] c1ec1484 00001183 00000000 00000000 00000001 00000003 00000001 00000000
[ 11.130014] c1ec1484 00000004 c5712dcc 00000000 c55e5c84 c10de492 00000004 c10755f2
[ 11.130014] Call Trace:
[ 11.130014] [<c1bb0e71>] dump_stack+0x4b/0x66
[ 11.130014] [<c1badf79>] print_usage_bug+0x1d3/0x1dd
[ 11.130014] [<c10de492>] mark_lock+0x282/0x2f0
[ 11.130014] [<c10755f2>] ? kvm_clock_read+0x22/0x30
[ 11.130014] [<c10dd8b0>] ? check_usage_backwards+0x150/0x150
[ 11.130014] [<c10e0e74>] __lock_acquire+0x584/0x1af0
[ 11.130014] [<c10b1baf>] ? sched_clock_cpu+0xef/0x190
[ 11.130014] [<c10de58c>] ? mark_held_locks+0x8c/0xf0
[ 11.130014] [<c10e2996>] lock_acquire+0x96/0xd0
[ 11.130014] [<c1ab80c2>] ? ndisc_send_ns+0xe2/0x130
[ 11.130014] [<c1ab66d3>] ndisc_send_skb+0x293/0x5d0
[ 11.130014] [<c1ab80c2>] ? ndisc_send_ns+0xe2/0x130
[ 11.130014] [<c1ab80c2>] ndisc_send_ns+0xe2/0x130
[ 11.130014] [<c108cc32>] ? mod_timer+0xf2/0x160
[ 11.130014] [<c1aa706e>] ? addrconf_dad_timer+0xce/0x150
[ 11.130014] [<c1aa70aa>] addrconf_dad_timer+0x10a/0x150
[ 11.130014] [<c1aa6fa0>] ? addrconf_dad_completed+0x1c0/0x1c0
[ 11.130014] [<c108c233>] call_timer_fn+0x73/0xf0
[ 11.130014] [<c108c1c0>] ? __internal_add_timer+0xb0/0xb0
[ 11.130014] [<c1aa6fa0>] ? addrconf_dad_completed+0x1c0/0x1c0
[ 11.130014] [<c108c5b1>] run_timer_softirq+0x141/0x1e0
[ 11.130014] [<c1086b20>] ? __do_softirq+0x70/0x1b0
[ 11.130014] [<c1086b70>] __do_softirq+0xc0/0x1b0
[ 11.130014] [<c1086e05>] irq_exit+0xa5/0xb0
[ 11.130014] [<c106cfd5>] smp_apic_timer_interrupt+0x35/0x50
[ 11.130014] [<c1bbfbca>] apic_timer_interrupt+0x32/0x38
[ 11.130014] [<c10936ed>] ? SyS_setpriority+0xfd/0x620
[ 11.130014] [<c10e26c9>] ? lock_release+0x9/0x240
[ 11.130014] [<c10936d7>] ? SyS_setpriority+0xe7/0x620
[ 11.130014] [<c1bbee6d>] ? _raw_read_unlock+0x1d/0x30
[ 11.130014] [<c1093701>] SyS_setpriority+0x111/0x620
[ 11.130014] [<c109363c>] ? SyS_setpriority+0x4c/0x620
[ 11.130014] [<c1bbf880>] syscall_call+0x7/0xb

Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 550bab42 20-Oct-2013 Julian Anastasov <ja@ssi.bg>

ipv6: fill rt6i_gateway with nexthop address

Make sure rt6i_gateway contains nexthop information in
all routes returned from lookup or when routes are directly
attached to skb for generated ICMP packets.

The effect of this patch should be a faster version of
rt6_nexthop() and the consideration of local addresses as
nexthop.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c547dbf5 18-Oct-2013 Jiri Pirko <jiri@resnulli.us>

ip6_output: do skb ufo init for peeked non ufo skb as well

Now, if user application does:
sendto len<mtu flag MSG_MORE
sendto len>mtu flag 0
The skb is not treated as fragmented one because it is not initialized
that way. So move the initialization to fix this.

introduced by:
commit e89e9cf539a28df7d0eb1d0a545368e9920b34ac "[IPv4/IPv6]: UFO Scatter-gather approach"

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2811ebac 20-Sep-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: udp packets following an UFO enqueued packet need also be handled by UFO

In the following scenario the socket is corked:
If the first UDP packet is larger then the mtu we try to append it to the
write queue via ip6_ufo_append_data. A following packet, which is smaller
than the mtu would be appended to the already queued up gso-skb via
plain ip6_append_data. This causes random memory corruptions.

In ip6_ufo_append_data we also have to be careful to not queue up the
same skb multiple times. So setup the gso frame only when no first skb
is available.

This also fixes a shortcoming where we add the current packet's length to
cork->length but return early because of a packet > mtu with dontfrag set
(instead of sutracting it again).

Found with trinity.

Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 788787b5 30-Aug-2013 Cong Wang <amwang@redhat.com>

ipv6: move ip6_local_out into core kernel

It will be used the vxlan kernel module.

Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9c9c9ad5 25-Aug-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: set skb->protocol on tcp, raw and ip6_append_data genereated skbs

Currently we don't initialize skb->protocol when transmitting data via
tcp, raw(with and without inclhdr) or udp+ufo or appending data directly
to the socket transmit queue (via ip6_append_data). This needs to be
done so that we can get the correct mtu in the xfrm layer.

Setting of skb->protocol happens only in functions where we also have
a transmitting socket and a new skb, so we don't overwrite old values.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>


# 75a493e6 02-Jul-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size

If the socket had an IPV6_MTU value set, ip6_append_data_mtu lost track
of this when appending the second frame on a corked socket. This results
in the following splat:

[37598.993962] ------------[ cut here ]------------
[37598.994008] kernel BUG at net/core/skbuff.c:2064!
[37598.994008] invalid opcode: 0000 [#1] SMP
[37598.994008] Modules linked in: tcp_lp uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media vfat fat usb_storage fuse ebtable_nat xt_CHECKSUM bridge stp llc ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat
+nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi
+scsi_transport_iscsi rfcomm bnep iTCO_wdt iTCO_vendor_support snd_hda_codec_conexant arc4 iwldvm mac80211 snd_hda_intel acpi_cpufreq mperf coretemp snd_hda_codec microcode cdc_wdm cdc_acm
[37598.994008] snd_hwdep cdc_ether snd_seq snd_seq_device usbnet mii joydev btusb snd_pcm bluetooth i2c_i801 e1000e lpc_ich mfd_core ptp iwlwifi pps_core snd_page_alloc mei cfg80211 snd_timer thinkpad_acpi snd tpm_tis soundcore rfkill tpm tpm_bios vhost_net tun macvtap macvlan kvm_intel kvm uinput binfmt_misc
+dm_crypt i915 i2c_algo_bit drm_kms_helper drm i2c_core wmi video
[37598.994008] CPU 0
[37598.994008] Pid: 27320, comm: t2 Not tainted 3.9.6-200.fc18.x86_64 #1 LENOVO 27744PG/27744PG
[37598.994008] RIP: 0010:[<ffffffff815443a5>] [<ffffffff815443a5>] skb_copy_and_csum_bits+0x325/0x330
[37598.994008] RSP: 0018:ffff88003670da18 EFLAGS: 00010202
[37598.994008] RAX: ffff88018105c018 RBX: 0000000000000004 RCX: 00000000000006c0
[37598.994008] RDX: ffff88018105a6c0 RSI: ffff88018105a000 RDI: ffff8801e1b0aa00
[37598.994008] RBP: ffff88003670da78 R08: 0000000000000000 R09: ffff88018105c040
[37598.994008] R10: ffff8801e1b0aa00 R11: 0000000000000000 R12: 000000000000fff8
[37598.994008] R13: 00000000000004fc R14: 00000000ffff0504 R15: 0000000000000000
[37598.994008] FS: 00007f28eea59740(0000) GS:ffff88023bc00000(0000) knlGS:0000000000000000
[37598.994008] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[37598.994008] CR2: 0000003d935789e0 CR3: 00000000365cb000 CR4: 00000000000407f0
[37598.994008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[37598.994008] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[37598.994008] Process t2 (pid: 27320, threadinfo ffff88003670c000, task ffff88022c162ee0)
[37598.994008] Stack:
[37598.994008] ffff88022e098a00 ffff88020f973fc0 0000000000000008 00000000000004c8
[37598.994008] ffff88020f973fc0 00000000000004c4 ffff88003670da78 ffff8801e1b0a200
[37598.994008] 0000000000000018 00000000000004c8 ffff88020f973fc0 00000000000004c4
[37598.994008] Call Trace:
[37598.994008] [<ffffffff815fc21f>] ip6_append_data+0xccf/0xfe0
[37598.994008] [<ffffffff8158d9f0>] ? ip_copy_metadata+0x1a0/0x1a0
[37598.994008] [<ffffffff81661f66>] ? _raw_spin_lock_bh+0x16/0x40
[37598.994008] [<ffffffff8161548d>] udpv6_sendmsg+0x1ed/0xc10
[37598.994008] [<ffffffff812a2845>] ? sock_has_perm+0x75/0x90
[37598.994008] [<ffffffff815c3693>] inet_sendmsg+0x63/0xb0
[37598.994008] [<ffffffff812a2973>] ? selinux_socket_sendmsg+0x23/0x30
[37598.994008] [<ffffffff8153a450>] sock_sendmsg+0xb0/0xe0
[37598.994008] [<ffffffff810135d1>] ? __switch_to+0x181/0x4a0
[37598.994008] [<ffffffff8153d97d>] sys_sendto+0x12d/0x180
[37598.994008] [<ffffffff810dfb64>] ? __audit_syscall_entry+0x94/0xf0
[37598.994008] [<ffffffff81020ed1>] ? syscall_trace_enter+0x231/0x240
[37598.994008] [<ffffffff8166a7e7>] tracesys+0xdd/0xe2
[37598.994008] Code: fe 07 00 00 48 c7 c7 04 28 a6 81 89 45 a0 4c 89 4d b8 44 89 5d a8 e8 1b ac b1 ff 44 8b 5d a8 4c 8b 4d b8 8b 45 a0 e9 cf fe ff ff <0f> 0b 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 48
[37598.994008] RIP [<ffffffff815443a5>] skb_copy_and_csum_bits+0x325/0x330
[37598.994008] RSP <ffff88003670da18>
[37599.007323] ---[ end trace d69f6a17f8ac8eee ]---

While there, also check if path mtu discovery is activated for this
socket. The logic was adapted from ip6_append_data when first writing
on the corked socket.

This bug was introduced with commit
0c1833797a5a6ec23ea9261d979aa18078720b74 ("ipv6: fix incorrect ipsec
fragment").

v2:
a) Replace IPV6_PMTU_DISC_DO with IPV6_PMTUDISC_PROBE.
b) Don't pass ipv6_pinfo to ip6_append_data_mtu (suggestion by Gao
feng, thanks!).
c) Change mtu to unsigned int, else we get a warning about
non-matching types because of the min()-macro type-check.

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a963a37d 26-Jun-2013 Eric Dumazet <edumazet@google.com>

ipv6: ip6_sk_dst_check() must not assume ipv6 dst

It's possible to use AF_INET6 sockets and to connect to an IPv4
destination. After this, socket dst cache is a pointer to a rtable,
not rt6_info.

ip6_sk_dst_check() should check the socket dst cache is IPv6, or else
various corruptions/crashes can happen.

Dave Jones can reproduce immediate crash with
trinity -q -l off -n -c sendmsg -c connect

With help from Hannes Frederic Sowa

Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab4eb353 21-Jun-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6: Process unicast packet with Router Alert by checking flag in skb.

Router Alert option is marked in skb.
Previously, IP6CB(skb)->ra was set to positive value for such packets.
Since commit dd3332bf ("ipv6: Store Router Alert option in IP6CB
directly."), IP6SKB_ROUTERALERT is set in IP6CB(skb)->flags, and
the value of Router Alert option (in network byte order) is set
to IP6CB(skb)->ra for such packets.

Multicast forwarding path uses that flag and value, but unicast
forwarding path does not use the flag and misuses IP6CB(skb)->ra
value.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 284041ef 16-May-2013 Eric Dumazet <edumazet@google.com>

ipv6: fix possible crashes in ip6_cork_release()

commit 0178b695fd6b4 ("ipv6: Copy cork options in ip6_append_data")
added some code duplication and bad error recovery, leading to potential
crash in ip6_cork_release() as kfree() could be called with garbage.

use kzalloc() to make sure this wont happen.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Neal Cardwell <ncardwell@google.com>


# bf84a010 14-Apr-2013 Daniel Borkmann <daniel@iogearbox.net>

net: sock: make sock_tx_timestamp void

Currently, sock_tx_timestamp() always returns 0. The comment that
describes the sock_tx_timestamp() function wrongly says that it
returns an error when an invalid argument is passed (from commit
20d4947353be, ``net: socket infrastructure for SO_TIMESTAMPING'').
Make the function void, so that we can also remove all the unneeded
if conditions that check for such a _non-existant_ error case in the
output path.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dd408515 09-Feb-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: don't let node/interface scoped multicast traffic escape on the wire

Reported-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f4e53e29 04-Feb-2013 Steffen Klassert <steffen.klassert@secunet.com>

ipv6: Don't send packet to big messages to self

Calling icmpv6_send() on a local message size error leads to an
incorrect update of the path mtu in the case when IPsec is used.
So use ipv6_local_error() instead to notify the socket about the
error.

Reported-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9647bb80 22-Jan-2013 Cong Wang <amwang@redhat.com>

ipv6: remove duplicated declaration of ip6_fragment()

It is declared in:
include/net/ip6_route.h:187:int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *));

and net/ip6_route.h is already included.

Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2576f17d 20-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6: Unshare ip6_nd_hdr() and change return type to void.

- move ip6_nd_hdr() to its users' source files.
In net/ipv6/mcast.c, it will be called ip6_mc_hdr().
- make return type to void since this function never fails.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6fd6ce20 16-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6: Do not depend on rt->n in ip6_finish_output2().

If neigh is not found, create new one.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 707be1ff 16-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6: Do not depend on rt->n in ip6_dst_lookup_tail().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7efdba5b 15-Jan-2013 Romain KUNTZ <r.kuntz@ipflavors.com>

ipv6: fix header length calculation in ip6_append_data()

Commit 299b0767 (ipv6: Fix IPsec slowpath fragmentation problem)
has introduced a error in the header length calculation that
provokes corrupted packets when non-fragmentable extensions
headers (Destination Option or Routing Header Type 2) are used.

rt->rt6i_nfheader_len is the length of the non-fragmentable
extension header, and it should be substracted to
rt->dst.header_len, and not to exthdrlen, as it was done before
commit 299b0767.

This patch reverts to the original and correct behavior. It has
been successfully tested with and without IPsec on packets
that include non-fragmentable extensions headers.

Signed-off-by: Romain Kuntz <r.kuntz@ipflavors.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3e4e4c1f 12-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6: Introduce ip6_flow_hdr() to fill version, tclass and flowlabel.

This is not only for readability but also for optimization.
What we do here is to build the 32bit word at the beginning of the ipv6
header (the "ip6_flow" virtual member of struct ip6_hdr in RFC3542) and
we do not need to read the tclass portion of the target buffer.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3c73a036 15-Nov-2012 Vlad Yasevich <vyasevic@redhat.com>

ipv6: Update ipv6 static library with newly needed functions

UDP offload needs some additional functions to be in the static kernel
for it work correclty. Move those functions into the core.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 94e187c0 28-Oct-2012 Amerigo Wang <amwang@redhat.com>

ipv6: introduce ip6_rt_put()

As suggested by Eric, we could introduce a helper function
for ipv6 too, to avoid checking if rt is NULL before
dst_release().

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 07a93626 29-Oct-2012 Amerigo Wang <amwang@redhat.com>

ipv6: use IS_ENABLED()

#if defined(CONFIG_FOO) || defined(CONFIG_FOO_MODULE)

can be replaced by

#if IS_ENABLED(CONFIG_FOO)

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5640f768 23-Sep-2012 Eric Dumazet <edumazet@google.com>

net: use a per task frag allocator

We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fdd6681d 09-Sep-2012 Amerigo Wang <amwang@redhat.com>

ipv6: remove some useless RCU read lock

After this commit:
commit 97cac0821af4474ec4ba3a9e7a36b98ed9b6db88
Author: David S. Miller <davem@davemloft.net>
Date: Mon Jul 2 22:43:47 2012 -0700

ipv6: Store route neighbour in rt6_info struct.

we no longer use RCU to protect route neighbour.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4cdd3408 26-Aug-2012 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack_ipv6: improve fragmentation handling

The IPv6 conntrack fragmentation currently has a couple of shortcomings.
Fragmentes are collected in PREROUTING/OUTPUT, are defragmented, the
defragmented packet is then passed to conntrack, the resulting conntrack
information is attached to each original fragment and the fragments then
continue their way through the stack.

Helper invocation occurs in the POSTROUTING hook, at which point only
the original fragments are available. The result of this is that
fragmented packets are never passed to helpers.

This patch improves the situation in the following way:

- If a reassembled packet belongs to a connection that has a helper
assigned, the reassembled packet is passed through the stack instead
of the original fragments.

- During defragmentation, the largest received fragment size is stored.
On output, the packet is refragmented if required. If the largest
received fragment size exceeds the outgoing MTU, a "packet too big"
message is generated, thus behaving as if the original fragments
were passed through the stack from an outside point of view.

- The ipv6_helper() hook function can't receive fragments anymore for
connections using a helper, so it is switched to use ipv6_skip_exthdr()
instead of the netfilter specific nf_ct_ipv6_skip_exthdr() and the
reassembled packets are passed to connection tracking helpers.

The result of this is that we can properly track fragmented packets, but
still generate ICMPv6 Packet too big messages if we would have before.

This patch is also required as a precondition for IPv6 NAT, where NAT
helpers might enlarge packets up to a point that they require
fragmentation. In that case we can't generate Packet too big messages
since the proper MTU can't be calculated in all cases (f.i. when
changing textual representation of a variable amount of addresses),
so the packet is transparently fragmented iff the original packet or
fragments would have fit the outgoing MTU.

IPVS parts by Jesper Dangaard Brouer <brouer@redhat.com>.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 1d861aa4 10-Jul-2012 David S. Miller <davem@davemloft.net>

inet: Minimize use of cached route inetpeer.

Only use it in the absolutely required cases:

1) COW'ing metrics

2) ipv4 PMTU

3) ipv4 redirects

Signed-off-by: David S. Miller <davem@davemloft.net>


# c56bf6fe 06-Jul-2012 Eric Dumazet <edumazet@google.com>

ipv6: fix a bad cast in ip6_dst_lookup_tail()

Fix a bug in ip6_dst_lookup_tail(), where typeof(dst) is
"struct dst_entry **", not "struct dst_entry *"

Reported-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 97cac082 02-Jul-2012 David S. Miller <davem@davemloft.net>

ipv6: Store route neighbour in rt6_info struct.

This makes for a simplified conversion away from dst_get_neighbour*().

All code outside of ipv6 will use neigh lookups via dst_neigh_lookup*().

Signed-off-by: David S. Miller <davem@davemloft.net>


# 5110effe 02-Jul-2012 David S. Miller <davem@davemloft.net>

net: Do delayed neigh confirmation.

When a dst_confirm() happens, mark the confirmation as pending in the
dst. Then on the next packet out, when we have the neigh in-hand, do
the update.

This removes the dependency in dst_confirm() of dst's having an
attached neigh.

While we're here, remove the explicit 'dst' NULL check, all except 2
or 3 call sites ensure it's not NULL. So just fix those cases up.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 95603e22 12-Jun-2012 Michel Machado <michel@digirati.com.br>

net-next: add dev_loopback_xmit() to avoid duplicate code

Add dev_loopback_xmit() in order to deduplicate functions
ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).

I was about to reinvent the wheel when I noticed that
ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
need and are not IP-only functions, but they were not available to reuse
elsewhere.

ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
understand that this is harmless, and should be in dev_loopback_xmit().

Signed-off-by: Michel Machado <michel@digirati.com.br>
CC: "David S. Miller" <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jpirko@redhat.com>
CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fbfe95a4 09-Jun-2012 David S. Miller <davem@davemloft.net>

inet: Create and use rt{,6}_get_peer_create().

There's a lot of places that open-code rt{,6}_get_peer() only because
they want to set 'create' to one. So add an rt{,6}_get_peer_create()
for their sake.

There were also a few spots open-coding plain rt{,6}_get_peer() and
those are transformed here as well.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 2d8dbb04 04-Jun-2012 Vincent Bernat <bernat@luffy.cx>

snmp: fix OutOctets counter to include forwarded datagrams

RFC 4293 defines ipIfStatsOutOctets (similar definition for
ipSystemStatsOutOctets):

The total number of octets in IP datagrams delivered to the lower
layers for transmission. Octets from datagrams counted in
ipIfStatsOutTransmits MUST be counted here.

And ipIfStatsOutTransmits:

The total number of IP datagrams that this entity supplied to the
lower layers for transmission. This includes datagrams generated
locally and those forwarded by this entity.

Therefore, IPSTATS_MIB_OUTOCTETS must be incremented when incrementing
IPSTATS_MIB_OUTFORWDATAGRAMS.

IP_UPD_PO_STATS is not used since ipIfStatsOutRequests must not
include forwarded datagrams:

The total number of IP datagrams that local IP user-protocols
(including ICMP) supplied to IP in requests for transmission. Note
that this counter does not include any datagrams counted in
ipIfStatsOutForwDatagrams.

Signed-off-by: Vincent Bernat <bernat@luffy.cx>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0c183379 25-May-2012 Gao feng <gaofeng@cn.fujitsu.com>

ipv6: fix incorrect ipsec fragment

Since commit ad0081e43a
"ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed"
the fragment of packets is incorrect.
because tunnel mode needs IPsec headers and trailer for all fragments,
while on transport mode it is sufficient to add the headers to the
first fragment and the trailer to the last.

so modify mtu and maxfraglen base on ipsec mode and if fragment is first
or last.

with my test,it work well(every fragment's size is the mtu)
and does not trigger slow fragment path.

Changes from v1:
though optimization, mtu_prev and maxfraglen_prev can be delete.
replace xfrm mode codes with dst_entry's new frag DST_XFRM_TUNNEL.
add fuction ip6_append_data_mtu to make codes clearer.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a34a101e 18-May-2012 Eric Dumazet <edumazet@google.com>

ipv6: disable GSO on sockets hitting dst_allfrag

If the allfrag feature has been set on a host route (due to an ICMPv6
Packet Too Big received indicating a MTU of less than 1280), we hit a
very slow behavior in TCP stack, because all big packets are dropped and
only a retransmit timer is able to push one MSS frame every 200 ms.

One way to handle this is to disable GSO on the socket the first time a
super packet is dropped. Adding a specific dst_allfrag() in the fast
path is probably overkill since the dst_allfrag() case almost never
happen.

Result on netperf TCP_STREAM, one flow :

Before : 60 kbit/sec
After : 1.6 Gbit/sec

Reported-by: Tore Anderson <tore@fud.no>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 72e843bb 18-May-2012 Eric Dumazet <edumazet@google.com>

ipv6: ip6_fragment() should check CHECKSUM_PARTIAL

Quoting Tore Anderson from :

If the allfrag feature has been set on a host route (due to an ICMPv6
Packet Too Big received indicating a MTU of less than 1280),
TCP SYN/ACK packets to that destination appears to get an incorrect
TCP checksum. This in turn means they are thrown away as invalid.

In the case of an IPv4 client behind a link with a MTU of less than
1260, accessing an IPv6 server through a stateless translator,
this means that the client can only download a single large file
from the server, because once it is in the server's routing cache
with the allfrag feature set, new TCP connections can no longer
be established.

</endquote>

It appears ip6_fragment() doesn't handle CHECKSUM_PARTIAL properly.

As network drivers are not prepared to fetch correct transport header, a
safe fix is to call skb_checksum_help() before fragmenting packet.

Reported-by: Tore Anderson <tore@fud.no>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d7f7c0ac 17-May-2012 Eric Dumazet <edumazet@google.com>

ipv6: remove csummode in ip6_append_data()

csummode variable is always CHECKSUM_NONE in ip6_append_data()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e87cc472 13-May-2012 Joe Perches <joe@perches.com>

net: Convert net_ratelimit uses to net_<level>_ratelimited

Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a495f836 29-Apr-2012 Chris Elston <celston@katalix.com>

ipv6: Export ipv6 functions for use by other protocols

For implementing other protocols on top of IPv6, such as L2TPv3's IP
encapsulation over ipv6, we'd like to call some IPv6 functions which
are not currently exported. This patch exports them.

Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 808db80a 24-Apr-2012 Eric Dumazet <edumazet@google.com>

ipv6: call consume_skb() in frag/reassembly

Some kfree_skb() calls should be replaced by consume_skb() to avoid
drop_monitor/dropwatch false positives.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1f85851e 19-Mar-2012 Gao feng <gaofeng@cn.fujitsu.com>

ipv6: fix incorrent ipv6 ipsec packet fragment

Since commit 299b0767(ipv6: Fix IPsec slowpath fragmentation problem)
In func ip6_append_data,after call skb_put(skb, fraglen + dst_exthdrlen)
the skb->len contains dst_exthdrlen,and we don't reduce dst_exthdrlen at last
This will make fraggap>0 in next "while cycle",and cause the size of skb incorrent

Fix this by reserve headroom for dst_exthdrlen.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c45a3dfb 27-Jan-2012 David S. Miller <davem@davemloft.net>

ipv6: Eliminate dst_get_neighbour_noref() usage in ip6_forward().

It's only used to get at neigh->primary_key, which in this context is
always going to be the same as rt->rt6i_gateway.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 4991969a 27-Jan-2012 David S. Miller <davem@davemloft.net>

ipv6: Remove neigh argument from ndisc_send_redirect()

Instead, compute it as-needed inside of that function using
dst_neigh_lookup().

Signed-off-by: David S. Miller <davem@davemloft.net>


# e688a604 21-Dec-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: introduce DST_NOPEER dst flag

Chris Boot reported crashes occurring in ipv6_select_ident().

[ 461.457562] RIP: 0010:[<ffffffff812dde61>] [<ffffffff812dde61>]
ipv6_select_ident+0x31/0xa7

[ 461.578229] Call Trace:
[ 461.580742] <IRQ>
[ 461.582870] [<ffffffff812efa7f>] ? udp6_ufo_fragment+0x124/0x1a2
[ 461.589054] [<ffffffff812dbfe0>] ? ipv6_gso_segment+0xc0/0x155
[ 461.595140] [<ffffffff812700c6>] ? skb_gso_segment+0x208/0x28b
[ 461.601198] [<ffffffffa03f236b>] ? ipv6_confirm+0x146/0x15e
[nf_conntrack_ipv6]
[ 461.608786] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
[ 461.614227] [<ffffffff81271d64>] ? dev_hard_start_xmit+0x357/0x543
[ 461.620659] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
[ 461.626440] [<ffffffffa0379745>] ? br_parse_ip_options+0x19a/0x19a
[bridge]
[ 461.633581] [<ffffffff812722ff>] ? dev_queue_xmit+0x3af/0x459
[ 461.639577] [<ffffffffa03747d2>] ? br_dev_queue_push_xmit+0x72/0x76
[bridge]
[ 461.646887] [<ffffffffa03791e3>] ? br_nf_post_routing+0x17d/0x18f
[bridge]
[ 461.653997] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
[ 461.659473] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
[ 461.665485] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
[ 461.671234] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
[ 461.677299] [<ffffffffa0379215>] ?
nf_bridge_update_protocol+0x20/0x20 [bridge]
[ 461.684891] [<ffffffffa03bb0e5>] ? nf_ct_zone+0xa/0x17 [nf_conntrack]
[ 461.691520] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
[ 461.697572] [<ffffffffa0374812>] ? NF_HOOK.constprop.8+0x3c/0x56
[bridge]
[ 461.704616] [<ffffffffa0379031>] ?
nf_bridge_push_encap_header+0x1c/0x26 [bridge]
[ 461.712329] [<ffffffffa037929f>] ? br_nf_forward_finish+0x8a/0x95
[bridge]
[ 461.719490] [<ffffffffa037900a>] ?
nf_bridge_pull_encap_header+0x1c/0x27 [bridge]
[ 461.727223] [<ffffffffa0379974>] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge]
[ 461.734292] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
[ 461.739758] [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge]
[ 461.746203] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
[ 461.751950] [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge]
[ 461.758378] [<ffffffffa037533a>] ? NF_HOOK.constprop.4+0x56/0x56
[bridge]

This is caused by bridge netfilter special dst_entry (fake_rtable), a
special shared entry, where attaching an inetpeer makes no sense.

Problem is present since commit 87c48fa3b46 (ipv6: make fragment
identifications less predictable)

Introduce DST_NOPEER dst flag and make sure ipv6_select_ident() and
__ip_select_ident() fallback to the 'no peer attached' handling.

Reported-by: Chris Boot <bootc@bootc.net>
Tested-by: Chris Boot <bootc@bootc.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 27217455 02-Dec-2011 David Miller <davem@davemloft.net>

net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}.

To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>


# 75f2811c 30-Nov-2011 Jesse Gross <jesse@nicira.com>

ipv6: Add fragment reporting to ipv6_skip_exthdr().

While parsing through IPv6 extension headers, fragment headers are
skipped making them invisible to the caller. This reports the
fragment offset of the last header in order to make it possible to
determine whether the packet is fragmented and, if so whether it is
a first or last fragment.

Signed-off-by: Jesse Gross <jesse@nicira.com>


# 4e3fd7a0 20-Nov-2011 Alexey Dobriyan <adobriyan@gmail.com>

net: remove ipv6_addr_copy()

C assignment can handle struct in6_addr copying.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a7ae1992 17-Nov-2011 Herbert Xu <herbert@gondor.apana.org.au>

ipv6: Remove all uses of LL_ALLOCATED_SPACE

ipv6: Remove all uses of LL_ALLOCATED_SPACE

The macro LL_ALLOCATED_SPACE was ill-conceived. It applies the
alignment to the sum of needed_headroom and needed_tailroom. As
the amount that is then reserved for head room is needed_headroom
with alignment, this means that the tail room left may be too small.

This patch replaces all uses of LL_ALLOCATED_SPACE in net/ipv6
with the macro LL_RESERVED_SPACE and direct reference to
needed_tailroom.

This also fixes the problem with needed_headroom changing between
allocating the skb and reserving the head room.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 504744e4 27-Oct-2011 Zheng Yan <zheng.z.yan@intel.com>

ipv6: fix error propagation in ip6_ufo_append_data()

We should return errcode from sock_alloc_send_skb()

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b903d324 26-Oct-2011 Eric Dumazet <eric.dumazet@gmail.com>

ipv6: tcp: fix TCLASS value in ACK messages sent from TIME_WAIT

commit 66b13d99d96a (ipv4: tcp: fix TOS value in ACK messages sent from
TIME_WAIT) fixed IPv4 only.

This part is for the IPv6 side, adding a tclass param to ip6_xmit()

We alias tw_tclass and tw_tos, if socket family is INET6.

[ if sockets is ipv4-mapped, only IP_TOS socket option is used to fill
TOS field, TCLASS is not taken into account ]

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9e903e08 18-Oct-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: add skb frag size accessors

To ease skb->truesize sanitization, its better to be able to localize
all references to skb frags size.

Define accessors : skb_frag_size() to fetch frag size, and
skb_frag_size_{set|add|sub}() to manipulate it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 299b0767 10-Oct-2011 Steffen Klassert <steffen.klassert@secunet.com>

ipv6: Fix IPsec slowpath fragmentation problem

ip6_append_data() builds packets based on the mtu from dst_mtu(rt->dst.path).
On IPsec the effective mtu is lower because we need to add the protocol
headers and trailers later when we do the IPsec transformations. So after
the IPsec transformations the packet might be too big, which leads to a
slowpath fragmentation then. This patch fixes this by building the packets
based on the lower IPsec mtu from dst_mtu(&rt->dst) and adapts the exthdr
handling to this.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 408dadf0 22-Aug-2011 Ian Campbell <Ian.Campbell@citrix.com>

net: ipv6: convert to SKB frag APIs

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>


# f2c31e32 29-Jul-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: fix NULL dereferences in check_peer_redir()

Gergely Kalman reported crashes in check_peer_redir().

It appears commit f39925dbde778 (ipv4: Cache learned redirect
information in inetpeer.) added a race, leading to possible NULL ptr
dereference.

Since we can now change dst neighbour, we should make sure a reader can
safely use a neighbour.

Add RCU protection to dst neighbour, and make sure check_peer_redir()
can be called safely by different cpus in parallel.

As neighbours are already freed after one RCU grace period, this patch
should not add typical RCU penalty (cache cold effects)

Many thanks to Gergely for providing a pretty report pointing to the
bug.

Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 87c48fa3 21-Jul-2011 Eric Dumazet <eric.dumazet@gmail.com>

ipv6: make fragment identifications less predictable

IPv6 fragment identification generation is way beyond what we use for
IPv4 : It uses a single generator. Its not scalable and allows DOS
attacks.

Now inetpeer is IPv6 aware, we can use it to provide a more secure and
scalable frag ident generator (per destination, instead of system wide)

This patch :
1) defines a new secure_ipv6_id() helper
2) extends inet_getid() to provide 32bit results
3) extends ipv6_select_ident() with a new dest parameter

Reported-by: Fernando Gont <fernando@gont.com.ar>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 69cce1d1 18-Jul-2011 David S. Miller <davem@davemloft.net>

net: Abstract dst->neighbour accesses behind helpers.

dst_{get,set}_neighbour()

Signed-off-by: David S. Miller <davem@davemloft.net>


# 05e3aa09 16-Jul-2011 David S. Miller <davem@davemloft.net>

net: Create and use new helper, neigh_output().

Signed-off-by: David S. Miller <davem@davemloft.net>


# a2928297 16-Jul-2011 David S. Miller <davem@davemloft.net>

ipv6: Use calculated 'neigh' instead of re-evaluating dst->neighbour

Signed-off-by: David S. Miller <davem@davemloft.net>


# f6b72b62 14-Jul-2011 David S. Miller <davem@davemloft.net>

net: Embed hh_cache inside of struct neighbour.

Now that there is a one-to-one correspondance between neighbour
and hh_cache entries, we no longer need:

1) dynamic allocation
2) attachment to dst->hh
3) refcounting

Initialization of the hh_cache entry is indicated by hh_len
being non-zero, and such initialization is always done with
the neighbour's lock held as a writer.

Signed-off-by: David S. Miller <davem@davemloft.net>


# bdc712b4 06-May-2011 David S. Miller <davem@davemloft.net>

inet: Decrease overhead of on-stack inet_cork.

When we fast path datagram sends to avoid locking by putting
the inet_cork on the stack we use up lots of space that isn't
necessary.

This is because inet_cork contains a "struct flowi" which isn't
used in these code paths.

Split inet_cork to two parts, "inet_cork" and "inet_cork_full".
Only the latter of which has the "struct flowi" and is what is
stored in inet_sock.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>


# b71d1d42 21-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com>

inet: constify ip headers and in6_addr

Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c3968a85 13-Apr-2011 Daniel Walter <sahne@0x90.at>

ipv6: RTA_PREFSRC support for ipv6 route source address selection

[ipv6] Add support for RTA_PREFSRC

This patch allows a user to select the preferred source address
for a specific IPv6-Route. It can be set via a netlink message
setting RTA_PREFSRC to a valid IPv6 address which must be
up on the device the route will be bound to.

Signed-off-by: Daniel Walter <dwalter@barracuda.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# 4c9483b2 12-Mar-2011 David S. Miller <davem@davemloft.net>

ipv6: Convert to use flowi6 where applicable.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 1d28f42c 11-Mar-2011 David S. Miller <davem@davemloft.net>

net: Put flowi_* prefix on AF independent members of struct flowi

I intend to turn struct flowi into a union of AF specific flowi
structs. There will be a common structure that each variant includes
first, much like struct sock_common.

This is the first step to move in that direction.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 452edd59 02-Mar-2011 David S. Miller <davem@davemloft.net>

xfrm: Return dst directly from xfrm_lookup()

Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 2774c131 01-Mar-2011 David S. Miller <davem@davemloft.net>

xfrm: Handle blackhole route creation via afinfo.

That way we don't have to potentially do this in every xfrm_lookup()
caller.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 69ead7af 01-Mar-2011 David S. Miller <davem@davemloft.net>

ipv6: Normalize arguments to ip6_dst_blackhole().

Return a dst pointer which is potentitally error encoded.

Don't pass original dst pointer by reference, pass a struct net
instead of a socket, and elide the flow argument since it is
unnecessary.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 80c0bc9e 01-Mar-2011 David S. Miller <davem@davemloft.net>

xfrm: Kill XFRM_LOOKUP_WAIT flag.

This can be determined from the flow flags instead.

Signed-off-by: David S. Miller <davem@davemloft.net>


# a1414715 01-Mar-2011 David S. Miller <davem@davemloft.net>

ipv6: Change final dst lookup arg name to "can_sleep"

Since it indicates whether we are invoked from a sleepable
context or not.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 5df65e55 01-Mar-2011 David S. Miller <davem@davemloft.net>

net: Add FLOWI_FLAG_CAN_SLEEP.

And set is in contexts where the route resolution can sleep.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 68d0c6d3 01-Mar-2011 David S. Miller <davem@davemloft.net>

ipv6: Consolidate route lookup sequences.

Route lookups follow a general pattern in the ipv6 code wherein
we first find the non-IPSEC route, potentially override the
flow destination address due to ipv6 options settings, and then
finally make an IPSEC search using either xfrm_lookup() or
__xfrm_lookup().

__xfrm_lookup() is used when we want to generate a blackhole route
if the key manager needs to resolve the IPSEC rules (in this case
-EREMOTE is returned and the original 'dst' is left unchanged).

Otherwise plain xfrm_lookup() is used and when asynchronous IPSEC
resolution is necessary, we simply fail the lookup completely.

All of these cases are encapsulated into two routines,
ip6_dst_lookup_flow and ip6_sk_dst_lookup_flow. The latter of which
handles unconnected UDP datagram sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a2ef920 28-Feb-2011 Herbert Xu <herbert@gondor.apana.org.au>

inet: Remove unused sk_sndmsg_* from UFO

UFO doesn't really use the sk_sndmsg_* parameters so touching
them is pointless. It can't use them anyway since the whole
point of UFO is to use the original pages without copying.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a693e698 28-Feb-2011 Anders Berggren <anders@halon.se>

net: TX timestamps for IPv6 UDP packets

Enabling TX timestamps (SO_TIMESTAMPING) for IPv6 UDP packets, in
the same fashion as for IPv4. Necessary in order for NICs such as
Intel 82580 to timestamp IPv6 packets.

Signed-off-by: Anders Berggren <anders@halon.se>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a5f5e368 24-Feb-2011 Hagen Paul Pfeifer <hagen@jauu.net>

ipv6: totlen is declared and assigned but not used

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 92d86829 04-Feb-2011 David S. Miller <davem@davemloft.net>

inetpeer: Move ICMP rate limiting state into inet_peer entries.

Like metrics, the ICMP rate limiting bits are cached state about
a destination. So move it into the inet_peer entries.

If an inet_peer cannot be bound (the reason is memory allocation
failure or similar), the policy is to allow.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 72b43d08 12-Jan-2011 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>

inet6: prevent network storms caused by linux IPv6 routers

Linux IPv6 forwards unicast packets, which are link layer multicasts...
The hole was present since day one. I was 100% this check is there, but it is not.

The problem shows itself, f.e. when Microsoft Network Load Balancer runs on a network.
This software resolves IPv6 unicast addresses to multicast MAC addresses.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ad0081e4 17-Dec-2010 David Stevens <dlstevens@us.ibm.com>

ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed.

This patch modifies IPsec6 to fragment IPv6 packets that are
locally generated as needed.

This version of the patch only fragments in tunnel mode, so that fragment
headers will not be obscured by ESP in transport mode.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a02cec21 22-Sep-2010 Eric Dumazet <eric.dumazet@gmail.com>

net: return operator cleanup

Change "return (EXPR);" to "return EXPR;"

return is not a function, parentheses are not required.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3d13008e 21-Sep-2010 Eric Dumazet <eric.dumazet@gmail.com>

ip: fix truesize mismatch in ip fragmentation

Special care should be taken when slow path is hit in ip_fragment() :

When walking through frags, we transfert truesize ownership from skb to
frags. Then if we hit a slow_path condition, we must undo this or risk
uncharging frags->truesize twice, and in the end, having negative socket
sk_wmem_alloc counter, or even freeing socket sooner than expected.

Many thanks to Nick Bowler, who provided a very clean bug report and
test program.

Thanks to Jarek for reviewing my first patch and providing a V2

While Nick bisection pointed to commit 2b85a34e911 (net: No more
expensive sock_hold()/sock_put() on each tx), underlying bug is older
(2.6.12-rc5)

A side effect is to extend work done in commit b2722b1c3a893e
(ip_fragment: also adjust skb->truesize for packets not owned by a
socket) to ipv6 as well.

Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 21dc3301 23-Aug-2010 David S. Miller <davem@davemloft.net>

net: Rename skb_has_frags to skb_has_frag_list

SKBs can be "fragmented" in two ways, via a page array (called
skb_shinfo(skb)->frags[]) and via a list of SKBs (called
skb_shinfo(skb)->frag_list).

Since skb_has_frags() tests the latter, it's name is confusing
since it sounds more like it's testing the former.

Signed-off-by: David S. Miller <davem@davemloft.net>


# d8d1f30b 11-Jun-2010 Changli Gao <xiaosuo@gmail.com>

net-next: remove useless union keyword

remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0aa68271 27-May-2010 Herbert Xu <herbert@gondor.apana.org.au>

ipv6: Add GSO support on forwarding path

Currently we disallow GSO packets on the IPv6 forward path.
This patch fixes this.

Note that I discovered that our existing GSO MTU checks (e.g.,
IPv4 forwarding) are buggy in that they skip the check altogether,
when they really should be checking gso_size + header instead.

I have also been lazy here in that I haven't bothered to segment
the GSO packet by hand before generating an ICMP message. Someone
should add that to be 100% correct.

Reported-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d1db275d 11-May-2010 Patrick McHardy <kaber@trash.net>

ipv6: ip6mr: support multiple tables

This patch adds support for multiple independant multicast routing instances,
named "tables".

Userspace multicast routing daemons can bind to a specific table instance by
issuing a setsockopt call using a new option MRT6_TABLE. The table number is
stored in the raw socket data and affects all following ip6mr setsockopt(),
getsockopt() and ioctl() calls. By default, a single table (RT6_TABLE_DFLT)
is created with a default routing rule pointing to it. Newly created pim6reg
devices have the table number appended ("pim6regX"), with the exception of
devices created in the default table, which are named just "pim6reg" for
compatibility reasons.

Packets are directed to a specific table instance using routing rules,
similar to how regular routing rules work. Currently iif, oif and mark
are supported as keys, source and destination addresses could be supported
additionally.

Example usage:

- bind pimd/xorp/... to a specific table:

uint32_t table = 123;
setsockopt(fd, SOL_IPV6, MRT6_TABLE, &table, sizeof(table));

- create routing rules directing packets to the new table:

# ip -6 mrule add iif eth0 lookup 123
# ip -6 mrule add oif eth0 lookup 123

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 83d7eb29 30-Apr-2010 Dan Carpenter <error27@gmail.com>

ipv6: cleanup: remove unneeded null check

We dereference "sk" unconditionally elsewhere in the function.

This was left over from: b30bd282 "ip6_xmit: remove unnecessary NULL
ptr check". According to that commit message, "the sk argument to
ip6_xmit is never NULL nowadays since the skb->priority assigment
expects a valid socket."

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b340ae2 23-Apr-2010 Brian Haley <brian.haley@hp.com>

IPv6: Complete IPV6_DONTFRAG support

Finally add support to detect a local IPV6_DONTFRAG event
and return the relevant data to the user if they've enabled
IPV6_RECVPATHMTU on the socket. The next recvmsg() will
return no data, but have an IPV6_PATHMTU as ancillary data.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 13b52cd4 23-Apr-2010 Brian Haley <brian.haley@hp.com>

IPv6: Add dontfrag argument to relevant functions

Add dontfrag argument to relevant functions for
IPV6_DONTFRAG support, as well as allowing the value
to be passed-in via ancillary cmsg data.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f2228f78 18-Apr-2010 Shan Wei <shanwei@cn.fujitsu.com>

ipv6: allow to send packet after receiving ICMPv6 Too Big message with MTU field less than IPV6_MIN_MTU

According to RFC2460, PMTU is set to the IPv6 Minimum Link
MTU (1280) and a fragment header should always be included
after a node receiving Too Big message reporting PMTU is
less than the IPv6 Minimum Link MTU.

After receiving a ICMPv6 Too Big message reporting PMTU is
less than the IPv6 Minimum Link MTU, sctp *can't* send any
data/control chunk that total length including IPv6 head
and IPv6 extend head is less than IPV6_MIN_MTU(1280 bytes).

The failure occured in p6_fragment(), about reason
see following(take SHUTDOWN chunk for example):
sctp_packet_transmit (SHUTDOWN chunk, len=16 byte)
|------sctp_v6_xmit (local_df=0)
|------ip6_xmit
|------ip6_output (dst_allfrag is ture)
|------ip6_fragment

In ip6_fragment(), for local_df=0, drops the the packet
and returns EMSGSIZE.

The patch fixes it with adding check length of skb->len.
In this case, Ipv6 not to fragment upper protocol data,
just only add a fragment header before it.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cd58bcd9 19-Apr-2010 Jan Engelhardt <jengelh@medozas.de>

netfilter: xt_TEE: have cloned packet travel through Xtables too

Since Xtables is now reentrant/nestable, the cloned packet can also go
through Xtables and be subject to rules itself.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# e281b198 19-Apr-2010 Jan Engelhardt <jengelh@medozas.de>

netfilter: xtables: inclusion of xt_TEE

xt_TEE can be used to clone and reroute a packet. This can for
example be used to copy traffic at a router for logging purposes
to another dedicated machine.

References: http://www.gossamer-threads.com/lists/iptables/devel/68781
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# b5d43998 15-Apr-2010 Shan Wei <shanwei@cn.fujitsu.com>

ipv6: fix the comment of ip6_xmit()

ip6_xmit() is used by upper transport protocol.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4e15ed4d 15-Apr-2010 Shan Wei <shanwei@cn.fujitsu.com>

net: replace ipfragok with skb->local_df

As Herbert Xu said: we should be able to simply replace ipfragok
with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.

The patch kills the ipfragok parameter of .queue_xmit().

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0eecb784 15-Apr-2010 Shan Wei <shanwei@cn.fujitsu.com>

ipv6: cancel to setting local_df in ip6_xmit()

commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.

So the change of commit 77e2f1(ipv6: Fix ip6_xmit to
send fragments if ipfragok is true) is not needed.
So the patch remove them.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e30b38c2 15-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com>

ip: Fix ip_dev_loopback_xmit()

Eric Paris got following trace with a linux-next kernel

[ 14.203970] BUG: using smp_processor_id() in preemptible [00000000]
code: avahi-daemon/2093
[ 14.204025] caller is netif_rx+0xfa/0x110
[ 14.204035] Call Trace:
[ 14.204064] [<ffffffff81278fe5>] debug_smp_processor_id+0x105/0x110
[ 14.204070] [<ffffffff8142163a>] netif_rx+0xfa/0x110
[ 14.204090] [<ffffffff8145b631>] ip_dev_loopback_xmit+0x71/0xa0
[ 14.204095] [<ffffffff8145b892>] ip_mc_output+0x192/0x2c0
[ 14.204099] [<ffffffff8145d610>] ip_local_out+0x20/0x30
[ 14.204105] [<ffffffff8145d8ad>] ip_push_pending_frames+0x28d/0x3d0
[ 14.204119] [<ffffffff8147f1cc>] udp_push_pending_frames+0x14c/0x400
[ 14.204125] [<ffffffff814803fc>] udp_sendmsg+0x39c/0x790
[ 14.204137] [<ffffffff814891d5>] inet_sendmsg+0x45/0x80
[ 14.204149] [<ffffffff8140af91>] sock_sendmsg+0xf1/0x110
[ 14.204189] [<ffffffff8140dc6c>] sys_sendmsg+0x20c/0x380
[ 14.204233] [<ffffffff8100ad82>] system_call_fastpath+0x16/0x1b

While current linux-2.6 kernel doesnt emit this warning, bug is latent
and might cause unexpected failures.

ip_dev_loopback_xmit() runs in process context, preemption enabled, so
must call netif_rx_ni() instead of netif_rx(), to make sure that we
process pending software interrupt.

Same change for ip6_dev_loopback_xmit()

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9c6eb28a 13-Apr-2010 Jan Engelhardt <jengelh@medozas.de>

netfilter: ipv6: add IPSKB_REROUTED exclusion to NF_HOOK/POSTROUTING invocation

Similar to how IPv4's ip_output.c works, have ip6_output also check
the IPSKB_REROUTED flag. It will be set from xt_TEE for cloned packets
since Xtables can currently only deal with a single packet in flight
at a time.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Acked-by: David S. Miller <davem@davemloft.net>
[Patrick: changed to use an IP6SKB value instead of IPSKB]
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 9e508490 13-Apr-2010 Jan Engelhardt <jengelh@medozas.de>

netfilter: ipv6: move POSTROUTING invocation before fragmentation

Patrick McHardy notes: "We used to invoke IPv4 POST_ROUTING after
fragmentation as well just to defragment the packets in conntrack
immediately afterwards, but that got changed during the
netfilter-ipsec integration. Ideally IPv6 would behave like IPv4."

This patch makes it so. Sending an oversized frame (e.g. `ping6
-s64000 -c1 ::1`) will now show up in POSTROUTING as a single skb
rather than multiple ones.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# b2e0b385 22-Mar-2010 Jan Engelhardt <jengelh@medozas.de>

netfilter: ipv6: use NFPROTO values for NF_HOOK invocation

The semantic patch that was used:
// <smpl>
@@
@@
(NF_HOOK
|NF_HOOK_THRESH
|nf_hook
)(
-PF_INET6,
+NFPROTO_IPV6,
...)
// </smpl>

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>


# 14f3ad6f 26-Feb-2010 Ulrich Weber <uweber@astaro.com>

ipv6: Use 1280 as min MTU for ipv6 forwarding

Clients will set their MTU to 1280 if they receive a
ICMPV6_PKT_TOOBIG message with an MTU less than 1280.

To allow encapsulating of packets over a 1280 link
we should always accept packets with a size of 1280
for forwarding even if the path has a lower MTU and
fragment the encapsulated packets afterwards.

In case a forwarded packet is not going to be encapsulated
a ICMPV6_PKT_TOOBIG msg will still be send by ip6_fragment()
with the correct MTU.

Signed-off-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3ffe533c 18-Feb-2010 Alexey Dobriyan <adobriyan@gmail.com>

ipv6: drop unused "dev" arg of icmpv6_send()

Dunno, what was the idea, it wasn't used for a long time.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7ad6848c 06-Jan-2010 Octavian Purdila <opurdila@ixiacom.com>

ip: fix mc_loop checks for tunnels with multicast outer addresses

When we have L3 tunnels with different inner/outer families
(i.e. IPV4/IPV6) which use a multicast address as the outer tunnel
destination address, multicast packets will be loopbacked back to the
sending socket even if IP*_MULTICAST_LOOP is set to disabled.

The mc_loop flag is present in the family specific part of the socket
(e.g. the IPv4 or IPv4 specific part). setsockopt sets the inner
family mc_loop flag. When the packet is pushed through the L3 tunnel
it will eventually be processed by the outer family which if different
will check the flag in a different part of the socket then it was set.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6ce9e7b5 02-Sep-2009 Eric Dumazet <eric.dumazet@gmail.com>

ip: Report qdisc packet drops

Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user (-ENOBUFS errors) and SNMP counters updated.

IP_RECVERR is used to enable extended reliable error message passing,
but these are not needed to update system wide SNMP stats.

This patch changes things a bit to allow SNMP counters to be updated,
regardless of IP_RECVERR being set or not on the socket.

Example after an UDP tx flood
# netstat -s
...
IP:
1487048 outgoing packets dropped
...
Udp:
...
SndbufErrors: 1487048


send() syscalls, do however still return an OK status, to not
break applications.

Note : send() manual page explicitly says for -ENOBUFS error :

"The output queue for a network interface was full.
This generally indicates that the interface has stopped sending,
but may be caused by transient congestion.
(Normally, this does not occur in Linux. Packets are just silently
dropped when a device queue overflows.) "

This is not true for IP_RECVERR enabled sockets : a send() syscall
that hit a qdisc drop returns an ENOBUFS error.

Many thanks to Christoph, David, and last but not least, Alexey !

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 06254914 01-Sep-2009 Eric Dumazet <eric.dumazet@gmail.com>

ipv6: ip6_push_pending_frames() should increment IPSTATS_MIB_OUTDISCARDS

qdisc drops should be notified to IP_RECVERR enabled sockets, as done in IPV4.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e651f03a 09-Aug-2009 Gerrit Renker <gerrit@erg.abdn.ac.uk>

inet6: Conversion from u8 to int

This replaces assignments of the type "int on LHS" = "u8 on RHS" with
simpler code. The LHS can express all of the unsigned right hand side
values, hence the assigned value can not be negative.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7ea2f2c5 09-Jul-2009 Sridhar Samudrala <sri@us.ibm.com>

udpv6: Remove unused skb argument of ipv6_select_ident()

- move ipv6_select_ident() inline function to ipv6.h and remove the unused
skb argument

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c31d5326 09-Jul-2009 Sridhar Samudrala <sri@us.ibm.com>

udpv6: Fix gso_size setting in ip6_ufo_append_data

- fix gso_size setting for ipv6 fragment to be a multiple of 8 bytes.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e51a67a9 08-Jul-2009 Eric Dumazet <eric.dumazet@gmail.com>

net: ip_push_pending_frames() fix

After commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
(net: No more expensive sock_hold()/sock_put() on each tx)
we do not take any more references on sk->sk_refcnt on outgoing packets.

I forgot to delete two __sock_put() from ip_push_pending_frames()
and ip6_push_pending_frames().

Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Emil S Tantilov <emils.tantilov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2b85a34e 11-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com>

net: No more expensive sock_hold()/sock_put() on each tx

One of the problem with sock memory accounting is it uses
a pair of sock_hold()/sock_put() for each transmitted packet.

This slows down bidirectional flows because the receive path
also needs to take a refcount on socket and might use a different
cpu than transmit path or transmit completion path. So these
two atomic operations also trigger cache line bounces.

We can see this in tx or tx/rx workloads (media gateways for example),
where sock_wfree() can be in top five functions in profiles.

We use this sock_hold()/sock_put() so that sock freeing
is delayed until all tx packets are completed.

As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
by one unit at init time, until sk_free() is called.
Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
to decrement initial offset and atomicaly check if any packets
are in flight.

skb_set_owner_w() doesnt call sock_hold() anymore

sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
reached 0 to perform the final freeing.

Drawback is that a skb->truesize error could lead to unfreeable sockets, or
even worse, prematurely calling __sk_free() on a live socket.

Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
contention point. 5 % speedup on a UDP transmit workload (depends
on number of flows), lowering TX completion cpu usage.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4d9092bb 09-Jun-2009 David S. Miller <davem@davemloft.net>

ipv6: Use frag list abstraction interfaces.

Signed-off-by: David S. Miller <davem@davemloft.net>


# adf30907 01-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com>

net: skb->dst accessors

Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# edf391ff 27-Apr-2009 Neil Horman <nhorman@tuxdriver.com>

snmp: add missing counters for RFC 4293

The IP MIB (RFC 4293) defines stats for InOctets, OutOctets, InMcastOctets and
OutMcastOctets:
http://tools.ietf.org/html/rfc4293
But it seems we don't track those in any way that easy to separate from other
protocols. This patch adds those missing counters to the stats file. Tested
successfully by me

With help from Eric Dumazet.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0178b695 05-Feb-2009 Herbert Xu <herbert@gondor.apana.org.au>

ipv6: Copy cork options in ip6_append_data

As the options passed to ip6_append_data may be ephemeral, we need
to duplicate it for corking. This patch applies the simplest fix
which is to memdup all the relevant bits.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bd91b8bf 10-Dec-2008 Benjamin Thery <benjamin.thery@bull.net>

netns: ip6mr: allocate mroute6_socket per-namespace.

Preliminary work to make IPv6 multicast forwarding netns-aware.

Make IPv6 multicast forwarding mroute6_socket per-namespace,
moves it into struct netns_ipv6.

At the moment, mroute6_socket is only referenced in init_net.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# def8b4fa 28-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

net: reduce structures when XFRM=n

ifdef out
* struct sk_buff::sp (pointer)
* struct dst_entry::xfrm (pointer)
* struct sock::sk_policy (2 pointers)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a57d4c7 08-Oct-2008 Denis V. Lunev <den@openvz.org>

ipv6: added net argument to ICMP6MSGOUT_INC_STATS_BH

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e41b5368 08-Oct-2008 Denis V. Lunev <den@openvz.org>

ipv6: added net argument to ICMP6_INC_STATS_BH

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 483a47d2 08-Oct-2008 Denis V. Lunev <den@openvz.org>

ipv6: added net argument to IP6_INC_STATS_BH

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3bd653c8 08-Oct-2008 Denis V. Lunev <den@openvz.org>

netns: add net parameter to IP6_INC_STATS

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0b0588d4 08-Oct-2008 Denis V. Lunev <den@openvz.org>

ipv6: local dev is actually unused in ip6_fragment

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e550dfb0 09-Sep-2008 Neil Horman <nhorman@tuxdriver.com>

ipv6: Fix OOPS in ip6_dst_lookup_tail().

This fixes kernel bugzilla 11469: "TUN with 1024 neighbours:
ip6_dst_lookup_tail NULL crash"

dst->neighbour is not necessarily hooked up at this point
in the processing path, so blindly dereferencing it is
the wrong thing to do. This NULL check exists in other
similar paths and this case was just an oversight.

Also fix the completely wrong and confusing indentation
here while we're at it.

Based upon a patch by Evgeniy Polyakov.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 191cd582 14-Aug-2008 Brian Haley <brian.haley@hp.com>

netns: Add network namespace argument to rt6_fill_node() and ipv6_dev_get_saddr()

ipv6_dev_get_saddr() blindly de-references dst_dev to get the network
namespace, but some callers might pass NULL. Change callers to pass a
namespace pointer instead.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 283d07ac 03-Aug-2008 Wei Yongjun <yjwei@cn.fujitsu.com>

ipv6: Do not drop packet if skb->local_df is set to true

The old code will drop IPv6 packet if ipfragok is not set, since
ipfragok is obsoleted, will be instead by used skb->local_df, so this
check must be changed to skb->local_df.

This patch fix this problem and not drop packet if skb->local_df is
set to true.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 77e2f14f 31-Jul-2008 Wei Yongjun <yjwei@cn.fujitsu.com>

ipv6: Fix ip6_xmit to send fragments if ipfragok is true

SCTP used ip6_xmit() to send fragments after received ICMP packet too
big message. But while send packet used ip6_xmit, the skb->local_df is
not initialized. So when skb if enter ip6_fragment(), the following
code will discard the skb.

ip6_fragment(...)
{
if (!skb->local_df) {
...
return -EMSGSIZE;
}
...
}

SCTP do the following step:
1. send packet ip6_xmit(skb, ipfragok=0)
2. received ICMP packet too big message
3. if PMTUD_ENABLE: ip6_xmit(skb, ipfragok=1)

This patch fixed the problem by set local_df if ipfragok is true.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 547b792c 25-Jul-2008 Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

net: convert BUG_TRAP to generic WARN_ON

Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.

I could make at least one BUILD_BUG_ON conversion.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 53b7997f 19-Jul-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

ipv6 netns: Make several "global" sysctl variables namespace aware.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 778d80be 27-Jun-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

ipv6: Add disable_ipv6 sysctl to disable IPv6 operaion on specific interface.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# f81b2e7d 25-Jun-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

ipv6: Do not forward packets with the unspecified source address.

RFC4291 2.5.2.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 4497b076 19-Jun-2008 Ben Hutchings <bhutchings@solarflare.com>

net: Discard and warn about LRO'd skbs received for forwarding

Add skb_warn_if_lro() to test whether an skb was received with LRO and
warn if so.

Change br_forward(), ip_forward() and ip6_forward() to call it) and
discard the skb if it returns true.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0b040829 10-Jun-2008 Adrian Bunk <bunk@kernel.org>

net: remove CVS keywords

This patch removes CVS keywords that weren't updated for a long time
from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f5184d26 12-May-2008 Johannes Berg <johannes@sipsolutions.net>

net: Allow netdevices to specify needed head/tailroom

This patch adds needed_headroom/needed_tailroom members to struct
net_device and updates many places that allocate sbks to use them. Not
all of them can be converted though, and I'm sure I missed some (I
mostly grepped for LL_RESERVED_SPACE)

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9acd9f3a 10-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Make address arguments const.

- net/ipv6/addrconf.c:
ipv6_get_ifaddr(), ipv6_dev_get_saddr()
- net/ipv6/mcast.c:
ipv6_sock_mc_join(), ipv6_sock_mc_drop(),
inet6_mc_check(),
ipv6_dev_mc_inc(), __ipv6_dev_mc_dec(), ipv6_dev_mc_dec(),
ipv6_chk_mcast_addr()
- net/ipv6/route.c:
rt6_lookup(), icmp6_dst_alloc()
- net/ipv6/ip6_output.c:
ip6_nd_hdr()
- net/ipv6/ndisc.c:
ndisc_send_ns(), ndisc_send_rs(), ndisc_send_redirect(),
ndisc_get_neigh(), __ndisc_send()

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 7bc570c8 02-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6] MROUTE: Support multicast forwarding.

Based on ancient patch by Mickael Hoerdt
<hoerdt@clarinet.u-strasbg.fr>, which is available at
<http://www-r2.u-strasbg.fr/~hoerdt/dev/linux_ipv6_mforwarding/patch-linux-ipv6-mforwarding-0.1a>.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 3b1e0a65 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.

Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# c346dca1 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS.

Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 7cbca67c 24-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Support Source Address Selection API (RFC5014).

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 6b75d090 10-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Optimize hop-limit determination.

Last part of hop-limit determination is always:
hoplimit = dst_metric(dst, RTAX_HOPLIMIT);
if (hoplimit < 0)
hoplimit = ipv6_get_hoplimit(dst->dev).

Let's consolidate it as ip6_dst_hoplimit(dst).

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# c8cdaf99 10-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV4,IPV6]: Share cork.rt between IPv4 and IPv6.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 8a3edd80 07-Mar-2008 Daniel Lezcano <dlezcano@fr.ibm.com>

[NETNS][IPV6] fix some missing namespace

This patch adds some missing namespace

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c20121ae 05-Mar-2008 Daniel Lezcano <dlezcano@fr.ibm.com>

[NETNS][IPV6] route6 - pass always a valid socket to ip6_dst_lookup

The ip6_dst_lookup receive a socket as parameter. In some part of the code
it is called with a NULL socket parameter. We want to rely on the socket
to retrieve the network namespace, so we always pass a valid socket in all
cases.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4591db4f 05-Mar-2008 Daniel Lezcano <dlezcano@fr.ibm.com>

[NETNS][IPV6] route6 - add netns parameter to ip6_route_output

Add an netns parameter to ip6_route_output. That will allow to access
to the right routing table for outgoing traffic.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5e5f3f0f 03-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6] ADDRCONF: Convert ipv6_get_saddr() to ipv6_dev_get_saddr().

Since most users of ipv6_get_saddr() pass non-NULL as
dst argument, use ipv6_dev_get_saddr() directly.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 1e04d530 28-Feb-2008 Adrian Bunk <bunk@kernel.org>

[IPV6]: Unexport ip6_find_1stfragopt

This patch removes the no longer used
EXPORT_SYMBOL_GPL(ip6_find_1stfragopt).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b5c15fc0 15-Feb-2008 Herbert Xu <herbert@gondor.apana.org.au>

[IPV6]: Fix reversed local_df test in ip6_fragment

I managed to reverse the local_df test when forward-porting this
patch so it actually makes things worse by never fragmenting at
all.

Thanks to David Stevens for testing and reporting this bug.

Bill Fink pointed out that the local_df setting is also the wrong
way around.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 28a89453 12-Feb-2008 Herbert Xu <herbert@gondor.apana.org.au>

[IPV6]: Fix IPsec datagram fragmentation

This is a long-standing bug in the IPsec IPv6 code that breaks
when we emit a IPsec tunnel-mode datagram packet. The problem
is that the code the emits the packet assumes the IPv6 stack
will fragment it later, but the IPv6 stack assumes that whoever
is emitting the packet is going to pre-fragment the packet.

In the long term we need to fix both sides, e.g., to get the
datagram code to pre-fragment as well as to get the IPv6 stack
to fragment locally generated tunnel-mode packet.

For now this patch does the second part which should make it
work for the IPsec host case.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4a19ec58 30-Jan-2008 Laszlo Attila Toth <panther@balabit.hu>

[NET]: Introducing socket mark socket option.

A userspace program may wish to set the mark for each packets its send
without using the netfilter MARK target. Changing the mark can be used
for mark based routing without netfilter or for packet filtering.

It requires CAP_NET_ADMIN capability.

Signed-off-by: Laszlo Attila Toth <panther@balabit.hu>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 29ffe1a5 28-Jan-2008 Herbert Xu <herbert@gondor.apana.org.au>

[INET]: Prevent out-of-sync truesize on ip_fragment slow path

When ip_fragment has to hit the slow path the value of skb->truesize
may go out of sync because we would have updated it without changing
the packet length. This violates the constraints on truesize.

This patch postpones the update of skb->truesize to prevent this.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1cab3da6 10-Jan-2008 Daniel Lezcano <dlezcano@fr.ibm.com>

[NETNS][IPV6]: inet6_addr - ipv6_get_ifaddr namespace aware

The inet6_addr_lst is browsed taking into account the network
namespace specified as parameter. If an address does not belong
to the specified namespace, it is ignored.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 426b5303 24-Jan-2008 Eric W. Biederman <ebiederm@xmission.com>

[NETNS]: Modify the neighbour table code so it handles multiple network namespaces

I'm actually surprised at how much was involved. At first glance it
appears that the neighbour table data structures are already split by
network device so all that should be needed is to modify the user
interface commands to filter the set of neighbours by the network
namespace of their devices.

However a couple things turned up while I was reading through the
code. The proxy neighbour table allows entries with no network
device, and the neighbour parms are per network device (except for the
defaults) so they now need a per network namespace default.

So I updated the two structures (which surprised me) with their very
own network namespace parameter. Updated the relevant lookup and
destroy routines with a network namespace parameter and modified the
code that interacts with users to filter out neighbour table entries
for devices of other namespaces.

I'm a little concerned that we can modify and display the global table
configuration and from all network namespaces. But this appears good
enough for now.

I keep thinking modifying the neighbour table to have per network
namespace instances of each table type would should be cleaner. The
hash table is already dynamically sized so there are it is not a
limiter. The default parameter would be straight forward to take care
of. However when I look at the how the network table is built and
used I still find some assumptions that there is only a single
neighbour table for each type of table in the kernel. The netlink
operations, neigh_seq_start, the non-core network users that call
neigh_lookup. So while it might be doable it would require more
refactoring than my current approach of just doing a little extra
filtering in the code.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a1b05140 20-Dec-2007 Masahide NAKAMURA <nakam@linux-ipv6.org>

[XFRM] IPv6: Fix dst/routing check at transformation.

IPv6 specific thing is wrongly removed from transformation at net-2.6.25.
This patch recovers it with current design.

o Update "path" of xfrm_dst since IPv6 transformation should
care about routing changes. It is required by MIPv6 and
off-link destined IPsec.
o Rename nfheader_len which is for non-fragment transformation used by
MIPv6 to rt6i_nfheader_len as IPv6 name space.

Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6e23ae2a 19-Nov-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Introduce NF_INET_ hook values

The IPv4 and IPv6 hook values are identical, yet some code tries to figure
out the "correct" value by looking at the address family. Introduce NF_INET_*
values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__
section for userspace compatibility.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ef76bc23 11-Jan-2008 Herbert Xu <herbert@gondor.apana.org.au>

[IPV6]: Add ip6_local_out

Most callers of the LOCAL_OUT chain will set the IP packet length
before doing so. They also share the same output function dst_output.

This patch creates a new function called ip6_local_out which does all
of that and converts the appropriate users over to it.

Apart from removing duplicate code, it will also help in merging the
IPsec output path.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b4ce9277 13-Nov-2007 Herbert Xu <herbert@gondor.apana.org.au>

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6. It's also currently
creating a rather ugly alignment hole in struct dst. Therefore this patch
moves it from there into struct rt6_info.

It also reorders the fields in rt6_info to minimize holes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 01488942 13-Nov-2007 Herbert Xu <herbert@gondor.apana.org.au>

[IPV6]: Only set nfheader_len for top xfrm dst

We only need to set nfheader_len in the top xfrm dst. This is because
we only ever read the nfheader_len from the top xfrm dst.

It is also easier to count nfheader_len as part of header_len which
then lets us remove the ugly wrapper functions for incrementing and
decrementing header lengths in xfrm6_policy.c.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f945fa7a 22-Jan-2008 Herbert Xu <herbert@gondor.apana.org.au>

[INET]: Fix truesize setting in ip_append_data

As it is ip_append_data only counts page fragments to the skb that
allocated it. As such it means that the first skb gets hit with a
4K charge even though it might have only used a fraction of it while
all subsequent skb's that use the same page gets away with no charge
at all.

This bug was exposed by the UDP accounting patch.

[ The wmem_alloc bumping needs to be moved with the truesize,
noticed by Takahiro Yasui. -DaveM ]

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ca46f9c8 05-Dec-2007 Mitsuru Chinen <mitch@linux.vnet.ibm.com>

[IPv6] SNMP: Increment OutNoRoutes when connecting to unreachable network

IPv6 stack doesn't increment OutNoRoutes counter when IP datagrams
is being discarded because no route could be found to transmit them
to their destination. IPv6 stack should increment the counter.
Incidentally, IPv4 stack increments that counter in such situation.

Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bf138862 05-Nov-2007 Pavel Emelyanov <xemul@openvz.org>

[IPV6]: Consolidate the ip cork destruction in ip6_output.c

The ip6_push_pending_frames and ip6_flush_pending_frames do the
same things to flush the sock's cork. Move this into a separate
function and save ~100 bytes from the .text

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c2636b4d 23-Oct-2007 Chuck Lever <chuck.lever@oracle.com>

[NET]: Treat the sign of the result of skb_headroom() consistently

In some places, the result of skb_headroom() is compared to an unsigned
integer, and in others, the result is compared to a signed integer. Make
the comparisons consistent and correct.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ad643a79 15-Oct-2007 Patrick McHardy <kaber@trash.net>

[IPV6]: Uninline netfilter okfns

Uninline netfilter okfns for those cases where gcc can generate tail-calls.

Before:
text data bss dec hex filename
8994153 1016524 524652 10535329 a0c1a1 vmlinux

After:
text data bss dec hex filename
8992761 1016524 524652 10533937 a0bc31 vmlinux
-------------------------------------------------------
-1392

All cases have been verified to generate tail-calls with and without netfilter.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 14878f75 16-Sep-2007 David L Stevens <dlstevens@us.ibm.com>

[IPV6]: Add ICMPMsgStats MIB (RFC 4293) [rev 2]

Background: RFC 4293 deprecates existing individual, named ICMP
type counters to be replaced with the ICMPMsgStatsTable. This table
includes entries for both IPv4 and IPv6, and requires counting of all
ICMP types, whether or not the machine implements the type.

These patches "remove" (but not really) the existing counters, and
replace them with the ICMPMsgStats tables for v4 and v6.
It includes the named counters in the /proc places they were, but gets the
values for them from the new tables. It also counts packets generated
from raw socket output (e.g., OutEchoes, MLD queries, RA's from
radvd, etc).

Changes:
1) create icmpmsg_statistics mib
2) create icmpv6msg_statistics mib
3) modify existing counters to use these
4) modify /proc/net/snmp to add "IcmpMsg" with all ICMP types
listed by number for easy SNMP parsing
5) modify /proc/net/snmp printing for "Icmp" to get the named data
from new counters.
[new to 2nd revision]
6) support per-interface ICMP stats
7) use common macro for per-device stat macros

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1e5dc146 24-Aug-2007 Masahide NAKAMURA <nakam@linux-ipv6.org>

[IPV6] IPSEC: Omit redirect for tunnelled packet.

IPv6 IPsec tunnel gateway incorrectly sends redirect to
router or sender when network device the IPsec tunnelled packet
is arrived is the same as the one the decapsulated packet
is sent.

With this patch, it omits to send the redirect when the forwarding
skbuff carries secpath, since such skbuff should be assumed as
a decapsulated packet from IPsec tunnel by own.

It may be a rare case for an IPsec security gateway, however
it is not rare when the gateway is MIPv6 Home Agent since
the another tunnel end-point is Mobile Node and it changes
the attached network.

Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e1f52208 11-Sep-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPv6]: Fix NULL pointer dereference in ip6_flush_pending_frames

Some of skbs in sk->write_queue do not have skb->dst because
we do not fill skb->dst when we allocate new skb in append_data().

BTW, I think we may not need to (or we should not) increment some stats
when using corking; if 100 sendmsg() (with MSG_MORE) result in 2 packets,
how many should we increment?

If 100, we should set skb->dst for every queued skbs.

If 1 (or 2 (*)), we increment the stats for the first queued skb and
we should just skip incrementing OutDiscards for the rest of queued skbs,
adn we should also impelement this semantics in other places;
e.g., we should increment other stats just once, not 100 times.

*: depends on the place we are discarding the datagram.

I guess should just increment by 1 (or 2).

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8984e41d 21-Aug-2007 Wei Yongjun <yjwei@cn.fujitsu.com>

[IPV6]: Fix kernel panic while send SCTP data with IP fragments

If ICMP6 message with "Packet Too Big" is received after send SCTP DATA,
kernel panic will occur when SCTP DATA is send again.

This is because of a bad dest address when call to skb_copy_bits().

The messages sequence is like this:

Endpoint A Endpoint B
<------- SCTP DATA (size=1432)
ICMP6 message ------->
(Packet Too Big pmtu=1280)
<------- Resend SCTP DATA (size=1432)
------------kernel panic---------------

printing eip:
c05be62a
*pde = 00000000
Oops: 0002 [#1]
SMP
Modules linked in: scomm l2cap bluetooth ipv6 dm_mirror dm_mod video output sbs battery lp floppy sg i2c_piix4 i2c_core pcnet32 mii button ac parport_pc parport ide_cd cdrom serio_raw mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
CPU: 0
EIP: 0060:[<c05be62a>] Not tainted VLI
EFLAGS: 00010282 (2.6.23-rc2 #1)
EIP is at skb_copy_bits+0x4f/0x1ef
eax: 000004d0 ebx: ce12a980 ecx: 00000134 edx: cfd5a880
esi: c8246858 edi: 00000000 ebp: c0759b14 esp: c0759adc
ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
Process swapper (pid: 0, ti=c0759000 task=c06d0340 task.ti=c0713000)
Stack: c0759b88 c0405867 ce12a980 c8bff838 c789c084 00000000 00000028 cfd5a880
d09f1890 000005dc 0000007b ce12a980 cfd5a880 c8bff838 c0759b88 d09bc521
000004d0 fffff96c 00000200 00000100 c0759b50 cfd5a880 00000246 c0759bd4
Call Trace:
[<c0405e1d>] show_trace_log_lvl+0x1a/0x2f
[<c0405ecd>] show_stack_log_lvl+0x9b/0xa3
[<c040608d>] show_registers+0x1b8/0x289
[<c0406271>] die+0x113/0x246
[<c0625dbc>] do_page_fault+0x4ad/0x57e
[<c0624642>] error_code+0x72/0x78
[<d09bc521>] ip6_output+0x8e5/0xab2 [ipv6]
[<d09bcec1>] ip6_xmit+0x2ea/0x3a3 [ipv6]
[<d0a3f2ca>] sctp_v6_xmit+0x248/0x253 [sctp]
[<d0a3c934>] sctp_packet_transmit+0x53f/0x5ae [sctp]
[<d0a34bf8>] sctp_outq_flush+0x555/0x587 [sctp]
[<d0a34d3c>] sctp_retransmit+0xf8/0x10f [sctp]
[<d0a3d183>] sctp_icmp_frag_needed+0x57/0x5b [sctp]
[<d0a3ece2>] sctp_v6_err+0xcd/0x148 [sctp]
[<d09cf1ce>] icmpv6_notify+0xe6/0x167 [ipv6]
[<d09d009a>] icmpv6_rcv+0x7d7/0x849 [ipv6]
[<d09be240>] ip6_input+0x1dc/0x310 [ipv6]
[<d09be965>] ipv6_rcv+0x294/0x2df [ipv6]
[<c05c3789>] netif_receive_skb+0x2d2/0x335
[<c05c5733>] process_backlog+0x7f/0xd0
[<c05c58f6>] net_rx_action+0x96/0x17e
[<c042e722>] __do_softirq+0x64/0xcd
[<c0406f37>] do_softirq+0x5c/0xac
=======================
Code: 00 00 29 ca 89 d0 2b 45 e0 89 55 ec 85 c0 7e 35 39 45 08 8b 55 e4 0f 4e 45 08 8b 75 e0 8b 7d dc 89 c1 c1 e9 02 03 b2 a0 00 00 00 <f3> a5 89 c1 83 e1 03 74 02 f3 a4 29 45 08 0f 84 7b 01 00 00 01
EIP: [<c05be62a>] skb_copy_bits+0x4f/0x1ef SS:ESP 0068:c0759adc
Kernel panic - not syncing: Fatal exception in interrupt

Arnaldo says:
====================
Thanks! I'm to blame for this one, problem was introduced in:

b0e380b1d8a8e0aca215df97702f99815f05c094

@@ -761,7 +762,7 @@ slow_path:
/*
* Copy a block of the IP datagram.
*/
- if (skb_copy_bits(skb, ptr, frag->h.raw, len))
+ if (skb_copy_bits(skb, ptr, skb_transport_header(skb),
len))
BUG();
left -= len;
====================

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ba9dda3a 07-Jul-2007 Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

[NETFILTER]: x_tables: add TRACE target

The TRACE target can be used to follow IP and IPv6 packets through
the ruleset.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick NcHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 59fbb3a6 27-Jun-2007 Masahide NAKAMURA <nakam@linux-ipv6.org>

[IPV6] MIP6: Loadable module support for MIPv6.

This patch makes MIPv6 loadable module named "mip6".

Here is a modprobe.conf(5) example to load it automatically
when user application uses XFRM state for MIPv6:

alias xfrm-type-10-43 mip6
alias xfrm-type-10-60 mip6

Some MIPv6 feature is not included by this modular, however,
it should not be affected to other features like either IPsec
or IPv6 with and without the patch.
We may discuss XFRM, MH (RAW socket) and ancillary data/sockopt
separately for future work.

Loadable features:
* MH receiving check (to send ICMP error back)
* RO header parsing and building (i.e. RH2 and HAO in DSTOPTS)
* XFRM policy/state database handling for RO

These are NOT covered as loadable:
* Home Address flags and its rule on source address selection
* XFRM sub policy (depends on its own kernel option)
* XFRM functions to receive RO as IPv6 extension header
* MH sending/receiving through raw socket if user application
opens it (since raw socket allows to do so)
* RH2 sending as ancillary data
* RH2 operation with setsockopt(2)

Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5bb1ab09 09-May-2007 David L Stevens <dlstevens@us.ibm.com>

[IPV6]: Send ICMPv6 error on scope violations.

When an IPv6 router is forwarding a packet with a link-local scope source
address off-link, RFC 4007 requires it to send an ICMPv6 destination
unreachable with code 2 ("not neighbor"), but Linux doesn't. Fix below.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 628a5c56 20-Apr-2007 John Heffner <jheffner@psc.edu>

[INET]: Add IP(V6)_PMTUDISC_RPOBE

Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER. This option forces
us not to fragment, but does not make use of the kernel path MTU discovery.
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery). This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b881ef76 20-Apr-2007 John Heffner <jheffner@psc.edu>

[IPV6]: MTU discovery check in ip6_fragment()

Adds a check in ip6_fragment() mirroring ip_fragment() for packets
that we can't fragment, and sends an ICMP Packet Too Big message
in response.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d626f62b 27-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_copy_from_linear_data{_offset}

To clearly state the intent of copying from linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 35fc92a9 27-Mar-2007 Herbert Xu <herbert@gondor.apana.org.au>

[NET]: Allow forwarding of ip_summed except CHECKSUM_COMPLETE

Right now Xen has a horrible hack that lets it forward packets with
partial checksums. One of the reasons that CHECKSUM_PARTIAL and
CHECKSUM_COMPLETE were added is so that we can get rid of this hack
(where it creates two extra bits in the skbuff to essentially mirror
ip_summed without being destroyed by the forwarding code).

I had forgotten that I've already gone through all the deivce drivers
last time around to make sure that they're looking at ip_summed ==
CHECKSUM_PARTIAL rather than ip_summed != 0 on transmit. In any case,
I've now done that again so it should definitely be safe.

Unfortunately nobody has yet added any code to update CHECKSUM_COMPLETE
values on forward so we I'm setting that to CHECKSUM_NONE. This should
be safe to remove for bridging but I'd like to check that code path
first.

So here is the patch that lets us get rid of the hack by preserving
ip_summed (mostly) on forwarded packets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 27a884dc 19-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Convert skb->tail to sk_buff_data_t

So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)

Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b0e380b1 10-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: unions of just one member don't get anything done, kill them

Renaming skb->h to skb->transport_header, skb->nh to skb->network_header and
skb->mac to skb->mac_header, to match the names of the associated helpers
(skb[_[re]set]_{transport,network,mac}_header).

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cfe1fc77 16-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_network_header_len

For the common sequence "skb->h.raw - skb->nh.raw", similar to skb->mac_len,
that is precalculated tho, don't think we need to bloat skb with one more
member, so just use this new helper, reducing the number of non-skbuff.h
references to the layer headers even more.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 55f79cc0 14-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[IPV6]: Reset the network header in ip6_nd_hdr

ip6_nd_hdr is always called immediately after a alloc_skb + skb_reserve
sequence, i.e. when skb->tail is equal to skb->data, making it correct to use
skb_reset_network_header().

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e7ac05f3 14-Mar-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: add nf_copy() to safely copy members in skb

This unifies the codes to copy netfilter related datas. Before copying,
nf_copy() puts original members in destination skb.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# badff6d0 13-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_reset_transport_header(skb)

For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple cases:

skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()

The next ones will handle the slightly more "complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0660e03f 25-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce ipv6_hdr(), remove skb->nh.ipv6h

Now the skb->nh union has just one member, .raw, i.e. it is just like the
skb->mac union, strange, no? I'm just leaving it like that till the transport
layer is done with, when we'll rename skb->mac.raw to skb->mac_header (or
->mac_header_offset?), ditto for ->{h,nh}.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c14d2450 11-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_set_network_header

For the cases where the network header is being set to a offset from skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d56f90a7 10-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_network_header()

For the places where we need a pointer to the network header, it is still legal
to touch skb->nh.raw directly if just adding to, subtracting from or setting it
to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bbe735e4 10-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_network_offset()

For the quite common 'skb->nh.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e2d1bca7 10-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Use skb_reset_network_header in skb_push cases

skb_push updates and returns skb->data, so we can just call
skb_reset_network_header after the call to skb_push.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c1d2bbe1 10-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_reset_network_header(skb)

For the common, open coded 'skb->nh.raw = skb->data' operation, so that we can
later turn skb->nh.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple case, next will handle the slightly more
"complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 459a98ed 19-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_reset_mac_header(skb)

For the common, open coded 'skb->mac.raw = skb->data' operation, so that we can
later turn skb->mac.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple case, next will handle the slightly more
"complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 95c385b4 25-Apr-2007 Neil Horman <nhorman@tuxdriver.com>

[IPV6] ADDRCONF: Optimistic Duplicate Address Detection (RFC 4429) Support.

Nominally an autoconfigured IPv6 address is added to an interface in the
Tentative state (as per RFC 2462). Addresses in this state remain in this
state while the Duplicate Address Detection process operates on them to
determine their uniqueness on the network. During this period, these
tentative addresses may not be used for communication, increasing the time
before a node may be able to communicate on a network. Using Optimistic
Duplicate Address Detection, autoconfigured addresses may be used
immediately for communication on the network, as long as certain rules are
followed to avoid conflicts with other nodes during the Duplicate Address
Detection process.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7159039a 22-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Decentralize EXPORT_SYMBOLs.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 1ab1457c 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET] IPV6: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3644f0ce 07-Dec-2006 Stephen Hemminger <shemminger@osdl.org>

[NET]: Convert hh_lock to seqlock.

The hard header cache is in the main output path, so using
seqlock instead of reader/writer lock should reduce overhead.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9a217a1c 05-Dec-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Repair IPv6 Fragments

The commit "[IPV6]: Use kmemdup" (commit-id:
af879cc704372ef762584e916129d19ffb39e844) broke IPv6 fragments.

Bug was spotted by Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# af879cc7 16-Nov-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[IPV6]: Use kmemdup

Code diff stats:

[acme@newtoy net-2.6.20]$ codiff /tmp/ipv6.ko.before /tmp/ipv6.ko.after
/pub/scm/linux/kernel/git/acme/net-2.6.20/net/ipv6/ip6_output.c:
ip6_output | -52
ip6_append_data | +2
2 functions changed, 2 bytes added, 52 bytes removed

/pub/scm/linux/kernel/git/acme/net-2.6.20/net/ipv6/addrconf.c:
addrconf_sysctl_register | -27
1 function changed, 27 bytes removed

/pub/scm/linux/kernel/git/acme/net-2.6.20/net/ipv6/tcp_ipv6.c:
tcp_v6_syn_recv_sock | -32
tcp_v6_parse_md5_keys | -24
2 functions changed, 56 bytes removed

/tmp/ipv6.ko.after:
5 functions changed, 2 bytes added, 135 bytes removed
[acme@newtoy net-2.6.20]$

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# a11d206d 04-Nov-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Per-interface statistics support.

For IP MIB (RFC4293).

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 82e91ffe 09-Nov-2006 Thomas Graf <tgraf@suug.ch>

[NET]: Turn nfmark into generic mark

nfmark is being used in various subsystems and has become
the defacto mark field for all kinds of packets. Therefore
it makes sense to rename it to `mark' and remove the
dependency on CONFIG_NETFILTER.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ae08e1f0 08-Nov-2006 Al Viro <viro@zeniv.linux.org.uk>

[IPV6]: ip6_output annotations

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 90bcaf7b 08-Nov-2006 Al Viro <viro@zeniv.linux.org.uk>

[IPV6]: flowlabels are net-endian

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fbea49e1 22-Sep-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6] NDISC: Add proxy_ndp sysctl.

We do not always need proxy NDP functionality even we
enable forwarding.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 74553b09 22-Sep-2006 Ville Nuorvala <vnuorval@tcs.hut.fi>

[IPV6]: Don't forward packets to proxied link-local address.

Proxying router can't forward traffic sent to link-local address, so signal
the sender and discard the packet. This behavior is clarified by Mobile IPv6
specification (RFC3775) but might be required for all proxying router.
Based on MIPL2 kernel patch.

Signed-off-by: Ville Nuorvala <vnuorval@tcs.hut.fi>
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# e21e0b5f 22-Sep-2006 Ville Nuorvala <vnuorval@tcs.hut.fi>

[IPV6] NDISC: Handle NDP messages to proxied addresses.

It is required to respond to NDP messages sent directly to the "target"
unicast address. Proxying node (router) is required to handle such
messages. To achieve this, check if the packet in forwarding patch is
NDP message.

With this patch, the proxy neighbor entries are always looked up in
forwarding path. We may want to optimize further.

Based on MIPL2 kernel patch.

Signed-off-by: Ville Nuorvala <vnuorval@tcs.hut.fi>
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 27637df9 23-Aug-2006 Masahide NAKAMURA <nakam@linux-ipv6.org>

[IPV6] IPSEC: Support sending with Mobile IPv6 extension headers.

Mobile IPv6 defines home address option as an option of destination
options header. It is placed before fragment header then
ip6_find_1stfragopt() is fixed to know about it.

Home address option also carries final source address of the flow,
then outbound AH calculation should take care of it like routing
header case. Based on MIPL2 kernel patch.

Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1b5c2299 23-Aug-2006 Masahide NAKAMURA <nakam@linux-ipv6.org>

[XFRM] STATE: Support non-fragment outbound transformation headers.

For originated outbound IPv6 packets which will fragment, ip6_append_data()
should know length of extension headers before sending them and
the length is carried by dst_entry.
IPv6 IPsec headers fragment then transformation was
designed to place all headers after fragment header.
OTOH Mobile IPv6 extension headers do not fragment then
it is a good idea to make dst_entry have non-fragment length to tell it
to ip6_append_data().

Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8e1ef0a9 29-Aug-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Cache source address as well in ipv6_pinfo{}.

Based on MIPL2 kernel patch.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Ville Nuorvala <vnuorval@tcs.hut.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cf6b1982 23-Aug-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6] ROUTE: Introduce a helper to check route validity.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Ville Nuorvala <vnuorval@tcs.hut.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 84fa7933 29-Aug-2006 Patrick McHardy <kaber@trash.net>

[NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE

Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose
checksum still needs to be completed) and CHECKSUM_COMPLETE (for
incoming packets, device supplied full checksum).

Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e9fa4f7b 13-Aug-2006 Herbert Xu <herbert@gondor.apana.org.au>

[INET]: Use pskb_trim_unique when trimming paged unique skbs

The IPv4/IPv6 datagram output path was using skb_trim to trim paged
packets because they know that the packet has not been cloned yet
(since the packet hasn't been given to anything else in the system).

This broke because skb_trim no longer allows paged packets to be
trimmed. Paged packets must be given to one of the pskb_trim functions
instead.

This patch adds a new pskb_trim_unique function to cover the IPv4/IPv6
datagram output path scenario and replaces the corresponding skb_trim
calls with it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dafee490 02-Aug-2006 Wei Dong <weid@nanjing-fnst.com>

[IPV6]: SNMPv2 "ipv6IfStatsOutFragCreates" counter error

When I tested linux kernel 2.6.71.7 about statistics
"ipv6IfStatsOutFragCreates", and found that it couldn't increase
correctly. The criteria is RFC 2465:

ipv6IfStatsOutFragCreates OBJECT-TYPE
SYNTAX Counter32
MAX-ACCESS read-only
STATUS current
DESCRIPTION
"The number of output datagram fragments that have
been generated as a result of fragmentation at
this output interface."
::= { ipv6IfStatsEntry 15 }

I think there are two issues in Linux kernel.
1st:
RFC2465 specifies the counter is "The number of output datagram
fragments...". I think increasing this counter after output a fragment
successfully is better. And it should not be increased even though a
fragment is created but failed to output.

2nd:
If we send a big ICMP/ICMPv6 echo request to a host, and receive
ICMP/ICMPv6 echo reply consisted of some fragments. As we know that in
Linux kernel first fragmentation occurs in ICMP layer(maybe saying
transport layer is better), but this is not the "real"
fragmentation,just do some "pre-fragment" -- allocate space for date,
and form a frag_list, etc. The "real" fragmentation happens in IP layer
-- set offset and MF flag and so on. So I think in "fast path" for
ip_fragment/ip6_fragment, if we send a fragment which "pre-fragment" by
upper layer we should also increase "ipv6IfStatsOutFragCreates".

Signed-off-by: Wei Dong <weid@nanjing-fnst.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32c524d1 02-Aug-2006 Wei Dong <weid@nanjing-fnst.com>

[IPV6]: SNMPv2 "ipv6IfStatsInHdrErrors" counter error

When I tested Linux kernel 2.6.17.7 about statistics
"ipv6IfStatsInHdrErrors", found that this counter couldn't increase
correctly. The criteria is RFC2465:
ipv6IfStatsInHdrErrors OBJECT-TYPE
SYNTAX Counter3
MAX-ACCESS read-only
STATUS current
DESCRIPTION
"The number of input datagrams discarded due to
errors in their IPv6 headers, including version
number mismatch, other format errors, hop count
exceeded, errors discovered in processing their
IPv6 options, etc."
::= { ipv6IfStatsEntry 2 }

When I send TTL=0 and TTL=1 a packet to a router which need to be
forwarded, router just sends an ICMPv6 message to tell the sender that
TIME_EXCEED and HOPLIMITS, but no increments for this counter(in the
function ip6_forward).

Signed-off-by: Wei Dong <weid@nanjing-fnst.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 497c615a 30-Jul-2006 Herbert Xu <herbert@gondor.apana.org.au>

[IPV6]: Audit all ip6_dst_lookup/ip6_dst_store calls

The current users of ip6_dst_lookup can be divided into two classes:

1) The caller holds no locks and is in user-context (UDP).
2) The caller does not want to lookup the dst cache at all.

The second class covers everyone except UDP because most people do
the cache lookup directly before calling ip6_dst_lookup. This patch
adds ip6_sk_dst_lookup for the first class.

Similarly ip6_dst_store users can be divded into those that need to
take the socket dst lock and those that don't. This patch adds
__ip6_dst_store for those (everyone except UDP/datagram) that don't
need an extra lock.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 89114afd 08-Jul-2006 Herbert Xu <herbert@gondor.apana.org.au>

[NET] gso: Add skb_is_gso

This patch adds the wrapper function skb_is_gso which can be used instead
of directly testing skb_shinfo(skb)->gso_size. This makes things a little
nicer and allows us to change the primary key for indicating whether an skb
is GSO (if we ever want to do that).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f83ef8c0 30-Jun-2006 Herbert Xu <herbert@gondor.apana.org.au>

[IPV6]: Added GSO support for TCPv6

This patch adds GSO support for IPv6 and TCPv6. This is based on a patch
by Ananda Raju <Ananda.Raju@neterion.com>. His original description is:

This patch enables TSO over IPv6. Currently Linux network stacks
restricts TSO over IPv6 by clearing of the NETIF_F_TSO bit from
"dev->features". This patch will remove this restriction.

This patch will introduce a new flag NETIF_F_TSO6 which will be used
to check whether device supports TSO over IPv6. If device support TSO
over IPv6 then we don't clear of NETIF_F_TSO and which will make the
TCP layer to create TSO packets. Any device supporting TSO over IPv6
will set NETIF_F_TSO6 flag in "dev->features" along with NETIF_F_TSO.

In case when user disables TSO using ethtool, NETIF_F_TSO will get
cleared from "dev->features". So even if we have NETIF_F_TSO6 we don't
get TSO packets created by TCP layer.

SKB_GSO_TCPV4 renamed to SKB_GSO_TCP to make it generic GSO packet.
SKB_GSO_UDPV4 renamed to SKB_GSO_UDP as UFO is not a IPv4 feature.
UFO is supported over IPv6 also

The following table shows there is significant improvement in
throughput with normal frames and CPU usage for both normal and jumbo.

--------------------------------------------------
| | 1500 | 9600 |
| ------------------|-------------------|
| | thru CPU | thru CPU |
--------------------------------------------------
| TSO OFF | 2.00 5.5% id | 5.66 20.0% id |
--------------------------------------------------
| TSO ON | 2.63 78.0 id | 5.67 39.0% id |
--------------------------------------------------

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 7967168c 22-Jun-2006 Herbert Xu <herbert@gondor.apana.org.au>

[NET]: Merge TSO/UFO fields in sk_buff

Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not
going to scale if we add any more segmentation methods (e.g., DCCP). So
let's merge them.

They were used to tell the protocol of a packet. This function has been
subsumed by the new gso_type field. This is essentially a set of netdev
feature bits (shifted by 16 bits) that are required to process a specific
skb. As such it's easy to tell whether a given device can process a GSO
skb: you just have to and the gso_type field and the netdev's features
field.

I've made gso_type a conjunction. The idea is that you have a base type
(e.g., SKB_GSO_TCPV4) that can be modified further to support new features.
For example, if we add a hardware TSO type that supports ECN, they would
declare NETIF_F_TSO | NETIF_F_TSO_ECN. All TSO packets with CWR set would
have a gso_type of SKB_GSO_TCPV4 | SKB_GSO_TCPV4_ECN while all other TSO
packets would be SKB_GSO_TCPV4. This means that only the CWR packets need
to be emulated in software.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 984bc16c 09-Jun-2006 James Morris <jmorris@namei.org>

[SECMARK]: Add secmark support to core networking.

Add a secmark field to the skbuff structure, to allow security subsystems to
place security markings on network packets. This is similar to the nfmark
field, except is intended for implementing security policy, rather than than
networking policy.

This patch was already acked in principle by Dave Miller.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b59f45d0 28-May-2006 Herbert Xu <herbert@gondor.apana.org.au>

[IPSEC] xfrm: Abstract out encapsulation modes

This patch adds the structure xfrm_mode. It is meant to represent
the operations carried out by transport/tunnel modes.

By doing this we allow additional encapsulation modes to be added
without clogging up the xfrm_input/xfrm_output paths.

Candidate modes include 4-to-6 tunnel mode, 6-to-4 tunnel mode, and
BEET modes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b30bd282 23-Mar-2006 Patrick McHardy <kaber@trash.net>

[IPV6]: ip6_xmit: remove unnecessary NULL ptr check

The sk argument to ip6_xmit is never NULL nowadays since the skb->priority
assigment expects a valid socket.

Coverity #354

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c7503609 20-Mar-2006 Dave Jones <davej@redhat.com>

[IPV6]: remove useless test in ip6_append_data

We've already dereferenced 'np' a dozen
times at this point, so it's safe to say it's not null.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d76e60a5 20-Mar-2006 David S. Miller <davem@davemloft.net>

[IPV6]: Fix some code/comment formatting in ip6_dst_output().

Signed-off-by: David S. Miller <davem@davemloft.net>


# baa829d8 12-Mar-2006 Patrick McHardy <kaber@trash.net>

[IPV4/6]: Fix UFO error propagation

When ufo_append_data fails err is uninitialized, but returned back.
Strangely gcc doesn't notice it.

Coverity #901 and #902

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d91675f9 24-Feb-2006 YOSHIFUJI Hideaki <yosufuji@linux-ipv6.org>

[IPV6]: Do not ignore IPV6_MTU socket option.

Based on patch by Hoerdt Mickael <hoerdt@clarinet.u-strasbg.fr>.

Signed-off-by: YOSHIFUJI Hideaki <yosufuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a2c2064f 08-Jan-2006 Patrick McHardy <kaber@trash.net>

[IPV6]: Set skb->priority in ip6_output.c

Set skb->priority = sk->sk_priority as in raw.c and IPv4.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3cf3dc6c 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[IPV6]: Export some symbols for DCCPv6

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 34a0b3cd 29-Nov-2005 Adrian Bunk <bunk@stusta.de>

[IPV6]: make two functions static

This patch makes two needlessly global functions static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9fb9cbb1 09-Nov-2005 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: Add nf_conntrack subsystem.

The existing connection tracking subsystem in netfilter can only
handle ipv4. There were basically two choices present to add
connection tracking support for ipv6. We could either duplicate all
of the ipv4 connection tracking code into an ipv6 counterpart, or (the
choice taken by these patches) we could design a generic layer that
could handle both ipv4 and ipv6 and thus requiring only one sub-protocol
(TCP, UDP, etc.) connection tracking helper module to be written.

In fact nf_conntrack is capable of working with any layer 3
protocol.

The existing ipv4 specific conntrack code could also not deal
with the pecularities of doing connection tracking on ipv6,
which is also cured here. For example, these issues include:

1) ICMPv6 handling, which is used for neighbour discovery in
ipv6 thus some messages such as these should not participate
in connection tracking since effectively they are like ARP
messages

2) fragmentation must be handled differently in ipv6, because
the simplistic "defrag, connection track and NAT, refrag"
(which the existing ipv4 connection tracking does) approach simply
isn't feasible in ipv6

3) ipv6 extension header parsing must occur at the correct spots
before and after connection tracking decisions, and there were
no provisions for this in the existing connection tracking
design

4) ipv6 has no need for stateful NAT

The ipv4 specific conntrack layer is kept around, until all of
the ipv4 specific conntrack helpers are ported over to nf_conntrack
and it is feature complete. Once that occurs, the old conntrack
stuff will get placed into the feature-removal-schedule and we will
fully kill it off 6 months later.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# a51482bd 08-Nov-2005 Jesper Juhl <jesper.juhl@gmail.com>

[NET]: kfree cleanup

From: Jesper Juhl <jesper.juhl@gmail.com>

This is the net/ part of the big kfree cleanup patch.

Remove pointless checks for NULL prior to calling kfree() in net/.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>


# e89e9cf5 18-Oct-2005 Ananda Raju <ananda.raju@neterion.com>

[IPv4/IPv6]: UFO Scatter-gather approach

Attached is kernel patch for UDP Fragmentation Offload (UFO) feature.

1. This patch incorporate the review comments by Jeff Garzik.
2. Renamed USO as UFO (UDP Fragmentation Offload)
3. udp sendfile support with UFO

This patches uses scatter-gather feature of skb to generate large UDP
datagram. Below is a "how-to" on changes required in network device
driver to use the UFO interface.

UDP Fragmentation Offload (UFO) Interface:
-------------------------------------------
UFO is a feature wherein the Linux kernel network stack will offload the
IP fragmentation functionality of large UDP datagram to hardware. This
will reduce the overhead of stack in fragmenting the large UDP datagram to
MTU sized packets

1) Drivers indicate their capability of UFO using
dev->features |= NETIF_F_UFO | NETIF_F_HW_CSUM | NETIF_F_SG

NETIF_F_HW_CSUM is required for UFO over ipv6.

2) UFO packet will be submitted for transmission using driver xmit routine.
UFO packet will have a non-zero value for

"skb_shinfo(skb)->ufo_size"

skb_shinfo(skb)->ufo_size will indicate the length of data part in each IP
fragment going out of the adapter after IP fragmentation by hardware.

skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[]
contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW
indicating that hardware has to do checksum calculation. Hardware should
compute the UDP checksum of complete datagram and also ip header checksum of
each fragmented IP packet.

For IPV6 the UFO provides the fragment identification-id in
skb_shinfo(skb)->ip6_frag_id. The adapter should use this ID for generating
IPv6 fragments.

Signed-off-by: Ananda Raju <ananda.raju@neterion.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (forwarded)
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# f36d6ab1 03-Oct-2005 Yan Zheng <yanzheng@21cn.com>

[IPV6]: Fix ipv6 fragment ID selection at slow path

Signed-Off-By: Yan Zheng <yanzheng@21cn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 41a1f8ea 07-Sep-2005 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Support IPV6_{RECV,}TCLASS socket options / ancillary data.

Based on patch from David L Stevens <dlstevens@us.ibm.com>

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 64ce2073 09-Aug-2005 Patrick McHardy <kaber@trash.net>

[NET]: Make NETDEBUG pure printk wrappers

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0bd1b59b 09-Aug-2005 Andrew McDonald <andrew@mcdonald.org.uk>

[IPV6]: Check interface bindings on IPv6 raw socket reception

Take account of whether a socket is bound to a particular device when
selecting an IPv6 raw socket to receive a packet. Also perform this
check when receiving IPv6 packets with router alert options.

Signed-off-by: Andrew McDonald <andrew@mcdonald.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 020b4c12 09-Aug-2005 Harald Welte <laforge@netfilter.org>

[NETFILTER]: Move ipv4 specific code from net/core/netfilter.c to net/ipv4/netfilter.c

Netfilter cleanup
- Move ipv4 code from net/core/netfilter.c to net/ipv4/netfilter.c
- Move ipv6 netfilter code from net/ipv6/ip6_output.c to net/ipv6/netfilter.c

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6869c4d8 09-Aug-2005 Harald Welte <laforge@netfilter.org>

[NETFILTER]: reduce netfilter sk_buff enlargement

As discussed at netconf'05, we're trying to save every bit in sk_buff.
The patch below makes sk_buff 8 bytes smaller. I did some basic
testing on my notebook and it seems to work.

The only real in-tree user of nfcache was IPVS, who only needs a
single bit. Unfortunately I couldn't find some other free bit in
sk_buff to stuff that bit into, so I introduced a separate field for
them. Maybe the IPVS guys can resolve that to further save space.

Initially I wanted to shrink pkt_type to three bits (PACKET_HOST and
alike are only 6 values defined), but unfortunately the bluetooth code
overloads pkt_type :(

The conntrack-event-api (out-of-tree) uses nfcache, but Rusty just
came up with a way how to do it without any skb fields, so it's safe
to remove it.

- remove all never-implemented 'nfcache' code
- don't have ipvs code abuse 'nfcache' field. currently get's their own
compile-conditional skb->ipvs_property field. IPVS maintainers can
decide to move this bit elswhere, but nfcache needs to die.
- remove skb->nfcache field to save 4 bytes
- move skb->nfctinfo into three unused bits to save further 4 bytes

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 44456d37 27-Jul-2005 Olaf Hering <olh@suse.de>

[PATCH] turn many #if $undefined_string into #ifdef $undefined_string

turn many #if $undefined_string into #ifdef $undefined_string to fix some
warnings after -Wno-def was added to global CFLAGS

Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e176fe89 05-Jul-2005 Thomas Graf <tgraf@suug.ch>

[NET]: Remove unused security member in sk_buff

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 18b8afc7 21-Jun-2005 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Kill nf_debug

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2fdba6b0 18-May-2005 Herbert Xu <herbert@gondor.apana.org.au>

[IPV4/IPV6] Ensure all frag_list members have NULL sk

Having frag_list members which holds wmem of an sk leads to nightmares
with partially cloned frag skb's. The reason is that once you unleash
a skb with a frag_list that has individual sk ownerships into the stack
you can never undo those ownerships safely as they may have been cloned
by things like netfilter. Since we have to undo them in order to make
skb_linearize happy this approach leads to a dead-end.

So let's go the other way and make this an invariant:

For any skb on a frag_list, skb->sk must be NULL.

That is, the socket ownership always belongs to the head skb.
It turns out that the implementation is actually pretty simple.

The above invariant is actually violated in the following patch
for a short duration inside ip_fragment. This is OK because the
offending frag_list member is either destroyed at the end of the
slow path without being sent anywhere, or it is detached from
the frag_list before being sent.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3320da89 19-Apr-2005 Herbert Xu <herbert@gondor.apana.org.au>

[IPV6]: Replace bogus instances of inet->recverr

While looking at this problem I noticed that IPv6 was sometimes
looking at inet->recverr which is bogus. Here is a patch to
correct that and use np->recverr.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!