History log of /linux-master/net/ipv6/mcast.c
Revision Date Author Comments
# 2f0ff05a 28-Feb-2024 Eric Dumazet <edumazet@google.com>

ipv6/addrconf: annotate data-races around devconf fields (II)

Final (?) round of this series.

Annotate lockless reads on following devconf fields,
because they be changed concurrently from /proc/net/ipv6/conf.

- accept_dad
- optimistic_dad
- use_optimistic
- use_oif_addrs_only
- ra_honor_pio_life
- keep_addr_on_down
- ndisc_notify
- ndisc_evict_nocarrier
- suppress_frag_ndisc
- addr_gen_mode
- seg6_enabled
- ioam6_enabled
- ioam6_id
- ioam6_id_wide
- drop_unicast_in_l2_multicast
- mldv[12]_unsolicited_report_interval
- force_mld_version
- force_tllao
- accept_untracked_na
- drop_unsolicited_na
- accept_source_route

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 17ef8efc 09-Feb-2024 Eric Dumazet <edumazet@google.com>

ipv6: mcast: remove one synchronize_net() barrier in ipv6_mc_down()

As discussed in the past (commit 2d3916f31891 ("ipv6: fix skb drops
in igmp6_event_query() and igmp6_event_report()")) I think the
synchronize_net() call in ipv6_mc_down() is not needed.

Under load, synchronize_net() can last between 200 usec and 5 ms.

KASAN seems to agree as well.

Fixes: f185de28d9ae ("mld: add new workqueues for process mld events")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2e7ef287 17-Jan-2024 Nikita Zhandarovich <n.zhandarovich@fintech.ru>

ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work

idev->mc_ifc_count can be written over without proper locking.

Originally found by syzbot [1], fix this issue by encapsulating calls
to mld_ifc_stop_work() (and mld_gq_stop_work() for good measure) with
mutex_lock() and mutex_unlock() accordingly as these functions
should only be called with mc_lock per their declarations.

[1]
BUG: KCSAN: data-race in ipv6_mc_down / mld_ifc_work

write to 0xffff88813a80c832 of 1 bytes by task 3771 on cpu 0:
mld_ifc_stop_work net/ipv6/mcast.c:1080 [inline]
ipv6_mc_down+0x10a/0x280 net/ipv6/mcast.c:2725
addrconf_ifdown+0xe32/0xf10 net/ipv6/addrconf.c:3949
addrconf_notify+0x310/0x980
notifier_call_chain kernel/notifier.c:93 [inline]
raw_notifier_call_chain+0x6b/0x1c0 kernel/notifier.c:461
__dev_notify_flags+0x205/0x3d0
dev_change_flags+0xab/0xd0 net/core/dev.c:8685
do_setlink+0x9f6/0x2430 net/core/rtnetlink.c:2916
rtnl_group_changelink net/core/rtnetlink.c:3458 [inline]
__rtnl_newlink net/core/rtnetlink.c:3717 [inline]
rtnl_newlink+0xbb3/0x1670 net/core/rtnetlink.c:3754
rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6558
netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2545
rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6576
netlink_unicast_kernel net/netlink/af_netlink.c:1342 [inline]
netlink_unicast+0x589/0x650 net/netlink/af_netlink.c:1368
netlink_sendmsg+0x66e/0x770 net/netlink/af_netlink.c:1910
...

write to 0xffff88813a80c832 of 1 bytes by task 22 on cpu 1:
mld_ifc_work+0x54c/0x7b0 net/ipv6/mcast.c:2653
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2700
worker_thread+0x525/0x730 kernel/workqueue.c:2781
...

Fixes: 2d9a93b4902b ("mld: convert from timer to delayed work")
Reported-by: syzbot+a9400cabb1d784e49abf@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/000000000000994e09060ebcdffb@google.com/
Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240117172102.12001-1-n.zhandarovich@fintech.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# b4a11b20 18-Oct-2023 Heng Guo <heng.guo@windriver.com>

net: fix IPSTATS_MIB_OUTPKGS increment in OutForwDatagrams.

Reproduce environment:
network with 3 VM linuxs is connected as below:
VM1<---->VM2(latest kernel 6.5.0-rc7)<---->VM3
VM1: eth0 ip: 192.168.122.207 MTU 1500
VM2: eth0 ip: 192.168.122.208, eth1 ip: 192.168.123.224 MTU 1500
VM3: eth0 ip: 192.168.123.240 MTU 1500

Reproduce:
VM1 send 1400 bytes UDP data to VM3 using tools scapy with flags=0.
scapy command:
send(IP(dst="192.168.123.240",flags=0)/UDP()/str('0'*1400),count=1,
inter=1.000000)

Result:
Before IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates
Ip: 1 64 11 0 3 4 0 0 4 7 0 0 0 0 0 0 0 0 0
......
----------------------------------------------------------------------
After IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates
Ip: 1 64 12 0 3 5 0 0 4 8 0 0 0 0 0 0 0 0 0
......
----------------------------------------------------------------------
"ForwDatagrams" increase from 4 to 5 and "OutRequests" also increase
from 7 to 8.

Issue description and patch:
IPSTATS_MIB_OUTPKTS("OutRequests") is counted with IPSTATS_MIB_OUTOCTETS
("OutOctets") in ip_finish_output2().
According to RFC 4293, it is "OutOctets" counted with "OutTransmits" but
not "OutRequests". "OutRequests" does not include any datagrams counted
in "ForwDatagrams".
ipSystemStatsOutOctets OBJECT-TYPE
DESCRIPTION
"The total number of octets in IP datagrams delivered to the
lower layers for transmission. Octets from datagrams
counted in ipIfStatsOutTransmits MUST be counted here.
ipSystemStatsOutRequests OBJECT-TYPE
DESCRIPTION
"The total number of IP datagrams that local IP user-
protocols (including ICMP) supplied to IP in requests for
transmission. Note that this counter does not include any
datagrams counted in ipSystemStatsOutForwDatagrams.
So do patch to define IPSTATS_MIB_OUTPKTS to "OutTransmits" and add
IPSTATS_MIB_OUTREQUESTS for "OutRequests".
Add IPSTATS_MIB_OUTREQUESTS counter in __ip_local_out() for ipv4 and add
IPSTATS_MIB_OUT counter in ip6_finish_output2() for ipv6.

Test result with patch:
Before IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates OutTransmits
Ip: 1 64 9 0 5 1 0 0 3 3 0 0 0 0 0 0 0 0 0 4
......
root@qemux86-64:~# cat /proc/net/netstat
......
IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts
OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets
InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts
InECT0Pkts InCEPkts ReasmOverlaps
IpExt: 0 0 0 0 0 0 2976 1896 0 0 0 0 0 9 0 0 0 0
----------------------------------------------------------------------
After IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates OutTransmits
Ip: 1 64 10 0 5 2 0 0 3 3 0 0 0 0 0 0 0 0 0 5
......
root@qemux86-64:~# cat /proc/net/netstat
......
IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts
OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets
InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts
InECT0Pkts InCEPkts ReasmOverlaps
IpExt: 0 0 0 0 0 0 4404 3324 0 0 0 0 0 10 0 0 0 0
----------------------------------------------------------------------
"ForwDatagrams" increase from 1 to 2 and "OutRequests" is keeping 3.
"OutTransmits" increase from 4 to 5 and "OutOctets" increase 1428.

Signed-off-by: Heng Guo <heng.guo@windriver.com>
Reviewed-by: Kun Song <Kun.Song@windriver.com>
Reviewed-by: Filip Pudak <filip.pudak@windriver.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6559c0ff 12-Sep-2023 Eric Dumazet <edumazet@google.com>

ipv6: lockless IPV6_MULTICAST_ALL implementation

Move np->mc_all to an atomic flags to fix data-races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b0adfba7 12-Sep-2023 Eric Dumazet <edumazet@google.com>

ipv6: lockless IPV6_UNICAST_HOPS implementation

Some np->hop_limit accesses are racy, when socket lock is not held.

Add missing annotations and switch to full lockless implementation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 59bb1d69 12-Sep-2023 Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>

ipv6: mcast: Remove redundant comparison in igmp6_mcf_get_next()

The 'state->im' value will always be non-zero after
the 'while' statement, so the check can be removed.

Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230912084100.1502379-1-Ilia.Gavrilov@infotecs.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 5bc67a85 11-Jul-2023 Guillaume Nault <gnault@redhat.com>

ipv6: Constify the sk parameter of several helper functions.

icmpv6_flow_init(), ip6_datagram_flow_key_init() and ip6_mc_hdr() don't
need to modify their sk argument. Make that explicit using const.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 66eb554c 16-Mar-2023 Eric Dumazet <edumazet@google.com>

ipv6: constify inet6_mc_check()

inet6_mc_check() is essentially a read-only function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8032bf12 09-Oct-2022 Jason A. Donenfeld <Jason@zx2c4.com>

treewide: use get_random_u32_below() instead of deprecated function

This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>


# 81895a65 05-Oct-2022 Jason A. Donenfeld <Jason@zx2c4.com>

treewide: use prandom_u32_max() when possible, part 1

Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:

@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)

@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@

- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);

// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@

((T)get_random_u32()@p & (LITERAL))

// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@

value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))

// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@

- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)

@collapse_ret@
type T;
identifier VAR;
expression E;
@@

{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}

@drop_var@
type T;
identifier VAR;
@@

{
- T VAR;
... when != VAR
}

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>


# 6dadbe4b 01-Sep-2022 Martin KaFai Lau <martin.lau@kernel.org>

bpf: net: Change do_ipv6_getsockopt() to take the sockptr_t argument

Similar to the earlier patch that changes sk_getsockopt() to
take the sockptr_t argument . This patch also changes
do_ipv6_getsockopt() to take the sockptr_t argument such that
a latter patch can make bpf_getsockopt(SOL_IPV6) to reuse
do_ipv6_getsockopt().

Note on the change in ip6_mc_msfget(). This function is to
return an array of sockaddr_storage in optval. This function
is shared between ipv6_get_msfilter() and compat_ipv6_get_msfilter().
However, the sockaddr_storage is stored at different offset of the
optval because of the difference between group_filter and
compat_group_filter. Thus, a new 'ss_offset' argument is
added to ip6_mc_msfget().

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002853.2892532-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 3e7d18b9 22-Jul-2022 Taehee Yoo <ap420073@gmail.com>

net: mld: fix reference count leak in mld_{query | report}_work()

mld_{query | report}_work() processes queued events.
If there are too many events in the queue, it re-queue a work.
And then, it returns without in6_dev_put().
But if queuing is failed, it should call in6_dev_put(), but it doesn't.
So, a reference count leak would occur.

THREAD0 THREAD1
mld_report_work()
spin_lock_bh()
if (!mod_delayed_work())
in6_dev_hold();
spin_unlock_bh()
spin_lock_bh()
schedule_delayed_work()
spin_unlock_bh()

Script to reproduce(by Hangbin Liu):
ip netns add ns1
ip netns add ns2
ip netns exec ns1 sysctl -w net.ipv6.conf.all.force_mld_version=1
ip netns exec ns2 sysctl -w net.ipv6.conf.all.force_mld_version=1

ip -n ns1 link add veth0 type veth peer name veth0 netns ns2
ip -n ns1 link set veth0 up
ip -n ns2 link set veth0 up

for i in `seq 50`; do
for j in `seq 100`; do
ip -n ns1 addr add 2021:${i}::${j}/64 dev veth0
ip -n ns2 addr add 2022:${i}::${j}/64 dev veth0
done
done
modprobe -r veth
ip -a netns del

splat looks like:
unregister_netdevice: waiting for veth0 to become free. Usage count = 2
leaked reference.
ipv6_add_dev+0x324/0xec0
addrconf_notify+0x481/0xd10
raw_notifier_call_chain+0xe3/0x120
call_netdevice_notifiers+0x106/0x160
register_netdevice+0x114c/0x16b0
veth_newlink+0x48b/0xa50 [veth]
rtnl_newlink+0x11a2/0x1a40
rtnetlink_rcv_msg+0x63f/0xc00
netlink_rcv_skb+0x1df/0x3e0
netlink_unicast+0x5de/0x850
netlink_sendmsg+0x6c9/0xa90
____sys_sendmsg+0x76a/0x780
__sys_sendmsg+0x27c/0x340
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Tested-by: Hangbin Liu <liuhangbin@gmail.com>
Fixes: f185de28d9ae ("mld: add new workqueues for process mld events")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a9384a4c 29-Apr-2022 Eric Dumazet <edumazet@google.com>

mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()

Whenever RCU protected list replaces an object,
the pointer to the new object needs to be updated
_before_ the call to kfree_rcu() or call_rcu()

Also ip6_mc_msfilter() needs to update the pointer
before releasing the mc_lock mutex.

Note that linux-5.13 was supporting kfree_rcu(NULL, rcu),
so this fix does not need the conditional test I was
forced to use in the equivalent patch for IPv4.

Fixes: 882ba1f73c06 ("mld: convert ipv6_mc_socklist->sflist to RCU")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2d3916f3 03-Mar-2022 Eric Dumazet <edumazet@google.com>

ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report()

While investigating on why a synchronize_net() has been added recently
in ipv6_mc_down(), I found that igmp6_event_query() and igmp6_event_report()
might drop skbs in some cases.

Discussion about removing synchronize_net() from ipv6_mc_down()
will happen in a different thread.

Fixes: f185de28d9ae ("mld: add new workqueues for process mld events")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220303173728.937869-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 26394fc1 11-Feb-2022 Ignat Korchagin <ignat@cloudflare.com>

ipv6: mcast: use rcu-safe version of ipv6_get_lladdr()

Some time ago 8965779d2c0e ("ipv6,mcast: always hold idev->lock before mca_lock")
switched ipv6_get_lladdr() to __ipv6_get_lladdr(), which is rcu-unsafe
version. That was OK, because idev->lock was held for these codepaths.

In 88e2ca308094 ("mld: convert ifmcaddr6 to RCU") these external locks were
removed, so we probably need to restore the original rcu-safe call.

Otherwise, we occasionally get a machine crashed/stalled with the following
in dmesg:

[ 3405.966610][T230589] general protection fault, probably for non-canonical address 0xdead00000000008c: 0000 [#1] SMP NOPTI
[ 3405.982083][T230589] CPU: 44 PID: 230589 Comm: kworker/44:3 Tainted: G O 5.15.19-cloudflare-2022.2.1 #1
[ 3405.998061][T230589] Hardware name: SUPA-COOL-SERV
[ 3406.009552][T230589] Workqueue: mld mld_ifc_work
[ 3406.017224][T230589] RIP: 0010:__ipv6_get_lladdr+0x34/0x60
[ 3406.025780][T230589] Code: 57 10 48 83 c7 08 48 89 e5 48 39 d7 74 3e 48 8d 82 38 ff ff ff eb 13 48 8b 90 d0 00 00 00 48 8d 82 38 ff ff ff 48 39 d7 74 22 <66> 83 78 32 20 77 1b 75 e4 89 ca 23 50 2c 75 dd 48 8b 50 08 48 8b
[ 3406.055748][T230589] RSP: 0018:ffff94e4b3fc3d10 EFLAGS: 00010202
[ 3406.065617][T230589] RAX: dead00000000005a RBX: ffff94e4b3fc3d30 RCX: 0000000000000040
[ 3406.077477][T230589] RDX: dead000000000122 RSI: ffff94e4b3fc3d30 RDI: ffff8c3a31431008
[ 3406.089389][T230589] RBP: ffff94e4b3fc3d10 R08: 0000000000000000 R09: 0000000000000000
[ 3406.101445][T230589] R10: ffff8c3a31430000 R11: 000000000000000b R12: ffff8c2c37887100
[ 3406.113553][T230589] R13: ffff8c3a39537000 R14: 00000000000005dc R15: ffff8c3a31431000
[ 3406.125730][T230589] FS: 0000000000000000(0000) GS:ffff8c3b9fc80000(0000) knlGS:0000000000000000
[ 3406.138992][T230589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3406.149895][T230589] CR2: 00007f0dfea1db60 CR3: 000000387b5f2000 CR4: 0000000000350ee0
[ 3406.162421][T230589] Call Trace:
[ 3406.170235][T230589] <TASK>
[ 3406.177736][T230589] mld_newpack+0xfe/0x1a0
[ 3406.186686][T230589] add_grhead+0x87/0xa0
[ 3406.195498][T230589] add_grec+0x485/0x4e0
[ 3406.204310][T230589] ? newidle_balance+0x126/0x3f0
[ 3406.214024][T230589] mld_ifc_work+0x15d/0x450
[ 3406.223279][T230589] process_one_work+0x1e6/0x380
[ 3406.232982][T230589] worker_thread+0x50/0x3a0
[ 3406.242371][T230589] ? rescuer_thread+0x360/0x360
[ 3406.252175][T230589] kthread+0x127/0x150
[ 3406.261197][T230589] ? set_kthread_struct+0x40/0x40
[ 3406.271287][T230589] ret_from_fork+0x22/0x30
[ 3406.280812][T230589] </TASK>
[ 3406.288937][T230589] Modules linked in: ... [last unloaded: kheaders]
[ 3406.476714][T230589] ---[ end trace 3525a7655f2f3b9e ]---

Fixes: 88e2ca308094 ("mld: convert ifmcaddr6 to RCU")
Reported-by: David Pinilla Caparros <dpini@cloudflare.com>
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3f22bb13 01-Sep-2021 Jiwon Kim <jiwonaid0@gmail.com>

ipv6: change return type from int to void for mld_process_v2

The mld_process_v2 only returned 0.

So, the return type is changed to void.

Signed-off-by: Jiwon Kim <jiwonaid0@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e11c0e25 04-Aug-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

net/ipv6/mcast: Use struct_size() helper

Replace IP6_SFLSIZE() with struct_size() helper in order to avoid any
potential type mistakes or integer overflows that, in the worst
scenario, could lead to heap overflows.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ffa85b73 13-Jun-2021 Taehee Yoo <ap420073@gmail.com>

mld: avoid unnecessary high order page allocation in mld_newpack()

If link mtu is too big, mld_newpack() allocates high-order page.
But most mld packets don't need high-order page.
So, it might waste unnecessary pages.
To avoid this, it makes mld_newpack() try to allocate order-0 page.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 020ef930 16-May-2021 Taehee Yoo <ap420073@gmail.com>

mld: fix panic in mld_newpack()

mld_newpack() doesn't allow to allocate high order page,
only order-0 allocation is allowed.
If headroom size is too large, a kernel panic could occur in skb_put().

Test commands:
ip netns del A
ip netns del B
ip netns add A
ip netns add B
ip link add veth0 type veth peer name veth1
ip link set veth0 netns A
ip link set veth1 netns B

ip netns exec A ip link set lo up
ip netns exec A ip link set veth0 up
ip netns exec A ip -6 a a 2001:db8:0::1/64 dev veth0
ip netns exec B ip link set lo up
ip netns exec B ip link set veth1 up
ip netns exec B ip -6 a a 2001:db8:0::2/64 dev veth1
for i in {1..99}
do
let A=$i-1
ip netns exec A ip link add ip6gre$i type ip6gre \
local 2001:db8:$A::1 remote 2001:db8:$A::2 encaplimit 100
ip netns exec A ip -6 a a 2001:db8:$i::1/64 dev ip6gre$i
ip netns exec A ip link set ip6gre$i up

ip netns exec B ip link add ip6gre$i type ip6gre \
local 2001:db8:$A::2 remote 2001:db8:$A::1 encaplimit 100
ip netns exec B ip -6 a a 2001:db8:$i::2/64 dev ip6gre$i
ip netns exec B ip link set ip6gre$i up
done

Splat looks like:
kernel BUG at net/core/skbuff.c:110!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 5.12.0+ #891
Workqueue: ipv6_addrconf addrconf_dad_work
RIP: 0010:skb_panic+0x15d/0x15f
Code: 92 fe 4c 8b 4c 24 10 53 8b 4d 70 45 89 e0 48 c7 c7 00 ae 79 83
41 57 41 56 41 55 48 8b 54 24 a6 26 f9 ff <0f> 0b 48 8b 6c 24 20 89
34 24 e8 4a 4e 92 fe 8b 34 24 48 c7 c1 20
RSP: 0018:ffff88810091f820 EFLAGS: 00010282
RAX: 0000000000000089 RBX: ffff8881086e9000 RCX: 0000000000000000
RDX: 0000000000000089 RSI: 0000000000000008 RDI: ffffed1020123efb
RBP: ffff888005f6eac0 R08: ffffed1022fc0031 R09: ffffed1022fc0031
R10: ffff888117e00187 R11: ffffed1022fc0030 R12: 0000000000000028
R13: ffff888008284eb0 R14: 0000000000000ed8 R15: 0000000000000ec0
FS: 0000000000000000(0000) GS:ffff888117c00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8b801c5640 CR3: 0000000033c2c006 CR4: 00000000003706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600
? ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600
skb_put.cold.104+0x22/0x22
ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600
? rcu_read_lock_sched_held+0x91/0xc0
mld_newpack+0x398/0x8f0
? ip6_mc_hdr.isra.26.constprop.46+0x600/0x600
? lock_contended+0xc40/0xc40
add_grhead.isra.33+0x280/0x380
add_grec+0x5ca/0xff0
? mld_sendpack+0xf40/0xf40
? lock_downgrade+0x690/0x690
mld_send_initial_cr.part.34+0xb9/0x180
ipv6_mc_dad_complete+0x15d/0x1b0
addrconf_dad_completed+0x8d2/0xbb0
? lock_downgrade+0x690/0x690
? addrconf_rs_timer+0x660/0x660
? addrconf_dad_work+0x73c/0x10e0
addrconf_dad_work+0x73c/0x10e0

Allowing high order page allocation could fix this problem.

Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 83c1ca25 16-Apr-2021 Taehee Yoo <ap420073@gmail.com>

mld: remove unnecessary prototypes

Some prototypes are unnecessary, so delete it.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b4b8446 04-Apr-2021 Taehee Yoo <ap420073@gmail.com>

mld: change lockdep annotation for ip6_sf_socklist and ipv6_mc_socklist

struct ip6_sf_socklist and ipv6_mc_socklist are per-socket MLD data.
These data are protected by rtnl lock, socket lock, and RCU.
So, when these are used, it verifies whether rtnl lock is acquired or not.

ip6_mc_msfget() is called by do_ipv6_getsockopt().
But caller doesn't acquire rtnl lock.
So, when these data are used in the ip6_mc_msfget() lockdep warns about it.
But accessing these is actually safe because socket lock was acquired by
do_ipv6_getsockopt().

So, it changes lockdep annotation from rtnl lock to socket lock.
(rtnl_dereference -> sock_dereference)

Locking graph for mld data is like below:

When writing mld data:
do_ipv6_setsockopt()
rtnl_lock
lock_sock
(mld functions)
idev->mc_lock(if per-interface mld data is modified)

When reading mld data:
do_ipv6_getsockopt()
lock_sock
ip6_mc_msfget()

Splat looks like:
=============================
WARNING: suspicious RCU usage
5.12.0-rc4+ #503 Not tainted
-----------------------------
net/ipv6/mcast.c:610 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by mcast-listener-/923:
#0: ffff888007958a70 (sk_lock-AF_INET6){+.+.}-{0:0}, at:
ipv6_get_msfilter+0xaf/0x190

stack backtrace:
CPU: 1 PID: 923 Comm: mcast-listener- Not tainted 5.12.0-rc4+ #503
Call Trace:
dump_stack+0xa4/0xe5
ip6_mc_msfget+0x553/0x6c0
? ipv6_sock_mc_join_ssm+0x10/0x10
? lockdep_hardirqs_on_prepare+0x3e0/0x3e0
? mark_held_locks+0xb7/0x120
? lockdep_hardirqs_on_prepare+0x27c/0x3e0
? __local_bh_enable_ip+0xa5/0xf0
? lock_sock_nested+0x82/0xf0
ipv6_get_msfilter+0xc3/0x190
? compat_ipv6_get_msfilter+0x300/0x300
? lock_downgrade+0x690/0x690
do_ipv6_getsockopt.isra.6.constprop.13+0x1809/0x29e0
? do_ipv6_mcast_group_source+0x150/0x150
? register_lock_class+0x1750/0x1750
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0x18/0x170
? find_held_lock+0x3a/0x1c0
? lock_downgrade+0x690/0x690
? ipv6_getsockopt+0xdb/0x1b0
ipv6_getsockopt+0xdb/0x1b0
[ ... ]

Fixes: 88e2ca308094 ("mld: convert ifmcaddr6 to RCU")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 63ed8de4 25-Mar-2021 Taehee Yoo <ap420073@gmail.com>

mld: add mc_lock for protecting per-interface mld data

The purpose of this lock is to avoid a bottleneck in the query/report
event handler logic.

By previous patches, almost all mld data is protected by RTNL.
So, the query and report event handler, which is data path logic
acquires RTNL too. Therefore if a lot of query and report events
are received, it uses RTNL for a long time.
So it makes the control-plane bottleneck because of using RTNL.
In order to avoid this bottleneck, mc_lock is added.

mc_lock protect only per-interface mld data and per-interface mld
data is used in the query/report event handler logic.
So, no longer rtnl_lock is needed in the query/report event handler logic.
Therefore bottleneck will be disappeared by mc_lock.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f185de28 25-Mar-2021 Taehee Yoo <ap420073@gmail.com>

mld: add new workqueues for process mld events

When query/report packets are received, mld module processes them.
But they are processed under BH context so it couldn't use sleepable
functions. So, in order to switch context, the two workqueues are
added which processes query and report event.

In the struct inet6_dev, mc_{query | report}_queue are added so it
is per-interface queue.
And mc_{query | report}_work are workqueue structure.

When the query or report event is received, skb is queued to proper
queue and worker function is scheduled immediately.
Workqueues and queues are protected by spinlock, which is
mc_{query | report}_lock, and worker functions are protected by RTNL.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 88e2ca30 25-Mar-2021 Taehee Yoo <ap420073@gmail.com>

mld: convert ifmcaddr6 to RCU

The ifmcaddr6 has been protected by inet6_dev->lock(rwlock) so that
the critical section is atomic context. In order to switch this context,
changing locking is needed. The ifmcaddr6 actually already protected by
RTNL So if it's converted to use RCU, its control path context can be
switched to sleepable.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b200e39 25-Mar-2021 Taehee Yoo <ap420073@gmail.com>

mld: convert ip6_sf_list to RCU

The ip6_sf_list has been protected by mca_lock(spin_lock) so that the
critical section is atomic context. In order to switch this context,
changing locking is needed. The ip6_sf_list actually already protected
by RTNL So if it's converted to use RCU, its control path context can
be switched to sleepable.
But It doesn't remove mca_lock yet because ifmcaddr6 isn't converted
to RCU yet. So, It's not fully converted to the sleepable context.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 882ba1f7 25-Mar-2021 Taehee Yoo <ap420073@gmail.com>

mld: convert ipv6_mc_socklist->sflist to RCU

The sflist has been protected by rwlock so that the critical section
is atomic context.
In order to switch this context, changing locking is needed.
The sflist actually already protected by RTNL So if it's converted
to use RCU, its control path context can be switched to sleepable.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cf2ce339 25-Mar-2021 Taehee Yoo <ap420073@gmail.com>

mld: get rid of inet6_dev->mc_lock

The purpose of mc_lock is to protect inet6_dev->mc_tomb.
But mc_tomb is already protected by RTNL and all functions,
which manipulate mc_tomb are called under RTNL.
So, mc_lock is not needed.
Furthermore, it is spinlock so the critical section is atomic.
In order to reduce atomic context, it should be removed.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2d9a93b4 25-Mar-2021 Taehee Yoo <ap420073@gmail.com>

mld: convert from timer to delayed work

mcast.c has several timers for delaying works.
Timer's expire handler is working under atomic context so it can't use
sleepable things such as GFP_KERNEL, mutex, etc.
In order to use sleepable APIs, it converts from timers to delayed work.
But there are some critical sections, which is used by both process
and BH context. So that it still uses spin_lock_bh() and rwlock.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 400490ac 27-Oct-2020 Lukas Bulwahn <lukas.bulwahn@gmail.com>

ipv6: mcast: make annotations for ip6_mc_msfget() consistent

Commit 931ca7ab7fe8 ("ip*_mc_gsfget(): lift copyout of struct group_filter
into callers") adjusted the type annotations for ip6_mc_msfget() at its
declaration, but missed the type annotations at its definition.

Hence, sparse complains on ./net/ipv6/mcast.c:

mcast.c:550:5: error: symbol 'ip6_mc_msfget' redeclared with different type \
(incompatible argument 3 (different address spaces))

Make ip6_mc_msfget() annotations consistent, which also resolves this
warning from sparse:

mcast.c:607:34: warning: incorrect type in argument 1 (different address spaces)
mcast.c:607:34: expected void [noderef] __user *to
mcast.c:607:34: got struct __kernel_sockaddr_storage *p

No functional change. No change in object code.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/20201028115349.6855-1-lukas.bulwahn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# ea2fce88 11-Jun-2020 Wang Hai <wanghai38@huawei.com>

mld: fix memory leak in ipv6_mc_destroy_dev()

Commit a84d01647989 ("mld: fix memory leak in mld_del_delrec()") fixed
the memory leak of MLD, but missing the ipv6_mc_destroy_dev() path, in
which mca_sources are leaked after ma_put().

Using ip6_mc_clear_src() to take care of the missing free.

BUG: memory leak
unreferenced object 0xffff8881113d3180 (size 64):
comm "syz-executor071", pid 389, jiffies 4294887985 (age 17.943s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 ff 02 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 ................
backtrace:
[<000000002cbc483c>] kmalloc include/linux/slab.h:555 [inline]
[<000000002cbc483c>] kzalloc include/linux/slab.h:669 [inline]
[<000000002cbc483c>] ip6_mc_add1_src net/ipv6/mcast.c:2237 [inline]
[<000000002cbc483c>] ip6_mc_add_src+0x7f5/0xbb0 net/ipv6/mcast.c:2357
[<0000000058b8b1ff>] ip6_mc_source+0xe0c/0x1530 net/ipv6/mcast.c:449
[<000000000bfc4fb5>] do_ipv6_setsockopt.isra.12+0x1b2c/0x3b30 net/ipv6/ipv6_sockglue.c:754
[<00000000e4e7a722>] ipv6_setsockopt+0xda/0x150 net/ipv6/ipv6_sockglue.c:950
[<0000000029260d9a>] rawv6_setsockopt+0x45/0x100 net/ipv6/raw.c:1081
[<000000005c1b46f9>] __sys_setsockopt+0x131/0x210 net/socket.c:2132
[<000000008491f7db>] __do_sys_setsockopt net/socket.c:2148 [inline]
[<000000008491f7db>] __se_sys_setsockopt net/socket.c:2145 [inline]
[<000000008491f7db>] __x64_sys_setsockopt+0xba/0x150 net/socket.c:2145
[<00000000c7bc11c5>] do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:295
[<000000005fb7a3f3>] entry_SYSCALL_64_after_hwframe+0x49/0xb3

Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when set link down")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d59eb177 30-Mar-2020 Al Viro <viro@zeniv.linux.org.uk>

ip6_mc_msfilter(): pass the address list separately

that way we'll be able to reuse it for compat case

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 931ca7ab 29-Mar-2020 Al Viro <viro@zeniv.linux.org.uk>

ip*_mc_gsfget(): lift copyout of struct group_filter into callers

pass the userland pointer to the array in its tail, so that part
gets copied out by our functions; copyout of everything else is
done in the callers. Rationale: reuse for compat; the array
is the same in native and compat, the layout of parts before it
is different for compat.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a84d0164 27-Aug-2019 Eric Dumazet <edumazet@google.com>

mld: fix memory leak in mld_del_delrec()

Similar to the fix done for IPv4 in commit e5b1c6c6277d
("igmp: fix memory leak in igmpv3_del_delrec()"), we need to
make sure mca_tomb and mca_sources are not blindly overwritten.

Using swap() then a call to ip6_mc_clear_src() will take care
of the missing free.

BUG: memory leak
unreferenced object 0xffff888117d9db00 (size 64):
comm "syz-executor247", pid 6918, jiffies 4294943989 (age 25.350s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 fe 88 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<000000005b463030>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
[<000000005b463030>] slab_post_alloc_hook mm/slab.h:522 [inline]
[<000000005b463030>] slab_alloc mm/slab.c:3319 [inline]
[<000000005b463030>] kmem_cache_alloc_trace+0x145/0x2c0 mm/slab.c:3548
[<00000000939cbf94>] kmalloc include/linux/slab.h:552 [inline]
[<00000000939cbf94>] kzalloc include/linux/slab.h:748 [inline]
[<00000000939cbf94>] ip6_mc_add1_src net/ipv6/mcast.c:2236 [inline]
[<00000000939cbf94>] ip6_mc_add_src+0x31f/0x420 net/ipv6/mcast.c:2356
[<00000000d8972221>] ip6_mc_source+0x4a8/0x600 net/ipv6/mcast.c:449
[<000000002b203d0d>] do_ipv6_setsockopt.isra.0+0x1b92/0x1dd0 net/ipv6/ipv6_sockglue.c:748
[<000000001f1e2d54>] ipv6_setsockopt+0x89/0xd0 net/ipv6/ipv6_sockglue.c:944
[<00000000c8f7bdf9>] udpv6_setsockopt+0x4e/0x90 net/ipv6/udp.c:1558
[<000000005a9a0c5e>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3139
[<00000000910b37b2>] __sys_setsockopt+0x10f/0x220 net/socket.c:2084
[<00000000e9108023>] __do_sys_setsockopt net/socket.c:2100 [inline]
[<00000000e9108023>] __se_sys_setsockopt net/socket.c:2097 [inline]
[<00000000e9108023>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2097
[<00000000f4818160>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:296
[<000000008d367e8f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when set link down")
Fixes: 9c8bb163ae78 ("igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4effd28c 20-Jan-2019 Linus Lüssing <linus.luessing@c0d3.blue>

bridge: join all-snoopers multicast address

Next to snooping IGMP/MLD queries RFC4541, section 2.1.1.a) recommends
to snoop multicast router advertisements to detect multicast routers.

Multicast router advertisements are sent to an "all-snoopers"
multicast address. To be able to receive them reliably, we need to
join this group.

Otherwise other snooping switches might refrain from forwarding these
advertisements to us.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dc012f36 12-Oct-2018 Eric Dumazet <edumazet@google.com>

ipv6: mcast: fix a use-after-free in inet6_mc_check

syzbot found a use-after-free in inet6_mc_check [1]

The problem here is that inet6_mc_check() uses rcu
and read_lock(&iml->sflock)

So the fact that ip6_mc_leave_src() is called under RTNL
and the socket lock does not help us, we need to acquire
iml->sflock in write mode.

In the future, we should convert all this stuff to RCU.

[1]
BUG: KASAN: use-after-free in ipv6_addr_equal include/net/ipv6.h:521 [inline]
BUG: KASAN: use-after-free in inet6_mc_check+0xae7/0xb40 net/ipv6/mcast.c:649
Read of size 8 at addr ffff8801ce7f2510 by task syz-executor0/22432

CPU: 1 PID: 22432 Comm: syz-executor0 Not tainted 4.19.0-rc7+ #280
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
ipv6_addr_equal include/net/ipv6.h:521 [inline]
inet6_mc_check+0xae7/0xb40 net/ipv6/mcast.c:649
__raw_v6_lookup+0x320/0x3f0 net/ipv6/raw.c:98
ipv6_raw_deliver net/ipv6/raw.c:183 [inline]
raw6_local_deliver+0x3d3/0xcb0 net/ipv6/raw.c:240
ip6_input_finish+0x467/0x1aa0 net/ipv6/ip6_input.c:345
NF_HOOK include/linux/netfilter.h:289 [inline]
ip6_input+0xe9/0x600 net/ipv6/ip6_input.c:426
ip6_mc_input+0x48a/0xd20 net/ipv6/ip6_input.c:503
dst_input include/net/dst.h:450 [inline]
ip6_rcv_finish+0x17a/0x330 net/ipv6/ip6_input.c:76
NF_HOOK include/linux/netfilter.h:289 [inline]
ipv6_rcv+0x120/0x640 net/ipv6/ip6_input.c:271
__netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4913
__netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5023
netif_receive_skb_internal+0x12c/0x620 net/core/dev.c:5126
napi_frags_finish net/core/dev.c:5664 [inline]
napi_gro_frags+0x75a/0xc90 net/core/dev.c:5737
tun_get_user+0x3189/0x4250 drivers/net/tun.c:1923
tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1968
call_write_iter include/linux/fs.h:1808 [inline]
do_iter_readv_writev+0x8b0/0xa80 fs/read_write.c:680
do_iter_write+0x185/0x5f0 fs/read_write.c:959
vfs_writev+0x1f1/0x360 fs/read_write.c:1004
do_writev+0x11a/0x310 fs/read_write.c:1039
__do_sys_writev fs/read_write.c:1112 [inline]
__se_sys_writev fs/read_write.c:1109 [inline]
__x64_sys_writev+0x75/0xb0 fs/read_write.c:1109
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457421
Code: 75 14 b8 14 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 34 b5 fb ff c3 48 83 ec 08 e8 1a 2d 00 00 48 89 04 24 b8 14 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 63 2d 00 00 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007f2d30ecaba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 000000000000003e RCX: 0000000000457421
RDX: 0000000000000001 RSI: 00007f2d30ecabf0 RDI: 00000000000000f0
RBP: 0000000020000500 R08: 00000000000000f0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 00007f2d30ecb6d4
R13: 00000000004c4890 R14: 00000000004d7b90 R15: 00000000ffffffff

Allocated by task 22437:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
__do_kmalloc mm/slab.c:3718 [inline]
__kmalloc+0x14e/0x760 mm/slab.c:3727
kmalloc include/linux/slab.h:518 [inline]
sock_kmalloc+0x15a/0x1f0 net/core/sock.c:1983
ip6_mc_source+0x14dd/0x1960 net/ipv6/mcast.c:427
do_ipv6_setsockopt.isra.9+0x3afb/0x45d0 net/ipv6/ipv6_sockglue.c:743
ipv6_setsockopt+0xbd/0x170 net/ipv6/ipv6_sockglue.c:933
rawv6_setsockopt+0x59/0x140 net/ipv6/raw.c:1069
sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3038
__sys_setsockopt+0x1ba/0x3c0 net/socket.c:1902
__do_sys_setsockopt net/socket.c:1913 [inline]
__se_sys_setsockopt net/socket.c:1910 [inline]
__x64_sys_setsockopt+0xbe/0x150 net/socket.c:1910
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 22430:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
__cache_free mm/slab.c:3498 [inline]
kfree+0xcf/0x230 mm/slab.c:3813
__sock_kfree_s net/core/sock.c:2004 [inline]
sock_kfree_s+0x29/0x60 net/core/sock.c:2010
ip6_mc_leave_src+0x11a/0x1d0 net/ipv6/mcast.c:2448
__ipv6_sock_mc_close+0x20b/0x4e0 net/ipv6/mcast.c:310
ipv6_sock_mc_close+0x158/0x1d0 net/ipv6/mcast.c:328
inet6_release+0x40/0x70 net/ipv6/af_inet6.c:452
__sock_release+0xd7/0x250 net/socket.c:579
sock_close+0x19/0x20 net/socket.c:1141
__fput+0x385/0xa30 fs/file_table.c:278
____fput+0x15/0x20 fs/file_table.c:309
task_work_run+0x1e8/0x2a0 kernel/task_work.c:113
tracehook_notify_resume include/linux/tracehook.h:193 [inline]
exit_to_usermode_loop+0x318/0x380 arch/x86/entry/common.c:166
prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8801ce7f2500
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 16 bytes inside of
192-byte region [ffff8801ce7f2500, ffff8801ce7f25c0)
The buggy address belongs to the page:
page:ffffea000739fc80 count:1 mapcount:0 mapping:ffff8801da800040 index:0x0
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffffea0006f6e548 ffffea000737b948 ffff8801da800040
raw: 0000000000000000 ffff8801ce7f2000 0000000100000010 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff8801ce7f2400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801ce7f2480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff8801ce7f2500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801ce7f2580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff8801ce7f2600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 15033f04 10-Sep-2018 Andre Naujoks <nautsch2@gmail.com>

ipv6: Add sockopt IPV6_MULTICAST_ALL analogue to IP_MULTICAST_ALL

The socket option will be enabled by default to ensure current behaviour
is not changed. This is the same for the IPv4 version.

A socket bound to in6addr_any and a specific port will receive all traffic
on that port. Analogue to IP_MULTICAST_ALL, disable this behaviour, if
one or more multicast groups were joined (using said socket) and only
pass on multicast traffic from groups, which were explicitly joined via
this socket.

Without this option disabled a socket (system even) joined to multiple
multicast groups is very hard to get right. Filtering by destination
address has to take place in user space to avoid receiving multicast
traffic from other multicast groups, which might have traffic on the same
port.

The extension of the IP_MULTICAST_ALL socketoption to just apply to ipv6,
too, is not done to avoid changing the behaviour of current applications.

Signed-off-by: Andre Naujoks <nautsch2@gmail.com>
Acked-By: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 08d3ffcc 20-Jul-2018 Hangbin Liu <liuhangbin@gmail.com>

multicast: do not restore deleted record source filter mode to new one

There are two scenarios that we will restore deleted records. The first is
when device down and up(or unmap/remap). In this scenario the new filter
mode is same with previous one. Because we get it from in_dev->mc_list and
we do not touch it during device down and up.

The other scenario is when a new socket join a group which was just delete
and not finish sending status reports. In this scenario, we should use the
current filter mode instead of restore old one. Here are 4 cases in total.

old_socket new_socket before_fix after_fix
IN(A) IN(A) ALLOW(A) ALLOW(A)
IN(A) EX( ) TO_IN( ) TO_EX( )
EX( ) IN(A) TO_EX( ) ALLOW(A)
EX( ) EX( ) TO_EX( ) TO_EX( )

Fixes: 24803f38a5c0b (igmp: do not remove igmp souce list info when set link down)
Fixes: 1666d49e1d416 (mld: do not remove mld souce list info when set link down)
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0ae0d60a 20-Jul-2018 Hangbin Liu <liuhangbin@gmail.com>

multicast: remove useless parameter for group add

Remove the mode parameter for igmp/igmp6_group_added as we can get it
from first parameter.

Fixes: 6e2059b53f988 (ipv4/igmp: init group mode as INCLUDE when join source group)
Fixes: c7ea20c9da5b9 (ipv6/mcast: init as INCLUDE when join SSM INCLUDE group)
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c7ea20c9 10-Jul-2018 Hangbin Liu <liuhangbin@gmail.com>

ipv6/mcast: init as INCLUDE when join SSM INCLUDE group

This an IPv6 version patch of "ipv4/igmp: init group mode as INCLUDE when
join source group". From RFC3810, part 6.1:

If no per-interface state existed for that
multicast address before the change (i.e., the change consisted of
creating a new per-interface record), or if no state exists after the
change (i.e., the change consisted of deleting a per-interface
record), then the "non-existent" state is considered to have an
INCLUDE filter mode and an empty source list.

Which means a new multicast group should start with state IN(). Currently,
for MLDv2 SSM JOIN_SOURCE_GROUP mode, we first call ipv6_sock_mc_join(),
then ip6_mc_source(), which will trigger a TO_IN() message instead of
ALLOW().

The issue was exposed by commit a052517a8ff65 ("net/multicast: should not
send source list records when have filter mode change"). Before this change,
we sent both ALLOW(A) and TO_IN(A). Now, we only send TO_IN(A).

Fix it by adding a new parameter to init group mode. Also add some wrapper
functions to avoid changing too much code.

v1 -> v2:
In the first version I only cleared the group change record. But this is not
enough. Because when a new group join, it will init as EXCLUDE and trigger
a filter mode change in ip/ip6_mc_add_src(), which will clear all source
addresses sf_crcount. This will prevent early joined address sending state
change records if multi source addressed joined at the same time.

In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM
JOIN_SOURCE_GROUP. I also split the original patch into two separated patches
for IPv4 and IPv6.

There is also a difference between v4 and v6 version. For IPv6, when the
interface goes down and up, we will send correct state change record with
unspecified IPv6 address (::) with function ipv6_mc_up(). But after DAD is
completed, we resend the change record TO_IN() in mld_send_initial_cr().
Fix it by sending ALLOW() for INCLUDE mode in mld_send_initial_cr().

Fixes: a052517a8ff65 ("net/multicast: should not send source list records when have filter mode change")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6c6da928 21-Jun-2018 Hangbin Liu <liuhangbin@gmail.com>

ipv6: mcast: fix unsolicited report interval after receiving querys

After recieving MLD querys, we update idev->mc_maxdelay with max_delay
from query header. This make the later unsolicited reports have the same
interval with mc_maxdelay, which means we may send unsolicited reports with
long interval time instead of default configured interval time.

Also as we will not call ipv6_mc_reset() after device up. This issue will
be there even after leave the group and join other groups.

Fixes: fc4eba58b4c14 ("ipv6: make unsolicited report intervals configurable for mld")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c3506372 10-Apr-2018 Christoph Hellwig <hch@lst.de>

proc: introduce proc_create_net{,_data}

Variants of proc_create{,_data} that directly take a struct seq_operations
and deal with network namespaces in ->open and ->release. All callers of
proc_create + seq_open_net converted over, and seq_{open,release}_net are
removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# 2f635cee 27-Mar-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Drop pernet_operations::async

Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d6444062 23-Mar-2018 Joe Perches <joe@perches.com>

net: Use octal not symbolic permissions

Prefer the direct use of octal for permissions.

Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
and some typing.

Miscellanea:

o Whitespace neatening around these conversions.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b75cc8f9 02-Mar-2018 David Ahern <dsahern@gmail.com>

net/ipv6: Pass skb to route lookup

IPv6 does path selection for multipath routes deep in the lookup
functions. The next patch adds L4 hash option and needs the skb
for the forward path. To get the skb to the relevant FIB lookup
functions it needs to go through the fib rules layer, so add a
lookup_data argument to the fib_lookup_arg struct.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1a2e9332 19-Feb-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Convert icmpv6_sk_ops, ndisc_net_ops and igmp6_net_ops

These pernet_operations create and destroy net::ipv6.icmp_sk
socket, used to send ICMP or error reply.

Nobody can dereference the socket to handle a packet before
net is initialized, as there is no routing; nobody can do
that in parallel with exit, as all of devices are moved
to init_net or destroyed and there are no packets it-flight.
So, it's possible to mark these pernet_operations as async.

The same for ndisc_net_ops and for igmp6_net_ops. The last
one also creates and destroys /proc entries.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32b395a1 06-Feb-2018 Masahiro Yamada <yamada.masahiro@socionext.com>

build_bug.h: remove BUILD_BUG_ON_NULL()

This macro is only used by net/ipv6/mcast.c, but there is no reason
why it must be BUILD_BUG_ON_NULL().

Replace it with BUILD_BUG_ON_ZERO(), and remove BUILD_BUG_ON_NULL()
definition from <linux/build_bug.h>.

Link: http://lkml.kernel.org/r/1515121833-3174-3-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Ian Abbott <abbotti@mev.co.uk>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5a75114a 16-Jan-2018 Eric Dumazet <edumazet@google.com>

ipv6: mcast: remove dead code

Since commit 41033f029e39 ("snmp: Remove duplicate OUTMCAST stat
increment") one line of code became unneeded.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 96890d62 15-Jan-2018 Alexey Dobriyan <adobriyan@gmail.com>

net: delete /proc THIS_MODULE references

/proc has been ignoring struct file_operations::owner field for 10 years.
Specifically, it started with commit 786d7e1612f0b0adb6046f19b906609e4fe8b1ba
("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
inode->i_fop is initialized with proxy struct file_operations for
regular files:

- if (de->proc_fops)
- inode->i_fop = de->proc_fops;
+ if (de->proc_fops) {
+ if (S_ISREG(inode->i_mode))
+ inode->i_fop = &proc_reg_file_ops;
+ else
+ inode->i_fop = de->proc_fops;
+ }

VFS stopped pinning module at this point.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b9b312a7 11-Dec-2017 Eric Dumazet <edumazet@google.com>

ipv6: mcast: better catch silly mtu values

syzkaller reported crashes in IPv6 stack [1]

Xin Long found that lo MTU was set to silly values.

IPv6 stack reacts to changes to small MTU, by disabling itself under
RTNL.

But there is a window where threads not using RTNL can see a wrong
device mtu. This can lead to surprises, in mld code where it is assumed
the mtu is suitable.

Fix this by reading device mtu once and checking IPv6 minimal MTU.

[1]
skbuff: skb_over_panic: text:0000000010b86b8d len:196 put:20
head:000000003b477e60 data:000000000e85441e tail:0xd4 end:0xc0 dev:lo
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:104!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100
RSP: 0018:ffff8801db307508 EFLAGS: 00010286
RAX: 0000000000000082 RBX: ffff8801c517e840 RCX: 0000000000000000
RDX: 0000000000000082 RSI: 1ffff1003b660e61 RDI: ffffed003b660e95
RBP: ffff8801db307570 R08: 1ffff1003b660e23 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff85bd4020
R13: ffffffff84754ed2 R14: 0000000000000014 R15: ffff8801c4e26540
FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000463610 CR3: 00000001c6698000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
skb_over_panic net/core/skbuff.c:109 [inline]
skb_put+0x181/0x1c0 net/core/skbuff.c:1694
add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695
add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817
mld_send_cr net/ipv6/mcast.c:1903 [inline]
mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448
call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320
expire_timers kernel/time/timer.c:1357 [inline]
__run_timers+0x7e1/0xb60 kernel/time/timer.c:1660
run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
__do_softirq+0x29d/0xbb2 kernel/softirq.c:285
invoke_softirq kernel/softirq.c:365 [inline]
irq_exit+0x1d3/0x210 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:540 [inline]
smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e99e88a9 16-Oct-2017 Kees Cook <keescook@chromium.org>

treewide: setup_timer() -> timer_setup()

This converts all remaining cases of the old setup_timer() API into using
timer_setup(), where the callback argument is the structure already
holding the struct timer_list. These should have no behavioral changes,
since they just change which pointer is passed into the callback with
the same available pointers after conversion. It handles the following
examples, in addition to some other variations.

Casting from unsigned long:

void my_callback(unsigned long data)
{
struct something *ptr = (struct something *)data;
...
}
...
setup_timer(&ptr->my_timer, my_callback, ptr);

and forced object casts:

void my_callback(struct something *ptr)
{
...
}
...
setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);

become:

void my_callback(struct timer_list *t)
{
struct something *ptr = from_timer(ptr, t, my_timer);
...
}
...
timer_setup(&ptr->my_timer, my_callback, 0);

Direct function assignments:

void my_callback(unsigned long data)
{
struct something *ptr = (struct something *)data;
...
}
...
ptr->my_timer.function = my_callback;

have a temporary cast added, along with converting the args:

void my_callback(struct timer_list *t)
{
struct something *ptr = from_timer(ptr, t, my_timer);
...
}
...
ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;

And finally, callbacks without a data assignment:

void my_callback(unsigned long data)
{
...
}
...
setup_timer(&ptr->my_timer, my_callback, 0);

have their argument renamed to verify they're unused during conversion:

void my_callback(struct timer_list *unused)
{
...
}
...
timer_setup(&ptr->my_timer, my_callback, 0);

The conversion is done with the following Coccinelle script:

spatch --very-quiet --all-includes --include-headers \
-I ./arch/x86/include -I ./arch/x86/include/generated \
-I ./include -I ./arch/x86/include/uapi \
-I ./arch/x86/include/generated/uapi -I ./include/uapi \
-I ./include/generated/uapi --include ./include/linux/kconfig.h \
--dir . \
--cocci-file ~/src/data/timer_setup.cocci

@fix_address_of@
expression e;
@@

setup_timer(
-&(e)
+&e
, ...)

// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _timer;
type _cast_data;
@@

(
-setup_timer(&_E->_timer, NULL, _E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E->_timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, &_E);
+timer_setup(&_E._timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._timer, NULL, 0);
)

@change_timer_function_usage@
expression _E;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@

(
-setup_timer(&_E->_timer, _callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
_E->_timer@_stl.function = _callback;
|
_E->_timer@_stl.function = &_callback;
|
_E->_timer@_stl.function = (_cast_func)_callback;
|
_E->_timer@_stl.function = (_cast_func)&_callback;
|
_E._timer@_stl.function = _callback;
|
_E._timer@_stl.function = &_callback;
|
_E._timer@_stl.function = (_cast_func)_callback;
|
_E._timer@_stl.function = (_cast_func)&_callback;
)

// callback(unsigned long arg)
@change_callback_handle_cast
depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@

void _callback(
-_origtype _origarg
+struct timer_list *t
)
{
(
... when != _origarg
_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle;
... when != _handle
_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle;
... when != _handle
_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
)
}

// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
depends on change_timer_function_usage &&
!change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@

void _callback(
-_origtype _origarg
+struct timer_list *t
)
{
+ _handletype *_origarg = from_timer(_origarg, t, _timer);
+
... when != _origarg
- (_handletype *)_origarg
+ _origarg
... when != _origarg
}

// Avoid already converted callbacks.
@match_callback_converted
depends on change_timer_function_usage &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@

void _callback(struct timer_list *t)
{ ... }

// callback(struct something *handle)
@change_callback_handle_arg
depends on change_timer_function_usage &&
!match_callback_converted &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@

void _callback(
-_handletype *_handle
+struct timer_list *t
)
{
+ _handletype *_handle = from_timer(_handle, t, _timer);
...
}

// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
depends on change_timer_function_usage &&
change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@

void _callback(struct timer_list *t)
{
- _handletype *_handle = from_timer(_handle, t, _timer);
}

// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
depends on change_timer_function_usage &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg &&
!change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@

(
-timer_setup(&_E->_timer, _callback, 0);
+setup_timer(&_E->_timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._timer, _callback, 0);
+setup_timer(&_E._timer, _callback, (_cast_data)&_E);
)

// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
depends on change_timer_function_usage &&
(change_callback_handle_cast ||
change_callback_handle_cast_no_arg ||
change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@

(
_E->_timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
;
)

// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
depends on change_timer_function_usage &&
(change_callback_handle_cast ||
change_callback_handle_cast_no_arg ||
change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@

_callback(
(
-(_cast_data)_E
+&_E->_timer
|
-(_cast_data)&_E
+&_E._timer
|
-_E
+&_E->_timer
)
)

// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _timer;
identifier _callback;
@@

(
-setup_timer(&_E->_timer, _callback, 0);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0L);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0UL);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0L);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0UL);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0L);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0UL);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0L);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0UL);
+timer_setup(_timer, _callback, 0);
)

@change_callback_unused_data
depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@

void _callback(
-_origtype _origarg
+struct timer_list *unused
)
{
... when != _origarg
}

Signed-off-by: Kees Cook <keescook@chromium.org>


# d3981bc6 04-Jul-2017 Reshetova, Elena <elena.reshetova@intel.com>

net, ipv6: convert ifmcaddr6.mca_refcnt from atomic_t to refcount_t

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4df864c1 16-Jun-2017 Johannes Berg <johannes.berg@intel.com>

networking: make skb_put & friends return void pointers

It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions (skb_put, __skb_put and pskb_put) return void *
and remove all the casts across the tree, adding a (u8 *) cast only
where the unsigned char pointer was used directly, all done with the
following spatch:

@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_put, __skb_put };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)

@@
expression E, SKB, LEN;
identifier fn = { skb_put, __skb_put };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)

which actually doesn't cover pskb_put since there are only three
users overall.

A handful of stragglers were converted manually, notably a macro in
drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
instances in net/bluetooth/hci_sock.c. In the former file, I also
had to fix one whitespace problem spatch introduced.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 59ae1d12 16-Jun-2017 Johannes Berg <johannes.berg@intel.com>

networking: introduce and use skb_put_data()

A common pattern with skb_put() is to just want to memcpy()
some data into the new space, introduce skb_put_data() for
this.

An spatch similar to the one for skb_put_zero() converts many
of the places using it:

@@
identifier p, p2;
expression len, skb, data;
type t, t2;
@@
(
-p = skb_put(skb, len);
+p = skb_put_data(skb, data, len);
|
-p = (t)skb_put(skb, len);
+p = skb_put_data(skb, data, len);
)
(
p2 = (t2)p;
-memcpy(p2, data, len);
|
-memcpy(p, data, len);
)

@@
type t, t2;
identifier p, p2;
expression skb, data;
@@
t *p;
...
(
-p = skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
|
-p = (t *)skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
)
(
p2 = (t2)p;
-memcpy(p2, data, sizeof(*p));
|
-memcpy(p, data, sizeof(*p));
)

@@
expression skb, len, data;
@@
-memcpy(skb_put(skb, len), data, len);
+skb_put_data(skb, data, len);

(again, manually post-processed to retain some comments)

Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b080db58 16-Jun-2017 Johannes Berg <johannes.berg@intel.com>

networking: convert many more places to skb_put_zero()

There were many places that my previous spatch didn't find,
as pointed out by yuan linyu in various patches.

The following spatch found many more and also removes the
now unnecessary casts:

@@
identifier p, p2;
expression len;
expression skb;
type t, t2;
@@
(
-p = skb_put(skb, len);
+p = skb_put_zero(skb, len);
|
-p = (t)skb_put(skb, len);
+p = skb_put_zero(skb, len);
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, len);
|
-memset(p, 0, len);
)

@@
type t, t2;
identifier p, p2;
expression skb;
@@
t *p;
...
(
-p = skb_put(skb, sizeof(t));
+p = skb_put_zero(skb, sizeof(t));
|
-p = (t *)skb_put(skb, sizeof(t));
+p = skb_put_zero(skb, sizeof(t));
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, sizeof(*p));
|
-memset(p, 0, sizeof(*p));
)

@@
expression skb, len;
@@
-memset(skb_put(skb, len), 0, len);
+skb_put_zero(skb, len);

Apply it to the tree (with one manual fixup to keep the
comment in vxlan.c, which spatch removed.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 382ed724 28-Mar-2017 Vlad Yasevich <vyasevich@gmail.com>

ipv6: add support for NETDEV_RESEND_IGMP event

This patch adds support for NETDEV_RESEND_IGMP event similar
to how it works for IPv4.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9c8bb163 08-Feb-2017 Hangbin Liu <liuhangbin@gmail.com>

igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()

In function igmpv3/mld_add_delrec() we allocate pmc and put it in
idev->mc_tomb, so we should free it when we don't need it in del_delrec().
But I removed kfree(pmc) incorrectly in latest two patches. Now fix it.

Fixes: 24803f38a5c0 ("igmp: do not remove igmp souce list info when ...")
Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when ...")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1666d49e 12-Jan-2017 Hangbin Liu <liuhangbin@gmail.com>

mld: do not remove mld souce list info when set link down

This is an IPv6 version of commit 24803f38a5c0 ("igmp: do not remove igmp
souce list..."). In mld_del_delrec(), we will restore back all source filter
info instead of flush them.

Move mld_clear_delrec() from ipv6_mc_down() to ipv6_mc_destroy_dev() since
we should not remove source list info when set link down. Remove
igmp6_group_dropped() in ipv6_mc_destroy_dev() since we have called it in
ipv6_mc_down().

Also clear all source info after igmp6_group_dropped() instead of in it
because ipv6_mc_down() will call igmp6_group_dropped().

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8651be8f 20-Oct-2016 WANG Cong <xiyou.wangcong@gmail.com>

ipv6: fix a potential deadlock in do_ipv6_setsockopt()

Baozeng reported this deadlock case:

CPU0 CPU1
---- ----
lock([ 165.136033] sk_lock-AF_INET6);
lock([ 165.136033] rtnl_mutex);
lock([ 165.136033] sk_lock-AF_INET6);
lock([ 165.136033] rtnl_mutex);

Similar to commit 87e9f0315952
("ipv4: fix a potential deadlock in mcast getsockopt() path")
this is due to we still have a case, ipv6_sock_mc_close(),
where we acquire sk_lock before rtnl_lock. Close this deadlock
with the similar solution, that is always acquire rtnl lock first.

Fixes: baf606d9c9b1 ("ipv4,ipv6: grab rtnl before locking the socket")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Tested-by: Baozeng Ding <sploving1@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a052517a 02-Aug-2016 Hangbin Liu <liuhangbin@gmail.com>

net/multicast: should not send source list records when have filter mode change

Based on RFC3376 5.1 and RFC3810 6.1

If the per-interface listening change that triggers the new report is
a filter mode change, then the next [Robustness Variable] State
Change Reports will include a Filter Mode Change Record. This
applies even if any number of source list changes occur in that
period.

Old State New State State Change Record Sent
--------- --------- ------------------------
INCLUDE (A) EXCLUDE (B) TO_EX (B)
EXCLUDE (A) INCLUDE (B) TO_IN (B)

So we should not send source-list change if there is a filter-mode change.

Here are two scenarios:
1. Group deleted and filter mode is EXCLUDE, which means we need send a
TO_IN { }.
2. Not group deleted, but has pcm->crcount, which means we need send a
normal filter-mode-change.

At the same time, if the type is ALLOW or BLOCK, and have psf->sf_crcount,
we stop add records and decrease sf_crcount directly

Reference: https://www.ietf.org/mail-archive/web/magma/current/msg01274.html

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1837b2e2 29-Feb-2016 Benjamin Poirier <bpoirier@suse.com>

mld, igmp: Fix reserved tailroom calculation

The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data____________|__tlen___|__extra__]
^ ^
head skb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data____________|__extra__|__tlen___]
^ ^
head skb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
reserved_tailroom = skb_tailroom - mtu
else:
reserved_tailroom = tlen

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 41033f02 16-Nov-2015 Neil Horman <nhorman@tuxdriver.com>

snmp: Remove duplicate OUTMCAST stat increment

the OUTMCAST stat is double incremented, getting bumped once in the mcast code
itself, and again in the common ip output path. Remove the mcast bump, as its
not needed

Validated by the reporter, with good results

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Claus Jensen <claus.jensen@microsemi.com>
CC: Claus Jensen <claus.jensen@microsemi.com>
CC: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 13206b6b 07-Oct-2015 Eric W. Biederman <ebiederm@xmission.com>

net: Pass net into dst_output and remove dst_output_okfn

Replace dst_output_okfn with dst_output

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0c4b51f0 15-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: Pass net into okfn

This is immediately motivated by the bridge code that chains functions that
call into netfilter. Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.

As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.

To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn. For the moment dst_output_okfn
just silently drops the struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 29a26a56 15-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: Pass struct net into the netfilter hooks

Pass a network namespace parameter into the netfilter hooks. At the
call site of the netfilter hooks the path a packet is taking through
the network stack is well known which allows the network namespace to
be easily and reliabily.

This allows the replacement of magic code like
"dev_net(state->in?:state->out)" that appears at the start of most
netfilter hooks with "state->net".

In almost all cases the network namespace passed in is derived
from the first network device passed in, guaranteeing those
paths will not see any changes in practice.

The exceptions are:
xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

In all cases these exceptions seem to be a better expression for the
network namespace the packet is being processed in then the historic
"dev_net(in?in:out)". I am documenting them in case something odd
pops up and someone starts trying to track down what happened.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a70649e 15-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

net: Merge dst_output and dst_output_sk

Add a sock paramter to dst_output making dst_output_sk superfluous.
Add a skb->sk parameter to all of the callers of dst_output
Have the callers of dst_output_sk call dst_output.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7026b1dd 05-Apr-2015 David Miller <davem@davemloft.net>

netfilter: Pass socket pointer down through okfn().

On the output paths in particular, we have to sometimes deal with two
socket contexts. First, and usually skb->sk, is the local socket that
generated the frame.

And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.

We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.

The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device. We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 53b24b8f 29-Mar-2015 Ian Morris <ipm@chirality.org.uk>

ipv6: coding style: comparison for inequality with NULL

The ipv6 code uses a mixture of coding styles. In some instances check for NULL
pointer is done as x != NULL and sometimes as x. x is preferred according to
checkpatch and this patch makes the code consistent by adopting the latter
form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 63159f29 29-Mar-2015 Ian Morris <ipm@chirality.org.uk>

ipv6: coding style: comparison for equality with NULL

The ipv6 code uses a mixture of coding styles. In some instances check for NULL
pointer is done as x == NULL and sometimes as !x. !x is preferred according to
checkpatch and this patch makes the code consistent by adopting the latter
form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 54ff9ef3 18-Mar-2015 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}

in favor of their inner __ ones, which doesn't grab rtnl.

As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.

So this patch:
- move rtnl handling to callers instead while already fixing some
reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
__ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
__ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 93a714d6 25-Feb-2015 Madhu Challa <challa@noironetworks.com>

multicast: Extend ip address command to enable multicast group join/leave on

Joining multicast group on ethernet level via "ip maddr" command would
not work if we have an Ethernet switch that does igmp snooping since
the switch would not replicate multicast packets on ports that did not
have IGMP reports for the multicast addresses.

Linux vxlan interfaces created via "ip link add vxlan" have the group option
that enables then to do the required join.

By extending ip address command with option "autojoin" we can get similar
functionality for openvswitch vxlan interfaces as well as other tunneling
mechanisms that need to receive multicast traffic. The kernel code is
structured similar to how the vxlan driver does a group join / leave.

example:
ip address add 224.1.1.10/24 dev eth5 autojoin
ip address del 224.1.1.10/24 dev eth5

Signed-off-by: Madhu Challa <challa@noironetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 46a4dee0 25-Feb-2015 Madhu Challa <challa@noironetworks.com>

igmp v6: add __ipv6_sock_mc_join and __ipv6_sock_mc_drop

Based on the igmp v4 changes from Eric Dumazet.
959d10f6bbf6("igmp: add __ip_mc_{join|leave}_group()")

These changes are needed to perform igmp v6 join/leave while
RTNL is held.

Make ipv6_sock_mc_join and ipv6_sock_mc_drop wrappers around
__ipv6_sock_mc_join and __ipv6_sock_mc_drop to avoid
proliferation of work queues.

Signed-off-by: Madhu Challa <challa@noironetworks.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# feb91a02 05-Nov-2014 Daniel Borkmann <daniel@iogearbox.net>

ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs

It has been reported that generating an MLD listener report on
devices with large MTUs (e.g. 9000) and a high number of IPv6
addresses can trigger a skb_over_panic():

skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
dev:port1
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:100!
invalid opcode: 0000 [#1] SMP
Modules linked in: ixgbe(O)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
[...]
Call Trace:
<IRQ>
[<ffffffff80578226>] ? skb_put+0x3a/0x3b
[<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
[<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
[<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
[<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
[<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
[<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
[<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
[<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
[<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
[<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70

mld_newpack() skb allocations are usually requested with dev->mtu
in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations")
we have changed the limit in order to be less likely to fail.

However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
macros, which determine if we may end up doing an skb_put() for
adding another record. To avoid possible fragmentation, we check
the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
assumption as the actual max allocation size can be much smaller.

The IGMP case doesn't have this issue as commit 57e1ab6eaddc
("igmp: refine skb allocations") stores the allocation size in
the cb[].

Set a reserved_tailroom to make it fit into the MTU and use
skb_availroom() helper instead. This also allows to get rid of
igmp_skb_size().

Reported-by: Wei Liu <lw1a2.jing@gmail.com>
Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: David L Stevens <david.stevens@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4c672e4b 05-Nov-2014 Daniel Borkmann <daniel@iogearbox.net>

ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs

It has been reported that generating an MLD listener report on
devices with large MTUs (e.g. 9000) and a high number of IPv6
addresses can trigger a skb_over_panic():

skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
dev:port1
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:100!
invalid opcode: 0000 [#1] SMP
Modules linked in: ixgbe(O)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
[...]
Call Trace:
<IRQ>
[<ffffffff80578226>] ? skb_put+0x3a/0x3b
[<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
[<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
[<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
[<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
[<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
[<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
[<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
[<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
[<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
[<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70

mld_newpack() skb allocations are usually requested with dev->mtu
in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations")
we have changed the limit in order to be less likely to fail.

However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
macros, which determine if we may end up doing an skb_put() for
adding another record. To avoid possible fragmentation, we check
the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
assumption as the actual max allocation size can be much smaller.

The IGMP case doesn't have this issue as commit 57e1ab6eaddc
("igmp: refine skb allocations") stores the allocation size in
the cb[].

Set a reserved_tailroom to make it fit into the MTU and use
skb_availroom() helper instead. This also allows to get rid of
igmp_skb_size().

Reported-by: Wei Liu <lw1a2.jing@gmail.com>
Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: David L Stevens <david.stevens@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1744bea1 04-Nov-2014 Joe Perches <joe@perches.com>

net: Convert SEQ_START_TOKEN/seq_printf to seq_puts

Using a single fixed string is smaller code size than using
a format and many string arguments.

Reduces overall code size a little.

$ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o*
text data bss dec hex filename
34269 7012 14824 56105 db29 net/ipv4/igmp.o.new
34315 7012 14824 56151 db57 net/ipv4/igmp.o.old
30078 7869 13200 51147 c7cb net/ipv6/mcast.o.new
30105 7869 13200 51174 c7e6 net/ipv6/mcast.o.old
11434 3748 8580 23762 5cd2 net/ipv6/ip6_flowlabel.o.new
11491 3748 8580 23819 5d0b net/ipv6/ip6_flowlabel.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 35f7aa53 20-Sep-2014 Daniel Borkmann <daniel@iogearbox.net>

ipv6: mld: answer mldv2 queries with mldv1 reports in mldv1 fallback

RFC2710 (MLDv1), section 3.7. says:

The length of a received MLD message is computed by taking the
IPv6 Payload Length value and subtracting the length of any IPv6
extension headers present between the IPv6 header and the MLD
message. If that length is greater than 24 octets, that indicates
that there are other fields present *beyond* the fields described
above, perhaps belonging to a *future backwards-compatible* version
of MLD. An implementation of the version of MLD specified in this
document *MUST NOT* send an MLD message longer than 24 octets and
MUST ignore anything past the first 24 octets of a received MLD
message.

RFC3810 (MLDv2), section 8.2.1. states for *listeners* regarding
presence of MLDv1 routers:

In order to be compatible with MLDv1 routers, MLDv2 hosts MUST
operate in version 1 compatibility mode. [...] When Host
Compatibility Mode is MLDv2, a host acts using the MLDv2 protocol
on that interface. When Host Compatibility Mode is MLDv1, a host
acts in MLDv1 compatibility mode, using *only* the MLDv1 protocol,
on that interface. [...]

While section 8.3.1. specifies *router* behaviour regarding presence
of MLDv1 routers:

MLDv2 routers may be placed on a network where there is at least
one MLDv1 router. The following requirements apply:

If an MLDv1 router is present on the link, the Querier MUST use
the *lowest* version of MLD present on the network. This must be
administratively assured. Routers that desire to be compatible
with MLDv1 MUST have a configuration option to act in MLDv1 mode;
if an MLDv1 router is present on the link, the system administrator
must explicitly configure all MLDv2 routers to act in MLDv1 mode.
When in MLDv1 mode, the Querier MUST send periodic General Queries
truncated at the Multicast Address field (i.e., 24 bytes long),
and SHOULD also warn about receiving an MLDv2 Query (such warnings
must be rate-limited). The Querier MUST also fill in the Maximum
Response Delay in the Maximum Response Code field, i.e., the
exponential algorithm described in section 5.1.3. is not used. [...]

That means that we should not get queries from different versions of
MLD. When there's a MLDv1 router present, MLDv2 enforces truncation
and MRC == MRD (both fields are overlapping within the 24 octet range).

Section 8.3.2. specifies behaviour in the presence of MLDv1 multicast
address *listeners*:

MLDv2 routers may be placed on a network where there are hosts
that have not yet been upgraded to MLDv2. In order to be compatible
with MLDv1 hosts, MLDv2 routers MUST operate in version 1 compatibility
mode. MLDv2 routers keep a compatibility mode per multicast address
record. The compatibility mode of a multicast address is determined
from the Multicast Address Compatibility Mode variable, which can be
in one of the two following states: MLDv1 or MLDv2.

The Multicast Address Compatibility Mode of a multicast address
record is set to MLDv1 whenever an MLDv1 Multicast Listener Report is
*received* for that multicast address. At the same time, the Older
Version Host Present timer for the multicast address is set to Older
Version Host Present Timeout seconds. The timer is re-set whenever a
new MLDv1 Report is received for that multicast address. If the Older
Version Host Present timer expires, the router switches back to
Multicast Address Compatibility Mode of MLDv2 for that multicast
address. [...]

That means, what can happen is the following scenario, that hosts can
act in MLDv1 compatibility mode when they previously have received an
MLDv1 query (or, simply operate in MLDv1 mode-only); and at the same
time, an MLDv2 router could start up and transmits MLDv2 startup query
messages while being unaware of the current operational mode.

Given RFC2710, section 3.7 we would need to answer to that with an MLDv1
listener report, so that the router according to RFC3810, section 8.3.2.
would receive that and internally switch to MLDv1 compatibility as well.

Right now, I believe since the initial implementation of MLDv2, Linux
hosts would just silently drop such MLDv2 queries instead of replying
with an MLDv1 listener report, which would prevent a MLDv2 router going
into fallback mode (until it receives other MLDv1 queries).

Since the mapping of MRC to MRD in exactly such cases can make use of
the exponential algorithm from 5.1.3, we cannot [strictly speaking] be
aware in MLDv1 of the encoding in MRC, it seems also not mentioned by
the RFC. Since encodings are the same up to 32767, assume in such a
situation this value as a hard upper limit we would clamp. We have asked
one of the RFC authors on that regard, and he mentioned that there seem
not to be any implementations that make use of that exponential algorithm
on startup messages. In any case, this patch fixes this MLD
interoperability issue.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1691c63e 11-Sep-2014 WANG Cong <xiyou.wangcong@gmail.com>

ipv6: refactor ipv6_dev_mc_inc()

Refactor out allocation and initialization and make
the refcount code more readable.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f7ed925c 11-Sep-2014 WANG Cong <xiyou.wangcong@gmail.com>

ipv6: update the comment in mcast.c

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 414b6c94 11-Sep-2014 WANG Cong <xiyou.wangcong@gmail.com>

ipv6: drop some rcu_read_lock in mcast

Similarly the code is already protected by rtnl lock.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b5350916 11-Sep-2014 WANG Cong <xiyou.wangcong@gmail.com>

ipv6: drop ipv6_sk_mc_lock in mcast

Similarly the code is already protected by rtnl lock.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cbeddd5d 09-Sep-2014 Daniel Borkmann <daniel@iogearbox.net>

ipv6: mcast: remove dead debugging defines

It's not used anywhere, so just remove these.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a9ed4a29 02-Sep-2014 Sabrina Dubroca <sd@queasysnail.net>

ipv6: fix rtnl locking in setsockopt for anycast and multicast

Calling setsockopt with IPV6_JOIN_ANYCAST or IPV6_LEAVE_ANYCAST
triggers the assertion in addrconf_join_solict()/addrconf_leave_solict()

ipv6_sock_ac_join(), ipv6_sock_ac_drop(), ipv6_sock_ac_close() need to
take RTNL before calling ipv6_dev_ac_inc/dec. Same thing with
ipv6_sock_mc_join(), ipv6_sock_mc_drop(), ipv6_sock_mc_close() before
calling ipv6_dev_mc_inc/dec.

This patch moves ASSERT_RTNL() up a level in the call stack.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reported-by: Tommi Rantala <tt.rantala@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f711939 02-Sep-2014 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: add sysctl_mld_qrv to configure query robustness variable

This patch adds a new sysctl_mld_qrv knob to configure the mldv1/v2 query
robustness variable. It specifies how many retransmit of unsolicited mld
retransmit should happen. Admins might want to tune this on lossy links.

Also reset mld state on interface down/up, so we pick up new sysctl
settings during interface up event.

IPv6 certification requests this knob to be available.

I didn't make this knob netns specific, as it is mostly a setting in a
physical environment and should be per host.

Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 67ba4152 24-Aug-2014 Ian Morris <ipm@chirality.org.uk>

ipv6: White-space cleansing : Line Layouts

This patch makes no changes to the logic of the code but simply addresses
coding style issues as detected by checkpatch.

Both objdump and diff -w show no differences.

A number of items are addressed in this patch:
* Multiple spaces converted to tabs
* Spaces before tabs removed.
* Spaces in pointer typing cleansed (char *)foo etc.
* Remove space after sizeof
* Ensure spacing around comparators such as if statements.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e940f5d6 26-Jun-2014 Hangbin Liu <liuhangbin@gmail.com>

ipv6: Fix MLD Query message check

Based on RFC3810 6.2, we also need to check the hop limit and router alert
option besides source address.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 43a43b60 31-Mar-2014 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: some ipv6 statistic counters failed to disable bh

After commit c15b1ccadb323ea ("ipv6: move DAD and addrconf_verify
processing to workqueue") some counters are now updated in process context
and thus need to disable bh before doing so, otherwise deadlocks can
happen on 32-bit archs. Fabio Estevam noticed this while while mounting
a NFS volume on an ARM board.

As a compensation for missing this I looked after the other *_STATS_BH
and found three other calls which need updating:

1) icmp6_send: ip6_fragment -> icmpv6_send -> icmp6_send (error handling)
2) ip6_push_pending_frames: rawv6_sendmsg -> rawv6_push_pending_frames -> ...
(only in case of icmp protocol with raw sockets in error handling)
3) ping6_v6_sendmsg (error handling)

Fixes: c15b1ccadb323ea ("ipv6: move DAD and addrconf_verify processing to workqueue")
Reported-by: Fabio Estevam <festevam@gmail.com>
Tested-by: Fabio Estevam <fabio.estevam@freescale.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6a7cc418 16-Jan-2014 Flavio Leitner <fbl@redhat.com>

ipv6: send Change Status Report after DAD is completed

The RFC 3810 defines two type of messages for multicast
listeners. The "Current State Report" message, as the name
implies, refreshes the *current* state to the querier.
Since the querier sends Query messages periodically, there
is no need to retransmit the report.

On the other hand, any change should be reported immediately
using "State Change Report" messages. Since it's an event
triggered by a change and that it can be affected by packet
loss, the rfc states it should be retransmitted [RobVar] times
to make sure routers will receive timely.

Currently, we are sending "Current State Reports" after
DAD is completed. Before that, we send messages using
unspecified address (::) which should be silently discarded
by routers.

This patch changes to send "State Change Report" messages
after DAD is completed fixing the behavior to be RFC compliant
and also to pass TAHI IPv6 testsuite.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 63862b5b 11-Jan-2014 Aruna-Hewapathirane <aruna.hewapathirane@gmail.com>

net: replace macros net_random and net_srandom with direct calls to prandom

This patch removes the net_random and net_srandom macros and replaces
them with direct calls to the prandom ones. As new commits only seem to
use prandom_u32 there is no use to keep them around.
This change makes it easier to grep for users of prandom_u32.

Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9260d3e1 29-Sep-2013 Salam Noureddine <noureddine@aristanetworks.com>

ipv6 mcast: use in6_dev_put in timer handlers instead of __in6_dev_put

It is possible for the timer handlers to run after the call to
ipv6_mc_down so use in6_dev_put instead of __in6_dev_put in the
handler function in order to do proper cleanup when the refcnt
reaches 0. Otherwise, the refcnt can reach zero without the
inet6_dev being destroyed and we end up leaking a reference to
the net_device and see messages like the following,

unregister_netdevice: waiting for eth0 to become free. Usage count = 1

Tested on linux-3.4.43.

Signed-off-by: Salam Noureddine <noureddine@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b4af8def 03-Sep-2013 Daniel Borkmann <daniel@iogearbox.net>

net: ipv6: mld: introduce mld_{gq, ifc, dad}_stop_timer functions

We already have mld_{gq,ifc,dad}_start_timer() functions, so introduce
mld_{gq,ifc,dad}_stop_timer() functions to reduce code size and make it
more readable.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2b7c121f 03-Sep-2013 Daniel Borkmann <daniel@iogearbox.net>

net: ipv6: mld: refactor query processing into v1/v2 functions

Make igmp6_event_query() a bit easier to read by refactoring code
parts into mld_process_v1() and mld_process_v2().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cc7f7ab7 03-Sep-2013 Daniel Borkmann <daniel@iogearbox.net>

net: ipv6: mld: similarly to MLDv2 have min max_delay of 1

Similarly as we do in MLDv2 queries, set a forged MLDv1 query with
0 ms mld_maxdelay to minimum timer shot time of 1 jiffies. This is
eventually done in igmp6_group_queried() anyway, so we can simplify
a check there.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 58c0ecfd 03-Sep-2013 Daniel Borkmann <daniel@iogearbox.net>

net: ipv6: mld: implement RFC3810 MLDv2 mode only

RFC3810, 10. Security Considerations says under subsection 10.1.
Query Message:

A forged Version 1 Query message will put MLDv2 listeners on that
link in MLDv1 Host Compatibility Mode. This scenario can be avoided
by providing MLDv2 hosts with a configuration option to ignore
Version 1 messages completely.

Hence, implement a MLDv2-only mode that will ignore MLDv1 traffic:

echo 2 > /proc/sys/net/ipv6/conf/ethX/force_mld_version or
echo 2 > /proc/sys/net/ipv6/conf/all/force_mld_version

Note that <all> device has a higher precedence as it was previously
also the case in the macro MLD_V1_SEEN() that would "short-circuit"
if condition on <all> case.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e3f5b170 03-Sep-2013 Daniel Borkmann <daniel@iogearbox.net>

net: ipv6: mld: get rid of MLDV2_MRC and simplify calculation

Get rid of MLDV2_MRC and use our new macros for mantisse and
exponent to calculate Maximum Response Delay out of the Maximum
Response Code.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6c567b78 03-Sep-2013 Daniel Borkmann <daniel@iogearbox.net>

net: ipv6: mld: clean up MLD_V1_SEEN macro

Replace the macro with a function to make it more readable. GCC will
eventually decide whether to inline this or not (also, that's not
fast-path anyway).

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 89225d1c 03-Sep-2013 Daniel Borkmann <daniel@iogearbox.net>

net: ipv6: mld: fix v1/v2 switchback timeout to rfc3810, 9.12.

i) RFC3810, 9.2. Query Interval [QI] says:

The Query Interval variable denotes the interval between General
Queries sent by the Querier. Default value: 125 seconds. [...]

ii) RFC3810, 9.3. Query Response Interval [QRI] says:

The Maximum Response Delay used to calculate the Maximum Response
Code inserted into the periodic General Queries. Default value:
10000 (10 seconds) [...] The number of seconds represented by the
[Query Response Interval] must be less than the [Query Interval].

iii) RFC3810, 9.12. Older Version Querier Present Timeout [OVQPT] says:

The Older Version Querier Present Timeout is the time-out for
transitioning a host back to MLDv2 Host Compatibility Mode. When an
MLDv1 query is received, MLDv2 hosts set their Older Version Querier
Present Timer to [Older Version Querier Present Timeout].

This value MUST be ([Robustness Variable] times (the [Query Interval]
in the last Query received)) plus ([Query Response Interval]).

Hence, on *default* the timeout results in:

[RV] = 2, [QI] = 125sec, [QRI] = 10sec
[OVQPT] = [RV] * [QI] + [QRI] = 260sec

Having that said, we currently calculate [OVQPT] (here given as 'switchback'
variable) as ...

switchback = (idev->mc_qrv + 1) * max_delay

RFC3810, 9.12. says "the [Query Interval] in the last Query received". In
section "9.14. Configuring timers", it is said:

This section is meant to provide advice to network administrators on
how to tune these settings to their network. Ambitious router
implementations might tune these settings dynamically based upon
changing characteristics of the network. [...]

iv) RFC38010, 9.14.2. Query Interval:

The overall level of periodic MLD traffic is inversely proportional
to the Query Interval. A longer Query Interval results in a lower
overall level of MLD traffic. The value of the Query Interval MUST
be equal to or greater than the Maximum Response Delay used to
calculate the Maximum Response Code inserted in General Query
messages.

I assume that was why switchback is calculated as is (3 * max_delay), although
this setting seems to be meant for routers only to configure their [QI]
interval for non-default intervals. So usage here like this is clearly wrong.

Concluding, the current behaviour in IPv6's multicast code is not conform
to the RFC as switch back is calculated wrongly. That is, it has a too small
value, so MLDv2 hosts switch back again to MLDv2 way too early, i.e. ~30secs
instead of ~260secs on default.

Hence, introduce necessary helper functions and fix this up properly as it
should be.

Introduced in 06da92283 ("[IPV6]: Add MLDv2 support."). Credits to Hannes
Frederic Sowa who also had a hand in this as well. Also thanks to Hangbin Liu
who did initial testing.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: David Stevens <dlstevens@us.ibm.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9fd07841 19-Aug-2013 Daniel Borkmann <daniel@iogearbox.net>

net: ipv6: mcast: minor: use defines for rfc3810/8.1 lengths

Instead of hard-coding length values, use a define to make it clear
where those lengths come from.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c2cef4e8 19-Aug-2013 Daniel Borkmann <daniel@iogearbox.net>

net: ipv6: minor: *_start_timer: rather use unsigned long

For the functions mld_gq_start_timer(), mld_ifc_start_timer(),
and mld_dad_start_timer(), rather use unsigned long than int
as we operate only on unsigned values anyway. This seems more
appropriate as there is no good reason to do type conversions
to int, that could lead to future errors.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 84698963 19-Aug-2013 Daniel Borkmann <daniel@iogearbox.net>

net: ipv6: igmp6_event_query: use msecs_to_jiffies

Use proper API functions to calculate jiffies from milliseconds and
not the crude method of dividing HZ by a value. This ensures more
accurate values even in the case of strange HZ values. While at it,
also simplify code in the mlh2 case by using max().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fc4eba58 13-Aug-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: make unsolicited report intervals configurable for mld

Commit cab70040dfd95ee32144f02fade64f0cb94f31a0 ("net: igmp:
Reduce Unsolicited report interval to 1s when using IGMPv3") and
2690048c01f32bf45d1c1e1ab3079bc10ad2aea7 ("net: igmp: Allow user-space
configuration of igmp unsolicited report interval") by William Manley made
igmp unsolicited report intervals configurable per interface and corrected
the interval of unsolicited igmpv3 report messages resendings to 1s.

Same needs to be done for IPv6:

MLDv1 (RFC2710 7.10.): 10 seconds
MLDv2 (RFC3810 9.11.): 1 second

Both intervals are configurable via new procfs knobs
mldv1_unsolicited_report_interval and mldv2_unsolicited_report_interval.

(also added .force_mld_version to ipv6_devconf_dflt to bring structs in
line without semantic changes)

v2:
a) Joined documentation update for IPv4 and IPv6 MLD/IGMP
unsolicited_report_interval procfs knobs.
b) incorporate stylistic feedback from William Manley

v3:
a) add new DEVCONF_* values to the end of the enum (thanks to David
Miller)

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: William Manley <william.manley@youview.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9d4a0314 26-Jul-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv4, ipv6: send igmpv3/mld packets with TC_PRIO_CONTROL

v2:
a) Also send ipv4 igmp messages with TC_PRIO_CONTROL

Cc: William Manley <william.manley@youview.com>
Cc: Lukas Tribus <luky-37@hotmail.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8965779d 29-Jun-2013 Amerigo Wang <amwang@redhat.com>

ipv6,mcast: always hold idev->lock before mca_lock

dingtianhong reported the following deadlock detected by lockdep:

======================================================
[ INFO: possible circular locking dependency detected ]
3.4.24.05-0.1-default #1 Not tainted
-------------------------------------------------------
ksoftirqd/0/3 is trying to acquire lock:
(&ndev->lock){+.+...}, at: [<ffffffff8147f804>] ipv6_get_lladdr+0x74/0x120

but task is already holding lock:
(&mc->mca_lock){+.+...}, at: [<ffffffff8149d130>] mld_send_report+0x40/0x150

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&mc->mca_lock){+.+...}:
[<ffffffff810a8027>] validate_chain+0x637/0x730
[<ffffffff810a8417>] __lock_acquire+0x2f7/0x500
[<ffffffff810a8734>] lock_acquire+0x114/0x150
[<ffffffff814f691a>] rt_spin_lock+0x4a/0x60
[<ffffffff8149e4bb>] igmp6_group_added+0x3b/0x120
[<ffffffff8149e5d8>] ipv6_mc_up+0x38/0x60
[<ffffffff81480a4d>] ipv6_find_idev+0x3d/0x80
[<ffffffff81483175>] addrconf_notify+0x3d5/0x4b0
[<ffffffff814fae3f>] notifier_call_chain+0x3f/0x80
[<ffffffff81073471>] raw_notifier_call_chain+0x11/0x20
[<ffffffff813d8722>] call_netdevice_notifiers+0x32/0x60
[<ffffffff813d92d4>] __dev_notify_flags+0x34/0x80
[<ffffffff813d9360>] dev_change_flags+0x40/0x70
[<ffffffff813ea627>] do_setlink+0x237/0x8a0
[<ffffffff813ebb6c>] rtnl_newlink+0x3ec/0x600
[<ffffffff813eb4d0>] rtnetlink_rcv_msg+0x160/0x310
[<ffffffff814040b9>] netlink_rcv_skb+0x89/0xb0
[<ffffffff813eb357>] rtnetlink_rcv+0x27/0x40
[<ffffffff81403e20>] netlink_unicast+0x140/0x180
[<ffffffff81404a9e>] netlink_sendmsg+0x33e/0x380
[<ffffffff813c4252>] sock_sendmsg+0x112/0x130
[<ffffffff813c537e>] __sys_sendmsg+0x44e/0x460
[<ffffffff813c5544>] sys_sendmsg+0x44/0x70
[<ffffffff814feab9>] system_call_fastpath+0x16/0x1b

-> #0 (&ndev->lock){+.+...}:
[<ffffffff810a798e>] check_prev_add+0x3de/0x440
[<ffffffff810a8027>] validate_chain+0x637/0x730
[<ffffffff810a8417>] __lock_acquire+0x2f7/0x500
[<ffffffff810a8734>] lock_acquire+0x114/0x150
[<ffffffff814f6c82>] rt_read_lock+0x42/0x60
[<ffffffff8147f804>] ipv6_get_lladdr+0x74/0x120
[<ffffffff8149b036>] mld_newpack+0xb6/0x160
[<ffffffff8149b18b>] add_grhead+0xab/0xc0
[<ffffffff8149d03b>] add_grec+0x3ab/0x460
[<ffffffff8149d14a>] mld_send_report+0x5a/0x150
[<ffffffff8149f99e>] igmp6_timer_handler+0x4e/0xb0
[<ffffffff8105705a>] call_timer_fn+0xca/0x1d0
[<ffffffff81057b9f>] run_timer_softirq+0x1df/0x2e0
[<ffffffff8104e8c7>] handle_pending_softirqs+0xf7/0x1f0
[<ffffffff8104ea3b>] __do_softirq_common+0x7b/0xf0
[<ffffffff8104f07f>] __thread_do_softirq+0x1af/0x210
[<ffffffff8104f1c1>] run_ksoftirqd+0xe1/0x1f0
[<ffffffff8106c7de>] kthread+0xae/0xc0
[<ffffffff814fff74>] kernel_thread_helper+0x4/0x10

actually we can just hold idev->lock before taking pmc->mca_lock,
and avoid taking idev->lock again when iterating idev->addr_list,
since the upper callers of mld_newpack() already take
read_lock_bh(&idev->lock).

Reported-by: dingtianhong <dingtianhong@huawei.com>
Cc: dingtianhong <dingtianhong@huawei.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Tested-by: Ding Tianhong <dingtianhong@huawei.com>
Tested-by: Chen Weilong <chenweilong@huawei.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b173ee48 26-Jun-2013 Hannes Frederic Sowa <hannes@stressinduktion.org>

ipv6: resend MLD report if a link-local address completes DAD

RFC3590/RFC3810 specifies we should resend MLD reports as soon as a
valid link-local address is available.

We now use the valid_ll_addr_cnt to check if it is necessary to resend
a new report.

Changes since Flavio Leitner's version:
a) adapt for valid_ll_addr_cnt
b) resend first reports directly in the path and just arm the timer for
mc_qrv-1 resends.

Reported-by: Flavio Leitner <fleitner@redhat.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 29a3cad5 28-May-2013 Simon Horman <horms@verge.net.au>

ipv6: Correct comparisons and calculations using skb->tail and skb-transport_header

This corrects an regression introduced by "net: Use 16bits for *_headers
fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In
that case skb->tail will be a pointer whereas skb->transport_header
will be an offset from head. This is corrected by using wrappers that
ensure that comparisons and calculations are always made using pointers.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ece31ffd 17-Feb-2013 Gao feng <gaofeng@cn.fujitsu.com>

net: proc: change proc_net_remove to remove_proc_entry

proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.

this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d4beaa66 17-Feb-2013 Gao feng <gaofeng@cn.fujitsu.com>

net: proc: change proc_net_fops_create to proc_create

Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.

It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ec16ef22 08-Feb-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6 mcast: Do not join device multicast for interface-local multicasts.

RFC4291 (IPv6 addressing architecture) says that interface-Local scope
spans only a single interface on a node. We should not join L2 device
multicast list for addresses in interface-local (or smaller) scope.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 56db1c5f 03-Feb-2013 Jean Sacren <sakiwit@gmail.com>

mcast: do not check 'rv' twice in a row

With the loop, don't check 'rv' twice in a row. Without the loop, 'rv'
doesn't even need to be checked.

Make the comment more grammar-friendly.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 07c2fecc 28-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6 mcast: Use ipv6_addr_equal() in ip6_mc_source().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2576f17d 20-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6: Unshare ip6_nd_hdr() and change return type to void.

- move ip6_nd_hdr() to its users' source files.
In net/ipv6/mcast.c, it will be called ip6_mc_hdr().
- make return type to void since this function never fails.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 12fd84f4 17-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6: Remove unused neigh argument for icmp6_dst_alloc() and its callers.

Because of rt->n removal, we do not need neigh argument any more.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# daad1512 12-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>

ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().

Move generalized version of ipv6_is_mld() to header,
and use it from ip6_mc_input().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0e1efe9d 05-Dec-2012 Eric Dumazet <edumazet@google.com>

ipv6: avoid taking locks at socket dismantle

ipv6_sock_mc_close() is called for ipv6 sockets at close time, and most
of them don't use multicast.

Add a test to avoid contention on a shared spinlock.

Same heuristic applies for ipv6_sock_ac_close(), to avoid contention
on a shared rwlock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 94e187c0 28-Oct-2012 Amerigo Wang <amwang@redhat.com>

ipv6: introduce ip6_rt_put()

As suggested by Eric, we could introduce a helper function
for ipv6 too, to avoid checking if rt is NULL before
dst_release().

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a858d64b 17-Jul-2012 Li Wei <lw@cn.fujitsu.com>

ipv6: fix unappropriate errno returned for non-multicast address

We need to check the passed in multicast address and return
appropriate errno(EINVAL) if it is not valid. And it's no need
to walk through the ipv6_mc_list in this situation.

Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a50feda5 18-May-2012 Eric Dumazet <edumazet@google.com>

ipv6: bool/const conversions phase2

Mostly bool conversions, some inline removals and const additions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f3213831 15-May-2012 Joe Perches <joe@perches.com>

net: ipv6: Standardize prefixes for message logging

Add #define pr_fmt(fmt) as appropriate.

Add "IPv6: " to appropriate files.

Convert printk(KERN_<LEVEL> to pr_<level> (but not KERN_DEBUG).
Standardize on "%s: " not "%s(): " when emitting __func__.
Use "%s: ", __func__ instead of embedding function name.
Coalesce formats, align arguments.

ADDRCONF output is now prefixed with "IPv6: "

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ce713ee5 05-Apr-2012 RongQing.Li <roy.qing.li@gmail.com>

net: replace continue with break to reduce unnecessary loop in xxx_xmarksources

The conditional which decides to skip inactive filters does not
change with the change of loop index, so it is unnecessary to
check them many times.

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 78d50217 04-Apr-2012 RongQing.Li <roy.qing.li@gmail.com>

ipv6: fix array index in ip6_mc_add_src()

Convert array index from the loop bound to the loop index.

And remove the void type conversion to ip6_mc_del1_src() return
code, seem it is unnecessary, since ip6_mc_del1_src() does not
use __must_check similar attribute, no compiler will report the
warning when it is removed.

v2: enrich the commit header

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c5779237 15-Mar-2012 RongQing.Li <roy.qing.li@gmail.com>

ipv6: Don't dev_hold(dev) in ip6_mc_find_dev_rcu.

ip6_mc_find_dev_rcu() is called with rcu_read_lock(), so don't
need to dev_hold().
With dev_hold(), not corresponding dev_put(), will lead to leak.

[ bug introduced in 96b52e61be1 (ipv6: mcast: RCU conversions) ]

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d1918542 28-Dec-2011 David S. Miller <davem@davemloft.net>

ipv6: Kill rt6i_dev and rt6i_expires defines.

It just obscures that the netdevice pointer and the expires value are
implemented in the dst_entry sub-object of the ipv6 route.

And it makes grepping for dst_entry member uses much harder too.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 87a11578 06-Dec-2011 David S. Miller <davem@davemloft.net>

ipv6: Move xfrm_lookup() call down into icmp6_dst_alloc().

And return error pointers.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 99d2f47a 29-Nov-2011 Jun Zhao <mypopydev@gmail.com>

ipv6 : mcast : Delete useless parameter in ip6_mc_add1_src()

Need not to used 'delta' flag when add single-source to interface
filter source list.

Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Signed-off-by: David S. Miller <davem@drr.davemloft.net>


# 4e3fd7a0 20-Nov-2011 Alexey Dobriyan <adobriyan@gmail.com>

net: remove ipv6_addr_copy()

C assignment can handle struct in6_addr copying.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a7ae1992 17-Nov-2011 Herbert Xu <herbert@gondor.apana.org.au>

ipv6: Remove all uses of LL_ALLOCATED_SPACE

ipv6: Remove all uses of LL_ALLOCATED_SPACE

The macro LL_ALLOCATED_SPACE was ill-conceived. It applies the
alignment to the sum of needed_headroom and needed_tailroom. As
the amount that is then reserved for head room is needed_headroom
with alignment, this means that the tail room left may be too small.

This patch replaces all uses of LL_ALLOCATED_SPACE in net/ipv6
with the macro LL_RESERVED_SPACE and direct reference to
needed_tailroom.

This also fixes the problem with needed_headroom changing between
allocating the skb and reserving the head room.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e05c4ad3 23-Aug-2011 Yan, Zheng <zheng.z.yan@intel.com>

mcast: Fix source address selection for multicast listener report

Should check use count of include mode filter instead of total number
of include mode filters.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e3cbf28f 17-Mar-2011 Lai Jiangshan <laijs@cn.fujitsu.com>

net,rcu: convert call_rcu(ipv6_mc_socklist_reclaim) to kfree_rcu()

The rcu callback ipv6_mc_socklist_reclaim() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(ipv6_mc_socklist_reclaim).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# b71d1d42 21-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com>

inet: constify ip headers and in6_addr

Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4c9483b2 12-Mar-2011 David S. Miller <davem@davemloft.net>

ipv6: Convert to use flowi6 where applicable.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b66fef9 04-Mar-2011 Hagen Paul Pfeifer <hagen@jauu.net>

mcast: net_device dev not used

ip6_mc_source(), ip6_mc_msfilter() as well as ip6_mc_msfget() declare
and assign dev but do not use the variable afterwards.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 452edd59 02-Mar-2011 David S. Miller <davem@davemloft.net>

xfrm: Return dst directly from xfrm_lookup()

Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 456b61bc 23-Nov-2010 Eric Dumazet <eric.dumazet@gmail.com>

ipv6: mcast: RCU conversion

ipv6_sk_mc_lock rwlock becomes a spinlock.

readers (inet6_mc_check()) now takes rcu_read_lock() instead of read
lock. Writers dont need to disable BH anymore.

struct ipv6_mc_socklist objects are reclaimed after one RCU grace
period.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8a22c99a 14-Nov-2010 Joe Perches <joe@perches.com>

net/ipv6/mcast.c: Remove unnecessary semicolons

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d8d1f30b 11-Jun-2010 Changli Gao <xiaosuo@gmail.com>

net-next: remove useless union keyword

remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 96b52e61 07-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com>

ipv6: mcast: RCU conversions

- ipv6_sock_mc_join() : doesnt touch dev refcount

- ipv6_sock_mc_drop() : doesnt touch dev/idev refcounts

- ip6_mc_find_dev() becomes ip6_mc_find_dev_rcu() (called from rcu),
and doesnt touch dev/idev refcounts

- ipv6_sock_mc_close() : doesnt touch dev/idev refcounts

- ip6_mc_source() uses ip6_mc_find_dev_rcu()

- ip6_mc_msfilter() uses ip6_mc_find_dev_rcu()

- ip6_mc_msfget() uses ip6_mc_find_dev_rcu()

- ipv6_dev_mc_dec(), ipv6_chk_mcast_addr(),
igmp6_event_query(), igmp6_event_report(),
mld_sendpack(), igmp6_send() dont touch idev refcount

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 72e09ad1 05-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com>

ipv6: avoid high order allocations

With mtu=9000, mld_newpack() use order-2 GFP_ATOMIC allocations, that
are very unreliable, on machines where PAGE_SIZE=4K

Limit allocated skbs to be at most one page. (order-0 allocations)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5d55354f 31-May-2010 Joe Perches <joe@perches.com>

net/ipv6/mcast.c: Remove unnecessary kmalloc casts

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6e7cb837 17-Apr-2010 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

ipv6 mcast: Introduce include/net/mld.h for MLD definitions.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 22bedad3 01-Apr-2010 Jiri Pirko <jpirko@redhat.com>

net: convert multicast list to list_head

Converts the list and the core manipulating with it to be the same as uc_list.

+uses two functions for adding/removing mc address (normal and "global"
variant) instead of a function parameter.
+removes dev_mcast.c completely.
+exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for
manipulation with lists on a sandbox (used in bonding and 80211 drivers)

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# b2e0b385 22-Mar-2010 Jan Engelhardt <jengelh@medozas.de>

netfilter: ipv6: use NFPROTO values for NF_HOOK invocation

The semantic patch that was used:
// <smpl>
@@
@@
(NF_HOOK
|NF_HOOK_THRESH
|nf_hook
)(
-PF_INET6,
+NFPROTO_IPV6,
...)
// </smpl>

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>


# 6457d26b 17-Feb-2010 Stephen Hemminger <shemminger@vyatta.com>

IPv6: convert mc_lock to spinlock

Only used for writing, so convert to spinlock

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2c8c1e72 16-Jan-2010 Alexey Dobriyan <adobriyan@gmail.com>

net: spread __net_init, __net_exit

__net_init/__net_exit are apparently not going away, so use them
to full extent.

In some cases __net_init was removed, because it was called from
__net_exit code.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ce81b76a 11-Nov-2009 Eric Dumazet <eric.dumazet@gmail.com>

ipv6: use RCU to walk list of network devices

No longer need read_lock(&dev_base_lock), use RCU instead.
We also can avoid taking references on inet6_dev structs.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 75c78500 15-Sep-2009 Moni Shoua <monis@voltaire.com>

bonding: remap muticast addresses without using dev_close() and dev_open()

This patch fixes commit e36b9d16c6a6d0f59803b3ef04ff3c22c3844c10. The approach
there is to call dev_close()/dev_open() whenever the device type is changed in
order to remap the device IP multicast addresses to HW multicast addresses.
This approach suffers from 2 drawbacks:

*. It assumes tha the device is UP when calling dev_close(), or otherwise
dev_close() has no affect. It is worth to mention that initscripts (Redhat)
and sysconfig (Suse) doesn't act the same in this matter.
*. dev_close() has other side affects, like deleting entries from the routing
table, which might be unnecessary.

The fix here is to directly remap the IP multicast addresses to HW multicast
addresses for a bonding device that changes its type, and nothing else.

Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3c2b8d18 21-Jul-2009 Gerrit Renker <gerrit@erg.abdn.ac.uk>

mcastv6: Local variable shadows function argument

The local variable 'idev' shadows the function argument 'idev' to
ip6_mc_add_src(). Fixed by removing the local declaration, as pmc->idev
should be identical with 'idev' passed as argument.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# adf30907 01-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com>

net: skb->dst accessors

Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# edf391ff 27-Apr-2009 Neil Horman <nhorman@tuxdriver.com>

snmp: add missing counters for RFC 4293

The IP MIB (RFC 4293) defines stats for InOctets, OutOctets, InMcastOctets and
OutMcastOctets:
http://tools.ietf.org/html/rfc4293
But it seems we don't track those in any way that easy to separate from other
protocols. This patch adds those missing counters to the stats file. Tested
successfully by me

With help from Eric Dumazet.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 448eb71f 15-Dec-2008 Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

ipv6/mcast: join error paths using goto

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 52479b62 25-Nov-2008 Alexey Dobriyan <adobriyan@gmail.com>

netns xfrm: lookup in netns

Pass netns to xfrm_lookup()/__xfrm_lookup(). For that pass netns
to flow_cache_lookup() and resolver callback.

Take it from socket or netdevice. Stub DECnet to init_net.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 07f0757a 19-Nov-2008 Joe Perches <joe@perches.com>

include/net net/ - csum_partial - remove unnecessary casts

The first argument to csum_partial is const void *
casts to char/u8 * are not necessary

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b7a4274 29-Oct-2008 Harvey Harrison <harvey.harrison@gmail.com>

net: replace %#p6 format specifier with %pi6

gcc warns when using the # modifier with the %p format specifier,
so we can't use this to omit the colons when needed, introduces
%pi6 instead.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b071195d 28-Oct-2008 Harvey Harrison <harvey.harrison@gmail.com>

net: replace all current users of NIP6_SEQFMT with %#p6

The define in kernel.h can be done away with at a later time.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a57d4c7 08-Oct-2008 Denis V. Lunev <den@openvz.org>

ipv6: added net argument to ICMP6MSGOUT_INC_STATS_BH

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5c5d244b 08-Oct-2008 Denis V. Lunev <den@openvz.org>

ipv6: added net argument to ICMP6MSGOUT_INC_STATS

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e41b5368 08-Oct-2008 Denis V. Lunev <den@openvz.org>

ipv6: added net argument to ICMP6_INC_STATS_BH

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a862f6a6 08-Oct-2008 Denis V. Lunev <den@openvz.org>

ipv6: added net argument to ICMP6_INC_STATS

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 483a47d2 08-Oct-2008 Denis V. Lunev <den@openvz.org>

ipv6: added net argument to IP6_INC_STATS_BH

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3bd653c8 08-Oct-2008 Denis V. Lunev <den@openvz.org>

netns: add net parameter to IP6_INC_STATS

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a6ffb404 19-Jul-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

ipv6 mcast: Omit redundant address family checks in ip6_mc_source().

The caller has alredy checked for them.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 53b7997f 19-Jul-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

ipv6 netns: Make several "global" sysctl variables namespace aware.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0b040829 10-Jun-2008 Adrian Bunk <bunk@kernel.org>

net: remove CVS keywords

This patch removes CVS keywords that weren't updated for a long time
from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9cba632e 23-Apr-2008 Rami Rosen <ramirose@gmail.com>

ipv6 mcast: Remove unused macro (MLDV2_QQIC) from mcast.c.

This patch removes MLDV2_QQIC macro from mcast.c
as it is unused.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# f5184d26 12-May-2008 Johannes Berg <johannes@sipsolutions.net>

net: Allow netdevices to specify needed head/tailroom

This patch adds needed_headroom/needed_tailroom members to struct
net_device and updates many places that allocate sbks to use them. Not
all of them can be converted though, and I'm sure I missed some (I
mostly grepped for LL_RESERVED_SPACE)

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d7aabf22 10-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Use in6addr_any where appropriate.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# f3ee4010 10-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Define constants for link-local multicast addresses.

- Define link-local all-node / all-router multicast addresses.
- Remove ipv6_addr_all_nodes() and ipv6_addr_all_routers().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 9acd9f3a 10-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Make address arguments const.

- net/ipv6/addrconf.c:
ipv6_get_ifaddr(), ipv6_dev_get_saddr()
- net/ipv6/mcast.c:
ipv6_sock_mc_join(), ipv6_sock_mc_drop(),
inet6_mc_check(),
ipv6_dev_mc_inc(), __ipv6_dev_mc_dec(), ipv6_dev_mc_dec(),
ipv6_chk_mcast_addr()
- net/ipv6/route.c:
rt6_lookup(), icmp6_dst_alloc()
- net/ipv6/ip6_output.c:
ip6_nd_hdr()
- net/ipv6/ndisc.c:
ndisc_send_ns(), ndisc_send_rs(), ndisc_send_redirect(),
ndisc_get_neigh(), __ndisc_send()

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 1ed8516f 03-Apr-2008 Denis V. Lunev <den@openvz.org>

[IPV6]: Simplify IPv6 control sockets creation.

Do this by replacing sock_create_kern with inet_ctl_sock_create.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1218854a 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET] NETNS: Omit seq_net_private->net without CONFIG_NET_NS.

Without CONFIG_NET_NS, no namespace other than &init_net exists,
no need to store net in seq_net_private.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 3b1e0a65 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.

Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# c346dca1 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS.

Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# ea82edf7 21-Mar-2008 Daniel Lezcano <dlezcano@fr.ibm.com>

[NETNS][IPV6] mcast - fix compilation warning when procfs is not compiled in

When CONFIG_PROC_FS=no, the out_sock_create label is not used because
the code using it is disabled and that leads to a warning at compile
time.

This patch fix that by making a specific function to initialize proc
for igmp6, and remove the annoying CONFIG_PROC_FS sections in
init/exit function.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b8ad0cbc 07-Mar-2008 Daniel Lezcano <dlezcano@fr.ibm.com>

[NETNS][IPV6] mcast - handle several network namespace

This patch make use of the network namespace information at the right
places to handle the multicast for several network namespaces. It
makes the socket control to be per namespace too.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 606a2b48 04-Mar-2008 Daniel Lezcano <dlezcano@fr.ibm.com>

[NETNS][IPV6] route6 - Pass the network namespace parameter to rt6_lookup

Add a network namespace parameter to rt6_lookup().

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 41927178 06-Dec-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6] MCAST: Use standard path for sending MLD/MLDv2 messages.

This is changing the paths for sending MLD/MLDv2 messages
from dev_queue_xmit() to standard dst_output().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 9b0f976f 29-Feb-2008 Denis V. Lunev <den@openvz.org>

[INET]: Remove struct net_proto_family* from _init calls.

struct net_proto_family* is not used in icmp[v6]_init, ndisc_init,
igmp_init and tcp_v4_init. Remove it.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9a429c49 01-Jan-2008 Eric Dumazet <dada1@cosmosbay.com>

[NET]: Add some acquires/releases sparse annotations.

Add __acquires() and __releases() annotations to suppress some sparse
warnings.

example of warnings :

net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong
count at exit
net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' -
unexpected unlock

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6e23ae2a 19-Nov-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Introduce NF_INET_ hook values

The IPv4 and IPv6 hook values are identical, yet some code tries to figure
out the "correct" value by looking at the address family. Introduce NF_INET_*
values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__
section for userspace compatibility.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b24b8a24 23-Jan-2008 Pavel Emelyanov <xemul@openvz.org>

[NET]: Convert init_timer into setup_timer

Many-many code in the kernel initialized the timer->function
and timer->data together with calling init_timer(timer). There
is already a helper for this. Use it for networking code.

The patch is HUGE, but makes the code 130 lines shorter
(98 insertions(+), 228 deletions(-)).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cf7732e4 10-Oct-2007 Pavel Emelyanov <xemul@openvz.org>

[NET]: Make core networking code use seq_open_private

This concerns the ipv4 and ipv6 code mostly, but also the netlink
and unix sockets.

The netlink code is an example of how to use the __seq_open_private()
call - it saves the net namespace on this private.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cfcabdcc 09-Oct-2007 Stephen Hemminger <shemminger@linux-foundation.org>

[NET]: sparse warning fixes

Fix a bunch of sparse warnings. Mostly about 0 used as
NULL pointer, and shadowed variable declarations.
One notable case was that hash size should have been unsigned.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0c4e8581 09-Oct-2007 Stephen Hemminger <shemminger@linux-foundation.org>

[NET]: Wrap netdevice hardware header creation.

Add inline for common usage of hardware header creation, and
fix bug in IPV6 mcast where the assumption about negative return is
an errno. Negative return from hard_header means not enough space
was available,(ie -N bytes).

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 14878f75 16-Sep-2007 David L Stevens <dlstevens@us.ibm.com>

[IPV6]: Add ICMPMsgStats MIB (RFC 4293) [rev 2]

Background: RFC 4293 deprecates existing individual, named ICMP
type counters to be replaced with the ICMPMsgStatsTable. This table
includes entries for both IPv4 and IPv6, and requires counting of all
ICMP types, whether or not the machine implements the type.

These patches "remove" (but not really) the existing counters, and
replace them with the ICMPMsgStats tables for v4 and v6.
It includes the named counters in the /proc places they were, but gets the
values for them from the new tables. It also counts packets generated
from raw socket output (e.g., OutEchoes, MLD queries, RA's from
radvd, etc).

Changes:
1) create icmpmsg_statistics mib
2) create icmpv6msg_statistics mib
3) modify existing counters to use these
4) modify /proc/net/snmp to add "IcmpMsg" with all ICMP types
listed by number for easy SNMP parsing
5) modify /proc/net/snmp printing for "Icmp" to get the named data
from new counters.
[new to 2nd revision]
6) support per-interface ICMP stats
7) use common macro for per-device stat macros

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 881d966b 17-Sep-2007 Eric W. Biederman <ebiederm@xmission.com>

[NET]: Make the device list and device lookups per namespace.

This patch makes most of the generic device layer network
namespace safe. This patch makes dev_base_head a
network namespace variable, and then it picks up
a few associated variables. The functions:
dev_getbyhwaddr
dev_getfirsthwbytype
dev_get_by_flags
dev_get_by_name
__dev_get_by_name
dev_get_by_index
__dev_get_by_index
dev_ioctl
dev_ethtool
dev_load
wireless_process_ioctl

were modified to take a network namespace argument, and
deal with it.

vlan_ioctl_set and brioctl_set were modified so their
hooks will receive a network namespace argument.

So basically anthing in the core of the network stack that was
affected to by the change of dev_base was modified to handle
multiple network namespaces. The rest of the network stack was
simply modified to explicitly use &init_net the initial network
namespace. This can be fixed when those components of the network
stack are modified to handle multiple network namespaces.

For now the ifindex generator is left global.

Fundametally ifindex numbers are per namespace, or else
we will have corner case problems with migration when
we get that far.

At the same time there are assumptions in the network stack
that the ifindex of a network device won't change. Making
the ifindex number global seems a good compromise until
the network stack can cope with ifindex changes when
you change namespaces, and the like.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 457c4cbc 11-Sep-2007 Eric W. Biederman <ebiederm@xmission.com>

[NET]: Make /proc/net per network namespace

This patch makes /proc/net per network namespace. It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.

Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 56b3d975 11-Jul-2007 Philippe De Muyter <phdm@macqel.be>

[NET]: Make all initialized struct seq_operations const.

Make all initialized struct seq_operations in net/ const

Signed-off-by: Philippe De Muyter <phdm@macqel.be>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7562f876 03-May-2007 Pavel Emelianov <xemul@openvz.org>

[NET]: Rework dev_base via list_head (v3)

Cleanup of dev_base list use, with the aim to simplify making device
list per-namespace. In almost every occasion, use of dev_base variable
and dev->next pointer could be easily replaced by for_each_netdev
loop. A few most complicated places were converted to using
first_netdev()/next_netdev().

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 27a884dc 19-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Convert skb->tail to sk_buff_data_t

So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)

Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cfe1fc77 16-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_network_header_len

For the common sequence "skb->h.raw - skb->nh.raw", similar to skb->mac_len,
that is precalculated tho, don't think we need to bloat skb with one more
member, so just use this new helper, reducing the number of non-skbuff.h
references to the layer headers even more.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d10ba34b 14-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: More skb_put related skb_reset_transport_header

This time we have to set it to skb->tail that is not anymore equal to
skb->data, so we either add a new helper or just add the skb->tail - skb->data
offset, for now do the later.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9c70220b 25-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_transport_header(skb)

For the places where we need a pointer to the transport header, it is
still legal to touch skb->h.raw directly if just adding to,
subtracting from or setting it to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cc70ab26 13-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[ICMP6]: Introduce icmp6_hdr()

For consistency with all the other skb->h.raw accessors.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0660e03f 25-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce ipv6_hdr(), remove skb->nh.ipv6h

Now the skb->nh union has just one member, .raw, i.e. it is just like the
skb->mac union, strange, no? I'm just leaving it like that till the transport
layer is done with, when we'll rename skb->mac.raw to skb->mac_header (or
->mac_header_offset?), ditto for ->{h,nh}.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 95c385b4 25-Apr-2007 Neil Horman <nhorman@tuxdriver.com>

[IPV6] ADDRCONF: Optimistic Duplicate Address Detection (RFC 4429) Support.

Nominally an autoconfigured IPv6 address is added to an interface in the
Tentative state (as per RFC 2462). Addresses in this state remain in this
state while the Duplicate Address Detection process operates on them to
determine their uniqueness on the network. During this period, these
tentative addresses may not be used for communication, increasing the time
before a node may be able to communicate on a network. Using Optimistic
Duplicate Address Detection, autoconfigured addresses may be used
immediately for communication on the network, as long as certain rules are
followed to avoid conflicts with other nodes during the Duplicate Address
Detection process.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9a32144e 12-Feb-2007 Arjan van de Ven <arjan@linux.intel.com>

[PATCH] mark struct file_operations const 7

Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1ab1457c 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NET] IPV6: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cc63f70b 06-Feb-2007 Alexey Dobriyan <adobriyan@openvz.org>

[IPV4/IPV6] multicast: Check add_grhead() return value

add_grhead() allocates memory with GFP_ATOMIC and in at least two places skb
from it passed to skb_put() without checking.

Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d88ae4cc 14-Jan-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6] MCAST: Fix joining all-node multicast group on device initialization.

Join all-node multicast group after assignment of dev->ip6_ptr
because it must be assigned when ipv6_dev_mc_inc() is called.
This fixes Bug#7817, reported by <gernoth@informatik.uni-erlangen.de>.

Closes: 7817
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 868c86bc 14-Nov-2006 Al Viro <viro@zeniv.linux.org.uk>

[NET]: annotate csum_ipv6_magic() callers in net/*

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a11d206d 04-Nov-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Per-interface statistics support.

For IP MIB (RFC4293).

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>


# 8a74ff77 08-Nov-2006 Al Viro <viro@zeniv.linux.org.uk>

[IPV6]: annotate ipv6 mcast

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab32ea5d 22-Sep-2006 Brian Haley <brian.haley@hp.com>

[NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly

Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly.

Couldn't actually measure any performance increase while testing (.3%
I consider noise), but seems like the right thing to do.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# acd6e00b 17-Aug-2006 David L Stevens <dlstevens@us.ibm.com>

[MCAST]: Fix filter leak on device removal.

This fixes source filter leakage when a device is removed and a
process leaves the group thereafter.

This also includes corresponding fixes for IPv6 multicast source
filters on device removal.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 0c600eda 21-Mar-2006 Ingo Oeser <ioe-lkml@rameria.de>

[IPV6]: Nearly complete kzalloc cleanup for net/ipv6

Stupidly use kzalloc() instead of kmalloc()/memset()
everywhere where this is possible in net/ipv6/*.c .

Signed-off-by: Ingo Oeser <ioe-lkml@rameria.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e80e28b6 03-Feb-2006 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] net/ipv6/mcast.c NULL noise removal

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7add2a43 24-Jan-2006 David L Stevens <dlstevens@us.ibm.com>

[IPV6] MLDv2: fix change records when transitioning to/from inactive

The following patch fixes these problems in MLDv2:

1) Add/remove "delete" records for sending change reports when
addition of a filter results in that filter transitioning to/from
inactive. [same as recent IPv4 IGMPv3 fix]
2) Remove 2 redundant "group_type" checks (can't be IPV6_ADDR_ANY
within that loop, so checks are always true)
3) change an is_in() "return 0" to "return type == MLD2_MODE_IS_INCLUDE".
It should always be "0" to get here, but it improves code locality
to not assume it, and if some race allowed otherwise, doing
the check would return the correct result.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9343e79a 17-Jan-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Preserve procfs IPV6 address output format

Procfs always output IPV6 addresses without the colon
characters, and we cannot change that.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 46b86a2d 13-Jan-2006 Joe Perches <joe@perches.com>

[NET]: Use NIP6_FMT in kernel.h

There are errors and inconsistency in the display of NIP6 strings.
ie: net/ipv6/ip6_flowlabel.c

There are errors and inconsistency in the display of NIPQUAD strings too.
ie: net/netfilter/nf_conntrack_ftp.c

This patch:
adds NIP6_FMT to kernel.h
changes all code to use NIP6_FMT
fixes net/ipv6/ip6_flowlabel.c
adds NIPQUAD_FMT to kernel.h
fixes net/netfilter/nf_conntrack_ftp.c
changes a few uses of "%u.%u.%u.%u" to NIPQUAD_FMT for symmetry to NIP6_FMT

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8b3a7005 11-Jan-2006 Kris Katterjohn <kjak@users.sourceforge.net>

[NET]: Remove more unneeded typecasts on *malloc()

This removes more unneeded casts on the return value for kmalloc(),
sock_kmalloc(), and vmalloc().

Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 196433c5 04-Jan-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[IPV6]: Use macro for rwlock_t initialization.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5ab4a6c8 27-Dec-2005 David L Stevens <dlstevens@us.ibm.com>

[IPV6] mcast: Fix multiple issues in MLDv2 reports.

The below "jumbo" patch fixes the following problems in MLDv2.

1) Add necessary "ntohs" to recent "pskb_may_pull" check [breaks
all nonzero source queries on little-endian (!)]

2) Add locking to source filter list [resend of prior patch]

3) fix "mld_marksources()" to
a) send nothing when all queried sources are excluded
b) send full exclude report when source queried sources are
not excluded
c) don't schedule a timer when there's nothing to report

NOTE: RFC 3810 specifies the source list should be saved and each
source reported individually as an IS_IN. This is an obvious DOS
path, requiring the host to store and then multicast as many sources
as are queried (e.g., millions...). This alternative sends a full,
relevant report that's limited to number of sources present on the
machine.

4) fix "add_grec()" to send empty-source records when it should
The original check doesn't account for a non-empty source
list with all sources inactive; the new code keeps that
short-circuit case, and also generates the group header
with an empty list if needed.

5) fix mca_crcount decrement to be after add_grec(), which needs
its original value

These issues (other than item #1 ;-) ) were all found by Yan Zheng,
much thanks!

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6f4353d8 26-Dec-2005 David L Stevens <dlstevens@us.ibm.com>

[IPV6]: Increase default MLD_MAX_MSF to 64.

The existing default of 10 is just way too low.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 24c69275 02-Dec-2005 David Stevens <dlstevens@us.ibm.com>

[IGMP]: workaround for IGMP v1/v2 bug

From: David Stevens <dlstevens@us.ibm.com>

As explained at:

http://www.cs.ucsb.edu/~krishna/igmp_dos/

With IGMP version 1 and 2 it is possible to inject a unicast
report to a client which will make it ignore multicast
reports sent later by the router.

The fix is to only accept the report if is was sent to a
multicast or unicast address.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 8713dbf0 27-Oct-2005 Yan Zheng <yanzheng@21cn.com>

[MCAST]: ip[6]_mc_add_src should be called when number of sources is zero

And filter mode is exclude.

Further explanation by David Stevens:

Multicast source filters aren't widely used yet, and that's really the only
feature that's affected if an application actually exercises this bug, as far
as I can tell. An ordinary filter-less multicast join should still work, and
only forwarded multicast traffic making use of filters and doing empty-source
filters with the MSFILTER ioctl would be at risk of not getting multicast
traffic forwarded to them because the reports generated would not be based on
the correct counts.

Signed-off-by: Yan Zheng <yanzheng@21cn.com
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# 97300b5f 31-Oct-2005 Yan Zheng <yanzheng@21cn.com>

[MCAST] IPv6: Check packet size when process Multicast

Signed-off-by: Yan Zheng <yanzheng@21cn.com
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# f12baeab 28-Oct-2005 Yan Zheng <yanzheng@21cn.com>

[MCAST] IPv6: Fix algorithm to compute Querier's Query Interval

5.1.3. Maximum Response Code

The Maximum Response Code field specifies the maximum time allowed
before sending a responding Report. The actual time allowed, called
the Maximum Response Delay, is represented in units of milliseconds,
and is derived from the Maximum Response Code as follows:

If Maximum Response Code < 32768,
Maximum Response Delay = Maximum Response Code

If Maximum Response Code >=32768, Maximum Response Code represents a
floating-point value as follows:

0 1 2 3 4 5 6 7 8 9 A B C D E F
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1| exp | mant |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Maximum Response Delay = (mant | 0x1000) << (exp+3)


5.1.9. QQIC (Querier's Query Interval Code)

The Querier's Query Interval Code field specifies the [Query
Interval] used by the Querier. The actual interval, called the
Querier's Query Interval (QQI), is represented in units of seconds,
and is derived from the Querier's Query Interval Code as follows:

If QQIC < 128, QQI = QQIC

If QQIC >= 128, QQIC represents a floating-point value as follows:

0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|1| exp | mant |
+-+-+-+-+-+-+-+-+

QQI = (mant | 0x10) << (exp + 3)

-- rfc3810

#define MLDV2_QQIC(value) MLDV2_EXP(0x80, 4, 3, value)
#define MLDV2_MRC(value) MLDV2_EXP(0x8000, 12, 3, value)

Above macro are defined in mcast.c. but 1 << 4 == 0x10 and 1 << 12 == 0x1000.
So the result computed by original Macro is larger.

Signed-off-by: Yan Zheng <yanzheng@21cn.com>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>


# fab10fe3 05-Oct-2005 Yan Zheng <yanzheng@21cn.com>

[MCAST] ipv6: Fix address size in grec_size

Signed-Off-By: Yan Zheng <yanzheng@21cn.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# de9daad9 14-Sep-2005 Denis Lukianov <denis@voxelsoft.com>

[MCAST]: Fix MCAST_EXCLUDE line dupes

This patch fixes line dupes at /ipv4/igmp.c and /ipv6/mcast.c in the
2.6 kernel, where MCAST_EXCLUDE is mistakenly used instead of
MCAST_INCLUDE.

Signed-off-by: Denis Lukianov <denis@voxelsoft.com>
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9c05989b 08-Jul-2005 David S. Miller <davem@davemloft.net>

[IPV6]: Fix warning in ip6_mc_msfilter.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 9951f036 08-Jul-2005 David L Stevens <dlstevens@us.ibm.com>

[IPV4]: (INCLUDE,empty)/leave-group equivalence for full-state MSF APIs & errno fix

1) Adds (INCLUDE, empty)/leave-group equivalence to the full-state
multicast source filter APIs (IPv4 and IPv6)

2) Fixes an incorrect errno in the IPv6 leave-group (ENOENT should be
EADDRNOTAVAIL)

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 917f2f10 08-Jul-2005 David L Stevens <dlstevens@us.ibm.com>

[IPV4]: multicast API "join" issues

1) In the full-state API when imsf_numsrc == 0
errno should be "0", but returns EADDRNOTAVAIL

2) An illegal filter mode change
errno should be EINVAL, but returns EADDRNOTAVAIL

3) Trying to do an any-source option without IP_ADD_MEMBERSHIP
errno should be EINVAL, but returns EADDRNOTAVAIL

4) Adds comments for the less obvious error return values

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e9823185 21-Jun-2005 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Restore netfilter assumptions in IPv6 multicast

Netfilter assumes that skb->data == skb->nh.ipv6h

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c9e3e8b6 21-Jun-2005 David L Stevens <dlstevens@us.ibm.com>

[IPV6]: multicast join and misc

Here is a simplified version of the patch to fix a bug in IPv6
multicasting. It:

1) adds existence check & EADDRINUSE error for regular joins
2) adds an exception for EADDRINUSE in the source-specific multicast
join (where a prior join is ok)
3) adds a missing/needed read_lock on sock_mc_list; would've raced
with destroying the socket on interface down without
4) adds a "leave group" in the (INCLUDE, empty) source filter case.
This frees unneeded socket buffer memory, but also prevents
an inappropriate interaction among the 8 socket options that
mess with this. Some would fail as if in the group when you
aren't really.

Item #4 had a locking bug in the last version of this patch; rather than
removing the idev->lock read lock only, I've simplified it to remove
all lock state in the path and treat it as a direct "leave group" call for
the (INCLUDE,empty) case it covers. Tested on an MP machine. :-)

Much thanks to HoerdtMickael <hoerdt@clarinet.u-strasbg.fr> who
reported the original bug.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!