History log of /linux-master/drivers/net/ipvlan/ipvlan_main.c
Revision Date Author Comments
# e353ea9c 22-Feb-2024 Eric Dumazet <edumazet@google.com>

rtnetlink: prepare nla_put_iflink() to run under RCU

We want to be able to run rtnl_fill_ifinfo() under RCU protection
instead of RTNL in the future.

This patch prepares dev_get_iflink() and nla_put_iflink()
to run either with RTNL or RCU held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3ee29a44 05-Jan-2024 Christophe JAILLET <christophe.jaillet@wanadoo.fr>

ipvlan: Remove usage of the deprecated ida_simple_xx() API

ida_alloc() and ida_free() should be preferred to the deprecated
ida_simple_get() and ida_simple_remove().

This is less verbose.

Note that the upper bound of ida_alloc_range() is inclusive while the one
of ida_simple_get() was exclusive. So calls have been updated accordingly.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e900274f 05-Jan-2024 Christophe JAILLET <christophe.jaillet@wanadoo.fr>

ipvlan: Fix a typo in a comment

s/diffentiate/differentiate/

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fb70136d 02-Dec-2023 Zhengchao Shao <shaozhengchao@huawei.com>

ipvlan: implement .parse_protocol hook function in ipvlan_header_ops

The .parse_protocol hook function in the ipvlan_header_ops structure is
not implemented. As a result, when the AF_PACKET family is used to send
packets, skb->protocol will be set to 0.
Ipvlan is a device of type ARPHRD_ETHER (ether_setup). Therefore, use
eth_header_parse_protocol function to obtain the protocol.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20231202130438.2266343-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# ff672b9f 26-Oct-2023 Eric Dumazet <edumazet@google.com>

ipvlan: properly track tx_errors

Both ipvlan_process_v4_outbound() and ipvlan_process_v6_outbound()
increment dev->stats.tx_errors in case of errors.

Unfortunately there are two issues :

1) ipvlan_get_stats64() does not propagate dev->stats.tx_errors to user.

2) Increments are not atomic. KCSAN would complain eventually.

Use DEV_STATS_INC() to not miss an update, and change ipvlan_get_stats64()
to copy the value back to user.

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Link: https://lore.kernel.org/r/20231026131446.3933175-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 043d5f68 17-Aug-2023 Lu Wei <luwei32@huawei.com>

ipvlan: Fix a reference count leak warning in ipvlan_ns_exit()

There are two network devices(veth1 and veth3) in ns1, and ipvlan1 with
L3S mode and ipvlan2 with L2 mode are created based on them as
figure (1). In this case, ipvlan_register_nf_hook() will be called to
register nf hook which is needed by ipvlans in L3S mode in ns1 and value
of ipvl_nf_hook_refcnt is set to 1.

(1)
ns1 ns2
------------ ------------

veth1--ipvlan1 (L3S)

veth3--ipvlan2 (L2)

(2)
ns1 ns2
------------ ------------

veth1--ipvlan1 (L3S)

ipvlan2 (L2) veth3
| |
|------->-------->--------->--------
migrate

When veth3 migrates from ns1 to ns2 as figure (2), veth3 will register in
ns2 and calls call_netdevice_notifiers with NETDEV_REGISTER event:

dev_change_net_namespace
call_netdevice_notifiers
ipvlan_device_event
ipvlan_migrate_l3s_hook
ipvlan_register_nf_hook(newnet) (I)
ipvlan_unregister_nf_hook(oldnet) (II)

In function ipvlan_migrate_l3s_hook(), ipvl_nf_hook_refcnt in ns1 is not 0
since veth1 with ipvlan1 still in ns1, (I) and (II) will be called to
register nf_hook in ns2 and unregister nf_hook in ns1. As a result,
ipvl_nf_hook_refcnt in ns1 is decreased incorrectly and this in ns2
is increased incorrectly. When the second net namespace is removed, a
reference count leak warning in ipvlan_ns_exit() will be triggered.

This patch add a check before ipvlan_migrate_l3s_hook() is called. The
warning can be triggered as follows:

$ ip netns add ns1
$ ip netns add ns2
$ ip netns exec ns1 ip link add veth1 type veth peer name veth2
$ ip netns exec ns1 ip link add veth3 type veth peer name veth4
$ ip netns exec ns1 ip link add ipv1 link veth1 type ipvlan mode l3s
$ ip netns exec ns1 ip link add ipv2 link veth3 type ipvlan mode l2
$ ip netns exec ns1 ip link set veth3 netns ns2
$ ip net del ns2

Fixes: 3133822f5ac1 ("ipvlan: use pernet operations and restrict l3s hooks to master netns")
Signed-off-by: Lu Wei <luwei32@huawei.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230817145449.141827-1-luwei32@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 068c38ad 26-Oct-2022 Thomas Gleixner <tglx@linutronix.de>

net: Remove the obsolte u64_stats_fetch_*_irq() users (drivers).

Now that the 32bit UP oddity is gone and 32bit uses always a sequence
count, there is no need for the fetch_irq() variants anymore.

Convert to the regular interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 40b9d1ab 15-Nov-2022 Mahesh Bandewar <maheshb@google.com>

ipvlan: hold lower dev to avoid possible use-after-free

Recently syzkaller discovered the issue of disappearing lower
device (NETDEV_UNREGISTER) while the virtual device (like
macvlan) is still having it as a lower device. So it's just
a matter of time similar discovery will be made for IPvlan
device setup. So fixing it preemptively. Also while at it,
add a refcount tracker.

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fb3ceec1 30-Aug-2022 Wolfram Sang <wsa+renesas@sang-engineering.com>

net: move from strlcpy with unused retval to strscpy

Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.

Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN
Link: https://lore.kernel.org/r/20220830201457.7984-1-wsa+renesas@sang-engineering.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 5665f48e 08-Jun-2022 Eric Dumazet <edumazet@google.com>

ipvlan: adopt u64_stats_t

As explained in commit 316580b69d0a ("u64_stats: provide u64_stats_t type")
we should use u64_stats_t and related accessors to avoid load/store tearing.

Add READ_ONCE() when reading rx_errs & tx_drps.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 6df6398f 05-May-2022 Jakub Kicinski <kuba@kernel.org>

net: add netif_inherit_tso_max()

To make later patches smaller create a helper for inheriting
the TSO limitations of a lower device. The TSO in the name
is not an accident, subsequent patches will replace GSO
with TSO in more names.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0c478946 02-Dec-2021 Xu Wang <vulab@iscas.ac.cn>

ipvlan: Remove redundant if statements

The 'if (dev)' statement already move into dev_{put , hold}, so remove
redundant if statements.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6d872df3 19-Nov-2021 Eric Dumazet <edumazet@google.com>

net: annotate accesses to dev->gso_max_segs

dev->gso_max_segs is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.

Add netif_set_gso_max_segs() helper.

Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs()
where we can to better document what is going on.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b66d216 19-Nov-2021 Eric Dumazet <edumazet@google.com>

net: annotate accesses to dev->gso_max_size

dev->gso_max_size is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.

Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size()
where we can to better document what is going on.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e35b8d7d 01-Oct-2021 Jakub Kicinski <kuba@kernel.org>

net: use eth_hw_addr_set() instead of ether_addr_copy()

Convert from ether_addr_copy() to eth_hw_addr_set():

@@
expression dev, np;
@@
- ether_addr_copy(dev->dev_addr, np)
+ eth_hw_addr_set(dev, np)

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f23e5ce 01-Oct-2021 Jakub Kicinski <kuba@kernel.org>

net: use eth_hw_addr_set()

Convert sw drivers from memcpy(... ETH_ADDR) to eth_hw_addr_set():

@@
expression dev, np;
@@
- memcpy(dev->dev_addr, np, ETH_ALEN)
+ eth_hw_addr_set(dev, np)

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 57fb346c 29-Jul-2021 Di Zhu <zhudi21@huawei.com>

ipvlan: Add handling of NETDEV_UP events

When an ipvlan device is created on a bond device, the link state
of the ipvlan device may be abnormal. This is because bonding device
allows to add physical network card device in the down state and so
NETDEV_CHANGE event will not be notified to other listeners, so ipvlan
has no chance to update its link status.

The following steps can cause such problems:
1) bond0 is down
2) ip link add link bond0 name ipvlan type ipvlan mode l2
3) echo +enp2s7 >/sys/class/net/bond0/bonding/slaves
4) ip link set bond0 up

After these steps, use ip link command, we found ipvlan has NO-CARRIER:
ipvlan@bond0: <NO-CARRIER, BROADCAST,MULTICAST,UP,M-DOWN> mtu ...>

We can deal with this problem like VLAN: Add handling of NETDEV_UP
events. If we receive NETDEV_UP event, we will update the link status
of the ipvlan.

Signed-off-by: Di Zhu <zhudi21@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cc69837f 20-Nov-2020 Jakub Kicinski <kuba@kernel.org>

net: don't include ethtool.h from netdevice.h

linux/netdevice.h is included in very many places, touching any
of its dependecies causes large incremental builds.

Drop the linux/ethtool.h include, linux/netdevice.h just needs
a forward declaration of struct ethtool_ops.

Fix all the places which made use of this implicit include.

Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Shannon Nelson <snelson@pensando.io>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20201120225052.1427503-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 0bad834c 21-Aug-2020 Taehee Yoo <ap420073@gmail.com>

ipvlan: advertise link netns via netlink

Assign rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages.

Test commands:
ip netns add nst
ip link add dummy0 type dummy
ip link add ipvlan0 link dummy0 type ipvlan
ip link set ipvlan0 netns nst
ip netns exec nst ip link show ipvlan0

Result:
---Before---
6: ipvlan0@if5: <BROADCAST,MULTICAST> ...
link/ether 82:3a:78:ab:60:50 brd ff:ff:ff:ff:ff:ff

---After---
12: ipvlan0@if11: <BROADCAST,MULTICAST> ...
link/ether 42:b1:ad:57:4e:27 brd ff:ff:ff:ff:ff:ff link-netnsid 0
~~~~~~~~~~~~~~

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d0f5c707 14-Aug-2020 Mahesh Bandewar <maheshb@google.com>

ipvlan: fix device features

Processing NETDEV_FEAT_CHANGE causes IPvlan links to lose
NETIF_F_LLTX feature because of the incorrect handling of
features in ipvlan_fix_features().

--before--
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: on [fixed]
lpaa10:~# ethtool -K ipvl0 tso off
Cannot change tcp-segmentation-offload
Actual changes:
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: off [fixed]
lpaa10:~#

--after--
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: on [fixed]
lpaa10:~# ethtool -K ipvl0 tso off
Cannot change tcp-segmentation-offload
Could not change any device features
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: on [fixed]
lpaa10:~#

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1a33e10e 02-May-2020 Cong Wang <xiyou.wangcong@gmail.com>

net: partially revert dynamic lockdep key changes

This patch reverts the folowing commits:

commit 064ff66e2bef84f1153087612032b5b9eab005bd
"bonding: add missing netdev_update_lockdep_key()"

commit 53d374979ef147ab51f5d632dfe20b14aebeccd0
"net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()"

commit 1f26c0d3d24125992ab0026b0dab16c08df947c7
"net: fix kernel-doc warning in <linux/netdevice.h>"

commit ab92d68fc22f9afab480153bd82a20f6e2533769
"net: core: add generic lockdep keys"

but keeps the addr_list_lock_key because we still lock
addr_list_lock nestedly on stack devices, unlikely xmit_lock
this is safe because we don't take addr_list_lock on any fast
path.

Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 63aae7b1 07-Mar-2020 Jiri Wiesner <jwiesner@suse.com>

ipvlan: do not add hardware address of master to its unicast filter list

There is a problem when ipvlan slaves are created on a master device that
is a vmxnet3 device (ipvlan in VMware guests). The vmxnet3 driver does not
support unicast address filtering. When an ipvlan device is brought up in
ipvlan_open(), the ipvlan driver calls dev_uc_add() to add the hardware
address of the vmxnet3 master device to the unicast address list of the
master device, phy_dev->uc. This inevitably leads to the vmxnet3 master
device being forced into promiscuous mode by __dev_set_rx_mode().

Promiscuous mode is switched on the master despite the fact that there is
still only one hardware address that the master device should use for
filtering in order for the ipvlan device to be able to receive packets.
The comment above struct net_device describes the uc_promisc member as a
"counter, that indicates, that promiscuous mode has been enabled due to
the need to listen to additional unicast addresses in a device that does
not implement ndo_set_rx_mode()". Moreover, the design of ipvlan
guarantees that only the hardware address of a master device,
phy_dev->dev_addr, will be used to transmit and receive all packets from
its ipvlan slaves. Thus, the unicast address list of the master device
should not be modified by ipvlan_open() and ipvlan_stop() in order to make
ipvlan a workable option on masters that do not support unicast address
filtering.

Fixes: 2ad7bf3638411 ("ipvlan: Initial check-in of the IPVLAN driver")
Reported-by: Per Sundstrom <per.sundstrom@redqube.se>
Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab92d68f 21-Oct-2019 Taehee Yoo <ap420073@gmail.com>

net: core: add generic lockdep keys

Some interface types could be nested.
(VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..)
These interface types should set lockdep class because, without lockdep
class key, lockdep always warn about unexisting circular locking.

In the current code, these interfaces have their own lockdep class keys and
these manage itself. So that there are so many duplicate code around the
/driver/net and /net/.
This patch adds new generic lockdep keys and some helper functions for it.

This patch does below changes.
a) Add lockdep class keys in struct net_device
- qdisc_running, xmit, addr_list, qdisc_busylock
- these keys are used as dynamic lockdep key.
b) When net_device is being allocated, lockdep keys are registered.
- alloc_netdev_mqs()
c) When net_device is being free'd llockdep keys are unregistered.
- free_netdev()
d) Add generic lockdep key helper function
- netdev_register_lockdep_key()
- netdev_unregister_lockdep_key()
- netdev_update_lockdep_key()
e) Remove unnecessary generic lockdep macro and functions
f) Remove unnecessary lockdep code of each interfaces.

After this patch, each interface modules don't need to maintain
their lockdep keys.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 41441d85 09-Oct-2019 Mahesh Bandewar <maheshb@google.com>

ipvlan: consolidate TSO flags using NETIF_F_ALL_TSO

This will ensure that any new TSO related flags added (which
would be part of ALL_TSO mask and IPvlan driver doesn't need
to update every time new flag gets added.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>


# a4d2113e 14-Aug-2019 Bill Sommerfeld <wsommerfeld@google.com>

ipvlan: set hw_enc_features like macvlan

Allow encapsulated packets sent to tunnels layered over ipvlan to use
offloads rather than forcing SW fallbacks.

Since commit f21e5077010acda73a60 ("macvlan: add offload features for
encapsulation"), macvlan has set dev->hw_enc_features to include
everything in dev->features; do likewise in ipvlan.

Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ceae266b 04-Jun-2019 Miaohe Lin <linmiaohe@huawei.com>

net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set

There's some NICs, such as hinic, with NETIF_F_IP_CSUM and NETIF_F_TSO
on but NETIF_F_HW_CSUM off. And ipvlan device features will be
NETIF_F_TSO on with NETIF_F_IP_CSUM and NETIF_F_IP_CSUM both off as
IPVLAN_FEATURES only care about NETIF_F_HW_CSUM. So TSO will be
disabled in netdev_fix_features.
For example:
Features for enp129s0f0:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on

Fixes: a188222b6ed2 ("net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7cc9f700 19-Feb-2019 Daniel Borkmann <daniel@iogearbox.net>

ipvlan: disallow userns cap_net_admin to change global mode/flags

When running Docker with userns isolation e.g. --userns-remap="default"
and spawning up some containers with CAP_NET_ADMIN under this realm, I
noticed that link changes on ipvlan slave device inside that container
can affect all devices from this ipvlan group which are in other net
namespaces where the container should have no permission to make changes
to, such as the init netns, for example.

This effectively allows to undo ipvlan private mode and switch globally to
bridge mode where slaves can communicate directly without going through
hostns, or it allows to switch between global operation mode (l2/l3/l3s)
for everyone bound to the given ipvlan master device. libnetwork plugin
here is creating an ipvlan master and ipvlan slave in hostns and a slave
each that is moved into the container's netns upon creation event.

* In hostns:

# ip -d a
[...]
8: cilium_host@bond0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.0.1/32 scope link cilium_host
valid_lft forever preferred_lft forever
[...]

* Spawn container & change ipvlan mode setting inside of it:

# docker run -dt --cap-add=NET_ADMIN --network cilium-net --name client -l app=test cilium/netperf
9fff485d69dcb5ce37c9e33ca20a11ccafc236d690105aadbfb77e4f4170879c

# docker exec -ti client ip -d a
[...]
10: cilium0@if4: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
valid_lft forever preferred_lft forever

# docker exec -ti client ip link change link cilium0 name cilium0 type ipvlan mode l2

# docker exec -ti client ip -d a
[...]
10: cilium0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
valid_lft forever preferred_lft forever

* In hostns (mode switched to l2):

# ip -d a
[...]
8: cilium_host@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.0.1/32 scope link cilium_host
valid_lft forever preferred_lft forever
[...]

Same l3 -> l2 switch would also happen by creating another slave inside
the container's network namespace when specifying the existing cilium0
link to derive the actual (bond0) master:

# docker exec -ti client ip link add link cilium0 name cilium1 type ipvlan mode l2

# docker exec -ti client ip -d a
[...]
2: cilium1@if4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
10: cilium0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
valid_lft forever preferred_lft forever

* In hostns:

# ip -d a
[...]
8: cilium_host@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.0.1/32 scope link cilium_host
valid_lft forever preferred_lft forever
[...]

One way to mitigate it is to check CAP_NET_ADMIN permissions of
the ipvlan master device's ns, and only then allow to change
mode or flags for all devices bound to it. Above two cases are
then disallowed after the patch.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c675e06a 08-Feb-2019 Daniel Borkmann <daniel@iogearbox.net>

ipvlan: decouple l3s mode dependencies from other modes

Right now ipvlan has a hard dependency on CONFIG_NETFILTER and
otherwise it cannot be built. However, the only ipvlan operation
mode that actually depends on netfilter is l3s, everything else
is independent of it. Break this hard dependency such that users
are able to use ipvlan l3 mode on systems where netfilter is not
compiled in.

Therefore, this adds a hidden CONFIG_IPVLAN_L3S bool which is
defaulting to y when CONFIG_NETFILTER is set in order to retain
existing behavior for l3s. All l3s related code is refactored
into ipvlan_l3s.c that is compiled in when enabled.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Martynas Pumputis <m@lambda.lt>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d5256083 29-Jan-2019 Daniel Borkmann <daniel@iogearbox.net>

ipvlan, l3mdev: fix broken l3s mode wrt local routes

While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
I ran into the issue that while l3 mode is working fine, l3s mode
does not have any connectivity to kube-apiserver and hence all pods
end up in Error state as well. The ipvlan master device sits on
top of a bond device and hostns traffic to kube-apiserver (also running
in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
where the latter is the address of the bond0. While in l3 mode, a
curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
works fine from hostns, neither of them do in case of l3s. In the
latter only a curl to https://127.0.0.1:37573 appeared to work where
for local addresses of bond0 I saw kernel suddenly starting to emit
ARP requests to query HW address of bond0 which remained unanswered
and neighbor entries in INCOMPLETE state. These ARP requests only
happen while in l3s.

Debugging this further, I found the issue is that l3s mode is piggy-
backing on l3 master device, and in this case local routes are using
l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev
if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be
a loopback"). I found that reverting them back into using the
net->loopback_dev fixed ipvlan l3s connectivity and got everything
working for the CNI.

Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the
l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
on l3 master device is to get the l3mdev_ip_rcv() receive hook for
setting the dst entry of the input route without adding its own
ipvlan specific hacks into the receive path, however, any l3 domain
semantics beyond just that are breaking l3s operation. Note that
ipvlan also has the ability to dynamically switch its internal
operation from l3 to l3s for all ports via ipvlan_set_port_mode()
at runtime. In any case, l3 vs l3s soley distinguishes itself by
'de-confusing' netfilter through switching skb->dev to ipvlan slave
device late in NF_INET_LOCAL_IN before handing the skb to L4.

Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
without any additional l3mdev semantics on top. This should also have
minimal impact since dev->priv_flags is already hot in cache. With
this set, l3s mode is working fine and I also get things like
masquerading pod traffic on the ipvlan master properly working.

[0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf

Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback")
Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Martynas Pumputis <m@lambda.lt>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 61345fab 13-Dec-2018 Petr Machata <petrm@mellanox.com>

net: ipvlan: Issue NETDEV_PRE_CHANGEADDR

A NETDEV_CHANGEADDR event implies a change of address of each of the
IPVLANs of this IPVLAN device. Therefore propagate NETDEV_PRE_CHANGEADDR
to all the IPVLANs.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b1dd054d 10-Dec-2018 YueHaibing <yuehaibing@huawei.com>

ipvlan: Remove a useless comparison

Fix following gcc warning:

drivers/net/ipvlan/ipvlan_main.c:543:12: warning:
comparison is always false due to limited range of data type [-Wtype-limits]

'mode' is a u16 variable, IPVLAN_MODE_L2 is zero,
the comparison is always false

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 567c5e13 06-Dec-2018 Petr Machata <petrm@mellanox.com>

net: core: dev: Add extack argument to dev_change_flags()

In order to pass extack together with NETDEV_PRE_UP notifications, it's
necessary to route the extack to __dev_open() from diverse (possibly
indirect) callers. One prominent API through which the notification is
invoked is dev_change_flags().

Therefore extend dev_change_flags() with and extra extack argument and
update all users. Most of the calls end up just encoding NULL, but
several sites (VLAN, ipvlan, VRF, rtnetlink) do have extack available.

Since the function declaration line is changed anyway, name the other
function arguments to placate checkpatch.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cf7686a0 06-Dec-2018 Petr Machata <petrm@mellanox.com>

net: ipvlan: ipvlan_set_port_mode(): Add an extack argument

A follow-up patch will extend dev_change_flags() with an extack
argument. Extend ipvlan_set_port_mode() to have that argument available
for the conversion.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5dc2d399 01-Jul-2018 Hangbin Liu <liuhangbin@gmail.com>

ipvlan: call dev_change_flags when ipvlan mode is reset

After we change the ipvlan mode from l3 to l2, or vice versa, we only
reset IFF_NOARP flag, but don't flush the ARP table cache, which will
cause eth->h_dest to be equal to eth->h_source in ipvlan_xmit_mode_l2().
Then the message will not come out of host.

Here is the reproducer on local host:

ip link set eth1 up
ip addr add 192.168.1.1/24 dev eth1
ip link add link eth1 ipvlan1 type ipvlan mode l3

ip netns add net1
ip link set ipvlan1 netns net1
ip netns exec net1 ip link set ipvlan1 up
ip netns exec net1 ip addr add 192.168.2.1/24 dev ipvlan1

ip route add 192.168.2.0/24 via 192.168.1.2
ping 192.168.2.2 -c 2

ip netns exec net1 ip link set ipvlan1 type ipvlan mode l2
ping 192.168.2.2 -c 2

Add the same configuration on remote host. After we set the mode to l2,
we could find that the src/dst MAC addresses are the same on eth1:

21:26:06.648565 00:b7:13:ad:d3:05 > 00:b7:13:ad:d3:05, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 58356, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.2.1 > 192.168.2.2: ICMP echo request, id 22686, seq 1, length 64

Fix this by calling dev_change_flags(), which will call netdevice notifier
with flag change info.

v2:
a) As pointed out by Wang Cong, check return value for dev_change_flags() when
change dev flags.
b) As suggested by Stefano and Sabrina, move flags setting before l3mdev_ops.
So we don't need to redo ipvlan_{, un}register_nf_hook() again in err path.

Reported-by: Jianlin Shi <jishi@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Fixes: 2ad7bf3638411 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 30877961 20-Jun-2018 Xin Long <lucien.xin@gmail.com>

ipvlan: fix IFLA_MTU ignored on NEWLINK

Commit 296d48568042 ("ipvlan: inherit MTU from master device") adjusted
the mtu from the master device when creating a ipvlan device, but it
would also override the mtu value set in rtnl_create_link. It causes
IFLA_MTU param not to take effect.

So this patch is to not adjust the mtu if IFLA_MTU param is set when
creating a ipvlan device.

Fixes: 296d48568042 ("ipvlan: inherit MTU from master device")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 548feb33 18-Jun-2018 Xin Long <lucien.xin@gmail.com>

ipvlan: use ETH_MAX_MTU as max mtu

Similar to the fixes on team and bonding, this restores the ability
to set an ipvlan device's mtu to anything higher than 1500.

Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab452c3c 14-May-2018 Keefe Liu <liuqifa@huawei.com>

ipvlan: call netdevice notifier when master mac address changed

When master device's mac has been changed, the commit
32c10bbfe914 ("ipvlan: always use the current L2 addr of the
master") makes the IPVlan devices's mac changed also, but it
doesn't do related works such as flush the IPVlan devices's
arp table.

Signed-off-by: Keefe Liu <liuqifa@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f635cee 27-Mar-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Drop pernet_operations::async

Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f5426250 09-Mar-2018 Paolo Abeni <pabeni@redhat.com>

net: introduce IFF_NO_RX_HANDLER

Some network devices - notably ipvlan slave - are not compatible with
any kind of rx_handler. Currently the hook can be installed but any
configuration (bridge, bond, macsec, ...) is nonfunctional.

This change allocates a priv_flag bit to mark such devices and explicitly
forbid installing a rx_handler if such bit is set. The new bit is used
by ipvlan slave device.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1ec54cb4 06-Mar-2018 Paolo Abeni <pabeni@redhat.com>

net: unpollute priv_flags space

the ipvlan device driver defines and uses 2 bits inside the priv_flags
net_device field. Such bits and the related helper are used only
inside the ipvlan device driver, and the core networking does not
need to be aware of them.

This change moves netif_is_ipvlan* helper in the ipvlan driver and
re-implement them looking for ipvlan specific symbols instead of
using priv_flags.

Overall this frees two bits inside priv_flags - and move the following
ones to avoid gaps - without any intended functional change.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3518e40b 02-Mar-2018 Paolo Abeni <pabeni@redhat.com>

ipvlan: forbid vlan devices on top of ipvlan

Currently we allow the creation of 8021q devices on top of
ipvlan, but such devices are nonfunctional, as the underlying
ipvlan rx_hanlder hook can't match the relevant traffic.

Be explicit and forbid the creation of such nonfunctional devices.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 82308194 28-Feb-2018 Paolo Abeni <pabeni@redhat.com>

ipvlan: use per device spinlock to protect addrs list updates

This changeset moves ipvlan address under RCU protection, using
a per ipvlan device spinlock to protect list mutation and RCU
read access to protect list traversal.

Also explicitly use RCU read lock to traverse the per port
ipvlans list, so that we can now perform a full address lookup
without asserting the RTNL lock.

Overall this allows the ipvlan driver to check fully for duplicate
addresses - before this commit ipv6 addresses assigned by autoconf
via prefix delegation where accepted without any check - and avoid
the following rntl assertion failure still in the same code path:

RTNL: assertion failed at drivers/net/ipvlan/ipvlan_core.c (124)
WARNING: CPU: 15 PID: 0 at drivers/net/ipvlan/ipvlan_core.c:124 ipvlan_addr_busy+0x97/0xa0 [ipvlan]
Modules linked in: ipvlan(E) ixgbe
CPU: 15 PID: 0 Comm: swapper/15 Tainted: G E 4.16.0-rc2.ipvlan+ #1782
Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
RIP: 0010:ipvlan_addr_busy+0x97/0xa0 [ipvlan]
RSP: 0018:ffff881ff9e03768 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff881fdf2a9000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 00000000000000f6 RDI: 0000000000000300
RBP: ffff881fdf2a8000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: ffff881ff9e034c0 R12: ffff881fe07bcc00
R13: 0000000000000001 R14: ffffffffa02002b0 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff881ff9e00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc5c1a4f248 CR3: 000000207e012005 CR4: 00000000001606e0
Call Trace:
<IRQ>
ipvlan_addr6_event+0x6c/0xd0 [ipvlan]
notifier_call_chain+0x49/0x90
atomic_notifier_call_chain+0x6a/0x100
ipv6_add_addr+0x5f9/0x720
addrconf_prefix_rcv_add_addr+0x244/0x3c0
addrconf_prefix_rcv+0x2f3/0x790
ndisc_router_discovery+0x633/0xb70
ndisc_rcv+0x155/0x180
icmpv6_rcv+0x4ac/0x5f0
ip6_input_finish+0x138/0x6a0
ip6_input+0x41/0x1f0
ipv6_rcv+0x4db/0x8d0
__netif_receive_skb_core+0x3d5/0xe40
netif_receive_skb_internal+0x89/0x370
napi_gro_receive+0x14f/0x1e0
ixgbe_clean_rx_irq+0x4ce/0x1020 [ixgbe]
ixgbe_poll+0x31a/0x7a0 [ixgbe]
net_rx_action+0x296/0x4f0
__do_softirq+0xcf/0x4f5
irq_exit+0xf5/0x110
do_IRQ+0x62/0x110
common_interrupt+0x91/0x91
</IRQ>

v1 -> v2: drop unneeded in_softirq check in ipvlan_addr6_validator_event()

Fixes: e9997c2938b2 ("ipvlan: fix check for IP addresses in control path")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 68eabe8b 26-Feb-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Convert ipvlan_net_ops

These pernet_operations unregister ipvlan net hooks.
nf_unregister_net_hooks() removes hooks one-by-one,
and then frees the memory via rcu. This looks similar
to that happens, when a new hooks is added: allocation
of bigger memory region, copy of old content, and rcu
freeing the old memory. So, all of net code should be
well with this behavior. Also at the time of hook
unregistering, there are no packets, and foreign net
pernet_operations are not interested in others hooks.
So, we mark them as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 94333fac 20-Feb-2018 Matteo Croce <mcroce@redhat.com>

ipvlan: drop ipv6 dependency

IPVlan has an hard dependency on IPv6, refactor the ipvlan code to allow
compiling it with IPv6 disabled, move duplicate code into addr_equal()
and refactor series of if-else into a switch.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5e51fe6f 01-Dec-2017 Gao Feng <gfree.wind@vip.163.com>

ipvlan: Add new func ipvlan_is_valid_dev instead of duplicated codes

There are multiple duplicated condition checks in the current codes, so
I add the new func ipvlan_is_valid_dev instead of the duplicated codes to
check if the netdev is real ipvlan dev.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fe18da60 17-Nov-2017 Girish Moodalbail <girish.moodalbail@oracle.com>

ipvlan: NULL pointer dereference panic in ipvlan_port_destroy

When call to register_netdevice() (called from ipvlan_link_new()) fails,
we call ipvlan_uninit() (through ndo_uninit()) to destroy the ipvlan
port. After returning unsuccessfully from register_netdevice() we go
ahead and call ipvlan_port_destroy() again which causes NULL pointer
dereference panic. Fix the issue by making ipvlan_init() and
ipvlan_uninit() call symmetric.

The ipvlan port will now be created inside ipvlan_init() and will be
destroyed in ipvlan_uninit().

Fixes: 2ad7bf363841 (ipvlan: Initial check-in of the IPVLAN driver)
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fe89aa6b 26-Oct-2017 Mahesh Bandewar <maheshb@google.com>

ipvlan: implement VEPA mode

This is very similar to the Macvlan VEPA mode, however, there is some
difference. IPvlan uses the mac-address of the lower device, so the VEPA
mode has implications of ICMP-redirects for packets destined for its
immediate neighbors sharing same master since the packets will have same
source and dest mac. The external switch/router will send redirect msg.

Having said that, this will be useful tool in terms of debugging
since IPvlan will not switch packets within its slaves and rely completely
on the external entity as intended in 802.1Qbg.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a190d04d 26-Oct-2017 Mahesh Bandewar <maheshb@google.com>

ipvlan: introduce 'private' attribute for all existing modes.

IPvlan has always operated in bridge mode. However there are scenarios
where each slave should be able to talk through the master device but
not necessarily across each other. Think of an environment where each
of a namespace is a private and independant customer. In this scenario
the machine which is hosting these namespaces neither want to tell who
their neighbor is nor the individual namespaces care to talk to neighbor
on short-circuited network path.

This patch implements the mode that is very similar to the 'private' mode
in macvlan where individual slaves can send and receive traffic through
the master device, just that they can not talk among slave devices.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# de95e047 18-Oct-2017 David Ahern <dsahern@gmail.com>

net: Add extack to validator_info structs used for address notifier

Add extack to in_validator_info and in6_validator_info. Update the one
user of each, ipvlan, to return an error message for failures.

Only manual configuration of an address is plumbed in the IPv6 code path.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ff7883ea 18-Oct-2017 David Ahern <dsahern@gmail.com>

net: ipv6: Make inet6addr_validator a blocking notifier

inet6addr_validator chain was added by commit 3ad7d2468f79f ("Ipvlan
should return an error when an address is already in use") to allow
address validation before changes are committed and to be able to
fail the address change with an error back to the user. The address
validation is not done for addresses received from router
advertisements.

Handling RAs in softirq context is the only reason for the notifier
chain to be atomic versus blocking. Since the only current user, ipvlan,
of the validator chain ignores softirq context, the notifier can be made
blocking and simply not invoked for softirq path.

The blocking option is needed by spectrum for example to validate
resources for an adding an address to an interface.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32c10bbf 11-Oct-2017 Mahesh Bandewar <maheshb@google.com>

ipvlan: always use the current L2 addr of the master

If the underlying master ever changes its L2 (e.g. bonding device),
then make sure that the IPvlan slaves always emit packets with the
current L2 of the master instead of the stale mac addr which was
copied during the device creation. The problem can be seen with
following script -

#!/bin/bash
# Create a vEth pair
ip link add dev veth0 type veth peer name veth1
ip link set veth0 up
ip link set veth1 up
ip link show veth0
ip link show veth1
# Create an IPvlan device on one end of this vEth pair.
ip link add link veth0 dev ipvl0 type ipvlan mode l2
ip link show ipvl0
# Change the mac-address of the vEth master.
ip link set veth0 address 02:11:22:33:44:55

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 42ab19ee 04-Oct-2017 David Ahern <dsahern@gmail.com>

net: Add extack to upper device linking

Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 87173cd6 01-Aug-2017 Florian Fainelli <f.fainelli@gmail.com>

ipvlan: Fix 64-bit statistics seqcount initialization

On 32-bit hosts and with CONFIG_DEBUG_LOCK_ALLOC we should be seeing a
lockdep splat indicating this seqcount is not correctly initialized, fix
that by using the proper helper function: netdev_alloc_pcpu_stats().

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 591bb278 26-Jul-2017 Florian Westphal <fw@strlen.de>

netfilter: nf_hook_ops structs can be const

We no longer place these on a list so they can be const.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 182e0b6b 03-Jul-2017 David S. Miller <davem@davemloft.net>

ipvlan: Stop advertising NETIF_F_UFO support.

It is going away.

Signed-off-by: David S. Miller <davem@davemloft.net>


# a8b8a889 25-Jun-2017 Matthias Schiffer <mschiffer@universe-factory.net>

net: add netlink_ext_ack argument to rtnl_link_ops.validate

Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ad744b22 25-Jun-2017 Matthias Schiffer <mschiffer@universe-factory.net>

net: add netlink_ext_ack argument to rtnl_link_ops.changelink

Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7a3f4a18 25-Jun-2017 Matthias Schiffer <mschiffer@universe-factory.net>

net: add netlink_ext_ack argument to rtnl_link_ops.newlink

Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3ad7d246 08-Jun-2017 Krister Johansen <kjlx@templeofstupid.com>

Ipvlan should return an error when an address is already in use.

The ipvlan code already knows how to detect when a duplicate address is
about to be assigned to an ipvlan device. However, that failure is not
propogated outward and leads to a silent failure.

Introduce a validation step at ip address creation time and allow device
drivers to register to validate the incoming ip addresses. The ipvlan
code is the first consumer. If it detects an address in use, we can
return an error to the user before beginning to commit the new ifa in
the networking code.

This can be especially useful if it is necessary to provision many
ipvlans in containers. The provisioning software (or operator) can use
this to detect situations where an ip address is unexpectedly in use.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cf124db5 07-May-2017 David S. Miller <davem@davemloft.net>

net: Fix inconsistent teardown and release of private netdev state.

Network devices can allocate reasources and private memory using
netdev_ops->ndo_init(). However, the release of these resources
can occur in one of two different places.

Either netdev_ops->ndo_uninit() or netdev->destructor().

The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.

netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.

netdev->destructor(), on the other hand, does not run until the
netdev references all go away.

Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().

This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.

If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
it is not able to invoke netdev->destructor().

This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.

However, this means that the resources that would normally be released
by netdev->destructor() will not be.

Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.

Many drivers do not try to deal with this, and instead we have leaks.

Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().

netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().

netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().

Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().

And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().

Signed-off-by: David S. Miller <davem@davemloft.net>


# 3133822f 20-Apr-2017 Florian Westphal <fw@strlen.de>

ipvlan: use pernet operations and restrict l3s hooks to master netns

commit 4fbae7d83c98c30efc ("ipvlan: Introduce l3s mode") added
registration of netfilter hooks via nf_register_hooks().

This API provides the illusion of 'global' netfilter hooks by placing the
hooks in all current and future network namespaces.

In case of ipvlan the hook appears to be only needed in the namespace
that contains the ipvlan master device (i.e., usually init_net), so
placing them in all namespaces is not needed.

This switches ipvlan driver to pernet operations, and then only registers
hooks in namespaces where a ipvlan master device is set to l3s mode.

Extra care has to be taken when the master device is moved to another
namespace, as we might have to 'move' the netfilter hooks too.

This is done by storing the namespace the ipvlan port was created in.
On REGISTER event, do (un)register operations in the old/new namespaces.

This will also allow removal of the nf_register_hooks() in a future patch.

Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 235a9d89 10-Feb-2017 Sainath Grandhi <sainath.grandhi@intel.com>

ipvtap: IP-VLAN based tap driver

This patch adds a tap character device driver that is based on the
IP-VLAN network interface, called ipvtap. An ipvtap device can be created
in the same way as an ipvlan device, using 'type ipvtap', and then accessed
using the tap user space interface.

Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c3262d9d 18-Jan-2017 Mahesh Bandewar <maheshb@google.com>

ipvlan: use netdev_is_rx_handler_busy instead of checking specific type

IPvlan checks if the master device is already used by checking a
specific device (here it's macvlan device). This is technically not
sufficient and it should just ensure the rx_handler is busy or not.
This would be a super check that includes macvlan and any other that
has already registered rx-handler.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 019ec003 13-Jan-2017 Mahesh Bandewar <maheshb@google.com>

ipvlan: fix dev_id creation corner case.

In the last patch da36e13cf65 ("ipvlan: improvise dev_id generation
logic in IPvlan") I missed some part of Dave's suggestion and because
of that the dev_id creation could fail in a corner case scenario. This
would happen when more or less 64k devices have been already created and
several have been deleted. If the devices that are still sticking around
are the last n bits from the bitmap. So in this scenario even if lower
bits are available, the dev_id search is so narrow that it always fails.

Fixes: da36e13cf65 ("ipvlan: improvise dev_id generation logic in IPvlan")
CC: David Miller <davem@davemloft.org>
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# da36e13c 09-Jan-2017 Mahesh Bandewar <maheshb@google.com>

ipvlan: improvise dev_id generation logic in IPvlan

The patch 009146d117b ("ipvlan: assign unique dev-id for each slave
device.") used ida_simple_get() to generate dev_ids assigned to the
slave devices. However (Eric has pointed out that) there is a shortcoming
with that approach as it always uses the first available ID. This
becomes a problem when a slave gets deleted and a new slave gets added.
The ID gets reassigned causing the new slave to get the same link-local
address. This side-effect is undesirable.

This patch adds a per-port variable that keeps track of the IDs
assigned and used as the stat-base for the IDR api. This base will be
wrapped around when it reaches the MAX (0xFFFE) value possibly on a
busy system where slaves are added and deleted routinely.

Fixes: 009146d117b ("ipvlan: assign unique dev-id for each slave device.")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Eric Dumazet <edumazet@google.com>
CC: David Miller <davem@davemloft.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bc1f4470 06-Jan-2017 stephen hemminger <stephen@networkplumber.org>

net: make ndo_get_stats64 a void function

The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 009146d1 03-Jan-2017 Mahesh Bandewar <maheshb@google.com>

ipvlan: assign unique dev-id for each slave device.

IPvlan setup uses one mac-address (of master). The IPv6 link-local
addresses are derived using the mac-address on the link. Lack of
dev-ids makes these link-local addresses same for all slaves including
that of master device. dev-ids are necessary to add differentiation
when L2 address is shared.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 86673982 28-Dec-2016 Gao Feng <fgao@ikuai8.com>

driver: ipvlan: Define common functions to decrease duplicated codes used to add or del IP address

There are some duplicated codes in ipvlan_add_addr6/4 and
ipvlan_del_addr6/4. Now define two common functions ipvlan_add_addr
and ipvlan_del_addr to decrease the duplicated codes.
It could be helful to maintain the codes.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b1227d01 21-Dec-2016 Eric Dumazet <edumazet@google.com>

ipvlan: fix various issues in ipvlan_process_multicast()

1) netif_rx() / dev_forward_skb() should not be called from process
context.

2) ipvlan_count_rx() should be called with preemption disabled.

3) We should check if ipvlan->dev is up before feeding packets
to netif_rx()

4) We need to prevent device from disappearing if some packets
are in the multicast backlog.

5) One kfree_skb() should be a consume_skb() eventually

Fixes: ba35f8588f47 ("ipvlan: Defer multicast / broadcast processing to
a work-queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1a31cc86 07-Dec-2016 Gao Feng <fgao@ikuai8.com>

driver: ipvlan: Unlink the upper dev when ipvlan_link_new failed

When netdev_upper_dev_unlink failed in ipvlan_link_new, need to
unlink the ipvlan dev with upper dev.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 48140a21 06-Dec-2016 Gao Feng <fgao@ikuai8.com>

driver: ipvlan: Free ipvl_port directly with kfree instead of kfree_rcu

There are two functions which would free the ipvl_port now. The first
is ipvlan_port_create. It frees the ipvl_port in the error handler,
so it could kfree it directly. The second is ipvlan_port_destroy. It
invokes netdev_rx_handler_unregister which enforces one grace period
by synchronize_net firstly, so it also could kfree the ipvl_port
directly and safely.

So it is unnecessary to use kfree_rcu to free ipvl_port.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8f679ed8 29-Nov-2016 Gao Feng <fgao@ikuai8.com>

driver: ipvlan: Remove useless member mtu_adj of struct ipvl_dev

The mtu_adj is initialized to zero when alloc mem, there is no any
assignment to mtu_adj. It is only used in ipvlan_adjust_mtu as one
right value.
So it is useless member of struct ipvl_dev, then remove it.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 147fd287 24-Nov-2016 Gao Feng <fgao@ikuai8.com>

driver: ipvlan: Fix one possible memleak in ipvlan_link_new

When ipvlan_link_new fails and creates one ipvlan port, it does not
destroy the ipvlan port created. It causes mem leak and the physical
device contains invalid ipvlan data.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab530f63 15-Oct-2016 Julia Lawall <julia.lawall@lip6.fr>

ipvlan: constify l3mdev_ops structure

This l3mdev_ops structure is only stored in the l3mdev_ops field of a
net_device structure. This field is declared const, so the l3mdev_ops
structure can be declared as const also. Additionally drop the
__read_mostly annotation.

The semantic patch that adds const is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@r disable optional_qualifier@
identifier i;
position p;
@@
static struct l3mdev_ops i@p = { ... };

@ok@
identifier r.i;
struct net_device *e;
position p;
@@
e->l3mdev_ops = &i@p;

@bad@
position p != {r.p,ok.p};
identifier r.i;
struct l3mdev_ops e;
@@
e@i@p

@depends on !bad disable optional_qualifier@
identifier r.i;
@@
static
+const
struct l3mdev_ops i = { ... };
// </smpl>

The effect on the layout of the .o file is shown by the following output
of the size command, first before then after the transformation:

text data bss dec hex filename
7364 466 52 7882 1eca drivers/net/ipvlan/ipvlan_main.o
7412 434 52 7898 1eda drivers/net/ipvlan/ipvlan_main.o

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4fbae7d8 16-Sep-2016 Mahesh Bandewar <maheshb@google.com>

ipvlan: Introduce l3s mode

In a typical IPvlan L3 setup where master is in default-ns and
each slave is into different (slave) ns. In this setup egress
packet processing for traffic originating from slave-ns will
hit all NF_HOOKs in slave-ns as well as default-ns. However same
is not true for ingress processing. All these NF_HOOKs are
hit only in the slave-ns skipping them in the default-ns.
IPvlan in L3 mode is restrictive and if admins want to deploy
iptables rules in default-ns, this asymmetric data path makes it
impossible to do so.

This patch makes use of the l3_rcv() (added as part of l3mdev
enhancements) to perform input route lookup on RX packets without
changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
to change the skb->dev just before handing over skb to L4.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0d7dd798 09-Jun-2016 Eric Dumazet <edumazet@google.com>

net: ipvlan: call netdev_lockdep_set_classes()

In case a qdisc is used on a ipvlan device, we need to use different
lockdep classes to avoid false positives.

Use the new netdev_lockdep_set_classes() generic helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 494e8489 27-Apr-2016 Mahesh Bandewar <maheshb@google.com>

ipvlan: Fix failure path in dev registration during link creation

When newlink creation fails at device-registration, the port->count
is decremented twice. Francesco Ruggeri (fruggeri@arista.com) found
this issue in Macvlan and the same exists in IPvlan driver too.

While fixing this issue I noticed another issue of missing unregister
in case of failure, so adding it to the fix which is similar to the
macvlan fix by Francesco in commit 308379607548 ("macvlan: fix failure
during registration v3")

Reported-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Eric Dumazet <edumazet@google.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f6773c5e 16-Mar-2016 Eric Dumazet <edumazet@google.com>

vlan: propagate gso_max_segs

vlan drivers lack proper propagation of gso_max_segs from
lower device.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 314d10d7 24-Feb-2016 David Decotigny <decot@googlers.com>

net: ipvlan: use __ethtool_get_ksettings

Signed-off-by: David Decotigny <decot@googlers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab5b7013 20-Feb-2016 Mahesh Bandewar <maheshb@google.com>

ipvlan: misc changes

1. scope correction for few functions that are used in single file.
2. Adjust variables that are used in fast-path to fit into single cacheline
3. Update rcv_frame() to skip shared check for frames coming over wire

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e93fbc5a 20-Feb-2016 Mahesh Bandewar <maheshb@google.com>

ipvlan: mode is u16

The mode argument was erronusly defined as u32 but it has always
been u16. Also use ipvlan_set_mode() helper to set the mode instead
of assigning directly. This should avoid future erronus assignments /
updates.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 296d4856 28-Jan-2016 Mahesh Bandewar <maheshb@google.com>

ipvlan: inherit MTU from master device

When we create IPvlan slave; we use ether_setup() and that
sets up default MTU to 1500 while the master device may have
lower / different MTU. Any subsequent changes to the masters'
MTU are reflected into the slaves' MTU setting. However if those
don't happen (most likely scenario), the slaves' MTU stays at
1500 which could be bad.

This change adds code to inherit MTU from the master device
instead of using the default value during the link initialization
phase.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tim Hockins <thockins@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a188222b 14-Dec-2015 Tom Herbert <tom@herbertland.com>

net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK

The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.

This patch:
- Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
NETIF_F_ALL_CSUM is being used as a mask).
- Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bf485bcf 18-Aug-2015 Phil Sutter <phil@nwl.cc>

net: ipvlan: convert to using IFF_NO_QUEUE

Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 23a5a49c 14-Jul-2015 Konstantin Khlebnikov <koct9i@gmail.com>

ipvlan: ignore addresses from ipv6 autoconfiguration

Inet6addr notifier is atomic and runs in bh context without RTNL when
ipv6 receives router advertisement packet and performs autoconfiguration.

Proper fix still in discussion. Let's at least plug the bug.
v1: http://lkml.kernel.org/r/20150514135618.14062.1969.stgit@buzz
v2: http://lkml.kernel.org/r/20150703125840.24121.91556.stgit@buzz

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6640e673 14-Jul-2015 Konstantin Khlebnikov <koct9i@gmail.com>

ipvlan: unhash addresses without synchronize_rcu

All structures used in traffic forwarding are rcu-protected:
ipvl_addr, ipvl_dev and ipvl_port. Thus we can unhash addresses
without synchronization. We'll anyway hash it back into the same
bucket: in worst case lockless lookup will scan hash once again.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6a725497 14-Jul-2015 Konstantin Khlebnikov <koct9i@gmail.com>

ipvlan: plug memory leak in ipvlan_link_delete

Add missing kfree_rcu(addr, rcu);

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 515866f8 14-Jul-2015 Konstantin Khlebnikov <koct9i@gmail.com>

ipvlan: remove counters of ipv4 and ipv6 addresses

They are unused after commit f631c44bbe15 ("ipvlan: Always set broadcast bit in
multicast filter").

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f631c44b 04-May-2015 Mahesh Bandewar <maheshb@google.com>

ipvlan: Always set broadcast bit in multicast filter

Earlier tricks of setting broadcast bit only when IPv4 address is added
onto interface are not good enough especially when autoconf comes in play.
Setting them on always is performance drag but now that multicast /
broadcast is not processed in fast-path; enabling broadcast will let
autoconf work correctly without affecting performance characteristics of
the device.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ba35f858 04-May-2015 Mahesh Bandewar <maheshb@google.com>

ipvlan: Defer multicast / broadcast processing to a work-queue

Processing multicast / broadcast in fast path is performance draining
and having more links means more cloning and bringing performance
down further.

Broadcast; in particular, need to be given to all the virtual links.
Earlier tricks of enabling broadcast bit for IPv4 only interfaces are not
really working since it fails autoconf. Which means enabling broadcast
for all the links if protocol specific hacks do not have to be added into
the driver.

This patch defers all (incoming as well as outgoing) multicast traffic to
a work-queue leaving only the unicast traffic in the fast-path. Now if we
need to apply any additional tricks to further reduce the impact of this
(multicast / broadcast) type of traffic, it can be implemented while
processing this work without affecting the fast-path.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7c411658 02-Apr-2015 Nicolas Dichtel <nicolas.dichtel@6wind.com>

ipvlan: implement ndo_get_iflink

Don't use dev->iflink anymore.

CC: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e9997c29 28-Mar-2015 Jiri Benc <jbenc@redhat.com>

ipvlan: fix check for IP addresses in control path

When an ipvlan interface is down, its addresses are not on the hash list.
Fix checks for existence of addresses not to depend on the hash list, walk
through all interface addresses instead.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 40891e8a 28-Mar-2015 Jiri Benc <jbenc@redhat.com>

ipvlan: do not use rcu operations for address list

All accesses to ipvlan->addrs are under rtnl.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 27705f70 28-Mar-2015 Jiri Benc <jbenc@redhat.com>

ipvlan: fix addr hash list corruption

When ipvlan interface with IP addresses attached is brought down and then
deleted, the assigned addresses are deleted twice from the address hash
list, first on the interface down and second on the link deletion.
Similarly, when an address is added while the interface is down, it is added
second time once the interface is brought up.

When the interface is down, the addresses should be kept off the hash list
for performance reasons. Ensure this is true, which also fixes the double add
problem. To fix the double free, check whether the address is hashed before
removing it.

Reported-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d476059e 01-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

net: Kill dev_rebuild_header

Now that there are no more users kill dev_rebuild_header and all of it's
implementations.

This is long overdue.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5933fea7 06-Dec-2014 Mahesh Bandewar <maheshb@google.com>

ipvlan: move the device check function into netdevice.h

Move the port check [ipvlan_dev_master()] and device check
[ipvlan_dev_slave()] functions to netdevice.h and rename them
netif_is_ipvlan_port() and netif_is_ipvlan() resp. to be
consistent with macvlan api naming.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 764e433b 06-Dec-2014 Mahesh Bandewar <maheshb@google.com>

ipvlan: play well with macvlan device

If a device is already a macvlan port then refuse to use it as
an ipvlan port in the early stage of port creation.

thost1:~# ip link add link eth0 mvl0 type macvlan
thost1:~# echo $?
0
thost1:~# ip link add link eth0 ipvl0 type ipvlan
RTNETLINK answers: Device or resource busy
thost1:~# echo $?
2
thost1:~#

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 04901cea 29-Nov-2014 Markus Elfring <elfring@users.sourceforge.net>

net-ipvlan: Deletion of an unnecessary check before the function call "free_percpu"

The free_percpu() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 92c7b0de 25-Nov-2014 Mahesh Bandewar <maheshb@google.com>

ipvlan: fix sparse warnings

Fix sparse warnings reported by kbuild robot

drivers/net/ipvlan/ipvlan_main.c:172:13: warning: symbol 'ipvlan_start_xmit' was not declared. Should it be static?
drivers/net/ipvlan/ipvlan_main.c:256:33: warning: incorrect type in initializer (different address spaces)
drivers/net/ipvlan/ipvlan_main.c:256:33: expected void const [noderef] <asn:3>*__vpp_verify
drivers/net/ipvlan/ipvlan_main.c:256:33: got struct ipvl_pcpu_stats *<noident>
drivers/net/ipvlan/ipvlan_main.c:544:5: warning: symbol 'ipvlan_link_register' was not declared. Should it be static

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2ad7bf36 24-Nov-2014 Mahesh Bandewar <maheshb@google.com>

ipvlan: Initial check-in of the IPVLAN driver.

This driver is very similar to the macvlan driver except that it
uses L3 on the frame to determine the logical interface while
functioning as packet dispatcher. It inherits L2 of the master
device hence the packets on wire will have the same L2 for all
the packets originating from all virtual devices off of the same
master device.

This driver was developed keeping the namespace use-case in
mind. Hence most of the examples given here take that as the
base setup where main-device belongs to the default-ns and
virtual devices are assigned to the additional namespaces.

The device operates in two different modes and the difference
in these two modes in primarily in the TX side.

(a) L2 mode : In this mode, the device behaves as a L2 device.
TX processing upto L2 happens on the stack of the virtual device
associated with (namespace). Packets are switched after that
into the main device (default-ns) and queued for xmit.

RX processing is simple and all multicast, broadcast (if
applicable), and unicast belonging to the address(es) are
delivered to the virtual devices.

(b) L3 mode : In this mode, the device behaves like a L3 device.
TX processing upto L3 happens on the stack of the virtual device
associated with (namespace). Packets are switched to the
main-device (default-ns) for the L2 processing. Hence the routing
table of the default-ns will be used in this mode.

RX processins is somewhat similar to the L2 mode except that in
this mode only Unicast packets are delivered to the virtual device
while main-dev will handle all other packets.

The devices can be added using the "ip" command from the iproute2
package -

ip link add link <master> <virtual> type ipvlan mode [ l2 | l3 ]

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Tim Hockin <thockin@google.com>
Cc: Brandon Philips <brandon.philips@coreos.com>
Cc: Pavel Emelianov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>