History log of /linux-master/drivers/net/macvtap.c
Revision Date Author Comments
# 55c90047 27-Oct-2023 Jakub Kicinski <kuba@kernel.org>

net: fill in MODULE_DESCRIPTION()s under drivers/net/

W=1 builds now warn if module is built without a MODULE_DESCRIPTION().

Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 10a03c36 13-Mar-2023 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

drivers: remove struct module * setting from struct class

There is no need to manually set the owner of a struct class, as the
registering function does it automatically, so remove all of the
explicit settings from various drivers that did so as it is unneeded.

This allows us to remove this pointer entirely from this structure going
forward.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230313181843.1207845-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# fa627348 01-Oct-2022 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

driver core: class: make namespace and get_ownership take const *

The callbacks in struct class namespace() and get_ownership() do not
modify the struct device passed to them, so mark the pointer as constant
and fix up all callbacks in the kernel to have the correct function
signature.

This helps make it more obvious what calls and callbacks do, and do not,
modify structures passed to them.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Link: https://lore.kernel.org/r/20221001165426.2690912-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# d57aae2e 17-Sep-2022 Xiu Jianfeng <xiujianfeng@huawei.com>

net: macvtap: add __init/__exit annotations to module init/exit funcs

Add missing __init/__exit annotations to module init/exit funcs.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Link: https://lore.kernel.org/r/20220917083535.20040-1-xiujianfeng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# a0219215 27-Feb-2022 Sven Eckelmann <sven@narfation.org>

macvtap: advertise link netns via netlink

Assign rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages. This fixes iproute2 which otherwise resolved
the link interface to an interface in the wrong namespace.

Test commands:

ip netns add nst
ip link add dummy0 type dummy
ip link add link macvtap0 link dummy0 type macvtap
ip link set macvtap0 netns nst
ip -netns nst link show macvtap0

Before:

10: macvtap0@gre0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff

After:

10: macvtap0@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Reported-by: Leonardo Mörlein <freifunk@irrelefant.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Link: https://lore.kernel.org/r/20220228003240.1337426-1-sven@narfation.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 1c5b5b3f 15-Oct-2021 Jean Sacren <sakiwit@gmail.com>

net: macvtap: fix template string argument of device_create() call

The last argument of device_create() call should be a template string.
The tap_name variable should be the argument to the string, but not the
argument of the call itself. We should add the template string and turn
tap_name into its argument.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 09c434b8 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Add SPDX license identifier for more missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have MODULE_LICENCE("GPL*") inside which was used in the initial
scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# dea6e19f 27-Oct-2017 Girish Moodalbail <girish.moodalbail@oracle.com>

tap: reference to KVA of an unloaded module causes kernel panic

The commit 9a393b5d5988 ("tap: tap as an independent module") created a
separate tap module that implements tap functionality and exports
interfaces that will be used by macvtap and ipvtap modules to create
create respective tap devices.

However, that patch introduced a regression wherein the modules macvtap
and ipvtap can be removed (through modprobe -r) while there are
applications using the respective /dev/tapX devices. These applications
cause kernel to hold reference to /dev/tapX through 'struct cdev
macvtap_cdev' and 'struct cdev ipvtap_dev' defined in macvtap and ipvtap
modules respectively. So, when the application is later closed the
kernel panics because we are referencing KVA that is present in the
unloaded modules.

----------8<------- Example ----------8<----------
$ sudo ip li add name mv0 link enp7s0 type macvtap
$ sudo ip li show mv0 |grep mv0| awk -e '{print $1 $2}'
14:mv0@enp7s0:
$ cat /dev/tap14 &
$ lsmod |egrep -i 'tap|vlan'
macvtap 16384 0
macvlan 24576 1 macvtap
tap 24576 3 macvtap
$ sudo modprobe -r macvtap
$ fg
cat /dev/tap14
^C

<...system panics...>
BUG: unable to handle kernel paging request at ffffffffa038c500
IP: cdev_put+0xf/0x30
----------8<-----------------8<----------

The fix is to set cdev.owner to the module that creates the tap device
(either macvtap or ipvtap). With this set, the operations (in
fs/char_dev.c) on char device holds and releases the module through
cdev_get() and cdev_put() and will not allow the module to unload
prematurely.

Fixes: 9a393b5d5988ea4e (tap: tap as an independent module)
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 42ab19ee 04-Oct-2017 David Ahern <dsahern@gmail.com>

net: Add extack to upper device linking

Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fb652fdf 03-Jul-2017 David S. Miller <davem@davemloft.net>

macvlan/macvtap: Remove NETIF_F_UFO advertisement.

It is going away.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 7a3f4a18 25-Jun-2017 Matthias Schiffer <mschiffer@universe-factory.net>

net: add netlink_ext_ack argument to rtnl_link_ops.newlink

Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 174cd4b1 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>

Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 9a393b5d 10-Feb-2017 Sainath Grandhi <sainath.grandhi@intel.com>

tap: tap as an independent module

This patch makes tap a separate module for other types of virtual interfaces, for example,
ipvlan to use.

Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 837585a5 03-Feb-2017 Willem de Bruijn <willemb@google.com>

macvtap: read vnet_hdr_size once

When IFF_VNET_HDR is enabled, a virtio_net header must precede data.
Data length is verified to be greater than or equal to expected header
length tun->vnet_hdr_sz before copying.

Macvtap functions read the value once, but unless READ_ONCE is used,
the compiler may ignore this and read multiple times. Enforce a single
read and locally cached value to avoid updates between test and use.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6391a448 19-Jan-2017 Jason Wang <jasowang@redhat.com>

virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving

Commit 501db511397f ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on
xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too,
fixing this by adding a hint (has_data_valid) and set it only on the
receiving path.

Cc: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cbbd26b8 01-Nov-2016 Al Viro <viro@zeniv.linux.org.uk>

[iov_iter] new primitives - copy_from_iter_full() and friends

copy_from_iter_full(), copy_from_iter_full_nocache() and
csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
et.al., advancing iterator only in case of successful full copy
and returning whether it had been successful or not.

Convert some obvious users. *NOTE* - do not blindly assume that
something is a good candidate for those unless you are sure that
not advancing iov_iter in failure case is the right thing in
this case. Anything that does short read/short write kind of
stuff (or is in a loop, etc.) is unlikely to be a good one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# aa196eed 29-Nov-2016 Jason Wang <jasowang@redhat.com>

macvtap: handle ubuf refcount correctly when meet errors

We trigger uarg->callback() immediately after we decide do datacopy
even if caller want to do zerocopy. This will cause the callback
(vhost_net_zerocopy_callback) decrease the refcount. But when we meet
an error afterwards, the error handling in vhost handle_tx() will try
to decrease it again. This is wrong and fix this by delay the
uarg->callback() until we're sure there's no errors.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 763dfa27 28-Nov-2016 Zhang Shengju <zhangshengju@cmss.chinamobile.com>

macvtap: replace printk with netdev_err

This patch replaces printk() with netdev_err() for macvtap device.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e824265d 24-Nov-2016 Gao Feng <fgao@ikuai8.com>

driver: macvtap: Unregister netdev rx_handler if macvtap_newlink fails

The macvtap_newlink registers the netdev rx_handler firstly, but it
does not unregister the handler if macvlan_common_newlink failed.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3e9e40e7 18-Nov-2016 Jarno Rajahalme <jarno@ovn.org>

virtio_net: Simplify call sites for virtio_net_hdr_{from, to}_skb().

No point storing the return value of virtio_net_hdr_to_skb() or
virtio_net_hdr_from_skb() to a variable when the value is used only
once as a boolean in an immediately following if statement.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 104a4933 11-Aug-2016 Jason Wang <jasowang@redhat.com>

macvtap: fix use after free for skb_array during release

We've clean skb_array in macvtap_put_queue() but still try to pop from
it during macvtap_sock_destruct(). Fix this use after free by moving
the skb array cleanup to macvtap_sock_destruct() instead.

Fixes: 362899b8725b ("macvtap: switch to use skb array")
Reported-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0d7eacbe 18-Jul-2016 Jason Wang <jasowang@redhat.com>

macvtap: correctly free skb during socket destruction

We should use kfree_skb() instead of kfree() to free an skb.

Fixes: 362899b8725b ("macvtap: switch to use skb array")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 362899b8 15-Jul-2016 Jason Wang <jasowang@redhat.com>

macvtap: switch to use skb array

This patch switch to use skb array instead of sk_receive_queue to
avoid spinlock contentions. Tests shows about 21% improvements for
guest rx pps:

Before: 1472731 pkts/s
After: 1786289 pkts/s

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1b16bf42 15-Jul-2016 Jason Wang <jasowang@redhat.com>

macvtap: avoid hash calculating for single queue

We decide the rxq through calculating its hash which is not necessary
if we only have one rx queue. So this patch skip this and just return
queue 0. Test shows 22% improving on guest rx pps.

Before: 1201504 pkts/s
After: 1472731 pkts/s

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fd88d68b 08-Jun-2016 Mike Rapoport <rppt@linux.vnet.ibm.com>

macvtap: use common code for virtio_net_hdr and skb GSO conversion

Replace open coded conversion between virtio_net_hdr to skb GSO info with
virtio_net_hdr_{from,to}_skb

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# be0bd316 06-May-2016 Eric Dumazet <edumazet@google.com>

macvtap: segmented packet is consumed

If GSO packet is segmented and its segments are properly queued,
we call consume_skb() instead of kfree_skb() to be drop monitor
friendly.

Fixes: 3e4f8b7873709 ("macvtap: Perform GSO on forwarding path.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 17af2bce 04-May-2016 Marc Angel <marc@arista.com>

macvtap: add namespace support to the sysfs device class

When creating macvtaps that are expected to have the same ifindex
in different network namespaces, only the first one will succeed.
The others will fail with a sysfs_warn_dup warning due to them trying
to create the following sysfs link (with 'NN' the ifindex of macvtapX):

/sys/class/macvtap/tapNN -> /sys/devices/virtual/net/macvtapX/tapNN

This is reproducible by running the following commands:

ip netns add ns1
ip netns add ns2
ip link add veth0 type veth peer name veth1
ip link set veth0 netns ns1
ip link set veth1 netns ns2
ip netns exec ns1 ip l add link veth0 macvtap0 type macvtap
ip netns exec ns2 ip l add link veth1 macvtap1 type macvtap

The last command will fail with "RTNETLINK answers: File exists" (along
with the kernel warning) but retrying it will work because the ifindex
was incremented.

The 'net' device class is isolated between network namespaces so each
one has its own hierarchy of net devices.
This isn't the case for the 'macvtap' device class.
The problem occurs half-way through the netdev registration, when
`macvtap_device_event` is called-back to create the 'tapNN' macvtap
class device under the 'macvtapX' net class device.

This patch adds namespace support to the 'macvtap' device class so
that /sys/class/macvtap is no longer shared between net namespaces.

However, making the macvtap sysfs class namespace-aware has the side
effect of changing /sys/devices/virtual/net/macvtapX/tapNN into
/sys/devices/virtual/net/macvtapX/macvtap/tapNN.

This is due to Commit 24b1442 ("Driver-core: Always create class
directories for classses that support namespaces") and the fact that
class devices supporting namespaces are really not supposed to be placed
directly under other class devices.

To avoid breaking userland, a tapNN symlink pointing to macvtap/tapNN is
created inside the macvtapX directory.

Signed-off-by: Marc Angel <marc@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e96c37f1 23-Apr-2016 Francesco Ruggeri <fruggeri@arista.com>

macvtap: check minor when unregistering

macvtap_device_event(NETDEV_UNREGISTER) should check vlan->minor to
determine if it is being invoked in the context of a macvtap_newlink
that failed, for example in this code sequence:

macvtap_newlink
macvlan_common_newlink
register_netdevice
call_netdevice_notifiers(NETDEV_REGISTER, dev)
macvtap_device_event(NETDEV_REGISTER)
<fail here, vlan->minor = 0>
rollback_registered(dev);
rollback_registered_many
call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
macvtap_device_event(NETDEV_UNREGISTER)
<nothing to clean up here>

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8e2ad411 08-Mar-2016 Willem de Bruijn <willemb@google.com>

macvtap: always pass ethernet header in linear

The stack expects link layer headers in the skb linear section.
Macvtap can create skbs with llheader in frags in edge cases:
when (IFF_VNET_HDR is off or vnet_hdr.hdr_len < ETH_HLEN) and
prepad + len > PAGE_SIZE and vnet_hdr.flags has no or bad csum.

Add checks to ensure linear is always at least ETH_HLEN.
At this point, len is already ensured to be >= ETH_HLEN.

For backwards compatiblity, rounds up short vnet_hdr.hdr_len.
This differs from tap and packet, which return an error.

Fixes b9fb9ee07e67 ("macvtap: add GSO/csum offload support")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a188222b 14-Dec-2015 Tom Herbert <tom@herbertland.com>

net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK

The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.

This patch:
- Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
NETIF_F_ALL_CSUM is being used as a mask).
- Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9cd3e072 29-Nov-2015 Eric Dumazet <edumazet@google.com>

net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA

This patch is a cleanup to make following patch easier to
review.

Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()

To ease backports, we rename both constants.

Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a499a2e9 09-Nov-2015 Vlad Yasevich <vyasevich@gmail.com>

macvtap: Resolve possible __might_sleep warning in macvtap_do_read()

macvtap_do_read code calls macvtap_put_user while it might be set up
to wait for the user. This results in the following warning:

Jun 23 16:25:26 galen kernel: ------------[ cut here ]------------
Jun 23 16:25:26 galen kernel: WARNING: CPU: 0 PID: 30433 at kernel/sched/core.c:
7286 __might_sleep+0x7f/0x90()
Jun 23 16:25:26 galen kernel: do not call blocking ops when !TASK_RUNNING; state
=1 set at [<ffffffff810f1c1f>] prepare_to_wait+0x2f/0x90
Jun 23 16:25:26 galen kernel: CPU: 0 PID: 30433 Comm: cat Not tainted 4.1.0-rc6+
#11
Jun 23 16:25:26 galen kernel: Call Trace:
Jun 23 16:25:26 galen kernel: [<ffffffff817f76ba>] dump_stack+0x4c/0x65
Jun 23 16:25:26 galen kernel: [<ffffffff810a07ca>] warn_slowpath_common+0x8a/0xc
0
Jun 23 16:25:26 galen kernel: [<ffffffff810a0846>] warn_slowpath_fmt+0x46/0x50
Jun 23 16:25:26 galen kernel: [<ffffffff810f1c1f>] ? prepare_to_wait+0x2f/0x90
Jun 23 16:25:26 galen kernel: [<ffffffff810f1c1f>] ? prepare_to_wait+0x2f/0x90
Jun 23 16:25:26 galen kernel: [<ffffffff810cdc1f>] __might_sleep+0x7f/0x90
Jun 23 16:25:26 galen kernel: [<ffffffff811f8e15>] might_fault+0x55/0xb0
Jun 23 16:25:26 galen kernel: [<ffffffff810fab9d>] ? trace_hardirqs_on_caller+0x fd/0x1c0
Jun 23 16:25:26 galen kernel: [<ffffffff813f639c>] copy_to_iter+0x7c/0x360
Jun 23 16:25:26 galen kernel: [<ffffffffa052da86>] macvtap_do_read+0x256/0x3d0 [macvtap]
Jun 23 16:25:26 galen kernel: [<ffffffff810f20e0>] ? prepare_to_wait_event+0x110/0x110
Jun 23 16:25:26 galen kernel: [<ffffffffa052dcab>] macvtap_read_iter+0x2b/0x50 [macvtap]
Jun 23 16:25:26 galen kernel: [<ffffffff81247f2e>] __vfs_read+0xae/0xe0
Jun 23 16:25:26 galen kernel: [<ffffffff81248526>] vfs_read+0x86/0x140
Jun 23 16:25:26 galen kernel: [<ffffffff812493b9>] SyS_read+0x49/0xb0
Jun 23 16:25:26 galen kernel: [<ffffffff8180182e>] system_call_fastpath+0x12/0x76
Jun 23 16:25:26 galen kernel: ---[ end trace 22e33f67e70c0c2a ]---

Make sure thet we call finish_wait() if we have the skb to process
before trying to actually process it.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f23d538b 22-Oct-2015 Jason Wang <jasowang@redhat.com>

macvtap: unbreak receiving of gro skb with frag list

We don't have fraglist support in TAP_FEATURES. This will lead
software segmentation of gro skb with frag list. Fixes by having
frag list support in TAP_FEATURES.

With this patch single session of netperf receiving were restored from
about 5Gb/s to about 12Gb/s on mlx4.

Fixes a567dd6252 ("macvtap: simplify usage of tap_features")
Cc: Vlad Yasevich <vyasevic@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3ea79249 18-Sep-2015 Michael S. Tsirkin <mst@redhat.com>

macvtap: fix TUNSETSNDBUF values > 64k

Upon TUNSETSNDBUF, macvtap reads the requested sndbuf size into
a local variable u.
commit 39ec7de7092b ("macvtap: fix uninitialized access on
TUNSETIFF") changed its type to u16 (which is the right thing to
do for all other macvtap ioctls), breaking all values > 64k.

The value of TUNSETSNDBUF is actually a signed 32 bit integer, so
the right thing to do is to read it into an int.

Cc: David S. Miller <davem@davemloft.net>
Fixes: 39ec7de7092b ("macvtap: fix uninitialized access on TUNSETIFF")
Reported-by: Mark A. Peloquin
Bisected-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c5c62f1b 23-Jul-2015 Ivan Vecera <ivecera@redhat.com>

macvtap: fix network header pointer for VLAN tagged pkts

Network header is set with offset ETH_HLEN but it is not true for VLAN
(multiple-)tagged and results in checksum issues in lower devices.

v2: leave skb->protocol untouched (thx Vlad), comment added
v3: moved after skb_probe_transport_header() call (thx Toshiaki)

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d5de1987 08-Jul-2015 Johannes Thumshirn <jthumshirn@suse.de>

macvtap: Destroy minor_idr on module_exit

Destroy minor_idr on module_exit, reclaiming the allocated memory.

This was detected by the following semantic patch (written by Luis Rodriguez
<mcgrof@suse.com>)
<SmPL>
@ defines_module_init @
declarer name module_init, module_exit;
declarer name DEFINE_IDR;
identifier init;
@@

module_init(init);

@ defines_module_exit @
identifier exit;
@@

module_exit(exit);

@ declares_idr depends on defines_module_init && defines_module_exit @
identifier idr;
@@

DEFINE_IDR(idr);

@ on_exit_calls_destroy depends on declares_idr && defines_module_exit @
identifier declares_idr.idr, defines_module_exit.exit;
@@

exit(void)
{
...
idr_destroy(&idr);
...
}

@ missing_module_idr_destroy depends on declares_idr && defines_module_exit && !on_exit_calls_destroy @
identifier declares_idr.idr, defines_module_exit.exit;
@@

exit(void)
{
...
+idr_destroy(&idr);
}
</SmPL>

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dfe816c5 19-Jun-2015 Pankaj Gupta <pagupta@redhat.com>

macvtap: Increase limit of macvtap queues

Macvtap should be compatible with tuntap for
maximum number of queues.

commit 'baf71c5c1f80d82e92924050a60b5baaf97e3094 (tuntap:
Increase the number of queues in tun.)' removes
the limitations and increases number of queues in tuntap.
Now, Its safe to increase number of queues in Macvtap as well.

This patch also modifies 'macvtap_del_queues' function
to avoid extra memory allocation in stack.

Changes from v1->v2 :
Michael S. Tsirkin, Jason Wang :
Better way to use linked list to
avoid use of extra memory in stack.
Sergei Shtylyov : Specify dependent commit's summary.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8b8e658b 24-Apr-2015 Greg Kurz <groug@kaod.org>

macvtap/tun: cross-endian support for little-endian hosts

The VNET_LE flag was introduced to fix accesses to virtio 1.0 headers
that are always little-endian. It can also be used to handle the special
case of a legacy little-endian device implemented by a big-endian host.

Let's add a flag and ioctls for big-endian devices as well. If both flags
are set, little-endian wins.

Since this is isn't a common usecase, the feature is controlled by a kernel
config option (not set by default).

Both macvtap and tun are covered by this patch since they share the same
API with userland.

Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>


# 7d824109 24-Apr-2015 Greg Kurz <groug@kaod.org>

virtio: add explicit big-endian support to memory accessors

The current memory accessors logic is:
- little endian if little_endian
- native endian (i.e. no byteswap) if !little_endian

If we want to fully support cross-endian vhost, we also need to be
able to convert to big endian.

Instead of changing the little_endian argument to some 3-value enum, this
patch changes the logic to:
- little endian if little_endian
- big endian if !little_endian

The native endian case is handled by all users with a trivial helper. This
patch doesn't change any functionality, nor it does add overhead.

Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>


# 5b11e15f 24-Apr-2015 Greg Kurz <groug@kaod.org>

macvtap: introduce macvtap_is_little_endian() helper

Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>


# 7f460d30 13-May-2015 Justin Cormack <justin@myriabit.com>

fix missing copy_from_user in macvtap

Fix missing copy_from_user in macvtap SIOCSIFHWADDR ioctl.

Signed-off-by: Justin Cormack <justin@netbsd.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b5082083 11-May-2015 Justin Cormack <justin@myriabit.com>

macvtap add missing ioctls - fix wrapping

The macvtap driver tries to emulate all the ioctls supported by a normal
tun/tap driver, however it was missing the generic SIOCGIFHWADDR and
SIOCSIFHWADDR ioctls to get and set the mac address that are supported
by tun/tap. This patch adds these.

Signed-off-by: Justin Cormack <justin@netbsd.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 11aa9c28 08-May-2015 Eric W. Biederman <ebiederm@xmission.com>

net: Pass kern from net_proto_family.create to sk_alloc

In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8b86a61d 17-Apr-2015 Johannes Berg <johannes.berg@intel.com>

net: remove unused 'dev' argument from netif_needs_gso()

In commit 04ffcb255f22 ("net: Add ndo_gso_check") Tom originally
added the 'dev' argument to be able to call ndo_gso_check().

Then later, when generalizing this in commit 5f35227ea34b
("net: Generalize ndo_gso_check to ndo_features_check")
Jesse removed the call to ndo_gso_check() in netif_needs_gso()
by calling the new ndo_features_check() in a different place.
This made the 'dev' argument unused.

Remove the unused argument and go back to the code as before.

Cc: Tom Herbert <therbert@google.com>
Cc: Jesse Gross <jesse@nicira.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5d5d5689 03-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

make new_sync_{read,write}() static

All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1b784140 02-Mar-2015 Ying Xue <ying.xue@windriver.com>

net: Remove iocb argument from sendmsg and recvmsg

After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f1d8b9e 27-Feb-2015 Eric Dumazet <edumazet@google.com>

macvtap: make sure neighbour code can push ethernet header

Brian reported crashes using IPv6 traffic with macvtap/veth combo.

I tracked the crashes in neigh_hh_output()

-> memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MOD);

Neighbour code assumes headroom to push Ethernet header is
at least 16 bytes.

It appears macvtap has only 14 bytes available on arches
where NET_IP_ALIGN is 0 (like x86)

Effect is a corruption of 2 bytes right before skb->head,
and possible crashes if accessing non existing memory.

This fix should also increase IPv4 performance, as paranoid code
in ip_finish_output2() wont have to call skb_realloc_headroom()

Reported-by: Brian Rak <brak@vultr.com>
Tested-by: Brian Rak <brak@vultr.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e3e3c423 03-Feb-2015 Vlad Yasevich <vyasevich@gmail.com>

Revert "drivers/net: Disable UFO through virtio"

This reverts commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4.

Now that GSO functionality can correctly track if the fragment
id has been selected and select a fragment id if necessary,
we can re-enable UFO on tap/macvap and virtio devices.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 72f65107 03-Feb-2015 Vlad Yasevich <vyasevich@gmail.com>

Revert "drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets"

This reverts commit 5188cd44c55db3e92cd9e77a40b5baa7ed4340f7.

Now that GSO layer can track if fragment id has been selected
and can allocate one if necessary, we don't need to do this in
tap and macvtap. This reverts most of the code and only keeps
the new ipv6 fragment id generation function that is still needed.

Fixes: 3d0ad09412ff (drivers/net: Disable UFO through virtio)
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# df8a39de 13-Jan-2015 Jiri Pirko <jiri@resnulli.us>

net: rename vlan_tx_* helpers since "tx" is misleading there

The same macros are used for rx as well. So rename it.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 01b07fb3 16-Dec-2014 Michael S. Tsirkin <mst@redhat.com>

macvtap: drop broken IFF_VNET_LE

Use TUNSETVNETLE/TUNGETVNETLE instead.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 39ec7de7 16-Dec-2014 Michael S. Tsirkin <mst@redhat.com>

macvtap: fix uninitialized access on TUNSETIFF

flags field in ifreq is only 16 bit wide, but
we read it as a 32 bit value.
If userspace doesn't zero-initialize unused fields,
this will lead to failures.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c0371da6 24-Nov-2014 Al Viro <viro@zeniv.linux.org.uk>

put iov_iter into msghdr

Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6ae7feb3 23-Nov-2014 Michael S. Tsirkin <mst@redhat.com>

macvtap: TUN_VNET_LE support

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>


# f51a5e82 01-Dec-2014 Jason Wang <jasowang@redhat.com>

tun/macvtap: use consume_skb() instead of kfree_skb() when needed

To be more friendly with drop monitor, we should only call kfree_skb() when
the packets were dropped and use consume_skb() in other cases.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f5ff53b4 19-Jun-2014 Al Viro <viro@zeniv.linux.org.uk>

{macvtap,tun}_get_user(): switch to iov_iter

allows to switch macvtap and tun from ->aio_write() to ->write_iter()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3af0bfe5 07-Nov-2014 Al Viro <viro@zeniv.linux.org.uk>

switch macvtap to ->read_iter()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7cc76f51 20-Nov-2014 Jason Wang <jasowang@redhat.com>

macvtap: advance iov iterator when needed in macvtap_put_user()

When mergeable buffer is used, vnet_hdr_sz is greater than sizeof struct
virtio_net_hdr. So we need advance the iov iterators in this case.

Fixes 6c36d2e26cda1ad3e2c4b90dd843825fc62fe5b4 ("macvtap: Use iovec iterators")
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6c36d2e2 07-Nov-2014 Herbert Xu <herbert@gondor.apana.org.au>

macvtap: Use iovec iterators

This patch removes the use of skb_copy_datagram_const_iovec in
favour of the iovec iterator-based skb_copy_datagram_iter.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3ce9b20f 02-Nov-2014 Herbert Xu <herbert@gondor.apana.org.au>

macvtap: Fix csum_start when VLAN tags are present

When VLAN is in use in macvtap_put_user, we end up setting
csum_start to the wrong place. The result is that the whoever
ends up doing the checksum setting will corrupt the packet instead
of writing the checksum to the expected location, usually this
means writing the checksum with an offset of -4.

This patch fixes this by adjusting csum_start when VLAN tags are
detected.

Fixes: f09e2249c4f5 ("macvtap: restore vlan header on user read")
Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5188cd44 30-Oct-2014 Ben Hutchings <ben@decadent.org.uk>

drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets

UFO is now disabled on all drivers that work with virtio net headers,
but userland may try to send UFO/IPv6 packets anyway. Instead of
sending with ID=0, we should select identifiers on their behalf (as we
used to).

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 916e4cf46d02 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3d0ad094 30-Oct-2014 Ben Hutchings <ben@decadent.org.uk>

drivers/net: Disable UFO through virtio

IPv6 does not allow fragmentation by routers, so there is no
fragmentation ID in the fixed header. UFO for IPv6 requires the ID to
be passed separately, but there is no provision for this in the virtio
net protocol.

Until recently our software implementation of UFO/IPv6 generated a new
ID, but this was a bug. Now we will use ID=0 for any UFO/IPv6 packet
passed through a tap, which is even worse.

Unfortunately there is no distinction between UFO/IPv4 and v6
features, so disable UFO on taps and virtio_net completely until we
have a proper solution.

We cannot depend on VM managers respecting the tap feature flags, so
keep accepting UFO packets but log a warning the first time we do
this.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 916e4cf46d02 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")
Signed-off-by: David S. Miller <davem@davemloft.net>


# 04ffcb25 14-Oct-2014 Tom Herbert <therbert@google.com>

net: Add ndo_gso_check

Add ndo_gso_check which a device can define to indicate whether is
is capable of doing GSO on a packet. This funciton would be called from
the stack to determine whether software GSO is needed to be done. A
driver should populate this function if it advertises GSO types for
which there are combinations that it wouldn't be able to handle. For
instance a device that performs UDP tunneling might only implement
support for transparent Ethernet bridging type of inner packets
or might have limitations on lengths of inner headers.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 40b8fe45 22-Sep-2014 Vlad Yasevich <vyasevich@gmail.com>

macvtap: Fix race between device delete and open.

In macvtap device delete and open calls can race and
this causes a list curruption of the vlan queue_list.

The race intself is triggered by the idr accessors
that located the vlan device. The device is stored
into and removed from the idr under both an rtnl and
a mutex. However, when attempting to locate the device
in idr, only a mutex is taken. As a result, once cpu
perfoming a delete may take an rtnl and wait for the mutex,
while another cput doing an open() will take the idr
mutex first to fetch the device pointer and later take
an rtnl to add a queue for the device which may have
just gotten deleted.

With this patch, we now hold the rtnl for the duration
of the macvtap_open() call thus making sure that
open will not race with delete.

CC: Michael S. Tsirkin <mst@redhat.com>
CC: Jason Wang <jasowang@redhat.com>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cbdb0427 29-Apr-2014 Vlad Yasevich <vyasevic@redhat.com>

mactap: Fix checksum errors for non-gso packets in bridge mode

The following is a problematic configuration:

VM1: virtio-net device connected to macvtap0@eth0
VM2: e1000 device connect to macvtap1@eth0

The problem is is that virtio-net supports checksum offloading
and thus sends the packets to the host with CHECKSUM_PARTIAL set.
On the other hand, e1000 does not support any acceleration.

For small TCP packets (and this includes the 3-way handshake),
e1000 ends up receiving packets that only have a partial checksum
set. This causes TCP to fail checksum validation and to drop
packets. As a result tcp connections can not be established.

Commit 3e4f8b787370978733ca6cae452720a4f0c296b8
macvtap: Perform GSO on forwarding path.
fixes this issue for large packets wthat will end up undergoing GSO.
This commit adds a check for the non-GSO case and attempts to
compute the checksum for partially checksummed packets in the
non-GSO case.

CC: Daniel Lezcano <daniel.lezcano@free.fr>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrian Nord <nightnord@gmail.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Michael S. Tsirkin <mst@redhat.com>
CC: Jason Wang <jasowang@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a81ab36b 08-Jan-2014 Paul Gortmaker <paul.gortmaker@windriver.com>

drivers/net: delete non-required instances of include <linux/init.h>

None of these files are actually using any __init type directives
and hence don't need to include <linux/init.h>. Most are just a
left over from __devinit and __cpuinit removal, or simply due to
code getting copied from one driver to the next.

This covers everything under drivers/net except for wireless, which
has been submitted separately.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3958afa1b 15-Dec-2013 Tom Herbert <therbert@google.com>

net: Change skb_get_rxhash to skb_get_hash

Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f6a1b66 11-Dec-2013 Vlad Yasevich <vyasevic@redhat.com>

macvlan: Remove custom recieve and forward handlers

Since now macvlan and macvtap use the same receive and
forward handlers, we can remove them completely and use
netif_rx and dev_forward_skb() directly.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6acf54f1 11-Dec-2013 Vlad Yasevich <vyasevic@redhat.com>

macvtap: Add support of packet capture on macvtap device.

Macvtap device currently doesn not allow a user to capture
traffic on due to the fact that it steals the packets
from the network stack before the skb->dev is set correctly
on the receive side, and that use uses macvlan transmit
path directly on the send side. As a result, we never
get a change to give traffic to the taps while the correct
device is set in the skb.

This patch makes macvtap device behave almost exaclty like
macvlan. On the send side, we switch to using dev_queue_xmit().
On the receive side, to deliver packets to macvtap, we now
use netif_rx and dev_forward_skb just like macvlan. The only
differnce now is that macvtap has its own rx_handler which is
attached to the macvtap netdev. It is here that we now steal
the packet and provide it to the socket.

As a result, we can now capture traffic on the macvtap device:
tcpdump -i macvtap0

It also gives us the abilit to add tc actions to the macvtap
device and actually utilize different bandwidth management
queues on output.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ce232ce0 10-Dec-2013 Jason Wang <jasowang@redhat.com>

macvtap: signal truncated packets

macvtap_put_user() never return a value grater than iov length, this in fact
bypasses the truncated checking in macvtap_recvmsg(). Fix this by always
returning the size of packet plus the possible vlan header to let the trunca
checking work.

Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bbd37626 10-Dec-2013 David S. Miller <davem@davemloft.net>

net: Revert macvtap/tun truncation signalling changes.

Jason Wang and Michael S. Tsirkin are still discussing how
to properly fix this.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 730054da 09-Dec-2013 Jason Wang <jasowang@redhat.com>

macvtap: signal truncated packets

macvtap_put_user() never return a value grater than iov length, this in fact
bypasses the truncated checking in macvtap_recvmsg(). Fix this by always
returning the size of packet plus the possible vlan header to let the truncated
checking work.

Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# de2aa476 10-Dec-2013 David S. Miller <davem@davemloft.net>

Revert "macvtap: remove useless codes in macvtap_aio_read() and macvtap_recvmsg()"

This reverts commit 41e4af69a5984a3193ba3108fb4e067b0e34dc73.

MSG_TRUNC handling was broken and is going to be fixed in the
'net' tree, so revert this.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 41e4af69 06-Dec-2013 Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

macvtap: remove useless codes in macvtap_aio_read() and macvtap_recvmsg()

By checking related codes, it is impossible that ret > len or total_len,
so we should remove some useless coeds in both above functions.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 55ec8e25 06-Dec-2013 Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

macvtap: remove unused parameter in macvtap_do_read()

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 359d44d7 06-Dec-2013 Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

macvtap: remove the dead branch

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e6ebc7f1 05-Dec-2013 Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

macvtap: update file current position

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 006da7b0 25-Nov-2013 Vlad Yasevich <vyasevic@redhat.com>

macvtap: Do not double-count received packets

Currently macvlan will count received packets after calling each
vlans receive handler. Macvtap attempts to count the packet
yet again when the user reads the packet from the tap socket.
This code doesn't do this consistently either. Remove the
counting from macvtap and let only macvlan count received
packets.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cd3e22b7 25-Nov-2013 Jason Wang <jasowang@redhat.com>

macvtap: fix tx_dropped counting error

After commit 8ffab51b3dfc54876f145f15b351c41f3f703195
(macvlan: lockless tx path), tx stat counter were converted to percpu stat
structure. So we need use to this also for tx_dropped in macvtap. Otherwise, the
management won't notice the dropping packet in macvtap tx path.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 16a3fa28 12-Nov-2013 Jason Wang <jasowang@redhat.com>

macvtap: limit head length of skb allocated

We currently use hdr_len as a hint of head length which is advertised by
guest. But when guest advertise a very big value, it can lead to an 64K+
allocating of kmalloc() which has a very high possibility of failure when host
memory is fragmented or under heavy stress. The huge hdr_len also reduce the
effect of zerocopy or even disable if a gso skb is linearized in guest.

To solves those issues, this patch introduces an upper limit (PAGE_SIZE) of the
head, which guarantees an order 0 allocation each time.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e5733321 16-Aug-2013 Vlad Yasevich <vyasevic@redhat.com>

macvtap: Ignore tap features when VNET_HDR is off

When the user turns off VNET_HDR support on the
macvtap device, there is no way to provide any
offload information to the user. So, it's safer
to ignore offload setting then depend on the user
setting them correctly.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e558b018 16-Aug-2013 Vlad Yasevich <vyasevic@redhat.com>

macvtap: Correctly set tap features when IFF_VNET_HDR is disabled.

When the user turns off IFF_VNET_HDR flag, attempts to change
offload features via TUNSETOFFLOAD do not work. This could cause
GSO packets to be delivered to the user when the user is
not prepared to handle them.

To solve, allow processing of TUNSETOFFLOAD when IFF_VNET_HDR is
disabled.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a567dd62 16-Aug-2013 Vlad Yasevich <vyasevic@redhat.com>

macvtap: simplify usage of tap_features

In macvtap, tap_features specific the features of that the user
has specified via ioctl(). If we treat macvtap as a macvlan+tap
then we could all the tap a pseudo-device and give it other features
like SG and GSO. Then we can stop using the features of lower
device (macvlan) when forwarding the traffic the tap.

This solves the issue of possible checksum offload mismatch between
tap feature and macvlan features.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 29d79196 08-Aug-2013 Eric Dumazet <edumazet@google.com>

macvtap: fix two races

Since commit ac4e4af1e59e1 ("macvtap: Consistently use rcu functions"),
Thomas gets two different warnings :

BUG: using smp_processor_id() in preemptible [00000000] code: vhost-45891/45892
caller is macvtap_do_read+0x45c/0x600 [macvtap]
CPU: 1 PID: 45892 Comm: vhost-45891 Not tainted 3.11.0-bisecttest #13
Call Trace:
([<00000000001126ee>] show_trace+0x126/0x144)
[<00000000001127d2>] show_stack+0xc6/0xd4
[<000000000068bcec>] dump_stack+0x74/0xd8
[<0000000000481066>] debug_smp_processor_id+0xf6/0x114
[<000003ff802e9a18>] macvtap_do_read+0x45c/0x600 [macvtap]
[<000003ff802e9c1c>] macvtap_recvmsg+0x60/0x88 [macvtap]
[<000003ff80318c5e>] handle_rx+0x5b2/0x800 [vhost_net]
[<000003ff8028f77c>] vhost_worker+0x15c/0x1c4 [vhost]
[<000000000015f3ac>] kthread+0xd8/0xe4
[<00000000006934a6>] kernel_thread_starter+0x6/0xc
[<00000000006934a0>] kernel_thread_starter+0x0/0xc

And

BUG: using smp_processor_id() in preemptible [00000000] code: vhost-45897/45898
caller is macvlan_start_xmit+0x10a/0x1b4 [macvlan]
CPU: 1 PID: 45898 Comm: vhost-45897 Not tainted 3.11.0-bisecttest #16
Call Trace:
([<00000000001126ee>] show_trace+0x126/0x144)
[<00000000001127d2>] show_stack+0xc6/0xd4
[<000000000068bdb8>] dump_stack+0x74/0xd4
[<0000000000481132>] debug_smp_processor_id+0xf6/0x114
[<000003ff802b72ca>] macvlan_start_xmit+0x10a/0x1b4 [macvlan]
[<000003ff802ea69a>] macvtap_get_user+0x982/0xbc4 [macvtap]
[<000003ff802ea92a>] macvtap_sendmsg+0x4e/0x60 [macvtap]
[<000003ff8031947c>] handle_tx+0x494/0x5ec [vhost_net]
[<000003ff8028f77c>] vhost_worker+0x15c/0x1c4 [vhost]
[<000000000015f3ac>] kthread+0xd8/0xe4
[<000000000069356e>] kernel_thread_starter+0x6/0xc
[<0000000000693568>] kernel_thread_starter+0x0/0xc
2 locks held by vhost-45897/45898:
#0: (&vq->mutex){+.+.+.}, at: [<000003ff8031903c>] handle_tx+0x54/0x5ec [vhost_net]
#1: (rcu_read_lock){.+.+..}, at: [<000003ff802ea53c>] macvtap_get_user+0x824/0xbc4 [macvtap]

In the first case, macvtap_put_user() calls macvlan_count_rx()
in a preempt-able context, and this is not allowed.

In the second case, macvtap_get_user() calls
macvlan_start_xmit() with BH enabled, and this is not allowed.

Reported-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Bisected-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 28d64271 08-Aug-2013 Eric Dumazet <edumazet@google.com>

net: attempt high order allocations in sock_alloc_send_pskb()

Adding paged frags skbs to af_unix sockets introduced a performance
regression on large sends because of additional page allocations, even
if each skb could carry at least 100% more payload than before.

We can instruct sock_alloc_send_pskb() to attempt high order
allocations.

Most of the time, it does a single page allocation instead of 8.

I added an additional parameter to sock_alloc_send_pskb() to
let other users to opt-in for this new feature on followup patches.

Tested:

Before patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

2304 212992 212992 10.00 46861.15

After patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

2304 212992 212992 10.00 57981.11

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c3bdeb5c 06-Aug-2013 Jason Wang <jasowang@redhat.com>

net: move zerocopy_sg_from_iovec() to net/core/datagram.c

To let it be reused and reduce code duplication. Also document this function.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b4bf0777 06-Aug-2013 Jason Wang <jasowang@redhat.com>

net: move iov_pages() to net/core/iovec.c

To let it be reused and reduce code duplication.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ece793fc 17-Jul-2013 Jason Wang <jasowang@redhat.com>

macvtap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS

We try to linearize part of the skb when the number of iov is greater than
MAX_SKB_FRAGS. This is not enough since each single vector may occupy more than
one pages, so zerocopy_sg_fromiovec() may still fail and may break the guest
network.

Solve this problem by calculate the pages needed for iov before trying to do
zerocopy and switch to use copy instead of zerocopy if it needs more than
MAX_SKB_FRAGS.

This is done through introducing a new helper to count the pages for iov, and
call uarg->callback() manually when switching from zerocopy to copy to notify
vhost.

We can do further optimization on top.

This bug were introduced from b92946e2919134ebe2a4083e4302236295ea2a73
(macvtap: zerocopy: validate vectors before building skb).

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0fbe0d47 15-Jul-2013 Jason Wang <jasowang@redhat.com>

macvtap: do not assume 802.1Q when send vlan packets

The hard-coded 8021.q proto will break 802.1ad traffic. So switch to use
vlan->proto.

Cc: Basil Gor <basil.gor@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 82a19eb8 15-Jul-2013 Jason Wang <jasowang@redhat.com>

macvtap: fix the missing ret value of TUNSETQUEUE

Commit 441ac0fcaadc76ad09771812382345001dd2b813
(macvtap: Convert to using rtnl lock) forget to return what
macvtap_ioctl_set_queue() returns to its caller. This may break multiqueue API
by always falling through to TUNGETFEATURES.

Cc: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 61d46bf9 09-Jul-2013 Jason Wang <jasowang@redhat.com>

macvtap: correctly linearize skb when zerocopy is used

Userspace may produce vectors greater than MAX_SKB_FRAGS. When we try to
linearize parts of the skb to let the rest of iov to be fit in
the frags, we need count copylen into linear when calling macvtap_alloc_skb()
instead of partly counting it into data_len. Since this breaks
zerocopy_sg_from_iovec() since its inner counter assumes nr_frags should
be zero at beginning. This cause nr_frags to be increased wrongly without
setting the correct frags.

This bug were introduced from b92946e2919134ebe2a4083e4302236295ea2a73
(macvtap: zerocopy: validate vectors before building skb).

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3e4f8b78 25-Jun-2013 Vlad Yasevich <vyasevic@redhat.com>

macvtap: Perform GSO on forwarding path.

When macvtap forwards skb to its tap, it needs to check
if GSO needs to be performed. This is sometimes necessary
when the HW device performed GRO, but the guest reading
from the tap does not support it (ex: Windows 7).

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2be5c767 25-Jun-2013 Vlad Yasevich <vyasevic@redhat.com>

macvtap: Let TUNSETOFFLOAD actually controll offload features.

When the user issues TUNSETOFFLOAD ioctl, macvtap does not do
anything other then to verify arguments. This patch adds
functionality to allow users to actually control offload features.
NETIF_F_GSO and NETIF_F_GRO are always on, but the rest of the
features can be controlled.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ac4e4af1 25-Jun-2013 Vlad Yasevich <vyasevic@redhat.com>

macvtap: Consistently use rcu functions

Currently macvtap uses rcu_bh functions in its
user facing fuction macvtap_get_user() and macvtap_put_user().
However, its packet handlers use normal rcu as the rcu_read_lock()
is taken in netif_receive_skb(). We can safely discontinue
the usage or rcu with bh disabled.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 441ac0fc 25-Jun-2013 Vlad Yasevich <vyasevic@redhat.com>

macvtap: Convert to using rtnl lock

Macvtap uses a private lock to protect the relationship between
macvtap_queue and macvlan_dev. The private lock is not needed
since the relationship is managed by user via open(), release(),
and dellink() calls. dellink() already happens under rtnl, so
we can safely convert open() and release(), and use it in ioctl()
as well.

Suggested by Eric Dumazet.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4c7ab054 23-Jun-2013 Michael S. Tsirkin <mst@redhat.com>

macvtap: fix recovery from gup errors

get user pages might fail partially in macvtap zero copy
mode. To recover we need to put all pages that we got,
but code used a wrong index resulting in double-free
errors.

Reported-by: Brad Hubbard <bhubbard@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f57855a5 13-Jun-2013 Jason Wang <jasowang@redhat.com>

macvtap: fix uninitialized return value macvtap_ioctl_set_queue()

Return -EINVAL on illegal flag instead of uninitialized value. This fixes the
kbuild test warning.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d9a90a31 13-Jun-2013 Jason Wang <jasowang@redhat.com>

macvtap: slient sparse warnings

This patch silents the following sparse warnings:

drivers/net/macvtap.c:98:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:98:9: expected struct macvtap_queue *<noident>
drivers/net/macvtap.c:98:9: got struct macvtap_queue [noderef]
<asn:4>*<noident>
drivers/net/macvtap.c:120:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:120:9: expected struct macvtap_queue *<noident>
drivers/net/macvtap.c:120:9: got struct macvtap_queue [noderef]
<asn:4>*<noident>
drivers/net/macvtap.c:151:22: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:233:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:243:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:247:15: error: incompatible types in comparison expression
(different address spaces)
CC [M] drivers/net/macvtap.o
drivers/net/macvlan.c:232:24: error: incompatible types in comparison expression
(different address spaces)

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# df09b36f 05-Jun-2013 Jason Wang <jasowang@redhat.com>

macvtap: enable multiqueue flag

To notify the userspace about our capability of multiqueue.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 815f236d 05-Jun-2013 Jason Wang <jasowang@redhat.com>

macvtap: add TUNSETQUEUE ioctl

This patch adds TUNSETQUEUE ioctl to let userspace can temporarily disable or
enable a queue of macvtap. This is used to be compatible at API layer of tuntap
to simplify the userspace to manage the queues. This is done through introducing
a linked list to track all taps while using vlan->taps array to only track
active taps.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 376b1aab 05-Jun-2013 Jason Wang <jasowang@redhat.com>

macvtap: eliminate linear search

Linear search were used in both get_slot() and macvtap_get_queue(), this is
because:

- macvtap didn't reshuffle the array of taps when create or destroy a queue, so
when adding a new queue, macvtap must do linear search to find a location for
the new queue. This will also complicate the TUNSETQUEUE implementation for
multiqueue API.
- the queue itself didn't track the queue index, so the we must do a linear
search in the array to find the location of a existed queue.

The solution is straightforward: reshuffle the array and introduce a queue_index
to macvtap_queue.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8f475a31 05-Jun-2013 Jason Wang <jasowang@redhat.com>

macvtap: introduce macvtap_get_vlan()

Factor out the device holding logic to a macvtap_get_vlan(), this will be also
used by multiqueue API.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 89cee917 05-Jun-2013 Jason Wang <jasowang@redhat.com>

macvtap: do not add self to waitqueue if doing a nonblock read

There's no need to add self to waitqueue if doing a nonblock read. This could
help to avoid the spinlock contention.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ed0483fa 05-Jun-2013 Jason Wang <jasowang@redhat.com>

macvtap: fix a possible race between queue selection and changing queues

Complier may generate codes that re-read the vlan->numvtaps during
macvtap_get_queue(). This may lead a race if vlan->numvtaps were changed in the
same time and which can lead unexpected result (e.g. very huge value).

We need prevent the compiler from generating such codes by adding an
ACCESS_ONCE() to make sure vlan->numvtaps were only read once.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 351638e7 27-May-2013 Jiri Pirko <jiri@resnulli.us>

net: pass info struct via netdevice notifier

So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>

v2->v3: fix typo on simeth
shortened dev_getter
shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller <davem@davemloft.net>


# 40893fd0 26-Mar-2013 Jason Wang <jasowang@redhat.com>

net: switch to use skb_probe_transport_header()

Switch to use the new help skb_probe_transport_header() to do the l4 header
probing for untrusted sources. For packets with partial csum, the header should
already been set by skb_partial_csum_set().

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9b4d669b 25-Mar-2013 Jason Wang <jasowang@redhat.com>

macvtap: set transport header before passing skb to lower device

Set the transport header for 1) some drivers (e.g ixgbe) needs l4 header 2)
precise packet length estimation (introduced in 1def9238) needs l4 header to
compute header length.

For the packets with partial checksum, the patch just set the transport header
to csum_start. Otherwise tries to use skb_flow_dissect() to get l4 offset, if it
fails, just pretend no l4 header.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ec09ebc1 27-Feb-2013 Tejun Heo <tj@kernel.org>

macvtap: convert to idr_alloc()

Convert to the much saner new idr interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c9af6db4 11-Feb-2013 Pravin B Shelar <pshelar@nicira.com>

net: Fix possible wrong checksum generation.

Patch cef401de7be8c4e (net: fix possible wrong checksum
generation) fixed wrong checksum calculation but it broke TSO by
defining new GSO type but not a netdev feature for that type.
net_gso_ok() would not allow hardware checksum/segmentation
offload of such packets without the feature.

Following patch fixes TSO and wrong checksum. This patch uses
same logic that Eric Dumazet used. Patch introduces new flag
SKBTX_SHARED_FRAG if at least one frag can be modified by
the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
info tx_flags rather than gso_type.

tx_flags is better compared to gso_type since we can have skb with
shared frag without gso packet. It does not link SHARED_FRAG to
GSO, So there is no need to define netdev feature for this.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cef401de 25-Jan-2013 Eric Dumazet <edumazet@google.com>

net: fix possible wrong checksum generation

Pravin Shelar mentioned that GSO could potentially generate
wrong TX checksum if skb has fragments that are overwritten
by the user between the checksum computation and transmit.

He suggested to linearize skbs but this extra copy can be
avoided for normal tcp skbs cooked by tcp_sendmsg().

This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
in skb_shinfo(skb)->gso_type if at least one frag can be
modified by the user.

Typical sources of such possible overwrites are {vm}splice(),
sendfile(), and macvtap/tun/virtio_net drivers.

Tested:

$ netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
7.7.8.84 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 3959.52

$ netperf -H 7.7.8.84 -t TCP_SENDFILE
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 3216.80

Performance of the SENDFILE is impacted by the extra allocation and
copy, and because we use order-0 pages, while the TCP_STREAM uses
bigger pages.

Reported-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3a7f8c34 11-Aug-2012 Denis Efremov <yefremov.denis@gmail.com>

macvtap: rcu_dereference outside read-lock section

rcu_dereference occurs in update section. Replacement by
rcu_dereference_protected in order to prevent lockdep
complaint.

Found by Linux Driver Verification project (linuxtesting.org)

Signed-off-by: Denis Efremov <yefremov.denis@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ccf7e72b 06-Jun-2012 Hong zhi guo <honkiko@gmail.com>

macvtap: use prepare_to_wait/finish_wait to ensure mb

instead of raw assignment to current->state

Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f09e2249 03-May-2012 Basil Gor <basil.gor@gmail.com>

macvtap: restore vlan header on user read

Ethernet vlan header is not on the packet and kept in the skb->vlan_tci
when it comes from lower dev. This patch inserts vlan header in user
buffer during skb copy on user read.

Signed-off-by: Basil Gor <basil.gor@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b92946e2 01-May-2012 Jason Wang <jasowang@redhat.com>

macvtap: zerocopy: validate vectors before building skb

There're several reasons that the vectors need to be validated:

- Return error when caller provides vectors whose num is greater than UIO_MAXIOV.
- Linearize part of skb when userspace provides vectors grater than MAX_SKB_FRAGS.
- Return error when userspace provides vectors whose total length may exceed
- MAX_SKB_FRAGS * PAGE_SIZE.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 01d6657b 01-May-2012 Jason Wang <jasowang@redhat.com>

macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully

Current the SKBTX_DEV_ZEROCOPY is set unconditionally after
zerocopy_sg_from_iovec(), this would lead NULL pointer when macvtap
fails to build zerocopy skb because destructor_arg was not
initialized. Solve this by set this flag after the skb were built
successfully.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 02ce04bb 01-May-2012 Jason Wang <jasowang@redhat.com>

macvtap: zerocopy: put page when fail to get all requested user pages

When get_user_pages_fast() fails to get all requested pages, we could not use
kfree_skb() to free it as it has not been put in the skb fragments. So we need
to call put_page() instead.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 4ef67ebe 01-May-2012 Jason Wang <jasowang@redhat.com>

macvtap: zerocopy: fix truesize underestimation

As the skb fragment were pinned/built from user pages, we should
account the page instead of length for truesize.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 3afc9621 01-May-2012 Jason Wang <jasowang@redhat.com>

macvtap: zerocopy: fix offset calculation when building skb

This patch fixes the offset calculation when building skb:

- offset1 were used as skb data offset not vector offset
- reset offset to zero only when we advance to next vector

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 40401530 12-Feb-2012 Al Viro <viro@ftp.linux.org.uk>

security: trim security.h

Trim security.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>


# ef0002b5 23-Nov-2011 Krishna Kumar <krkumar2@in.ibm.com>

macvtap: Fix macvtap_get_queue to use rxhash first

It was reported that the macvtap device selects a
Acked-by: Michael S. Tsirkin <mst@redhat.com>

Signed-off-by: David S. Miller <davem@davemloft.net>


# 2cfa5a04 23-Nov-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: treewide use of RCU_INIT_POINTER

rcu_assign_pointer(ptr, NULL) can be safely replaced by
RCU_INIT_POINTER(ptr, NULL)

(old rcu_assign_pointer() macro was testing the NULL value and could
omit the smp_wmb(), but this had to be removed because of compiler
warnings)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e09eff7f 19-Oct-2011 Eric W. Biederman <ebiederm@xmission.com>

macvtap: Fix the minor device number allocation

On systems that create and delete lots of dynamic devices the
31bit linux ifindex fails to fit in the 16bit macvtap minor,
resulting in unusable macvtap devices. I have systems running
automated tests that that hit this condition in just a few days.

Use a linux idr allocator to track which mavtap minor numbers
are available and and to track the association between macvtap
minor numbers and macvtap network devices.

Remove the unnecessary unneccessary check to see if the network
device we have found is indeed a macvtap device. With macvtap
specific data structures it is impossible to find any other
kind of networking device.

Increase the macvtap minor range from 65536 to the full 20 bits
that is supported by linux device numbers. It doesn't solve the
original problem but there is no penalty for a larger minor
device range.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9bf1907f 19-Oct-2011 Eric W. Biederman <ebiederm@xmission.com>

macvtap: Rewrite macvtap_newlink so the error handling works.

Place macvlan_common_newlink at the end of macvtap_newlink because
failing in newlink after registering your network device is not
supported.

Move device_create into a netdevice creation notifier. The network device
notifier is the only hook that is called after the network device has been
registered with the device layer and before register_network_device returns
success.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2259fef0 19-Oct-2011 Eric W. Biederman <ebiederm@xmission.com>

macvtap: Don't leak unreceived packets when we delete a macvtap device.

To avoid leaking packets in the receive queue. Add a socket destructor
that will run whenever destroy a macvtap socket.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 047af9cf 19-Oct-2011 Eric W. Biederman <ebiederm@xmission.com>

macvtap: Fix macvtap_open races in the zero copy enable code.

To see if it is appropriate to enable the macvtap zero copy feature
don't test the lowerdev network device flags. Instead test the
macvtap network device flags which are a direct copy of the lowerdev
flags. This is important because nothing holds a reference to lowerdev
and on a very bad day we lowerdev could be a pointer to stale memory.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 99f34b38 19-Oct-2011 Eric W. Biederman <ebiederm@xmission.com>

macvtap: Close a race between macvtap_open and macvtap_dellink.

There is a small window in macvtap_open between looking up a
networking device and calling macvtap_set_queue in which
macvtap_del_queues called from macvtap_dellink. After
calling macvtap_del_queues it is totally incorrect to
allow macvtap_set_queue to proceed so prevent success by
reporting that all of the available queues are in use.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 653fc915 18-Sep-2011 Jason Wang <jasowang@redhat.com>

macvtap: fix the uninitialized var using in macvtap_alloc_skb()

Commit d1b08284 use new frag API but would leave f to be used
uninitialized, this patch fix it.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d1b08284 30-Aug-2011 Ian Campbell <Ian.Campbell@citrix.com>

macvtap: convert to SKB paged frag API.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>


# 97bc3633 05-Jul-2011 Shirley Ma <mashirle@us.ibm.com>

macvtap: macvtapTX zero-copy support

Only 128 bytes is copied, the rest of data is DMA mapped directly from
userspace.

Signed-off-by: Shirley Ma <xma@...ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 10a8d94a 09-Jun-2011 Jason Wang <jasowang@redhat.com>

virtio_net: introduce VIRTIO_NET_HDR_F_DATA_VALID

There's no need for the guest to validate the checksum if it have been
validated by host nics. So this patch introduces a new flag -
VIRTIO_NET_HDR_F_DATA_VALID which is used to bypass the checksum
examing in guest. The backend (tap/macvtap) may set this flag when
met skbs with CHECKSUM_UNNECESSARY to save cpu utilization.

No feature negotiation is needed as old driver just ignore this flag.

Iperf shows 12%-30% performance improvement for UDP traffic. For TCP,
when gro is on no difference as it produces skb with partial
checksum. But when gro is disabled, 20% or even higher improvement
could be measured by netperf.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ce3c8692 04-Mar-2011 Nicolas Kaiser <nikai@nikai.net>

drivers/net/macvtap: fix error check

'len' is unsigned of type size_t and can't be negative.

Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 13707f9e 26-Jan-2011 Eric Dumazet <eric.dumazet@gmail.com>

drivers/net: remove some rcu sparse warnings

Add missing __rcu annotations and helpers.
minor : Fix some rcu_dereference() calls in macvtap

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1ac9ad13 11-Jan-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: remove dev_txq_stats_fold()

After recent changes, (percpu stats on vlan/tunnels...), we dont need
anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters.

Only remaining users are ixgbe, sch_teql, gianfar & macvlan :

1) ixgbe can be converted to use existing tx_ring counters.

2) macvlan incremented txq->tx_dropped, it can use the
dev->stats.tx_dropped counter.

3) sch_teql : almost revert ab35cd4b8f42 (Use net_device internal stats)
Now we have ndo_get_stats64(), use it, even for "unsigned long"
fields (No need to bring back a struct net_device_stats)

4) gianfar adds a stats structure per tx queue to hold
tx_bytes/tx_packets

This removes a lockdep warning (and possible lockup) in rndis gadget,
calling dev_get_stats() from hard IRQ context.

Ref: http://www.spinics.net/lists/netdev/msg149202.html

Reported-by: Neil Jones <neiljay@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Sandeep Gopalpet <sandeep.kumar@freescale.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 55508d60 14-Dec-2010 Michał Mirosław <mirq-linux@rere.qmqm.pl>

net: Use skb_checksum_start_offset()

Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().

Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1565c7c1 04-Aug-2010 Krishna Kumar <krkumar2@in.ibm.com>

macvtap: Implement multiqueue for macvtap driver

Implement multiqueue facility for macvtap driver. The idea is that
a macvtap device can be opened multiple times and the fd's can be
used to register eg, as backend for vhost.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8a35747a 21-Jul-2010 Herbert Xu <herbert@gondor.apana.org.au>

macvtap: Limit packet queue length

Mark Wagner reported OOM symptoms when sending UDP traffic over
a macvtap link to a kvm receiver.

This appears to be caused by the fact that macvtap packet queues
are unlimited in length. This means that if the receiver can't
keep up with the rate of flow, then we will hit OOM. Of course
it gets worse if the OOM killer then decides to kill the receiver.

This patch imposes a cap on the packet queue length, in the same
way as the tuntap driver, using the device TX queue length.

Please note that macvtap currently has no way of giving congestion
notification, that means the software device TX queue cannot be
used and packets will always be dropped once the macvtap driver
queue fills up.

This shouldn't be a great problem for the scenario where macvtap
is used to feed a kvm receiver, as the traffic is most likely
external in origin so congestion notification can't be applied
anyway.

Of course, if anybody decides to complain about guest-to-guest
UDP packet loss down the track, then we may have to revisit this.

Incidentally, this patch also fixes a real memory leak when
macvtap_get_queue fails.

Chris Wright noticed that for this patch to work, we need a
non-zero TX queue length. This patch includes his work to change
the default macvtap TX queue length to 500.

Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1ebed71a 10-Jul-2010 David S. Miller <davem@davemloft.net>

macvtap: Use dev_t for macvtap_major.

Reported-by: "Robert P. J. Day" <rpjday@crashcourse.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 55afbd08 29-Apr-2010 Michael S. Tsirkin <mst@redhat.com>

macvtap: add ioctl to modify vnet header size

This adds TUNSETVNETHDRSZ/TUNGETVNETHDRSZ support
to macvtap.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David S. Miller <davem@davemloft.net>


# 43815482 29-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com>

net: sock_def_readable() and friends RCU conversion

sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
need two atomic operations (and associated dirtying) per incoming
packet.

RCU conversion is pretty much needed :

1) Add a new structure, called "struct socket_wq" to hold all fields
that will need rcu_read_lock() protection (currently: a
wait_queue_head_t and a struct fasync_struct pointer).

[Future patch will add a list anchor for wakeup coalescing]

2) Attach one of such structure to each "struct socket" created in
sock_alloc_inode().

3) Respect RCU grace period when freeing a "struct socket_wq"

4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
socket_wq"

5) Change sk_sleep() function to use new sk->sk_wq instead of
sk->sk_sleep

6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
a rcu_read_lock() section.

7) Change all sk_has_sleeper() callers to :
- Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
- Use wq_has_sleeper() to eventually wakeup tasks.
- Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

8) sock_wake_async() is modified to use rcu protection as well.

9) Exceptions :
macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
instead of dynamically allocated ones. They dont need rcu freeing.

Some cleanups or followups are probably needed, (possible
sk_callback_lock conversion to a spinlock for example...).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4a4771a5 25-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com>

net: use sk_sleep()

Commit aa395145 (net: sk_sleep() helper) missed three files in the
conversion.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aa395145 20-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com>

net: sk_sleep() helper

Define a new function to return the waitqueue of a "struct sock".

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
return sk->sk_sleep;
}

Change all read occurrences of sk_sleep by a call to this function.

Needed for a future RCU conversion. sk_sleep wont be a field directly
available.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# b9fb9ee0 17-Feb-2010 Arnd Bergmann <arnd@arndb.de>

macvtap: add GSO/csum offload support

Added flags field to macvtap_queue to enable/disable processing of
virtio_net_hdr via IFF_VNET_HDR. This flag is checked to prepend virtio_net_hdr
in the receive path and process/skip virtio_net_hdr in the send path.

Original patch by Sridhar, further changes by Arnd.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 501c774c 17-Feb-2010 Arnd Bergmann <arnd@arndb.de>

net/macvtap: add vhost support

This adds support for passing a macvtap file descriptor into
vhost-net, much like we already do for tun/tap.

Most of the new code is taken from the respective patch
in the tun driver and may get consolidated in the future.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 02df55d2 17-Feb-2010 Arnd Bergmann <arnd@arndb.de>

macvtap: rework object lifetime rules

This reworks the change done by the previous patch
in a more complete way.

The original macvtap code has a number of problems
resulting from the use of RCU for protecting the
access to struct macvtap_queue from open files.

This includes
- need for GFP_ATOMIC allocations for skbs
- potential deadlocks when copy_*_user sleeps
- inability to work with vhost-net

Changing the lifetime of macvtap_queue to always
depend on the open file solves all these. The
RCU reference simply moves one step down to
the reference on the macvlan_dev, which we
only need for nonblocking operations.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 564517e8 10-Feb-2010 Arnd Bergmann <arnd@arndb.de>

net/macvtap: fix reference counting

The RCU usage in the original code was broken because
there are cases where we possibly sleep with rcu_read_lock
held. As a fix, change the macvtap_file_get_queue to
get a reference on the socket and the netdev instead of
taking the full rcu_read_lock.

Also, change macvtap_file_get_queue failure case to
not require a subsequent macvtap_file_put_queue, as
pointed out by Ed Swierk.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ed Swierk <eswierk@aristanetworks.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Ed Swierk <eswierk@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 20d29d7a 29-Jan-2010 Arnd Bergmann <arnd@arndb.de>

net: macvtap driver

In order to use macvlan with qemu and other tools that require
a tap file descriptor, the macvtap driver adds a small backend
with a character device with the same interface as the tun
driver, with a minimum set of features.

Macvtap interfaces are created in the same way as macvlan
interfaces using ip link, but the netif is just used as a
handle for configuration and accounting, while the data
goes through the chardev. Each macvtap interface has its
own character device, simplifying permission management
significantly over the generic tun/tap driver.

Cc: Patrick McHardy <kaber@trash.net>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Or Gerlitz <ogerlitz@voltaire.com>
Cc: netdev@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>