History log of /linux-master/drivers/net/virtio_net.c
Revision Date Author Comments
# 059a49aa 03-Apr-2024 Breno Leitao <leitao@debian.org>

virtio_net: Do not send RSS key if it is not supported

There is a bug when setting the RSS options in virtio_net that can break
the whole machine, getting the kernel into an infinite loop.

Running the following command in any QEMU virtual machine with virtionet
will reproduce this problem:

# ethtool -X eth0 hfunc toeplitz

This is how the problem happens:

1) ethtool_set_rxfh() calls virtnet_set_rxfh()

2) virtnet_set_rxfh() calls virtnet_commit_rss_command()

3) virtnet_commit_rss_command() populates 4 entries for the rss
scatter-gather

4) Since the command above does not have a key, then the last
scatter-gatter entry will be zeroed, since rss_key_size == 0.
sg_buf_size = vi->rss_key_size;

5) This buffer is passed to qemu, but qemu is not happy with a buffer
with zero length, and do the following in virtqueue_map_desc() (QEMU
function):

if (!sz) {
virtio_error(vdev, "virtio: zero sized buffers are not allowed");

6) virtio_error() (also QEMU function) set the device as broken

vdev->broken = true;

7) Qemu bails out, and do not repond this crazy kernel.

8) The kernel is waiting for the response to come back (function
virtnet_send_command())

9) The kernel is waiting doing the following :

while (!virtqueue_get_buf(vi->cvq, &tmp) &&
!virtqueue_is_broken(vi->cvq))
cpu_relax();

10) None of the following functions above is true, thus, the kernel
loops here forever. Keeping in mind that virtqueue_is_broken() does
not look at the qemu `vdev->broken`, so, it never realizes that the
vitio is broken at QEMU side.

Fix it by not sending RSS commands if the feature is not available in
the device.

Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Cc: stable@vger.kernel.org
Cc: qemu-devel@nongnu.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5da7137d 29-Feb-2024 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: rename free_old_xmit_skbs to free_old_xmit

Since free_old_xmit_skbs not only deals with skb, but also xdp frame and
subsequent added xsk, so change the name of this function to
free_old_xmit.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20240229072044.77388-19-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# b1dc24ab 29-Feb-2024 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: unify the code for recycling the xmit ptr

There are two completely similar and independent implementations. This
is inconvenient for the subsequent addition of new types. So extract a
function from this piece of code and call this function uniformly to
recover old xmit ptr.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20240229072044.77388-18-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 0d197a14 20-Jul-2023 Jason Wang <jasowang@redhat.com>

virtio-net: add cond_resched() to the command waiting loop

Adding cond_resched() to the command waiting loop for a better
co-operation with the scheduler. This allows to give CPU a breath to
run other task(workqueue) instead of busy looping when preemption is
not allowed on a device whose CVQ might be slow.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230720083839.481487-3-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>


# b9f74252 20-Jul-2023 Jason Wang <jasowang@redhat.com>

virtio-net: convert rx mode setting to use workqueue

This patch convert rx mode setting to be done in a workqueue, this is
a must for allow to sleep when waiting for the cvq command to
response since current code is executed under addr spin lock.

Note that we need to disable and flush the workqueue during freeze,
this means the rx mode setting is lost after resuming. This is not the
bug of this patch as we never try to restore rx mode setting during
resume.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230720083839.481487-2-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>


# e3fe8d28 03-Jan-2024 Zhu Yanjun <yanjun.zhu@linux.dev>

virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings

Fix the warnings when building virtio_net driver.

"
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4551:48: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 10 [-Wformat-overflow=]
4551 | sprintf(vi->rq[i].name, "input.%d", i);
| ^~
In function ‘virtnet_find_vqs’,
inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4551:41: note: directive argument in the range [-2147483643, 65534]
4551 | sprintf(vi->rq[i].name, "input.%d", i);
| ^~~~~~~~~~
drivers/net/virtio_net.c:4551:17: note: ‘sprintf’ output between 8 and 18 bytes into a destination of size 16
4551 | sprintf(vi->rq[i].name, "input.%d", i);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4552:49: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 9 [-Wformat-overflow=]
4552 | sprintf(vi->sq[i].name, "output.%d", i);
| ^~
In function ‘virtnet_find_vqs’,
inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4552:41: note: directive argument in the range [-2147483643, 65534]
4552 | sprintf(vi->sq[i].name, "output.%d", i);
| ^~~~~~~~~~~
drivers/net/virtio_net.c:4552:17: note: ‘sprintf’ output between 9 and 19 bytes into a destination of size 16
4552 | sprintf(vi->sq[i].name, "output.%d", i);

"

Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240104020902.2753599-1-yanjun.zhu@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d2c4f192 26-Dec-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: fix missing dma unmap for resize

For rq, we have three cases getting buffers from virtio core:

1. virtqueue_get_buf{,_ctx}
2. virtqueue_detach_unused_buf
3. callback for virtqueue_resize

But in commit 295525e29a5b("virtio_net: merge dma operations when
filling mergeable buffers"), I missed the dma unmap for the #3 case.

That will leak some memory, because I did not release the pages referred
by the unused buffers.

If we do such script, we will make the system OOM.

while true
do
ethtool -G ens4 rx 128
ethtool -G ens4 rx 256
free -m
done

Fixes: 295525e29a5b ("virtio_net: merge dma operations when filling mergeable buffers")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Message-Id: <20231226094333.47740-1-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# fb6e30a7 12-Dec-2023 Ahmed Zaki <ahmed.zaki@intel.com>

net: ethtool: pass a pointer to parameters to get/set_rxfh ethtool ops

The get/set_rxfh ethtool ops currently takes the rxfh (RSS) parameters
as direct function arguments. This will force us to change the API (and
all drivers' functions) every time some new parameters are added.

This is part 1/2 of the fix, as suggested in [1]:

- First simplify the code by always providing a pointer to all params
(indir, key and func); the fact that some of them may be NULL seems
like a weird historic thing or a premature optimization.
It will simplify the drivers if all pointers are always present.

- Then make the functions take a dev pointer, and a pointer to a
single struct wrapping all arguments. The set_* should also take
an extack.

Link: https://lore.kernel.org/netdev/20231121152906.2dd5f487@kernel.org/ [1]
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Suggested-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-2-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 62087995 11-Dec-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: support rx netdim

By comparing the traffic information in the complete napi processes,
let the virtio-net driver automatically adjust the coalescing
moderation parameters of each receive queue.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1db43c08 11-Dec-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: extract virtqueue coalescig cmd for reuse

Extract commands to set virtqueue coalescing parameters for reuse
by ethtool -Q, vq resize and netdim.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d7180080 11-Dec-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: separate rx/tx coalescing moderation cmds

This patch separates the rx and tx global coalescing moderation
commands to support netdim switches in subsequent patches.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7949c06a 11-Dec-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: returns whether napi is complete

rx netdim needs to count the traffic during a complete napi process,
and start updating and comparing samples to make decisions after
the napi ends. Let virtqueue_napi_complete() return true if napi is done,
otherwise vice versa.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2311e06b 26-Dec-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: fix missing dma unmap for resize

For rq, we have three cases getting buffers from virtio core:

1. virtqueue_get_buf{,_ctx}
2. virtqueue_detach_unused_buf
3. callback for virtqueue_resize

But in commit 295525e29a5b("virtio_net: merge dma operations when
filling mergeable buffers"), I missed the dma unmap for the #3 case.

That will leak some memory, because I did not release the pages referred
by the unused buffers.

If we do such script, we will make the system OOM.

while true
do
ethtool -G ens4 rx 128
ethtool -G ens4 rx 256
free -m
done

Fixes: 295525e29a5b ("virtio_net: merge dma operations when filling mergeable buffers")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20231226094333.47740-1-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 61217d8f 26-Oct-2023 Eric Dumazet <edumazet@google.com>

virtio_net: use u64_stats_t infra to avoid data-races

syzbot reported a data-race in virtnet_poll / virtnet_stats [1]

u64_stats_t infra has very nice accessors that must be used
to avoid potential load-store tearing.

[1]
BUG: KCSAN: data-race in virtnet_poll / virtnet_stats

read-write to 0xffff88810271b1a0 of 8 bytes by interrupt on cpu 0:
virtnet_receive drivers/net/virtio_net.c:2102 [inline]
virtnet_poll+0x6c8/0xb40 drivers/net/virtio_net.c:2148
__napi_poll+0x60/0x3b0 net/core/dev.c:6527
napi_poll net/core/dev.c:6594 [inline]
net_rx_action+0x32b/0x750 net/core/dev.c:6727
__do_softirq+0xc1/0x265 kernel/softirq.c:553
invoke_softirq kernel/softirq.c:427 [inline]
__irq_exit_rcu kernel/softirq.c:632 [inline]
irq_exit_rcu+0x3b/0x90 kernel/softirq.c:644
common_interrupt+0x7f/0x90 arch/x86/kernel/irq.c:247
asm_common_interrupt+0x26/0x40 arch/x86/include/asm/idtentry.h:636
__sanitizer_cov_trace_const_cmp8+0x0/0x80 kernel/kcov.c:306
jbd2_write_access_granted fs/jbd2/transaction.c:1174 [inline]
jbd2_journal_get_write_access+0x94/0x1c0 fs/jbd2/transaction.c:1239
__ext4_journal_get_write_access+0x154/0x3f0 fs/ext4/ext4_jbd2.c:241
ext4_reserve_inode_write+0x14e/0x200 fs/ext4/inode.c:5745
__ext4_mark_inode_dirty+0x8e/0x440 fs/ext4/inode.c:5919
ext4_evict_inode+0xaf0/0xdc0 fs/ext4/inode.c:299
evict+0x1aa/0x410 fs/inode.c:664
iput_final fs/inode.c:1775 [inline]
iput+0x42c/0x5b0 fs/inode.c:1801
do_unlinkat+0x2b9/0x4f0 fs/namei.c:4405
__do_sys_unlink fs/namei.c:4446 [inline]
__se_sys_unlink fs/namei.c:4444 [inline]
__x64_sys_unlink+0x30/0x40 fs/namei.c:4444
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff88810271b1a0 of 8 bytes by task 2814 on cpu 1:
virtnet_stats+0x1b3/0x340 drivers/net/virtio_net.c:2564
dev_get_stats+0x6d/0x860 net/core/dev.c:10511
rtnl_fill_stats+0x45/0x320 net/core/rtnetlink.c:1261
rtnl_fill_ifinfo+0xd0e/0x1120 net/core/rtnetlink.c:1867
rtnl_dump_ifinfo+0x7f9/0xc20 net/core/rtnetlink.c:2240
netlink_dump+0x390/0x720 net/netlink/af_netlink.c:2266
netlink_recvmsg+0x425/0x780 net/netlink/af_netlink.c:1992
sock_recvmsg_nosec net/socket.c:1027 [inline]
sock_recvmsg net/socket.c:1049 [inline]
____sys_recvmsg+0x156/0x310 net/socket.c:2760
___sys_recvmsg net/socket.c:2802 [inline]
__sys_recvmsg+0x1ea/0x270 net/socket.c:2832
__do_sys_recvmsg net/socket.c:2842 [inline]
__se_sys_recvmsg net/socket.c:2839 [inline]
__x64_sys_recvmsg+0x46/0x50 net/socket.c:2839
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x000000000045c334 -> 0x000000000045c376

Fixes: 3fa2a1df9094 ("virtio-net: per cpu 64 bit stats (v2)")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c4e33cf2 08-Oct-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: a tiny comment update

Update a comment because virtio-net now supports both
VIRTIO_NET_F_NOTF_COAL and VIRTIO_NET_F_VQ_NOTF_COAL.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f61fe5f0 08-Oct-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: fix the vq coalescing setting for vq resize

According to the definition of virtqueue coalescing spec[1]:

Upon disabling and re-enabling a transmit virtqueue, the device MUST set
the coalescing parameters of the virtqueue to those configured through the
VIRTIO_NET_CTRL_NOTF_COAL_TX_SET command, or, if the driver did not set
any TX coalescing parameters, to 0.

Upon disabling and re-enabling a receive virtqueue, the device MUST set
the coalescing parameters of the virtqueue to those configured through the
VIRTIO_NET_CTRL_NOTF_COAL_RX_SET command, or, if the driver did not set
any RX coalescing parameters, to 0.

We need to add this setting for vq resize (ethtool -G) where vq_reset happens.

[1] https://lists.oasis-open.org/archives/virtio-dev/202303/msg00415.html

Fixes: 394bd87764b6 ("virtio_net: support per queue interrupt coalesce command")
Cc: Gavin Li <gavinl@nvidia.com>
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bfb2b360 08-Oct-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: fix per queue coalescing parameter setting

When the user sets a non-zero coalescing parameter to 0 for a specific
virtqueue, it does not work as expected, so let's fix this.

Fixes: 394bd87764b6 ("virtio_net: support per queue interrupt coalesce command")
Reported-by: Xiaoming Zhao <zxm377917@alibaba-inc.com>
Cc: Gavin Li <gavinl@nvidia.com>
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e9420838 08-Oct-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: consistently save parameters for per-queue

When using .set_coalesce interface to set all queue coalescing
parameters, we need to update both per-queue and global save values.

Fixes: 394bd87764b6 ("virtio_net: support per queue interrupt coalesce command")
Cc: Gavin Li <gavinl@nvidia.com>
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 134674c1 08-Oct-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: fix mismatch of getting tx-frames

Since virtio-net allows switching napi_tx for per txq, we have to
get the specific txq's result now.

Fixes: 394bd87764b6 ("virtio_net: support per queue interrupt coalesce command")
Cc: Gavin Li <gavinl@nvidia.com>
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3014a0d5 08-Oct-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: initially change the value of tx-frames

Background:
1. Commit 0c465be183c7 ("virtio_net: ethtool tx napi configuration") uses
tx-frames to toggle napi_tx (0 off and 1 on) if notification coalescing
is not supported.
2. Commit 31c03aef9bc2 ("virtio_net: enable napi_tx by default") enables
napi_tx for all txqs by default.

Status:
When virtio-net supports notification coalescing, after initialization,
tx-frames is 0 and napi_tx is true.

Problem:
When the user only wants to set rx coalescing params using
ethtool -C eth0 rx-usecs 10, or
ethtool -Q eth0 queue_mask 0x1 -C rx-usecs 10,
these cmds will carry tx-frames as 0, causing the napi_tx switching condition
is satisfied. Then the user gets:
netlink error: Device or resource busy.

The same happens when trying to set rx-frames, adaptive_rx, adaptive_tx...

How to fix:
When notification coalescing feature is negotiated, initially make the
value of tx-frames to be consistent with napi_tx.

For compatibility with the past, it is still supported to use tx-frames
to toggle napi_tx.

Reported-by: Xiaoming Zhao <zxm377917@alibaba-inc.com>
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d12a26b7 21-Sep-2023 Eric Dumazet <edumazet@google.com>

virtio_net: avoid data-races on dev->stats fields

Use DEV_STATS_INC() and DEV_STATS_READ() which provide
atomicity on paths that can be used concurrently.

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5720c43d 26-Sep-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: fix the missing of the dma cpu sync

Commit 295525e29a5b ("virtio_net: merge dma operations when filling
mergeable buffers") unmaps the buffer with DMA_ATTR_SKIP_CPU_SYNC when
the dma->ref is zero. We do that with DMA_ATTR_SKIP_CPU_SYNC, because we
do not want to do the sync for the entire page_frag. But that misses the
sync for the current area.

This patch does cpu sync regardless of whether the ref is zero or not.

Fixes: 295525e29a5b ("virtio_net: merge dma operations when filling mergeable buffers")
Reported-by: Michael Roth <michael.roth@amd.com>
Closes: http://lore.kernel.org/all/20230926130451.axgodaa6tvwqs3ut@amd.com
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 295525e2 10-Aug-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: merge dma operations when filling mergeable buffers

Currently, the virtio core will perform a dma operation for each
buffer. Although, the same page may be operated multiple times.

This patch, the driver does the dma operation and manages the dma
address based the feature premapped of virtio core.

This way, we can perform only one dma operation for the pages of the
alloc frag. This is beneficial for the iommu device.

kernel command line: intel_iommu=on iommu.passthrough=0

| strict=0 | strict=1
Before | 775496pps | 428614pps
After | 1109316pps | 742853pps

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Message-Id: <20230810123057.43407-13-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# dae64749 21-Aug-2023 Feng Liu <feliu@nvidia.com>

virtio_net: Introduce skb_vnet_common_hdr to avoid typecasting

The virtio_net driver currently deals with different versions and types
of virtio net headers, such as virtio_net_hdr_mrg_rxbuf,
virtio_net_hdr_v1_hash, etc. Due to these variations, the code relies
on multiple type casts to convert memory between different structures,
potentially leading to bugs when there are changes in these structures.

Introduces the "struct skb_vnet_common_hdr" as a unifying header
structure using a union. With this approach, various virtio net header
structures can be converted by accessing different members of this
structure, thus eliminating the need for type casting and reducing the
risk of potential bugs.

For example following code:
static struct sk_buff *page_to_skb(struct virtnet_info *vi,
struct receive_queue *rq,
struct page *page, unsigned int offset,
unsigned int len, unsigned int truesize,
unsigned int headroom)
{
[...]
struct virtio_net_hdr_mrg_rxbuf *hdr;
[...]
hdr_len = vi->hdr_len;
[...]
ok:
hdr = skb_vnet_hdr(skb);
memcpy(hdr, hdr_p, hdr_len);
[...]
}

When VIRTIO_NET_F_HASH_REPORT feature is enabled, hdr_len = 20
But the sizeof(*hdr) is 12,
memcpy(hdr, hdr_p, hdr_len); will copy 20 bytes to the hdr,
which make a potential risk of bug. And this risk can be avoided by
introducing struct skb_vnet_common_hdr.

Change log
v1->v2
feedback from Willem de Bruijn <willemdebruijn.kernel@gmail.com>
feedback from Simon Horman <horms@kernel.org>
1. change to use net-next tree.
2. move skb_vnet_common_hdr inside kernel file instead of the UAPI header.

v2->v3
feedback from Willem de Bruijn <willemdebruijn.kernel@gmail.com>
1. fix typo in commit message.
2. add original struct virtio_net_hdr into union
3. remove virtio_net_hdr_mrg_rxbuf variable in receive_buf;

Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 49e47a5b 02-Aug-2023 Jakub Kicinski <kuba@kernel.org>

net: move struct netdev_rx_queue out of netdevice.h

struct netdev_rx_queue is touched in only a few places
and having it defined in netdevice.h brings in the dependency
on xdp.h, because struct xdp_rxq_info gets embedded in
struct netdev_rx_queue.

In prep for removal of xdp.h from netdevice.h move all
the netdev_rx_queue stuff to a new header.

We could technically break the new header up to avoid
the sysfs.h include but it's so rarely included it
doesn't seem to be worth it at this point.

Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>


# 8af3bf66 31-Jul-2023 Gavin Li <gavinl@nvidia.com>

virtio_net: enable per queue interrupt coalesce feature

Enable per queue interrupt coalesce feature bit in driver and validate its
dependency with control queue.

Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20230731070656.96411-4-gavinl@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 394bd877 31-Jul-2023 Gavin Li <gavinl@nvidia.com>

virtio_net: support per queue interrupt coalesce command

Add interrupt_coalesce config in send_queue and receive_queue to cache user
config.

Send per virtqueue interrupt moderation config to underlying device in
order to have more efficient interrupt moderation and cpu utilization of
guest VM.

Additionally, address all the VQs when updating the global configuration,
as now the individual VQs configuration can diverge from the global
configuration.

Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20230731070656.96411-3-gavinl@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 308d7982 31-Jul-2023 Gavin Li <gavinl@nvidia.com>

virtio_net: extract interrupt coalescing settings to a structure

Extract interrupt coalescing settings to a structure so that it could be
reused in other data structures.

Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Heng Qi <hengqi@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230731070656.96411-2-gavinl@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 51b81317 09-Aug-2023 Jason Wang <jasowang@redhat.com>

virtio-net: set queues after driver_ok

Commit 25266128fe16 ("virtio-net: fix race between set queues and
probe") tries to fix the race between set queues and probe by calling
_virtnet_set_queues() before DRIVER_OK is set. This violates virtio
spec. Fixing this by setting queues after virtio_device_ready().

Note that rtnl needs to be held for userspace requests to change the
number of queues. So we are serialized in this way.

Fixes: 25266128fe16 ("virtio-net: fix race between set queues and probe")
Reported-by: Dragos Tatulea <dtatulea@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2c507ce9 10-Aug-2023 Hawkins Jiawei <yin31149@gmail.com>

virtio-net: Zero max_tx_vq field for VIRTIO_NET_CTRL_MQ_HASH_CONFIG case

Kernel uses `struct virtio_net_ctrl_rss` to save command-specific-data
for both the VIRTIO_NET_CTRL_MQ_HASH_CONFIG and
VIRTIO_NET_CTRL_MQ_RSS_CONFIG commands.

According to the VirtIO standard, "Field reserved MUST contain zeroes.
It is defined to make the structure to match the layout of
virtio_net_rss_config structure, defined in 5.1.6.5.7.".

Yet for the VIRTIO_NET_CTRL_MQ_HASH_CONFIG command case, the `max_tx_vq`
field in struct virtio_net_ctrl_rss, which corresponds to the
`reserved` field in struct virtio_net_hash_config, is not zeroed,
thereby violating the VirtIO standard.

This patch solves this problem by zeroing this field in
virtnet_init_default_rss().

Cc: Andrew Melnychenko <andrew@daynix.com>
Cc: stable@vger.kernel.org
Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20230810110405.25558-1-yin31149@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>


# 25266128 25-Jul-2023 Jason Wang <jasowang@redhat.com>

virtio-net: fix race between set queues and probe

A race were found where set_channels could be called after registering
but before virtnet_set_queues() in virtnet_probe(). Fixing this by
moving the virtnet_set_queues() before netdevice registering. While at
it, use _virtnet_set_queues() to avoid holding rtnl as the device is
not even registered at that time.

Cc: stable@vger.kernel.org
Fixes: a220871be66f ("virtio-net: correctly enable multiqueue")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230725072049.617289-1-jasowang@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# accc1bf2 05-Jun-2023 Brett Creeley <brett.creeley@amd.com>

virtio_net: use control_buf for coalesce params

Commit 699b045a8e43 ("net: virtio_net: notifications coalescing
support") added coalescing command support for virtio_net. However,
the coalesce commands are using buffers on the stack, which is causing
the device to see DMA errors. There should also be a complaint from
check_for_stack() in debug_dma_map_xyz(). Fix this by adding and using
coalesce params from the control_buf struct, which aligns with other
commands.

Cc: stable@vger.kernel.org
Fixes: 699b045a8e43 ("net: virtio_net: notifications coalescing support")
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20230605195925.51625-1-brett.creeley@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 5306623a 12-May-2023 Feng Liu <feliu@nvidia.com>

virtio_net: Fix error unwinding of XDP initialization

When initializing XDP in virtnet_open(), some rq xdp initialization
may hit an error causing net device open failed. However, previous
rqs have already initialized XDP and enabled NAPI, which is not the
expected behavior. Need to roll back the previous rq initialization
to avoid leaks in error unwinding of init code.

Also extract helper functions of disable and enable queue pairs.
Use newly introduced disable helper function in error unwinding and
virtnet_close. Use enable helper function in virtnet_open.

Fixes: 754b8a21a96d ("virtio_net: setup xdp_rxq_info")
Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b51f4113 10-May-2023 Yunsheng Lin <linyunsheng@huawei.com>

net: introduce and use skb_frag_fill_page_desc()

Most users use __skb_frag_set_page()/skb_frag_off_set()/
skb_frag_size_set() to fill the page desc for a skb frag.

Introduce skb_frag_fill_page_desc() to do that.

net/bpf/test_run.c does not call skb_frag_off_set() to
set the offset, "copy_from_user(page_address(page), ...)"
and 'shinfo' being part of the 'data' kzalloced in
bpf_test_init() suggest that it is assuming offset to be
initialized as zero, so call skb_frag_fill_page_desc()
with offset being zero for this case.

Also, skb_frag_set_page() is not used anymore, so remove
it.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 21e26a71 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: introduce virtnet_build_skb()

This logic is used in multiple places, now we separate it into
a helper.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 19e8c85e 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: introduce receive_small_build_xdp

Simplifying receive_small() function. Bringing the logic relating to
build_skb together.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# aef76506 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: small: remove skip_xdp

Because the skb build code is not shared between xdp and non-xdp, and
the xdp code in receive_small() is simpler, so "skip_xdp" is not needed.
We can remove it.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 7af70fc1 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: small: avoid code duplication in xdp scenarios

Avoid the problem that some variables(headroom and so on) will repeat
the calculation when process xdp.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# fc8ce84b 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: small: remove the delta

In the case of XDP-PASS, skb_reserve uses the "delta" to compatible
non-XDP, now that is not shared between xdp and non-xdp, so we can
remove this logic.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# c5f3e72f 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: introduce receive_small_xdp()

The purpose of this patch is to simplify the receive_small().
Separate all the logic of XDP of small into a function.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 59ba3b1a 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: merge: remove skip_xdp

Now, the logic of merge xdp process is simple, we can remove the
skip_xdp.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d8f2835a 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: introduce receive_mergeable_xdp()

The purpose of this patch is to simplify the receive_mergeable().
Separate all the logic of XDP into a function.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 4cb00b13 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: virtnet_build_xdp_buff_mrg() auto release xdp shinfo

virtnet_build_xdp_buff_mrg() auto release xdp shinfo then the caller no
need to careful the xdp shinfo.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 80f50f91 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: separate the logic of freeing the rest mergeable buf

This patch introduce a new function that frees the rest mergeable buf.
The subsequent patch will reuse this function.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# bb2c1e9e 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: separate the logic of freeing xdp shinfo

This patch introduce a new function that releases the
xdp shinfo. The subsequent patch will reuse this function.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 00765f8e 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: introduce virtnet_xdp_handler() to seprate the logic of run xdp

At present, we have two similar logic to perform the XDP prog.

Therefore, this patch separates the code of executing XDP, which is
conducive to later maintenance.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# dbe4fec2 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: optimize mergeable_xdp_get_buf()

The previous patch, in order to facilitate review, I do not do any
modification. This patch has made some optimization on the top.

* remove some repeated logics in this function.
* add fast check for passing without any alloc.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# ad4858be 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: introduce mergeable_xdp_get_buf()

Separating the logic of preparation for xdp from receive_mergeable.

The purpose of this is to simplify the logic of execution of XDP.

The main logic here is that when headroom is insufficient, we need to
allocate a new page and calculate offset. It should be noted that if
there is new page, the variable page will refer to the new page.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 363d8ce4 08-May-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: mergeable xdp: put old page immediately

In the xdp implementation of virtio-net mergeable, it always checks
whether two page is used and a page is selected to release. This is
complicated for the processing of action, and be careful.

In the entire process, we have such principles:
* If xdp_page is used (PASS, TX, Redirect), then we release the old
page.
* If it is a drop case, we will release two. The old page obtained from
buf is release inside err_xdp, and xdp_page needs be relased by us.

But in fact, when we allocate a new page, we can release the old page
immediately. Then just one is using, we just need to release the new
page for drop case. On the drop path, err_xdp will release the variable
"page", so we only need to let "page" point to the new xdp_page in
advance.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# f8bb5104 03-May-2023 Wenliang Wang <wangwenliang.1995@bytedance.com>

virtio_net: suppress cpu stall when free_unused_bufs

For multi-queue and large ring-size use case, the following error
occurred when free_unused_bufs:
rcu: INFO: rcu_sched self-detected stall on CPU.

Fixes: 986a4f4d452d ("virtio_net: multiqueue support")
Signed-off-by: Wenliang Wang <wangwenliang.1995@bytedance.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# be50da3e 09-Mar-2023 Jiri Pirko <jiri@resnulli.us>

net: virtio_net: implement exact header length guest feature

Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
set implicates that device benefits from knowing the exact size
of the header. For compatibility, to signal to the device that
the header is reliable driver also needs to set this feature.
Without this feature set by driver, device has to figure
out the header size itself.

Quoting the original virtio spec:
"hdr_len is a hint to the device as to how much of the header needs to
be kept to copy into each packet"

"a hint" might not be clear for the reader what does it mean, if it is
"maybe like that" of "exactly like that". This feature just makes it
crystal clear and let the device count on the hdr_len being filled up
by the exact length of header.

Also note the spec already has following note about hdr_len:
"Due to various bugs in implementations, this field is not useful
as a guarantee of the transport header size."

Without this feature the device needs to parse the header in core
data path handling. Accurate information helps the device to eliminate
such header parsing and directly use the hardware accelerators
for GSO operation.

virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
The driver already complies to fill the correct value. Introduce the
feature and advertise it.

Note that virtio spec also includes following note for device
implementation:
"Caution should be taken by the implementation so as to prevent
a malicious driver from attacking the device by setting
an incorrect hdr_len."

There is a plan to support this feature in our emulated device.
A device of SolidRun offers this feature bit. They claim this feature
will save the device a few cycles for every GSO packet.

Link: https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-230006x3
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Alvaro Karsz <alvaro.karsz@solid-run.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230309094559.917857-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 853618d5 14-Apr-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: bugfix overflow inside xdp_linearize_page()

Here we copy the data from the original buf to the new page. But we
not check that it may be overflow.

As long as the size received(including vnethdr) is greater than 3840
(PAGE_SIZE -VIRTIO_XDP_HEADROOM). Then the memcpy will overflow.

And this is completely possible, as long as the MTU is large, such
as 4096. In our test environment, this will cause crash. Since crash is
caused by the written memory, it is meaningless, so I do not include it.

Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1a3bd6ea 14-Mar-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: free xdp shinfo frags when build_skb_from_xdp_buff() fails

build_skb_from_xdp_buff() may return NULL, in this case
we need to free the frags of xdp shinfo.

Fixes: fab89bafa95b ("virtio-net: support multi-buffer xdp")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fa0f1ba7 14-Mar-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: fix page_to_skb() miss headroom

Because headroom is not passed to page_to_skb(), this causes the shinfo
exceeds the range. Then the frags of shinfo are changed by other process.

[ 157.724634] stack segment: 0000 [#1] PREEMPT SMP NOPTI
[ 157.725358] CPU: 3 PID: 679 Comm: xdp_pass_user_f Tainted: G E 6.2.0+ #150
[ 157.726401] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/4
[ 157.727820] RIP: 0010:skb_release_data+0x11b/0x180
[ 157.728449] Code: 44 24 02 48 83 c3 01 39 d8 7e be 48 89 d8 48 c1 e0 04 41 80 7d 7e 00 49 8b 6c 04 30 79 0c 48 89 ef e8 89 b
[ 157.730751] RSP: 0018:ffffc90000178b48 EFLAGS: 00010202
[ 157.731383] RAX: 0000000000000010 RBX: 0000000000000001 RCX: 0000000000000000
[ 157.732270] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888100dd0b00
[ 157.733117] RBP: 5d5d76010f6e2408 R08: ffff888100dd0b2c R09: 0000000000000000
[ 157.734013] R10: ffffffff82effd30 R11: 000000000000a14e R12: ffff88810981ffc0
[ 157.734904] R13: ffff888100dd0b00 R14: 0000000000000002 R15: 0000000000002310
[ 157.735793] FS: 00007f06121d9740(0000) GS:ffff88842fcc0000(0000) knlGS:0000000000000000
[ 157.736794] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 157.737522] CR2: 00007ffd9a56c084 CR3: 0000000104bda001 CR4: 0000000000770ee0
[ 157.738420] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 157.739283] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 157.740146] PKRU: 55555554
[ 157.740502] Call Trace:
[ 157.740843] <IRQ>
[ 157.741117] kfree_skb_reason+0x50/0x120
[ 157.741613] __udp4_lib_rcv+0x52b/0x5e0
[ 157.742132] ip_protocol_deliver_rcu+0xaf/0x190
[ 157.742715] ip_local_deliver_finish+0x77/0xa0
[ 157.743280] ip_sublist_rcv_finish+0x80/0x90
[ 157.743834] ip_list_rcv_finish.constprop.0+0x16f/0x190
[ 157.744493] ip_list_rcv+0x126/0x140
[ 157.744952] __netif_receive_skb_list_core+0x29b/0x2c0
[ 157.745602] __netif_receive_skb_list+0xed/0x160
[ 157.746190] ? udp4_gro_receive+0x275/0x350
[ 157.746732] netif_receive_skb_list_internal+0xf2/0x1b0
[ 157.747398] napi_gro_receive+0xd1/0x210
[ 157.747911] virtnet_receive+0x75/0x1c0
[ 157.748422] virtnet_poll+0x48/0x1b0
[ 157.748878] __napi_poll+0x29/0x1b0
[ 157.749330] net_rx_action+0x27a/0x340
[ 157.749812] __do_softirq+0xf3/0x2fb
[ 157.750298] do_softirq+0xa2/0xd0
[ 157.750745] </IRQ>
[ 157.751563] <TASK>
[ 157.752329] __local_bh_enable_ip+0x6d/0x80
[ 157.753178] virtnet_xdp_set+0x482/0x860
[ 157.754159] ? __pfx_virtnet_xdp+0x10/0x10
[ 157.755129] dev_xdp_install+0xa4/0xe0
[ 157.756033] dev_xdp_attach+0x20b/0x5e0
[ 157.756933] do_setlink+0x82e/0xc90
[ 157.757777] ? __nla_validate_parse+0x12b/0x1e0
[ 157.758744] rtnl_setlink+0xd8/0x170
[ 157.759549] ? mod_objcg_state+0xcb/0x320
[ 157.760328] ? security_capable+0x37/0x60
[ 157.761209] ? security_capable+0x37/0x60
[ 157.762072] rtnetlink_rcv_msg+0x145/0x3d0
[ 157.762929] ? ___slab_alloc+0x327/0x610
[ 157.763754] ? __alloc_skb+0x141/0x170
[ 157.764533] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 157.765422] netlink_rcv_skb+0x58/0x110
[ 157.766229] netlink_unicast+0x21f/0x330
[ 157.766951] netlink_sendmsg+0x240/0x4a0
[ 157.767654] sock_sendmsg+0x93/0xa0
[ 157.768434] ? sockfd_lookup_light+0x12/0x70
[ 157.769245] __sys_sendto+0xfe/0x170
[ 157.770079] ? handle_mm_fault+0xe9/0x2d0
[ 157.770859] ? preempt_count_add+0x51/0xa0
[ 157.771645] ? up_read+0x3c/0x80
[ 157.772340] ? do_user_addr_fault+0x1e9/0x710
[ 157.773166] ? kvm_read_and_reset_apf_flags+0x49/0x60
[ 157.774087] __x64_sys_sendto+0x29/0x30
[ 157.774856] do_syscall_64+0x3c/0x90
[ 157.775518] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 157.776382] RIP: 0033:0x7f06122def70

Fixes: 18117a842ab0 ("virtio-net: remove xdp related info from page_to_skb()")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cd1c604a 07-Mar-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: add checking sq is full inside xdp xmit

If the queue of xdp xmit is not an independent queue, then when the xdp
xmit used all the desc, the xmit from the __dev_queue_xmit() may encounter
the following error.

net ens4: Unexpected TXQ (0) queue failure: -28

This patch adds a check whether sq is full in xdp xmit.

Fixes: 56434a01b12e ("virtio_net: add XDP_TX support")
Reported-by: Yichun Zhang <yichun@openresty.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# b8ef4809 07-Mar-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: separate the logic of checking whether sq is full

Separate the logic of checking whether sq is full. The subsequent patch
will reuse this func.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 25074a44 07-Mar-2023 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: reorder some funcs

The purpose of this is to facilitate the subsequent addition of new
functions without introducing a separate declaration.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 30bbf891 07-Feb-2023 Lorenzo Bianconi <lorenzo@kernel.org>

virtio_net: Update xdp_features with xdp multi-buff

Now virtio-net supports xdp multi-buffer so add it to xdp_features.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/bpf/60c76cd63a0246db785606e8891b925fd5c9bf06.1675763384.git.lorenzo@kernel.org


# 27369c9c 03-Feb-2023 Parav Pandit <parav@nvidia.com>

virtio-net: Maintain reverse cleanup order

To easily audit the code, better to keep the device stop()
sequence to be mirror of the device open() sequence.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 66c0e13a 01-Feb-2023 Marek Majtyka <alardam@gmail.com>

drivers: net: turn on XDP features

A summary of the flags being set for various drivers is given below.
Note that XDP_F_REDIRECT_TARGET and XDP_F_FRAG_TARGET are features
that can be turned off and on at runtime. This means that these flags
may be set and unset under RTNL lock protection by the driver. Hence,
READ_ONCE must be used by code loading the flag value.

Also, these flags are not used for synchronization against the availability
of XDP resources on a device. It is merely a hint, and hence the read
may race with the actual teardown of XDP resources on the device. This
may change in the future, e.g. operations taking a reference on the XDP
resources of the driver, and in turn inhibiting turning off this flag.
However, for now, it can only be used as a hint to check whether device
supports becoming a redirection target.

Turn 'hw-offload' feature flag on for:
- netronome (nfp)
- netdevsim.

Turn 'native' and 'zerocopy' features flags on for:
- intel (i40e, ice, ixgbe, igc)
- mellanox (mlx5).
- stmmac
- netronome (nfp)

Turn 'native' features flags on for:
- amazon (ena)
- broadcom (bnxt)
- freescale (dpaa, dpaa2, enetc)
- funeth
- intel (igb)
- marvell (mvneta, mvpp2, octeontx2)
- mellanox (mlx4)
- mtk_eth_soc
- qlogic (qede)
- sfc
- socionext (netsec)
- ti (cpsw)
- tap
- tsnep
- veth
- xen
- virtio_net.

Turn 'basic' (tx, pass, aborted and drop) features flags on for:
- netronome (nfp)
- cavium (thunder)
- hyperv.

Turn 'redirect_target' feature flag on for:
- amanzon (ena)
- broadcom (bnxt)
- freescale (dpaa, dpaa2)
- intel (i40e, ice, igb, ixgbe)
- ti (cpsw)
- marvell (mvneta, mvpp2)
- sfc
- socionext (netsec)
- qlogic (qede)
- mellanox (mlx5)
- tap
- veth
- virtio_net
- xen

Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Marek Majtyka <alardam@gmail.com>
Link: https://lore.kernel.org/r/3eca9fafb308462f7edb1f58e451d59209aa07eb.1675245258.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 981f14d4 31-Jan-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: fix possible unsigned integer overflow

When the single-buffer xdp is loaded and after xdp_linearize_page()
is called, *num_buf becomes 0 and (*num_buf - 1) may overflow into
a large integer in virtnet_build_xdp_buff_mrg(), resulting in
unexpected packet dropping.

Fixes: ef75cb51f139 ("virtio-net: build xdp_buff with multi buffers")
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20230131085004.98687-1-hengqi@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 9f62d221 27-Jan-2023 Laurent Vivier <lvivier@redhat.com>

virtio_net: notify MAC address change on device initialization

In virtnet_probe(), if the device doesn't provide a MAC address the
driver assigns a random one.
As we modify the MAC address we need to notify the device to allow it
to update all the related information.

The problem can be seen with vDPA and mlx5_vdpa driver as it doesn't
assign a MAC address by default. The virtio_net device uses a random
MAC address (we can see it with "ip link"), but we can't ping a net
namespace from another one using the virtio-vdpa device because the
new MAC address has not been provided to the hardware:
RX packets are dropped since they don't go through the receive filters,
TX packets go through unaffected.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 7c06458c 27-Jan-2023 Laurent Vivier <lvivier@redhat.com>

virtio_net: disable VIRTIO_NET_F_STANDBY if VIRTIO_NET_F_MAC is not set

failover relies on the MAC address to pair the primary and the standby
devices:

"[...] the hypervisor needs to enable VIRTIO_NET_F_STANDBY
feature on the virtio-net interface and assign the same MAC address
to both virtio-net and VF interfaces."

Documentation/networking/net_failover.rst

This patch disables the STANDBY feature if the MAC address is not
provided by the hypervisor.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d0671115 22-Jan-2023 Parav Pandit <parav@nvidia.com>

virtio-net: Reduce debug name field size to 16 bytes

virtio queue index can be maximum of 65535. 16 bytes are enough to store
the vq name with the existing string prefix.

With this change, send queue struct saves 24 bytes and receive
queue saves whole cache line worth 64 bytes per structure
due to saving in alignment bytes.

Pahole results before:

pahole -s drivers/net/virtio_net.o | \
grep -e "send_queue" -e "receive_queue"
send_queue 1112 0
receive_queue 1280 1

Pahole results after:
pahole -s drivers/net/virtio_net.o | \
grep -e "send_queue" -e "receive_queue"
send_queue 1088 0
receive_queue 1216 1

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# eb1d929f 16-Jan-2023 Parav Pandit <parav@nvidia.com>

virtio_net: Reuse buffer free function

virtnet_rq_free_unused_buf() helper function to free the buffer
already exists. Avoid code duplication by reusing existing function.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fab89baf 14-Jan-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: support multi-buffer xdp

Driver can pass the skb to stack by build_skb_from_xdp_buff().

Driver forwards multi-buffer packets using the send queue
when XDP_TX and XDP_REDIRECT, and clears the reference of multi
pages when XDP_DROP.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 18117a84 14-Jan-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: remove xdp related info from page_to_skb()

For the clear construction of xdp_buff, we remove the xdp processing
interleaved with page_to_skb(). Now, the logic of xdp and building
skb from xdp are separate and independent.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b26aa481 14-Jan-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: build skb from multi-buffer xdp

This converts the xdp_buff directly to a skb, including
multi-buffer and single buffer xdp. We'll isolate the
construction of skb based on xdp from page_to_skb().

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 97717e8d 14-Jan-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: transmit the multi-buffer xdp

This serves as the basis for XDP_TX and XDP_REDIRECT
to send a multi-buffer xdp_frame.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 22174f79 14-Jan-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: construct multi-buffer xdp in mergeable

Build multi-buffer xdp using virtnet_build_xdp_buff_mrg().

For the prefilled buffer before xdp is set, we will probably use
vq reset in the future. At the same time, virtio net currently
uses comp pages, and bpf_xdp_frags_increase_tail() needs to calculate
the tailroom of the last frag, which will involve the offset of the
corresponding page and cause a negative value, so we disable tail
increase by not setting xdp_rxq->frag_size.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ef75cb51 14-Jan-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: build xdp_buff with multi buffers

Support xdp for multi buffer packets in mergeable mode.

Putting the first buffer as the linear part for xdp_buff,
and the rest of the buffers as non-linear fragments to struct
skb_shared_info in the tailroom belonging to xdp_buff.

Let 'truesize' return to its literal meaning, that is, when
xdp is set, it includes the length of headroom and tailroom.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 50bd14bc 14-Jan-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: update bytes calculation for xdp_frame

Update relative record value for xdp_frame as basis
for multi-buffer xdp transmission.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8d9bc36d 14-Jan-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: set up xdp for multi buffer packets

When the xdp program sets xdp.frags, which means it can process
multi-buffer packets over larger MTU, so we continue to support xdp.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e814b958 14-Jan-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: fix calculation of MTU for single-buffer xdp

When single-buffer xdp is loaded, the size of the buffer filled each time
is 'sz = (PAGE_SIZE - headroom - tailroom)', which is the maximum packet
length that the driver allows the device to pass in. Otherwise, the packet
with a length greater than sz will come in, so num_buf will be greater than
or equal to 2, and xdp_linearize_page() will be performed and the packet
will be dropped because the total length is greater than PAGE_SIZE. So the
maximum value of MTU for single-buffer xdp is 'max_sz = sz - ETH_HLEN'.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 484beac2 14-Jan-2023 Heng Qi <hengqi@linux.alibaba.com>

virtio-net: disable the hole mechanism for xdp

XDP core assumes that the frame_size of xdp_buff and the length of
the frag are PAGE_SIZE. The hole may cause the processing of xdp to
fail, so we disable the hole mechanism when xdp is set.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 63b11404 02-Feb-2023 Parav Pandit <parav@nvidia.com>

virtio-net: Keep stop() to follow mirror sequence of open()

Cited commit in fixes tag frees rxq xdp info while RQ NAPI is
still enabled and packet processing may be ongoing.

Follow the mirror sequence of open() in the stop() callback.
This ensures that when rxq info is unregistered, no rx
packet processing is ongoing.

Fixes: 754b8a21a96d ("virtio_net: setup xdp_rxq_info")
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://lore.kernel.org/r/20230202163516.12559-1-parav@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# ad7e615f 25-Jan-2023 Magnus Karlsson <magnus.karlsson@intel.com>

virtio-net: execute xdp_do_flush() before napi_complete_done()

Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.

The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.

Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221220185903.1105011-1-sbohrer@cloudflare.com
Link: https://lore.kernel.org/all/20210624160609.292325-1-toke@redhat.com/
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d71ebe81 16-Jan-2023 Jason Wang <jasowang@redhat.com>

virtio-net: correctly enable callback during start_xmit

Commit a7766ef18b33("virtio_net: disable cb aggressively") enables
virtqueue callback via the following statement:

do {
if (use_napi)
virtqueue_disable_cb(sq->vq);

free_old_xmit_skbs(sq, false);

} while (use_napi && kick &&
unlikely(!virtqueue_enable_cb_delayed(sq->vq)));

When NAPI is used and kick is false, the callback won't be enabled
here. And when the virtqueue is about to be full, the tx will be
disabled, but we still don't enable tx interrupt which will cause a TX
hang. This could be observed when using pktgen with burst enabled.

TO be consistent with the logic that tries to disable cb only for
NAPI, fixing this by trying to enable delayed callback only when NAPI
is enabled when the queue is about to be full.

Fixes: a7766ef18b33 ("virtio_net: disable cb aggressively")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 418044e1 07-Dec-2022 Andrew Melnychenko <andrew@daynix.com>

drivers/net/virtio_net.c: Added USO support.

Now, it possible to enable GSO_UDP_L4("tx-udp-segmentation") for VirtioNet.

Signed-off-by: Andrew Melnychenko <andrew@daynix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 068c38ad 26-Oct-2022 Thomas Gleixner <tglx@linutronix.de>

net: Remove the obsolte u64_stats_fetch_*_irq() users (drivers).

Now that the 32bit UP oddity is gone and 32bit uses always a sequence
count, there is no need for the fetch_irq() variants anymore.

Convert to the regular interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# b0686565 22-Nov-2022 Li Zetao <lizetao1@huawei.com>

virtio_net: Fix probe failed when modprobe virtio_net

When doing the following test steps, an error was found:
step 1: modprobe virtio_net succeeded
# modprobe virtio_net <-- OK

step 2: fault injection in register_netdevice()
# modprobe -r virtio_net <-- OK
# ...
FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 0
CPU: 0 PID: 3521 Comm: modprobe
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
Call Trace:
<TASK>
...
should_failslab+0xa/0x20
...
dev_set_name+0xc0/0x100
netdev_register_kobject+0xc2/0x340
register_netdevice+0xbb9/0x1320
virtnet_probe+0x1d72/0x2658 [virtio_net]
...
</TASK>
virtio_net: probe of virtio0 failed with error -22

step 3: modprobe virtio_net failed
# modprobe virtio_net <-- failed
virtio_net: probe of virtio0 failed with error -2

The root cause of the problem is that the queues are not
disable on the error handling path when register_netdevice()
fails in virtnet_probe(), resulting in an error "-ENOENT"
returned in the next modprobe call in setup_vq().

virtio_pci_modern_device uses virtqueues to send or
receive message, and "queue_enable" records whether the
queues are available. In vp_modern_find_vqs(), all queues
will be selected and activated, but once queues are enabled
there is no way to go back except reset.

Fix it by reset virtio device on error handling path. This
makes error handling follow the same order as normal device
cleanup in virtnet_remove() which does: unregister, destroy
failover, then reset. And that flow is better tested than
error handling so we can be reasonably sure it works well.

Fixes: 024655555021 ("virtio_net: fix use after free on allocation failure")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20221122150046.3910638-1-lizetao1@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 4959aebb 14-Sep-2022 Gavin Li <gavinl@nvidia.com>

virtio-net: use mtu size as buffer length for big packets

Currently add_recvbuf_big() allocates MAX_SKB_FRAGS segments for big
packets even when GUEST_* offloads are not present on the device.
However, if guest GSO is not supported, it would be sufficient to
allocate segments to cover just up the MTU size and no further.
Allocating the maximum amount of segments results in a large waste of
buffer space in the queue, which limits the number of packets that can
be buffered and can result in reduced performance.

Therefore, if guest GSO is not supported, use the MTU to calculate the
optimal amount of segments required.

Below is the iperf TCP test results over a Mellanox NIC, using vDPA for
1 VQ, queue size 1024, before and after the change, with the iperf
server running over the virtio-net interface.

MTU(Bytes)/Bandwidth (Gbit/s)
Before After
1500 22.5 22.4
9000 12.8 25.9

And result of queue size 256.
MTU(Bytes)/Bandwidth (Gbit/s)
Before After
9000 2.15 11.9

With this patch no degradation is observed with multiple below tests and
feature bit combinations. Results are summarized below for q depth of
1024. Interface MTU is 1500 if MTU feature is disabled. MTU is set to 9000
in other tests.

Features/ Bandwidth (Gbit/s)
Before After
mtu off 20.1 20.2
mtu/indirect on 17.4 17.3
mtu/indirect/packed on 17.2 17.2

Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Gavi Teitz <gavi@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Message-Id: <20220914144911.56422-3-gavinl@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>


# 46cd26f4 14-Sep-2022 Gavin Li <gavinl@nvidia.com>

virtio-net: introduce and use helper function for guest gso support checks

Probe routine is already several hundred lines.
Use helper function for guest gso support check.

Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Gavi Teitz <gavi@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Message-Id: <20220914144911.56422-2-gavinl@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# fb3ceec1 30-Aug-2022 Wolfram Sang <wsa+renesas@sang-engineering.com>

net: move from strlcpy with unused retval to strscpy

Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.

Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN
Link: https://lore.kernel.org/r/20220830201457.7984-1-wsa+renesas@sang-engineering.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 95bb6330 11-Aug-2022 Michael S. Tsirkin <mst@redhat.com>

virtio_net: fix endian-ness for RSS

Using native endian-ness for device supplied fields is wrong
on BE platforms. Sparse warns about this.

Fixes: 91f41f01d219 ("drivers/net/virtio_net: Added RSS hash report.")
Cc: "Andrew Melnychenko" <andrew@daynix.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2e9ca760 15-Aug-2022 Michael S. Tsirkin <mst@redhat.com>

virtio_net: Revert "virtio_net: set the default max ring size by find_vqs()"

This reverts commit 762faee5a2678559d3dc09d95f8f2c54cd0466a7.

This has been reported to trip up guests on GCP (Google Cloud).
The reason is that virtio_find_vqs_ctx_size is broken on legacy
devices. We can in theory fix virtio_find_vqs_ctx_size but
in fact the patch itself has several other issues:

- It treats unknown speed as < 10G
- It leaves userspace no way to find out the ring size set by hypervisor
- It tests speed when link is down
- It ignores the virtio spec advice:
Both \field{speed} and \field{duplex} can change, thus the driver
is expected to re-read these values after receiving a
configuration change notification.
- It is not clear the performance impact has been tested properly

Revert the patch for now.

Reported-by: Andres Freund <andres@anarazel.de>
Link: https://lore.kernel.org/r/20220814212610.GA3690074%40roeck-us.net
Link: https://lore.kernel.org/r/20220815070203.plwjx7b3cyugpdt7%40awork3.anarazel.de
Link: https://lore.kernel.org/r/3df6bb82-1951-455d-a768-e9e1513eb667%40www.fastmail.com
Link: https://lore.kernel.org/r/FCDC5DDE-3CDD-4B8A-916F-CA7D87B547CE%40anarazel.de
Fixes: 762faee5a267 ("virtio_net: set the default max ring size by find_vqs()")
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Andres Freund <andres@anarazel.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Message-Id: <20220816053602.173815-2-mst@redhat.com>


# 699b045a 17-Jul-2022 Alvaro Karsz <alvaro.karsz@solid-run.com>

net: virtio_net: notifications coalescing support

New VirtIO network feature: VIRTIO_NET_F_NOTF_COAL.

Control a Virtio network device notifications coalescing parameters
using the control virtqueue.

A device that supports this fetature can receive
VIRTIO_NET_CTRL_NOTF_COAL control commands.

- VIRTIO_NET_CTRL_NOTF_COAL_TX_SET:
Ask the network device to change the following parameters:
- tx_usecs: Maximum number of usecs to delay a TX notification.
- tx_max_packets: Maximum number of packets to send before a
TX notification.

- VIRTIO_NET_CTRL_NOTF_COAL_RX_SET:
Ask the network device to change the following parameters:
- rx_usecs: Maximum number of usecs to delay a RX notification.
- rx_max_packets: Maximum number of packets to receive before a
RX notification.

VirtIO spec. patch:
https://lists.oasis-open.org/archives/virtio-comment/202206/msg00100.html

Signed-off-by: Alvaro Karsz <alvaro.karsz@solid-run.com>
Message-Id: <20220718091102.498774-1-alvaro.karsz@solid-run.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>


# a335b33f 01-Aug-2022 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: support set_ringparam

Support set_ringparam based on virtio queue reset.

Users can use ethtool -G eth0 <ring_num> to modify the ring size of
virtio-net.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-43-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# ebcce492 01-Aug-2022 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: support tx queue resize

This patch implements the resize function of the tx queues.
Based on this function, it is possible to modify the ring num of the
queue.

Inludes fixup:

virtio_net: fix for stuck when change tx ring size with dev down

When dev is set to DOWN state, napi has been disabled, if we modify the
ring size at this time, we should not call napi_disable() again, which
will cause stuck.

And all operations are under the protection of rtnl_lock, so there is no
need to consider concurrency issues.

Message-Id: <20220801063902.129329-42-xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220811080258.79398-3-xuanzhuo@linux.alibaba.com>
Reported-by: Kangjie Xu <kangjie.xu@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 6a4763e2 01-Aug-2022 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: support rx queue resize

This patch implements the resize function of the rx queues.
Based on this function, it is possible to modify the ring num of the
queue.

Includes fixup:

virtio_net: fix for stuck when change rx ring size with dev down

When dev is set to DOWN state, napi has been disabled, if we modify the
ring size at this time, we should not call napi_disable() again, which
will cause stuck.

And all operations are under the protection of rtnl_lock, so there is no
need to consider concurrency issues.

Message-Id: <20220801063902.129329-41-xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220811080258.79398-2-xuanzhuo@linux.alibaba.com>
Reported-by: Kangjie Xu <kangjie.xu@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 6e345f8c 01-Aug-2022 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: split free_unused_bufs()

This patch separates two functions for freeing sq buf and rq buf from
free_unused_bufs().

When supporting the enable/disable tx/rq queue in the future, it is
necessary to support separate recovery of a sq buf or a rq buf.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-40-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 8597b5dd 01-Aug-2022 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: get ringparam by virtqueue_get_vring_max_size()

Use virtqueue_get_vring_max_size() in virtnet_get_ringparam() to set
tx,rx_max_pending.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-39-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 762faee5 01-Aug-2022 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: set the default max ring size by find_vqs()

Use virtio_find_vqs_ctx_size() to specify the maximum ring size of tx,
rx at the same time.

| rx/tx ring size
-------------------------------------------
speed == UNKNOWN or < 10G| 1024
speed < 40G | 4096
speed >= 40G | 8192

Call virtnet_update_settings() once before calling init_vqs() to update
speed.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-38-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 7a542bee 04-Aug-2022 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: fix memory leak inside XPD_TX with mergeable

When we call xdp_convert_buff_to_frame() to get xdpf, if it returns
NULL, we should check if xdp_page was allocated by xdp_linearize_page().
If it is newly allocated, it should be freed here alone. Just like any
other "goto err_xdp".

Fixes: 44fa2dbd4759 ("xdp: transition into using xdp_frame for ndo_xdp_xmit")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a159128 25-Jul-2022 Jason Wang <jasowang@redhat.com>

virtio-net: fix the race between refill work and close

We try using cancel_delayed_work_sync() to prevent the work from
enabling NAPI. This is insufficient since we don't disable the source
of the refill work scheduling. This means an NAPI poll callback after
cancel_delayed_work_sync() can schedule the refill work then can
re-enable the NAPI that leads to use-after-free [1].

Since the work can enable NAPI, we can't simply disable NAPI before
calling cancel_delayed_work_sync(). So fix this by introducing a
dedicated boolean to control whether or not the work could be
scheduled from NAPI.

[1]
==================================================================
BUG: KASAN: use-after-free in refill_work+0x43/0xd4
Read of size 2 at addr ffff88810562c92e by task kworker/2:1/42

CPU: 2 PID: 42 Comm: kworker/2:1 Not tainted 5.19.0-rc1+ #480
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: events refill_work
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
print_report.cold+0xbb/0x6ac
? _printk+0xad/0xde
? refill_work+0x43/0xd4
kasan_report+0xa8/0x130
? refill_work+0x43/0xd4
refill_work+0x43/0xd4
process_one_work+0x43d/0x780
worker_thread+0x2a0/0x6f0
? process_one_work+0x780/0x780
kthread+0x167/0x1a0
? kthread_exit+0x50/0x50
ret_from_fork+0x22/0x30
</TASK>
...

Fixes: b2baed69e605c ("virtio_net: set/cancel work on ndo_open/ndo_stop")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 50c0ada6 17-Jun-2022 Jason Wang <jasowang@redhat.com>

virtio-net: fix race between ndo_open() and virtio_device_ready()

We currently call virtio_device_ready() after netdev
registration. Since ndo_open() can be called immediately
after register_netdev, this means there exists a race between
ndo_open() and virtio_device_ready(): the driver may start to use the
device before DRIVER_OK which violates the spec.

Fix this by switching to use register_netdevice() and protect the
virtio_device_ready() with rtnl_lock() to make sure ndo_open() can
only be called after virtio_device_ready().

Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220617072949.30734-1-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 8af52fe9 21-Jun-2022 Stephan Gerhold <stephan.gerhold@kernkonzept.com>

virtio_net: fix xdp_rxq_info bug after suspend/resume

The following sequence currently causes a driver bug warning
when using virtio_net:

# ip link set eth0 up
# echo mem > /sys/power/state (or e.g. # rtcwake -s 10 -m mem)
<resume>
# ip link set eth0 down

Missing register, driver bug
WARNING: CPU: 0 PID: 375 at net/core/xdp.c:138 xdp_rxq_info_unreg+0x58/0x60
Call trace:
xdp_rxq_info_unreg+0x58/0x60
virtnet_close+0x58/0xac
__dev_close_many+0xac/0x140
__dev_change_flags+0xd8/0x210
dev_change_flags+0x24/0x64
do_setlink+0x230/0xdd0
...

This happens because virtnet_freeze() frees the receive_queue
completely (including struct xdp_rxq_info) but does not call
xdp_rxq_info_unreg(). Similarly, virtnet_restore() sets up the
receive_queue again but does not call xdp_rxq_info_reg().

Actually, parts of virtnet_freeze_down() and virtnet_restore_up()
are almost identical to virtnet_close() and virtnet_open(): only
the calls to xdp_rxq_info_(un)reg() are missing. This means that
we can fix this easily and avoid such problems in the future by
just calling virtnet_close()/open() from the freeze/restore handlers.

Aside from adding the missing xdp_rxq_info calls the only difference
is that the refill work is only cancelled if netif_running(). However,
this should not make any functional difference since the refill work
should only be active if the network interface is actually up.

Fixes: 754b8a21a96d ("virtio_net: setup xdp_rxq_info")
Signed-off-by: Stephan Gerhold <stephan.gerhold@kernkonzept.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220621114845.3650258-1-stephan.gerhold@kernkonzept.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d484735d 06-May-2022 Jakub Kicinski <kuba@kernel.org>

net: virtio: switch to netif_napi_add_weight()

virtio netdev driver uses a custom napi weight, switch to the new
API for setting custom weight.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8d602e1a 04-May-2022 Jakub Kicinski <kuba@kernel.org>

net: move snowflake callers to netif_napi_add_tx_weight()

Make the drivers with custom tx napi weight call netif_napi_add_tx_weight().

Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220504163725.550782-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# acb16b39 25-Apr-2022 Nikolay Aleksandrov <razor@blackwall.org>

virtio_net: fix wrong buf address calculation when using xdp

We received a report[1] of kernel crashes when Cilium is used in XDP
mode with virtio_net after updating to newer kernels. After
investigating the reason it turned out that when using mergeable bufs
with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
calculates the build_skb address wrong because the offset can become less
than the headroom so it gets the address of the previous page (-X bytes
depending on how lower offset is):
page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256

This is a pr_err() I added in the beginning of page_to_skb which clearly
shows offset that is less than headroom by adding 4 bytes of metadata
via an xdp prog. The calculations done are:
receive_mergeable():
headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
offset = xdp.data - page_address(xdp_page) -
vi->hdr_len - metasize;

page_to_skb():
p = page_address(page) + offset;
...
buf = p - headroom;

Now buf goes -4 bytes from the page's starting address as can be seen
above which is set as skb->head and skb->data by build_skb later. Depending
on what's done with the skb (when it's freed most often) we get all kinds
of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
the new headroom after the xdp program has run, similar to how offset
and len are recalculated. Headroom is directly related to
data_hard_start, data and data_meta, so we use them to get the new size.
The result is correct (similar pr_err() in page_to_skb, one case of
xdp_page and one case of virtnet buf):
a) Case with 4 bytes of metadata
[ 115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
[ 121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
b) Case of pushing data +32 bytes
[ 153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
[ 158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
c) Case of pushing data -33 bytes
[ 835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
[ 840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223

Offset and headroom are equal because offset points to the start of
reserved bytes for the virtio_net header which are at buf start +
headroom, while data points at buf start + vnet hdr size + headroom so
when data or data_meta are adjusted by the xdp prog both the headroom size
and the offset change equally. We can use data_hard_start to compute the
new headroom after the xdp prog (linearized / page start case, the
virtnet buf case is similar just with bigger base offset):
xdp.data_hard_start = page_address + vnet_hdr
xdp.data = page_address + vnet_hdr + headroom
new headroom after xdp prog = xdp.data - xdp.data_hard_start - metasize

An example reproducer xdp prog[3] is below.

[1] https://github.com/cilium/cilium/issues/19453

[2] Two of the many traces:
[ 40.437400] BUG: Bad page state in process swapper/0 pfn:14940
[ 40.916726] BUG: Bad page state in process systemd-resolve pfn:053b7
[ 41.300891] kernel BUG at include/linux/mm.h:720!
[ 41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G B W 5.18.0-rc1+ #37
[ 41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
[ 41.306018] RIP: 0010:page_frag_free+0x79/0xe0
[ 41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
[ 41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
[ 41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
[ 41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
[ 41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
[ 41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
[ 41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
[ 41.317700] FS: 00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
[ 41.319150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
[ 41.321387] Call Trace:
[ 41.321819] <TASK>
[ 41.322193] skb_release_data+0x13f/0x1c0
[ 41.322902] __kfree_skb+0x20/0x30
[ 41.343870] tcp_recvmsg_locked+0x671/0x880
[ 41.363764] tcp_recvmsg+0x5e/0x1c0
[ 41.384102] inet_recvmsg+0x42/0x100
[ 41.406783] ? sock_recvmsg+0x1d/0x70
[ 41.428201] sock_read_iter+0x84/0xd0
[ 41.445592] ? 0xffffffffa3000000
[ 41.462442] new_sync_read+0x148/0x160
[ 41.479314] ? 0xffffffffa3000000
[ 41.496937] vfs_read+0x138/0x190
[ 41.517198] ksys_read+0x87/0xc0
[ 41.535336] do_syscall_64+0x3b/0x90
[ 41.551637] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 41.568050] RIP: 0033:0x48765b
[ 41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[ 41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
[ 41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
[ 41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
[ 41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
[ 41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
[ 41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
[ 41.744254] </TASK>
[ 41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net

and

[ 33.524802] BUG: Bad page state in process systemd-network pfn:11e60
[ 33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
[ 33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
[ 33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
[ 33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
[ 33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
[ 33.532471] page dumped because: nonzero mapcount
[ 33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
[ 33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
[ 33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
[ 33.532484] Call Trace:
[ 33.532496] <TASK>
[ 33.532500] dump_stack_lvl+0x45/0x5a
[ 33.532506] bad_page.cold+0x63/0x94
[ 33.532510] free_pcp_prepare+0x290/0x420
[ 33.532515] free_unref_page+0x1b/0x100
[ 33.532518] skb_release_data+0x13f/0x1c0
[ 33.532524] kfree_skb_reason+0x3e/0xc0
[ 33.532527] ip6_mc_input+0x23c/0x2b0
[ 33.532531] ip6_sublist_rcv_finish+0x83/0x90
[ 33.532534] ip6_sublist_rcv+0x22b/0x2b0

[3] XDP program to reproduce(xdp_pass.c):
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

SEC("xdp_pass")
int xdp_pkt_pass(struct xdp_md *ctx)
{
bpf_xdp_adjust_head(ctx, -(int)32);
return XDP_PASS;
}

char _license[] SEC("license") = "GPL";

compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass

CC: stable@vger.kernel.org
CC: Jason Wang <jasowang@redhat.com>
CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: virtualization@lists.linux-foundation.org
Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220425103703.3067292-1-razor@blackwall.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# c1170820 28-Mar-2022 Andrew Melnychenko <andrew@daynix.com>

drivers/net/virtio_net: Added RSS hash report control.

Now it's possible to control supported hashflows.
Added hashflow set/get callbacks.
Also, disabling RXH_IP_SRC/DST for TCP would disable then for UDP.
TCP and UDP supports only:
ethtool -U eth0 rx-flow-hash tcp4 sd
RXH_IP_SRC + RXH_IP_DST
ethtool -U eth0 rx-flow-hash tcp4 sdfn
RXH_IP_SRC + RXH_IP_DST + RXH_L4_B_0_1 + RXH_L4_B_2_3
Disabling happens because VirtioNET hashtype for IP doesn't check L4 proto,
it works for all IP packets(TCP, UDP, ICMP, etc.).
For TCP and UDP, it's possible to set IP+PORT hashes.
But disabling IP hashes will disable them for TCP and UDP simultaneously.
It's possible to set IP+PORT for TCP/UDP and disable/enable IP
for everything else(UDP, ICMP, etc.).

Signed-off-by: Andrew Melnychenko <andrew@daynix.com>
Link: https://lore.kernel.org/r/20220328175336.10802-5-andrew@daynix.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 91f41f01 28-Mar-2022 Andrew Melnychenko <andrew@daynix.com>

drivers/net/virtio_net: Added RSS hash report.

Added features for RSS hash report.
If hash is provided - it sets to skb.
Added checks if rss and/or hash are enabled together.

Signed-off-by: Andrew Melnychenko <andrew@daynix.com>
Link: https://lore.kernel.org/r/20220328175336.10802-4-andrew@daynix.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# c7114b12 28-Mar-2022 Andrew Melnychenko <andrew@daynix.com>

drivers/net/virtio_net: Added basic RSS support.

Added features for RSS.
Added initialization, RXHASH feature and ethtool ops.
By default RSS/RXHASH is disabled.
Virtio RSS "IPv6 extensions" hashes disabled.
Added ethtools ops to set key and indirection table.

Signed-off-by: Andrew Melnychenko <andrew@daynix.com>
Link: https://lore.kernel.org/r/20220328175336.10802-3-andrew@daynix.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# c1ddc42d 28-Mar-2022 Andrew Melnychenko <andrew@daynix.com>

drivers/net/virtio_net: Fixed padded vheader to use v1 with hash.

The header v1 provides additional info about RSS.
Added changes to computing proper header length.
In the next patches, the header may contain RSS hash info
for the hash population.

Signed-off-by: Andrew Melnychenko <andrew@daynix.com>
Link: https://lore.kernel.org/r/20220328175336.10802-2-andrew@daynix.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 4f50ef15 13-Feb-2022 Michael Catanzaro <mcatanzaro.kernel@gmail.com>

virtio_net: Fix code indent error

This patch fixes the checkpatch.pl warning:

ERROR: code indent should use tabs where possible #3453: FILE: drivers/net/virtio_net.c:3453: ret = register_virtio_driver(&virtio_net_driver);$

Uneccessary newline was also removed making line 3453 now 3452.

Signed-off-by: Michael Catanzaro <mcatanzaro.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9b51d9d8 14-Aug-2021 Yury Norov <yury.norov@gmail.com>

cpumask: replace cpumask_next_* with cpumask_first_* where appropriate

cpumask_first() is a more effective analogue of 'next' version if n == -1
(which means start == 0). This patch replaces 'next' with 'first' where
things look trivial.

There's no cpumask_first_zero() function, so create it.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>


# d9679d00 13-Oct-2021 Michael S. Tsirkin <mst@redhat.com>

virtio: wrap config->reset calls

This will enable cleanups down the road.
The idea is to disable cbs, then add "flush_queued_cbs" callback
as a parameter, this way drivers can flush any work
queued after callbacks have been disabled.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20211013105226.20225-1-mst@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# c8064e5b 30-Nov-2021 Paolo Abeni <pabeni@redhat.com>

bpf: Let bpf_warn_invalid_xdp_action() report more info

In non trivial scenarios, the action id alone is not sufficient to
identify the program causing the warning. Before the previous patch,
the generated stack-trace pointed out at least the involved device
driver.

Let's additionally include the program name and id, and the relevant
device name.

If the user needs additional infos, he can fetch them via a kernel
probe, leveraging the arguments added here.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com


# 74624944 18-Nov-2021 Hao Chen <chenhao288@hisilicon.com>

ethtool: extend ringparam setting/getting API with rx_buf_len

Add two new parameters kernel_ringparam and extack for
.get_ringparam and .set_ringparam to extend more ring params
through netlink.

Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5337824f 16-Nov-2021 Eric Dumazet <edumazet@google.com>

net: annotate accesses to queue->trans_start

In following patches, dev_watchdog() will no longer stop all queues.
It will read queue->trans_start locklessly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 053c9e18 15-Dec-2021 Wenliang Wang <wangwenliang.1995@bytedance.com>

virtio_net: fix rx_drops stat for small pkts

We found the stat of rx drops for small pkts does not increment when
build_skb fail, it's not coherent with other mode's rx drops stat.

Signed-off-by: Wenliang Wang <wangwenliang.1995@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fcfb65f8 24-Nov-2021 Michael S. Tsirkin <mst@redhat.com>

Revert "virtio-net: don't let virtio core to validate used length"

This reverts commit 816625c13652cef5b2c49082d652875da6f2ad7a.

Attempts to validate length in the core did not work out.
We'll drop them, so revert the dependent changes in drivers.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 816625c1 26-Oct-2021 Jason Wang <jasowang@redhat.com>

virtio-net: don't let virtio core to validate used length

For RX virtuqueue, the used length is validated in all the three paths
(big, small and mergeable). For control vq, we never tries to use used
length. So this patch forbids the core to validate the used length.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20211027022107.14357-3-jasowang@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# fc02e8cb 09-Oct-2021 Michael S. Tsirkin <mst@redhat.com>

virtio_net: clarify tailroom logic

Make tailroom math follow same logic as everything else, subtracing
values in the order in which things are laid out in the buffer.

Tested-by: Corentin Noël <corentin.noel@collabora.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# f2edaa4a 27-Oct-2021 Jakub Kicinski <kuba@kernel.org>

net: virtio: use eth_hw_addr_set()

Commit 406f42fa0d3c ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it go through appropriate helpers.

Even though the current code uses dev->addr_len the we can switch
to eth_hw_addr_set() instead of dev_addr_set(). The netdev is
always allocated by alloc_etherdev_mq() and there are at least two
places which assume Ethernet address:
- the line below calling eth_hw_addr_random()
- virtnet_set_mac_address() -> eth_commit_mac_addr_change()

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20211027152012.3393077-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 6213f07c 09-Oct-2021 Li RongQing <lirongqing@baidu.com>

virtio_net: skip RCU read lock by checking xdp_enabled of vi

networking benchmark shows that __rcu_read_lock and
__rcu_read_unlock takes some cpu cycles, and we can avoid
calling them partially in virtio rx path by check xdp_enabled
of vi, and xdp is disabled most of time

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a520794b 17-Sep-2021 Tony Lu <tony.ly@linux.alibaba.com>

virtio_net: introduce TX timeout watchdog

This implements ndo_tx_timeout handler and put this into stats. When
there is something wrong to send out packets, we could notice tx timeout
events and total timeout counter.

We have suffered send timeout issues due to the backends hung. With this,
we can find the details, and collect the counters by monitor systems.

Signed-off-by: Tony Lu <tony.ly@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9ce4e3d6 18-Sep-2021 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: use netdev_warn_once to output warn when without enough queues

This warning is output when virtnet does not have enough queues, but it
only needs to be printed once to inform the user of this situation. It
is not necessary to print it every time. If the user loads xdp
frequently, this log appears too much.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 732b74d6 09-Oct-2021 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio-net: fix for skb_over_panic inside big mode

commit 126285651b7f ("Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net")
accidentally reverted the effect of
commit 1a8024239da ("virtio-net: fix for skb_over_panic inside big mode")
on drivers/net/virtio_net.c

As a result, users of crosvm (which is using large packet mode)
are experiencing crashes with 5.14-rc1 and above that do not
occur with 5.13.

Crash trace:

[ 61.346677] skbuff: skb_over_panic: text:ffffffff881ae2c7 len:3762 put:3762 head:ffff8a5ec8c22000 data:ffff8a5ec8c22010 tail:0xec2 end:0xec0 dev:<NULL>
[ 61.369192] kernel BUG at net/core/skbuff.c:111!
[ 61.372840] invalid opcode: 0000 [#1] SMP PTI
[ 61.374892] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.14.0-rc1 linux-v5.14-rc1-for-mesa-ci.tar.bz2 #1
[ 61.376450] Hardware name: ChromiumOS crosvm, BIOS 0

..

[ 61.393635] Call Trace:
[ 61.394127] <IRQ>
[ 61.394488] skb_put.cold+0x10/0x10
[ 61.395095] page_to_skb+0xf7/0x410
[ 61.395689] receive_buf+0x81/0x1660
[ 61.396228] ? netif_receive_skb_list_internal+0x1ad/0x2b0
[ 61.397180] ? napi_gro_flush+0x97/0xe0
[ 61.397896] ? detach_buf_split+0x67/0x120
[ 61.398573] virtnet_poll+0x2cf/0x420
[ 61.399197] __napi_poll+0x25/0x150
[ 61.399764] net_rx_action+0x22f/0x280
[ 61.400394] __do_softirq+0xba/0x257
[ 61.401012] irq_exit_rcu+0x8e/0xb0
[ 61.401618] common_interrupt+0x7b/0xa0
[ 61.402270] </IRQ>

See
https://lore.kernel.org/r/5edaa2b7c2fe4abd0347b8454b2ac032b6694e2c.camel%40collabora.com
for the report.

Apply the original 1a8024239da ("virtio-net: fix for skb_over_panic inside big mode")
again, the original logic still holds:

In virtio-net's large packet mode, there is a hole in the space behind
buf.

hdr_padded_len - hdr_len

We must take this into account when calculating tailroom.

Cc: Greg KH <gregkh@linuxfoundation.org>
Fixes: fb32856b16ad ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
Fixes: 126285651b7f ("Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reported-by: Corentin Noël <corentin.noel@collabora.com>
Tested-by: Corentin Noël <corentin.noel@collabora.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# afd92d82 17-Sep-2021 Jason Wang <jasowang@redhat.com>

virtio-net: fix pages leaking when building skb in big mode

We try to use build_skb() if we had sufficient tailroom. But we forget
to release the unused pages chained via private in big mode which will
leak pages. Fixing this by release the pages after building the skb in
big mode.

Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Fixes: fb32856b16ad ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3dcc1edc 26-Aug-2021 Li RongQing <lirongqing@baidu.com>

virtio_net: reduce raw_smp_processor_id() calling in virtnet_xdp_get_sq

smp_processor_id()/raw* will be called once each when not
more queues in virtnet_xdp_get_sq() which is called in
non-preemptible context, so it's safe to call the function
smp_processor_id() once.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f3ccfda1 20-Aug-2021 Yufeng Mo <moyufeng@huawei.com>

ethtool: extend coalesce setting uAPI with CQE mode

In order to support more coalesce parameters through netlink,
add two new parameter kernel_coal and extack for .set_coalesce
and .get_coalesce, then some extra info can return to user with
the netlink API.

Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# a0d1d0f4 03-Aug-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

virtio_net: Replace deprecated CPU-hotplug functions.

The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# c32325b8 02-Aug-2021 Jakub Kicinski <kuba@kernel.org>

virtio-net: realign page_to_skb() after merges

We ended up merging two versions of the same patch set:

commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
commit 5c37711d9f27 ("virtio-net: fix for unable to handle page fault for address")

into net, and

commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
commit 6c66c147b9a4 ("virtio-net: fix for unable to handle page fault for address")

into net-next. Redo the merge from commit 126285651b7f ("Merge
ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net"), so that
the most recent code remains.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dbcf24d1 17-Aug-2021 Jason Wang <jasowang@redhat.com>

virtio-net: use NETIF_F_GRO_HW instead of NETIF_F_LRO

Commit a02e8964eaf92 ("virtio-net: ethtool configurable LRO")
maps LRO to virtio guest offloading features and allows the
administrator to enable and disable those features via ethtool.

This leads to several issues:

- For a device that doesn't support control guest offloads, the "LRO"
can't be disabled triggering WARN in dev_disable_lro() when turning
off LRO or when enabling forwarding bridging etc.

- For a device that supports control guest offloads, the guest
offloads are disabled in cases of bridging, forwarding etc slowing
down the traffic.

Fix this by using NETIF_F_GRO_HW instead. Though the spec does not
guarantee packets to be re-segmented as the original ones,
we can add that to the spec, possibly with a flag for devices to
differentiate between GRO and LRO.

Further, we never advertised LRO historically before a02e8964eaf92
("virtio-net: ethtool configurable LRO") and so bridged/forwarded
configs effectively always relied on virtio receive offloads behaving
like GRO - thus even if this breaks any configs it is at least not
a regression.

Fixes: a02e8964eaf92 ("virtio-net: ethtool configurable LRO")
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reported-by: Ivan <ivan@prestigetransportation.com>
Tested-by: Ivan <ivan@prestigetransportation.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 222722bc 09-Jul-2021 Yunjian Wang <wangyunjian@huawei.com>

virtio_net: check virtqueue_add_sgs() return value

As virtqueue_add_sgs() can fail, we should check the return value.

Addresses-Coverity-ID: 1464439 ("Unchecked return value")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a7766ef1 12-Apr-2021 Michael S. Tsirkin <mst@redhat.com>

virtio_net: disable cb aggressively

There are currently two cases where we poll TX vq not in response to a
callback: start xmit and rx napi. We currently do this with callbacks
enabled which can cause extra interrupts from the card. Used not to be
a big issue as we run with interrupts disabled but that is no longer the
case, and in some cases the rate of spurious interrupts is so high
linux detects this and actually kills the interrupt.

Fix up by disabling the callbacks before polling the tx vq.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 22bc63c5 12-Apr-2021 Michael S. Tsirkin <mst@redhat.com>

virtio_net: move txq wakeups under tx q lock

We currently check num_free outside tx q lock
which is unsafe: new packets can arrive meanwhile
and there won't be space in the queue.
Thus a spurious queue wakeup causing overhead
and even packet drops.

Move the check under the lock to fix that.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 5a2f966d 12-Apr-2021 Michael S. Tsirkin <mst@redhat.com>

virtio_net: move tx vq operation under tx queue lock

It's unsafe to operate a vq from multiple threads.
Unfortunately this is exactly what we do when invoking
clean tx poll from rx napi.
Same happens with napi-tx even without the
opportunistic cleaning from the receive interrupt: that races
with processing the vq in start_xmit.

As a fix move everything that deals with the vq to under tx lock.

Fixes: b92f1e6751a6 ("virtio-net: transmit napi")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 3f2869ca 17-May-2021 Xie Yongji <xieyongji@bytedance.com>

virtio_net: Fix error handling in virtnet_restore()

Do some cleanups in virtnet_restore() when virtnet_cpu_notif_add() failed.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20210517084516.332-1-xieyongji@bytedance.com
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# a2f7dc00 23-Jun-2021 Xianting Tian <xianting_tian@126.com>

virtio_net: Use virtio_find_vqs_ctx() helper

virtio_find_vqs_ctx() is defined but never be called currently,
it is the right place to use it.

Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 85eb1389 05-Jun-2021 Xianting Tian <xianting.tian@linux.alibaba.com>

virtio_net: Remove BUG() to avoid machine dead

We should not directly BUG() when there is hdr error, it is
better to output a print when such error happens. Currently,
the caller of xmit_skb() already did it.

Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ad993a95 31-May-2021 Xie Yongji <xieyongji@bytedance.com>

virtio-net: Add validation for used length

This adds validation for used length (might come
from an untrusted device) to avoid data corruption
or loss.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210531135852.113-1-xieyongji@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 7bf64460 13-May-2021 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio-net: get build_skb() buf by data ptr

In the case of merge, the page passed into page_to_skb() may be a head
page, not the page where the current data is located. So when trying to
get the buf where the data is located, you should directly use the
pointer(p) to get the address corresponding to the page.

At the same time, the offset of the data in the page should also be
obtained using offset_in_page().

This patch solves this problem. But if you don’t use this patch, the
original code can also run, because if the page is not the page of the
current data, the calculated tailroom will be less than 0, and will not
enter the logic of build_skb() . The significance of this patch is to
modify this logical problem, allowing more situations to use
build_skb().

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6c66c147 13-May-2021 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio-net: fix for unable to handle page fault for address

In merge mode, when xdp is enabled, if the headroom of buf is smaller
than virtnet_get_headroom(), xdp_linearize_page() will be called but the
variable of "headroom" is still 0, which leads to wrong logic after
entering page_to_skb().

[ 16.600944] BUG: unable to handle page fault for address: ffffecbfff7b43c8[ 16.602175] #PF: supervisor read access in kernel mode
[ 16.603350] #PF: error_code(0x0000) - not-present page
[ 16.604200] PGD 0 P4D 0
[ 16.604686] Oops: 0000 [#1] SMP PTI
[ 16.605306] CPU: 4 PID: 715 Comm: sh Tainted: G B 5.12.0+ #312
[ 16.606429] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/04
[ 16.608217] RIP: 0010:unmap_page_range+0x947/0xde0
[ 16.609014] Code: 00 00 08 00 48 83 f8 01 45 19 e4 41 f7 d4 41 83 e4 03 e9 a4 fd ff ff e8 b7 63 ed ff 4c 89 e0 48 c1 e0 065
[ 16.611863] RSP: 0018:ffffc90002503c58 EFLAGS: 00010286
[ 16.612720] RAX: ffffecbfff7b43c0 RBX: 00007f19f7203000 RCX: ffffffff812ff359
[ 16.613853] RDX: ffff888107778000 RSI: 0000000000000000 RDI: 0000000000000005
[ 16.614976] RBP: ffffea000425e000 R08: 0000000000000000 R09: 3030303030303030
[ 16.616124] R10: ffffffff82ed7d94 R11: 6637303030302052 R12: 7c00000afffded0f
[ 16.617276] R13: 0000000000000001 R14: ffff888119ee7010 R15: 00007f19f7202000
[ 16.618423] FS: 0000000000000000(0000) GS:ffff88842fd00000(0000) knlGS:0000000000000000
[ 16.619738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 16.620670] CR2: ffffecbfff7b43c8 CR3: 0000000103220005 CR4: 0000000000370ee0
[ 16.621792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 16.622920] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 16.624047] Call Trace:
[ 16.624525] ? release_pages+0x24d/0x730
[ 16.625209] unmap_single_vma+0xa9/0x130
[ 16.625885] unmap_vmas+0x76/0xf0
[ 16.626480] exit_mmap+0xa0/0x210
[ 16.627129] mmput+0x67/0x180
[ 16.627673] do_exit+0x3d1/0xf10
[ 16.628259] ? do_user_addr_fault+0x231/0x840
[ 16.629000] do_group_exit+0x53/0xd0
[ 16.629631] __x64_sys_exit_group+0x1d/0x20
[ 16.630354] do_syscall_64+0x3c/0x80
[ 16.630988] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 16.631828] RIP: 0033:0x7f1a043d0191
[ 16.632464] Code: Unable to access opcode bytes at RIP 0x7f1a043d0167.
[ 16.633502] RSP: 002b:00007ffe3d993308 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ 16.634737] RAX: ffffffffffffffda RBX: 00007f1a044c9490 RCX: 00007f1a043d0191
[ 16.635857] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
[ 16.636986] RBP: 0000000000000000 R08: ffffffffffffff88 R09: 0000000000000001
[ 16.638120] R10: 0000000000000008 R11: 0000000000000246 R12: 00007f1a044c9490
[ 16.639245] R13: 0000000000000001 R14: 00007f1a044c9968 R15: 0000000000000000
[ 16.640408] Modules linked in:
[ 16.640958] CR2: ffffecbfff7b43c8
[ 16.641557] ---[ end trace bc4891c6ce46354c ]---
[ 16.642335] RIP: 0010:unmap_page_range+0x947/0xde0
[ 16.643135] Code: 00 00 08 00 48 83 f8 01 45 19 e4 41 f7 d4 41 83 e4 03 e9 a4 fd ff ff e8 b7 63 ed ff 4c 89 e0 48 c1 e0 065
[ 16.645983] RSP: 0018:ffffc90002503c58 EFLAGS: 00010286
[ 16.646845] RAX: ffffecbfff7b43c0 RBX: 00007f19f7203000 RCX: ffffffff812ff359
[ 16.647970] RDX: ffff888107778000 RSI: 0000000000000000 RDI: 0000000000000005
[ 16.649091] RBP: ffffea000425e000 R08: 0000000000000000 R09: 3030303030303030
[ 16.650250] R10: ffffffff82ed7d94 R11: 6637303030302052 R12: 7c00000afffded0f
[ 16.651394] R13: 0000000000000001 R14: ffff888119ee7010 R15: 00007f19f7202000
[ 16.652529] FS: 0000000000000000(0000) GS:ffff88842fd00000(0000) knlGS:0000000000000000
[ 16.653887] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 16.654841] CR2: ffffecbfff7b43c8 CR3: 0000000103220005 CR4: 0000000000370ee0
[ 16.655992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 16.657150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 16.658290] Kernel panic - not syncing: Fatal exception
[ 16.659613] Kernel Offset: disabled
[ 16.660234] ---[ end Kernel panic - not syncing: Fatal exception ]---

Fixes: fb32856b16ad ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1a802423 03-Jun-2021 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio-net: fix for skb_over_panic inside big mode

In virtio-net's large packet mode, there is a hole in the space behind
buf.

hdr_padded_len - hdr_len

We must take this into account when calculating tailroom.

[ 44.544385] skb_put.cold (net/core/skbuff.c:5254 (discriminator 1) net/core/skbuff.c:5252 (discriminator 1))
[ 44.544864] page_to_skb (drivers/net/virtio_net.c:485) [ 44.545361] receive_buf (drivers/net/virtio_net.c:849 drivers/net/virtio_net.c:1131)
[ 44.545870] ? netif_receive_skb_list_internal (net/core/dev.c:5714)
[ 44.546628] ? dev_gro_receive (net/core/dev.c:6103)
[ 44.547135] ? napi_complete_done (./include/linux/list.h:35 net/core/dev.c:5867 net/core/dev.c:5862 net/core/dev.c:6565)
[ 44.547672] virtnet_poll (drivers/net/virtio_net.c:1427 drivers/net/virtio_net.c:1525)
[ 44.548251] __napi_poll (net/core/dev.c:6985)
[ 44.548744] net_rx_action (net/core/dev.c:7054 net/core/dev.c:7139)
[ 44.549264] __do_softirq (./arch/x86/include/asm/jump_label.h:19 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:560)
[ 44.549762] irq_exit_rcu (kernel/softirq.c:433 kernel/softirq.c:637 kernel/softirq.c:649)
[ 44.551384] common_interrupt (arch/x86/kernel/irq.c:240 (discriminator 13))
[ 44.551991] ? asm_common_interrupt (./arch/x86/include/asm/idtentry.h:638)
[ 44.552654] asm_common_interrupt (./arch/x86/include/asm/idtentry.h:638)

Fixes: fb32856b16ad ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reported-by: Corentin Noël <corentin.noel@collabora.com>
Tested-by: Corentin Noël <corentin.noel@collabora.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8fb7da9e 01-Jun-2021 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio_net: get build_skb() buf by data ptr

In the case of merge, the page passed into page_to_skb() may be a head
page, not the page where the current data is located. So when trying to
get the buf where the data is located, we should get buf based on
headroom instead of offset.

This patch solves this problem. But if you don't use this patch, the
original code can also run, because if the page is not the page of the
current data, the calculated tailroom will be less than 0, and will not
enter the logic of build_skb() . The significance of this patch is to
modify this logical problem, allowing more situations to use
build_skb().

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5c37711d 01-Jun-2021 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio-net: fix for unable to handle page fault for address

In merge mode, when xdp is enabled, if the headroom of buf is smaller
than virtnet_get_headroom(), xdp_linearize_page() will be called but the
variable of "headroom" is still 0, which leads to wrong logic after
entering page_to_skb().

[ 16.600944] BUG: unable to handle page fault for address: ffffecbfff7b43c8[ 16.602175] #PF: supervisor read access in kernel mode
[ 16.603350] #PF: error_code(0x0000) - not-present page
[ 16.604200] PGD 0 P4D 0
[ 16.604686] Oops: 0000 [#1] SMP PTI
[ 16.605306] CPU: 4 PID: 715 Comm: sh Tainted: G B 5.12.0+ #312
[ 16.606429] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/04
[ 16.608217] RIP: 0010:unmap_page_range+0x947/0xde0
[ 16.609014] Code: 00 00 08 00 48 83 f8 01 45 19 e4 41 f7 d4 41 83 e4 03 e9 a4 fd ff ff e8 b7 63 ed ff 4c 89 e0 48 c1 e0 065
[ 16.611863] RSP: 0018:ffffc90002503c58 EFLAGS: 00010286
[ 16.612720] RAX: ffffecbfff7b43c0 RBX: 00007f19f7203000 RCX: ffffffff812ff359
[ 16.613853] RDX: ffff888107778000 RSI: 0000000000000000 RDI: 0000000000000005
[ 16.614976] RBP: ffffea000425e000 R08: 0000000000000000 R09: 3030303030303030
[ 16.616124] R10: ffffffff82ed7d94 R11: 6637303030302052 R12: 7c00000afffded0f
[ 16.617276] R13: 0000000000000001 R14: ffff888119ee7010 R15: 00007f19f7202000
[ 16.618423] FS: 0000000000000000(0000) GS:ffff88842fd00000(0000) knlGS:0000000000000000
[ 16.619738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 16.620670] CR2: ffffecbfff7b43c8 CR3: 0000000103220005 CR4: 0000000000370ee0
[ 16.621792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 16.622920] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 16.624047] Call Trace:
[ 16.624525] ? release_pages+0x24d/0x730
[ 16.625209] unmap_single_vma+0xa9/0x130
[ 16.625885] unmap_vmas+0x76/0xf0
[ 16.626480] exit_mmap+0xa0/0x210
[ 16.627129] mmput+0x67/0x180
[ 16.627673] do_exit+0x3d1/0xf10
[ 16.628259] ? do_user_addr_fault+0x231/0x840
[ 16.629000] do_group_exit+0x53/0xd0
[ 16.629631] __x64_sys_exit_group+0x1d/0x20
[ 16.630354] do_syscall_64+0x3c/0x80
[ 16.630988] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 16.631828] RIP: 0033:0x7f1a043d0191
[ 16.632464] Code: Unable to access opcode bytes at RIP 0x7f1a043d0167.
[ 16.633502] RSP: 002b:00007ffe3d993308 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ 16.634737] RAX: ffffffffffffffda RBX: 00007f1a044c9490 RCX: 00007f1a043d0191
[ 16.635857] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
[ 16.636986] RBP: 0000000000000000 R08: ffffffffffffff88 R09: 0000000000000001
[ 16.638120] R10: 0000000000000008 R11: 0000000000000246 R12: 00007f1a044c9490
[ 16.639245] R13: 0000000000000001 R14: 00007f1a044c9968 R15: 0000000000000000
[ 16.640408] Modules linked in:
[ 16.640958] CR2: ffffecbfff7b43c8
[ 16.641557] ---[ end trace bc4891c6ce46354c ]---
[ 16.642335] RIP: 0010:unmap_page_range+0x947/0xde0
[ 16.643135] Code: 00 00 08 00 48 83 f8 01 45 19 e4 41 f7 d4 41 83 e4 03 e9 a4 fd ff ff e8 b7 63 ed ff 4c 89 e0 48 c1 e0 065
[ 16.645983] RSP: 0018:ffffc90002503c58 EFLAGS: 00010286
[ 16.646845] RAX: ffffecbfff7b43c0 RBX: 00007f19f7203000 RCX: ffffffff812ff359
[ 16.647970] RDX: ffff888107778000 RSI: 0000000000000000 RDI: 0000000000000005
[ 16.649091] RBP: ffffea000425e000 R08: 0000000000000000 R09: 3030303030303030
[ 16.650250] R10: ffffffff82ed7d94 R11: 6637303030302052 R12: 7c00000afffded0f
[ 16.651394] R13: 0000000000000001 R14: ffff888119ee7010 R15: 00007f19f7202000
[ 16.652529] FS: 0000000000000000(0000) GS:ffff88842fd00000(0000) knlGS:0000000000000000
[ 16.653887] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 16.654841] CR2: ffffecbfff7b43c8 CR3: 0000000103220005 CR4: 0000000000370ee0
[ 16.655992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 16.657150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 16.658290] Kernel panic - not syncing: Fatal exception
[ 16.659613] Kernel Offset: disabled
[ 16.660234] ---[ end Kernel panic - not syncing: Fatal exception ]---

Fixes: fb32856b16ad ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 122b84a1 01-May-2021 Max Gurtovoy <mgurtovoy@nvidia.com>

virtio-net: don't allocate control_buf if not supported

Not all virtio_net devices support the ctrl queue feature. Thus, there
is no need to allocate unused resources.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Link: https://lore.kernel.org/r/20210502093319.61313-1-mgurtovoy@nvidia.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# f80bd740 22-Apr-2021 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio-net: fix use-after-free in skb_gro_receive

When "headroom" > 0, the actual allocated memory space is the entire
page, so the address of the page should be used when passing it to
build_skb().

BUG: KASAN: use-after-free in skb_gro_receive (net/core/skbuff.c:4260)
Write of size 16 at addr ffff88811619fffc by task kworker/u9:0/534
CPU: 2 PID: 534 Comm: kworker/u9:0 Not tainted 5.12.0-rc7-custom-16372-gb150be05b806 #3382
Hardware name: QEMU MSN2700, BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Workqueue: xprtiod xs_stream_data_receive_workfn [sunrpc]
Call Trace:
<IRQ>
dump_stack (lib/dump_stack.c:122)
print_address_description.constprop.0 (mm/kasan/report.c:233)
kasan_report.cold (mm/kasan/report.c:400 mm/kasan/report.c:416)
skb_gro_receive (net/core/skbuff.c:4260)
tcp_gro_receive (net/ipv4/tcp_offload.c:266 (discriminator 1))
tcp4_gro_receive (net/ipv4/tcp_offload.c:316)
inet_gro_receive (net/ipv4/af_inet.c:1545 (discriminator 2))
dev_gro_receive (net/core/dev.c:6075)
napi_gro_receive (net/core/dev.c:6168 net/core/dev.c:6198)
receive_buf (drivers/net/virtio_net.c:1151) virtio_net
virtnet_poll (drivers/net/virtio_net.c:1415 drivers/net/virtio_net.c:1519) virtio_net
__napi_poll (net/core/dev.c:6964)
net_rx_action (net/core/dev.c:7033 net/core/dev.c:7118)
__do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:346)
irq_exit_rcu (kernel/softirq.c:221 kernel/softirq.c:422 kernel/softirq.c:434)
common_interrupt (arch/x86/kernel/irq.c:240 (discriminator 14))
</IRQ>

Fixes: fb32856b16ad ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reported-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# af39c8f7 20-Apr-2021 Eric Dumazet <edumazet@google.com>

virtio-net: fix use-after-free in page_to_skb()

KASAN/syzbot had 4 reports, one of them being:

BUG: KASAN: slab-out-of-bounds in memcpy include/linux/fortify-string.h:191 [inline]
BUG: KASAN: slab-out-of-bounds in page_to_skb+0x5cf/0xb70 drivers/net/virtio_net.c:480
Read of size 12 at addr ffff888014a5f800 by task systemd-udevd/8445

CPU: 0 PID: 8445 Comm: systemd-udevd Not tainted 5.12.0-rc8-next-20210419-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x141/0x1d7 lib/dump_stack.c:120
print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:233
__kasan_report mm/kasan/report.c:419 [inline]
kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:436
check_region_inline mm/kasan/generic.c:180 [inline]
kasan_check_range+0x13d/0x180 mm/kasan/generic.c:186
memcpy+0x20/0x60 mm/kasan/shadow.c:65
memcpy include/linux/fortify-string.h:191 [inline]
page_to_skb+0x5cf/0xb70 drivers/net/virtio_net.c:480
receive_mergeable drivers/net/virtio_net.c:1009 [inline]
receive_buf+0x2bc0/0x6250 drivers/net/virtio_net.c:1119
virtnet_receive drivers/net/virtio_net.c:1411 [inline]
virtnet_poll+0x568/0x10b0 drivers/net/virtio_net.c:1516
__napi_poll+0xaf/0x440 net/core/dev.c:6962
napi_poll net/core/dev.c:7029 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7116
__do_softirq+0x29b/0x9fe kernel/softirq.c:559
invoke_softirq kernel/softirq.c:433 [inline]
__irq_exit_rcu+0x136/0x200 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
common_interrupt+0xa4/0xd0 arch/x86/kernel/irq.c:240

Fixes: fb32856b16ad ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f5d7872a 20-Apr-2021 Eric Dumazet <edumazet@google.com>

virtio-net: restrict build_skb() use to some arches

build_skb() is supposed to be followed by
skb_reserve(skb, NET_IP_ALIGN), so that IP headers are word-aligned.
(Best practice is to reserve NET_IP_ALIGN+NET_SKB_PAD, but the NET_SKB_PAD
part is only a performance optimization if tunnel encaps are added.)

Unfortunately virtio_net has not provisioned this reserve.
We can only use build_skb() for arches where NET_IP_ALIGN == 0

We might refine this later, with enough testing.

Fixes: fb32856b16ad ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: David S. Miller <davem@davemloft.net>


# fb32856b 16-Apr-2021 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio-net: page_to_skb() use build_skb when there's sufficient tailroom

In page_to_skb(), if we have enough tailroom to save skb_shared_info, we
can use build_skb to create skb directly. No need to alloc for
additional space. And it can save a 'frags slot', which is very friendly
to GRO.

Here, if the payload of the received package is too small (less than
GOOD_COPY_LEN), we still choose to copy it directly to the space got by
napi_alloc_skb. So we can reuse these pages.

Testing Machine:
The four queues of the network card are bound to the cpu1.

Test command:
for ((i=0;i<5;++i)); do sockperf tp --ip 192.168.122.64 -m 1000 -t 150& done

The size of the udp package is 1000, so in the case of this patch, there
will always be enough tailroom to use build_skb. The sent udp packet
will be discarded because there is no port to receive it. The irqsoftd
of the machine is 100%, we observe the received quantity displayed by
sar -n DEV 1:

no build_skb: 956864.00 rxpck/s
build_skb: 1158465.00 rxpck/s

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 044ab86d 18-Mar-2021 Antoine Tenart <atenart@kernel.org>

net: move the xps maps to an array

Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in
net_device. That will simplify a lot the code removing the need for lots
of if/else conditionals as the correct map will be available using its
offset in the array.

This should not modify the xps maps behaviour in any way.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fdc13979 07-Mar-2021 Lorenzo Bianconi <lorenzo@kernel.org>

bpf, devmap: Move drop error path to devmap for XDP_REDIRECT

We want to change the current ndo_xdp_xmit drop semantics because it will
allow us to implement better queue overflow handling. This is working
towards the larger goal of a XDP TX queue-hook. Move XDP_REDIRECT error
path handling from each XDP ethernet driver to devmap code. According to
the new APIs, the driver running the ndo_xdp_xmit pointer, will break tx
loop whenever the hw reports a tx error and it will just return to devmap
caller the number of successfully transmitted frames. It will be devmap
responsibility to free dropped frames.

Move each XDP ndo_xdp_xmit capable driver to the new APIs:

- veth
- virtio-net
- mvneta
- mvpp2
- socionext
- amazon ena
- bnxt
- freescale (dpaa2, dpaa)
- xen-frontend
- qede
- ice
- igb
- ixgbe
- i40e
- mlx5
- ti (cpsw, cpsw-new)
- tun
- sfc

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Link: https://lore.kernel.org/bpf/ed670de24f951cfd77590decf0229a0ad7fd12f6.1615201152.git.lorenzo@kernel.org


# d7a9a01b 16-Mar-2021 Alexander Duyck <alexanderduyck@fb.com>

virtio_net: Update driver to use ethtool_sprintf

Update the code to replace instances of snprintf and a pointer update with
just calling ethtool_sprintf.

Also replace the char pointer with a u8 pointer to avoid having to recast
the pointer type.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 97c2c69e 09-Mar-2021 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio-net: support XDP when not more queues

The number of queues implemented by many virtio backends is limited,
especially some machines have a large number of CPUs. In this case, it
is often impossible to allocate a separate queue for
XDP_TX/XDP_REDIRECT, then xdp cannot be loaded to work, even xdp does
not use the XDP_TX/XDP_REDIRECT.

This patch allows XDP_TX/XDP_REDIRECT to run by reuse the existing SQ
with __netif_tx_lock() hold when there are not enough queues.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab5bd583 18-Feb-2021 Xuan Zhuo <xuanzhuo@linux.alibaba.com>

virtio-net: Support IFF_TX_SKB_NO_LINEAR flag

Virtio net supports the case where the skb linear space is empty, so
add priv_flags.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210218204908.5455-4-alobakin@pm.me


# 0f6925b3 02-Apr-2021 Eric Dumazet <edumazet@google.com>

virtio_net: Do not pull payload in skb->head

Xuan Zhuo reported that commit 3226b158e67c ("net: avoid 32 x truesize
under-estimation for tiny skbs") brought a ~10% performance drop.

The reason for the performance drop was that GRO was forced
to chain sk_buff (using skb_shinfo(skb)->frag_list), which
uses more memory but also cause packet consumers to go over
a lot of overhead handling all the tiny skbs.

It turns out that virtio_net page_to_skb() has a wrong strategy :
It allocates skbs with GOOD_COPY_LEN (128) bytes in skb->head, then
copies 128 bytes from the page, before feeding the packet to GRO stack.

This was suboptimal before commit 3226b158e67c ("net: avoid 32 x truesize
under-estimation for tiny skbs") because GRO was using 2 frags per MSS,
meaning we were not packing MSS with 100% efficiency.

Fix is to pull only the ethernet header in page_to_skb()

Then, we change virtio_net_hdr_to_skb() to pull the missing
headers, instead of assuming they were already pulled by callers.

This fixes the performance regression, but could also allow virtio_net
to accept packets with more than 128bytes of headers.

Many thanks to Xuan Zhuo for his report, and his tests/help.

Fixes: 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs")
Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://www.spinics.net/lists/netdev/msg731397.html
Co-Developed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 95efabf0 19-Nov-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

virtio_net: Fix fall-through warnings for Clang

In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a goto statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/cb9b9534572bc476f4fb7b49a73dc8646b780c84.1605896060.git.gustavoars@kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# be9df4af 22-Dec-2020 Lorenzo Bianconi <lorenzo@kernel.org>

net, xdp: Introduce xdp_prepare_buff utility routine

Introduce xdp_prepare_buff utility routine to initialize per-descriptor
xdp_buff fields (e.g. xdp_buff pointers). Rely on xdp_prepare_buff() in
all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/45f46f12295972a97da8ca01990b3e71501e9d89.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 43b5169d 22-Dec-2020 Lorenzo Bianconi <lorenzo@kernel.org>

net, xdp: Introduce xdp_init_buff utility routine

Introduce xdp_init_buff utility routine to initialize xdp_buff fields
const over NAPI iterations (e.g. frame_sz or rxq pointer). Rely on
xdp_init_buff in all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/7f8329b6da1434dc2b05a77f2e800b29628a8913.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# de33212f 22-Dec-2020 Jeff Dike <jdike@akamai.com>

virtio_net: Fix recursive call to cpus_read_lock()

virtnet_set_channels can recursively call cpus_read_lock if CONFIG_XPS
and CONFIG_HOTPLUG are enabled.

The path is:
virtnet_set_channels - calls get_online_cpus(), which is a trivial
wrapper around cpus_read_lock()
netif_set_real_num_tx_queues
netif_reset_xps_queues_gt
netif_reset_xps_queues - calls cpus_read_lock()

This call chain and potential deadlock happens when the number of TX
queues is reduced.

This commit the removes netif_set_real_num_[tr]x_queues calls from
inside the get/put_online_cpus section, as they don't require that it
be held.

Fixes: 47be24796c13 ("virtio-net: fix the set affinity bug when CPU IDs are not consecutive")
Signed-off-by: Jeff Dike <jdike@akamai.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20201223025421.671-1-jdike@akamai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 411ea23a 04-Dec-2020 Dan Carpenter <dan.carpenter@oracle.com>

virtio_net: Fix error code in probe()

Set a negative error code intead of returning success if the MTU has
been changed to something invalid.

Fixes: fe36cbe0671e ("virtio_net: clear MTU when out of range")
Reported-by: Robert Buhren <robert.buhren@sect.tu-berlin.de>
Reported-by: Felicitas Hetzelt <file@sect.tu-berlin.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/X8pGVJSeeCdII1Ys@mwanda
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>


# b02e5a0e 30-Nov-2020 Björn Töpel <bjorn@kernel.org>

xsk: Propagate napi_id to XDP socket Rx path

Add napi_id to the xdp_rxq_info structure, and make sure the XDP
socket pick up the napi_id in the Rx path. The napi_id is used to find
the corresponding NAPI structure for socket busy polling.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com


# cf8691cb 21-Oct-2020 Michael S. Tsirkin <mst@redhat.com>

Revert "virtio-net: ethtool configurable RXCSUM"

This reverts commit 3618ad2a7c0e78e4258386394d5d5f92a3dbccf8.

When control vq is not negotiated, that commit causes a crash:

[ 72.229171] kernel BUG at drivers/net/virtio_net.c:1667!
[ 72.230266] invalid opcode: 0000 [#1] PREEMPT SMP
[ 72.231172] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc8-02934-g3618ad2a7c0e7 #1
[ 72.231172] EIP: virtnet_send_command+0x120/0x140
[ 72.231172] Code: 00 0f 94 c0 8b 7d f0 65 33 3d 14 00 00 00 75 1c 8d 65 f4 5b 5e 5f 5d c3 66 90 be 01 00 00 00 e9 6e ff ff ff 8d b6 00
+00 00 00 <0f> 0b e8 d9 bb 82 00 eb 17 8d b4 26 00 00 00 00 8d b4 26 00 00 00
[ 72.231172] EAX: 0000000d EBX: f72895c0 ECX: 00000017 EDX: 00000011
[ 72.231172] ESI: f7197800 EDI: ed69bd00 EBP: ed69bcf4 ESP: ed69bc98
[ 72.231172] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
[ 72.231172] CR0: 80050033 CR2: 00000000 CR3: 02c84000 CR4: 000406f0
[ 72.231172] Call Trace:
[ 72.231172] ? __virt_addr_valid+0x45/0x60
[ 72.231172] ? ___cache_free+0x51f/0x760
[ 72.231172] ? kobject_uevent_env+0xf4/0x560
[ 72.231172] virtnet_set_guest_offloads+0x4d/0x80
[ 72.231172] virtnet_set_features+0x85/0x120
[ 72.231172] ? virtnet_set_guest_offloads+0x80/0x80
[ 72.231172] __netdev_update_features+0x27a/0x8e0
[ 72.231172] ? kobject_uevent+0xa/0x20
[ 72.231172] ? netdev_register_kobject+0x12c/0x160
[ 72.231172] register_netdevice+0x4fe/0x740
[ 72.231172] register_netdev+0x1c/0x40
[ 72.231172] virtnet_probe+0x728/0xb60
[ 72.231172] ? _raw_spin_unlock+0x1d/0x40
[ 72.231172] ? virtio_vdpa_get_status+0x1c/0x20
[ 72.231172] virtio_dev_probe+0x1c6/0x271
[ 72.231172] really_probe+0x195/0x2e0
[ 72.231172] driver_probe_device+0x26/0x60
[ 72.231172] device_driver_attach+0x49/0x60
[ 72.231172] __driver_attach+0x46/0xc0
[ 72.231172] ? device_driver_attach+0x60/0x60
[ 72.231172] bus_add_driver+0x197/0x1c0
[ 72.231172] driver_register+0x66/0xc0
[ 72.231172] register_virtio_driver+0x1b/0x40
[ 72.231172] virtio_net_driver_init+0x61/0x86
[ 72.231172] ? veth_init+0x14/0x14
[ 72.231172] do_one_initcall+0x76/0x2e4
[ 72.231172] ? rdinit_setup+0x2a/0x2a
[ 72.231172] do_initcalls+0xb2/0xd5
[ 72.231172] kernel_init_freeable+0x14f/0x179
[ 72.231172] ? rest_init+0x100/0x100
[ 72.231172] kernel_init+0xd/0xe0
[ 72.231172] ret_from_fork+0x1c/0x30
[ 72.231172] Modules linked in:
[ 72.269563] ---[ end trace a6ebc4afea0e6cb1 ]---

The reason is that virtnet_set_features now calls virtnet_set_guest_offloads
unconditionally, it used to only call it when there is something
to configure.

If device does not have a control vq, everything breaks.

Revert the original commit for now.

Cc: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Fixes: 3618ad2a7c0e7 ("virtio-net: ethtool configurable RXCSUM")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20201021142944.13615-1-mst@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 3618ad2a 11-Oct-2020 Tonghao Zhang <xiangxia.m.yue@gmail.com>

virtio-net: ethtool configurable RXCSUM

Allow user configuring RXCSUM separately with ethtool -K,
reusing the existing virtnet_set_guest_offloads helper
that configures RXCSUM for XDP. This is conditional on
VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.

If Rx checksum is disabled, LRO should also be disabled.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20201012015820.62042-1-xiangxia.m.yue@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 1a03b8a3 28-Sep-2020 Tonghao Zhang <xiangxia.m.yue@gmail.com>

virtio-net: don't disable guest csum when disable LRO

Open vSwitch and Linux bridge will disable LRO of the interface
when this interface added to them. Now when disable the LRO, the
virtio-net csum is disable too. That drops the forwarding performance.

Fixes: a02e8964eaf9 ("virtio-net: ethtool configurable LRO")
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5198d545 09-Sep-2020 Jakub Kicinski <kuba@kernel.org>

net: remove napi_hash_del() from driver-facing API

We allow drivers to call napi_hash_del() before calling
netif_napi_del() to batch RCU grace periods. This makes
the API asymmetric and leaks internal implementation details.
Soon we will want the grace period to protect more than just
the NAPI hash table.

Restructure the API and have drivers call a new function -
__netif_napi_del() if they want to take care of RCU waits.

Note that only core was checking the return status from
napi_hash_del() so the new helper does not report if the
NAPI was actually deleted.

Some notes on driver oddness:
- veth observed the grace period before calling netif_napi_del()
but that should not matter
- myri10ge observed normal RCU flavor
- bnx2x and enic did not actually observe the grace period
(unless they did so implicitly)
- virtio_net and enic only unhashed Rx NAPIs

The last two points seem to indicate that the calls to
napi_hash_del() were a left over rather than an optimization.
Regardless, it's easy enough to correct them.

This patch may introduce extra synchronize_net() calls for
interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
free_netdev() to call netif_napi_del(). This seems inevitable
since we want to use RCU for netpoll dev->napi_list traversal,
and almost no drivers set IFF_DISABLE_NETPOLL.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# df561f66 23-Aug-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

treewide: Use fallthrough pseudo-keyword

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>


# 64ffa39d 05-Aug-2020 Michael S. Tsirkin <mst@redhat.com>

virtio_net: use LE accessors for speed/duplex

Speed and duplex config fields depend on VIRTIO_NET_F_SPEED_DUPLEX
which being 63>31 depends on VIRTIO_F_VERSION_1.

Accordingly, use LE accessors for these fields.

Reported-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# e8407fde 22-Jul-2020 Andrii Nakryiko <andriin@fb.com>

bpf, xdp: Remove XDP_QUERY_PROG and XDP_QUERY_PROG_HW XDP commands

Now that BPF program/link management is centralized in generic net_device
code, kernel code never queries program id from drivers, so
XDP_QUERY_PROG/XDP_QUERY_PROG_HW commands are unnecessary.

This patch removes all the implementations of those commands in kernel, along
the xdp_attachment_query().

This patch was compile-tested on allyesconfig.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-10-andriin@fb.com


# 1b698fa5 28-May-2020 Lorenzo Bianconi <lorenzo@kernel.org>

xdp: Rename convert_to_xdp_frame in xdp_convert_buff_to_frame

In order to use standard 'xdp' prefix, rename convert_to_xdp_frame
utility routine in xdp_convert_buff_to_frame and replace all the
occurrences

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/6344f739be0d1a08ab2b9607584c4d5478c8c083.1590698295.git.lorenzo@kernel.org


# 9ce6146e 13-May-2020 Jesper Dangaard Brouer <brouer@redhat.com>

virtio_net: Add XDP frame size in two code paths

The virtio_net driver is running inside the guest-OS. There are two
XDP receive code-paths in virtio_net, namely receive_small() and
receive_mergeable(). The receive_big() function does not support XDP.

In receive_small() the frame size is available in buflen. The buffer
backing these frames are allocated in add_recvbuf_small() with same
size, except for the headroom, but tailroom have reserved room for
skb_shared_info. The headroom is encoded in ctx pointer as a value.

In receive_mergeable() the frame size is more dynamic. There are two
basic cases: (1) buffer size is based on a exponentially weighted
moving average (see DECLARE_EWMA) of packet length. Or (2) in case
virtnet_get_headroom() have any headroom then buffer size is
PAGE_SIZE. The ctx pointer is this time used for encoding two values;
the buffer len "truesize" and headroom. In case (1) if the rx buffer
size is underestimated, the packet will have been split over more
buffers (num_buf info in virtio_net_hdr_mrg_rxbuf placed in top of
buffer area). If that happens the XDP path does a xdp_linearize_page
operation.

V3: Adjust frame_sz in receive_mergeable() case, spotted by Jason Wang.

The code is really hard to follow, so some hints to reviewers.
The receive_mergeable() case gets frames that were allocated in
add_recvbuf_mergeable() which uses headroom=virtnet_get_headroom(),
and 'buf' ptr is advanced this headroom. The headroom can only
be 0 or VIRTIO_XDP_HEADROOM, as virtnet_get_headroom is really
simple:

static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
{
return vi->xdp_queue_pairs ? VIRTIO_XDP_HEADROOM : 0;
}

As frame_sz is an offset size from xdp.data_hard_start, reviewers
should notice how this is calculated in receive_mergeable():

int offset = buf - page_address(page);
[...]
data = page_address(xdp_page) + offset;
xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len;

The calculated offset will always be VIRTIO_XDP_HEADROOM when
reaching this code. Thus, xdp.data_hard_start will be page-start
address plus vi->hdr_len. Given this xdp.frame_sz need to be
reduced with vi->hdr_len size.

IMHO a followup patch should cleanup this code to make it easier
to maintain and understand, but it is outside the scope of this
patchset.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/bpf/158945344436.97035.9445115070189151680.stgit@firesoul


# 01c32598 07-May-2020 Michael S. Tsirkin <mst@redhat.com>

virtio_net: fix lockdep warning on 32 bit

When we fill up a receive VQ, try_fill_recv currently tries to count
kicks using a 64 bit stats counter. Turns out, on a 32 bit kernel that
uses a seqcount. sequence counts are "lock" constructs where you need to
make sure that writers are serialized.

In turn, this means that we mustn't run two try_fill_recv concurrently.
Which of course we don't. We do run try_fill_recv sometimes from a
softirq napi context, and sometimes from a fully preemptible context,
but the later always runs with napi disabled.

However, when it comes to the seqcount, lockdep is trying to enforce the
rule that the same lock isn't accessed from preemptible and softirq
context - it doesn't know about napi being enabled/disabled. This causes
a false-positive warning:

WARNING: inconsistent lock state
...
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.

As a work around, shut down the warning by switching
to u64_stats_update_begin_irqsave - that works by disabling
interrupts on 32 bit only, is a NOP on 64 bit.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a51e5206 04-Mar-2020 Jakub Kicinski <kuba@kernel.org>

virtio_net: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver correctly rejects all unsupported parameters.
As a side effect of these changes the error code for
unsupported params changes from EINVAL to EOPNOTSUPP.

v2: correctly handle rx-frames (and adjust the commit msg)
v3: adjust commit message for new error code and member name

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9aedc6e2 28-Feb-2020 Cris Forno <cforno12@linux.vnet.ibm.com>

net/ethtool: Introduce link_ksettings API for virtual network devices

With the ethtool_virtdev_set_link_ksettings function in core/ethtool.c,
ibmveth, netvsc, and virtio now use the core's helper function.

Funtionality changes that pertain to ibmveth driver include:

1. Changed the initial hardcoded link speed to 1GB.

2. Added support for allowing a user to change the reported link
speed via ethtool.

Functionality changes to the netvsc driver include:

1. When netvsc_get_link_ksettings is called, it will defer to the VF
device if it exists to pull accelerated networking values, otherwise
pull default or user-defined values.

2. Similarly, if netvsc_set_link_ksettings called and a VF device
exists, the real values of speed and duplex are changed.

Signed-off-by: Cris Forno <cforno12@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 503d539a 24-Feb-2020 Yuya Kusakabe <yuya.kusakabe@gmail.com>

virtio_net: Add XDP meta data support

Implement support for transferring XDP meta data into skb for
virtio_net driver; before calling into the program, xdp.data_meta points
to xdp.data, where on program return with pass verdict, we call
into skb_metadata_set().

Tested with the script at
https://github.com/higebu/virtio_net-xdp-metadata-test.

Signed-off-by: Yuya Kusakabe <yuya.kusakabe@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/bpf/20200225033212.437563-2-yuya.kusakabe@gmail.com


# f1d4884d 24-Feb-2020 Yuya Kusakabe <yuya.kusakabe@gmail.com>

virtio_net: Keep vnet header zeroed if XDP is loaded for small buffer

We do not want to care about the vnet header in receive_small() if XDP
is loaded, since we can not know whether or not the packet is modified
by XDP.

Fixes: f6b10209b90d ("virtio-net: switch to use build_skb() for small buffer")
Signed-off-by: Yuya Kusakabe <yuya.kusakabe@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/bpf/20200225033212.437563-1-yuya.kusakabe@gmail.com


# 9719c6b9 26-Jan-2020 John Fastabend <john.fastabend@gmail.com>

bpf, xdp: virtio_net use access ptr macro for xdp enable check

virtio_net currently relies on rcu critical section to access the xdp
program in its xdp_xmit handler. However, the pointer to the xdp program
is only used to do a NULL pointer comparison to determine if xdp is
enabled or not.

Use rcu_access_pointer() instead of rcu_dereference() to reflect this.
Then later when we drop rcu_read critical section virtio_net will not
need in special handling.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/1580084042-11598-3-git-send-email-john.fastabend@gmail.com


# 1d233886 16-Jan-2020 Toke Høiland-Jørgensen <toke@redhat.com>

xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths

Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().

Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().

Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.

With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).

Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk


# 85192dbf 17-Nov-2019 Andrii Nakryiko <andriin@fb.com>

bpf: Convert bpf_prog refcnt to atomic64_t

Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64
and remove artificial 32k limit. This allows to make bpf_prog's refcounting
non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc.

Validated compilation by running allyesconfig kernel build.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com


# 895b5c9f 29-Sep-2019 Florian Westphal <fw@strlen.de>

netfilter: drop bridge nf reset from nf_reset

commit 174e23810cd31
("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
recycle always drop skb extensions. The additional skb_ext_del() that is
performed via nf_reset on napi skb recycle is not needed anymore.

Most nf_reset() calls in the stack are there so queued skb won't block
'rmmod nf_conntrack' indefinitely.

This removes the skb_ext_del from nf_reset, and renames it to a more
fitting nf_reset_ct().

In a few selected places, add a call to skb_ext_reset to make sure that
no active extensions remain.

I am submitting this for "net", because we're still early in the release
cycle. The patch applies to net-next too, but I think the rename causes
needless divergence between those trees.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 718be6ba 19-Aug-2019 ? jiang <jiangkidd@hotmail.com>

virtio-net: lower min ring num_free for efficiency

This change lowers ring buffer reclaim threshold from 1/2*queue to budget
for better performance. According to our test with qemu + dpdk, packet
dropping happens when the guest is not able to provide free buffer in
avail ring timely with default 1/2*queue. The value in the patch has been
tested and does show better performance.

Test setup: iperf3 to generate packets to guest (total 30mins, pps 400k, UDP)
avg packets drop before: 2842
avg packets drop after: 360(-87.3%)

Further, current code suffers from a starvation problem: the amount of
work done by try_fill_recv is not bounded by the budget parameter, thus
(with large queues) once in a while userspace gets blocked for a long
time while queue is being refilled. Trigger refills earlier to make sure
the amount of work to do is limited.

Signed-off-by: jiangkidd <jiangkidd@hotmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 31c03aef 12-Jun-2019 Willem de Bruijn <willemb@google.com>

virtio_net: enable napi_tx by default

NAPI tx mode improves TCP behavior by enabling TCP small queues (TSQ).
TSQ reduces queuing ("bufferbloat") and burstiness.

Previous measurements have shown significant improvement for
TCP_STREAM style workloads. Such as those in commit 86a5df1495cc
("Merge branch 'virtio-net-tx-napi'").

There has been uncertainty about smaller possible regressions in
latency due to increased reliance on tx interrupts.

The above results did not show that, nor did I observe this when
rerunning TCP_RR on Linux 5.1 this week on a pair of guests in the
same rack. This may be subject to other settings, notably interrupt
coalescing.

In the unlikely case of regression, we have landed a credible runtime
solution. Ethtool can configure it with -C tx-frames [0|1] as of
commit 0c465be183c7 ("virtio_net: ethtool tx napi configuration").

NAPI tx mode has been the default in Google Container-Optimized OS
(COS) for over half a year, as of release M70 in October 2018,
without any negative reports.

Link: https://marc.info/?l=linux-netdev&m=149305618416472
Link: https://lwn.net/Articles/507065/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1ccea77e 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not see http www gnu org licenses

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details [based]
[from] [clk] [highbank] [c] you should have received a copy of the
gnu general public license along with this program if not see http
www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 355 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.837383322@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7934b481 02-Apr-2019 Yuval Shaia <yuval.shaia@oracle.com>

virtio-net: Fix some minor formatting errors

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6221333a 03-Apr-2019 Yuval Shaia <yuval.shaia@oracle.com>

virtio-net: Remove inclusion of pci.h

This header is not in use - remove it.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6b16f9ee 01-Apr-2019 Florian Westphal <fw@strlen.de>

net: move skb->xmit_more hint to softnet data

There are two reasons for this.

First, the xmit_more flag conceptually doesn't fit into the skb, as
xmit_more is not a property related to the skb.
Its only a hint to the driver that the stack is about to transmit another
packet immediately.

Second, it was only done this way to not have to pass another argument
to ndo_start_xmit().

We can place xmit_more in the softnet data, next to the device recursion.
The recursion counter is already written to on each transmit. The "more"
indicator is placed right next to it.

Drivers can use the netdev_xmit_more() helper instead of skb->xmit_more
to check the "more packets coming" hint.

skb->xmit_more is retained (but always 0) to not cause build breakage.

This change takes care of the simple s/skb->xmit_more/netdev_xmit_more()/
conversions. Remaining drivers are converted in the next patches.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 310974fa 18-Mar-2019 Peter Xu <peterx@redhat.com>

virtio_net: remove hcpu from virtnet_clean_affinity

The variable is never used.

CC: Michael S. Tsirkin <mst@redhat.com>
CC: Jason Wang <jasowang@redhat.com>
CC: virtualization@lists.linux-foundation.org
CC: netdev@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 546f2897 31-Jan-2019 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Account for tx bytes and packets on sending xdp_frames

Previously virtnet_xdp_xmit() did not account for device tx counters,
which caused confusions.
To be consistent with SKBs, account them on freeing xdp_frames.

Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5050471d 28-Jan-2019 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Differentiate sk_buff and xdp_frame on freeing

We do not reset or free up unused buffers when enabling/disabling XDP,
so it can happen that xdp_frames are freed after disabling XDP or
sk_buffs are freed after enabling XDP on xdp tx queues.
Thus we need to handle both forms (xdp_frames and sk_buffs) regardless
of XDP setting.
One way to trigger this problem is to disable XDP when napi_tx is
enabled. In that case, virtnet_xdp_set() calls virtnet_napi_enable()
which kicks NAPI. The NAPI handler will call virtnet_poll_cleantx()
which invokes free_old_xmit_skbs() for queues which have been used by
XDP.

Note that even with this change we need to keep skipping
free_old_xmit_skbs() from NAPI handlers when XDP is enabled, because XDP
tx queues do not aquire queue locks.

- v2: Use napi_consume_skb() instead of dev_consume_skb_any()

Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 07b344f4 28-Jan-2019 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Use xdp_return_frame to free xdp_frames on destroying vqs

put_page() can work as a fallback for freeing xdp_frames, but the
appropriate way is to use xdp_return_frame().

Fixes: cac320c850ef ("virtio_net: convert to use generic xdp_frame and xdp_return_frame API")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 03aa6d34 28-Jan-2019 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Don't process redirected XDP frames when XDP is disabled

Commit 8dcc5b0ab0ec ("virtio_net: fix ndo_xdp_xmit crash towards dev not
ready for XDP") tried to avoid access to unexpected sq while XDP is
disabled, but was not complete.

There was a small window which causes out of bounds sq access in
virtnet_xdp_xmit() while disabling XDP.

An example case of
- curr_queue_pairs = 6 (2 for SKB and 4 for XDP)
- online_cpu_num = xdp_queue_paris = 4
when XDP is enabled:

CPU 0 CPU 1
(Disabling XDP) (Processing redirected XDP frames)

virtnet_xdp_xmit()
virtnet_xdp_set()
_virtnet_set_queues()
set curr_queue_pairs (2)
check if rq->xdp_prog is not NULL
virtnet_xdp_sq(vi)
qp = curr_queue_pairs -
xdp_queue_pairs +
smp_processor_id()
= 2 - 4 + 1 = -1
sq = &vi->sq[qp] // out of bounds access
set xdp_queue_pairs (0)
rq->xdp_prog = NULL

Basically we should not change curr_queue_pairs and xdp_queue_pairs
while someone can read the values. Thus, when disabling XDP, assign NULL
to rq->xdp_prog first, and wait for RCU grace period, then change
xxx_queue_pairs.
Note that we need to keep the current order when enabling XDP though.

- v2: Make rcu_assign_pointer/synchronize_net conditional instead of
_virtnet_set_queues.

Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1667c08a 28-Jan-2019 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Fix out of bounds access of sq

When XDP is disabled, curr_queue_pairs + smp_processor_id() can be
larger than max_queue_pairs.
There is no guarantee that we have enough XDP send queues dedicated for
each cpu when XDP is disabled, so do not count drops on sq in that case.

Fixes: 5b8f3c8d30a6 ("virtio_net: Add XDP related stats")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 188313c1 28-Jan-2019 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Fix not restoring real_num_rx_queues

When _virtnet_set_queues() failed we did not restore real_num_rx_queues.
Fix this by placing the change of real_num_rx_queues after
_virtnet_set_queues().
This order is also in line with virtnet_set_channels().

Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 534da5e8 28-Jan-2019 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Don't call free_old_xmit_skbs for xdp_frames

When napi_tx is enabled, virtnet_poll_cleantx() called
free_old_xmit_skbs() even for xdp send queue.
This is bogus since the queue has xdp_frames, not sk_buffs, thus mangled
device tx bytes counters because skb->len is meaningless value, and even
triggered oops due to general protection fault on freeing them.

Since xdp send queues do not aquire locks, old xdp_frames should be
freed only in virtnet_xdp_xmit(), so just skip free_old_xmit_skbs() for
xdp send queues.

Similarly virtnet_poll_tx() called free_old_xmit_skbs(). This NAPI
handler is called even without calling start_xmit() because cb for tx is
by default enabled. Once the handler is called, it enabled the cb again,
and then the handler would be called again. We don't need this handler
for XDP, so don't enable cb as well as not calling free_old_xmit_skbs().

Also, we need to disable tx NAPI when disabling XDP, so
virtnet_poll_tx() can safely access curr_queue_pairs and
xdp_queue_pairs, which are not atomically updated while disabling XDP.

Fixes: b92f1e6751a6 ("virtio-net: transmit napi")
Fixes: 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx napi")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8be4d9a4 28-Jan-2019 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Don't enable NAPI when interface is down

Commit 4e09ff536284 ("virtio-net: disable NAPI only when enabled during
XDP set") tried to fix inappropriate NAPI enabling/disabling when
!netif_running(), but was not complete.

On error path virtio_net could enable NAPI even when !netif_running().
This can cause enabling NAPI twice on virtnet_open(), which would
trigger BUG_ON() in napi_enable().

Fixes: 4941d472bf95b ("virtio-net: do not reset during XDP set")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# df133f3f 17-Jan-2019 Michael S. Tsirkin <mst@redhat.com>

virtio_net: bulk free tx skbs

Use napi_consume_skb() to get bulk free. Note that napi_consume_skb is
safe to call in a non-napi context as long as the napi_budget flag is
correct.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 133bbb18 17-Jan-2019 Willem de Bruijn <willemb@google.com>

virtio-net: per-queue RPS config

On multiqueue network devices, RPS maps are configured independently
for each receive queue through /sys/class/net/$DEV/queues/rx-*.

On virtio-net currently all packets use the map from rx-0, because the
real rx queue is not known at time of map lookup by get_rps_cpu.

Call skb_record_rx_queue in the driver rx path to make lookup work.

Recording the receive queue has ramifications beyond RPS, such as in
sticky load balancing decisions for sockets (skb_tx_hash) and XPS.

Reported-by: Mark Hlady <mhlady@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a02e8964 20-Dec-2018 Willem de Bruijn <willemb@google.com>

virtio-net: ethtool configurable LRO

Virtio-net devices negotiate LRO support with the host.
Display the initially negotiated state with ethtool -k.

Also allow configuring it with ethtool -K, reusing the existing
virtnet_set_guest_offloads helper that configures LRO for XDP.
This is conditional on VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.

Virtio-net negotiates TSO4 and TSO6 separately, but ethtool does not
distinguish between the two. Display LRO as on only if any offload
is active.

RTNL is held while calling virtnet_set_features, same as on the path
from virtnet_xdp_set.

Changes v1 -> v2
- allow ethtool config (-K) only if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
- show LRO as enabled if any LRO variant is enabled
- do not allow configuration while XDP is active
- differentiate current features from the capable set, to restore
on XDP down only those features that were active on XDP up
- move test out of VIRTIO_NET_F_CSUM/TSO branch, which is tx only

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 436c9453 28-Nov-2018 Jason Wang <jasowang@redhat.com>

virtio-net: keep vnet header zeroed after processing XDP

We copy vnet header unconditionally in page_to_skb() this is wrong
since XDP may modify the packet data. So let's keep a zeroed vnet
header for not confusing the conversion between vnet header and skb
metadata.

In the future, we should able to detect whether or not the packet was
modified and keep using the vnet header when packet was not touched.

Fixes: f600b6905015 ("virtio_net: Add XDP support")
Reported-by: Pavel Popa <pashinho1990@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 18ba58e1 21-Nov-2018 Jason Wang <jasowang@redhat.com>

virtio-net: fail XDP set if guest csum is negotiated

We don't support partial csumed packet since its metadata will be lost
or incorrect during XDP processing. So fail the XDP set if guest_csum
feature is negotiated.

Fixes: f600b6905015 ("virtio_net: Add XDP support")
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pavel Popa <pashinho1990@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e59ff2c4 21-Nov-2018 Jason Wang <jasowang@redhat.com>

virtio-net: disable guest csum during XDP set

We don't disable VIRTIO_NET_F_GUEST_CSUM if XDP was set. This means we
can receive partial csumed packets with metadata kept in the
vnet_hdr. This may have several side effects:

- It could be overridden by header adjustment, thus is might be not
correct after XDP processing.
- There's no way to pass such metadata information through
XDP_REDIRECT to another driver.
- XDP does not support checksum offload right now.

So simply disable guest csum if possible in this the case of XDP.

Fixes: 3f93522ffab2d ("virtio-net: switch off offloads on demand if possible on XDP set")
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pavel Popa <pashinho1990@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 05c998b7 17-Oct-2018 Ake Koomsin <ake@igel.co.jp>

virtio_net: avoid using netif_tx_disable() for serializing tx routine

Commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
introduces netif_tx_disable() after netif_device_detach() in order to
avoid use-after-free of tx queues. However, there are two issues.

1) Its operation is redundant with netif_device_detach() in case the
interface is running.
2) In case of the interface is not running before suspending and
resuming, the tx does not get resumed by netif_device_attach().
This results in losing network connectivity.

It is better to use netif_tx_lock_bh()/netif_tx_unlock_bh() instead for
serializing tx routine during reset. This also preserves the symmetry
of netif_device_detach() and netif_device_attach().

Fixes commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
Signed-off-by: Ake Koomsin <ake@igel.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0c465be1 08-Oct-2018 Jason Wang <jasowang@redhat.com>

virtio_net: ethtool tx napi configuration

Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through setting tx-frames. So as to not interfere
with possible future interrupt moderation, value 1 means tx napi while
value 0 means not.

Only allow the switching when device is down for simplicity.

Link: https://patchwork.ozlabs.org/patch/948149/
Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 260dd2c3 27-Sep-2018 Eric Dumazet <edumazet@google.com>

virtio_net: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

virto_net uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1150827b 13-Aug-2018 YueHaibing <yuehaibing@huawei.com>

virtio_net: remove duplicated include from virtio_net.c

Remove duplicated include linux/netdevice.h

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2ca653d6 09-Aug-2018 Caleb Raitto <caraitto@google.com>

virtio_net: Stripe queue affinities across cores.

Always set the affinity hint, even if #cpu != #vq.

Handle the case where #cpu > #vq (including when #cpu % #vq != 0) and
when #vq > #cpu (including when #vq % #cpu != 0).

Signed-off-by: Caleb Raitto <caraitto@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jon Olson <jonolson@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 19e226e8 09-Aug-2018 Caleb Raitto <caraitto@google.com>

virtio: Make vp_set_vq_affinity() take a mask.

Make vp_set_vq_affinity() take a cpumask instead of taking a single CPU.

If there are fewer queues than cores, queue affinity should be able to
map to multiple cores.

Link: https://patchwork.ozlabs.org/patch/948149/
Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Caleb Raitto <caraitto@google.com>
Acked-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4d99f660 08-Aug-2018 Andrei Vagin <avagin@gmail.com>

net: allow to call netif_reset_xps_queues() under cpus_read_lock

The definition of static_key_slow_inc() has cpus_read_lock in place. In the
virtio_net driver, XPS queues are initialized after setting the queue:cpu
affinity in virtnet_set_affinity() which is already protected within
cpus_read_lock. Lockdep prints a warning when we are trying to acquire
cpus_read_lock when it is already held.

This patch adds an ability to call __netif_set_xps_queue under
cpus_read_lock().
Acked-by: Jason Wang <jasowang@redhat.com>

============================================
WARNING: possible recursive locking detected
4.18.0-rc3-next-20180703+ #1 Not tainted
--------------------------------------------
swapper/0/1 is trying to acquire lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: static_key_slow_inc+0xe/0x20

but task is already holding lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(cpu_hotplug_lock.rw_sem);
lock(cpu_hotplug_lock.rw_sem);

*** DEADLOCK ***

May be due to missing lock nesting notation

3 locks held by swapper/0/1:
#0: 00000000244bc7da (&dev->mutex){....}, at: __driver_attach+0x5a/0x110
#1: 00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
#2: 000000005cd8463f (xps_map_mutex){+.+.}, at: __netif_set_xps_queue+0x8d/0xc60

v2: move cpus_read_lock() out of __netif_set_xps_queue()

Cc: "Nambiar, Amritha" <amritha.nambiar@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")

Signed-off-by: Andrei Vagin <avagin@gmail.com>

Signed-off-by: David S. Miller <davem@davemloft.net>


# b633d440 04-Aug-2018 Gustavo A. R. Silva <gustavo@embeddedor.com>

virtio-net: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 1402059 ("Missing break in switch")
Addresses-Coverity-ID: 1402060 ("Missing break in switch")
Addresses-Coverity-ID: 1402061 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d46eeeaf 31-Jul-2018 Jason Wang <jasowang@redhat.com>

virtio-net: get rid of unnecessary container of rq stats

We don't maintain tx counters in rx stats any more. There's no need
for an extra container of rq stats.

Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ca9e83b4 31-Jul-2018 Jason Wang <jasowang@redhat.com>

virtio-net: correctly update XDP_TX counters

Commit 5b8f3c8d30a6 ("virtio_net: Add XDP related stats") tries to
count TX XDP stats in virtnet_receive(). This will cause several
issues:

- virtnet_xdp_sq() was called without checking whether or not XDP is
set. This may cause out of bound access when there's no enough txq
for XDP.
- Stats were updated even if there's no XDP/XDP_TX.

Fixing this by reusing virtnet_xdp_xmit() for XDP_TX which can counts
TX XDP counter itself and remove the unnecessary tx stats embedded in
rx stats.

Reported-by: syzbot+604f8271211546f5b3c7@syzkaller.appspotmail.com
Fixes: 5b8f3c8d30a6 ("virtio_net: Add XDP related stats")
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ecbc42ca 23-Jul-2018 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Fix incosistent received bytes counter

When received packets are dropped in virtio_net driver, received packets
counter is incremented but bytes counter is not.
As a result, for instance if we drop all packets by XDP, only received
is counted and bytes stays 0, which looks inconsistent.
IMHO received packets/bytes should be counted if packets are produced by
the hypervisor, like what common NICs on physical machines are doing.
So fix the bytes counter.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 461f03dc 23-Jul-2018 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Add kick stats

So we can infer the number of VM-Exits.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5b8f3c8d 23-Jul-2018 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Add XDP related stats

Add counters below:
* Tx
- xdp_tx: frames sent by ndo_xdp_xmit or XDP_TX.
- xdp_tx_drops: dropped frames out of xdp_tx ones.
* Rx
- xdp_packets: frames went through xdp program.
- xdp_tx: XDP_TX frames.
- xdp_redirects: XDP_REDIRECT frames.
- xdp_drops: any dropped frames out of xdp_packets ones.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2a43565c 23-Jul-2018 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Factor out the logic to determine xdp sq

Make sure to use the same logic in all places to determine xdp sq. This
is useful for xdp counters which the following commit will introduce as
well.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2c4a2f7d 23-Jul-2018 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Make drop counter per-queue

Since when XDP was introduced, drop counter has been able to be updated
much more frequently than before, as XDP_DROP increments the counter.
Thus for performance analysis per-queue drop counter would be useful.

Also this avoids cache contention and race on updating the counter. It
is currently racy because napi handlers read-modify-write it without any
locks.

There are more counters in dev->stats that are racy, but I left them
per-device, because they are rarely updated and does not worth being
per-queue counters IMHO. To fix them we need atomic ops or some kind of
locks.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a0929a44 23-Jul-2018 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Use temporary storage for accounting rx stats

The purpose is to keep receive_buf arguments simple when more per-queue
counter items are added later.
Also XDP_TX related sq counters will be updated in the following changes
so create a container struct virtnet_rx_stats which will includes both
rq and sq statistics. For now it only covers rq stats.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7d9d60fd 23-Jul-2018 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Fix incosistent received bytes counter

When received packets are dropped in virtio_net driver, received packets
counter is incremented but bytes counter is not.
As a result, for instance if we drop all packets by XDP, only received
is counted and bytes stays 0, which looks inconsistent.
IMHO received packets/bytes should be counted if packets are produced by
the hypervisor, like what common NICs on physical machines are doing.
So fix the bytes counter.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6b867589 11-Jul-2018 Jakub Kicinski <kuba@kernel.org>

xdp: don't make drivers report attachment mode

prog_attached of struct netdev_bpf should have been superseded
by simply setting prog_id long time ago, but we kept it around
to allow offloading drivers to communicate attachment mode (drv
vs hw). Subsequently drivers were also allowed to report back
attachment flags (prog_flags), and since nowadays only programs
attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell
the attachment mode from the flags driver reports. Remove
prog_attached member.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>


# 2471c75e 26-Jun-2018 Jesper Dangaard Brouer <brouer@redhat.com>

virtio_net: split XDP_TX kick and XDP_REDIRECT map flushing

The driver was combining XDP_TX virtqueue_kick and XDP_REDIRECT
map flushing (xdp_do_flush_map). This is suboptimal, these two
flush operations should be kept separate.

The suboptimal behavior was introduced in commit 9267c430c6b6
("virtio-net: add missing virtqueue kick when flushing packets").

Fixes: 9267c430c6b6 ("virtio-net: add missing virtqueue kick when flushing packets")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6396bb22 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: kzalloc() -> kcalloc()

The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

kzalloc(a * b, gfp)

with:
kcalloc(a * b, gfp)

as well as handling cases of:

kzalloc(a * b * c, gfp)

with:

kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kzalloc(sizeof(THING) * C2, ...)
|
kzalloc(sizeof(TYPE) * C2, ...)
|
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>


# 6da2ec56 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: kmalloc() -> kmalloc_array()

The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

kmalloc(a * b, gfp)

with:
kmalloc_array(a * b, gfp)

as well as handling cases of:

kmalloc(a * b * c, gfp)

with:

kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>


# fd3a8862 06-Jun-2018 Willem de Bruijn <willemb@google.com>

net: in virtio_net_hdr only add VLAN_HLEN to csum_start if payload holds vlan

Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr
to communicate packet metadata to userspace.

For skbuffs with vlan, the first two return the packet as it may have
existed on the wire, inserting the VLAN tag in the user buffer. Then
virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes.

Commit f09e2249c4f5 ("macvtap: restore vlan header on user read")
added this feature to macvtap. Commit 3ce9b20f1971 ("macvtap: Fix
csum_start when VLAN tags are present") then fixed up csum_start.

Virtio, packet and uml do not insert the vlan header in the user
buffer.

When introducing virtio_net_hdr_from_skb to deduplicate filling in
the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was
applied uniformly, breaking csum offset for packets with vlan on
virtio and packet.

Make insertion of VLAN_HLEN optional. Convert the callers to pass it
when needed.

Fixes: e858fae2b0b8f4 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
Fixes: 1276f24eeef2 ("packet: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 16ec025a 05-Jun-2018 Jesper Dangaard Brouer <brouer@redhat.com>

virtio_net: remove ndo_xdp_flush call virtnet_xdp_flush

Remove the ndo_xdp_flush call implementation virtnet_xdp_flush
as no callers of ndo_xdp_flush are left.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>


# 2fa3c8a8 31-May-2018 Tonghao Zhang <xiangxia.m.yue@gmail.com>

net: virtio: simplify the virtnet_find_vqs

Use the common free functions while return successfully.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5d274cb4 31-May-2018 Jesper Dangaard Brouer <brouer@redhat.com>

virtio_net: implement flush flag for ndo_xdp_xmit

When passed the XDP_XMIT_FLUSH flag virtnet_xdp_xmit now performs the
same virtqueue_kick as virtnet_xdp_flush.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 42b33468 31-May-2018 Jesper Dangaard Brouer <brouer@redhat.com>

xdp: add flags argument to ndo_xdp_xmit API

This patch only change the API and reject any use of flags. This is an
intermediate step that allows us to implement the flush flag operation
later, for each individual driver in a separate patch.

The plan is to implement flush operation via XDP_XMIT_FLUSH flag
and then remove XDP_XMIT_FLAGS_NONE when done.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 4b8e6ac4 30-May-2018 Wei Yongjun <weiyongjun1@huawei.com>

virtio_net: fix error return code in virtnet_probe()

Fix to return a negative error code from the failover create fail error
handling case instead of 0, as done elsewhere in this function.

Fixes: ba5e4426e80e ("virtio_net: Extend virtio to use VF datapath when available")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ba5e4426 24-May-2018 Sridhar Samudrala <sridhar.samudrala@intel.com>

virtio_net: Extend virtio to use VF datapath when available

This patch enables virtio_net to switch over to a VF datapath when STANDBY
feature is enabled and a VF netdev is present with the same MAC address.
It allows live migration of a VM with a direct attached VF without the need
to setup a bond/team between a VF and virtio net device in the guest.

It uses the API that is exported by the net_failover driver to create and
and destroy a master failover netdev. When STANDBY feature is enabled, an
additional netdev(failover netdev) is created that acts as a master device
and tracks the state of the 2 lower netdevs. The original virtio_net netdev
is marked as 'standby' netdev and a passthru device with the same MAC is
registered as 'primary' netdev.

The hypervisor needs to unplug the VF device from the guest on the source
host and reset the MAC filter of the VF to initiate failover of datapath
to virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9805069d 24-May-2018 Sridhar Samudrala <sridhar.samudrala@intel.com>

virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.

VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 735fc405 24-May-2018 Jesper Dangaard Brouer <brouer@redhat.com>

xdp: change ndo_xdp_xmit API to support bulking

This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.

When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.

Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps

With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.

Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.

The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.

V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 3d62b2a0 21-May-2018 Jason Wang <jasowang@redhat.com>

virtio-net: fix leaking page for gso packet during mergeable XDP

We need to drop refcnt to xdp_page if we see a gso packet. Otherwise
it will be leaked. Fixing this by moving the check of gso packet above
the linearizing logic. While at it, remove useless comment as well.

Cc: John Fastabend <john.fastabend@gmail.com>
Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 850e088d 21-May-2018 Jason Wang <jasowang@redhat.com>

virtio-net: correctly check num_buf during err path

If we successfully linearize the packet, num_buf will be set to zero
which may confuse error handling path which assumes num_buf is at
least 1 and this can lead the code tries to pop the descriptor of next
buffer. Fixing this by checking num_buf against 1 before decreasing.

Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5d458a13 21-May-2018 Jason Wang <jasowang@redhat.com>

virtio-net: correctly transmit XDP buff after linearizing

We should not go for the error path after successfully transmitting a
XDP buffer after linearizing. Since the error path may try to pop and
drop next packet and increase the drop counters. Fixing this by simply
drop the refcnt of original page and go for xmit path.

Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6890418b 21-May-2018 Jason Wang <jasowang@redhat.com>

virtio-net: correctly redirect linearized packet

After a linearized packet was redirected by XDP, we should not go for
the err path which will try to pop buffers for the next packet and
increase the drop counter. Fixing this by just drop the page refcnt
for the original page.

Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Reported-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aaa64527 22-Apr-2018 Nikita V. Shirokov <tehnerd@tehnerd.com>

bpf: fix virtio-net's length calc for XDP_PASS

In commit 6870de435b90 ("bpf: make virtio compatible w/
bpf_xdp_adjust_tail") i didn't account for vi->hdr_len during new
packet's length calculation after bpf_prog_run in receive_mergeable.
because of this all packets, if they were passed to the kernel,
were truncated by 12 bytes.

Fixes:6870de435b90 ("bpf: make virtio compatible w/ bpf_xdp_adjust_tail")
Reported-by: David Ahern <dsahern@gmail.com>

Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>


# f4ee703a 18-Apr-2018 Michael S. Tsirkin <mst@redhat.com>

virtio_net: sparse annotation fix

offloads is a buffer in virtio format, should use
the __virtio64 tag.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d7fad4c8 18-Apr-2018 Michael S. Tsirkin <mst@redhat.com>

virtio_net: fix adding vids on big-endian

Programming vids (adding or removing them) still passes
guest-endian values in the DMA buffer. That's wrong
if guest is big-endian and when virtio 1 is enabled.

Note: this is on top of a previous patch:
virtio_net: split out ctrl buffer

Fixes: 9465a7a6f ("virtio_net: enable v1.0 support")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 12e57169 18-Apr-2018 Michael S. Tsirkin <mst@redhat.com>

virtio_net: split out ctrl buffer

When sending control commands, virtio net sets up several buffers for
DMA. The buffers are all part of the net device which means it's
actually allocated by kvmalloc so it's in theory (on extreme memory
pressure) possible to get a vmalloc'ed buffer which on some platforms
means we can't DMA there.

Fix up by moving the DMA buffers into a separate structure.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6870de43 17-Apr-2018 Nikita V. Shirokov <tehnerd@tehnerd.com>

bpf: make virtio compatible w/ bpf_xdp_adjust_tail

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for virtio driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack

Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>


# 44fa2dbd 17-Apr-2018 Jesper Dangaard Brouer <brouer@redhat.com>

xdp: transition into using xdp_frame for ndo_xdp_xmit

Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff. This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 03993094 17-Apr-2018 Jesper Dangaard Brouer <brouer@redhat.com>

xdp: transition into using xdp_frame for return API

Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object. In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame. The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow. In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern. My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8d5d8852 17-Apr-2018 Jesper Dangaard Brouer <brouer@redhat.com>

xdp: rhashtable with allocator ID to pointer mapping

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info. Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short. The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator. Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU. The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function. In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
XDP_REDIRECT, even-though it's not strictly necessary when
allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).

V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch

V8: Address sparse should be static warnings (from kbuild test robot)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cac320c8 17-Apr-2018 Jesper Dangaard Brouer <brouer@redhat.com>

virtio_net: convert to use generic xdp_frame and xdp_return_frame API

The virtio_net driver assumes XDP frames are always released based on
page refcnt (via put_page). Thus, is only queues the XDP data pointer
address and uses virt_to_head_page() to retrieve struct page.

Use the XDP return API to get away from such assumptions. Instead
queue an xdp_frame, which allow us to use the xdp_return_frame API,
when releasing the frame.

V8: Avoid endianness issues (found by kbuild test robot)
V9: Change __virtnet_xdp_xmit from bool to int return value (found by Dan Carpenter)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9267c430 13-Apr-2018 Jason Wang <jasowang@redhat.com>

virtio-net: add missing virtqueue kick when flushing packets

We tends to batch submitting packets during XDP_TX. This requires to
kick virtqueue after a batch, we tried to do it through
xdp_do_flush_map() which only makes sense for devmap not XDP_TX. So
explicitly kick the virtqueue in this case.

Reported-by: Kimitoshi Takahashi <ktaka@nii.ac.jp>
Tested-by: Kimitoshi Takahashi <ktaka@nii.ac.jp>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bda7fab5 22-Mar-2018 Jay Vosburgh <jay.vosburgh@canonical.com>

virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS

The operstate update logic will leave an interface in the
default UNKNOWN operstate if the interface carrier state never changes
from the default carrier up state set at creation. This includes the
case of an explicit call to netif_carrier_on, as the carrier on to on
transition has no effect on operstate.

This affects virtio-net for the case that the virtio peer does
not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
updates). Without this feature, the virtio specification states that
"the link should be assumed active," so, logically, the operstate should
be UP instead of UNKNOWN. This has impact on user space applications
that use the operstate to make availability decisions for the interface.

Resolve this by changing the virtio probe logic slightly to call
netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
cases, and then the existing call to netif_carrier_on for the "without"
case will cause an operstate transition.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3cc81a9a 02-Mar-2018 Jason Wang <jasowang@redhat.com>

virtio-net: re enable XDP_REDIRECT for mergeable buffer

XDP_REDIRECT support for mergeable buffer was removed since commit
7324f5399b06 ("virtio_net: disable XDP_REDIRECT in receive_mergeable()
case"). This is because we don't reserve enough tailroom for struct
skb_shared_info which breaks XDP assumption. So this patch fixes this
by reserving enough tailroom and using fixed size of rx buffer.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 51568d69 02-Mar-2018 Jason Wang <jasowang@redhat.com>

virtio-net: re enable XDP_REDIRECT for mergeable buffer

XDP_REDIRECT support for mergeable buffer was removed since commit
7324f5399b06 ("virtio_net: disable XDP_REDIRECT in receive_mergeable()
case"). This is because we don't reserve enough tailroom for struct
skb_shared_info which breaks XDP assumption. So this patch fixes this
by reserving enough tailroom and using fixed size of rx buffer.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4e09ff53 28-Feb-2018 Jason Wang <jasowang@redhat.com>

virtio-net: disable NAPI only when enabled during XDP set

We try to disable NAPI to prevent a single XDP TX queue being used by
multiple cpus. But we don't check if device is up (NAPI is enabled),
this could result stall because of infinite wait in
napi_disable(). Fixing this by checking device state through
netif_running() before.

Fixes: 4941d472bf95b ("virtio-net: do not reset during XDP set")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8dcc5b0a 20-Feb-2018 Jesper Dangaard Brouer <brouer@redhat.com>

virtio_net: fix ndo_xdp_xmit crash towards dev not ready for XDP

When a driver implements the ndo_xdp_xmit() function, there is
(currently) no generic way to determine whether it is safe to call.

It is e.g. unsafe to call the drivers ndo_xdp_xmit, if it have not
allocated the needed XDP TX queues yet. This is the case for
virtio_net, which first allocates the XDP TX queues once an XDP/bpf
prog is attached (in virtnet_xdp_set()).

Thus, a crash will occur for virtio_net when redirecting to another
virtio_net device's ndo_xdp_xmit, which have not attached a XDP prog.
The sample xdp_redirect_map tries to attach a dummy XDP prog to take
this into account, but it can also easily fail if the virtio_net (or
actually underlying vhost driver) have not allocated enough extra
queues for the device.

Allocating more queue this is currently a manual config.
Hint for libvirt XML add:

<driver name='vhost' queues='16'>
<host mrg_rxbuf='off'/>
<guest tso4='off' tso6='off' ecn='off' ufo='off'/>
</driver>

The solution in this patch is to check that the device have loaded an
XDP/bpf prog before proceeding. This is similar to the check
performed in driver ixgbe.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 11b7d897 20-Feb-2018 Jesper Dangaard Brouer <brouer@redhat.com>

virtio_net: fix memory leak in XDP_REDIRECT

XDP_REDIRECT calling xdp_do_redirect() can fail for multiple reasons
(which can be inspected by tracepoints). The current semantics is that
on failure the driver calling xdp_do_redirect() must handle freeing or
recycling the page associated with this frame. This can be seen as an
optimization, as drivers usually have an optimized XDP_DROP code path
for frame recycling in place already.

The virtio_net driver didn't handle when xdp_do_redirect() failed.
This caused a memory leak as the page refcnt wasn't decremented on
failures.

The function __virtnet_xdp_xmit() did handle one type of failure,
when the xmit queue virtqueue_add_outbuf() is full, which "hides"
releasing a refcnt on the page. Instead the function __virtnet_xdp_xmit()
must follow API of xdp_do_redirect(), which on errors leave it up to
the caller to free the page, of the failed send operation.

Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 95dbe9e7 20-Feb-2018 Jesper Dangaard Brouer <brouer@redhat.com>

virtio_net: fix XDP code path in receive_small()

When configuring virtio_net to use the code path 'receive_small()',
in-order to get correct XDP_REDIRECT support, I discovered TCP packets
would get silently dropped when loading an XDP program action XDP_PASS.

The bug seems to be that receive_small() when XDP is loaded check that
hdr->hdr.flags is zero, which seems wrong as hdr.flags contains the
flags VIRTIO_NET_HDR_F_* :
#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* Use csum_start, csum_offset */
#define VIRTIO_NET_HDR_F_DATA_VALID 2 /* Csum is valid */

TCP got dropped as it had the VIRTIO_NET_HDR_F_DATA_VALID flag set.

The flags that are relevant here are the VIRTIO_NET_HDR_GSO_* flags
stored in hdr->hdr.gso_type. Thus, the fix is just check that none of
the gso_type flags have been set.

Fixes: bb91accf2733 ("virtio-net: XDP support for small buffers")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7324f539 20-Feb-2018 Jesper Dangaard Brouer <brouer@redhat.com>

virtio_net: disable XDP_REDIRECT in receive_mergeable() case

The virtio_net code have three different RX code-paths in receive_buf().
Two of these code paths can handle XDP, but one of them is broken for
at least XDP_REDIRECT.

Function(1): receive_big() does not support XDP.
Function(2): receive_small() support XDP fully and uses build_skb().
Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb().

The simple explanation is that receive_mergeable() is broken because
it uses napi_alloc_skb(), which violates XDP given XDP assumes packet
header+data in single page and enough tail room for skb_shared_info.

The longer explaination is that receive_mergeable() tries to
work-around and satisfy these XDP requiresments e.g. by having a
function xdp_linearize_page() that allocates and memcpy RX buffers
around (in case packet is scattered across multiple rx buffers). This
does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because
we have not implemented bpf_xdp_adjust_tail yet).

The XDP_REDIRECT action combined with cpumap is broken, and cause hard
to debug crashes. The main issue is that the RX packet does not have
the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing
skb_shared_info to overlap the next packets head-room (in which cpumap
stores info).

Reproducing depend on the packet payload length and if RX-buffer size
happened to have tail-room for skb_shared_info or not. But to make
this even harder to troubleshoot, the RX-buffer size is runtime
dynamically change based on an Exponentially Weighted Moving Average
(EWMA) over the packet length, when refilling RX rings.

This patch only disable XDP_REDIRECT support in receive_mergeable()
case, because it can cause a real crash.

IMHO we should consider NOT supporting XDP in receive_mergeable() at
all, because the principles behind XDP are to gain speed by (1) code
simplicity, (2) sacrificing memory and (3) where possible moving
runtime checks to setup time. These principles are clearly being
violated in receive_mergeable(), that e.g. runtime track average
buffer size to save memory consumption.

In the longer run, we should consider introducing a separate receive
function when attaching an XDP program, and also change the memory
model to be compatible with XDP when attaching an XDP prog.

Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d7dfc5cf 16-Jan-2018 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Add ethtool stats

The main purpose of this patch is adding a way of checking per-queue stats.
It's useful to debug performance problems on multiqueue environment.

$ ethtool -S ens10
NIC statistics:
rx_queue_0_packets: 2090408
rx_queue_0_bytes: 3164825094
rx_queue_1_packets: 2082531
rx_queue_1_bytes: 3152932314
tx_queue_0_packets: 2770841
tx_queue_0_bytes: 4194955474
tx_queue_1_packets: 3084697
tx_queue_1_bytes: 4670196372

This change converts existing per-cpu stats structure into per-queue one.
This should not impact on performance since each queue counter is not
updated concurrently by multiple cpus.

Performance numbers:
- Guest has 2 vcpus and 2 queues
- Guest runs netserver
- Host runs 100-flow super_netperf

Before After Diff
UDP_STREAM 18byte 86.22 87.00 +0.90%
UDP_STREAM 1472byte 4055.27 4042.18 -0.32%
TCP_STREAM 16956.32 16890.63 -0.39%
UDP_RR 178667.11 185862.70 +4.03%
TCP_RR 128473.04 124985.81 -2.71%

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>


# faa9b39f 05-Jan-2018 Jason Baron <jbaron@akamai.com>

virtio_net: propagate linkspeed/duplex settings from the hypervisor

The ability to set speed and duplex for virtio_net is useful in various
scenarios as described here:

16032be virtio_net: add ethtool support for set and get of settings

However, it would be nice to be able to set this from the hypervisor,
such that virtio_net doesn't require custom guest ethtool commands.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention
is that device feature bits are to grow down from bit 63, since the
transports are starting from bit 24 and growing up.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtio-dev@lists.oasis-open.org
Signed-off-by: David S. Miller <davem@davemloft.net>


# 754b8a21 03-Jan-2018 Jesper Dangaard Brouer <brouer@redhat.com>

virtio_net: setup xdp_rxq_info

The virtio_net driver doesn't dynamically change the RX-ring queue
layout and backing pages, but instead reject XDP setup if all the
conditions for XDP is not meet. Thus, the xdp_rxq_info also remains
fairly static. This allow us to simply add the reg/unreg to
net_device open/close functions.

Driver hook points for xdp_rxq_info:
* reg : virtnet_open
* unreg: virtnet_close

V3:
- bugfix, also setup xdp.rxq in receive_mergeable()
- Tested bpf-sample prog inside guest on a virtio_net device

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# fdaa767a 06-Dec-2017 Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

virtio_net: Disable interrupts if napi_complete_done rescheduled napi

Since commit 39e6c8208d7b ("net: solve a NAPI race") napi has been able
to be rescheduled within napi_complete_done() even in non-busypoll case,
but virtnet_poll() always enabled interrupts before complete, and when
napi was rescheduled within napi_complete_done() it did not disable
interrupts.
This caused more interrupts when event idx is disabled.

According to commit cbdadbbf0c79 ("virtio_net: fix race in RX VQ
processing") we cannot place virtqueue_enable_cb_prepare() after
NAPI_STATE_SCHED is cleared, so disable interrupts again if
napi_complete_done() returned false.

Tested with vhost-user of OVS 2.7 on host, which does not have the event
idx feature.

* Before patch:

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

212992 1472 60.00 32763206 0 6430.32
212992 60.00 23384299 4589.56

Interrupts on guest: 9872369
Packets/interrupt: 2.37

* After patch

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

212992 1472 60.00 32794646 0 6436.49
212992 60.00 32793501 6436.27

Interrupts on guest: 4941299
Packets/interrupt: 6.64

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 03e9f8a0 03-Dec-2017 Yunjian Wang <wangyunjian@huawei.com>

virtio_net: fix return value check in receive_mergeable()

The function virtqueue_get_buf_ctx() could return NULL, the return
value 'buf' need to be checked with NULL, not value 'ctx'.

Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 453f85d4 15-Nov-2017 Mel Gorman <mgorman@techsingularity.net>

mm: remove __GFP_COLD

As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of. Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact. Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.

This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache. Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop. It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.

The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path. A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.

Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4e63525 03-Nov-2017 Jakub Kicinski <kuba@kernel.org>

net: bpf: rename ndo_xdp to ndo_bpf

ndo_xdp is a control path callback for setting up XDP in the
driver. We can reuse it for other forms of communication
between the eBPF stack and the drivers. Rename the callback
and associated structures and definitions.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# de8f3a83 24-Sep-2017 Daniel Borkmann <daniel@iogearbox.net>

bpf: add meta pointer for direct access

This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.

xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.

The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dd543797 22-Sep-2017 Jason Wang <jasowang@redhat.com>

virtio-net: correctly set xdp_xmit for mergeable buffer

We should set xdp_xmit only when xdp_do_redirect() succeed.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 186b3c99 19-Sep-2017 Jason Wang <jasowang@redhat.com>

virtio-net: support XDP_REDIRECT

This patch tries to add XDP_REDIRECT for virtio-net. The changes are
not complex as we could use exist XDP_TX helpers for most of the
work. The rest is passing the XDP_TX to NAPI handler for implementing
batching.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 31240345 19-Sep-2017 Jason Wang <jasowang@redhat.com>

virtio-net: add packet len average only when needed during XDP

There's no need to add packet len average in the case of XDP_PASS
since it will be done soon after skb is created.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9457642a 19-Sep-2017 Jason Wang <jasowang@redhat.com>

virtio-net: remove unnecessary parameter of virtnet_xdp_xmit()

CC: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dadc0736 24-Aug-2017 Eric Dumazet <edumazet@google.com>

virtio_net: be drop monitor friendly

This change is needed to not fool drop monitor.
(perf record ... -e skb:kfree_skb )

Packets were properly sent and are consumed after TX completion.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 718ad681 18-Aug-2017 stephen hemminger <stephen@networkplumber.org>

net: drop unused attribute argument from sysfs queue funcs

The show and store functions don't need/use the attribute.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a4a76503 15-Aug-2017 stephen hemminger <stephen@networkplumber.org>

virtio: put paren around sizeof

Kernel coding style is to put paren around operand of sizeof.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7acd4329 12-Aug-2017 Colin Ian King <colin.king@canonical.com>

virtio-net: make array guest_offloads static

The array guest_offloads is local to the source and does not need to
be in global scope, so make it static. Also tweak formatting.

Cleans up sparse warnings:
symbol 'guest_offloads' was not declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1daa8790 31-Jul-2017 Michael S. Tsirkin <mst@redhat.com>

virtio_net: fix truesize for mergeable buffers

Seth Forshee noticed a performance degradation with some workloads.
This turns out to be due to packet drops. Euan Kemp noticed that this
is because we drop all packets where length exceeds the truesize, but
for some packets we add in extra memory without updating the truesize.
This in turn was kept around unchanged from ab7db91705e95 ("virtio-net:
auto-tune mergeable rx buffer size for improved performance"). That
commit had an internal reason not to account for the extra space: not
enough bits to do it. No longer true so let's account for the allocated
length exactly.

Many thanks to Seth Forshee for the report and bisecting and Euan Kemp
for debugging the issue.

Fixes: 680557cf79f8 ("virtio_net: rework mergeable buffer handling")
Reported-by: Euan Kemp <euan.kemp@coreos.com>
Tested-by: Euan Kemp <euan.kemp@coreos.com>
Reported-by: Seth Forshee <seth.forshee@canonical.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 67a75194 25-Jul-2017 Arnd Bergmann <arnd@arndb.de>

virtio-net: mark PM functions as __maybe_unused

After removing the reset function, the freeze and restore functions
are now unused when CONFIG_PM_SLEEP is disabled:

drivers/net/virtio_net.c:1881:12: error: 'virtnet_restore_up' defined but not used [-Werror=unused-function]
static int virtnet_restore_up(struct virtio_device *vdev)
drivers/net/virtio_net.c:1859:13: error: 'virtnet_freeze_down' defined but not used [-Werror=unused-function]
static void virtnet_freeze_down(struct virtio_device *vdev)

A more robust way to do this is to remove the #ifdef around the callers
and instead mark them as __maybe_unused. The compiler will now just
silently drop the unused code.

Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cfa0ebc9 24-Jul-2017 Andrew Jones <drjones@redhat.com>

virtio-net: fix module unloading

Unregister the driver before removing multi-instance hotplug
callbacks. This order avoids the warning issued from
__cpuhp_remove_state_cpuslocked when the number of remaining
instances isn't yet zero.

Fixes: 8017c279196a ("net/virtio-net: Convert to hotplug state machine")
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 3f93522f 19-Jul-2017 Jason Wang <jasowang@redhat.com>

virtio-net: switch off offloads on demand if possible on XDP set

Current XDP implementation wants guest offloads feature to be disabled
on device. This is inconvenient and means guest can't benefit from
offloads if XDP is not used. This patch tries to address this
limitation by disabling the offloads on demand through control guest
offloads. Guest offloads will be disabled and enabled on demand on XDP
set.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4941d472 19-Jul-2017 Jason Wang <jasowang@redhat.com>

virtio-net: do not reset during XDP set

We currently reset the device during XDP set, the main reason is
that we allocate more headroom with XDP (for header adjustment).

This works but causes network downtime for users.

Previous patches encoded the headroom in the buffer context,
this makes it possible to detect the case where a buffer
with headroom insufficient for XDP is added to the queue and
XDP is enabled afterwards.

Upon detection, we handle this case by copying the packet
(slow, but it's a temporary condition).

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 192f68cf 19-Jul-2017 Jason Wang <jasowang@redhat.com>

virtio-net: switch to use new ctx API for small buffer

Use ctx API to store headroom for small buffers.
Following patches will retrieve this info and use it for XDP.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 28b39bc7 19-Jul-2017 Jason Wang <jasowang@redhat.com>

virtio-net: pack headroom into ctx for mergeable buffers

Pack headroom into ctx - this way when we get a buffer we can figure out
the actual headroom that was allocated for the buffer. Will be helpful
to optimize switching between XDP and non-XDP modes which have different
headroom requirements.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e078de03 03-Jul-2017 David S. Miller <davem@davemloft.net>

virtio_net: Remove references to NETIF_F_UFO.

It is going away.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 55281621 07-Jul-2017 Jason Wang <jasowang@redhat.com>

virtio-net: fix leaking of ctx array

Fixes: commit d45b897b11ea ("virtio_net: allow specifying context for rx")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 713a98d9 27-Jun-2017 Jason Wang <jasowang@redhat.com>

virtio-net: serialize tx routine during reset

We don't hold any tx lock when trying to disable TX during reset, this
would lead a use after free since ndo_start_xmit() tries to access
the virtqueue which has already been freed. Fix this by using
netif_tx_disable() before freeing the vqs, this could make sure no tx
after vq freeing.

Reported-by: Jean-Philippe Menil <jpmenil@gmail.com>
Tested-by: Jean-Philippe Menil <jpmenil@gmail.com>
Fixes commit f600b6905015 ("virtio_net: Add XDP support")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Robert McCabe <robert.mccabe@rockwellcollins.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5b0e6629 15-Jun-2017 Martin KaFai Lau <kafai@fb.com>

bpf: virtio_net: Report bpf_prog ID during XDP_QUERY_PROG

Add support to virtio_net to report bpf_prog ID during XDP_QUERY_PROG.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 59ae1d12 16-Jun-2017 Johannes Berg <johannes.berg@intel.com>

networking: introduce and use skb_put_data()

A common pattern with skb_put() is to just want to memcpy()
some data into the new space, introduce skb_put_data() for
this.

An spatch similar to the one for skb_put_zero() converts many
of the places using it:

@@
identifier p, p2;
expression len, skb, data;
type t, t2;
@@
(
-p = skb_put(skb, len);
+p = skb_put_data(skb, data, len);
|
-p = (t)skb_put(skb, len);
+p = skb_put_data(skb, data, len);
)
(
p2 = (t2)p;
-memcpy(p2, data, len);
|
-memcpy(p, data, len);
)

@@
type t, t2;
identifier p, p2;
expression skb, data;
@@
t *p;
...
(
-p = skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
|
-p = (t *)skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
)
(
p2 = (t2)p;
-memcpy(p2, data, sizeof(*p));
|
-memcpy(p, data, sizeof(*p));
)

@@
expression skb, len, data;
@@
-memcpy(skb_put(skb, len), data, len);
+skb_put_data(skb, data, len);

(again, manually post-processed to retain some comments)

Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e2fcad58 03-Jun-2017 Jason A. Donenfeld <Jason@zx2c4.com>

virtio_net: check return value of skb_to_sgvec always

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f0c3192c 02-Jun-2017 Michael S. Tsirkin <mst@redhat.com>

virtio_net: lower limit on buffer size

commit d85b758f72b0 ("virtio_net: fix support for small rings")
was supposed to increase the buffer size for small rings but had an
unintentional side effect of decreasing it for large rings. This seems
to break some setups - it's not yet clear why, but increasing buffer
size back to what it was before helps.

Fixes: d85b758f72b0 ("virtio_net: fix support for small rings")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2836b4f2 23-May-2017 Vlad Yasevich <vyasevich@gmail.com>

virtio-net: enable TSO/checksum offloads for Q-in-Q vlans

Since virtio does not provide it's own ndo_features_check handler,
TSO, and now checksum offload, are disabled for stacked vlans.
Re-enable the support and let the host take care of it. This
restores/improves Guest-to-Guest performance over Q-in-Q vlans.

Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 56da5fd0 05-Apr-2017 Dan Carpenter <dan.carpenter@oracle.com>

virtio_net: tidy a couple debug statements

We are printing a decimal value for truesize so we shouldn't use an "0x"
prefix.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 5f24df09 29-Mar-2017 Michael S. Tsirkin <mst@redhat.com>

virtio_net: don't reset twice on XDP on/off

We already do a reset once in remove_vq_common -
there appears to be no point in doing another one
when we add/remove XDP.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# d85b758f7 08-Mar-2017 Michael S. Tsirkin <mst@redhat.com>

virtio_net: fix support for small rings

When ring size is small (<32 entries) making buffers smaller means a
full ring might not be able to hold enough buffers to fit a single large
packet.

Make sure a ring full of buffers is large enough to allow at least one
packet of max size.

Fixes: 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag allocators")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# e377fcc8 06-Mar-2017 Michael S. Tsirkin <mst@redhat.com>

virtio_net: reduce alignment for buffers

We don't need to align length to any particular
value anymore. Aligning to L1 cache size probably
sill makes sense to reduce false sharing.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 680557cf 06-Mar-2017 Michael S. Tsirkin <mst@redhat.com>

virtio_net: rework mergeable buffer handling

Use the new _ctx virtio API to maintain true length for each buffer.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# d45b897b 06-Mar-2017 Michael S. Tsirkin <mst@redhat.com>

virtio_net: allow specifying context for rx

With mergeable buffers we never use s/g for rx,
so allow specifying context in that case.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 4d463c4d 02-May-2017 Daniel Borkmann <daniel@iogearbox.net>

xdp: use common helper for netlink extended ack reporting

Small follow-up to d74a32acd59a ("xdp: use netlink extended ACK reporting")
in order to let drivers all use the same NL_SET_ERR_MSG_MOD() helper macro
for reporting. This also ensures that we consistently add the driver's
prefix for dumping the report in user space to indicate that the error
message is driver specific and not coming from core code. Furthermore,
NL_SET_ERR_MSG_MOD() now reuses NL_SET_ERR_MSG() and thus makes all macros
check the pointer as suggested.

References: https://www.spinics.net/lists/netdev/msg433267.html
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9b2bbdb2 06-Mar-2017 Michael S. Tsirkin <mst@redhat.com>

virtio: wrap find_vqs

We are going to add more parameters to find_vqs, let's wrap the call so
we don't need to tweak all drivers every time.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 9861ce03 30-Apr-2017 Jakub Kicinski <kuba@kernel.org>

virtio_net: make use of extended ack message reporting

Try to carry error messages to the user via the netlink extended
ack message attribute.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1d11e732 27-Apr-2017 Willem de Bruijn <willemb@google.com>

virtio-net: use netif_tx_napi_add for tx napi

Avoid hashing the tx napi struct into napi_hash[], which is used for
busy polling receive queues.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 78a57b48 25-Apr-2017 Willem de Bruijn <willemb@google.com>

virtio-net: on tx, only call napi_disable if tx napi is on

As of tx napi, device down (`ip link set dev $dev down`) hangs unless
tx napi is enabled. Else napi_enable is not called, so napi_disable
will spin on test_and_set_bit NAPI_STATE_SCHED.

Only call napi_disable if tx napi is enabled.

Fixes: 5a719c2552ca ("virtio-net: transmit napi")
Reported-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bdb12e0d 24-Apr-2017 Willem de Bruijn <willemb@google.com>

virtio-net: keep tx interrupts disabled unless kick

Tx napi mode increases the rate of transmit interrupts. Suppress some
by masking interrupts while more packets are expected. The interrupts
will be reenabled before the last packet is sent.

This optimization reduces the througput drop with tx napi for
unidirectional flows such as UDP_STREAM that do not benefit from
cleaning tx completions in the the receive napi handler.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7b0411ef 24-Apr-2017 Willem de Bruijn <willemb@google.com>

virtio-net: clean tx descriptors from rx napi

Amortize the cost of virtual interrupts by doing both rx and tx work
on reception of a receive interrupt if tx napi is enabled. With
VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion
interrupts for bidirectional workloads.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ea7735d9 24-Apr-2017 Willem de Bruijn <willemb@google.com>

virtio-net: move free_old_xmit_skbs

An upcoming patch will call free_old_xmit_skbs indirectly from
virtnet_poll. Move the function above this to avoid having to
introduce a forward declaration.

This is a pure move: no code changes.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b92f1e67 24-Apr-2017 Willem de Bruijn <willemb@google.com>

virtio-net: transmit napi

Convert virtio-net to a standard napi tx completion path. This enables
better TCP pacing using TCP small queues and increases single stream
throughput.

The virtio-net driver currently cleans tx descriptors on transmission
of new packets in ndo_start_xmit. Latency depends on new traffic, so
is unbounded. To avoid deadlock when a socket reaches its snd limit,
packets are orphaned on tranmission. This breaks socket backpressure,
including TSQ.

Napi increases the number of interrupts generated compared to the
current model, which keeps interrupts disabled as long as the ring
has enough free descriptors. Keep tx napi optional and disabled for
now. Follow-on patches will reduce the interrupt cost.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e4e8452a 24-Apr-2017 Willem de Bruijn <willemb@google.com>

virtio-net: napi helper functions

Prepare virtio-net for tx napi by converting existing napi code to
use helper functions. This also deduplicates some logic.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fe36cbe0 29-Mar-2017 Michael S. Tsirkin <mst@redhat.com>

virtio_net: clear MTU when out of range

virtio attempts to clear the MTU feature bit if the value is out of the
supported range, but this has no real effect since FEATURES_OK has
already been set.

Fix this up by checking the MTU in the new validate callback.

Fixes: 14de9d114a82 ("virtio-net: Add initial MTU advice feature")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 2e123b44 07-Mar-2017 Michael S. Tsirkin <mst@redhat.com>

virtio_net: enable big packets for large MTU values

If one enables e.g. jumbo frames without mergeable
buffers, packets won't fit in 1500 byte buffers
we use. Switch to big packet mode instead.
TODO: make sizing more exact, possibly extend small
packet mode to use larger pages.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# ebb6b4b1 21-Mar-2017 Philippe Reynes <tremyfr@gmail.com>

net: virtio_net: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# eb1e011a 15-Feb-2017 Johannes Berg <johannes.berg@intel.com>

average: change to declare precision, not factor

Declaring the factor is counter-intuitive, and people are prone
to using small(-ish) values even when that makes no sense.

Change the DECLARE_EWMA() macro to take the fractional precision,
in bits, rather than a factor, and update all users.

While at it, add some more documentation.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>


# fb5e31d9 05-Feb-2017 Christoph Hellwig <hch@lst.de>

virtio: allow drivers to request IRQ affinity when creating VQs

Add a struct irq_affinity pointer to the find_vqs methods, which if set
is used to tell the PCI layer to create the MSI-X vectors for our I/O
virtqueues with the proper affinity from the start. Compared to after
the fact affinity hints this gives us an instantly working setup and
allows to allocate the irq descritors node-local and avoid interconnect
traffic. Last but not least this will allow blk-mq queues are created
based on the interrupt affinity for storage drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# f6b10209 21-Feb-2017 Jason Wang <jasowang@redhat.com>

virtio-net: switch to use build_skb() for small buffer

This patch switch to use build_skb() for small buffer which can have
better performance for both TCP and XDP (since we can work at page
before skb creation). It also remove lots of XDP codes since both
mergeable and small buffer use page frag during refill now.

Before | After
XDP_DROP(xdp1) 64B : 11.1Mpps | 14.4Mpps

Tested with xdp1/xdp2/xdp_ip_tx_tunnel and netperf.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 017b29c3 19-Feb-2017 Jason Wang <jasowang@redhat.com>

virito-net: set queues after reset during xdp_set

We set queues before reset which will cause a crash[1]. This is
because is_xdp_raw_buffer_queue() depends on the old xdp queue pairs
number to do the correct detection. So fix this by

- passing xdp queue pairs and current queue pairs to virtnet_reset()
- change vi->xdp_qp after reset but before refill, to make sure both
free_unused_bufs() and refill can make correct detection of XDP.
- remove the duplicated queue pairs setting before virtnet_reset()
since we will do it inside virtnet_reset()

[1]

[ 74.328168] general protection fault: 0000 [#1] SMP
[ 74.328625] Modules linked in: nfsd xfs libcrc32c virtio_net virtio_pci
[ 74.329117] CPU: 0 PID: 2849 Comm: xdp2 Not tainted 4.10.0-rc7+ #499
[ 74.329577] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
[ 74.330424] task: ffff88007a894000 task.stack: ffffc90004388000
[ 74.330844] RIP: 0010:skb_release_head_state+0x28/0x80
[ 74.331298] RSP: 0018:ffffc9000438b8d0 EFLAGS: 00010206
[ 74.331676] RAX: 0000000000000000 RBX: ffff88007ad96300 RCX: 0000000000000000
[ 74.332217] RDX: ffff88007fc137a8 RSI: ffff88007fc0db28 RDI: 0001bf00000001be
[ 74.332758] RBP: ffffc9000438b8d8 R08: 000000000005008f R09: 00000000000005f9
[ 74.333274] R10: ffff88007d001700 R11: ffffffff820a8a4d R12: ffff88007ad96300
[ 74.333787] R13: 0000000000000002 R14: ffff880036604000 R15: 000077ff80000000
[ 74.334308] FS: 00007fc70d8a7b40(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 74.334891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 74.335314] CR2: 00007fff4144a710 CR3: 000000007ab56000 CR4: 00000000003406f0
[ 74.335830] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 74.336373] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 74.336895] Call Trace:
[ 74.337086] skb_release_all+0xd/0x30
[ 74.337356] consume_skb+0x2c/0x90
[ 74.337607] free_unused_bufs+0x1ff/0x270 [virtio_net]
[ 74.337988] ? vp_synchronize_vectors+0x3b/0x60 [virtio_pci]
[ 74.338398] virtnet_xdp+0x21e/0x440 [virtio_net]
[ 74.338741] dev_change_xdp_fd+0x101/0x140
[ 74.339048] do_setlink+0xcf4/0xd20
[ 74.339304] ? symcmp+0xf/0x20
[ 74.339529] ? mls_level_isvalid+0x52/0x60
[ 74.339828] ? mls_range_isvalid+0x43/0x50
[ 74.340135] ? nla_parse+0xa0/0x100
[ 74.340400] rtnl_setlink+0xd4/0x120
[ 74.340664] ? cpumask_next_and+0x30/0x50
[ 74.340966] rtnetlink_rcv_msg+0x7f/0x1f0
[ 74.341259] ? sock_has_perm+0x59/0x60
[ 74.341586] ? napi_consume_skb+0xe2/0x100
[ 74.342010] ? rtnl_newlink+0x890/0x890
[ 74.342435] netlink_rcv_skb+0x92/0xb0
[ 74.342846] rtnetlink_rcv+0x23/0x30
[ 74.343277] netlink_unicast+0x162/0x210
[ 74.343677] netlink_sendmsg+0x2db/0x390
[ 74.343968] sock_sendmsg+0x33/0x40
[ 74.344233] SYSC_sendto+0xee/0x160
[ 74.344482] ? SYSC_bind+0xb0/0xe0
[ 74.344806] ? sock_alloc_file+0x92/0x110
[ 74.345106] ? fd_install+0x20/0x30
[ 74.345360] ? sock_map_fd+0x3f/0x60
[ 74.345586] SyS_sendto+0x9/0x10
[ 74.345790] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 74.346086] RIP: 0033:0x7fc70d1b8f6d
[ 74.346312] RSP: 002b:00007fff4144a708 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 74.346785] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007fc70d1b8f6d
[ 74.347244] RDX: 000000000000002c RSI: 00007fff4144a720 RDI: 0000000000000003
[ 74.347683] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[ 74.348544] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff4144bd90
[ 74.349082] R13: 0000000000000002 R14: 0000000000000002 R15: 00007fff4144cda0
[ 74.349607] Code: 00 00 00 55 48 89 e5 53 48 89 fb 48 8b 7f 58 48 85 ff 74 0e 40 f6 c7 01 74 3d 48 c7 43 58 00 00 00 00 48 8b 7b 68 48 85 ff 74 05 <f0> ff 0f 74 20 48 8b 43 60 48 85 c0 74 14 65 8b 15 f3 ab 8d 7e
[ 74.351008] RIP: skb_release_head_state+0x28/0x80 RSP: ffffc9000438b8d0
[ 74.351625] ---[ end trace fe6e19fd11cfc80b ]---

Fixes: 2de2f7f40ef9 ("virtio_net: XDP support for adjust_head")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 61845d20 16-Feb-2017 Jason Wang <jasowang@redhat.com>

virtio-net: batch stats updating

We already have counters for sent/recv packets and sent/recv bytes.
Doing a batched update to reduce the number of
u64_stats_update_begin/end().

Take care not to bother with stats update when called
speculatively.

Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2de2f7f4 02-Feb-2017 John Fastabend <john.fastabend@gmail.com>

virtio_net: XDP support for adjust_head

Add support for XDP adjust head by allocating a 256B header region
that XDP programs can grow into. This is only enabled when a XDP
program is loaded.

In order to ensure that we do not have to unwind queue headroom push
queue setup below bpf_prog_add. It reads better to do a prog ref
unwind vs another queue setup call.

At the moment this code must do a full reset to ensure old buffers
without headroom on program add or with headroom on program removal
are not used incorrectly in the datapath. Ideally we would only
have to disable/enable the RX queues being updated but there is no
API to do this at the moment in virtio so use the big hammer. In
practice it is likely not that big of a problem as this will only
happen when XDP is enabled/disabled changing programs does not
require the reset. There is some risk that the driver may either
have an allocation failure or for some reason fail to correctly
negotiate with the underlying backend in this case the driver will
be left uninitialized. I have not seen this ever happen on my test
systems and for what its worth this same failure case can occur
from probe and other contexts in virtio framework.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9fe7bfce 02-Feb-2017 John Fastabend <john.fastabend@gmail.com>

virtio_net: refactor freeze/restore logic into virtnet reset logic

For XDP we will need to reset the queues to allow for buffer headroom
to be configured. In order to do this we need to essentially run the
freeze()/restore() code path. Unfortunately the locking requirements
between the freeze/restore and reset paths are different however so
we can not simply reuse the code.

This patch refactors the code path and adds a reset helper routine.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 722d8283 02-Feb-2017 John Fastabend <john.fastabend@gmail.com>

virtio_net: remove duplicate queue pair binding in XDP

Factor out qp assignment.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0354e4d1 02-Feb-2017 John Fastabend <john.fastabend@gmail.com>

virtio_net: factor out xdp handler for readability

At this point the do_xdp_prog is mostly if/else branches handling
the different modes of virtio_net. So remove it and handle running
the program in the per mode handlers.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 47315329 02-Feb-2017 John Fastabend <john.fastabend@gmail.com>

virtio_net: wrap rtnl_lock in test for calling with lock already held

For XDP use case and to allow ethtool reset tests it is useful to be
able to use reset paths from contexts where rtnl lock is already
held.

This requries updating virtnet_set_queues and free_receive_bufs the
two places where rtnl_lock is taken in virtio_net. To do this we
use the following pattern,

_foo(...) { do stuff }
foo(...) { rtnl_lock(); _foo(...); rtnl_unlock()};

this allows us to use freeze()/restore() flow from both contexts.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4d6308aa 04-Feb-2017 Eric Dumazet <edumazet@google.com>

virtio_net: exploit napi_complete_done() return value

Since commit 364b6055738b ("net: busy-poll: return busypolling status to
drivers"), napi_complete_done() returns a boolean that can be used
by drivers to conditionally rearm interrupts.

This patch changes virtio_net to use this boolean to avoid a bit of
overhead for busy-poll users.

Jason reports about 1.1% improvement for 1 byte TCP_RR (burst 100).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ceef438d 02-Feb-2017 Eric Dumazet <edumazet@google.com>

virtio_net: remove custom busy_poll

Generic NAPI busy polling allows us to remove custom implementations
found in drivers.

It is possible further optimization could be done by testing
napi_complete_done() return value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 529ec6ac 25-Jan-2017 Jakub Kicinski <kuba@kernel.org>

virtio_net: reject XDP programs using header adjustment

commit 17bedab27231 ("bpf: xdp: Allow head adjustment in XDP prog")
added a new XDP helper to prepend and remove data from a frame.
Make virtio_net reject programs making use of this helper until
proper support is added.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b68df015 25-Jan-2017 John Fastabend <john.fastabend@gmail.com>

virtio_net: use dev_kfree_skb for small buffer XDP receive

In the small buffer case during driver unload we currently use
put_page instead of dev_kfree_skb. Resolve this by adding a check
for virtnet mode when checking XDP queue type. Also name the
function so that the code reads correctly to match the additional
check.

Fixes: bb91accf2733 ("virtio-net: XDP support for small buffers")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a67edbf4 24-Jan-2017 Daniel Borkmann <daniel@iogearbox.net>

bpf: add initial bpf tracepoints

This work adds a number of tracepoints to paths that are either
considered slow-path or exception-like states, where monitoring or
inspecting them would be desirable.

For bpf(2) syscall, tracepoints have been placed for main commands
when they succeed. In XDP case, tracepoint is for exceptions, that
is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
return code, or when error occurs during XDP_TX action and the packet
could not be forwarded.

Both have been split into separate event headers, and can be further
extended. Worst case, if they unexpectedly should get into our way in
future, they can also removed [1]. Of course, these tracepoints (like
any other) can be analyzed by eBPF itself, etc. Example output:

# ./perf record -a -e bpf:* sleep 10
# ./perf script
sock_example 6197 [005] 283.980322: bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
sock_example 6197 [005] 283.980721: bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
sock_example 6197 [005] 283.988423: bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
sock_example 6197 [005] 283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
[...]
sock_example 6197 [005] 288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
swapper 0 [005] 289.338243: bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER

[1] https://lwn.net/Articles/705270/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d0fa28f0 23-Jan-2017 Michael S. Tsirkin <mst@redhat.com>

virtio_net: fix PAGE_SIZE > 64k

I don't have any guests with PAGE_SIZE > 64k but the
code seems to be clearly broken in that case
as PAGE_SIZE / MERGEABLE_BUFFER_ALIGN will need
more than 8 bit and so the code in mergeable_ctx_to_buf_address
does not give us the actual true size.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6391a448 19-Jan-2017 Jason Wang <jasowang@redhat.com>

virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving

Commit 501db511397f ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on
xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too,
fixing this by adding a hint (has_data_valid) and set it only on the
receiving path.

Cc: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bc1f4470 06-Jan-2017 stephen hemminger <stephen@networkplumber.org>

net: make ndo_get_stats64 a void function

The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 801822d1 23-Dec-2016 Shyam Saini <mayhs11saini@gmail.com>

net: Use kmemdup instead of kmalloc and memcpy

when some other buffer is immediately copied into allocated region.
Replace calls to kmalloc followed by a memcpy with a direct
call to kmemdup.

Signed-off-by: Shyam Saini <mayhs11saini@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 73c1b41e 21-Dec-2016 Thomas Gleixner <tglx@linutronix.de>

cpu/hotplug: Cleanup state names

When the state names got added a script was used to add the extra argument
to the calls. The script basically converted the state constant to a
string, but the cleanup to convert these strings into meaningful ones did
not happen.

Replace all the useless strings with 'subsys/xxx/yyy:state' strings which
are used in all the other places already.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20161221192112.085444152@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# bb91accf 23-Dec-2016 Jason Wang <jasowang@redhat.com>

virtio-net: XDP support for small buffers

Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of
small receive buffer untouched. This will confuse the user who want to
set XDP but use small buffers. Other than forbid XDP in small buffer
mode, let's make it work. XDP then can only work at skb->data since
virtio-net create skbs during refill, this is sub optimal which could
be optimized in the future.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c47a43d3 23-Dec-2016 Jason Wang <jasowang@redhat.com>

virtio-net: remove big packet XDP codes

Now we in fact don't allow XDP for big packets, remove its codes.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 92502fe8 23-Dec-2016 Jason Wang <jasowang@redhat.com>

virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support

When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO
packet that exceeds a single page which could not be handled
correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is
supported. While at it, forbid XDP for ECN (which comes only from GRO)
too to prevent user from misconfiguration.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5c33474d 23-Dec-2016 Jason Wang <jasowang@redhat.com>

virtio-net: make rx buf size estimation works for XDP

We don't update ewma rx buf size in the case of XDP. This will lead
underestimation of rx buf size which causes host to produce more than
one buffers. This will greatly increase the possibility of XDP page
linearization.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b00f70b0 23-Dec-2016 Jason Wang <jasowang@redhat.com>

virtio-net: unbreak csumed packets for XDP_PASS

We drop csumed packet when do XDP for packets. This breaks
XDP_PASS when GUEST_CSUM is supported. Fix this by allowing csum flag
to be set. With this patch, simple TCP works for XDP_PASS.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1830f893 23-Dec-2016 Jason Wang <jasowang@redhat.com>

virtio-net: correctly handle XDP_PASS for linearized packets

When XDP_PASS were determined for linearized packets, we try to get
new buffers in the virtqueue and build skbs from them. This is wrong,
we should create skbs based on existed buffers instead. Fixing them by
creating skb based on xdp_page.

With this patch "ping 192.168.100.4 -s 3900 -M do" works for XDP_PASS.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 56a86f84 23-Dec-2016 Jason Wang <jasowang@redhat.com>

virtio-net: fix page miscount during XDP linearizing

We don't put page during linearizing, the would cause leaking when
xmit through XDP_TX or the packet exceeds PAGE_SIZE. Fix them by
put page accordingly. Also decrease the number of buffers during
linearizing to make sure caller can free buffers correctly when packet
exceeds PAGE_SIZE. With this patch, we won't get OOM after linearize
huge number of packets.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 275be061 23-Dec-2016 Jason Wang <jasowang@redhat.com>

virtio-net: correctly xmit linearized page on XDP_TX

After we linearize page, we should xmit this page instead of the page
of first buffer which may lead unexpected result. With this patch, we
can see correct packet during XDP_TX.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 73b62bd0 23-Dec-2016 Jason Wang <jasowang@redhat.com>

virtio-net: remove the warning before XDP linearizing

Since we use EWMA to estimate the size of rx buffer. When rx buffer
size is underestimated, it's usual to have a packet with more than one
buffers. Consider this is not a bug, remove the warning and correct
the comment before XDP linearizing.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 72979a6c 15-Dec-2016 John Fastabend <john.fastabend@gmail.com>

virtio_net: xdp, add slowpath case for non contiguous buffers

virtio_net XDP support expects receive buffers to be contiguous.
If this is not the case we enable a slowpath to allow connectivity
to continue but at a significan performance overhead associated with
linearizing data. To make it painfully aware to users that XDP is
running in a degraded mode we throw an xdp buffer error.

To linearize packets we allocate a page and copy the segments of
the data, including the header, into it. After this the page can be
handled by XDP code flow as normal.

Then depending on the return code the page is either freed or sent
to the XDP xmit path. There is no attempt to optimize this path.

This case is being handled simple as a precaution in case some
unknown backend were to generate packets in this form. To test this
I had to hack qemu and force it to generate these packets. I do not
expect this case to be generated by "real" backends.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 56434a01 15-Dec-2016 John Fastabend <john.fastabend@gmail.com>

virtio_net: add XDP_TX support

This adds support for the XDP_TX action to virtio_net. When an XDP
program is run and returns the XDP_TX action the virtio_net XDP
implementation will transmit the packet on a TX queue that aligns
with the current CPU that the XDP packet was processed on.

Before sending the packet the header is zeroed. Also XDP is expected
to handle checksum correctly so no checksum offload support is
provided.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 672aafd5 15-Dec-2016 John Fastabend <john.fastabend@gmail.com>

virtio_net: add dedicated XDP transmit queues

XDP requires using isolated transmit queues to avoid interference
with normal networking stack (BQL, NETDEV_TX_BUSY, etc). This patch
adds a XDP queue per cpu when a XDP program is loaded and does not
expose the queues to the OS via the normal API call to
netif_set_real_num_tx_queues(). This way the stack will never push
an skb to these queues.

However virtio/vhost/qemu implementation only allows for creating
TX/RX queue pairs at this time so creating only TX queues was not
possible. And because the associated RX queues are being created I
went ahead and exposed these to the stack and let the backend use
them. This creates more RX queues visible to the network stack than
TX queues which is worth mentioning but does not cause any issues as
far as I can tell.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f600b690 15-Dec-2016 John Fastabend <john.fastabend@gmail.com>

virtio_net: Add XDP support

This adds XDP support to virtio_net. Some requirements must be
met for XDP to be enabled depending on the mode. First it will
only be supported with LRO disabled so that data is not pushed
across multiple buffers. Second the MTU must be less than a page
size to avoid having to handle XDP across multiple pages.

If mergeable receive is enabled this patch only supports the case
where header and data are in the same buf which we can check when
a packet is received by looking at num_buf. If the num_buf is
greater than 1 and a XDP program is loaded the packet is dropped
and a warning is thrown. When any_header_sg is set this does not
happen and both header and data is put in a single buffer as expected
so we check this when XDP programs are loaded. Subsequent patches
will process the packet in a degraded mode to ensure connectivity
and correctness is not lost even if backend pushes packets into
multiple buffers.

If big packets mode is enabled and MTU/LRO conditions above are
met then XDP is allowed.

This patch was tested with qemu with vhost=on and vhost=off where
mergeable and big_packet modes were forced via hard coding feature
negotiation. Multiple buffers per packet was forced via a small
test patch to vhost.c in the vhost=on qemu mode.

Suggested-by: Shrijeet Mukherjee <shrijeet@gmail.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a220871b 12-Dec-2016 Jason Wang <jasowang@redhat.com>

virtio-net: correctly enable multiqueue

Commit 4490001029012539937ff02778fe6180613fa949 ("virtio-net: enable
multiqueue by default") blindly set the affinity instead of queues
during probe which can cause a mismatch of #queues between guest and
host. This patch fixes it by setting queues.

Reported-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Theodore Ts'o <tytso@mit.edu>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Fixes: 49000102901 ("virtio-net: enable multiqueue by default")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e37e2ff3 05-Dec-2016 Andy Lutomirski <luto@kernel.org>

virtio-net: Fix DMA-from-the-stack in virtnet_set_mac_address()

With CONFIG_VMAP_STACK=y, virtnet_set_mac_address() can be passed a
pointer to the stack and it will OOPS. Copy the address to the heap
to prevent the crash.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Laura Abbott <labbott@redhat.com>
Reported-by: zbyszek@in.waw.pl
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 44900010 24-Nov-2016 Jason Wang <jasowang@redhat.com>

virtio-net: enable multiqueue by default

We use single queue even if multiqueue is enabled and let admin to
enable it through ethtool later. This is used to avoid possible
regression (small packet TCP stream transmission). But looks like an
overkill since:

- single queue user can disable multiqueue when launching qemu
- brings extra troubles for the management since it needs extra admin
tool in guest to enable multiqueue
- multiqueue performs much better than single queue in most of the
cases

So this patch enables multiqueue by default: if #queues is less than or
equal to #vcpu, enable as much as queue pairs; if #queues is greater
than #vcpu, enable #vcpu queue pairs.

Cc: Hannes Frederic Sowa <hannes@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>
Cc: Jeremy Eder <jeder@redhat.com>
Cc: Marko Myllynen <myllynen@redhat.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 963abe5c 15-Nov-2016 Eric Dumazet <edumazet@google.com>

virtio-net: add a missing synchronize_net()

It seems many drivers do not respect napi_hash_del() contract.

When napi_hash_del() is used before netif_napi_del(), an RCU grace
period is needed before freeing NAPI object.

Fixes: 91815639d880 ("virtio-net: rx busy polling support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f3358507 03-Nov-2016 Michael S. Tsirkin <mst@redhat.com>

virtio-net: drop legacy features in virtio 1 mode

Virtio 1.0 spec says VIRTIO_F_ANY_LAYOUT and VIRTIO_NET_F_GSO are
legacy-only feature bits. Do not negotiate them in virtio 1 mode. Note
this is a spec violation so we need to backport it to stable/downstream
kernels.

Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 93a205ee 25-Oct-2016 Aaron Conole <aconole@redhat.com>

virtio-net: Update the mtu code to match virtio spec

The virtio committee recently ratified a change, VIRTIO-152, which
defines the mtu field to be 'max' MTU, not simply desired MTU.

This commit brings the virtio-net device in compliance with VIRTIO-152.

Additionally, drop the max_mtu branch - it cannot be taken since the u16
returned by virtio_cread16 will never exceed the initial value of
max_mtu.

Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: "Michael S. Tsirkin" <mst@redhat.com>
Acked-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d0c2c997 20-Oct-2016 Jarod Wilson <jarod@redhat.com>

net: use core MTU range checking in virt drivers

hyperv_net:
- set min/max_mtu, per Haiyang, after rndis_filter_device_add

virtio_net:
- set min/max_mtu
- remove virtnet_change_mtu

vmxnet3:
- set min/max_mtu

xen-netback:
- min_mtu = 0, max_mtu = 65517

xen-netfront:
- min_mtu = 0, max_mtu = 65535

unisys/visor:
- clean up defines a little to not clash with network core or add
redundat definitions

CC: netdev@vger.kernel.org
CC: virtualization@lists.linux-foundation.org
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: Shrikrishna Khare <skhare@vmware.com>
CC: "VMware, Inc." <pv-drivers@vmware.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Paul Durrant <paul.durrant@citrix.com>
CC: David Kershner <david.kershner@unisys.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8017c279 12-Aug-2016 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

net/virtio-net: Convert to hotplug state machine

Install the callbacks via the state machine.

The driver supports multiple instances and therefore the new
cpuhp_state_add_instance_nocalls() infrastrucure is used. The driver
currently uses get_online_cpus() to avoid missing a CPU hotplug event while
invoking virtnet_set_affinity(). This could be avoided by using
cpuhp_state_add_instance() variant which holds the hotplug lock and invokes
callback during registration. This is more or less a 1:1 conversion of the
current code.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: netdev@vger.kernel.org
Cc: Will Deacon <will.deacon@arm.com>
Cc: virtualization@lists.linux-foundation.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/1471024183-12666-7-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# a725ee3e 18-Jul-2016 Andy Lutomirski <luto@kernel.org>

virtio-net: Remove more stack DMA

VLAN and MQ control was doing DMA from the stack. Fix it.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d1dc06dc 13-Jun-2016 Mike Rapoport <rppt@linux.vnet.ibm.com>

virtio_net: fix csum generation for virtio-net devices

The commit e858fae2b0b8 ("virtio_net: use common code for virtio_net_hdr
and skb GSO conversion") replaced the tun code for header manipulation
with the generic helpers. While doing so, it implictly moved the
skb_partial_csum_set() invocation after eth_type_trans(), which
invalidate the current gso start/offset values.
Fix it by moving the helper invocation before the mac pulling.

Fixes: e858fae2b0b8 ("virtio_net: use common code for virtio_net_hdr and
skb GSO conversion")

Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e858fae2 08-Jun-2016 Mike Rapoport <rppt@linux.vnet.ibm.com>

virtio_net: use common code for virtio_net_hdr and skb GSO conversion

Replace open coded conversion between virtio_net_hdr to skb GSO info with
virtio_net_hdr_{from,to}_skb

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 14de9d11 03-Jun-2016 Aaron Conole <aconole@redhat.com>

virtio-net: Add initial MTU advice feature

This commit adds the feature bit and associated mtu device entry for the
virtio network device. When a virtio device comes up, it checks the
feature bit for the VIRTIO_NET_F_MTU feature. If such feature bit is
enabled, the driver will read the advised MTU and use it as the initial
value.

Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f00e35e2 30-May-2016 wangyunjian <wangyunjian@huawei.com>

virtio_net: fix virtnet_open and virtnet_probe competing for try_fill_recv

In function virtnet_open() and virtnet_probe(), func try_fill_recv() may
be executed at the same time. VQ in virtqueue_add() has not been protected
well and BUG_ON will be triggered when virito_net.ko being removed.

Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c67f5db8 17-Mar-2016 Paolo Abeni <pabeni@redhat.com>

virtio_net: replace netdev_alloc_skb_ip_align() with napi_alloc_skb()

This gives small but noticeable rx performance improvement (2-3%)
and will allow exploiting future napi improvement.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 0cf3ace9 07-Feb-2016 Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

virtio_net: validate ethtool port setting and explain the user validation

We should validate the port setting that we got from the user and check
if it's what we've set it to (PORT_OTHER), also add explanation that
ignoring advertising is good as long as we don't have autonegotiation.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 16032be5 02-Feb-2016 Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

virtio_net: add ethtool support for set and get of settings

This patch allows the user to set and retrieve speed and duplex of the
virtio_net device via ethtool. Having this functionality is very helpful
for simulating different environments and also enables the virtio_net
device to participate in operations where proper speed and duplex are
required (e.g. currently bonding lacp mode requires full duplex). Custom
speed and duplex are not allowed, the user-supplied settings are validated
before applying.

Example:
$ ethtool eth1
Settings for eth1:
...
Speed: Unknown!
Duplex: Unknown! (255)
$ ethtool -s eth1 speed 1000 duplex full
$ ethtool eth1
Settings for eth1:
...
Speed: 1000Mb/s
Duplex: Full

Based on a patch by Roopa Prabhu.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2ac46030 15-Nov-2015 Michael S. Tsirkin <mst@redhat.com>

virtio-net: Stop doing DMA from the stack

Once virtio starts using the DMA API, we won't be able to safely DMA
from the stack. virtio-net does a couple of config DMA requests
from small stack buffers -- switch to using dynamically-allocated
memory.

This should have no effect on any performance-critical code paths.

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Andy Lutomirski <luto@kernel.org>


# 93d05d4a 18-Nov-2015 Eric Dumazet <edumazet@google.com>

net: provide generic busy polling to all NAPI drivers

NAPI drivers no longer need to observe a particular protocol
to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)

napi_hash_add() and napi_hash_del() are automatically called
from core networking stack, respectively from
netif_napi_add() and netif_napi_del()

This patch depends on free_netdev() and netif_napi_del() being
called from process context, which seems to be the norm.

Drivers might still prefer to call napi_hash_del() on their
own, since they might combine all the rcu grace periods into
a single one, knowing their NAPI structures lifetime, while
core networking stack has no idea of a possible combining.

Once this patch proves to not bring serious regressions,
we will cleanup drivers to either remove napi_hash_del()
or provide appropriate rcu grace periods combining.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 93f93a44 18-Nov-2015 Eric Dumazet <edumazet@google.com>

net: move skb_mark_napi_id() into core networking stack

We would like to automatically provide busy polling support
to all NAPI drivers, without them having to implement anything.

skb_mark_napi_id() can be called from napi_gro_receive() and
napi_get_frags().

Few drivers are still calling skb_mark_napi_id() because
they use netif_receive_skb(). They should eventually call
napi_gro_receive() instead. I will leave this to drivers
maintainers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 547c890c 27-Aug-2015 Jason Wang <jasowang@redhat.com>

virtio-net: avoid unnecessary sg initialzation

Usually an skb does not have up to MAX_SKB_FRAGS frags. So no need to
initialize the unuse part of sg. This patch initialize the sg based on
the real number it will used:

- during xmit, it could be inferred from nr_frags and can_push.
- for small receive buffer, it will also be 2.

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5377d758 19-Aug-2015 Johannes Berg <johannes@sipsolutions.net>

virtio_net: use DECLARE_EWMA

Instead of using the out-of-line EWMA calculation, use DECLARE_EWMA()
to create static inlines. On x86/64 this results in no change in code
size for me, but reduces the struct receive_queue size by the two
unsigned long values that store the parameters.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 48900cb6 04-Aug-2015 Jason Wang <jasowang@redhat.com>

virtio-net: drop NETIF_F_FRAGLIST

virtio declares support for NETIF_F_FRAGLIST, but assumes
that there are at most MAX_SKB_FRAGS + 2 fragments which isn't
always true with a fraglist.

A longer fraglist in the skb will make the call to skb_to_sgvec overflow
the sg array, leading to memory corruption.

Drop NETIF_F_FRAGLIST so we only get what we can handle.

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0fbd050a 31-Jul-2015 Eric Dumazet <edumazet@google.com>

virtio_net: add gro capability

Straightforward patch to add GRO processing to virtio_net.

napi_complete_done() usage allows more aggressive aggregation,
opted-in by setting /sys/class/net/xxx/gro_flush_timeout

Tested:

Setting /sys/class/net/xxx/gro_flush_timeout to 1000 nsec,
Rick Jones reported following results.

One VM of each on a pair of OpenStack compute nodes with E5-2650Lv3 CPUs
and Intel 82599ES-based NICs. So, two "before" and two "after" VMs.
The OpenStack compute nodes were running OpenStack Kilo, with VxLAN
encapsulation being used through OVS so no GRO coming-up the host
stack. The compute nodes themselves were running a 3.14-based kernel.

Single-stream netperf, CPU utilizations and thus service demands are
based on intra-guest reported CPU.

Throughput Mbit/s, bigger is better
Min Median Average Max
4.2.0-rc3+ 1364 1686 1678 1938
4.2.0-rc3+flush1k 1824 2269 2275 2647

Send Service Demand, smaller is better
Min Median Average Max
4.2.0-rc3+ 0.236 0.558 0.524 0.802
4.2.0-rc3+flush1k 0.176 0.503 0.471 0.738

Receive Service Demand, smaller is better.
Min Median Average Max
4.2.0-rc3+ 1.906 2.188 2.191 2.531
4.2.0-rc3+flush1k 0.448 0.529 0.533 0.692

Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Rick Jones <rick.jones2@hp.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 75993300 15-Jul-2015 Michael S. Tsirkin <mst@redhat.com>

virtio_net: don't require ANY_LAYOUT with VERSION_1

ANY_LAYOUT is a compatibility feature. It's implied
for VERSION_1 devices, and non-transitional devices
might not offer it. Change code to behave accordingly.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 60302ff6 02-Apr-2015 Michael S. Tsirkin <mst@redhat.com>

virtio: document queue state logic

commit d631b94e7a15277858ec5f88d674d93080506999
virtio: change comment in transmit

started clarifying the logic behind queue state management,
but introduced an inaccuracy: TX_BUSY does not cause
a BUG message.

Clean this up some more, explaining the tradeoffs in detail.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# faadb05f 26-Mar-2015 Li RongQing <roy.qing.li@gmail.com>

virtio: simplify the using of received in virtnet_poll

received is 0, no need to minus it and use "+=" to reassign it

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d631b94e 24-Mar-2015 stephen hemminger <stephen@networkplumber.org>

virtio: change comment in transmit

The original comment was not really informative or funny
as well as sexist. Replace it with a better explanation of
why the driver does stop and what the impacts are.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab3971b1 11-Mar-2015 Jason Wang <jasowang@redhat.com>

virtio-net: correctly delete napi hash

We don't delete napi from hash list during module exit. This will
cause the following panic when doing module load and unload:

BUG: unable to handle kernel paging request at 0000004e00000075
IP: [<ffffffff816bd01b>] napi_hash_add+0x6b/0xf0
PGD 3c5d5067 PUD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[<ffffffffa0a5bfb7>] init_vqs+0x107/0x490 [virtio_net]
[<ffffffffa0a5c9f2>] virtnet_probe+0x562/0x791815639d880be [virtio_net]
[<ffffffff8139e667>] virtio_dev_probe+0x137/0x200
[<ffffffff814c7f2a>] driver_probe_device+0x7a/0x250
[<ffffffff814c81d3>] __driver_attach+0x93/0xa0
[<ffffffff814c8140>] ? __device_attach+0x40/0x40
[<ffffffff814c6053>] bus_for_each_dev+0x63/0xa0
[<ffffffff814c7a79>] driver_attach+0x19/0x20
[<ffffffff814c76f0>] bus_add_driver+0x170/0x220
[<ffffffffa0a60000>] ? 0xffffffffa0a60000
[<ffffffff814c894f>] driver_register+0x5f/0xf0
[<ffffffff8139e41b>] register_virtio_driver+0x1b/0x30
[<ffffffffa0a60010>] virtio_net_driver_init+0x10/0x12 [virtio_net]

This patch fixes this by doing this in virtnet_free_queues(). And also
don't delete napi in virtnet_freeze() since it will call
virtnet_free_queues() which has already did this.

Fixes 91815639d880 ("virtio-net: rx busy polling support")
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e3e3c423 03-Feb-2015 Vlad Yasevich <vyasevich@gmail.com>

Revert "drivers/net: Disable UFO through virtio"

This reverts commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4.

Now that GSO functionality can correctly track if the fragment
id has been selected and select a fragment id if necessary,
we can re-enable UFO on tap/macvap and virtio devices.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 074c3582 24-Jun-2014 Jacob Keller <jacob.e.keller@intel.com>

virtio_net: add software timestamp support

This patch enables the use of software timestamping via the virtio_net
driver.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>


# 6ba42248 12-Jan-2015 Michael S. Tsirkin <mst@redhat.com>

virtio/net: verify device has config space

Some devices might not implement config space access
(e.g. remoteproc used not to - before 3.9).
virtio/net needs config space access so make it
fail gracefully if not there.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 41f2f127 23-Dec-2014 Jason Wang <jasowang@redhat.com>

virtio-net: don't do header check for dodgy gso packets

There's no need to do header check for virtio-net since:

- Host sets dodgy for all gso packets from guest and check the header.
- Host should be prepared for all kinds of evil packets from guest, since
malicious guest can send any kinds of packet.

So this patch sets NETIF_F_GSO_ROBUST for virtio-net to skip the check.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8acdf999 19-Dec-2014 Herbert Xu <herbert@gondor.apana.org.au>

virtio_net: Fix napi poll list corruption

The commit d75b1ade567ffab085e8adbbdacf0092d10cd09c (net: less
interrupt masking in NAPI) breaks virtio_net in an insidious way.

It is now required that if the entire budget is consumed when poll
returns, the napi poll_list must remain empty. However, like some
other drivers virtio_net tries to do a last-ditch check and if
there is more work it will call napi_schedule and then immediately
process some of this new work. Should the entire budget be consumed
while processing such new work then we will violate the new caller
contract.

This patch fixes this by not touching any work when we reschedule
in virtio_net.

The worst part of this bug is that the list corruption causes other
napi users to be moved off-list. In my case I was chasing a stall
in IPsec (IPsec uses netif_rx) and I only belatedly realised that it
was virtio_net which caused the stall even though the virtio_net
poll was still functioning perfectly after IPsec stalled.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 51cdc381 01-Dec-2014 Michael S. Tsirkin <mst@redhat.com>

virtio: drop VIRTIO_F_VERSION_1 from drivers

Core activates this bit automatically now,
drop it from drivers that set it explicitly.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 9465a7a6 23-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: enable v1.0 support

Now that we have completed 1.0 support, enable it in our driver.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 7e93a02f 26-Nov-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: disable mac write for virtio 1.0

The spec states that mac in config space is only driver-writable in the
legacy case. Fence writing it in virtnet_set_mac_address() in the
virtio 1.0 case.

Suggested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>


# d04302b3 23-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: bigger header when VERSION_1 is set

With VERSION_1 virtio_net uses same header size
whether mergeable buffers are enabled or not.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>


# bcff3162 23-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: stricter short buffer length checks

Our buffer length check is not strict enough for mergeable
buffers: buffer can still be shorter that header + address
by 2 bytes.

Fix that up.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>


# 012873d0 24-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: get rid of virtio_net_hdr/skb_vnet_hdr

virtio 1.0 doesn't use virtio_net_hdr anymore, and in fact, it's not
really useful since virtio_net_hdr_mrg_rxbuf includes that as the first
field anyway.

Let's drop it, precalculate header len and store within vi instead.

This way we can also remove struct skb_vnet_hdr.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>


# 946fa564 23-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: pass vi around

Too many places poke at [rs]q->vq->vdev->priv just to get
the vi structure. Let's just pass the pointer around: seems
cleaner, and might even be faster.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>


# fdd819b2 07-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: v1.0 endianness

Based on patches by Rusty Russell, Cornelia Huck.
Note: more code changes are needed for 1.0 support
(due to different header size).
So we don't advertize support for 1.0 yet.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


# 892d6eb1 20-Nov-2014 Jason Wang <jasowang@redhat.com>

virtio-net: validate features during probe

We currently trigger BUG when VIRTIO_NET_F_CTRL_VQ
is not set but one of features depending on it is.
That's not a friendly way to report errors to
hypervisors.
Let's check, and fail probe instead.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3d0ad094 30-Oct-2014 Ben Hutchings <ben@decadent.org.uk>

drivers/net: Disable UFO through virtio

IPv6 does not allow fragmentation by routers, so there is no
fragmentation ID in the fixed header. UFO for IPv6 requires the ID to
be passed separately, but there is no provision for this in the virtio
net protocol.

Until recently our software implementation of UFO/IPv6 generated a new
ID, but this was a bug. Now we will use ID=0 for any UFO/IPv6 packet
passed through a tap, which is even worse.

Unfortunately there is no distinction between UFO/IPv4 and v6
features, so disable UFO on taps and virtio_net completely until we
have a proper solution.

We cannot depend on VM managers respecting the tap feature flags, so
keep accepting UFO packets but log a warning the first time we do
this.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 916e4cf46d02 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b7fd2e6 15-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: fix use after free

commit 0b725a2ca61bedc33a2a63d0451d528b268cf975
net: Remove ndo_xmit_flush netdev operation, use signalling instead.

added code that looks at skb->xmit_more after the skb has
been put in TX VQ. Since some paths process the ring and free the skb
immediately, this can cause use after free.

Fix by storing xmit_more in a local variable.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e53fbd11 14-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: enable VQs early on restore

virtio spec requires drivers to set DRIVER_OK before using VQs.
This is set automatically after restore returns, virtio net violated this
rule by using receive VQs within restore.

To fix, call virtio_device_ready before using VQs.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 02465555 14-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: fix use after free on allocation failure

In the extremely unlikely event that driver initialization fails after
RX buffers are added, virtio net frees RX buffers while VQs are
still active, potentially causing device to use a freed buffer.

To fix, reset device first - same as we do on device removal.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 4baf1e33 14-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: enable VQs early

virtio spec requires drivers to set DRIVER_OK before using VQs.
This is set automatically after probe returns, virtio net violated this
rule by using receive VQs within probe.

To fix, call virtio_device_ready before using VQs.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 507613bf 14-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: minor cleanup

goto done;
done:
return;
is ugly, it was put there to make diff review easier.
replace by open-coded return.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 080c6373 14-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio-net: drop config_mutex

config_mutex served two purposes: prevent multiple concurrent config
change handlers, and synchronize access to config_enable flag.

Since commit dbf2576e37da0fcc7aacbfbb9fd5d3de7888a3c1
workqueue: make all workqueues non-reentrant
all workqueues are non-reentrant, and config_enable
is now gone.

Get rid of the unnecessary lock.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 102a2786 14-Oct-2014 Michael S. Tsirkin <mst@redhat.com>

virtio_net: drop config_enable

Now that virtio core ensures config changes don't arrive during probing,
drop config_enable flag in virtio net.
On removal, flush is now sufficient to guarantee that no change work is
queued.

This help simplify the driver, and will allow setting DRIVER_OK earlier
without losing config change notifications.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# a5835440 10-Sep-2014 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: pass well-formed sgs to virtqueue_add_*()

This is the only driver which doesn't hand virtqueue_add_inbuf and
virtqueue_add_outbuf a well-formed, well-terminated sg. Fix it,
so we can make virtio_add_* simpler.

pktgen results:
modprobe pktgen
echo 'add_device eth0' > /proc/net/pktgen/kpktgend_0
echo nowait 1 > /proc/net/pktgen/eth0
echo count 1000000 > /proc/net/pktgen/eth0
echo clone_skb 100000 > /proc/net/pktgen/eth0
echo dst_mac 4e:14:25:a9:30:ac > /proc/net/pktgen/eth0
echo dst 192.168.1.2 > /proc/net/pktgen/eth0
for i in `seq 20`; do echo start > /proc/net/pktgen/pgctrl; tail -n1 /proc/net/pktgen/eth0; done

Before:
746547-793084(786421+/-9.6e+03)pps 346-367(364.4+/-4.4)Mb/sec (346397808-367990976(3.649e+08+/-4.5e+06)bps) errors: 0

After:
767390-792966(785159+/-6.5e+03)pps 356-367(363.75+/-2.9)Mb/sec (356068960-367936224(3.64314e+08+/-3e+06)bps) errors: 0

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c89fcfd4 28-Aug-2014 David S. Miller <davem@davemloft.net>

virtio_net: flush when in xmit_more mode and under descriptor pressure

Mirror the changes made to ixgbe in commit 2367a17390138f68b3aa28f2f220b8d7ff8d91f4
("ixgbe: flush when in xmit_more mode and under descriptor pressure")

Signed-off-by: David S. Miller <davem@davemloft.net>


# 0b725a2c 25-Aug-2014 David S. Miller <davem@davemloft.net>

net: Remove ndo_xmit_flush netdev operation, use signalling instead.

As reported by Jesper Dangaard Brouer, for high packet rates the
overhead of having another indirect call in the TX path is
non-trivial.

There is the indirect call itself, and then there is all of the
reloading of the state to refetch the tail pointer value and
then write the device register.

Move to a more passive scheme, which requires very light modifications
to the device drivers.

The signal is a new skb->xmit_more value, if it is non-zero it means
that more SKBs are pending to be transmitted on the same queue as the
current SKB. And therefore, the driver may elide the tail pointer
update.

Right now skb->xmit_more is always zero.

Signed-off-by: David S. Miller <davem@davemloft.net>


# c223a078 23-Aug-2014 David S. Miller <davem@davemloft.net>

virtio_net: Support netdev_ops->ndo_xmit_flush()

Signed-off-by: David S. Miller <davem@davemloft.net>


# 91815639 23-Jul-2014 Jason Wang <jasowang@redhat.com>

virtio-net: rx busy polling support

Add basic support for rx busy polling. Instead of introducing new
states and spinlock to synchronize between NAPI and polling method,
this patch just reuse NAPI state to avoid extra overhead for fast path
and simplified the codes.

Test was done between a kvm guest and an external host. Two hosts were
connected through 40gb mlx4 cards. With both busy_poll and busy_read
are set to 50 in guest, 1 byte netperf tcp_rr shows 127% improvement:
transaction rate was increased from 8353.33 to 18966.87.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2ffa7598 23-Jul-2014 Jason Wang <jasowang@redhat.com>

virtio-net: introduce virtnet_receive()

Move common receive logic to a new helper virtnet_receive(). It will
also be used by rx busy polling method.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7ad24ea4 10-May-2014 Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de>

net: get rid of SET_ETHTOOL_OPS

net: get rid of SET_ETHTOOL_OPS

Dave Miller mentioned he'd like to see SET_ETHTOOL_OPS gone.
This does that.

Mostly done via coccinelle script:
@@
struct ethtool_ops *ops;
struct net_device *dev;
@@
- SET_ETHTOOL_OPS(dev, ops);
+ dev->ethtool_ops = ops;

Compile tested only, but I'd seriously wonder if this broke anything.

Suggested-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de>
Acked-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6ebbc1a6 29-Apr-2014 Zhangjie \(HZ\) <zhangjie14@huawei.com>

virtio-net: Set needed_headroom for virtio-net when VIRTIO_F_ANY_LAYOUT is true

This is a small supplement for commit e7428e95a06fb516fac1308bd0e176e27c0b9287
("virtio-net: put virtio-net header inline with data"). TCP packages have
enough room to put virtio-net header in, but UDP packages do not. By
setting dev->needed_headroom for virtio-net device, UDP packages could have
enough room.

For UDP packages, sk_buff is alloced in fun __ip_append_data. The size is
"alloclen + hh_len + 15", and "hh_len = LL_RESERVED_SPACE(rt-dst.dev);".
The Macro is defined as follows:
#define LL_RESERVED_SPACE(dev) \
((((dev)->hard_header_len+(dev)->needed_headroom)\
&~(HH_DATA_MOD - 1)) + HH_DATA_MOD)
By default, for UDP packages, after skb is allocated, only 16 bytes
reserved. And 2 bytes remained after mac header is set. That is not enough
to put virtio-net header in. If we set dev->needed_headroom to 12 or 10
(according to mergeable_rx_bufs is on or off ), more room can be reserved.
Then there is enough room for UDP packages to put the header in.

test result list as below:
guest and host: suse11sp3, netperf, intel 2.4GHz
+-------+---------+---------+---------+---------+
| | old | new |
+-------+---------+---------+---------+---------+
| UDP | Gbit/s | pps | Gbit/s | pps |
| 64 | 0.57 | 692232 | 0.61 | 742420 |
| 256 | 1.60 | 686860 | 1.71 | 733331 |
| 512 | 2.92 | 674576 | 3.07 | 710446 |
| 1024 | 4.99 | 598977 | 5.17 | 620821 |
| 1460 | 5.68 | 483757 | 7.16 | 610519 |
| 4096 | 6.98 | 637468 | 7.21 | 658471 |
+-------+---------+---------+---------+---------+

Signed-off-by: Zhang Jie <zhangjie14@huawei.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c18e9cd6 17-Apr-2014 Amos Kong <akong@redhat.com>

virtio_net: zero is an invald queue_pairs number

Execute "ethtool -L eth0 combined 0" in guest, if multiqueue
is enabled, virtnet_send_command() will return -EINVAL error,
there is a validation in QEMU.

But if multiqueue is disabled, virtnet_set_queues() will just
return zero (success). We should return error for this situation.

Signed-off-by: Amos Kong <akong@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 681daee2 25-Mar-2014 Jason Wang <jasowang@redhat.com>

virtio-net: correct error handling of virtqueue_kick()

Current error handling of virtqueue_kick() was wrong in two places:
- The skb were freed immediately when virtqueue_kick() fail during
xmit. This may lead double free since the skb was not detached from
the virtqueue.
- try_fill_recv() returns false when virtqueue_kick() fail. This will
lead unnecessary rescheduling of refill work.

Actually, it's safe to just ignore the kick failure in those two
places. So this patch fixes this by partially revert commit
67975901183799af8e93ec60e322f9e2a1940b9b.

Fixes 67975901183799af8e93ec60e322f9e2a1940b9b
(virtio_net: verify if virtqueue_kick() succeeded).

Cc: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 85e94525 15-Mar-2014 Eric W. Biederman <ebiederm@xmission.com>

virtio_net: Call dev_kfree_skb_any instead of dev_kfree_skb.

Replace dev_kfree_skb with dev_kfree_skb_any in start_xmit which can
be called in hard irq and other contexts.

start_xmit only frees skbs that it is dropping.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 57a7744e 13-Mar-2014 Eric W. Biederman <ebiederm@xmission.com>

net: Replace u64_stats_fetch_begin_bh to u64_stats_fetch_begin_irq

Replace the bh safe variant with the hard irq safe variant.

We need a hard irq safe variant to deal with netpoll transmitting
packets from hard irq context, and we need it in most if not all of
the places using the bh safe variant.

Except on 32bit uni-processor the code is exactly the same so don't
bother with a bh variant, just have a hard irq safe variant that
everyone can use.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a7c58146 12-Mar-2014 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: don't crash if virtqueue is broken.

A bad implementation of virtio might cause us to mark the virtqueue
broken: we'll dev_err() in that case, and the device is useless, but
let's not BUG_ON().

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 0e7ede80 20-Feb-2014 Jason Wang <jasowang@redhat.com>

virtio-net: alloc big buffers also when guest can receive UFO

We should alloc big buffers also when guest can receive UFO
packets to let the big packets fit into guest rx buffer.

Fixes 5c5167515d80f78f6bb538492c423adcae31ad65
(virtio-net: Allow UFO feature to be set and advertised.)

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fbf28d78 16-Jan-2014 Michael Dalton <mwdalton@google.com>

virtio-net: initial rx sysfs support, export mergeable rx buffer size

Add initial support for per-rx queue sysfs attributes to virtio-net. If
mergeable packet buffers are enabled, adds a read-only mergeable packet
buffer size sysfs attribute for each RX queue.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ab7db917 16-Jan-2014 Michael Dalton <mwdalton@google.com>

virtio-net: auto-tune mergeable rx buffer size for improved performance

Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag
allocators") changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads. For workloads with packet size <= MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size >=
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c ("virtio-net: coalesce rx frags when possible during rx"),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads & vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs): 13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fb51879d 16-Jan-2014 Michael Dalton <mwdalton@google.com>

virtio-net: use per-receive queue page frag alloc for mergeable bufs

The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
mergeable rx buffer allocations. This commit migrates virtio-net to use
per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
mergeable rx buffer memory allocation, which now will use skb_refill_frag()
for both atomic and GFP-WAIT buffer allocations.

To address fragmentation concerns, if after buffer allocation there
is too little space left in the page frag to allocate a subsequent
buffer, the remaining space is added to the current allocated buffer
so that the remaining space can be used to store packet data.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# be121f46 15-Jan-2014 Jason Wang <jasowang@redhat.com>

virtio-net: drop rq->max and rq->num

It looks like there's no need for those two fields:

- Unless there's a failure for the first refill try, rq->max should be always
equal to the vring size.
- rq->num is only used to determine the condition that we need to do the refill,
we could check vq->num_free instead.
- rq->num was required to be increased or decreased explicitly after each
get/put which results a bad API.

So this patch removes them both to make the code simpler.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6cd4ce00 29-Dec-2013 Jason Wang <jasowang@redhat.com>

virtio-net: fix refill races during restore

During restoring, try_fill_recv() was called with neither napi lock nor napi
disabled. This can lead two try_fill_recv() was called in the same time. Fix
this by refilling before trying to enable napi.

Fixes 0741bcb5584f9e2390ae6261573c4de8314999f2
(virtio: net: Add freeze, restore handlers to support S4).

Cc: Amit Shah <amit.shah@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 788a8b6d 09-Dec-2013 stephen hemminger <stephen@networkplumber.org>

virtio_net: spelling fixes

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d24bae32 09-Dec-2013 stephen hemminger <stephen@networkplumber.org>

virtio_net: remove unused parameter to send_command

All the code passes NULL for the last sg list (in).
Simplify by just removing it.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 98bfd23c 05-Dec-2013 Michael Dalton <mwdalton@google.com>

virtio-net: free bufs correctly on invalid packet length

When a packet with invalid length arrives, ensure that the packet
is freed correctly if mergeable packet buffers and big packets
(GUEST_TSO4) are both enabled.

Signed-off-by: Michael Dalton <mwdalton@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d4fb84ee 05-Dec-2013 Andrey Vagin <avagin@openvz.org>

virtio: delete napi structures from netdev before releasing memory

free_netdev calls netif_napi_del too, but it's too late, because napi
structures are placed on vi->rq. netif_napi_add() is called from
virtnet_alloc_queues.

general protection fault: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables virtio_balloon pcspkr virtio_net(-) i2c_pii
CPU: 1 PID: 347 Comm: rmmod Not tainted 3.13.0-rc2+ #171
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800b779c420 ti: ffff8800379e0000 task.ti: ffff8800379e0000
RIP: 0010:[<ffffffff81322e19>] [<ffffffff81322e19>] __list_del_entry+0x29/0xd0
RSP: 0018:ffff8800379e1dd0 EFLAGS: 00010a83
RAX: 6b6b6b6b6b6b6b6b RBX: ffff8800379c2fd0 RCX: dead000000200200
RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000001 RDI: ffff8800379c2fd0
RBP: ffff8800379e1dd0 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff8800379c2f90
R13: ffff880037839160 R14: 0000000000000000 R15: 00000000013352f0
FS: 00007f1400e34740(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f464124c763 CR3: 00000000b68cf000 CR4: 00000000000006e0
Stack:
ffff8800379e1df0 ffffffff8155beab 6b6b6b6b6b6b6b2b ffff8800378391c0
ffff8800379e1e18 ffffffff8156499b ffff880037839be0 ffff880037839d20
ffff88003779d3f0 ffff8800379e1e38 ffffffffa003477c ffff88003779d388
Call Trace:
[<ffffffff8155beab>] netif_napi_del+0x1b/0x80
[<ffffffff8156499b>] free_netdev+0x8b/0x110
[<ffffffffa003477c>] virtnet_remove+0x7c/0x90 [virtio_net]
[<ffffffff813ae323>] virtio_dev_remove+0x23/0x80
[<ffffffff813f62ef>] __device_release_driver+0x7f/0xf0
[<ffffffff813f6ca0>] driver_detach+0xc0/0xd0
[<ffffffff813f5f28>] bus_remove_driver+0x58/0xd0
[<ffffffff813f72ec>] driver_unregister+0x2c/0x50
[<ffffffff813ae65e>] unregister_virtio_driver+0xe/0x10
[<ffffffffa0036942>] virtio_net_driver_exit+0x10/0x6ce [virtio_net]
[<ffffffff810d7cf2>] SyS_delete_module+0x172/0x220
[<ffffffff810a732d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff810f5d4c>] ? __audit_syscall_entry+0x9c/0xf0
[<ffffffff81677f69>] system_call_fastpath+0x16/0x1b
Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00
RIP [<ffffffff81322e19>] __list_del_entry+0x29/0xd0
RSP <ffff8800379e1dd0>
---[ end trace d5931cd3f87c9763 ]---

Fixes: 986a4f4d452d (virtio_net: multiqueue support)
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fa9fac17 05-Dec-2013 Andrey Vagin <avagin@openvz.org>

virtio-net: determine type of bufs correctly

free_unused_bufs must check vi->mergeable_rx_bufs before
vi->big_packets, because we use this sequence in other places.
Otherwise we allocate buffer of one type, then free it as another
type.

general protection fault: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables pcspkr virtio_balloon virtio_net(-) i2c_pii
CPU: 0 PID: 400 Comm: rmmod Not tainted 3.13.0-rc2+ #170
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800b6d2a210 ti: ffff8800aed32000 task.ti: ffff8800aed32000
RIP: 0010:[<ffffffffa00345f3>] [<ffffffffa00345f3>] free_unused_bufs+0xc3/0x190 [virtio_net]
RSP: 0018:ffff8800aed33dd8 EFLAGS: 00010202
RAX: ffff8800b1fe2c00 RBX: ffff8800b66a7240 RCX: 6b6b6b6b6b6b6b6b
RDX: 6b6b6b6b6b6b6b6b RSI: ffff8800b8419a68 RDI: ffff8800b66a1148
RBP: ffff8800aed33e00 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: ffff8800b66a1148 R14: 0000000000000000 R15: 000077ff80000000
FS: 00007fc4f9c4e740(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f63f432f000 CR3: 00000000b6538000 CR4: 00000000000006f0
Stack:
ffff8800b66a7240 ffff8800b66a7380 ffff8800377bd3f0 0000000000000000
00000000023302f0 ffff8800aed33e18 ffffffffa00346e2 ffff8800b66a7240
ffff8800aed33e38 ffffffffa003474d ffff8800377bd388 ffff8800377bd390
Call Trace:
[<ffffffffa00346e2>] remove_vq_common+0x22/0x40 [virtio_net]
[<ffffffffa003474d>] virtnet_remove+0x4d/0x90 [virtio_net]
[<ffffffff813ae303>] virtio_dev_remove+0x23/0x80
[<ffffffff813f62cf>] __device_release_driver+0x7f/0xf0
[<ffffffff813f6c80>] driver_detach+0xc0/0xd0
[<ffffffff813f5f08>] bus_remove_driver+0x58/0xd0
[<ffffffff813f72cc>] driver_unregister+0x2c/0x50
[<ffffffff813ae63e>] unregister_virtio_driver+0xe/0x10
[<ffffffffa0036852>] virtio_net_driver_exit+0x10/0x7be [virtio_net]
[<ffffffff810d7cf2>] SyS_delete_module+0x172/0x220
[<ffffffff810a732d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff810f5d4c>] ? __audit_syscall_entry+0x9c/0xf0
[<ffffffff81677f69>] system_call_fastpath+0x16/0x1b
Code: c0 74 55 0f 1f 44 00 00 80 7b 30 00 74 7a 48 8b 50 30 4c 89 e6 48 03 73 20 48 85 d2 0f 84 bb 00 00 00 66 0f
RIP [<ffffffffa00345f3>] free_unused_bufs+0xc3/0x190 [virtio_net]
RSP <ffff8800aed33dd8>
---[ end trace edb570ea923cce9c ]---

Fixes: 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators)
Cc: Michael Dalton <mwdalton@google.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# adf8d3ff 06-Dec-2013 Jeff Kirsher <jeffrey.t.kirsher@intel.com>

drivers/net/*: Fix FSF address in file headers

Several files refer to an old address for the Free Software Foundation
in the file header comment. Resolve by replacing the address with
the URL <http://www.gnu.org/licenses/> so that we do not have to keep
updating the header comments anytime the address changes.

CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Veaceslav Falico <vfalico@redhat.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Paul Mackerras <paulus@samba.org>
CC: Ian Campbell <ian.campbell@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f121159d 28-Nov-2013 Michael S. Tsirkin <mst@redhat.com>

virtio_net: make all RX paths handle erors consistently

receive mergeable now handles errors internally.
Do same for big and small packet paths, otherwise
the logic is too hard to follow.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8fc3b9e9 28-Nov-2013 Michael S. Tsirkin <mst@redhat.com>

virtio_net: fix error handling for mergeable buffers

Eric Dumazet noticed that if we encounter an error
when processing a mergeable buffer, we don't
dequeue all of the buffers from this packet,
the result is almost sure to be loss of networking.

Jason Wang noticed that we also leak a page and that we don't decrement
the rq buf count, so we won't repost buffers (a resource leak).

Fix both issues.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael Dalton <mwdalton@google.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 99e872ae 29-Nov-2013 Thomas Huth <thuth@linux.vnet.ibm.com>

virtio_net: Fixed a trivial typo (fitler --> filter)

"MAC filter" sounds more reasonable than "MAC fitler".

Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0f13b66b 18-Nov-2013 Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

net, virtio_net: replace the magic value

It is more appropriate to use # of queue pairs currently used by
the driver instead of a magic value.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5061de36 14-Nov-2013 Michael Dalton <mwdalton@google.com>

virtio-net: mergeable buffer size should include virtio-net header

Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page
frag allocators") changed the mergeable receive buffer size from PAGE_SIZE
to MTU-size. However, the merge buffer size does not take into account the
size of the virtio-net header. Consequently, packets that are MTU-size
will take two buffers intead of one (to store the virtio-net header),
substantially decreasing the throughput of MTU-size traffic due to TCP
window / SKB truesize effects.

This commit changes the mergeable buffer size to include the virtio-net
header. The buffer size is cacheline-aligned because skb_page_frag_refill
will not automatically align the requested size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs and
vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup
cpuset, using cgroups to ensure that other processes in the system will not
be scheduled on the benchmark CPUs. Transmit offloads and mergeable receive
buffers are enabled, but guest_tso4 / guest_csum are explicitly disabled to
force MTU-sized packets on the receiver.

next-net trunk before 2613af0ed18a (PAGE_SIZE buf): 3861.08Gb/s
net-next trunk (MTU 1500- packet uses two buf due to size bug): 4076.62Gb/s
net-next trunk (MTU 1480- packet fits in one buf): 6301.34Gb/s
net-next trunk w/ size fix (MTU 1500 - packet fits in one buf): 6445.44Gb/s

Suggested-by: Eric Northup <digitaleric@google.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 827da44c 07-Oct-2013 John Stultz <john.stultz@linaro.org>

net: Explicitly initialize u64_stats_sync structures for lockdep

In order to enable lockdep on seqcount/seqlock structures, we
must explicitly initialize any locks.

The u64_stats_sync structure, uses a seqcount, and thus we need
to introduce a u64_stats_init() function and use it to initialize
the structure.

This unfortunately adds a lot of fairly trivial initialization code
to a number of drivers. But the benefit of ensuring correctness makes
this worth while.

Because these changes are required for lockdep to be enabled, and the
changes are quite trivial, I've not yet split this patch out into 30-some
separate patches, as I figured it would be better to get the various
maintainers thoughts on how to best merge this change along with
the seqcount lockdep enablement.

Feedback would be appreciated!

Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Roger Luethi <rl@hellgate.ch>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Simon Horman <horms@verge.net.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 9bb8ca86 05-Nov-2013 Jason Wang <jasowang@redhat.com>

virtio-net: switch to use XPS to choose txq

We used to use a percpu structure vq_index to record the cpu to queue
mapping, this is suboptimal since it duplicates the work of XPS and
loses all other XPS functionality such as allowing user to configure
their own transmission steering strategy.

So this patch switches to use XPS and suggest a default mapping when
the number of cpus is equal to the number of queues. With XPS support,
there's no need for keeping per-cpu vq_index and .ndo_select_queue(),
so they were removed also.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ba275241 01-Nov-2013 Jason Wang <jasowang@redhat.com>

virtio-net: coalesce rx frags when possible during rx

Commit 2613af0ed18a11d5c566a81f9a6510b73180660a (virtio_net: migrate mergeable
rx buffers to page frag allocators) try to increase the payload/truesize for
MTU-sized traffic. But this will introduce the extra overhead for GSO packets
received because of the frag list. This commit tries to reduce this issue by
coalesce the possible rx frags when possible during rx. Test result shows the
about 15% improvement on full size GSO packet receiving (and even better than
before commit 2613af0ed18a11d5c566a81f9a6510b73180660a).

Before this commit:
./netperf -H 192.168.100.4
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.4
() port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 20303.87

After this commit:
./netperf -H 192.168.100.4
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.4
() port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 23841.26

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michael Dalton <mwdalton@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ec9debbd 29-Oct-2013 Jason Wang <jasowang@redhat.com>

virtio-net: correctly handle cpu hotplug notifier during resuming

commit 3ab098df35f8b98b6553edc2e40234af512ba877 (virtio-net: don't respond to
cpu hotplug notifier if we're not ready) tries to bypass the cpu hotplug
notifier by checking the config_enable and does nothing is it was false. So it
need to try to hold the config_lock mutex which may happen in atomic
environment which leads the following warnings:

[ 622.944441] CPU0 attaching NULL sched-domain.
[ 622.944446] CPU1 attaching NULL sched-domain.
[ 622.944485] CPU0 attaching NULL sched-domain.
[ 622.950795] BUG: sleeping function called from invalid context at kernel/mutex.c:616
[ 622.950796] in_atomic(): 1, irqs_disabled(): 1, pid: 10, name: migration/1
[ 622.950796] no locks held by migration/1/10.
[ 622.950798] CPU: 1 PID: 10 Comm: migration/1 Not tainted 3.12.0-rc5-wl-01249-gb91e82d #317
[ 622.950799] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 622.950802] 0000000000000000 ffff88001d42dba0 ffffffff81a32f22 ffff88001bfb9c70
[ 622.950803] ffff88001d42dbb0 ffffffff810edb02 ffff88001d42dc38 ffffffff81a396ed
[ 622.950805] 0000000000000046 ffff88001d42dbe8 ffffffff810e861d 0000000000000000
[ 622.950805] Call Trace:
[ 622.950810] [<ffffffff81a32f22>] dump_stack+0x54/0x74
[ 622.950815] [<ffffffff810edb02>] __might_sleep+0x112/0x114
[ 622.950817] [<ffffffff81a396ed>] mutex_lock_nested+0x3c/0x3c6
[ 622.950818] [<ffffffff810e861d>] ? up+0x39/0x3e
[ 622.950821] [<ffffffff8153ea7c>] ? acpi_os_signal_semaphore+0x21/0x2d
[ 622.950824] [<ffffffff81565ed1>] ? acpi_ut_release_mutex+0x5e/0x62
[ 622.950828] [<ffffffff816d04ec>] virtnet_cpu_callback+0x33/0x87
[ 622.950830] [<ffffffff81a42576>] notifier_call_chain+0x3c/0x5e
[ 622.950832] [<ffffffff810e86a8>] __raw_notifier_call_chain+0xe/0x10
[ 622.950835] [<ffffffff810c5556>] __cpu_notify+0x20/0x37
[ 622.950836] [<ffffffff810c5580>] cpu_notify+0x13/0x15
[ 622.950838] [<ffffffff81a237cd>] take_cpu_down+0x27/0x3a
[ 622.950841] [<ffffffff81136289>] stop_machine_cpu_stop+0x93/0xf1
[ 622.950842] [<ffffffff81136167>] cpu_stopper_thread+0xa0/0x12f
[ 622.950844] [<ffffffff811361f6>] ? cpu_stopper_thread+0x12f/0x12f
[ 622.950847] [<ffffffff81119710>] ? lock_release_holdtime.part.7+0xa3/0xa8
[ 622.950848] [<ffffffff81135e4b>] ? cpu_stop_should_run+0x3f/0x47
[ 622.950850] [<ffffffff810ea9b0>] smpboot_thread_fn+0x1c5/0x1e3
[ 622.950852] [<ffffffff810ea7eb>] ? lg_global_unlock+0x67/0x67
[ 622.950854] [<ffffffff810e36b7>] kthread+0xd8/0xe0
[ 622.950857] [<ffffffff81a3bfad>] ? wait_for_common+0x12f/0x164
[ 622.950859] [<ffffffff810e35df>] ? kthread_create_on_node+0x124/0x124
[ 622.950861] [<ffffffff81a45ffc>] ret_from_fork+0x7c/0xb0
[ 622.950862] [<ffffffff810e35df>] ? kthread_create_on_node+0x124/0x124
[ 622.950876] smpboot: CPU 1 is now offline
[ 623.194556] SMP alternatives: lockdep: fixing up alternatives
[ 623.194559] smpboot: Booting Node 0 Processor 1 APIC 0x1
...

A correct fix is to unregister the hotcpu notifier during restore and register a
new one in resume.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2613af0e 28-Oct-2013 Michael Dalton <mwdalton@google.com>

virtio_net: migrate mergeable rx buffers to page frag allocators

The virtio_net driver's mergeable receive buffer allocator
uses 4KB packet buffers. For MTU-sized traffic, SKB truesize
is > 4KB but only ~1500 bytes of the buffer is used to store
packet data, reducing the effective TCP window size
substantially. This patch addresses the performance concerns
with mergeable receive buffers by allocating MTU-sized packet
buffers using page frag allocators. If more than MAX_SKB_FRAGS
buffers are needed, the SKB frag_list is used.

Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 047b9b94 28-Oct-2013 Heinz Graalfs <graalfs@linux.vnet.ibm.com>

virtio_net: verify if queue is broken after virtqueue_get_buf()

If a virtqueue_get_buf() call returns a NULL pointer a possibly endless while
loop should be avoided by checking for a broken virtqueue.

Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 67975901 28-Oct-2013 Heinz Graalfs <graalfs@linux.vnet.ibm.com>

virtio_net: verify if virtqueue_kick() succeeded

Verify if a host kick succeeded by checking return value of virtqueue_kick().

Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 35ed159b 14-Oct-2013 Jason Wang <jasowang@redhat.com>

virtio-net: refill only when device is up during setting queues

We used to schedule the refill work unconditionally after changing the
number of queues. This may lead an issue if the device is not
up. Since we only try to cancel the work in ndo_stop(), this may cause
the refill work still work after removing the device. Fix this by only
schedule the work when device is up.

The bug were introduce by commit 9b9cd8024a2882e896c65222aa421d461354e3f2.
(virtio-net: fix the race between channels setting and refill)

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3ab098df 14-Oct-2013 Jason Wang <jasowang@redhat.com>

virtio-net: don't respond to cpu hotplug notifier if we're not ready

We're trying to re-configure the affinity unconditionally in cpu hotplug
callback. This may lead the issue during resuming from s3/s4 since

- virt queues haven't been allocated at that time.
- it's unnecessary since thaw method will re-configure the affinity.

Fix this issue by checking the config_enable and do nothing is we're not ready.

The bug were introduced by commit 8de4b2f3ae90c8fc0f17eeaab87d5a951b66ee17
(virtio-net: reset virtqueue affinity when doing cpu hotplug).

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 855e0c52 14-Oct-2013 Rusty Russell <rusty@rustcorp.com.au>

virtio: use size-based config accessors.

This lets the transport do endian conversion if necessary, and insulates
the drivers from the difference.

Most drivers can use the simple helpers virtio_cread() and virtio_cwrite().

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 89107000 16-Sep-2013 Aaron Lu <aaron.lu@intel.com>

virtio: pm: use CONFIG_PM_SLEEP instead of CONFIG_PM

The freeze and restore functions defined in virtio drivers are used
for suspend and hibernate, so CONFIG_PM_SLEEP is more appropriate than
CONFIG_PM. This patch replace all CONFIG_PM with CONFIG_PM_SLEEP for
virtio drivers that implement freeze and restore callbacks.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 4f49129b 27-Aug-2013 Thomas Huth <thuth@linux.vnet.ibm.com>

virtio-net: Set RXCSUM feature if GUEST_CSUM is available

If the VIRTIO_NET_F_GUEST_CSUM virtio feature is available, the guest
does not have to calculate the checksums on all received packets. This
is pretty much the same feature as RX checksum offloading on real
network cards, so the virtio-net driver should report this by setting
the NETIF_F_RXCSUM flag. When the user now runs "ethtool -k", he or she
can see whether the virtio-net interface has to calculate RX checksums
or not.

Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e7428e95 24-Jul-2013 Michael S. Tsirkin <mst@redhat.com>

virtio-net: put virtio net header inline with data

For small packets we can simplify xmit processing
by linearizing buffers with the header:
most packets seem to have enough head room
we can use for this purpose.
Since existing hypervisors require that header
is the first s/g element, we need a feature bit
for this.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cbdadbbf 08-Jul-2013 Michael S. Tsirkin <mst@redhat.com>

virtio_net: fix race in RX VQ processing

virtio net called virtqueue_enable_cq on RX path after napi_complete, so
with NAPI_STATE_SCHED clear - outside the implicit napi lock.
This violates the requirement to synchronize virtqueue_enable_cq wrt
virtqueue_add_buf. In particular, used event can move backwards,
causing us to lose interrupts.
In a debug build, this can trigger panic within START_USE.

Jason Wang reports that he can trigger the races artificially,
by adding udelay() in virtqueue_enable_cb() after virtio_mb().

However, we must call napi_complete to clear NAPI_STATE_SCHED before
polling the virtqueue for used buffers, otherwise napi_schedule_prep in
a callback will fail, causing us to lose RX events.

To fix, call virtqueue_enable_cb_prepare with NAPI_STATE_SCHED
set (under napi lock), later call virtqueue_poll with
NAPI_STATE_SCHED clear (outside the lock).

Reported-by: Jason Wang <jasowang@redhat.com>
Tested-by: Jason Wang <jasowang@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9b9cd802 03-Jul-2013 Jason Wang <jasowang@redhat.com>

virtio-net: fix the race between channels setting and refill

Commit 55257d72bd1c51f25106350f4983ec19f62ed1fa (virtio-net: fill only rx queues
which are being used) tries to refill on demand when changing the number of
channels by call try_refill_recv() directly, this may race:

- the refill work who may do the refill in the same time
- the try_refill_recv() called in bh since napi was not disabled

Which may led guest complain during setting channels:

virtio_net virtio0: input.1:id 0 is not a head!

Solve this issue by scheduling a refill work which can guarantee the
serialization of refill.

Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# e4166625 21-May-2013 Jason Wang <jasowang@redhat.com>

virtio_net: enable napi for all possible queues during open

Commit 55257d72bd1c51f25106350f4983ec19f62ed1fa (virtio-net: fill only rx
queues which are being used) only does the napi enabling during open for
curr_queue_pairs. This will break multiqueue receiving since napi of new queues
were still disabled after changing the number of queues.

This patch fixes this by enabling napi for all possible queues during open.

Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d34710e3 09-May-2013 Amerigo Wang <amwang@redhat.com>

virtio_net: use default napi weight by default

Since commit 82dc3c63c692b1e1d5937 ("net: introduce NAPI_POLL_WEIGHT")
we warn drivers when they use napi weight higher than NAPI_POLL_WEIGHT,
but virtio_net still uses 128 by default. This patch makes its default
value to NAPI_POLL_WEIGHT.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 55257d72 28-Apr-2013 Sasha Levin <sasha.levin@oracle.com>

virtio-net: fill only rx queues which are being used

Due to MQ support we may allocate a whole bunch of rx queues but
never use them. With this patch we'll safe the space used by
the receive buffers until they are actually in use:

sh-4.2# free -h
total used free shared buffers cached
Mem: 490M 35M 455M 0B 0B 4.1M
-/+ buffers/cache: 31M 459M
Swap: 0B 0B 0B
sh-4.2# ethtool -L eth0 combined 8
sh-4.2# free -h
total used free shared buffers cached
Mem: 490M 162M 327M 0B 0B 4.1M
-/+ buffers/cache: 158M 331M
Swap: 0B 0B 0B

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 80d5c368 18-Apr-2013 Patrick McHardy <kaber@trash.net>

net: vlan: prepare for 802.1ad VLAN filtering offload

Change the rx_{add,kill}_vid callbacks to take a protocol argument in
preparation of 802.1ad support. The protocol argument used so far is
always htons(ETH_P_8021Q).

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f646968f 18-Apr-2013 Patrick McHardy <kaber@trash.net>

net: vlan: rename NETIF_F_HW_VLAN_* feature flags to NETIF_F_HW_VLAN_CTAG_*

Rename the hardware VLAN acceleration features to include "CTAG" to indicate
that they only support CTAGs. Follow up patches will introduce 802.1ad
server provider tagging (STAGs) and require the distinction for hardware not
supporting acclerating both.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4fda8302 10-Apr-2013 Jason Wang <jasowang@redhat.com>

virtio-net: initialize vlan_features

There's nothing that prevent passing the device features of virtio_net to its
vlan device. So this patch simply passes those to vlan device to benefit from
advanced features.

Netperf shows better sending performance for vlan device since TSO can work on
vlan now.

before:
netperf -H 192.168.5.2
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.2 ()
port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 4162.35

after:
netperf -H 192.168.5.2
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.2 ()
port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 9365.42

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9d0ca6ed 21-Mar-2013 Rusty Russell <rusty@rustcorp.com.au>

virtio: remove obsolete virtqueue_get_queue_index()

You can access it directly now, since 3.8: v3.7-rc1-13-g06ca287
'virtio: move queue_index and num_free fields into core struct
virtqueue.'

Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9dc7b9e4 19-Mar-2013 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: use simplified virtqueue accessors.

We never add buffers with input and output parts, so use the new accessors.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Reviewed-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# f7bc9594 19-Mar-2013 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: use virtqueue_add_sgs[] for command buffers.

It's a bit cleaner to hand multiple sgs, rather than one big one.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Tested-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# c9af6db4 11-Feb-2013 Pravin B Shelar <pshelar@nicira.com>

net: Fix possible wrong checksum generation.

Patch cef401de7be8c4e (net: fix possible wrong checksum
generation) fixed wrong checksum calculation but it broke TSO by
defining new GSO type but not a netdev feature for that type.
net_gso_ok() would not allow hardware checksum/segmentation
offload of such packets without the feature.

Following patch fixes TSO and wrong checksum. This patch uses
same logic that Eric Dumazet used. Patch introduces new flag
SKBTX_SHARED_FRAG if at least one frag can be modified by
the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
info tx_flags rather than gso_type.

tx_flags is better compared to gso_type since we can have skb with
shared frag without gso packet. It does not link SHARED_FRAG to
GSO, So there is no need to define netdev feature for this.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b2a17029 12-Feb-2013 Rusty Russell <rusty@rustcorp.com.au>

virtio: use module_virtio_driver.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# e68ed8f0 03-Feb-2013 Joe Perches <joe@perches.com>

drivers:net:misc: Remove unnecessary alloc/OOM messages

alloc failures already get standardized OOM
messages and a dump_stack.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cef401de 25-Jan-2013 Eric Dumazet <edumazet@google.com>

net: fix possible wrong checksum generation

Pravin Shelar mentioned that GSO could potentially generate
wrong TX checksum if skb has fragments that are overwritten
by the user between the checksum computation and transmit.

He suggested to linearize skbs but this extra copy can be
avoided for normal tcp skbs cooked by tcp_sendmsg().

This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
in skb_shinfo(skb)->gso_type if at least one frag can be
modified by the user.

Typical sources of such possible overwrites are {vm}splice(),
sendfile(), and macvtap/tun/virtio_net drivers.

Tested:

$ netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
7.7.8.84 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 3959.52

$ netperf -H 7.7.8.84 -t TCP_SENDFILE
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 3216.80

Performance of the SENDFILE is impacted by the extra allocation and
copy, and because we use order-0 pages, while the TCP_STREAM uses
bigger pages.

Reported-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8de4b2f3 24-Jan-2013 Wanlong Gao <gaowanlong@cn.fujitsu.com>

virtio-net: reset virtqueue affinity when doing cpu hotplug

Add a cpu notifier to virtio-net, so that we can reset the
virtqueue affinity if the cpu hotplug happens. It improve
the performance through enabling or disabling the virtqueue
affinity after doing cpu hotplug.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Eric Dumazet <erdnetdev@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: virtualization@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8898c21c 24-Jan-2013 Wanlong Gao <gaowanlong@cn.fujitsu.com>

virtio-net: split out clean affinity function

Split out the clean affinity function to virtnet_clean_affinity().

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Eric Dumazet <erdnetdev@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: virtualization@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 47be2479 24-Jan-2013 Wanlong Gao <gaowanlong@cn.fujitsu.com>

virtio-net: fix the set affinity bug when CPU IDs are not consecutive

As Michael mentioned, set affinity and select queue will not work very
well when CPU IDs are not consecutive, this can happen with hot unplug.
Fix this bug by traversal the online CPUs, and create a per cpu variable
to find the mapping from CPU to the preferable virtual-queue.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Eric Dumazet <erdnetdev@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: virtualization@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7e58d5ae 20-Jan-2013 Amos Kong <akong@redhat.com>

virtio-net: introduce a new control to set macaddr

Currently we write MAC address to pci config space byte by byte,
this means that we have an intermediate step where mac is wrong.
This patch introduced a new control command to set MAC address,
it's atomic.

VIRTIO_NET_F_CTRL_MAC_ADDR is a new feature bit for compatibility.

Signed-off-by: Amos Kong <akong@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 40cbfc37 20-Jan-2013 Amos Kong <akong@redhat.com>

move virtnet_send_command() above virtnet_set_mac_address()

We want to send vq command to set mac address in
virtnet_set_mac_address(), so do this function moving.
Fixed a little issue of coding style.

Signed-off-by: Amos Kong <akong@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0e3daa64 16-Oct-2012 Rusty Russell <rusty@rustcorp.com.au>

virtio: net: make it clear that virtqueue_add_buf() no longer returns > 0

We simplified virtqueue_add_buf(), make it clear in the callers.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 9ed4cb07 16-Oct-2012 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: don't rely on virtqueue_add_buf() returning capacity.

Now we can easily use vq->num_free to determine if there are descriptors
left in the queue, we're about to change virtqueue_add_buf() to return 0
on success. The virtio_net driver is the only one which actually uses
the return value, so change that.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>


# 7bedc7dc 16-Oct-2012 Michael S. Tsirkin <mst@redhat.com>

virtio-net: remove unused skb_vnet_hdr->num_sg field

[Split from "correct capacity math on ring full" -- Rusty]

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>


# 6ee57bcc 16-Oct-2012 Michael S. Tsirkin <mst@redhat.com>

virtio-net: correct capacity math on ring full

Capacity math on ring full is wrong: we are
looking at num_sg but that might be optimistic
because of indirect buffer use.

The implementation also penalizes fast path
with extra memory accesses for the benefit of
ring full condition handling which is slow path.

It's easy to query ring capacity so let's do just that.

This change also makes it easier to move vnet header
for tx around as follow-up patch does.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>


# 008d4278 09-Dec-2012 Amerigo Wang <amwang@redhat.com>

virtio_net: fix a typo in virtnet_alloc_queues()

Obviously it should check !vi->rq.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d73bcd2c 07-Dec-2012 Jason Wang <jasowang@redhat.com>

virtio-net: support changing the number of queue pairs through ethtool

This patch implements the ethtool_{set|get}_channels method of virtio-net to
allow user to change the number of queues when the device is running on demand.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 986a4f4d 07-Dec-2012 Jason Wang <jasowang@redhat.com>

virtio_net: multiqueue support

This patch adds the multiqueue (VIRTIO_NET_F_MQ) support to virtio_net
driver. VIRTIO_NET_F_MQ capable device could allow the driver to do packet
transmission and reception through multiple queue pairs and does the packet
steering to get better performance. By default, one one queue pair is used, user
could change the number of queue pairs by ethtool in the next patch.

When multiple queue pairs is used and the number of queue pairs is equal to the
number of vcpus. Driver does the following optimizations to implement per-cpu
virt queue pairs:

- select the txq based on the smp processor id.
- smp affinity hint to the cpu that owns the queue pairs.

This could be used with the flow steering support of the device to guarantee the
packets of a single flow is handled by the same cpu.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e9d7417b 07-Dec-2012 Jason Wang <jasowang@redhat.com>

virtio-net: separate fields of sending/receiving queue from virtnet_info

To support multiqueue transmitq/receiveq, the first step is to separate queue
related structure from virtnet_info. This patch introduce send_queue and
receive_queue structure and use the pointer to them as the parameter in
functions handling sending/receiving.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8cc085d6 03-Dec-2012 Bill Pemberton <wfp5p@virginia.edu>

virtio_net: remove __dev* attributes

CONFIG_HOTPLUG is going away as an option. As result the __dev*
markings will be going away.

Remove use of __devinit, __devexit_p, __devinitdata, __devinitconst,
and __devexit.

Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# be443899 08-Nov-2012 Amerigo Wang <amwang@redhat.com>

virtio_net: use net_*_ratelimited() helpers

These can be converted to net_*_ratelimited().

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3b07e9ca 20-Aug-2012 Tejun Heo <tj@kernel.org>

workqueue: deprecate system_nrt[_freezable]_wq

system_nrt[_freezable]_wq are now spurious. Mark them deprecated and
convert all users to system[_freezable]_wq.

If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
Please use system[_freezable]_wq instead.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com>

Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Airlie <airlied@linux.ie>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>


# ee89bab1 09-Aug-2012 Amerigo Wang <amwang@redhat.com>

net: move and rename netif_notify_peers()

I believe net/core/dev.c is a better place for netif_notify_peers(),
because other net event notify functions also stay in this file.

And rename it to netdev_notify_peers().

Cc: David S. Miller <davem@davemloft.net>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e3906486 21-Jul-2012 Kevin Groeneveld <kgroeneveld@gmail.com>

net: fix race condition in several drivers when reading stats

Fix race condition in several network drivers when reading stats on 32bit
UP architectures. These drivers update their stats in a BH context and
therefore should use u64_stats_fetch_begin_bh/u64_stats_fetch_retry_bh
instead of u64_stats_fetch_begin/u64_stats_fetch_retry when reading the
stats.

Signed-off-by: Kevin Groeneveld <kgroeneveld@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f2f2c8b4 28-Jun-2012 Jiri Pirko <jpirko@redhat.com>

virtio_net: use IFF_LIVE_ADDR_CHANGE priv_flag

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d4fc6918 26-Jun-2012 Jiri Pirko <jpirko@redhat.com>

virtio_net: allow to change mac when iface is running

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 83a27052 05-Jun-2012 Eric Dumazet <edumazet@google.com>

virtio-net: fix a race on 32bit arches

commit 3fa2a1df909 (virtio-net: per cpu 64 bit stats (v2)) added a race
on 32bit arches.

We must use separate syncp for rx and tx path as they can be run at the
same time on different cpus. Thus one sequence increment can be lost and
readers spin forever.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3bbf372c 30-May-2012 Michael S. Tsirkin <mst@redhat.com>

virtio-net: remove useless disable on freeze

disable_cb is just an optimization: it
can not guarantee that there are no callbacks.
In particular it doesn't have any effect when
event index is on.

Instead, detach, napi disable and reset on freeze ensure we don't run
concurrently with a callback.

Remove the useless calls so we get same behaviour
with and without event index.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ec13ee80 16-May-2012 Michael S. Tsirkin <mst@redhat.com>

virtio_net: invoke softirqs after __napi_schedule

__napi_schedule might raise softirq but nothing
causes do_softirq to trigger, so it does not in fact
run. As a result,
the error message "NOHZ: local_softirq_pending 08"
sometimes occurs during boot of a KVM guest when the network service is
started and we are oom:

...
Bringing up loopback interface: [ OK ]
Bringing up interface eth0:
Determining IP information for eth0...NOHZ: local_softirq_pending 08
done.
[ OK ]
...

Further, receive queue processing might get delayed
indefinitely until some interrupt triggers:
virtio_net expected napi to be run immediately.

One way to cause do_softirq to be executed is by
invoking local_bh_enable(). As __napi_schedule is
normally called from bh or irq context, this
seems to make sense: disable bh before __napi_schedule
and enable afterwards.

In fact it's a very complicated way of calling do_softirq(),
and works since this function is only used when we are not
in interrupt context. It's not hot at all, in any ideal scenario.

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Tested-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>


# 586d17c5 11-Apr-2012 Jason Wang <jasowang@redhat.com>

virtio-net: send gratuitous packets when needed

As hypervior does not have the knowledge of guest network configuration, it's
better to ask guest to send gratuitous packets when needed.

This patch implements VIRTIO_NET_F_GUEST_ANNOUNCE feature: hypervisor would
notice the guest when it thinks it's time for guest to announce the link
presnece. Guest tests VIRTIO_NET_S_ANNOUNCE bit during config change interrupt
and woule send gratuitous packets through netif_notify_peers() and ack the
notification through ctrl vq.

We need to make sure the atomicy of read and ack in guest otherwise we may ack
more times than being notified. This is done through handling the whole config
change interrupt in an non-reentrant workqueue.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 31304165 08-Apr-2012 Torsten Kaiser <just.for.lkml@googlemail.com>

net: Fix misplaced parenthesis in virtio_net.c

Commit 2e57b79ccef1ff1422fdf45a9b28fe60f8f084f7 misplaced its
parenthesis and now tx_fifo_errors will only be incremented if an
ENOMEM error is not written to the syslog.

Correct the parenthesis and indentation to the original goal of
counting all non ENOMEM errors and ratelimiting only the messages.

Signed-of-by: Torsten Kaiser <just.for.lkml@googlemail.com>

Signed-off-by: David S. Miller <davem@davemloft.net>


# 2e57b79c 27-Mar-2012 Rick Jones <rick.jones2@hp.com>

virtio_net: do not rate limit counter increments

While it is desirable to rate limit certain messages, it is not
desirable to rate limit the incrementing of counters associated
with those messages.

Signed-off-by: Rick Jones <rick.jones2@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f2cedb63 14-Feb-2012 Danny Kukawka <danny.kukawka@bisect.de>

net: replace random_ether_addr() with eth_hw_addr_random()

Replace usage of random_ether_addr() with eth_hw_addr_random()
to set addr_assign_type correctly to NET_ADDR_RANDOM.

Change the trivial cases.

v2: adapt to renamed eth_hw_addr_random()

Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 58472a76 12-Feb-2012 Eric Dumazet <eric.dumazet@gmail.com>

virtio: net: remove sparse errors

commit 3fa2a1df909 (virtio-net: per cpu 64 bit stats (v2)) added extra
__percpu qualifiers and sparse errors.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0741bcb5 22-Dec-2011 Amit Shah <amit.shah@redhat.com>

virtio: net: Add freeze, restore handlers to support S4

Remove all the vqs, disable napi and detach from the netdev on
hibernation.

Re-create vqs after restoring from a hibernated image, re-enable napi
and re-attach the netdev. This keeps networking working across
hibernation.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 04486ed0 22-Dec-2011 Amit Shah <amit.shah@redhat.com>

virtio: net: Move vq and vq buf removal into separate function

The remove and PM freeze functions will share this code.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 3f9c10b0 22-Dec-2011 Amit Shah <amit.shah@redhat.com>

virtio: net: Move vq initialization into separate function

The probe and PM restore functions will share this code.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# f96fde41 11-Jan-2012 Rusty Russell <rusty@rustcorp.com.au>

virtio: rename virtqueue_add_buf_gfp to virtqueue_add_buf

Remove wrapper functions. This makes the allocation type explicit in
all callers; I used GPF_KERNEL where it seemed obvious, left it at
GFP_ATOMIC otherwise.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 3464645a 03-Jan-2012 Mike Waychison <mikew@google.com>

virtio_net: Pass gfp flags when allocating rx buffers.

Currently, the refill path for RX buffers will always allocate the
buffers as GFP_ATOMIC, even if we are in process context. This will
fail to apply memory pressure as the worker thread will not contribute
to the freeing of memory.

Fix this by changing add_recvbuf_small to use the gfp variant allocator,
__netdev_alloc_skb_ip_align().

Signed-off-by: Mike Waychison <mikew@google.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f1776dad 28-Dec-2011 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: use non-reentrant workqueue.

Michael S. Tsirkin also noticed that we could run the refill work
multiple CPUs: if we kick off a refill on one CPU and then on another,
they would both manipulate the queue at the same time (they use
napi_disable to avoid racing against the receive handler itself).

Tejun points out that this is what the WQ_NON_REENTRANT flag is for,
and that there is a convenient system kthread we can use.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b2baed69 28-Dec-2011 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: set/cancel work on ndo_open/ndo_stop

Michael S. Tsirkin noticed that we could run the refill work after
ndo_close, which can re-enable napi - we don't disable it until
virtnet_remove. This is clearly wrong, so move the workqueue control
to ndo_open and ndo_stop (aka. virtnet_open and virtnet_close).

One subtle point: virtnet_probe() could simply fail if it couldn't
allocate a receive buffer, but that's less polite in virtnet_open() so
we schedule a refill as we do in the normal receive path if we run out
of memory.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# eb939922 19-Dec-2011 Rusty Russell <rusty@rustcorp.com.au>

module_param: make bool parameters really bool (net & drivers/net)

module_param(bool) used to counter-intuitively take an int. In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.

It's time to remove the int/unsigned int option. For this version
it'll simply give a warning, but it'll break next kernel version.

(Thanks to Joe Perches for suggesting coccinelle for 0/1 -> true/false).

Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8e586137 08-Dec-2011 Jiri Pirko <jpirko@redhat.com>

net: make vlan ndo_vlan_rx_[add/kill]_vid return error value

Let caller know the result of adding/removing vlan id to/from vlan
filter.

In some drivers I make those functions to just return 0. But in those
where there is able to see if hw setup went correctly, return value is
set appropriately.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 021ac8d3 21-Nov-2011 Rick Jones <rick.jones2@hp.com>

virtio_net: return already tracked tx_fifo_errors via virtnet_getstats()

Tx_fifo_errors are tracked in start_xmit_ for virtio_net, but not
reported in the tallies returned by virtnet_stats(). Return them
as the rx "sub-stats" rx_length_errors and rx_frame_errors are.

Signed-off-by: Rick Jones <rick.jones2@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 66846048 14-Nov-2011 Rick Jones <rick.jones2@hp.com>

enable virtio_net to return bus_info in ethtool -i consistent with emulated NICs

Add a new .bus_name to virtio_config_ops then modify virtio_net to
call through to it in an ethtool .get_drvinfo routine to report
bus_info in ethtool -i output which is consistent with other
emulated NICs and the output of lspci.

Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 77dd7693 14-Aug-2011 Sasha Levin <levinsasha928@gmail.com>

virtio-net: Use virtio_config_val() for retrieving config

This patch modifies virtio-net to use virtio_config_val() instead
of a 'if(virtio_has_feature()) vdev->config->get()' construct to retrieve
optional values from the config space.

Cc: Amit Shah <amit.shah@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 8f9f4668 19-Oct-2011 Rick Jones <rick.jones2@hp.com>

Add ethtool -g support to virtio_net

Add support for reporting ring sizes via ethtool -g to the virtio_net
driver.

Signed-off-by: Rick Jones <rick.jones2@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b727361 19-Oct-2011 Eric Dumazet <eric.dumazet@gmail.com>

virtio_net: fix truesize underestimation

We must account in skb->truesize, the size of the fragments, not the
used part of them.

Doing this work is important to avoid unexpected OOM situations.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: virtualization@lists.linux-foundation.org
CC: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8a59a7b9 19-Oct-2011 Krishna Kumar <krkumar2@in.ibm.com>

virtio_net: Clean up set_skb_frag()

Remove manual initialization in set_skb_frag, and instead
use __skb_fill_page_desc() to do the same. Patch tested
on net-next.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9e903e08 18-Oct-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: add skb frag size accessors

To ease skb->truesize sanitization, its better to be able to localize
all references to skb frags size.

Define accessors : skb_frag_size() to fetch frag size, and
skb_frag_size_{set|add|sub}() to manipulate it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e878d78b 27-Sep-2011 Sasha Levin <levinsasha928@gmail.com>

virtio-net: Verify page list size before fitting into skb

This patch verifies that the length of a buffer stored in a linked list
of pages is small enough to fit into a skb.

If the size is larger than a max size of a skb, it means that we shouldn't
go ahead building skbs anyway since we won't be able to send the buffer as
the user requested.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 86ee8130 21-Sep-2011 Ian Campbell <Ian.Campbell@citrix.com>

virtionet: convert to SKB paged frag API.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>


# 01789349 16-Aug-2011 Jiri Pirko <jpirko@redhat.com>

net: introduce IFF_UNICAST_FLT private flag

Use IFF_UNICAST_FTL to find out if driver handles unicast address
filtering. In case it does not, promisc mode is entered.

Patch also fixes following drivers:
stmmac, niu: support uc filtering and yet it propagated
ndo_set_multicast_list
bna, benet, pxa168_eth, ks8851, ks8851_mll, ksz884x : has set
ndo_set_rx_mode but do not support uc filtering

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2e66f55b 19-Jul-2011 Krishna Kumar <krkumar2@in.ibm.com>

virtio_net: Fix panic in virtnet_remove

Fix a panic in virtnet_remove. unregister_netdev has already
freed up the netdev (and virtnet_info) due to dev->destructor
being set, while virtnet_info is still required. Remove
virtnet_free altogether, and move the freeing of the per-cpu
statistics from virtnet_free to virtnet_remove.

Tested patch below.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3fa2a1df 15-Jun-2011 stephen hemminger <shemminger@vyatta.com>

virtio-net: per cpu 64 bit stats (v2)

Use per-cpu variables to maintain 64 bit statistics.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>


# 10a8d94a 09-Jun-2011 Jason Wang <jasowang@redhat.com>

virtio_net: introduce VIRTIO_NET_HDR_F_DATA_VALID

There's no need for the guest to validate the checksum if it have been
validated by host nics. So this patch introduces a new flag -
VIRTIO_NET_HDR_F_DATA_VALID which is used to bypass the checksum
examing in guest. The backend (tap/macvtap) may set this flag when
met skbs with CHECKSUM_UNNECESSARY to save cpu utilization.

No feature negotiation is needed as old driver just ignore this flag.

Iperf shows 12%-30% performance improvement for UDP traffic. For TCP,
when gro is on no difference as it produces skb with partial
checksum. But when gro is disabled, 20% or even higher improvement
could be measured by netperf.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7a66f784 19-May-2011 Michael S. Tsirkin <mst@redhat.com>

virtio_net: delay TX callbacks

Ask for delayed callbacks on TX ring full, to give the
other side more of a chance to make progress.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 98e778c9 30-Mar-2011 Michał Mirosław <mirq-linux@rere.qmqm.pl>

virtio_net: convert to hw_features

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3e9d08ec 10-Feb-2011 Bruce Rogers <brogers@novell.com>

virtio_net: Add schedule check to napi_enable call

Under harsh testing conditions, including low memory, the guest would
stop receiving packets. With this patch applied we no longer see any
problems in the driver while performing these tests for extended periods
of time.

Make sure napi is scheduled subsequent to each napi_enable.

Signed-off-by: Bruce Rogers <brogers@novell.com>
Signed-off-by: Olaf Kirch <okir@suse.de>
Cc: stable@kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 55508d60 14-Dec-2010 Michał Mirosław <mirq-linux@rere.qmqm.pl>

net: Use skb_checksum_start_offset()

Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().

Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 167c25e4 10-Nov-2010 Jason Wang <jasowang@redhat.com>

virtio-net: init link state correctly

For device that supports VIRTIO_NET_F_STATUS, there's no need to
assume the link is up and we need to call nerif_carrier_off() before
querying device status, otherwise we may get wrong operstate after
diver was loaded because the link watch event was not fired as
expected.

For device that does not support VIRITO_NET_F_STATUS, we could not get
its status through virtnet_update_status() and what we can only do is
always assuming the link is up.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 01414802 17-Aug-2010 Ben Hutchings <bhutchings@solarflare.com>

ethtool: Provide a default implementation of ethtool_ops::get_drvinfo

The driver name and bus address for a net_device can normally be found
through the driver model now. Instead of requiring drivers to provide
this information redundantly through the ethtool_ops::get_drvinfo
operation, use the driver model to do so if the driver does not define
the operation. Since ETHTOOL_GDRVINFO no longer requires the driver
to implement any operations, do not require net_device::ethtool_ops to
be set either.

Remove implementations of get_drvinfo and ethtool_ops that provide
only this information.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a767bde4 04-Aug-2010 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: implements ethtool_ops.get_drvinfo

I often use "ethtool -i" command to check what driver controls the
ehternet device. But because current virtio_net driver doesn't
support "ethtool -i", it becomes the following:

# ethtool -i eth3
Cannot get driver information: Operation not supported

This patch simply adds the "ethtool -i" support. The following is the
result when using the virtio_net driver with my patch applied to.

# ethtool -i eth3
driver: virtio_net
version: N/A
firmware-version: N/A
bus-info: virtio0

Personally, "-i" is one of the most frequently-used option, and most
network drivers support "ethtool -i", so I think virtio_net also
should do.

Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use ARRAY_SIZE)
Signed-off-by: David S. Miller <davem@davemloft.net>


# 58eba97d 02-Jul-2010 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: fix oom handling on tx

virtio net will never try to overflow the TX ring, so the only reason
add_buf may fail is out of memory. Thus, we can not stop the
device until some request completes - there's no guarantee anything
at all is outstanding.

Make the error message clearer as well: error here does not
indicate queue full.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (...and avoid TX_BUSY)
Cc: stable@kernel.org # .34.x (s/virtqueue_/vi->svq->vq_ops->/)
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1788f495 02-Jul-2010 Michael S. Tsirkin <mst@redhat.com>

virtio_net: do not reschedule rx refill forever

We currently fill all of RX ring, then add_buf
returns ENOSPC, which gets mis-detected as an out of
memory condition and causes us to reschedule the work,
and so on forever. Fix this by oom = err == -ENOMEM;

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org # .34.x
Signed-off-by: David S. Miller <davem@davemloft.net>


# aa989f5e 30-May-2010 Michael S. Tsirkin <mst@redhat.com>

virtio-net: pass gfp to add_buf

virtio-net bounces buffer allocations off to
a thread if it can't allocate buffers from the atomic
pool. However, if posting buffers still requires atomic
buffers, this is unlikely to succeed.
Fix by passing in the proper gfp_t parameter.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1915a712 12-Apr-2010 Michael S. Tsirkin <mst@redhat.com>

virtio_net: use virtqueue_xxx wrappers

Switch virtio_net to new virtqueue_xxx wrappers.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# b4bf665c 14-Apr-2010 David S. Miller <davem@davemloft.net>

virtio_net: Fix mis-merge.

Pointed out by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 0e413f22 29-Mar-2010 Shirley Ma <mashirle@us.ibm.com>

virtio_net: missing sg_init_table

Add missing sg_init_table for sg_set_buf in virtio_net which
induced in defer skb patch.

Reported-by: Thomas Müller <thomas@mathtm.de>
Tested-by: Thomas Müller <thomas@mathtm.de>
Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5e01d2f9 07-Apr-2010 Michael S. Tsirkin <mst@redhat.com>

virtio-net: move sg off stack

Move sg structure off stack and into virtnet_info structure.
This helps remove extra sg_init_table calls as well as reduce
stack usage.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 22bedad3 01-Apr-2010 Jiri Pirko <jpirko@redhat.com>

net: convert multicast list to list_head

Converts the list and the core manipulating with it to be the same as uc_list.

+uses two functions for adding/removing mc address (normal and "global"
variant) instead of a function parameter.
+removes dev_mcast.c completely.
+exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for
manipulation with lists on a sandbox (used in bonding and 80211 drivers)

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2c45cd43 29-Mar-2010 Shirley Ma <mashirle@us.ibm.com>

virtio_net: missing sg_init_table

Add missing sg_init_table for sg_set_buf in virtio_net which
induced in defer skb patch.

Reported-by: Thomas Müller <thomas@mathtm.de>
Tested-by: Thomas Müller <thomas@mathtm.de>
Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 2507c05f 02-Mar-2010 Jiri Pirko <jpirko@redhat.com>

virtio_net: remove forgotten assignment

This is no longer needed. I missed to remove this in
567ec874d15b478c8eda7e9a5d2dcb05f13f1fb5 ("net: convert multiple
drivers to use netdev_for_each_mc_addr, part6")

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 567ec874 23-Feb-2010 Jiri Pirko <jpirko@redhat.com>

net: convert multiple drivers to use netdev_for_each_mc_addr, part6

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 830a8a97 08-Feb-2010 Shirley Ma <mashirle@us.ibm.com>

virtio_net: remove send queue

Now we have a virtio detach API (in commit
f9bfbebf34eab707b065116cdc9699d25ba4252a), we don't need to track xmit
skbs in the virio_net driver, which improves transmission performance.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4cd24eaf 07-Feb-2010 Jiri Pirko <jpirko@redhat.com>

net: use netdev_mc_count and netdev_mc_empty when appropriate

This patch replaces dev->mc_count in all drivers (hopefully I didn't miss
anything). Used spatch and did small tweaks and conding style changes when
it was suitable.

Jirka

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9ab86bbc 28-Jan-2010 Shirley Ma <mashirle@us.ibm.com>

virtio_net: Defer skb allocation in receive path Date: Wed, 13 Jan 2010 12:53:38 -0800

virtio_net receives packets from its pre-allocated vring buffers, then it
delivers these packets to upper layer protocols as skb buffs. So it's not
necessary to pre-allocate skb for each mergable buffer, then frees extra
skbs when buffers are merged into a large packet. This patch has deferred
skb allocation in receiving packets for both big packets and mergeable buffers
to reduce skb pre-allocations and skb frees. It frees unused buffers by calling
detach_unused_buf in vring, so recv skb queue is not needed.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 39d32157 25-Jan-2010 Herbert Xu <herbert@gondor.apana.org.au>

virtio_net: Make delayed refill more reliable

I have seen RX stalls on a machine that experienced a suspected
OOM. After the stall, the RX buffer is empty on the guest side
and there are exactly 16 entries available on the host side. As
the number of entries is less than that required by a maximal
skb, the host cannot proceed.

The guest did not have a refill job scheduled.

My diagnosis is that an OOM had occured, with the delayed refill
job scheduled. The job was able to allocate at least one skb, but
not enough to overcome the minimum required by the host to proceed.

As the refill job would only reschedule itself if it failed completely
to allocate any skbs, this would lead to an RX stall.

The following patch removes this stall possibility by always
rescheduling the refill job until the ring is totally refilled.

Testing has shown that the RX stall no longer occurs whereas
previously it would occur within a day.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32e7bfc4 25-Jan-2010 Jiri Pirko <jpirko@redhat.com>

net: use helpers to access uc list V2

This patch introduces three macros to work with uc list from net drivers.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8e95a202 03-Dec-2009 Joe Perches <joe@perches.com>

drivers/net: Move && and || to end of previous line

Only files where David Miller is the primary git-signer.
wireless, wimax, ixgbe, etc are not modified.

Compile tested x86 allyesconfig only
Not all files compiled (not x86 compatible)

Added a few > 80 column lines, which I ignored.
Existing checkpatch complaints ignored.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 22402529 05-Nov-2009 Uwe Kleine-König <u.kleine-koenig@pengutronix.de>

virtio_net: rename driver struct to please modpost

Commit

3d1285b (move virtnet_remove to .devexit.text)

introduced the first reference to __devexit in struct virtio_driver
virtio_net which upset modpost ("Section mismatch in reference from the
variable virtio_net to the function .devexit.text:virtnet_remove()").

Fix this by renaming virtio_net to virtio_net_driver.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reported-by: Michael S. Tsirkin <mst@redhat.com>
Blame-taken-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 03f191ba 28-Oct-2009 Michael S. Tsirkin <mst@redhat.com>

virtio-net: fix data corruption with OOM

virtio net used to unlink skbs from send queues on error,
but ever since 48925e372f04f5e35fec6269127c62b2c71ab794
we do not do this. This causes guest data corruption and crashes
with vhost since net core can requeue the skb or free it without
it being taken off the list.

This patch fixes this by queueing the skb after successful
transmit.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e95646c3 30-Sep-2009 Christian Borntraeger <borntraeger@de.ibm.com>

virtio: let header files include virtio_ids.h

Rusty,

commit 3ca4f5ca73057a617f9444a91022d7127041970a
virtio: add virtio IDs file
moved all device IDs into a single file. While the change itself is
a very good one, it can break userspace applications. For example
if a userspace tool wanted to get the ID of virtio_net it used to
include virtio_net.h. This does no longer work, since virtio_net.h
does not include virtio_ids.h.
This patch moves all "#include <linux/virtio_ids.h>" from the C
files into the header files, making the header files compatible with
the old ones.

In addition, this patch exports virtio_ids.h to userspace.

CC: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# ed79bab8 14-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com>

virtio_net: use dev_kfree_skb_any() in free_old_xmit_skbs()

Because netpoll can call netdevice start_xmit() method with
irqs disabled, drivers should not call kfree_skb() from
their start_xmit(), but use dev_kfree_skb_any() instead.

Oct 8 11:16:52 172.30.1.31 [113074.791813] ------------[ cut here ]------------
Oct 8 11:16:52 172.30.1.31 [113074.791813] WARNING: at net/core/skbuff.c:398 \
skb_release_head_state+0x64/0xc8()
Oct 8 11:16:52 172.30.1.31 [113074.791813] Hardware name:
Oct 8 11:16:52 172.30.1.31 [113074.791813] Modules linked in: netconsole ocfs2 jbd2 quota_tree \
ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs crc32c drbd cn loop \
serio_raw psmouse snd_pcm snd_timer snd soundcore snd_page_alloc virtio_net pcspkr parport_pc parport \
i2c_piix4 i2c_core button processor evdev ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot \
dm_mod ide_cd_mod cdrom ata_generic ata_piix virtio_blk libata scsi_mod piix ide_pci_generic ide_core \
virtio_pci virtio_ring virtio floppy thermal fan thermal_sys [last unloaded: netconsole]
Oct 8 11:16:52 172.30.1.31 [113074.791813] Pid: 11132, comm: php5-cgi Tainted: G W \
2.6.31.2-vserver #1
Oct 8 11:16:52 172.30.1.31 [113074.791813] Call Trace:
Oct 8 11:16:52 172.30.1.31 [113074.791813] <IRQ> [<ffffffff81253cd5>] ? \
skb_release_head_state+0x64/0xc8
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff81253cd5>] ? skb_release_head_state+0x64/0xc8
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff81049ae1>] ? warn_slowpath_common+0x77/0xa3
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff81253cd5>] ? skb_release_head_state+0x64/0xc8
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff81253a1a>] ? __kfree_skb+0x9/0x7d
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffffa01cb139>] ? free_old_xmit_skbs+0x51/0x6e \
[virtio_net]
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffffa01cbc85>] ? start_xmit+0x26/0xf2 [virtio_net]
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff8126934f>] ? netpoll_send_skb+0xd2/0x205
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffffa0429216>] ? write_msg+0x90/0xeb [netconsole]
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff81049f06>] ? __call_console_drivers+0x5e/0x6f
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff8102b49d>] ? kvm_clock_read+0x4d/0x52
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff8104a082>] ? release_console_sem+0x115/0x1ba
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff8104a632>] ? vprintk+0x2f2/0x34b
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff8106b142>] ? vx_update_load+0x18/0x13e
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff81308309>] ? printk+0x4e/0x5d
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff8102b49d>] ? kvm_clock_read+0x4d/0x52
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff81070b62>] ? getnstimeofday+0x55/0xaf
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff81062683>] ? ktime_get_ts+0x21/0x49
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff810626b7>] ? ktime_get+0xc/0x41
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff81062788>] ? hrtimer_interrupt+0x9c/0x146
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff81024a4b>] ? smp_apic_timer_interrupt+0x80/0x93
Oct 8 11:16:52 172.30.1.31 [113074.791813] [<ffffffff81011663>] ? apic_timer_interrupt+0x13/0x20
Oct 8 11:16:52 172.30.1.31 [113074.791813] <EOI> [<ffffffff8130a9eb>] ? _spin_unlock_irq+0xd/0x31

Reported-and-tested-by: Massimo Cetra <mcetra@navynet.it>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Bug-Entry: http://bugzilla.kernel.org/show_bug.cgi?id=14378
Signed-off-by: David S. Miller <davem@davemloft.net>


# 89d71a66 12-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com>

net: Use netdev_alloc_skb_ip_align()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3d1285be 30-Sep-2009 Uwe Kleine-König <u.kleine-koenig@pengutronix.de>

move virtnet_remove to .devexit.text

The function virtnet_remove is used only wrapped by __devexit_p so define
it using __devexit.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Alex Williamson <alex.williamson@hp.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0aea51c3 26-Aug-2009 Amit Shah <amit.shah@redhat.com>

virtio_net: Check for room in the vq before adding buffer

Saves us one cycle of alloc-add-free if the queue was full.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (modified)


# 48925e37 24-Sep-2009 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: avoid (most) NETDEV_TX_BUSY by stopping queue early.

Now we can tell the theoretical capacity remaining in the output
queue, virtio_net can waste entries by stopping the queue early.

It doesn't work in the case of indirect buffers and kmalloc failure,
but that's rare (we could drop the packet in that case, but other
drivers return TX_BUSY for similar reasons).

For the record, I think this patch reflects poorly on the linux
network API.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dinesh Subhraveti <dineshs@us.ibm.com>


# b3f24698 24-Sep-2009 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: formalize skb_vnet_hdr

We put the virtio_net_hdr into the skb's cb region; turn this into a
union to clean up the code slightly and allow future expansion.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Dinesh Subhraveti <dineshs@us.ibm.com>


# b0c39dbd 24-Sep-2009 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: don't free buffers in xmit ring

The virtio_net driver is complicated by the two methods of freeing old
xmit buffers (in addition to freeing old ones at the start of the xmit
path).

The original code used a 1/10 second timer attached to xmit_free(),
reset on every xmit. Before we orphaned skbs on xmit, the
transmitting userspace could block with a full socket until the timer
fired, the skb destructor was called, and they were re-woken.

So we added the VIRTIO_F_NOTIFY_ON_EMPTY feature: supporting devices
send an interrupt (even if normally suppressed) on an empty xmit ring
which makes us schedule xmit_tasklet(). This was a benchmark win.

Unfortunately, VIRTIO_F_NOTIFY_ON_EMPTY makes quite a lot of work: a
host which is faster than the guest will fire the interrupt every xmit
packet (slowing the guest down further). Attempting mitigation in the
host adds overhead of userspace timers (possibly with the additional
pain of signals), and risks increasing latency anyway if you get it
wrong.

In practice, this effect was masked by benchmarks which take advantage
of GSO (with its inherent transmit batching), but it's still there.

Now we orphan xmitted skbs, the pressure is off: remove both paths and
no longer request VIRTIO_F_NOTIFY_ON_EMPTY. Note that the current
QEMU will notify us even if we don't negotiate this feature (legal,
but suboptimal); a patch is outstanding to improve that.

Move the skb_orphan/nf_reset to after we've done the send and notified
the other end, for a slight optimization.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Mark McLoughlin <markmc@redhat.com>


# 8958f574 24-Sep-2009 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

This effectively reverts 99ffc696d10b28580fe93441d627cf290ac4484c
"virtio: wean net driver off NETDEV_TX_BUSY".

The complexity of queuing an skb (setting a tasklet to re-xmit) is
questionable, especially once we get rid of the other reason for the
tasklet in the next patch.

If the skb won't fit in the tx queue, just return NETDEV_TX_BUSY.
This is frowned upon, so a followup patch uses a more complex solution.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Herbert Xu <herbert@gondor.apana.org.au>


# 2b5bbe3b 24-Sep-2009 Rusty Russell <rusty@rustcorp.com.au>

virtio_net: skb_orphan() and nf_reset() in xmit path.

The complex transmit free logic was introduced to avoid hangs on
removing the ip_conntrack module and also because drivers aren't
generally supposed to keep stale skbs for unbounded times.

After some debate, it was decided that while doing skb_orphan()
generally is a rat's nest, we can do it in this driver. Following
patches take advantage of this.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 3ca4f5ca 31-Jul-2009 Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>

virtio: add virtio IDs file

Virtio IDs are spread all over the tree which makes assigning new IDs
bothersome. Putting them together should make the process less error-prone.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 3c1b27d5 23-Sep-2009 Rusty Russell <rusty@rustcorp.com.au>

virtio: make add_buf return capacity remaining

This API change means that virtio_net can tell how much capacity
remains for buffers. It's necessarily fuzzy, since
VIRTIO_RING_F_INDIRECT_DESC means we can fit any number of descriptors
in one, *if* we can kmalloc.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dinesh Subhraveti <dineshs@us.ibm.com>


# 0fc0b732 02-Sep-2009 Stephen Hemminger <shemminger@vyatta.com>

netdev: drivers should make ethtool_ops const

No need to put ethtool_ops in data, they should be const.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 424efe9c 31-Aug-2009 Stephen Hemminger <shemminger@vyatta.com>

netdev: convert pseudo drivers to netdev_tx_t

These are all drivers that don't touch real hardware.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3161e453 26-Aug-2009 Rusty Russell <rusty@rustcorp.com.au>

virtio: net refill on out-of-memory

If we run out of memory, use keventd to fill the buffer. There's a
report of this happening: "Page allocation failures in guest",
Message-ID: <20090713115158.0a4892b0@mjolnir.ossman.eu>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5c516751 14-Jul-2009 Sridhar Samudrala <sri@us.ibm.com>

virtio-net: Allow UFO feature to be set and advertised.

- Allow setting UFO on virtio-net and advertise to host.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 31278e71 16-Jun-2009 Jiri Pirko <jpirko@redhat.com>

net: group address list and its count

This patch is inspired by patch recently posted by Johannes Berg. Basically what
my patch does is to group list and a count of addresses into newly introduced
structure netdev_hw_addr_list. This brings us two benefits:
1) struct net_device becames a bit nicer.
2) in the future there will be a possibility to operate with lists independently
on netdevices (with exporting right functions).
I wanted to introduce this patch before I'll post a multicast lists conversion.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

drivers/net/bnx2.c | 4 +-
drivers/net/e1000/e1000_main.c | 4 +-
drivers/net/ixgbe/ixgbe_main.c | 6 +-
drivers/net/mv643xx_eth.c | 2 +-
drivers/net/niu.c | 4 +-
drivers/net/virtio_net.c | 10 ++--
drivers/s390/net/qeth_l2_main.c | 2 +-
include/linux/netdevice.h | 17 +++--
net/core/dev.c | 130 ++++++++++++++++++--------------------
9 files changed, 89 insertions(+), 90 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>


# d2a7ddda 12-Jun-2009 Michael S. Tsirkin <mst@redhat.com>

virtio: find_vqs/del_vqs virtio operations

This replaces find_vq/del_vq with find_vqs/del_vqs virtio operations,
and updates all drivers. This is needed for MSI support, because MSI
needs to know the total number of vectors upfront.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (+ lguest/9p compile fixes)


# 9499f5e7 12-Jun-2009 Rusty Russell <rusty@rustcorp.com.au>

virtio: add names to virtqueue struct, mapping from devices to queues.

Add a linked list of all virtqueues for a virtio device: this helps for
debugging and is also needed for upcoming interface change.

Also, add a "name" field for clearer debug messages.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 8981f010 11-Jun-2009 Herbert Xu <herbert@gondor.apana.org.au>

virtio_net: Fix IP alignment on non-mergeable RX path

We need to enforce the IP alignment on the non-mergeable RX path just
like the other RX path. Not doing so results in misaligned IP
headers.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b82f08ea 03-Jun-2009 Herbert Xu <herbert@gondor.apana.org.au>

virtio_net: Set correct gso->hdr_len

Through a bug in the tun driver, I noticed that virtio_net is
producing bogus hdr_len values. In particular, it only includes
the IP header in the linear area, and excludes the entire TCP
header. This causes the TCP header to be copied twice for each
packet. (The bug omitted the second copy :)

This patch corrects this.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ccffad25 22-May-2009 Jiri Pirko <jpirko@redhat.com>

net: convert unicast addr list

This patch converts unicast address list to standard list_head using
previously introduced struct netdev_hw_addr. It also relaxes the
locking. Original spinlock (still used for multicast addresses) is not
needed and is no longer used for a protection of this list. All
reading and writing takes place under rtnl (with no changes).

I also removed a possibility to specify the length of the address
while adding or deleting unicast address. It's always dev->addr_len.

The convertion touched especially e1000 and ixgbe codes when the
change is not so trivial.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

drivers/net/bnx2.c | 13 +--
drivers/net/e1000/e1000_main.c | 24 +++--
drivers/net/ixgbe/ixgbe_common.c | 14 ++--
drivers/net/ixgbe/ixgbe_common.h | 4 +-
drivers/net/ixgbe/ixgbe_main.c | 6 +-
drivers/net/ixgbe/ixgbe_type.h | 4 +-
drivers/net/macvlan.c | 11 +-
drivers/net/mv643xx_eth.c | 11 +-
drivers/net/niu.c | 7 +-
drivers/net/virtio_net.c | 7 +-
drivers/s390/net/qeth_l2_main.c | 6 +-
drivers/scsi/fcoe/fcoe.c | 16 ++--
include/linux/netdevice.h | 18 ++--
net/8021q/vlan.c | 4 +-
net/8021q/vlan_dev.c | 10 +-
net/core/dev.c | 195 +++++++++++++++++++++++++++-----------
net/dsa/slave.c | 10 +-
net/packet/af_packet.c | 4 +-
18 files changed, 227 insertions(+), 137 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1824a989 01-May-2009 Alex Williamson <alex.williamson@hp.com>

virtio_net: Fix function name typo

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 23e258e1 01-May-2009 Alex Williamson <alex.williamson@hp.com>

virtio_net: Cleanup command queue scatterlist usage

We were avoiding calling sg_init* on scatterlists passed
into virtnet_send_command to prevent extraneous end markers.
This caused build warnings for uninitialized variables.
Cleanup the code to create proper scatterlists.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0ee904c3 11-Apr-2009 Alexander Beregalov <a.beregalov@gmail.com>

drivers/net: replace BUG() with BUG_ON() if possible

Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 62994b2d 04-Apr-2009 Alex Williamson <alex.williamson@hp.com>

virtio_net: Set the mac config only when VIRITO_NET_F_MAC

VIRTIO_NET_F_MAC indicates the presence of the mac field in config
space, not the validity of the value it contains. Allow the mac to be
changed at runtime, but only push the change into config space with the
VIRTIO_NET_F_MAC feature present.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4783256e 18-Mar-2009 Pantelis Koukousoulas <pktoss@gmail.com>

virtio_net: Make virtio_net support carrier detection

Impact: Make NetworkManager work with virtio_net

For now the semantics are simple: There is always carrier.

This allows a seamless experience with e.g., qemu/kvm
where NetworkManager just configures and sets up
everything automagically.

If/when a generally agreed-upon way to control
carrier on/off in the emulator/hypervisor level
emerges, it will be trivial to extend the driver
to support that too, but for now even this 2-liner
makes user experience that much better.

Signed-off-by: Pantelis Koukousoulas <pktoss@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9c46f6d4 04-Feb-2009 Alex Williamson <alex.williamson@hp.com>

virtio_net: Allow setting the MAC address of the NIC

Many physical NICs let the OS re-program the "hardware" MAC
address. Virtual NICs should allow this too.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0bde9569 04-Feb-2009 Alex Williamson <alex.williamson@hp.com>

virtio_net: Add support for VLAN filtering in the hypervisor

VLAN filtering allows the hypervisor to drop packets from VLANs
that we're not a part of, further reducing the number of extraneous
packets recieved. This makes use of the VLAN virtqueue command class.
The CTRL_VLAN feature bit tells us whether the backend supports VLAN
filtering.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f565a7c2 04-Feb-2009 Alex Williamson <alex.williamson@hp.com>

virtio_net: Add a MAC filter table

Make use of the MAC control virtqueue class to support a MAC
filter table. The filter table is managed by the hypervisor.
We consider the table to be available if the CTRL_RX feature
bit is set. We leave it to the hypervisor to manage the table
and enable promiscuous or all-multi mode as necessary depending
on the resources available to it.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2af7698e 04-Feb-2009 Alex Williamson <alex.williamson@hp.com>

virtio_net: Add a set_rx_mode interface

Make use of the RX_MODE control virtqueue class to enable the
set_rx_mode netdev interface. This allows us to selectively
enable/disable promiscuous and allmulti mode so we don't see
packets we don't want. For now, we automatically enable these
as needed if additional unicast or multicast addresses are
requested.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2a41f71d 04-Feb-2009 Alex Williamson <alex.williamson@hp.com>

virtio_net: Add a virtqueue for outbound control commands

This will be used for RX mode, MAC filter table, VLAN filtering, etc...

The control transaction consists of one or more "out" sg entries and
one or more "in" sg entries. The first out entry contains a header
defining the class and command. Additional out entries may provide
data for the command. The last in entry provides a status response
back from the command.

Virtqueues typically run asynchronous, running a callback function
when there's data in the channel. We can't readily make use of this
in the command paths where we need to use this. Instead, we kick
the virtqueue and spin. The kick causes an I/O write, triggering an
immediate trap into the hypervisor.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8527bec5 26-Jan-2009 Ira W. Snyder <iws@ovro.caltech.edu>

virtio_net: use correct accessors for scatterlists

Without this fix, virtio_net makes incorrect usage of scatterlists. It sets
the end of the scatterlist chain after the first element, despite the fact
that more entries come after it.

If you try to run dma_map_sg() on one of the scatterlists given to you by
add_buf(), you will get a null pointer oops.

Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e918085a 25-Jan-2009 Alex Williamson <alex.williamson@hp.com>

virtio_net: Fix MAX_PACKET_LEN to support 802.1Q VLANs

802.1Q expanded the maximum ethernet frame size by 4 bytes for the
VLAN tag. We're not taking this into account in virtio_net, which
means the buffers we provide to the backend in the virtqueue RX ring
aren't big enough to hold a full MTU VLAN packet. For QEMU/KVM,
this results in the backend exiting with a packet truncation error.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9f4d26d0 19-Jan-2009 Mark McLoughlin <markmc@redhat.com>

virtio_net: add link status handling

Allow the host to inform us that the link is down by adding
a VIRTIO_NET_F_STATUS which indicates that device status is
available in virtio_net config.

This is currently useful for simulating link down conditions
(e.g. using proposed qemu 'set_link' monitor command) but
would also be needed if we were to support device assignment
via virtio.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (added future masking)
Signed-off-by: David S. Miller <davem@davemloft.net>


# 288379f0 19-Jan-2009 Ben Hutchings <bhutchings@solarflare.com>

net: Remove redundant NAPI functions

Following the removal of the unused struct net_device * parameter from
the NAPI functions named *netif_rx_* in commit 908a7a1, they are
exactly equivalent to the corresponding *napi_* functions and are
therefore redundant.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 76288b4e 06-Jan-2009 Stephen Hemminger <shemminger@vyatta.com>

virtio: convert to net_device_ops

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 908a7a16 22-Dec-2008 Neil Horman <nhorman@tuxdriver.com>

net: Remove unused netdev arg from some NAPI interfaces.

When the napi api was changed to separate its 1:1 binding to the net_device
struct, the netif_rx_[prep|schedule|complete] api failed to remove the now
vestigual net_device structure parameter. This patch cleans up that api by
properly removing it..

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 39da5814 26-Nov-2008 Mark McLoughlin <markmc@redhat.com>

virtio_net: large tx MTU support

We don't really have a max tx packet size limit, so allow configuring
the device with up to 64k tx MTU.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 3f2c31d9 16-Nov-2008 Mark McLoughlin <markmc@redhat.com>

virtio_net: VIRTIO_NET_F_MSG_RXBUF (imprive rcv buffer allocation)

If segmentation offload is enabled by the host, we currently allocate
maximum sized packet buffers and pass them to the host. This uses up
20 ring entries, allowing us to supply only 20 packet buffers to the
host with a 256 entry ring. This is a huge overhead when receiving
small packets, and is most keenly felt when receiving MTU sized
packets from off-host.

The VIRTIO_NET_F_MRG_RXBUF feature flag is set by hosts which support
using receive buffers which are smaller than the maximum packet size.
In order to transfer large packets to the guest, the host merges
together multiple receive buffers to form a larger logical buffer.
The number of merged buffers is returned to the guest via a field in
the virtio_net_hdr.

Make use of this support by supplying single page receive buffers to
the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of
the payload to the skb's linear data buffer and adjust the fragment
offset to point to the remaining data. This ensures proper alignment
and allows us to not use any paged data for small packets. If the
payload occupies multiple pages, we simply append those pages as
fragments and free the associated skbs.

This scheme allows us to be efficient in our use of ring entries
while still supporting large packets. Benchmarking using netperf from
an external machine to a guest over a 10Gb/s network shows a 100%
improvement from ~1Gb/s to ~2Gb/s. With a local host->guest benchmark
with GSO disabled on the host side, throughput was seen to increase
from 700Mb/s to 1.7Gb/s.

Based on a patch from Herbert Xu.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv)
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0276b497 16-Nov-2008 Mark McLoughlin <markmc@redhat.com>

virtio_net: hook up the set-tso ethtool op

Seems like an oversight that we have set-tx-csum and set-sg hooked
up, but not set-tso.

Also leads to the strange situation that if you e.g. disable tx-csum,
then tso doesn't get disabled.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0a888fd1 16-Nov-2008 Mark McLoughlin <markmc@redhat.com>

virtio_net: Recycle some more rx buffer pages

Each time we re-fill the recv queue with buffers, we allocate
one too many skbs and free it again when adding fails. We should
recycle the pages allocated in this case.

A previous version of this patch made trim_pages() trim trailing
unused pages from skbs with some paged data, but this actually
caused a barely measurable slowdown.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv)
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8f15ea42 13-Nov-2008 Wang Chen <wangchen@cn.fujitsu.com>

netdevice: safe convert to netdev_priv() #part-3

We have some reasons to kill netdev->priv:
1. netdev->priv is equal to netdev_priv().
2. netdev_priv() wraps the calculation of netdev->priv's offset, obviously
netdev_priv() is more flexible than netdev->priv.
But we cann't kill netdev->priv, because so many drivers reference to it
directly.

This patch is a safe convert for netdev->priv to netdev_priv(netdev).
Since all of the netdev->priv is only for read.
But it is too big to be sent in one mail.
I split it to 4 parts and make every part smaller than 100,000 bytes,
which is max size allowed by vger.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e174961c 27-Oct-2008 Johannes Berg <johannes@sipsolutions.net>

net: convert print_mac to %pM

This converts pretty much everything to print_mac. There were
a few things that had conflicts which I have just dropped for
now, no harm done.

I've built an allyesconfig with this and looked at the files
that weren't built very carefully, but it's a huge patch.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fb6813f4 24-Jul-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: Recycle unused recv buffer pages for large skbs in net driver

If we hack the virtio_net driver to always allocate full-sized (64k+)
skbuffs, the driver slows down (lguest numbers):

Time to receive 1GB (small buffers): 10.85 seconds
Time to receive 1GB (64k+ buffers): 24.75 seconds

Of course, large buffers use up more space in the ring, so we increase
that from 128 to 2048:

Time to receive 1GB (64k+ buffers, 2k ring): 16.61 seconds

If we recycle pages rather than using alloc_page/free_page:

Time to receive 1GB (64k+ buffers, 2k ring, recycle pages): 10.81 seconds

This demonstrates that with efficient allocation, we don't need to
have a separate "small buffer" queue.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 97402b96 17-Apr-2008 Herbert Xu <herbert@gondor.apana.org.au>

virtio net: Allow receiving SG packets

Finally this patch lets virtio_net receive GSO packets in addition
to sending them. This can definitely be optimised for the non-GSO
case. For comparison the Xen approach stores one page in each skb
and uses subsequent skb's pages to construct an SG skb instead of
preallocating the maximum amount of pages per skb.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (added feature bits)


# a9ea3fc6 17-Apr-2008 Herbert Xu <herbert@gondor.apana.org.au>

virtio net: Add ethtool ops for SG/GSO

This patch adds some basic ethtool operations to virtio_net so
I could test SG without GSO (which was really useful because TSO
turned out to be buggy :)

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (remove MTU setting)


# 9953ca6c 26-May-2008 Mark McLoughlin <markmc@redhat.com>

virtio: fix virtio_net xmit of freed skb bug

On Mon, 2008-05-26 at 17:42 +1000, Rusty Russell wrote:
> If we fail to transmit a packet, we assume the queue is full and put
> the skb into last_xmit_skb. However, if more space frees up before we
> xmit it, we loop, and the result can be transmitting the same skb twice.
>
> Fix is simple: set skb to NULL if we've used it in some way, and check
> before sending.
...
> diff -r 564237b31993 drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c Mon May 19 12:22:00 2008 +1000
> +++ b/drivers/net/virtio_net.c Mon May 19 12:24:58 2008 +1000
> @@ -287,21 +287,25 @@ again:
> free_old_xmit_skbs(vi);
>
> /* If we has a buffer left over from last time, send it now. */
> - if (vi->last_xmit_skb) {
> + if (unlikely(vi->last_xmit_skb)) {
> if (xmit_skb(vi, vi->last_xmit_skb) != 0) {
> /* Drop this skb: we only queue one. */
> vi->dev->stats.tx_dropped++;
> kfree_skb(skb);
> + skb = NULL;
> goto stop_queue;
> }
> vi->last_xmit_skb = NULL;

With this, may drop an skb and then later in the function discover that
we could have sent it after all. Poor wee skb :)

How about the incremental patch below?

Cheers,
Mark.

Subject: [PATCH] virtio_net: Delay dropping tx skbs

Currently we drop the skb in start_xmit() if we have a
queued buffer and fail to transmit it.

However, if we delay dropping it until we've stopped the
queue and enabled the tx notification callback, then there
is a chance space might become available for it.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 5e4fe5c4 08-Jul-2008 Mark McLoughlin <markmc@redhat.com>

virtio_net: Set VIRTIO_NET_F_GUEST_CSUM feature

We can handle receiving partial csums, so set the
appropriate feature bit.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>


# 363f1514 08-Jun-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: use callback on empty in virtio_net

virtio_net uses a timer to free old transmitted packets, rather than
leaving callbacks enabled all the time. If the host promises to
always notify us when the transmit ring is empty, we can free packets
at that point and avoid the timer.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>


# 14c998f0 08-Jun-2008 Mark McLoughlin <markmc@redhat.com>

virtio: virtio_net free transmit skbs in a timer

virtio_net currently only frees old transmit skbs just
before queueing new ones. If the queue is full, it then
enables interrupts and waits for notification that more
work has been performed.

However, a side-effect of this scheme is that there are
always xmit skbs left dangling when no new packets are
sent, against the Documentation/networking/driver.txt
guideline:

"... it is not allowed for your TX mitigation scheme
to let TX packets "hang out" in the TX ring unreclaimed
forever if no new TX packets are sent."

Add a timer to ensure that any time we queue new TX
skbs, we will shortly free them again.

This fixes an easily reproduced hang at shutdown where
iptables attempts to unload nf_conntrack and nf_conntrack
waits for an skb it is tracking to be freed, but virtio_net
never frees it.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>


# 23cde76d 08-Jun-2008 Mark McLoughlin <markmc@redhat.com>

virtio_net: Fix skb->csum_start computation

hdr->csum_start is the offset from the start of the ethernet
header to the transport layer checksum field. skb->csum_start
is the offset from skb->head.

skb_partial_csum_set() assumes that skb->data points to the
ethernet header - i.e. it computes skb->csum_start by adding
the headroom to hdr->csum_start.

Since eth_type_trans() skb_pull()s the ethernet header,
skb_partial_csum_set() should be called before
eth_type_trans().

(Without this patch, GSO packets from a guest to the world outside the
host are corrupted).

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>


# 11a3a154 26-May-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: fix delayed xmit of packet and freeing of old packets.

Because we cache the last failed-to-xmit packet, if there are no
packets queued behind that one we may never send it (reproduced here
as TCP stalls, "cured" by an outgoing ping).

Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>


# 7eb2e251 26-May-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: fix virtio_net xmit of freed skb bug

If we fail to transmit a packet, we assume the queue is full and put
the skb into last_xmit_skb. However, if more space frees up before we
xmit it, we loop, and the result can be transmitting the same skb twice.

Fix is simple: set skb to NULL if we've used it in some way, and check
before sending.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>


# 288369cc 22-May-2008 Wang Chen <wangchen@cn.fujitsu.com>

VIRTIO: Use __skb_queue_purge()

Use standard routine for queue purging.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>


# c45a6816 02-May-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: explicit advertisement of driver features

A recent proposed feature addition to the virtio block driver revealed
some flaws in the API: in particular, we assume that feature
negotiation is complete once a driver's probe function returns.

There is nothing in the API to require this, however, and even I
didn't notice when it was violated.

So instead, we require the driver to specify what features it supports
in a table, we can then move the feature negotiation into the virtio
core. The intersection of device and driver features are presented in
a new 'features' bitmap in the struct virtio_device.

Note that this highlights the difference between Linux unsigned-long
bitmaps where each unsigned long is in native endian, and a
straight-forward little-endian array of bytes.

Drivers can still remove feature bits in their probe routine if they
really have to.

API changes:
- dev->config->feature() no longer gets and acks a feature.
- drivers should advertise their features in the 'feature_table' field
- use virtio_has_feature() for extra sanity when checking feature bits

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 5539ae96 02-May-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: finer-grained features for virtio_net

So, we previously had a 'VIRTIO_NET_F_GSO' bit which meant that 'the
host can handle csum offload, and any TSO (v4&v6 incl ECN) or UFO
packets you might want to send. I thought this was good enough for
Linux, but it actually isn't, since we don't do UFO in software.

So, add separate feature bits for what the host can handle. Add
equivalent ones for the guest to say what it can handle, because LRO
is coming too (thanks Herbert!).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 99ffc696 02-May-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: wean net driver off NETDEV_TX_BUSY

Herbert tells me that returning NETDEV_TX_BUSY from hard_start_xmit is
seen as a poor thing to do; we should cache the packet and stop the queue.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>


# 05271685 02-May-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: fix scatterlist sizing in net driver.

Herbert Xu points out (within another patch) that my scatterlists are
too short: one entry for the gso header, one for the skb->data, and
MAX_SKB_FRAGS for all the fragments.

Fix both xmit and recv sides (recv currently unused, coming in later
patch).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 655aa31f 02-May-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: fix tx_ stats in virtio_net

get_buf() gives the length written by the other side, which will be
zero. We want to add the skb length.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 21f644f3 08-Apr-2008 David S. Miller <davem@davemloft.net>

[NET]: Undo code bloat in hot paths due to print_mac().

If print_mac() is used inside of a pr_debug() the compiler
can't see that the call is redundant so still performs it
even of pr_debug() ends up being a nop.

So don't use print_mac() in such cases in hot code paths,
use MAC_FMT et al. instead.

As noted by Joe Perches, pr_debug() could be modified to
handle this better, but that is a change to an interface
used by the entire kernel and thus needs to be validated
carefully. This here is thus the less risky fix for
2.6.25

Signed-off-by: David S. Miller <davem@davemloft.net>


# 6ea0a467 07-Apr-2008 Anthony Liguori <aliguori@us.ibm.com>

virtio_net: remove overzealous printk

The 'disable_cb' is really just a hint and as such, it's possible for more
work to get queued up while callbacks are disabled. Under stress with an
SMP guest, this printk triggers very frequently. There is no race here, this
is how things are designed to work so let's just remove the printk.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4265f161 14-Mar-2008 Christian Borntraeger <borntraeger@de.ibm.com>

virtio: fix race in enable_cb

There is a race in virtio_net, dealing with disabling/enabling the callback.
I saw the following oops:

kernel BUG at /space/kvm/drivers/virtio/virtio_ring.c:218!
illegal operation: 0001 [#1] SMP
Modules linked in: sunrpc dm_mod
CPU: 2 Not tainted 2.6.25-rc1zlive-host-10623-gd358142-dirty #99
Process swapper (pid: 0, task: 000000000f85a610, ksp: 000000000f873c60)
Krnl PSW : 0404300180000000 00000000002b81a6 (vring_disable_cb+0x16/0x20)
R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 EA:3
Krnl GPRS: 0000000000000001 0000000000000001 0000000010005800 0000000000000001
000000000f3a0900 000000000f85a610 0000000000000000 0000000000000000
0000000000000000 000000000f870000 0000000000000000 0000000000001237
000000000f3a0920 000000000010ff74 00000000002846f6 000000000fa0bcd8
Krnl Code: 00000000002b819a: a7110001 tmll %r1,1
00000000002b819e: a7840004 brc 8,2b81a6
00000000002b81a2: a7f40001 brc 15,2b81a4
>00000000002b81a6: a51b0001 oill %r1,1
00000000002b81aa: 40102000 sth %r1,0(%r2)
00000000002b81ae: 07fe bcr 15,%r14
00000000002b81b0: eb7ff0380024 stmg %r7,%r15,56(%r15)
00000000002b81b6: a7f13e00 tmll %r15,15872
Call Trace:
([<000000000fa0bcd0>] 0xfa0bcd0)
[<00000000002b8350>] vring_interrupt+0x5c/0x6c
[<000000000010ab08>] do_extint+0xb8/0xf0
[<0000000000110716>] ext_no_vtime+0x16/0x1a
[<0000000000107e72>] cpu_idle+0x1c2/0x1e0

The problem can be triggered with a high amount of host->guest traffic.
I think its the following race:

poll says netif_rx_complete
poll calls enable_cb
enable_cb opens the interrupt mask
a new packet comes, an interrupt is triggered----\
enable_cb sees that there is more work |
enable_cb disables the interrupt |
. V
. interrupt is delivered
. skb_recv_done does atomic napi test, ok
some waiting disable_cb is called->check fails->bang!
.
poll would do napi check
poll would do disable_cb

The fix is to let enable_cb not disable the interrupt again, but expect the
caller to do the cleanup if it returns false. In that case, the interrupt is
only disabled, if the napi test_set_bit was successful.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleaned up doco)


# da74e89d 29-Feb-2008 Amit Shah <amitshah@gmx.net>

virtio: Enable netpoll interface for netconsole logging

Add a new poll_controller handler that the netpoll interface needs.

This enables netconsole logging from a kvm guest over the virtio
net interface.

Signed-off-by: Amit Shah <amitshah@gmx.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# d9d5dcc8 18-Feb-2008 Christian Borntraeger <borntraeger@de.ibm.com>

virtio_net: Fix oops on early interrupts - introduced by virtio reset code

Signed-off-by: Jeff Garzik <jeff@garzik.org>


# 370076d9 06-Feb-2008 Christian Borntraeger <borntraeger@de.ibm.com>

virtio net: fix oops on interface-up

I got the following oops during interface ifup. Unfortunately its not
easily reproducable so I cant say for sure that my fix fixes this
problem, but I am confident and I think its correct anyway:

<2>kernel BUG at /space/kvm/drivers/virtio/virtio_ring.c:234!
<4>illegal operation: 0001 [#1] PREEMPT SMP
<4>Modules linked in:
<4>CPU: 0 Not tainted 2.6.24zlive-guest-07293-gf1ca151-dirty #91
<4>Process swapper (pid: 0, task: 0000000000800938, ksp: 000000000084ddb8)
<4>Krnl PSW : 0404300180000000 0000000000466374 (vring_disable_cb+0x30/0x34)
<4> R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 EA:3
<4>Krnl GPRS: 0000000000000001 0000000000000001 0000000010003800 0000000000466344
<4> 000000000e980900 00000000008848b0 000000000084e748 0000000000000000
<4> 000000000087b300 0000000000001237 0000000000001237 000000000f85bdd8
<4> 000000000e980920 00000000001137c0 0000000000464754 000000000f85bdd8
<4>Krnl Code: 0000000000466368: e3b0b0700004 lg %r11,112(%r11)
<4> 000000000046636e: 07fe bcr 15,%r14
<4> 0000000000466370: a7f40001 brc 15,466372
<4> >0000000000466374: a7f4fff6 brc 15,466360
<4> 0000000000466378: eb7ff0500024 stmg %r7,%r15,80(%r15)
<4> 000000000046637e: a7f13e00 tmll %r15,15872
<4> 0000000000466382: b90400ef lgr %r14,%r15
<4> 0000000000466386: a7840001 brc 8,466388
<4>Call Trace:
<4>([<000201500f85c000>] 0x201500f85c000)
<4> [<0000000000466556>] vring_interrupt+0x72/0x88
<4> [<00000000004801a0>] kvm_extint_handler+0x34/0x44
<4> [<000000000010d22c>] do_extint+0xbc/0xf8
<4> [<0000000000113f98>] ext_no_vtime+0x16/0x1a
<4> [<000000000010a182>] cpu_idle+0x216/0x238
<4>([<000000000010a162>] cpu_idle+0x1f6/0x238)
<4> [<0000000000568656>] rest_init+0xaa/0xb8
<4> [<000000000084ee2c>] start_kernel+0x3fc/0x490
<4> [<0000000000100020>] _stext+0x20/0x80
<4>
<4> <0>Kernel panic - not syncing: Fatal exception in interrupt
<4>

After looking at the code and the dump I think the following scenario
happened: Ifup was running on cpu2 and the interrupt arrived on cpu0.
Now virtnet_open on cpu 2 managed to execute napi_enable and disable_cb
but did not execute rx_schedule. Meanwhile on cpu 0 skb_recv_done was
called by vring_interrupt, executed netif_rx_schedule_prep, which
succeeded and therefore called disable_cb. This triggered the BUG_ON,
as interrupts were already disabled by cpu 2.

I think the proper solution is to make the call to disable_cb depend on
the atomic update of NAPI_STATE_SCHED by using netif_rx_schedule_prep
in the same way as skb_recv_done.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jeff@garzik.org>


# 6c0cd7c0 16-Dec-2007 Dor Laor <dor.laor@qumranet.com>

virtio_net: parametrize the napi_weight for virtio receive queue.

It is done in order to improve performance.

Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 2cb9c6ba 04-Feb-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: free transmit skbs when notified, not on next xmit.

This fixes a potential dangling xmit problem.

We also suppress refill interrupts until we need them.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# a48bd8f6 04-Feb-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: flush buffers on open

Fix bug found by Christian Borntraeger: if the other side fills all
the registered network buffers before we enable NAPI, we will never
get an interrupt. The simplest fix is to process the input queue once
on open.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# e70f2f1b 06-Dec-2007 Christian Borntraeger <borntraeger@de.ibm.com>

virtnet: remove double ether_setup

Hello Rusty,

virtnet_probe already calls alloc_etherdev, which calls ether_setup.
There is no need to do that again.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 6e5aa7ef 04-Feb-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: reset function

A reset function solves three problems:

1) It allows us to renegotiate features, eg. if we want to upgrade a
guest driver without rebooting the guest.

2) It gives us a clean way of shutting down virtqueues: after a reset,
we know that the buffers won't be used by the host, and

3) It helps the guest recover from messed-up drivers.

So we remove the ->shutdown hook, and the only way we now remove
feature bits is via reset.

We leave it to the driver to do the reset before it deletes queues:
the balloon driver, for example, needs to chat to the host in its
remove function.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# b3369c1f 04-Feb-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: populate network rings in the probe routine, not open

Since we want to reset the device to remove them, this is simpler
(device is reset for us on driver remove).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 34a48579 04-Feb-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: Tweak virtio_net defines

1) Turn GSO on virtio net into an all-or-nothing (keep checksumming
separate). Having multiple bits is a pain: if you can't support something
you should handle it in software, which is still a performance win.

2) Make VIRTIO_NET_HDR_GSO_ECN a flag in the header, so it can apply to
IPv6 or v4.

3) Rename VIRTIO_NET_F_NO_CSUM to VIRTIO_NET_F_CSUM (ie. means we do
checksumming).

4) Add csum and gso params to virtio_net to allow more testing.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 50c8ea80 04-Feb-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: Net header needs hdr_len

It's far easier to deal with packets if we don't have to parse the
packet to figure out the header length to know how much to pull into
the skb data. Add the field to the virtio_net_hdr struct (and fix the
spaces that somehow crept in there).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 18445c4d 04-Feb-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: explicit enable_cb/disable_cb rather than callback return.

It seems that virtio_net wants to disable callbacks (interrupts) before
calling netif_rx_schedule(), so we can't use the return value to do so.

Rename "restart" to "cb_enable" and introduce "cb_disable" hook: callback
now returns void, rather than a boolean.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# a586d4f6 04-Feb-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: simplify config mechanism.

Previously we used a type/len pair within the config space, but this
seems overkill. We now simply define a structure which represents the
layout in the config space: the config space can now only be extended
at the end.

The main driver-visible changes:
1) We indicate what fields are present with an explicit feature bit.
2) Virtqueues are explicitly numbered, and not in the config space.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# f35d9d8a 04-Feb-2008 Rusty Russell <rusty@rustcorp.com.au>

virtio: Implement skb_partial_csum_set, for setting partial csums on untrusted packets.

Use it in virtio_net (replacing buggy version there), it's also going
to be used by TAP for partial csum support.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: David S. Miller <davem@davemloft.net>


# 8329d98e 19-Nov-2007 Rusty Russell <rusty@rustcorp.com.au>

virtio: fix net driver loop case where we fail to restart

skb is only NULL the first time around: it's more correct to test for
being under-budget.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 74b2553f 19-Nov-2007 Rusty Russell <rusty@rustcorp.com.au>

virtio: fix module/device unloading

The virtio code never hooked through the ->remove callback. Although
noone supports device removal at the moment, this code is already
needed for module unloading.

This of course also revealed bugs in virtio_blk, virtio_net and lguest
unloading paths.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 4d125de3 06-Nov-2007 Rusty Russell <rusty@rustcorp.com.au>

virtio: more fallout from scatterlist changes.

This fixes OOPS in network driver when CONFIG_DEBUG_SG=y.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 296f96fc 21-Oct-2007 Rusty Russell <rusty@rustcorp.com.au>

Net driver using virtio

The network driver uses two virtqueues: one for input packets and one
for output packets. This has nice locking properties (ie. we don't do
any for recv vs send).

TODO:
1) Big packets.
2) Multi-client devices (maybe separate driver?).
3) Resolve freeing of old xmit skbs (Christian Borntraeger)

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: netdev@vger.kernel.org