#
76f40853 |
|
11-Mar-2024 |
Xianting Tian <xianting.tian@linux.alibaba.com> |
vhost: correct misleading printing information Guest moved avail idx not used idx when we need to print log if '(vq->avail_idx - last_avail_idx) > vq->num', so fix it. Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com> Message-Id: <20240311082109.46773-1-xianting.tian@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
df9ace76 |
|
27-Mar-2024 |
Gavin Shan <gshan@redhat.com> |
vhost: Add smp_rmb() in vhost_enable_notify() A smp_rmb() has been missed in vhost_enable_notify(), inspired by Will. Otherwise, it's not ensured the available ring entries pushed by guest can be observed by vhost in time, leading to stale available ring entries fetched by vhost in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M \ : \ -netdev tap,id=vnet0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 : guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! Add the missed smp_rmb() in vhost_enable_notify(). When it returns true, it means there's still pending tx buffers. Since it might read indices, so it still can bypass the smp_rmb() in vhost_get_vq_desc(). Note that it should be safe until vq->avail_idx is changed by commit d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()"). Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()") Cc: <stable@kernel.org> # v5.18+ Reported-by: Yihuang Yu <yihyu@redhat.com> Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20240328002149.1141302-3-gshan@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
|
#
22e1992c |
|
27-Mar-2024 |
Gavin Shan <gshan@redhat.com> |
vhost: Add smp_rmb() in vhost_vq_avail_empty() A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by Will. Otherwise, it's not ensured the available ring entries pushed by guest can be observed by vhost in time, leading to stale available ring entries fetched by vhost in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M \ : \ -netdev tap,id=vnet0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 : guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! Add the missed smp_rmb() in vhost_vq_avail_empty(). When tx_can_batch() returns true, it means there's still pending tx buffers. Since it might read indices, so it still can bypass the smp_rmb() in vhost_get_vq_desc(). Note that it should be safe until vq->avail_idx is changed by commit 275bf960ac697 ("vhost: better detection of available buffers"). Fixes: 275bf960ac69 ("vhost: better detection of available buffers") Cc: <stable@kernel.org> # v4.11+ Reported-by: Yihuang Yu <yihyu@redhat.com> Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20240328002149.1141302-2-gshan@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
|
#
3652117f |
|
22-Nov-2023 |
Christian Brauner <brauner@kernel.org> |
eventfd: simplify eventfd_signal() Ever since the eventfd type was introduced back in 2007 in commit e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal() function only ever passed 1 as a value for @n. There's no point in keeping that additional argument. Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org Acked-by: Xu Yilun <yilun.xu@intel.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl Acked-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
ca50ec37 |
|
27-Sep-2023 |
Eric Auger <eric.auger@redhat.com> |
vhost: Allow null msg.size on VHOST_IOTLB_INVALIDATE Commit e2ae38cf3d91 ("vhost: fix hung thread due to erroneous iotlb entries") Forbade vhost iotlb msg with null size to prevent entries with size = start = 0 and last = ULONG_MAX to end up in the iotlb. Then commit 95932ab2ea07 ("vhost: allow batching hint without size") only applied the check for VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE message types to fix a regression observed with batching hit. Still, the introduction of that check introduced a regression for some users attempting to invalidate the whole ULONG_MAX range by setting the size to 0. This is the case with qemu/smmuv3/vhost integration which does not work anymore. It Looks safe to partially revert the original commit and allow VHOST_IOTLB_INVALIDATE messages with null size. vhost_iotlb_del_range() will compute a correct end iova. Same for vhost_vdpa_iotlb_unmap(). Signed-off-by: Eric Auger <eric.auger@redhat.com> Fixes: e2ae38cf3d91 ("vhost: fix hung thread due to erroneous iotlb entries") Cc: stable@vger.kernel.org # v5.17+ Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230927140544.205088-1-eric.auger@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
228a27cf |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: Allow worker switching while work is queueing This patch drops the requirement that we can only switch workers if work has not been queued by using RCU for the vq based queueing paths and a mutex for the device wide flush. We can also use this to support SIGKILL properly in the future where we should exit almost immediately after getting that signal. With this patch, when get_signal returns true, we can set the vq->worker to NULL and do a synchronize_rcu to prevent new work from being queued to the vhost_task that has been killed. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-18-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
c1ecd8e9 |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: allow userspace to create workers For vhost-scsi with 3 vqs or more and a workload that tries to use them in parallel like: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=3 the single vhost worker thread will become a bottlneck and we are stuck at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are used. To better utilize virtqueues and available CPUs, this patch allows userspace to create workers and bind them to vqs. You can have N workers per dev and also share N workers with M vqs on that dev. This patch adds the interface related code and the next patch will hook vhost-scsi into it. The patches do not try to hook net and vsock into the interface because: 1. multiple workers don't seem to help vsock. The problem is that with only 2 virtqueues we never fully use the existing worker when doing bidirectional tests. This seems to match vhost-scsi where we don't see the worker as a bottleneck until 3 virtqueues are used. 2. net already has a way to use multiple workers. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-16-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
1cdaafa1 |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: replace single worker pointer with xarray The next patch allows userspace to create multiple workers per device, so this patch replaces the vhost_worker pointer with an xarray so we can store mupltiple workers and look them up. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-15-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
cef25866 |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: add helper to parse userspace vring state/file The next patches add new vhost worker ioctls which will need to get a vhost_virtqueue from a userspace struct which specifies the vq's index. This moves the vhost_vring_ioctl code to do this to a helper so it can be shared. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-14-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
27eca189 |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: remove vhost_work_queue vhost_work_queue is no longer used. Each driver is using the poll or vq based queueing, so remove vhost_work_queue. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-13-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
493b94bf |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: convert poll work to be vq based This has the drivers pass in their poll to vq mapping and then converts the core poll code to use the vq based helpers. In the next patches we will allow vqs to be handled by different workers, so to allow drivers to execute operations like queue, stop, flush, etc on specific polls/vqs we need to know the mappings. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-8-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
a6fc0473 |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: take worker or vq for flushing This patch has the core work flush function take a worker. When we support multiple workers we can then flush each worker during device removal, stoppage, etc. It also adds a helper to flush specific virtqueues, so vhost-scsi can flush IO vqs from it's ctl vq. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-7-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
0921dddc |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: take worker or vq instead of dev for queueing This patch has the core work queueing function take a worker for when we support multiple workers. It also adds a helper that takes a vq during queueing so modules can control which vq/worker to queue work on. This temp leaves vhost_work_queue. It will be removed when the drivers are converted in the next patches. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-6-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
9784df15 |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost, vhost_net: add helper to check if vq has work In the next patches each vq might have different workers so one could have work but others do not. For net, we only want to check specific vqs, so this adds a helper to check if a vq has work pending and converts vhost-net to use it. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230626232307.97930-5-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
737bdb64 |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: add vhost_worker pointer to vhost_virtqueue This patchset allows userspace to map vqs to different workers. This patch adds a worker pointer to the vq so in later patches in this set we can queue/flush specific vqs and their workers. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-4-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
c011bb66 |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: dynamically allocate vhost_worker This patchset allows us to allocate multiple workers, so this has us move from the vhost_worker that's embedded in the vhost_dev to dynamically allocating it. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-3-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
3e11c6eb |
|
26-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: create worker at end of vhost_dev_set_owner vsock can start queueing work after VHOST_VSOCK_SET_GUEST_CID, so after we have called vhost_worker_create it can be calling vhost_work_queue and trying to access the vhost worker/task. If vhost_dev_alloc_iovecs fails, then vhost_worker_free could free the worker/task from under vsock. This moves vhost_worker_create to the end of vhost_dev_set_owner where we know we can no longer fail in that path. If it fails after the VHOST_SET_OWNER and userspace closes the device, then the normal vsock release handling will do the right thing. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-2-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
55d8122f |
|
24-Apr-2023 |
Shannon Nelson <shannon.nelson@amd.com> |
vhost: support PACKED when setting-getting vring_base Use the right structs for PACKED or split vqs when setting and getting the vring base. Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend") Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Message-Id: <20230424225031.18947-3-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
|
#
4b13cbef |
|
07-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: Fix worker hangs due to missed wake up calls We can race where we have added work to the work_list, but vhost_task_fn has passed that check but not yet set us into TASK_INTERRUPTIBLE. wake_up_process will see us in TASK_RUNNING and just return. This bug was intoduced in commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") when I moved the setting of TASK_INTERRUPTIBLE to simplfy the code and avoid get_signal from logging warnings about being in the wrong state. This moves the setting of TASK_INTERRUPTIBLE back to before we test if we need to stop the task to avoid a possible race there as well. We then have vhost_worker set TASK_RUNNING if it finds work similar to before. Fixes: f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230607192338.6041-3-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
a284f09e |
|
07-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: Fix crash during early vhost_transport_send_pkt calls If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we can race where: 1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue 2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create. 3. vhost_worker_create will set the dev->worker pointer before setting the worker->vtsk pointer. 4. thread0's vhost_work_queue will see the dev->worker pointer is set and try to call vhost_task_wake using not yet set worker->vtsk pointer. 5. We then crash since vtsk is NULL. Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker threads"), we only had the worker pointer so we could just check it to see if VHOST_SET_OWNER has been done. After that commit we have the vhost_worker and vhost_task pointer, so we can now hit the bug above. This patch embeds the vhost_worker in the vhost_dev and moves the work list initialization back to vhost_dev_init, so we can just check the worker.vtsk pointer to check if VHOST_SET_OWNER has been done like before. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230607192338.6041-2-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reported-by: syzbot+d0d442c22fa8db45ff0e@syzkaller.appspotmail.com Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
|
#
4d8df0f5 |
|
22-May-2023 |
Prathu Baronia <prathubaronia2011@gmail.com> |
vhost: use kzalloc() instead of kmalloc() followed by memset() Use kzalloc() to allocate new zeroed out msg node instead of memsetting a node allocated with kmalloc(). Signed-off-by: Prathu Baronia <prathubaronia2011@gmail.com> Message-Id: <20230522085019.42914-1-prathubaronia2011@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
|
#
f9010dbd |
|
01-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
fork, vhost: Use CLONE_THREAD to fix freezer/ps regression When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps or ps a would not incorrectly detect the vhost task as another process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's didn't disable or add support for them. To fix both bugs, this switches the vhost task to be thread in the process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that SIGKILL/STOP support is required because CLONE_THREAD requires CLONE_SIGHAND which requires those 2 signals to be supported. This is a modified version of the patch written by Mike Christie <michael.christie@oracle.com> which was a modified version of patch originally written by Linus. Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER. Including ignoring signals, setting up the register state, and having get_signal return instead of calling do_group_exit. Tidied up the vhost_task abstraction so that the definition of vhost_task only needs to be visible inside of vhost_task.c. Making it easier to review the code and tell what needs to be done where. As part of this the main loop has been moved from vhost_worker into vhost_task_fn. vhost_worker now returns true if work was done. The main loop has been updated to call get_signal which handles SIGSTOP, freezing, and collects the message that tells the thread to exit as part of process exit. This collection clears __fatal_signal_pending. This collection is not guaranteed to clear signal_pending() so clear that explicitly so the schedule() sleeps. For now the vhost thread continues to exist and run work until the last file descriptor is closed and the release function is called as part of freeing struct file. To avoid hangs in the coredump rendezvous and when killing threads in a multi-threaded exec. The coredump code and de_thread have been modified to ignore vhost threads. Remvoing the special case for exec appears to require teaching vhost_dev_flush how to directly complete transactions in case the vhost thread is no longer running. Removing the special case for coredump rendezvous requires either the above fix needed for exec or moving the coredump rendezvous into get_signal. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Co-developed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e4be66e5 |
|
27-Feb-2023 |
Jacob Keller <jacob.e.keller@intel.com> |
vhost: use struct_size and size_add to compute flex array sizes The vhost_get_avail_size and vhost_get_used_size functions compute the size of structures with flexible array members with an additional 2 bytes if the VIRTIO_RING_F_EVENT_IDX feature flag is set. Convert these functions to use struct_size() and size_add() instead of coding the calculation by hand. This ensures that the calculations will saturate at SIZE_MAX rather than overflowing. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: virtualization@lists.linux-foundation.org Cc: kvm@vger.kernel.org Message-Id: <20230227214127.3678392-1-jacob.e.keller@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
05bfb338 |
|
24-Feb-2023 |
Josh Poimboeuf <jpoimboe@kernel.org> |
vhost: Fix livepatch timeouts in vhost_worker() Livepatch timeouts were reported due to busy vhost_worker() kthreads. Now that cond_resched() can do livepatch task switching, use cond_resched() in vhost_worker(). That's the better way to conditionally call schedule() anyway. Reported-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Link: https://lore.kernel.org/r/509f6ea6fe6505f0a75a66026ba531c765ef922f.1677257135.git.jpoimboe@kernel.org
|
#
ff61f079 |
|
14-Mar-2023 |
Jonathan Corbet <corbet@lwn.net> |
docs: move x86 documentation into Documentation/arch/ Move the x86 documentation under Documentation/arch/ as a way of cleaning up the top-level directory and making the structure of our docs more closely match the structure of the source directories it describes. All in-kernel references to the old paths have been updated. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20230315211523.108836-1-corbet@lwn.net/ Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
#
6e890c5d |
|
10-Mar-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: use vhost_tasks for worker threads For vhost workers we use the kthread API which inherit's its values from and checks against the kthreadd thread. This results in the wrong RLIMITs being checked, so while tools like libvirt try to control the number of threads based on the nproc rlimit setting we can end up creating more threads than the user wanted. This patch has us use the vhost_task helpers which will inherit its values/checks from the thread that owns the device similar to if we did a clone in userspace. The vhost threads will now be counted in the nproc rlimits. And we get features like cgroups and mm sharing automatically, so we can remove those calls. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
1a5f8090 |
|
10-Mar-2023 |
Mike Christie <michael.christie@oracle.com> |
vhost: move worker thread fields to new struct This is just a prep patch. It moves the worker related fields to a new vhost_worker struct and moves the code around to create some helpers that will be used in the next patch. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
759aba1e |
|
09-Jan-2023 |
Liming Wu <liming.wu@jaguarmicro.com> |
vhost: remove unused paramete "enabled" is defined in vhost_init_device_iotlb, but it is never used. Let's remove it. Signed-off-by: Liming Wu <liming.wu@jaguarmicro.com> Message-Id: <20230110024445.303-1-liming.wu@jaguarmicro.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
|
#
9526f9a2 |
|
17-Jan-2023 |
Eric Auger <eric.auger@redhat.com> |
vhost/net: Clear the pending messages when the backend is removed When the vhost iotlb is used along with a guest virtual iommu and the guest gets rebooted, some MISS messages may have been recorded just before the reboot and spuriously executed by the virtual iommu after the reboot. As vhost does not have any explicit reset user API, VHOST_NET_SET_BACKEND looks a reasonable point where to clear the pending messages, in case the backend is removed. Export vhost_clear_msg() and call it in vhost_net_set_backend() when fd == -1. Signed-off-by: Eric Auger <eric.auger@redhat.com> Suggested-by: Jason Wang <jasowang@redhat.com> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") Message-Id: <20230117151518.44725-3-eric.auger@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
98047313 |
|
09-Nov-2022 |
Stefano Garzarella <sgarzare@redhat.com> |
vhost: fix range used in translate_desc() vhost_iotlb_itree_first() requires `start` and `last` parameters to search for a mapping that overlaps the range. In translate_desc() we cyclically call vhost_iotlb_itree_first(), incrementing `addr` by the amount already translated, so rightly we move the `start` parameter passed to vhost_iotlb_itree_first(), but we should hold the `last` parameter constant. Let's fix it by saving the `last` parameter value before incrementing `addr` in the loop. Fixes: a9709d6874d5 ("vhost: convert pre sorted vhost memory array to interval tree") Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20221109102503.18816-3-sgarzare@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
de4eda9d |
|
15-Sep-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
use less confusing names for iov_iter direction initializers READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e3bf3df8 |
|
15-Sep-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
[vhost] fix 'direction' argument of iov_iter_{init,bvec}() READ means "data destination", WRITE - "data source". Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b2ffa407 |
|
17-May-2022 |
Mike Christie <michael.christie@oracle.com> |
vhost: rename vhost_work_dev_flush This patch renames vhost_work_dev_flush to just vhost_dev_flush to relfect that it flushes everything on the device and that drivers don't know/care that polls are based on vhost_works. Drivers just flush the entire device and polls, and works for vhost-scsi management TMFs and IO net virtqueues, etc all are flushed. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20220517180850.198915-9-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
6ca84326 |
|
17-May-2022 |
Mike Christie <michael.christie@oracle.com> |
vhost: flush dev once during vhost_dev_stop When vhost_work_dev_flush returns all work queued at that time will have completed. There is then no need to flush after every vhost_poll_stop call, and we can move the flush call to after the loop that stops the pollers. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20220517180850.198915-3-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
6fcf224c |
|
17-May-2022 |
Andrey Ryabinin <arbn@yandex-team.com> |
vhost: get rid of vhost_poll_flush() wrapper vhost_poll_flush() is a simple wrapper around vhost_work_dev_flush(). It gives wrong impression that we are doing some work over vhost_poll, while in fact it flushes vhost_poll->dev. It only complicate understanding of the code and leads to mistakes like flushing the same vhost_dev several times in a row. Just remove vhost_poll_flush() and call vhost_work_dev_flush() directly. Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> [merge vhost_poll_flush removal from Stefano Garzarella] Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20220517180850.198915-2-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
aaca8373 |
|
30-Mar-2022 |
Gautam Dawar <gautam.dawar@xilinx.com> |
vhost-vdpa: support ASID based IOTLB API This patch extends the vhost-vdpa to support ASID based IOTLB API. The vhost-vdpa device will allocated multiple IOTLBs for vDPA device that supports multiple address spaces. The IOTLBs and vDPA device memory mappings is determined and maintained through ASID. Note that we still don't support vDPA device with more than one address spaces that depends on platform IOMMU. This work will be done by moving the IOMMU logic from vhost-vDPA to vDPA device driver. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Gautam Dawar <gdawar@xilinx.com> Message-Id: <20220330180436.24644-16-gdawar@xilinx.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Includes fixup: vhost-vdpa: Fix some error handling path in vhost_vdpa_process_iotlb_msg() In the error paths introduced by the original patch, a mutex may be left locked. Add the correct goto instead of a direct return. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Message-Id: <89ef0ae4c26ac3cfa440c71e97e392dcb328ac1b.1653227924.git.christophe.jaillet@wanadoo.fr> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
91233ad7 |
|
30-Mar-2022 |
Gautam Dawar <gautam.dawar@xilinx.com> |
vhost: support ASID in IOTLB API This patches allows userspace to send ASID based IOTLB message to vhost. This idea is to use the reserved u32 field in the existing V2 IOTLB message. Vhost device should advertise this capability via VHOST_BACKEND_F_IOTLB_ASID backend feature. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Gautam Dawar <gdawar@xilinx.com> Message-Id: <20220330180436.24644-10-gdawar@xilinx.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
d3bb267b |
|
21-Jan-2022 |
Stefano Garzarella <sgarzare@redhat.com> |
vhost: cache avail index in vhost_enable_notify() In vhost_enable_notify() we enable the notifications and we read the avail index to check if new buffers have become available in the meantime. We are not caching the avail index, so when the device will call vhost_get_vq_desc(), it will find the old value in the cache and it will read the avail index again. It would be better to refresh the cache every time we read avail index, so let's change vhost_enable_notify() caching the value in `avail_idx` and compare it with `last_avail_idx` to check if there are new buffers available. We don't expect a significant performance boost because the above path is not very common, indeed vhost_enable_notify() is often called with unlikely(), expecting that avail index has not been updated. We ran virtio-test/vhost-test and noticed minimal improvement as expected. To stress the patch more, we modified vhost_test.ko to call vhost_enable_notify()/vhost_disable_notify() on every cycle when calling vhost_get_vq_desc(); in this case we observed a more evident improvement, with a reduction of the test execution time of about 3.7%. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20220121153108.187291-1-sgarzare@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
|
#
95932ab2 |
|
10-Mar-2022 |
Jason Wang <jasowang@redhat.com> |
vhost: allow batching hint without size Commit e2ae38cf3d91 ("vhost: fix hung thread due to erroneous iotlb entries") tries to reject the IOTLB message whose size is zero. But the size is not necessarily meaningful, one example is the batching hint, so the commit breaks that. Fixing this be reject zero size message only if the message is used to update/invalidate the IOTLB. Fixes: e2ae38cf3d91 ("vhost: fix hung thread due to erroneous iotlb entries") Reported-by: Eli Cohen <elic@nvidia.com> Cc: Anirudh Rayabharam <mail@anirudhrb.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20220310075211.4801-1-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Eli Cohen <elic@nvidia.com>
|
#
4c809363 |
|
13-Jan-2022 |
Stefano Garzarella <sgarzare@redhat.com> |
vhost: remove avail_event arg from vhost_update_avail_event() In vhost_update_avail_event() we never used the `avail_event` argument, since its introduction in commit 2723feaa8ec6 ("vhost: set log when updating used flags or avail event"). Let's remove it to clean up the code. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20220113141134.186773-1-sgarzare@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
e2ae38cf |
|
05-Mar-2022 |
Anirudh Rayabharam <mail@anirudhrb.com> |
vhost: fix hung thread due to erroneous iotlb entries In vhost_iotlb_add_range_ctx(), range size can overflow to 0 when start is 0 and last is ULONG_MAX. One instance where it can happen is when userspace sends an IOTLB message with iova=size=uaddr=0 (vhost_process_iotlb_msg). So, an entry with size = 0, start = 0, last = ULONG_MAX ends up in the iotlb. Next time a packet is sent, iotlb_access_ok() loops indefinitely due to that erroneous entry. Call Trace: <TASK> iotlb_access_ok+0x21b/0x3e0 drivers/vhost/vhost.c:1340 vq_meta_prefetch+0xbc/0x280 drivers/vhost/vhost.c:1366 vhost_transport_do_send_pkt+0xe0/0xfd0 drivers/vhost/vsock.c:104 vhost_worker+0x23d/0x3d0 drivers/vhost/vhost.c:372 kthread+0x2e9/0x3a0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> Reported by syzbot at: https://syzkaller.appspot.com/bug?extid=0abd373e2e50d704db87 To fix this, do two things: 1. Return -EINVAL in vhost_chr_write_iter() when userspace asks to map a range with size 0. 2. Fix vhost_iotlb_add_range_ctx() to handle the range [0, ULONG_MAX] by splitting it into two entries. Fixes: 0bbe30668d89e ("vhost: factor out IOTLB") Reported-by: syzbot+0abd373e2e50d704db87@syzkaller.appspotmail.com Tested-by: syzbot+0abd373e2e50d704db87@syzkaller.appspotmail.com Signed-off-by: Anirudh Rayabharam <mail@anirudhrb.com> Link: https://lore.kernel.org/r/20220305095525.5145-1-mail@anirudhrb.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
f7ad318e |
|
28-Jul-2021 |
Xie Yongji <xieyongji@bytedance.com> |
vhost: Fix the calculation in vhost_overflow() This fixes the incorrect calculation for integer overflow when the last address of iova range is 0xffffffff. Fixes: ec33d031a14b ("vhost: detect 32 bit integer wrap around") Reported-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210728130756.97-2-xieyongji@bytedance.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
1465cb61 |
|
24-May-2021 |
Mike Christie <michael.christie@oracle.com> |
vhost: remove work arg from vhost_work_flush vhost_work_flush doesn't do anything with the work arg. This patch drops it and then renames vhost_work_flush to vhost_work_dev_flush to reflect that the function flushes all the works in the dev and not just a specific queue or work item. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210525174733.6212-2-michael.christie@oracle.com Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
beb691e6 |
|
12-Mar-2021 |
Laurent Vivier <lvivier@redhat.com> |
vhost: Fix vhost_vq_reset() vhost_reset_is_le() is vhost_init_is_le(), and in the case of cross-endian legacy, vhost_init_is_le() depends on vq->user_be. vq->user_be is set by vhost_disable_cross_endian(). But in vhost_vq_reset(), we have: vhost_reset_is_le(vq); vhost_disable_cross_endian(vq); And so user_be is used before being set. To fix that, reverse the lines order as there is no other dependency between them. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Link: https://lore.kernel.org/r/20210312140913.788592-1-lvivier@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
6bcf3422 |
|
09-Nov-2020 |
Mike Christie <michael.christie@oracle.com> |
vhost: add helper to check if a vq has been setup This adds a helper check if a vq has been setup. The next patches will use this when we move the vhost scsi cmd preallocation from per session to per vq. In the per vq case, we only want to allocate cmds for vqs that have actually been setup and not for all the possible vqs. Signed-off-by: Mike Christie <michael.christie@oracle.com> Link: https://lore.kernel.org/r/1604986403-4931-2-git-send-email-michael.christie@oracle.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
|
#
86e182fe |
|
09-Sep-2020 |
Zhu Lingshan <lingshan.zhu@intel.com> |
vhost_vdpa: remove unnecessary spin_lock in vhost_vring_call This commit removed unnecessary spin_locks in vhost_vring_call and related operations. Because we manipulate irq offloading contents in vhost_vdpa ioctl code path which is already protected by dev mutex and vq mutex. Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Link: https://lore.kernel.org/r/20200909065234.3313-1-lingshan.zhu@intel.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
|
#
5e5e8736 |
|
14-Sep-2020 |
Li Wang <li.wang@windriver.com> |
vhost: reduce stack usage in log_used Fix the warning: [-Werror=-Wframe-larger-than=] drivers/vhost/vhost.c: In function log_used: drivers/vhost/vhost.c:1906:1: warning: the frame size of 1040 bytes is larger than 1024 bytes Signed-off-by: Li Wang <li.wang@windriver.com> Link: https://lore.kernel.org/r/1600106889-25013-1-git-send-email-li.wang@windriver.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
|
#
ab512251 |
|
02-Oct-2020 |
Greg Kurz <groug@kaod.org> |
vhost: Don't call log_access_ok() when using IOTLB When the IOTLB device is enabled, the log_guest_addr that is passed by userspace to the VHOST_SET_VRING_ADDR ioctl, and which is then written to vq->log_addr, is a GIOVA. All writes to this address are translated by log_user() to writes to an HVA, and then ultimately logged through the corresponding GPAs in log_write_hva(). No logging will ever occur with vq->log_addr in this case. It is thus wrong to pass vq->log_addr and log_guest_addr to log_access_vq() which assumes they are actual GPAs. Introduce a new vq_log_used_access_ok() helper that only checks accesses to the log for the used structure when there isn't an IOTLB device around. Signed-off-by: Greg Kurz <groug@kaod.org> Link: https://lore.kernel.org/r/160171933385.284610.10189082586063280867.stgit@bahia.lan Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
71878fa4 |
|
02-Oct-2020 |
Greg Kurz <groug@kaod.org> |
vhost: Use vhost_get_used_size() in vhost_vring_set_addr() The open-coded computation of the used size doesn't take the event into account when the VIRTIO_RING_F_EVENT_IDX feature is present. Fix that by using vhost_get_used_size(). Fixes: 8ea8cf89e19a ("vhost: support event index") Cc: stable@vger.kernel.org Signed-off-by: Greg Kurz <groug@kaod.org> Link: https://lore.kernel.org/r/160171932300.284610.11846106312938909461.stgit@bahia.lan Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
0210a8db |
|
02-Oct-2020 |
Greg Kurz <groug@kaod.org> |
vhost: Don't call access_ok() when using IOTLB When the IOTLB device is enabled, the vring addresses we get from userspace are GIOVAs. It is thus wrong to pass them down to access_ok() which only takes HVAs. Access validation is done at prefetch time with IOTLB. Teach vq_access_ok() about that by moving the (vq->iotlb) check from vhost_vq_access_ok() to vq_access_ok(). This prevents vhost_vring_set_addr() to fail when verifying the accesses. No behavior change for vhost_vq_access_ok(). BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1883084 Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Cc: jasowang@redhat.com CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Greg Kurz <groug@kaod.org> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/160171931213.284610.2052489816407219136.stgit@bahia.lan Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
ae6961de |
|
31-Aug-2020 |
Yunsheng Lin <linyunsheng@huawei.com> |
vhost: fix typo in error message "enable" should be "disable" when the function name is vhost_disable_notify(), which does the disabling work. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
460f7ce1 |
|
04-Aug-2020 |
Jason Wang <jasowang@redhat.com> |
vhost: generialize backend features setting/getting Move the backend features setting/getting from net.c to vhost.c to be reused by vhost-vdpa. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200804162048.22587-3-eli@mellanox.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
265a0ad8 |
|
31-Jul-2020 |
Zhu Lingshan <lingshan.zhu@intel.com> |
vhost: introduce vhost_vring_call This commit introduces struct vhost_vring_call which replaced raw struct eventfd_ctx *call_ctx in struct vhost_virtqueue. Besides eventfd_ctx, it contains a spin lock and an irq_bypass_producer in its structure. Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Suggested-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200731065533.4144-2-lingshan.zhu@intel.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
bf11d71a |
|
31-Jul-2020 |
Gustavo A. R. Silva <gustavoars@kernel.org> |
vhost: Use flex_array_size() helper in copy_from_user() Make use of the flex_array_size() helper to calculate the size of a flexible array member within an enclosing structure. This helper offers defense-in-depth against potential integer overflows, while at the same time makes it explicitly clear that we are dealing with a flexible array member. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20200731130956.GA30525@embeddedor Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
71c0b9a6 |
|
30-Oct-2019 |
Will Deacon <will@kernel.org> |
vhost: Remove redundant use of read_barrier_depends() barrier Since commit 76ebbe78f739 ("locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()"), there is no need to use smp_read_barrier_depends() outside of the Alpha architecture code. Unfortunately, there is precisely _one_ user in the vhost code, and there isn't an obvious READ_ONCE() access making the barrier redundant. However, on closer inspection (thanks, Jason), it appears that vring synchronisation between the producer and consumer occurs via the 'avail_idx' field, which is followed up by an rmb() in vhost_get_vq_desc(), making the read_barrier_depends() redundant on Alpha. Jason says: | I'm also confused about the barrier here, basically in driver side | we did: | | 1) allocate pages | 2) store pages in indirect->addr | 3) smp_wmb() | 4) increase the avail idx (somehow a tail pointer of vring) | | in vhost we did: | | 1) read avail idx | 2) smp_rmb() | 3) read indirect->addr | 4) read from indirect->addr | | It looks to me even the data dependency barrier is not necessary | since we have rmb() which is sufficient for us to the correct | indirect->addr and driver are not expected to do any writing to | indirect->addr after avail idx is increased Remove the redundant barrier invocation. Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Jason Wang <jasowang@redhat.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
|
#
37c54f9b |
|
10-Jun-2020 |
Christoph Hellwig <hch@lst.de> |
kernel: set USER_DS in kthread_use_mm Some architectures like arm64 and s390 require USER_DS to be set for kernel threads to access user address space, which is the whole purpose of kthread_use_mm, but other like x86 don't. That has lead to a huge mess where some callers are fixed up once they are tested on said architectures, while others linger around and yet other like io_uring try to do "clever" optimizations for what usually is just a trivial asignment to a member in the thread_struct for most architectures. Make kthread_use_mm set USER_DS, and kthread_unuse_mm restore to the previous value instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/20200404094101.672954-7-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f5678e7f |
|
10-Jun-2020 |
Christoph Hellwig <hch@lst.de> |
kernel: better document the use_mm/unuse_mm API contract Switch the function documentation to kerneldoc comments, and add WARN_ON_ONCE asserts that the calling thread is a kernel thread and does not have ->mm set (or has ->mm set in the case of unuse_mm). Also give the functions a kthread_ prefix to better document the use case. [hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio] Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de [sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename] Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb] Acked-by: Haren Myneni <haren@linux.ibm.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9bf5b9eb |
|
10-Jun-2020 |
Christoph Hellwig <hch@lst.de> |
kernel: move use_mm/unuse_mm to kthread.c Patch series "improve use_mm / unuse_mm", v2. This series improves the use_mm / unuse_mm interface by better documenting the assumptions, and my taking the set_fs manipulations spread over the callers into the core API. This patch (of 3): Use the proper API instead. Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de These helpers are only for use with kernel threads, and I will tie them more into the kthread infrastructure going forward. Also move the prototypes to kthread.h - mmu_context.h was a little weird to start with as it otherwise contains very low-level MM bits. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
690623e1 |
|
07-Jun-2020 |
John Hubbard <jhubbard@nvidia.com> |
vhost: convert get_user_pages() --> pin_user_pages() This code was using get_user_pages*(), in approximately a "Case 5" scenario (accessing the data within a page), using the categorization from [1]. That means that it's time to convert the get_user_pages*() + put_page() calls to pin_user_pages*() + unpin_user_pages() calls. There is some helpful background in [2]: basically, this is a small part of fixing a long-standing disconnect between pinning pages, and file systems' use of those pages. [1] Documentation/core-api/pin_user_pages.rst [2] "Explicit pinning of user-space pages": https://lwn.net/Articles/807108/ Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200529234309.484480-3-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e0136c16 |
|
05-Jun-2020 |
Zhu Lingshan <lingshan.zhu@intel.com> |
vhost: replace -1 with VHOST_FILE_UNBIND in ioctls This commit replaces -1 with VHOST_FILE_UNBIND in ioctls since we have added such a macro in the uapi header for vdpa_host. Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/1591352835-22441-5-git-send-email-lingshan.zhu@intel.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
002ef18e |
|
27-May-2020 |
Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com> |
vhost: (cosmetic) remove a superfluous variable initialisation Even the compiler is able to figure out that in this case the initialisation is superfluous. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com> Link: https://lore.kernel.org/r/20200527180541.5570-3-guennadi.liakhovetski@linux.intel.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
5ce995f3 |
|
29-May-2020 |
Jason Wang <jasowang@redhat.com> |
vhost: use mmgrab() instead of mmget() for non worker device For the device that doesn't use vhost worker and use_mm(), mmget() is too heavy weight and it may brings troubles for implementing mmap() support for vDPA device. This is because, an reference to the address space was held via mm_get() in vhost_dev_set_owner() and an reference to the file was held in mmap(). This means when process exits, the mm can not be released thus we can not release the file. This patch tries to use mmgrab() instead of mmget(), which allows the address space to be destroy in process exit without releasing the mm structure itself. This is sufficient for vDPA device which pin user pages and does not depend on the address space to work. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200529080303.15449-3-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
01fcb1cb |
|
29-May-2020 |
Jason Wang <jasowang@redhat.com> |
vhost: allow device that does not depend on vhost worker vDPA device currently relays the eventfd via vhost worker. This is inefficient due the latency of wakeup and scheduling, so this patch tries to introduce a use_worker attribute for the vhost device. When use_worker is not set with vhost_dev_init(), vhost won't try to allocate a worker thread and the vhost_poll will be processed directly in the wakeup function. This help for vDPA since it reduces the latency caused by vhost worker. In my testing, it saves 0.2 ms in pings between VMs on a mutual host. Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200529080303.15449-2-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
a865e420 |
|
06-Apr-2020 |
Michael S. Tsirkin <mst@redhat.com> |
virtio: force spec specified alignment on types The ring element addresses are passed between components with different alignments assumptions. Thus, if guest/userspace selects a pointer and host then gets and dereferences it, we might need to decrease the compiler-selected alignment to prevent compiler on the host from assuming pointer is aligned. This actually triggers on ARM with -mabi=apcs-gnu - which is a deprecated configuration, but it seems safer to handle this generally. Note that userspace that allocates the memory is actually OK and does not need to be fixed, but userspace that gets it from guest or another process does need to be fixed. The later doesn't generally talk to the kernel so while it might be buggy it's not talking to the kernel in the buggy way - it's just using the header in the buggy way - so fixing header and asking userspace to recompile is the best we can do. I verified that the produced kernel binary on x86 is exactly identical before and after the change. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
|
#
1b0be99f |
|
15-May-2020 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: missing __user tags sparse warns about converting void * to void __user *. This is not new but only got noticed now that vhost is built on more systems. This is just a question of __user tags missing in a couple of places, so fix it up. Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
0bbe3066 |
|
26-Mar-2020 |
Jason Wang <jasowang@redhat.com> |
vhost: factor out IOTLB This patch factors out IOTLB into a dedicated module in order to be reused by other modules like vringh. User may choose to enable the automatic retiring by specifying VHOST_IOTLB_FLAG_RETIRE flag to fit for the case of vhost device IOTLB implementation. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200326140125.19794-4-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
792a4f2e |
|
26-Mar-2020 |
Jason Wang <jasowang@redhat.com> |
vhost: allow per device message handler This patch allow device to register its own message handler during vhost_dev_init(). vDPA device will use it to implement its own DMA mapping logic. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200326140125.19794-3-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
8f6a7f96 |
|
04-Dec-2019 |
Andrey Konovalov <andreyknvl@google.com> |
vhost, kcov: collect coverage from vhost_worker Add kcov_remote_start()/kcov_remote_stop() annotations to the vhost_worker() function, which is responsible for processing vhost works. Since vhost_worker() threads are spawned per vhost device instance the common kcov handle is used for kcov_remote_start()/stop() annotations (see Documentation/dev-tools/kcov.rst for details). As the result kcov can now be used to collect coverage from vhost worker threads. Link: http://lkml.kernel.org/r/e49d5d154e5da6c9ada521d2b7ce10a49ce9f98b.1572366574.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Alexander Potapenko <glider@google.com> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Windsor <dwindsor@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jason Wang <jasowang@redhat.com> Cc: Marco Elver <elver@google.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0d4a3f2a |
|
14-Sep-2019 |
Michael S. Tsirkin <mst@redhat.com> |
Revert "vhost: block speculation of translated descriptors" This reverts commit a89db445fbd7f1f8457b03759aa7343fa530ef6b. I was hasty to include this patch, and it breaks the build on 32 bit. Defence in depth is good but let's do it properly. Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
060423bf |
|
11-Sep-2019 |
yongduan <yongduan@tencent.com> |
vhost: make sure log_num < in_num The code assumes log_num < in_num everywhere, and that is true as long as in_num is incremented by descriptor iov count, and log_num by 1. However this breaks if there's a zero sized descriptor. As a result, if a malicious guest creates a vring desc with desc.len = 0, it may cause the host kernel to crash by overflowing the log array. This bug can be triggered during the VM migration. There's no need to log when desc.len = 0, so just don't increment log_num in this case. Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server") Cc: stable@vger.kernel.org Reviewed-by: Lidong Chen <lidongchen@tencent.com> Signed-off-by: ruippan <ruippan@tencent.com> Signed-off-by: yongduan <yongduan@tencent.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
a89db445 |
|
08-Sep-2019 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: block speculation of translated descriptors iovec addresses coming from vhost are assumed to be pre-validated, but in fact can be speculated to a value out of range. Userspace address are later validated with array_index_nospec so we can be sure kernel info does not leak through these addresses, but vhost must also not leak userspace info outside the allowed memory table to guests. Following the defence in depth principle, make sure the address is not validated out of node range. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Cc: stable@vger.kernel.org Acked-by: Jason Wang <jasowang@redhat.com> Tested-by: Jason Wang <jasowang@redhat.com>
|
#
3d2c7d37 |
|
10-Aug-2019 |
Michael S. Tsirkin <mst@redhat.com> |
Revert "vhost: access vq metadata through kernel virtual address" This reverts commit 7f466032dc ("vhost: access vq metadata through kernel virtual address"). The commit caused a bunch of issues, and while commit 73f628ec9e ("vhost: disable metadata prefetch optimization") disabled the optimization it's not nice to keep lots of dead code around. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
896fc242 |
|
20-Aug-2019 |
Yunsheng Lin <linyunsheng@huawei.com> |
vhost: Remove unnecessary variable It is unnecessary to use ret variable to return the error code, just return the error code directly. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
7a338472 |
|
04-Jun-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482 Based on 1 normalized pattern(s): this work is licensed under the terms of the gnu gpl version 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 48 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Enrico Weigelt <info@metux.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081204.624030236@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
cb1aaebe |
|
07-Jun-2019 |
Mauro Carvalho Chehab <mchehab+samsung@kernel.org> |
docs: fix broken documentation links Mostly due to x86 and acpi conversion, several documentation links are still pointing to the old file. Fix them. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Reviewed-by: Wolfram Sang <wsa@the-dreams.de> Reviewed-by: Sven Van Asbroeck <TheSven73@gmail.com> Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com> Acked-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
#
7f466032 |
|
24-May-2019 |
Jason Wang <jasowang@redhat.com> |
vhost: access vq metadata through kernel virtual address It was noticed that the copy_to/from_user() friends that was used to access virtqueue metdata tends to be very expensive for dataplane implementation like vhost since it involves lots of software checks, speculation barriers, hardware feature toggling (e.g SMAP). The extra cost will be more obvious when transferring small packets since the time spent on metadata accessing become more significant. This patch tries to eliminate those overheads by accessing them through direct mapping of those pages. Invalidation callbacks is implemented for co-operation with general VM management (swap, KSM, THP or NUMA balancing). We will try to get the direct mapping of vq metadata before each round of packet processing if it doesn't exist. If we fail, we will simplely fallback to copy_to/from_user() friends. This invalidation and direct mapping access are synchronized through spinlock and RCU. All matedata accessing through direct map is protected by RCU, and the setup or invalidation are done under spinlock. This method might does not work for high mem page which requires temporary mapping so we just fallback to normal copy_to/from_user() and may not for arch that has virtual tagged cache since extra cache flushing is needed to eliminate the alias. This will result complex logic and bad performance. For those archs, this patch simply go for copy_to/from_user() friends. This is done by ruling out kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE. Note that this is only done when device IOTLB is not enabled. We could use similar method to optimize IOTLB in the future. Tests shows at most about 23% improvement on TX PPS when using virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell: SMAP on | SMAP off Before: 5.2Mpps | 7.1Mpps After: 6.4Mpps | 8.2Mpps Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Miller <davem@davemloft.net> Cc: Jerome Glisse <jglisse@redhat.com> Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-parisc@vger.kernel.org Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
feebcaea |
|
24-May-2019 |
Jason Wang <jasowang@redhat.com> |
vhost: factor out setting vring addr and num Factoring vring address and num setting which needs special care for accelerating vq metadata accessing. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
4942e825 |
|
24-May-2019 |
Jason Wang <jasowang@redhat.com> |
vhost: introduce helpers to get the size of metadata area To avoid code duplication since it will be used by kernel VA prefetching. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
9b5e830b |
|
24-May-2019 |
Jason Wang <jasowang@redhat.com> |
vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch() Rename the function to be more accurate since it actually tries to prefetch vq metadata address in IOTLB. And this will be used by following patch to prefetch metadata virtual addresses. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
7b5d753e |
|
24-May-2019 |
Jason Wang <jasowang@redhat.com> |
vhost: fine grain userspace memory accessors This is used to hide the metadata address from virtqueue helpers. This will allow to implement a vmap based fast accessing to metadata. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
1ab5d138 |
|
24-May-2019 |
Jason Wang <jasowang@redhat.com> |
vhost: generalize adding used elem Use one generic vhost_copy_to_user() instead of two dedicated accessor. This will simplify the conversion to fine grain accessors. About 2% improvement of PPS were seen during vitio-user txonly test. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
e82b9b07 |
|
16-May-2019 |
Jason Wang <jasowang@redhat.com> |
vhost: introduce vhost_exceeds_weight() We used to have vhost_exceeds_weight() for vhost-net to: - prevent vhost kthread from hogging the cpu - balance the time spent between TX and RX This function could be useful for vsock and scsi as well. So move it to vhost.c. Device must specify a weight which counts the number of requests, or it can also specific a byte_weight which counts the number of bytes that has been processed. Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
73b0140b |
|
13-May-2019 |
Ira Weiny <ira.weiny@intel.com> |
mm/gup: change GUP fast to use flags rather than a write 'bool' To facilitate additional options to get_user_pages_fast() change the singular write parameter to be gup_flags. This patch does not change any functionality. New functionality will follow in subsequent patches. Some of the get_user_pages_fast() call sites were unchanged because they already passed FOLL_WRITE or 0 for the write parameter. NOTE: It was suggested to change the ordering of the get_user_pages_fast() arguments to ensure that callers were converted. This breaks the current GUP call site convention of having the returned pages be the final parameter. So the suggestion was rejected. Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Mike Marshall <hubcap@omnibond.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
813dbeb6 |
|
08-Apr-2019 |
Jason Wang <jasowang@redhat.com> |
vhost: reject zero size iova range We used to accept zero size iova range which will lead a infinite loop in translate_desc(). Fixing this by failing the request in this case. Reported-by: syzbot+d21e6e297322a900c128@syzkaller.appspotmail.com Fixes: 6b1e6cc7 ("vhost: new device IOTLB API") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
81bf7bbe |
|
05-Mar-2019 |
Arnd Bergmann <arnd@arndb.de> |
vhost: silence an unused-variable warning On some architectures, the MMU can be disabled, leading to access_ok() becoming an empty macro that does not evaluate its size argument, which in turn produces an unused-variable warning: drivers/vhost/vhost.c:1191:9: error: unused variable 's' [-Werror,-Wunused-variable] size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0; Mark the variable as __maybe_unused to shut up that warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cfdbb4ed |
|
05-Mar-2019 |
Arnd Bergmann <arnd@arndb.de> |
vhost: silence an unused-variable warning On some architectures, the MMU can be disabled, leading to access_ok() becoming an empty macro that does not evaluate its size argument, which in turn produces an unused-variable warning: drivers/vhost/vhost.c:1191:9: error: unused variable 's' [-Werror,-Wunused-variable] size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0; Mark the variable as __maybe_unused to shut up that warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
816db766 |
|
18-Feb-2019 |
Jason Wang <jasowang@redhat.com> |
vhost: correctly check the return value of translate_desc() in log_used() When fail, translate_desc() returns negative value, otherwise the number of iovs. So we should fail when the return value is negative instead of a blindly check against zero. Detected by CoverityScan, CID# 1442593: Control flow issues (DEADCODE) Fixes: cc5e71075947 ("vhost: log dirty page correctly") Acked-by: Michael S. Tsirkin <mst@redhat.com> Reported-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b46a0bf7 |
|
28-Jan-2019 |
Jason Wang <jasowang@redhat.com> |
vhost: fix OOB in get_rx_bufs() After batched used ring updating was introduced in commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"). We tend to batch heads in vq->heads for more than one packet. But the quota passed to get_rx_bufs() was not correctly limited, which can result a OOB write in vq->heads. headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx, vhost_len, &in, vq_log, &log, likely(mergeable) ? UIO_MAXIOV : 1); UIO_MAXIOV was still used which is wrong since we could have batched used in vq->heads, this will cause OOB if the next buffer needs more than 960 (1024 (UIO_MAXIOV) - 64 (VHOST_NET_BATCH)) heads after we've batched 64 (VHOST_NET_BATCH) heads: Acked-by: Stefan Hajnoczi <stefanha@redhat.com> ============================================================================= BUG kmalloc-8k (Tainted: G B ): Redzone overwritten ----------------------------------------------------------------------------- INFO: 0x00000000fd93b7a2-0x00000000f0713384. First byte 0xa9 instead of 0xcc INFO: Allocated in alloc_pd+0x22/0x60 age=3933677 cpu=2 pid=2674 kmem_cache_alloc_trace+0xbb/0x140 alloc_pd+0x22/0x60 gen8_ppgtt_create+0x11d/0x5f0 i915_ppgtt_create+0x16/0x80 i915_gem_create_context+0x248/0x390 i915_gem_context_create_ioctl+0x4b/0xe0 drm_ioctl_kernel+0xa5/0xf0 drm_ioctl+0x2ed/0x3a0 do_vfs_ioctl+0x9f/0x620 ksys_ioctl+0x6b/0x80 __x64_sys_ioctl+0x11/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Slab 0x00000000d13e87af objects=3 used=3 fp=0x (null) flags=0x200000000010201 INFO: Object 0x0000000003278802 @offset=17064 fp=0x00000000e2e6652b Fixing this by allocating UIO_MAXIOV + VHOST_NET_BATCH iovs for vhost-net. This is done through set the limitation through vhost_dev_init(), then set_owner can allocate the number of iov in a per device manner. This fixes CVE-2018-16880. Fixes: e2b3b35eb989 ("vhost_net: batch used ring update in rx") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cc5e7107 |
|
16-Jan-2019 |
Jason Wang <jasowang@redhat.com> |
vhost: log dirty page correctly Vhost dirty page logging API is designed to sync through GPA. But we try to log GIOVA when device IOTLB is enabled. This is wrong and may lead to missing data after migration. To solve this issue, when logging with device IOTLB enabled, we will: 1) reuse the device IOTLB translation result of GIOVA->HVA mapping to get HVA, for writable descriptor, get HVA through iovec. For used ring update, translate its GIOVA to HVA 2) traverse the GPA->HVA mapping to get the possible GPA and log through GPA. Pay attention this reverse mapping is not guaranteed to be unique, so we should log each possible GPA in this case. This fix the failure of scp to guest during migration. In -next, we will probably support passing GIOVA->GPA instead of GIOVA->HVA. Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Reported-by: Jintack Lim <jintack@cs.columbia.edu> Cc: Jintack Lim <jintack@cs.columbia.edu> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
74ad7419 |
|
13-Dec-2018 |
Pavel Tikhomirov <ptikhomirov@virtuozzo.com> |
vhost: return EINVAL if iovecs size does not match the message size We've failed to copy and process vhost_iotlb_msg so let userspace at least know about it. For instance before these patch the code below runs without any error: int main() { struct vhost_msg msg; struct iovec iov; int fd; fd = open("/dev/vhost-net", O_RDWR); if (fd == -1) { perror("open"); return 1; } iov.iov_base = &msg; iov.iov_len = sizeof(msg)-4; if (writev(fd, &iov,1) == -1) { perror("writev"); return 1; } return 0; } Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
96d4f267 |
|
03-Jan-2019 |
Linus Torvalds <torvalds@linux-foundation.org> |
Remove 'type' argument from access_ok() function Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument of the user address range verification function since we got rid of the old racy i386-only code to walk page tables by hand. It existed because the original 80386 would not honor the write protect bit when in kernel mode, so you had to do COW by hand before doing any user access. But we haven't supported that in a long time, and these days the 'type' argument is a purely historical artifact. A discussion about extending 'user_access_begin()' to do the range checking resulted this patch, because there is no way we're going to move the old VERIFY_xyz interface to that model. And it's best done at the end of the merge window when I've done most of my merges, so let's just get this done once and for all. This patch was mostly done with a sed-script, with manual fix-ups for the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form. There were a couple of notable cases: - csky still had the old "verify_area()" name as an alias. - the iter_iov code had magical hardcoded knowledge of the actual values of VERIFY_{READ,WRITE} (not that they mattered, since nothing really used it) - microblaze used the type argument for a debug printout but other than those oddities this should be a total no-op patch. I tried to fix up all architectures, did fairly extensive grepping for access_ok() uses, and the changes are trivial, but I may have missed something. Any missed conversion should be trivially fixable, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
86a07da3 |
|
12-Dec-2018 |
Jason Wang <jasowang@redhat.com> |
Revert "net: vhost: lock the vqs one by one" This reverts commit 78139c94dc8c96a478e67dab3bee84dc6eccb5fd. We don't protect device IOTLB with vq mutex, which will lead e.g use after free for device IOTLB entries. And since we've switched to use mutex_trylock() in previous patch, it's safe to revert it without having deadlock. Fixes: commit 78139c94dc8c ("net: vhost: lock the vqs one by one") Cc: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
841df922 |
|
12-Dec-2018 |
Jason Wang <jasowang@redhat.com> |
vhost: make sure used idx is seen before log in vhost_add_used_n() We miss a write barrier that guarantees used idx is updated and seen before log. This will let userspace sync and copy used ring before used idx is update. Fix this by adding a barrier before log_write(). Fixes: 8dd014adfea6f ("vhost-net: mergeable buffers support") Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e3f78718 |
|
30-Nov-2018 |
Jean-Philippe Brucker <jean-philippe@linaro.org> |
vhost: fix IOTLB locking Commit 78139c94dc8c ("net: vhost: lock the vqs one by one") moved the vq lock to improve scalability, but introduced a possible deadlock in vhost-iotlb. vhost_iotlb_notify_vq() now takes vq->mutex while holding the device's IOTLB spinlock. And on the vhost_iotlb_miss() path, the spinlock is taken while holding vq->mutex. Since calling vhost_poll_queue() doesn't require any lock, avoid the deadlock by not taking vq->mutex. Fixes: 78139c94dc8c ("net: vhost: lock the vqs one by one") Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ff002269 |
|
30-Oct-2018 |
Jason Wang <jasowang@redhat.com> |
vhost: Fix Spectre V1 vulnerability The idx in vhost_vring_ioctl() was controlled by userspace, hence a potential exploitation of the Spectre variant 1 vulnerability. Fixing this by sanitizing idx before using it to index d->vqs. Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
78139c94 |
|
25-Sep-2018 |
Tonghao Zhang <xiangxia.m.yue@gmail.com> |
net: vhost: lock the vqs one by one This patch changes the way that lock all vqs at the same, to lock them one by one. It will be used for next patch to avoid the deadlock. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2d66f997 |
|
24-Aug-2018 |
Jason Wang <jasowang@redhat.com> |
vhost: correctly check the iova range when waking virtqueue We don't wakeup the virtqueue if the first byte of pending iova range is the last byte of the range we just got updated. This will lead a virtqueue to wait for IOTLB updating forever. Fixing by correct the check and wake up the virtqueue in this case. Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Reported-by: Peter Xu <peterx@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Tested-by: Peter Xu <peterx@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b13f9c63 |
|
07-Aug-2018 |
Jason Wang <jasowang@redhat.com> |
vhost: reset metadata cache when initializing new IOTLB We need to reset metadata cache during new IOTLB initialization, otherwise the stale pointers to previous IOTLB may be still accessed which will lead a use after free. Reported-by: syzbot+c51e6736a1bf614b3272@syzkaller.appspotmail.com Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
429711ae |
|
05-Aug-2018 |
Jason Wang <jasowang@redhat.com> |
vhost: switch to use new message format We use to have message like: struct vhost_msg { int type; union { struct vhost_iotlb_msg iotlb; __u8 padding[64]; }; }; Unfortunately, there will be a hole of 32bit in 64bit machine because of the alignment. This leads a different formats between 32bit API and 64bit API. What's more it will break 32bit program running on 64bit machine. So fixing this by introducing a new message type with an explicit 32bit reserved field after type like: struct vhost_msg_v2 { __u32 type; __u32 reserved; union { struct vhost_iotlb_msg iotlb; __u8 padding[64]; }; }; We will have a consistent ABI after switching to use this. To enable this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2). Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6da2ec56 |
|
12-Jun-2018 |
Kees Cook <keescook@chromium.org> |
treewide: kmalloc() -> kmalloc_array() The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(char) * COUNT + COUNT , ...) | kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) | kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) | kmalloc(sizeof(TYPE) * C2, ...) | kmalloc(C1 * C2 * C3, ...) | kmalloc(C1 * C2, ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
b2303d7b |
|
07-Jun-2018 |
Matthew Wilcox <willy@infradead.org> |
Convert vhost to struct_size Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
670ae9ca |
|
11-May-2018 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix info leak due to uninitialized memory struct vhost_msg within struct vhost_msg_node is copied to userspace. Unfortunately it turns out on 64 bit systems vhost_msg has padding after type which gcc doesn't initialize, leaking 4 uninitialized bytes to userspace. This padding also unfortunately means 32 bit users of this interface are broken on a 64 bit kernel which will need to be fixed separately. Fixes: CVE-2018-1118 Cc: stable@vger.kernel.org Reported-by: Kevin Easton <kevin@guarana.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
9965ed17 |
|
05-Mar-2018 |
Christoph Hellwig <hch@lst.de> |
fs: add new vfs_poll and file_can_poll helpers These abstract out calls to the poll method in preparation for changes in how we poll. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
1b15ad68 |
|
22-May-2018 |
Jason Wang <jasowang@redhat.com> |
vhost: synchronize IOTLB message with dev cleanup DaeRyong Jeong reports a race between vhost_dev_cleanup() and vhost_process_iotlb_msg(): Thread interleaving: CPU0 (vhost_process_iotlb_msg) CPU1 (vhost_dev_cleanup) (In the case of both VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE) ===== ===== vhost_umem_clean(dev->iotlb); if (!dev->iotlb) { ret = -EFAULT; break; } dev->iotlb = NULL; The reason is we don't synchronize between them, fixing by protecting vhost_process_iotlb_msg() with dev mutex. Reported-by: DaeRyong Jeong <threeearcat@gmail.com> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ddd3d408 |
|
10-Apr-2018 |
Stefan Hajnoczi <stefanha@redhat.com> |
vhost: return bool from *_access_ok() functions Currently vhost *_access_ok() functions return int. This is error-prone because there are two popular conventions: 1. 0 means failure, 1 means success 2. -errno means failure, 0 means success Although vhost mostly uses #1, it does not do so consistently. umem_access_ok() uses #2. This patch changes the return type from int to bool so that false means failure and true means success. This eliminates a potential source of errors. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d14d2b78 |
|
10-Apr-2018 |
Stefan Hajnoczi <stefanha@redhat.com> |
vhost: fix vhost_vq_access_ok() log check Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log when IOTLB is enabled") introduced a regression. The logic was originally: if (vq->iotlb) return 1; return A && B; After the patch the short-circuit logic for A was inverted: if (A || vq->iotlb) return A; return B; This patch fixes the regression by rewriting the checks in the obvious way, no longer returning A when vq->iotlb is non-NULL (which is hard to understand). Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7ced6c98 |
|
11-Apr-2018 |
Eric Auger <eric.auger@redhat.com> |
vhost: Fix vhost_copy_to_user() vhost_copy_to_user is used to copy vring used elements to userspace. We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC. Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache") Signed-off-by: Eric Auger <eric.auger@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d65026c6 |
|
29-Mar-2018 |
Jason Wang <jasowang@redhat.com> |
vhost: validate log when IOTLB is enabled Vq log_base is the userspace address of bitmap which has nothing to do with IOTLB. So it needs to be validated unconditionally otherwise we may try use 0 as log_base which may lead to pin pages that will lead unexpected result (e.g trigger BUG_ON() in set_bit_to_user()). Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dc6455a7 |
|
27-Mar-2018 |
Jason Wang <jasowang@redhat.com> |
vhost: correctly remove wait queue during poll failure We tried to remove vq poll from wait queue, but do not check whether or not it was in a list before. This will lead double free. Fixing this by switching to use vhost_poll_stop() which zeros poll->wqh after removing poll from waitqueue to make sure it won't be freed twice. Cc: Darren Kenny <darren.kenny@oracle.com> Reported-by: syzbot+c0272972b01b872e604a@syzkaller.appspotmail.com Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
26b36604 |
|
14-Mar-2018 |
Sonny Rao <sonnyrao@chromium.org> |
vhost: fix vhost ioctl signature to build with clang Clang is particularly anal about signed vs unsigned comparisons and doesn't like the fact that some ioctl numbers set the MSB, so we get this error when trying to build vhost on aarch64: drivers/vhost/vhost.c:1400:7: error: overflow converting case value to switch condition type (3221794578 to 18446744072636378898) [-Werror, -Wswitch] case VHOST_GET_VRING_BASE: 3221794578 is 0xC008AF12 in hex 18446744072636378898 is 0xFFFFFFFFC008AF12 in hex Fix this by using unsigned ints in the function signature for vhost_vring_ioctl(). Signed-off-by: Sonny Rao <sonnyrao@chromium.org> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
a9a08845 |
|
11-Feb-2018 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: do bulk POLL* -> EPOLL* replacement This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d25cc43c |
|
06-Jan-2018 |
Eric Biggers <ebiggers@google.com> |
vhost: don't hold onto file pointer for VHOST_SET_LOG_FD We already hold a reference to the eventfd_ctx, which is sufficient; there's no need to hold a reference to the struct file as well. So get rid of vhost_dev->log_file. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com>
|
#
09f332a5 |
|
06-Jan-2018 |
Eric Biggers <ebiggers@google.com> |
vhost: don't hold onto file pointer for VHOST_SET_VRING_ERR We already hold a reference to the eventfd_ctx, which is sufficient; there's no need to hold a reference to the struct file as well. So get rid of vhost_virtqueue->error. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com>
|
#
e050c7d9 |
|
06-Jan-2018 |
Eric Biggers <ebiggers@google.com> |
vhost: don't hold onto file pointer for VHOST_SET_VRING_CALL We already hold a reference to the eventfd_ctx, which is sufficient; there's no need to hold a reference to the struct file as well. So get rid of vhost_virtqueue->call. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com>
|
#
f6f93f75 |
|
24-Dec-2017 |
夷则(Caspar) <jinli.zjl@alibaba-inc.com> |
vhost: remove unused lock check flag in vhost_dev_cleanup() In commit ea5d404655ba ("vhost: fix release path lockdep checks"), Michael added a flag to check whether we should hold a lock in vhost_dev_cleanup(), however, in commit 47283bef7ed3 ("vhost: move memory pointer to VQs"), RCU operations have been replaced by mutex, we can remove the no-longer-used `locked' parameter now. Signed-off-by: Caspar Zhang <jinli.zjl@alibaba-inc.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
ac964d7a |
|
09-Jan-2018 |
Tonghao Zhang <xiangxia.m.yue@gmail.com> |
vhost: Remove the unused variable. The patch (7235acdb1) changed the way of the work flushing in which the queued seq, done seq, and the flushing are not used anymore. Then remove them now. Fixes: 7235acdb1 ("vhost: simplify work flushing") Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
6f3180af |
|
23-Jan-2018 |
Jason Wang <jasowang@redhat.com> |
vhost: do not try to access device IOTLB when not initialized The code will try to access dev->iotlb when processing VHOST_IOTLB_INVALIDATE even if it was not initialized which may lead to NULL pointer dereference. Fixes this by check dev->iotlb before. Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e9cb4239 |
|
23-Jan-2018 |
Jason Wang <jasowang@redhat.com> |
vhost: use mutex_lock_nested() in vhost_dev_lock_vqs() We used to call mutex_lock() in vhost_dev_lock_vqs() which tries to hold mutexes of all virtqueues. This may confuse lockdep to report a possible deadlock because of trying to hold locks belong to same class. Switch to use mutex_lock_nested() to avoid false positive. Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") Reported-by: syzbot+dbb7c1161485e61b0241@syzkaller.appspotmail.com Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3a5db0b1 |
|
27-Nov-2017 |
Paul E. McKenney <paulmck@kernel.org> |
drivers/vhost: Remove now-redundant read_barrier_depends() Because READ_ONCE() now implies read_barrier_depends(), the read_barrier_depends() in next_desc() is now redundant. This commit therefore removes it and the related comments. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: <kvm@vger.kernel.org> Cc: <virtualization@lists.linux-foundation.org> Cc: <netdev@vger.kernel.org>
|
#
afc9a42b |
|
03-Jul-2017 |
Al Viro <viro@zeniv.linux.org.uk> |
the rest of drivers/*: annotate ->poll() instances Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
58e3b602 |
|
03-Jul-2017 |
Al Viro <viro@zeniv.linux.org.uk> |
vhost: annotate vhost_poll its ->mask is POLL... bitmap Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3ad6f93e |
|
03-Jul-2017 |
Al Viro <viro@zeniv.linux.org.uk> |
annotate poll-related wait keys __poll_t is also used as wait key in some waitqueues. Verify that wait_..._poll() gets __poll_t as key and provide a helper for wakeup functions to get back to that __poll_t value. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e6c8adca |
|
03-Jul-2017 |
Al Viro <viro@zeniv.linux.org.uk> |
anntotate the places where ->poll() return values go Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ca2c5b33 |
|
21-Aug-2017 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix end of range for access_ok During access_ok checks, addr increases as we iterate over the data structure, thus addr + len - 1 will point beyond the end of region we are translating. Harmless since we then verify that the region covers addr, but let's not waste cpu cycles. Reported-by: Koichiro Den <den@klaipeden.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Koichiro Den <den@klaipeden.com>
|
#
f808c13f |
|
08-Sep-2017 |
Davidlohr Bueso <dave@stgolabs.net> |
lib/interval_tree: fast overlap detection Allow interval trees to quickly check for overlaps to avoid unnecesary tree lookups in interval_tree_iter_first(). As of this patch, all interval tree flavors will require using a 'rb_root_cached' such that we can have the leftmost node easily available. While most users will make use of this feature, those with special functions (in addition to the generic insert, delete, search calls) will avoid using the cached option as they can do funky things with insertions -- for example, vma_interval_tree_insert_after(). [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()] Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Doug Ledford <dledford@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: David Airlie <airlied@linux.ie> Cc: Jason Wang <jasowang@redhat.com> Cc: Christian Benvenuti <benve@cisco.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8d65843c |
|
26-Jul-2017 |
Jason Wang <jasowang@redhat.com> |
Revert "vhost: cache used event for better performance" This reverts commit 809ecb9bca6a9424ccd392d67e368160f8b76c92. Since it was reported to break vhost_net. We want to cache used event and use it to check for notification. The assumption was that guest won't move the event idx back, but this could happen in fact when 16 bit index wraps around after 64K entries. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ac6424b9 |
|
19-Jun-2017 |
Ingo Molnar <mingo@kernel.org> |
sched/wait: Rename wait_queue_t => wait_queue_entry_t Rename: wait_queue_t => wait_queue_entry_t 'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue", but in reality it's a queue *entry*. The 'real' queue is the wait queue head, which had to carry the name. Start sorting this out by renaming it to 'wait_queue_entry_t'. This also allows the real structure name 'struct __wait_queue' to lose its double underscore and become 'struct wait_queue_entry', which is the more canonical nomenclature for such data types. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
6c5ab651 |
|
08-May-2017 |
Michal Hocko <mhocko@suse.com> |
mm: support __GFP_REPEAT in kvmalloc_node for >32kB vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp. vhost_vsock because it would really like to prefer kmalloc to the vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device allocation to vmalloc") for more context. Michael Tsirkin has also noted: "__GFP_REPEAT overhead is during allocation time. Using vmalloc means all accesses are slowed down. Allocation is not on data path, accesses are." The similar applies to other vhost_kvzalloc users. Let's teach kvmalloc_node to handle __GFP_REPEAT properly. There are two things to be careful about. First we should prevent from the OOM killer and so have to involve __GFP_NORETRY by default and secondly override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is ignored for !costly orders. Supporting __GFP_REPEAT like semantic for !costly request is possible it would require changes in the page allocator. This is out of scope of this patch. This patch shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
174cd4b1 |
|
02-Feb-2017 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
6e84f315 |
|
08-Feb-2017 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. The APIs that are going to be moved first are: mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
f8894913 |
|
28-Feb-2017 |
Jason Wang <jasowang@redhat.com> |
vhost: introduce O(1) vq metadata cache When device IOTLB is enabled, all address translations were stored in interval tree. O(lgN) searching time could be slow for virtqueue metadata (avail, used and descriptors) since they were accessed much often than other addresses. So this patch introduces an O(1) array which points to the interval tree nodes that store the translations of vq metadata. Those array were update during vq IOTLB prefetching and were reset during each invalidation and tlb update. Each time we want to access vq metadata, this small array were queried before interval tree. This would be sufficient for static mappings but not dynamic mappings, we could do optimizations on top. Test were done with l2fwd in guest (2M hugepage): noiommu | before | after tx 1.32Mpps | 1.06Mpps(82%) | 1.30Mpps(98%) rx 2.33Mpps | 1.46Mpps(63%) | 2.29Mpps(98%) We can almost reach the same performance as noiommu mode. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
e3b56cdd |
|
07-Feb-2017 |
Jason Wang <jasowang@redhat.com> |
vhost: try avoiding avail index access when getting descriptor If last avail idx is not equal to cached avail idx, we're sure there's still available buffers in the virtqueue so there's no need to re-read avail idx. So let's skip this to avoid unnecessary userspace memory access and memory barrier. Pktgen test show about 3% improvement on rx pps. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
cda8bba0 |
|
30-Jan-2017 |
Halil Pasic <pasic@linux.vnet.ibm.com> |
vhost: fix initialization for vq->is_le Currently, under certain circumstances vhost_init_is_le does just a part of the initialization job, and depends on vhost_reset_is_le being called too. For this reason vhost_vq_init_access used to call vhost_reset_is_le when vq->private_data is NULL. This is not only counter intuitive, but also real a problem because it breaks vhost_net. The bug was introduced to vhost_net with commit 2751c9882b94 ("vhost: cross-endian support for legacy devices"). The symptom is corruption of the vq's used.idx field (virtio) after VHOST_NET_SET_BACKEND was issued as a part of the vhost shutdown on a vq with pending descriptors. Let us make sure the outcome of vhost_init_is_le never depend on the state it is actually supposed to initialize, and fix virtio_net by removing the reset from vhost_vq_init_access. With the above, there is no reason for vhost_reset_is_le to do just half of the job. Let us make vhost_reset_is_le reinitialize is_le. Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reported-by: Michael A. Tebolt <miket@us.ibm.com> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Fixes: commit 2751c9882b94 ("vhost: cross-endian support for legacy devices") Cc: <stable@vger.kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Tested-by: Michael A. Tebolt <miket@us.ibm.com>
|
#
275bf960 |
|
18-Jan-2017 |
Jason Wang <jasowang@redhat.com> |
vhost: better detection of available buffers This patch tries to do several tweaks on vhost_vq_avail_empty() for a better performance: - check cached avail index first which could avoid userspace memory access. - using unlikely() for the failure of userspace access - check vq->last_avail_idx instead of cached avail index as the last step. This patch is need for batching supports which needs to peek whether or not there's still available buffers in the ring. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
809ecb9b |
|
11-Dec-2016 |
Jason Wang <jasowang@redhat.com> |
vhost: cache used event for better performance When event index was enabled, we need to fetch used event from userspace memory each time. This userspace fetch (with memory barrier) could be saved sometime when 1) caching used event and 2) if used event is ahead of new and old to new updating does not cross it, we're sure there's no need to notify guest. This will be useful for heavy tx load e.g guest pktgen test with Linux driver shows ~3.5% improvement. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
72952cc0 |
|
05-Dec-2016 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: add missing __user annotations Several vhost functions were missing __user annotations on pointers, causing sparse warnings. Fix this up. sparse also warns about vhost_process_iotlb_msg which is local and should be static. Fix that up as well. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
2f952c01 |
|
05-Dec-2016 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: make interval tree static inline vhost_umem_interval_tree is only used locally within vhost.c, mark it static. As some functions generated go unused, this triggers warnings unless we also mark it inline. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
635abf01 |
|
07-Dec-2016 |
Peng Tao <bergwolf@gmail.com> |
vhost: remove unnecessary smp_mb from vhost_work_queue test_and_set_bit() already implies a memory barrier. Signed-off-by: Peng Tao <bergwolf@gmail.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cbbd26b8 |
|
01-Nov-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
[iov_iter] new primitives - copy_from_iter_full() and friends copy_from_iter_full(), copy_from_iter_full_nocache() and csum_and_copy_from_iter_full() - counterparts of copy_from_iter() et.al., advancing iterator only in case of successful full copy and returning whether it had been successful or not. Convert some obvious users. *NOTE* - do not blindly assume that something is a good candidate for those unless you are sure that not advancing iov_iter in failure case is the right thing in this case. Anything that does short read/short write kind of stuff (or is in a loop, etc.) is unlikely to be a good one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ec33d031 |
|
01-Aug-2016 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: detect 32 bit integer wrap around Detect and fail early if long wrap around is triggered. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
6b1e6cc7 |
|
23-Jun-2016 |
Jason Wang <jasowang@redhat.com> |
vhost: new device IOTLB API This patch tries to implement an device IOTLB for vhost. This could be used with userspace(qemu) implementation of DMA remapping to emulate an IOMMU for the guest. The idea is simple, cache the translation in a software device IOTLB (which is implemented as an interval tree) in vhost and use vhost_net file descriptor for reporting IOTLB miss and IOTLB update/invalidation. When vhost meets an IOTLB miss, the fault address, size and access can be read from the file. After userspace finishes the translation, it writes the translated address to the vhost_net file to update the device IOTLB. When device IOTLB is enabled by setting VIRTIO_F_IOMMU_PLATFORM all vq addresses set by ioctl are treated as iova instead of virtual address and the accessing can only be done through IOTLB instead of direct userspace memory access. Before each round or vq processing, all vq metadata is prefetched in device IOTLB to make sure no translation fault happens during vq processing. In most cases, virtqueues are contiguous even in virtual address space. The IOTLB translation for virtqueue itself may make it a little slower. We might add fast path cache on top of this patch. Signed-off-by: Jason Wang <jasowang@redhat.com> [mst: use virtio feature bit: VHOST_F_DEVICE_IOTLB -> VIRTIO_F_IOMMU_PLATFORM ] [mst: fix build warnings ] Signed-off-by: Michael S. Tsirkin <mst@redhat.com> [ weiyj.lk: missing unlock on error ] Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
|
#
a9709d68 |
|
23-Jun-2016 |
Jason Wang <jasowang@redhat.com> |
vhost: convert pre sorted vhost memory array to interval tree Current pre-sorted memory region array has some limitations for future device IOTLB conversion: 1) need extra work for adding and removing a single region, and it's expected to be slow because of sorting or memory re-allocation. 2) need extra work of removing a large range which may intersect several regions with different size. 3) need trick for a replacement policy like LRU To overcome the above shortcomings, this patch convert it to interval tree which can easily address the above issue with almost no extra work. The patch could be used for: - Extend the current API and only let the userspace to send diffs of memory table. - Simplify Device IOTLB implementation. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
bfe2bc51 |
|
23-Jun-2016 |
Jason Wang <jasowang@redhat.com> |
vhost: introduce vhost memory accessors This patch introduces vhost memory accessors which were just wrappers for userspace address access helpers. This is a requirement for vhost device iotlb implementation which will add iotlb translations in those accessors. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
04b96e55 |
|
25-Apr-2016 |
Jason Wang <jasowang@redhat.com> |
vhost: lockless enqueuing We use spinlock to synchronize the work list now which may cause unnecessary contentions. So this patch switch to use llist to remove this contention. Pktgen tests shows about 5% improvement: Before: ~1300000 pps After: ~1370000 pps Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
7235acdb |
|
25-Apr-2016 |
Jason Wang <jasowang@redhat.com> |
vhost: simplify work flushing We used to implement the work flushing through tracking queued seq, done seq, and the number of flushing. This patch simplify this by just implement work flushing through another kind of vhost work with completion. This will be used by lockless enqueuing patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
03088137 |
|
04-Mar-2016 |
Jason Wang <jasowang@redhat.com> |
vhost_net: basic polling support This patch tries to poll for new added tx buffer or socket receive queue for a while at the end of tx/rx processing. The maximum time spent on polling were specified through a new kind of vring ioctl. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
d4a60603 |
|
04-Mar-2016 |
Jason Wang <jasowang@redhat.com> |
vhost: introduce vhost_vq_avail_empty() This patch introduces a helper which will return true if we're sure that the available ring is empty for a specific vq. When we're not sure, e.g vq access failure, return false instead. This could be used for busy polling code to exit the busy loop. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
526d3e7f |
|
04-Mar-2016 |
Jason Wang <jasowang@redhat.com> |
vhost: introduce vhost_has_work() This path introduces a helper which can give a hint for whether or not there's a work queued in the work list. This could be used for busy polling code to exit the busy loop. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
80f7d030 |
|
16-Feb-2016 |
Greg Kurz <groug@kaod.org> |
vhost: rename vhost_init_used() Looking at how callers use this, maybe we should just rename init_used to vhost_vq_init_access. The _used suffix was a hint that we access the vq used ring. But maybe what callers care about is that it must be called after access_ok. Also, this function manipulates the vq->is_le field which isn't related to the vq used ring. This patch simply renames vhost_init_used() to vhost_vq_init_access() as suggested by Michael. No behaviour change. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
c5072037 |
|
16-Feb-2016 |
Greg Kurz <groug@kaod.org> |
vhost: rename cross-endian helpers The default use case for vhost is when the host and the vring have the same endianness (default native endianness). But there are cases where they differ and vhost should byteswap when accessing the vring. The first case is when the host is big endian and the vring belongs to a virtio 1.0 device, which is always little endian. This is covered by the vq->is_le field. This field is initialized when userspace calls the VHOST_SET_FEATURES ioctl. It is reset when the device stops. We already have a vhost_init_is_le() helper, but the reset operation is opencoded as follows: vq->is_le = virtio_legacy_is_little_endian(); It isn't clear that we are resetting vq->is_le here. This patch moves the code to a helper with a more explicit name. The other case where we may have to byteswap is when the architecture can switch endianness at runtime (bi-endian). If endianness differs in the host and in the guest, then legacy devices need to be used in cross-endian mode. This mode is available with CONFIG_VHOST_CROSS_ENDIAN_LEGACY=y, which introduces a vq->user_be field. Userspace may enable cross-endian mode by calling the SET_VRING_ENDIAN ioctl before the device is started. The cross-endian mode is disabled when the device is stopped. The current names of the helpers that manipulate vq->user_be are unclear. This patch renames those helpers to clearly show that this is cross-endian stuff and with explicit enable/disable semantics. No behaviour change. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
e1f33be9 |
|
16-Feb-2016 |
Greg Kurz <groug@kaod.org> |
vhost: fix error path in vhost_init_used() We don't want side effects. If something fails, we rollback vq->is_le to its previous value. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
5fba13b5 |
|
29-Nov-2015 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: replace % with & on data path We know vring num is a power of 2, so use & to mask the high bits. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
d5424838 |
|
16-Nov-2015 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: relax log address alignment commit 5d9a07b0de512b77bf28d2401e5fe3351f00a240 ("vhost: relax used address alignment") fixed the alignment for the used virtual address, but not for the physical address used for logging. That's a mistake: alignment should clearly be the same for virtual and physical addresses, Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
1e099473 |
|
15-Jul-2015 |
Igor Mammedov <imammedo@redhat.com> |
vhost: fix error handling for memory region alloc callers of vhost_kvzalloc() expect the same behaviour on allocation error as from kmalloc/vmalloc i.e. NULL return value. So just return vzmalloc() returned value instead of returning ERR_PTR(-ENOMEM) Fixes: 4de7255f7d2be5 ("vhost: extend memory regions allocation to vmalloc") Spotted-by: Dan Carpenter <dan.carpenter@oracle.com> Suggested-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
7932c0bd |
|
17-Jul-2015 |
Marc-André Lureau <marcandre.lureau@redhat.com> |
vhost: actually track log eventfd file While reviewing vhost log code, I found out that log_file is never set. Note: I haven't tested the change (QEMU doesn't use LOG_FD yet). Cc: stable@vger.kernel.org Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
c9ce42f7 |
|
02-Jul-2015 |
Igor Mammedov <imammedo@redhat.com> |
vhost: add max_mem_regions module parameter it became possible to use a bigger amount of memory slots, which is used by memory hotplug for registering hotplugged memory. However QEMU crashes if it's used with more than ~60 pc-dimm devices and vhost-net enabled since host kernel in module vhost-net refuses to accept more than 64 memory regions. Allow to tweak limit via max_mem_regions module paramemter with default value set to 64 slots. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
4de7255f |
|
01-Jul-2015 |
Igor Mammedov <imammedo@redhat.com> |
vhost: extend memory regions allocation to vmalloc with large number of memory regions we could end up with high order allocations and kmalloc could fail if host is under memory pressure. Considering that memory regions array is used on hot path try harder to allocate using kmalloc and if it fails resort to vmalloc. It's still better than just failing vhost_set_memory() and causing guest crash due to it when a new memory hotplugged to guest. I'll still look at QEMU side solution to reduce amount of memory regions it feeds to vhost to make things even better, but it doesn't hurt for kernel to behave smarter and don't crash older QEMU's which could use large amount of memory regions. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
bcfeacab |
|
16-Jun-2015 |
Igor Mammedov <imammedo@redhat.com> |
vhost: use binary search instead of linear in find_region() For default region layouts performance stays the same as linear search i.e. it takes around 210ns average for translate_desc() that inlines find_region(). But it scales better with larger amount of regions, 235ns BS vs 300ns LS with 55 memory regions and it will be about the same values when allowed number of slots is increased to 509 like it has been done in kvm. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
2751c988 |
|
24-Apr-2015 |
Greg Kurz <groug@kaod.org> |
vhost: cross-endian support for legacy devices This patch brings cross-endian support to vhost when used to implement legacy virtio devices. Since it is a relatively rare situation, the feature availability is controlled by a kernel config option (not set by default). The vq->is_le boolean field is added to cache the endianness to be used for ring accesses. It defaults to native endian, as expected by legacy virtio devices. When the ring gets active, we force little endian if the device is modern. When the ring is deactivated, we revert to the native endian default. If cross-endian was compiled in, a vq->user_be boolean field is added so that userspace may request a specific endianness. This field is used to override the default when activating the ring of a legacy device. It has no effect on modern devices. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
|
#
aad9a1ce |
|
10-Dec-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
vhost: switch vhost get_indirect() to iov_iter, kill memcpy_fromiovec() Cc: Michael S. Tsirkin <mst@redhat.com> Cc: kvm@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5d9a07b0 |
|
20-Dec-2014 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: relax used address alignment virtio 1.0 only requires used address to be 4 byte aligned, vhost required 8 bytes (size of vring_used_elem). Fix up vhost to match that. Additionally, while vhost correctly requires 8 byte alignment for log, it's unconnected to used ring: it's a consequence that log has u64 entries. Tweak code to make that clearer. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
3b1bbe89 |
|
24-Oct-2014 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: virtio 1.0 endian-ness support Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
64f7f051 |
|
01-Dec-2014 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: switch to __get/__put_user exclusively Most places in vhost can use __get/__put_user rather than get/put_user since addresses are pre-validated. This should be good for performance, but this also will help make code sparse-clean: get/put_user macros don't play well with __virtioXX bitwise tags. Switch to get/put_user to __ variants everywhere in vhost. There's one exception - for consistency switch that as well, and add an explicit access_ok check. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
47283bef |
|
05-Jun-2014 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: move memory pointer to VQs commit 2ae76693b8bcabf370b981cd00c36cd41d33fabc vhost: replace rcu with mutex replaced rcu sync for memory accesses with VQ mutex locl/unlock. This is correct since all accesses are under VQ mutex, but incomplete: we still do useless rcu lock/unlock operations, someone might copy this code into some other context where this won't be right. This use of RCU is also non standard and hard to understand. Let's copy the pointer to each VQ structure, this way the access rules become straight-forward, and there's no need for RCU anymore. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
ea16c514 |
|
05-Jun-2014 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: move acked_features to VQs Refactor code to make sure features are only accessed under VQ mutex. This makes everything simpler, no need for RCU here anymore. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
98f9ca0a |
|
28-May-2014 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: replace rcu with mutex All memory accesses are done under some VQ mutex. So lock/unlock all VQs is a faster equivalent of synchronize_rcu() for memory access changes. Some guests cause a lot of these changes, so it's helpful to make them faster. Reported-by: "Gonglei (Arei)" <arei.gonglei@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
59566b6e |
|
06-Dec-2013 |
Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> |
vhost: remove the dead branch Since vhost_dev_init() forever return 0, some branches are never run, therefore need to be removed. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ac9fde24 |
|
07-Jun-2013 |
Qin Chuanyu <qinchuanyu@huawei.com> |
vhost: wake up worker outside spin_lock the wake_up_process func is included by spin_lock/unlock in vhost_work_queue, but it could be done outside the spin_lock. I have test it with kernel 3.0.27 and guest suse11-sp2 using iperf, the num as below. original modified thread_num tp(Gbps) vhost(%) | tp(Gbps) vhost(%) 1 9.59 28.82 | 9.59 27.49 8 9.61 32.92 | 9.62 26.77 64 9.58 46.48 | 9.55 38.99 256 9.6 63.7 | 9.6 52.59 Signed-off-by: Chuanyu Qin <qinchuanyu@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
c49e4e57 |
|
02-Sep-2013 |
Jason Wang <jasowang@redhat.com> |
vhost: switch to use vhost_add_used_n() Let vhost_add_used() to use vhost_add_used_n() to reduce the code duplication. To avoid the overhead brought by __copy_to_user(). We will use put_user() when one used need to be added. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
35596b27 |
|
18-Aug-2013 |
Asias He <asias@redhat.com> |
vhost: Include linux/uio.h instead of linux/socket.h memcpy_fromiovec is moved from net/core/iovec.c to lib/iovec.c. linux/uio.h provides the declaration for memcpy_fromiovec. Include linux/uio.h instead of inux/socket.h for it. Signed-off-by: Asias He <asias@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6ac1afbf |
|
06-May-2013 |
Asias He <asias@redhat.com> |
vhost: Make vhost a separate module Currently, vhost-net and vhost-scsi are sharing the vhost core code. However, vhost-scsi shares the code by including the vhost.c file directly. Making vhost a separate module makes it is easier to share code with other vhost devices. Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
6d5e6aa8 |
|
06-May-2013 |
Asias He <asias@redhat.com> |
vhost: Simplify dev->vqs[i] access Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
05c05351 |
|
06-Jun-2013 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: check owner before we overwrite ubuf_info If device has an owner, we shouldn't touch ubuf_info since it might be in use. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7542a6b0 |
|
06-May-2013 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: drop virtio_net.h dependency There's no net specific code in vhost.c anymore, don't include the virtio_net.h header. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
54db63c2 |
|
05-May-2013 |
Asias He <asias@redhat.com> |
vhost: Export vhost_dev_set_owner Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
150b9e51 |
|
28-Apr-2013 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix error handling in RESET_OWNER ioctl RESET_OWNER ioctl would leave the fd in a bad state if memory allocation failed: device is stopped but owner is not reset. Make state changes after allocating memory, such that a failed ioctl has no effect. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
81f95a55 |
|
28-Apr-2013 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: move per-vq net specific fields out to net This will remove the need for vhost scsi to pull in virtio-net.h. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
2839400f |
|
27-Apr-2013 |
Asias He <asias@redhat.com> |
vhost: move vhost-net zerocopy fields to net.c On top of 'vhost: Allow device specific fields per vq', we can move device specific fields to device virt queue from vhost virt queue. Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
3ab2e420 |
|
26-Apr-2013 |
Asias He <asias@redhat.com> |
vhost: Allow device specific fields per vq This is useful for any device who wants device specific fields per vq. For example, tcm_vhost wants a per vq field to track requests which are in flight on the vq. Also, on top of this we can add patches to move things like ubufs from vhost.h out to net.c. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
70181d51 |
|
10-Apr-2013 |
Jason Wang <jasowang@redhat.com> |
vhost_net: remove tx polling state After commit 2b8b328b61c799957a456a5a8dab8cc7dea68575 (vhost_net: handle polling errors when setting backend), we in fact track the polling state through poll->wqh, so there's no need to duplicate the work with an extra vhost_net_polling_state. So this patch removes this and make the code simpler. This patch also removes the all tx starting/stopping code in tx path according to Michael's suggestion. Netperf test shows almost the same result in stream test, but gets improvements on TCP_RR tests (both zerocopy or copy) especially on low load cases. Tested between multiqueue kvm guest and external host with two direct connected 82599s. zerocopy disabled: sessions|transaction rates|normalize| before/after/+improvements 1 | 9510.24/11727.29/+23.3% | 693.54/887.68/+28.0% | 25| 192931.50/241729.87/+25.3% | 2376.80/2771.70/+16.6% | 50| 277634.64/291905.76/+5% | 3118.36/3230.11/+3.6% | zerocopy enabled: sessions|transaction rates|normalize| before/after/+improvements 1 | 7318.33/11929.76/+63.0% | 521.86/843.30/+61.6% | 25| 167264.88/242422.15/+44.9% | 2181.60/2788.16/+27.8% | 50| 272181.02/294347.04/+8.1% | 3071.56/3257.85/+6.1% | Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2b8b328b |
|
27-Jan-2013 |
Jason Wang <jasowang@redhat.com> |
vhost_net: handle polling errors when setting backend Currently, the polling errors were ignored, which can lead following issues: - vhost remove itself unconditionally from waitqueue when stopping the poll, this may crash the kernel since the previous attempt of starting may fail to add itself to the waitqueue - userspace may think the backend were successfully set even when the polling failed. Solve this by: - check poll->wqh before trying to remove from waitqueue - report polling errors in vhost_poll_start(), tx_poll_start(), the return value will be checked and returned when userspace want to set the backend After this fix, there still could be a polling failure after backend is set, it will addressed by the next patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
935cdee7 |
|
06-Dec-2012 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: avoid backend flush on vring ops vring changes already do a flush internally where appropriate, so we do not need a second flush. It's currently not very expensive but a follow-up patch makes flush more heavy-weight, so remove the extra flush here to avoid regressing performance if call or kick fds are changed on data path. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
bd97120f |
|
25-Nov-2012 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix length for cross region descriptor If a single descriptor crosses a region, the second chunk length should be decremented by size translated so far, instead it includes the full descriptor length. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b211616d |
|
01-Nov-2012 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: move -net specific code out Zerocopy handling code is vhost-net specific. Move it from vhost.c/vhost.h out to net.c Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c4fcb586 |
|
01-Nov-2012 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: track zero copy failures using DMA length This will be used to disable zerocopy when error rate is high. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
70e4cb9a |
|
01-Nov-2012 |
Michael S. Tsirkin <mst@redhat.com> |
vhost-net: cleanup macros for DMA status tracking Better document macros for DMA tracking. Add an explicit one for DMA in progress instead of relying on user supplying len != 1. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e19d6763 |
|
01-Nov-2012 |
Michael S. Tsirkin <mst@redhat.com> |
skb: report completion status for zero copy skbs Even if skb is marked for zero copy, net core might still decide to copy it later which is somewhat slower than a copy in user context: besides copying the data we need to pin/unpin the pages. Add a parameter reporting such cases through zero copy callback: if this happens a lot, device can take this into account and switch to copying in user context. This patch updates all users but ignores the passed value for now: it will be used by follow-up patches. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cecb46f1 |
|
27-Aug-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
vhost_set_vring(): turn pollstart/pollstop into bool Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
163049ae |
|
21-Jul-2012 |
Stefan Hajnoczi <stefanha@gmail.com> |
vhost: make vhost work queue visible The vhost work queue allows processing to be done in vhost worker thread context, which uses the owner process mm. Access to the vring and guest memory is typically only possible from vhost worker context so it is useful to allow work to be queued directly by users. Currently vhost_net only uses the poll wrappers which do not expose the work queue functions. However, for tcm_vhost (vhost_scsi) it will be necessary to queue custom work. Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> Cc: Zhi Yong Wu <wuzhy@cn.ibm.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
d7ffde35 |
|
25-Jun-2012 |
Jens Freimann <jfrei@linux.vnet.ibm.com> |
vhost: use USER_DS in vhost_worker thread On some architectures address spaces are set up in a way that this is not necessary to work properly but on some others (like s390) it is. Make sure we operate on the user address space to allow copy_xxx_user() from the vhost_worker() thread by setting it explicitly before calling use_mm() and revert it after unuse_mm(). Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c70aa540 |
|
01-May-2012 |
Jason Wang <jasowang@redhat.com> |
vhost: zerocopy: poll vq in zerocopy callback We add used and signal guest in worker thread but did not poll the virtqueue during the zero copy callback. This may lead the missing of adding and signalling during zerocopy. Solve this by polling the virtqueue and let it wakeup the worker during callback. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
ca8f4fb2 |
|
08-Apr-2012 |
Michael S. Tsirkin <mst@redhat.com> |
skbuff: struct ubuf_info callback type safety The skb struct ubuf_info callback gets passed struct ubuf_info itself, not the arg value as the field name and the function signature seem to imply. Rename the arg field to ctx to match usage, add documentation and change the callback argument type to make usage clear and to have compiler check correctness. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c6daa7ff |
|
25-Nov-2011 |
Cong Wang <amwang@redhat.com> |
vhost: remove the second argument of k[un]map_atomic() Signed-off-by: Cong Wang <amwang@redhat.com>
|
#
ea5d4046 |
|
27-Nov-2011 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix release path lockdep checks We shouldn't hold any locks on release path. Pass a flag to vhost_dev_cleanup to use the lockdep info correctly. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Sasha Levin <levinsasha928@gmail.com>
|
#
d550dda1 |
|
27-Feb-2012 |
Nadav Har'El <nyh@math.technion.ac.il> |
vhost: don't forget to schedule() This is a tiny, but important, patch to vhost. Vhost's worker thread only called schedule() when it had no work to do, and it wanted to go to sleep. But if there's always work to do, e.g., the guest is running a network-intensive program like netperf with small message sizes, schedule() was *never* called. This had several negative implications (on non-preemptive kernels): 1. Passing time was not properly accounted to the "vhost" process (ps and top would wrongly show it using zero CPU time). 2. Sometimes error messages about RCU timeouts would be printed, if the core running the vhost thread didn't schedule() for a very long time. 3. Worst of all, a vhost thread would "hog" the core. If several vhost threads need to share the same core, typically one would get most of the CPU time (and its associated guest most of the performance), while the others hardly get any work done. The trivial solution is to add if (need_resched()) schedule(); After doing every piece of work. This will not do the heavy schedule() all the time, just when the timer interrupt decided a reschedule is warranted (so need_resched returns true). Thanks to Abel Gordon for this patch. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
b834226b |
|
19-Jul-2011 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: optimize interrupt enable/disable As we now only update used ring after enabling the backend, we can write flags with __put_user: as that's done on data path, it matters. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
75fd9edc |
|
19-Jul-2011 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix zcopy reference counting Fix get/put refcount imbalance with zero copy, which caused qemu to hang forever on guest driver unload. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
2723feaa |
|
21-Jun-2011 |
Jason Wang <jasowang@redhat.com> |
vhost: set log when updating used flags or avail event We need to log writes when updating used flags and avail event fields. Otherwise the guest may see a stale value after migration and miss notifying the host. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
f59281da |
|
21-Jun-2011 |
Jason Wang <jasowang@redhat.com> |
vhost: init used ring after backend was set Move the used ring initialization after backend was set. This makes it possible to disable the backend and tweak the used ring, then restart. This will also make it possible to log the used ring write correctly. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
bab632d6 |
|
17-Jul-2011 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: vhost TX zero-copy support >From: Shirley Ma <mashirle@us.ibm.com> This adds experimental zero copy support in vhost-net, disabled by default. To enable, set experimental_zcopytx module option to 1. This patch maintains the outstanding userspace buffers in the sequence it is delivered to vhost. The outstanding userspace buffers will be marked as done once the lower device buffers DMA has finished. This is monitored through last reference of kfree_skb callback. Two buffer indices are used for this purpose. The vhost-net device passes the userspace buffers info to lower device skb through message control. DMA done status check and guest notification are handled by handle_tx: in the worst case is all buffers in the vq are in pending/done status, so we need to notify guest to release DMA done buffers first before we get any new buffers from the vq. One known problem is that if the guest stops submitting buffers, buffers might never get used until some further action, e.g. device reset. This does not seem to affect linux guests. Signed-off-by: Shirley <xma@us.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8ea8cf89 |
|
19-May-2011 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: support event index Support the new event index feature. When acked, utilize it to reduce the # of interrupts sent to the guest. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
#
61516587 |
|
06-May-2011 |
Rob Landley <rob@landley.net> |
Correct occurrences of - Documentation/kvm/ to Documentation/virtual/kvm - Documentation/uml/ to Documentation/virtual/uml - Documentation/lguest/ to Documentation/virtual/lguest throughout the kernel source tree. Signed-off-by: Rob Landley <rob@landley.net> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
|
#
fcc042a2 |
|
06-Mar-2011 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: copy_from_user -> __copy_from_user copy_from_user is pretty high on perf top profile, replacing it with __copy_from_user helps. It's also safe because we do access_ok checks during setup. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
d47effe1 |
|
01-Mar-2011 |
Krishna Kumar <krkumar2@in.ibm.com> |
vhost: Cleanup vhost.c and net.c Minor cleanup of vhost.c and net.c to match coding style. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
0174b0c3 |
|
10-Jan-2011 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix signed/unsigned comparison To detect that a sequence number is done, we are doing math on unsigned integers so the result is unsigned too. Not what was intended for the <= comparison. The result is user stuck forever in flush call. Convert to int to fix this. Further, get rid of ({}) to make code clearer. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
28831ee6 |
|
29-Nov-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: better variable name in logging We really store a page offset in write_address, so rename it write_page to avoid confusion. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
3bf9be40 |
|
29-Nov-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: correctly set bits of dirty pages Fix two bugs in dirty page logging: When counting pages we should increase address by 1 instead of VHOST_PAGE_SIZE. Make log_write() correctly process requests that cross pages with write_address not starting at page boundary. Reported-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
bf5e0bd2 |
|
14-Nov-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: remove unused include vhost.c does not need to know about sockets, don't include sock.h Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
e4dde731 |
|
29-Nov-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: correctly set bits of dirty pages Fix two bugs in dirty page logging: When counting pages we should increase address by 1 instead of VHOST_PAGE_SIZE. Make log_write() correctly process requests that cross pages with write_address not starting at page boundary. Reported-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
8b7347aa |
|
19-Sep-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: get/put_user -> __get/__put_user We do access_ok checks on all ring values on an ioctl, so we don't need to redo them on each access. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
dfe5ac5b |
|
21-Sep-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: copy_to_user -> __copy_to_user We do access_ok checks at setup time, so we don't need to redo them on each access. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
64e1c807 |
|
06-Oct-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost-net: batch use/unuse mm Move use/unuse mm to vhost.c which makes it possible to batch these operations. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
533a19b4 |
|
06-Oct-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: put mm after thread stop makes it possible to batch use/unuse mm Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
3fcedec7 |
|
25-Oct-2010 |
Julia Lawall <julia@diku.dk> |
drivers/vhost/vhost.c: delete double assignment Delete successive assignments to the same location. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression i; @@ *i = ...; i = ...; // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
6d97e55f |
|
11-Oct-2010 |
Dan Carpenter <error27@gmail.com> |
vhost: fix return code for log_access_ok() access_ok() returns 1 if it's OK otherwise it should return 0. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
e0e9b406 |
|
14-Sep-2010 |
Jason Wang <jasowang@redhat.com> |
vhost: max s/g to match qemu Qemu supports up to UIO_MAXIOV s/g so we have to match that because guest drivers may rely on this. Allocate indirect and log arrays dynamically to avoid using too much contigious memory and make the length of hdr array to match the header length since each iovec entry has a least one byte. Test with copying large files w/ and w/o migration in both linux and windows guests. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
5786aee8 |
|
21-Sep-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix log ctx signalling The log eventfd signalling got put in dead code. We didn't notice because qemu currently does polling instead of eventfd select. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
615cc221 |
|
02-Sep-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: error handling fix vhost should set worker to NULL on cgroups attach failure, so that we won't try to destroy the worker again on close. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
87d6a412 |
|
02-Sep-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix attach to cgroups regression Since 2.6.36-rc1, non-root users of vhost-net fail to attach if they are in any cgroups. The reason is that when qemu uses vhost, vhost wants to attach its thread to all cgroups that qemu has. But we got the API backwards, so a non-priveledged process (Qemu) tried to control the priveledged one (vhost), which fails. Fix this by switching to the new cgroup_attach_task_all, and running it from the vhost thread. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
78b620ce |
|
30-Aug-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
vhost: stop worker only if created Its currently illegal to call kthread_stop(NULL) Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
28457ee6 |
|
09-Mar-2010 |
Arnd Bergmann <arnd@relay.de.ibm.com> |
vhost: add __rcu annotations Also add rcu_dereference_protected() for code paths where locks are held. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: "Michael S. Tsirkin" <mst@redhat.com>
|
#
8dd014ad |
|
27-Jul-2010 |
David Stevens <dlstevens@us.ibm.com> |
vhost-net: mergeable buffers support This adds support for mergeable buffers in vhost-net: this is needed for older guests without indirect buffer support, as well as for zero copy with some devices. Includes changes by Michael S. Tsirkin to make the patch as low risk as possible (i.e., close to no changes when feature is disabled). Signed-off-by: David Stevens <dlstevens@us.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
9e3d1957 |
|
27-Jul-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: apply cgroup to vhost workers Apply the cgroup of the owner task to the created vhost worker. Based on patches from Sridhar Samudrala's and Tejun Heo. Later we'll need to also apply cpumask and probably priority of the owner process. Discussion on the best way to do this is still ongoing. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com>
|
#
c23f3445 |
|
02-Jun-2010 |
Tejun Heo <tj@kernel.org> |
vhost: replace vhost_workqueue with per-vhost kthread Replace vhost_workqueue with per-vhost kthread. Other than callback argument change from struct work_struct * to struct vhost_work *, there's no visible change to vhost_poll_*() interface. This conversion is to make each vhost use a dedicated kthread so that resource control via cgroup can be applied. Partially based on Sridhar Samudrala's patch. * Updated to use sub structure vhost_work instead of directly using vhost_poll at Michael's suggestion. * Added flusher wake_up() optimization at Michael's suggestion. Changes by MST: * Converted atomics/barrier use to a spinlock. * Create thread on SET_OWNER * Fix flushing Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
|
#
7b3384fc |
|
01-Jul-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: add unlikely annotations to error path patch 'break out of polling loop on error' caused a minor performance regression on my machine: recover that performance by adding a bunch of unlikely annotations in the error handling. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
d5675bd2 |
|
24-Jun-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: break out of polling loop on error When ring parsing fails, we currently handle this as ring empty condition. This means that we enable kicks and recheck ring empty: if this not empty, we re-start polling which of course will fail again. Instead, let's return a negative error code and stop polling. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
a02c3789 |
|
27-May-2010 |
Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> |
vhost: fix the memory leak which will happen when memory_access_ok fails We need to free newmem when vhost_set_memory() fails to complete. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
7ad9c9d2 |
|
27-May-2010 |
Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> |
vhost: fix to check the return value of copy_to/from_user() correctly copy_to/from_user() returns the number of bytes that could not be copied. So we need to check if it is not zero, and in that case, we should return the error number -EFAULT rather than directly return the return value from copy_to/from_user(). Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
f8322fbe |
|
26-May-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: whitespace fix Fix up whitespace in vq_memory_access_ok. Reported-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
0f3d9a17 |
|
24-May-2010 |
Krishna Kumar <krkumar2@in.ibm.com> |
vhost: Fix host panic if ioctl called with wrong index Missed a boundary value check in vhost_set_vring. The host panics if idx == nvqs is used in ioctl commands in vhost_virtqueue_init. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
4be929be |
|
24-May-2010 |
Alexey Dobriyan <adobriyan@gmail.com> |
kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0d499356 |
|
11-May-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix barrier pairing According to memory-barriers.txt, an smp memory barrier in guest should always be paired with an smp memory barrier in host, and I quote "a lack of appropriate pairing is almost certainly an error". In case of vhost, failure to flush out used index update before looking at the interrupt disable flag could result in missed interrupts, resulting in networking hang under stress. This might happen when flags read bypasses used index write. So we see interrupts disabled and do not interrupt, at the same time guest writes flags value to enable interrupt, reads an old used index value, thinks that used ring is empty and waits for interrupt. Note: the barrier we pair with here is in drivers/virtio/virtio_ring.c, function vring_enable_cb. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Juan Quintela <quintela@redhat.com>
|
#
a8d3782f |
|
13-Apr-2010 |
Christoph Hellwig <hch@infradead.org> |
vhost: fix sparse warnings Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
179b284e |
|
07-Apr-2010 |
Jeff Dike <jdike@addtoit.com> |
vhost-net: fix vq_memory_access_ok error checking vq_memory_access_ok needs to check whether mem == NULL Signed-off-by: Jeff Dike <jdike@linux.intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
5a0e3ad6 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
#
535297a6 |
|
17-Mar-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix error handling in vring ioctls Stanse found a locking problem in vhost_set_vring: several returns from VHOST_SET_VRING_KICK, VHOST_SET_VRING_CALL, VHOST_SET_VRING_ERR with the vq->mutex held. Fix these up. Reported-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Laurent Chavey <chavey@google.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
d6db3f5c |
|
23-Feb-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: fix get_user_pages_fast error handling get_user_pages_fast returns number of pages on success, negative value on failure, but never 0. Fix vhost code to match this logic. Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
73a99f08 |
|
23-Feb-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: initialize log eventfd context pointer vq log eventfd context pointer needs to be initialized, otherwise operation may fail or oops if log is enabled but log eventfd not set by userspace. When log_ctx for device is created, it is copied to the vq. This reset was missing. Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
86e9424d |
|
17-Feb-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost: logging thinko fix vhost was dong some complex math to get offset to log at, and got it wrong by a couple of bytes, while in fact it's simple: get address where we write, subtract start of buffer, add log base. Do it this way. Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
#
5659338c |
|
01-Feb-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost-net: switch to smp barriers vhost-net only uses memory barriers to control SMP effects (communication with userspace potentially running on a different CPU), so it should use SMP barriers and not mandatory barriers for memory access ordering, as suggested by Documentation/memory-barriers.txt Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3a4d5c94 |
|
13-Jan-2010 |
Michael S. Tsirkin <mst@redhat.com> |
vhost_net: a kernel-level virtio server What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration, bug work-arounds in userspace) - write logging is supported (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tap device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. Compared to userspace, people reported improved latency (as I save up to 4 system calls per packet), as well as better bandwidth and CPU utilization. Features that I plan to look at in the future: - mergeable buffers - zero copy - scalability tuning: figure out the best threading model to use Note on RCU usage (this is also documented in vhost.h, near private_pointer which is the value protected by this variant of RCU): what is happening is that the rcu_dereference() is being used in a workqueue item. The role of rcu_read_lock() is taken on by the start of execution of the workqueue item, of rcu_read_unlock() by the end of execution of the workqueue item, and of synchronize_rcu() by flush_workqueue()/flush_work(). In the future we might need to apply some gcc attribute or sparse annotation to the function passed to INIT_WORK(). Paul's ack below is for this RCU usage. (Includes fixes by Alan Cox <alan@linux.intel.com>, David L Stevens <dlstevens@us.ibm.com>, Chris Wright <chrisw@redhat.com>) Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|