History log of /linux-master/virt/kvm/eventfd.c
Revision Date Author Comments
# 8ed26ab8 17-Oct-2023 Paolo Bonzini <pbonzini@redhat.com>

KVM: clean up directives to compile out irqfds

Keep all #ifdef CONFIG_HAVE_KVM_IRQCHIP parts of eventfd.c together, and
compile out the irqfds field of struct kvm if the symbol is not defined.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c5b31cc2 17-Oct-2023 Paolo Bonzini <pbonzini@redhat.com>

KVM: remove CONFIG_HAVE_KVM_IRQFD

All platforms with a kernel irqchip have support for irqfd. Unify the
two configuration items so that userspace can expect to use irqfd to
inject interrupts into the irqchip.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 3652117f 22-Nov-2023 Christian Brauner <brauner@kernel.org>

eventfd: simplify eventfd_signal()

Ever since the eventfd type was introduced back in 2007 in commit
e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed 1 as a value for @n. There's no point in
keeping that additional argument.

Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org
Acked-by: Xu Yilun <yilun.xu@intel.com>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl
Acked-by: Eric Farman <farman@linux.ibm.com> # s390
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# cc77b95a 07-Feb-2023 Wei Wang <wei.w.wang@intel.com>

kvm/eventfd: use list_for_each_entry when deassign ioeventfd

Simpify kvm_deassign_ioeventfd_idx to use list_for_each_entry as the
loop just ends at the entry that's found and deleted.

Note, coalesced_mmio_ops and ioeventfd_ops are the only instances of
kvm_io_device_ops that implement a destructor, all other callers of
kvm_io_bus_unregister_dev() are unaffected by this change.

Suggested-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230207123713.3905-3-wei.w.wang@intel.com
[sean: call out that only select users implement a destructor]
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 5ea5ca3c 07-Feb-2023 Wei Wang <wei.w.wang@intel.com>

KVM: destruct kvm_io_device while unregistering it from kvm_io_bus

Current usage of kvm_io_device requires users to destruct it with an extra
call of kvm_iodevice_destructor after the device gets unregistered from
kvm_io_bus. This is not necessary and can cause errors if a user forgot
to make the extra call.

Simplify the usage by combining kvm_iodevice_destructor into
kvm_io_bus_unregister_dev. This reduces LOCs a bit for users and can
avoid the leakage of destructing the device explicitly.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230207123713.3905-2-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 70b0bc4c 27-Mar-2023 Michal Luczaj <mhal@rbox.co>

KVM: Don't kfree(NULL) on kzalloc() failure in kvm_assign_ioeventfd_idx()

On kzalloc() failure, taking the `goto fail` path leads to kfree(NULL).
Such no-op has no use. Move it out.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20230327175457.735903-1-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>


# fef8f2b9 22-Mar-2023 Dmytro Maluka <dmy@semihalf.com>

KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking

KVM irqfd based emulation of level-triggered interrupts doesn't work
quite correctly in some cases, particularly in the case of interrupts
that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT).
Such an interrupt is acked to the device in its threaded irq handler,
i.e. later than it is acked to the interrupt controller (EOI at the end
of hardirq), not earlier.

Linux keeps such interrupt masked until its threaded handler finishes,
to prevent the EOI from re-asserting an unacknowledged interrupt.
However, with KVM + vfio (or whatever is listening on the resamplefd)
we always notify resamplefd at the EOI, so vfio prematurely unmasks the
host physical IRQ, thus a new physical interrupt is fired in the host.
This extra interrupt in the host is not a problem per se. The problem is
that it is unconditionally queued for injection into the guest, so the
guest sees an extra bogus interrupt. [*]

There are observed at least 2 user-visible issues caused by those
extra erroneous interrupts for a oneshot irq in the guest:

1. System suspend aborted due to a pending wakeup interrupt from
ChromeOS EC (drivers/platform/chrome/cros_ec.c).
2. Annoying "invalid report id data" errors from ELAN0000 touchpad
(drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
every time the touchpad is touched.

The core issue here is that by the time when the guest unmasks the IRQ,
the physical IRQ line is no longer asserted (since the guest has
acked the interrupt to the device in the meantime), yet we
unconditionally inject the interrupt queued into the guest by the
previous resampling. So to fix the issue, we need a way to detect that
the IRQ is no longer pending, and cancel the queued interrupt in this
case.

With IOAPIC we are not able to probe the physical IRQ line state
directly (at least not if the underlying physical interrupt controller
is an IOAPIC too), so in this patch we use irqfd resampler for that.
Namely, instead of injecting the queued interrupt, we just notify the
resampler that this interrupt is done. If the IRQ line is actually
already deasserted, we are done. If it is still asserted, a new
interrupt will be shortly triggered through irqfd and injected into the
guest.

In the case if there is no irqfd resampler registered for this IRQ, we
cannot fix the issue, so we keep the existing behavior: immediately
unconditionally inject the queued interrupt.

This patch fixes the issue for x86 IOAPIC only. In the long run, we can
fix it for other irqchips and other architectures too, possibly taking
advantage of reading the physical state of the IRQ line, which is
possible with some other irqchips (e.g. with arm64 GIC, maybe even with
the legacy x86 PIC).

[*] In this description we assume that the interrupt is a physical host
interrupt forwarded to the guest e.g. by vfio. Potentially the same
issue may occur also with a purely virtual interrupt from an
emulated device, e.g. if the guest handles this interrupt, again, as
a oneshot interrupt.

Signed-off-by: Dmytro Maluka <dmy@semihalf.com>
Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/
Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/
Message-Id: <20230322204344.50138-3-dmy@semihalf.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# d583fbd7 22-Mar-2023 Dmytro Maluka <dmy@semihalf.com>

KVM: irqfd: Make resampler_list an RCU list

It is useful to be able to do read-only traversal of the list of all the
registered irqfd resamplers without locking the resampler_lock mutex.
In particular, we are going to traverse it to search for a resampler
registered for the given irq of an irqchip, and that will be done with
an irqchip spinlock (ioapic->lock) held, so it is undesirable to lock a
mutex in this context. So turn this list into an RCU list.

For protecting the read side, reuse kvm->irq_srcu which is already used
for protecting a number of irq related things (kvm->irq_routing,
irqfd->resampler->list, kvm->irq_ack_notifier_list,
kvm->arch.mask_notifier_list).

Signed-off-by: Dmytro Maluka <dmy@semihalf.com>
Message-Id: <20230322204344.50138-2-dmy@semihalf.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# e332b55f 19-May-2022 Wanpeng Li <wanpengli@tencent.com>

KVM: eventfd: Fix false positive RCU usage warning

The splat below can be seen when running kvm-unit-test:

=============================
WARNING: suspicious RCU usage
5.18.0-rc7 #5 Tainted: G IOE
-----------------------------
/home/kernel/linux/arch/x86/kvm/../../../virt/kvm/eventfd.c:80 RCU-list traversed in non-reader section!!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
4 locks held by qemu-system-x86/35124:
#0: ffff9725391d80b8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x77/0x710 [kvm]
#1: ffffbd25cfb2a0b8 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0xdeb/0x1900 [kvm]
#2: ffffbd25cfb2b920 (&kvm->irq_srcu){....}-{0:0}, at: kvm_hv_notify_acked_sint+0x79/0x1e0 [kvm]
#3: ffffbd25cfb2b920 (&kvm->irq_srcu){....}-{0:0}, at: irqfd_resampler_ack+0x5/0x110 [kvm]

stack backtrace:
CPU: 2 PID: 35124 Comm: qemu-system-x86 Tainted: G IOE 5.18.0-rc7 #5
Call Trace:
<TASK>
dump_stack_lvl+0x6c/0x9b
irqfd_resampler_ack+0xfd/0x110 [kvm]
kvm_notify_acked_gsi+0x32/0x90 [kvm]
kvm_hv_notify_acked_sint+0xc5/0x1e0 [kvm]
kvm_hv_set_msr_common+0xec1/0x1160 [kvm]
kvm_set_msr_common+0x7c3/0xf60 [kvm]
vmx_set_msr+0x394/0x1240 [kvm_intel]
kvm_set_msr_ignored_check+0x86/0x200 [kvm]
kvm_emulate_wrmsr+0x4f/0x1f0 [kvm]
vmx_handle_exit+0x6fb/0x7e0 [kvm_intel]
vcpu_enter_guest+0xe5a/0x1900 [kvm]
kvm_arch_vcpu_ioctl_run+0x16e/0xac0 [kvm]
kvm_vcpu_ioctl+0x279/0x710 [kvm]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

resampler-list is protected by irq_srcu (see kvm_irqfd_assign), so fix
the false positive by using list_for_each_entry_srcu().

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1652950153-12489-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 6a0c6170 26-Jan-2022 Hou Wenlong <houwenlong93@linux.alibaba.com>

KVM: eventfd: Fix false positive RCU usage warning

Fix the following false positive warning:
=============================
WARNING: suspicious RCU usage
5.16.0-rc4+ #57 Not tainted
-----------------------------
arch/x86/kvm/../../../virt/kvm/eventfd.c:484 RCU-list traversed in non-reader section!!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by fc_vcpu 0/330:
#0: ffff8884835fc0b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0x6f0 [kvm]
#1: ffffc90004c0bb68 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0x600/0x1860 [kvm]
#2: ffffc90004c0c1d0 (&kvm->irq_srcu){....}-{0:0}, at: kvm_notify_acked_irq+0x36/0x180 [kvm]

stack backtrace:
CPU: 26 PID: 330 Comm: fc_vcpu 0 Not tainted 5.16.0-rc4+
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x44/0x57
kvm_notify_acked_gsi+0x6b/0x70 [kvm]
kvm_notify_acked_irq+0x8d/0x180 [kvm]
kvm_ioapic_update_eoi+0x92/0x240 [kvm]
kvm_apic_set_eoi_accelerated+0x2a/0xe0 [kvm]
handle_apic_eoi_induced+0x3d/0x60 [kvm_intel]
vmx_handle_exit+0x19c/0x6a0 [kvm_intel]
vcpu_enter_guest+0x66e/0x1860 [kvm]
kvm_arch_vcpu_ioctl_run+0x438/0x7f0 [kvm]
kvm_vcpu_ioctl+0x38a/0x6f0 [kvm]
__x64_sys_ioctl+0x89/0xc0
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu),
kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact,
kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it
a false positive warning. So use hlist_for_each_entry_srcu() instead of
hlist_for_each_entry_rcu().

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Message-Id: <f98bac4f5052bad2c26df9ad50f7019e40434512.1643265976.git.houwenlong.hwl@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 515a0c79 27-Aug-2021 Longpeng(Mike) <longpeng2@huawei.com>

kvm: irqfd: avoid update unmodified entries of the routing

All of the irqfds would to be updated when update the irq
routing, it's too expensive if there're too many irqfds.

However we can reduce the cost by avoid some unnecessary
updates. For irqs of MSI type on X86, the update can be
saved if the msi values are not change.

The vfio migration could receives benefit from this optimi-
zaiton. The test VM has 128 vcpus and 8 VF (with 65 vectors
enabled), so the VM has more than 520 irqfds. We mesure the
cost of the vfio_msix_enable (in QEMU, it would set routing
for each irqfd) for each VF, and we can see the total cost
can be significantly reduced.

Origin Apply this Patch
1st 8 4
2nd 15 5
3rd 22 6
4th 24 6
5th 36 7
6th 44 7
7th 51 8
8th 58 8
Total 258ms 51ms

We're also tring to optimize the QEMU part [1], but it's still
worth to optimize the KVM to gain more benefits.

[1] https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04215.html

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Message-Id: <20210827080003.2689-1-longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b59e00dd 27-Oct-2020 David Woodhouse <dwmw@amazon.co.uk>

kvm/eventfd: Drain events from eventfd in irqfd_wakeup()

Don't allow the events to accumulate in the eventfd counter, drain them
as they are handled.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201027135523.646811-4-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# e8dbf195 26-Oct-2020 David Woodhouse <dwmw@amazon.co.uk>

kvm/eventfd: Use priority waitqueue to catch events before userspace

As far as I can tell, when we use posted interrupts we silently cut off
the events from userspace, if it's listening on the same eventfd that
feeds the irqfd.

I like that behaviour. Let's do it all the time, even without posted
interrupts. It makes it much easier to handle IRQ remapping invalidation
without having to constantly add/remove the fd from the userspace poll
set. We can just leave userspace polling on it, and the bypass will...
well... bypass it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201026175325.585623-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2fc4f15d 10-Sep-2020 Yi Li <yili@winhong.com>

kvm/eventfd: move wildcard calculation outside loop

There is no need to calculate wildcard in each iteration
since wildcard is not changed.

Signed-off-by: Yi Li <yili@winhong.com>
Message-Id: <20200911055652.3041762-1-yili@winhong.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5c73b9a2 20-Jul-2020 Ahmed S. Darwish <a.darwish@linutronix.de>

kvm/eventfd: Use sequence counter with associated spinlock

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lkml.kernel.org/r/20200720155530.1173732-24-a.darwish@linutronix.de


# 656012c7 01-Apr-2020 Fuad Tabba <tabba@google.com>

KVM: Fix spelling in code comments

Fix spelling and typos (e.g., repeated words) in comments.

Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200401140310.29701-1-tabba@google.com


# c4e115f0 20-Apr-2020 Jason Yan <yanaijie@huawei.com>

kvm/eventfd: remove unneeded conversion to bool

The '==' expression itself is bool, no need to convert it to bool again.
This fixes the following coccicheck warning:

virt/kvm/eventfd.c:724:38-43: WARNING: conversion to bool not needed
here

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Message-Id: <20200420123805.4494-1-yanaijie@huawei.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 775c8a3d 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504

Based on 1 normalized pattern(s):

this file is free software you can redistribute it and or modify it
under the terms of version 2 of the gnu general public license as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation inc 51 franklin st fifth floor boston ma 02110
1301 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 8 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081207.443595178@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 654f1f13 05-May-2019 Peter Xu <peterx@redhat.com>

kvm: Check irqchip mode before assign irqfd

When assigning kvm irqfd we didn't check the irqchip mode but we allow
KVM_IRQFD to succeed with all the irqchip modes. However it does not
make much sense to create irqfd even without the kernel chips. Let's
provide a arch-dependent helper to check whether a specific irqfd is
allowed by the arch. At least for x86, it should make sense to check:

- when irqchip mode is NONE, all irqfds should be disallowed, and,

- when irqchip mode is SPLIT, irqfds that are with resamplefd should
be disallowed.

For either of the case, previously we'll silently ignore the irq or
the irq ack event if the irqchip mode is incorrect. However that can
cause misterious guest behaviors and it can be hard to triage. Let's
fail KVM_IRQFD even earlier to detect these incorrect configurations.

CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Radim Krčmář <rkrcmar@redhat.com>
CC: Alex Williamson <alex.williamson@redhat.com>
CC: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ca0488aa 15-Mar-2019 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

kvm: don't redefine flags as something else

The function irqfd_wakeup() has flags defined as __poll_t and then it
has additional flags which is used for irqflags.

Redefine the inner flags variable as iflags so it does not shadow the
outer flags.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b12ce36a 11-Feb-2019 Ben Gardon <bgardon@google.com>

kvm: Add memcg accounting to KVM allocations

There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.

Tested:
Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
introduced no new failures.
Ran a kernel memory accounting test which creates a VM to touch
memory and then checks that the kernel memory allocated for the
process is within certain bounds.
With this patch we account for much more of the vmalloc and slab memory
allocated for the VM.

There remain a few allocations which should be charged to the VM's
cgroup but are not. In they include:
vcpu->run
kvm->coalesced_mmio_ring
There allocations are unaccounted in this patch because they are mapped
to userspace, and accounting them to a cgroup causes problems. This
should be addressed in a future patch.

Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9432a317 28-May-2018 Paolo Bonzini <pbonzini@redhat.com>

KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer

A comment warning against this bug is there, but the code is not doing what
the comment says. Therefore it is possible that an EPOLLHUP races against
irq_bypass_register_consumer. The EPOLLHUP handler schedules irqfd_shutdown,
and if that runs soon enough, you get a use-after-free.

Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>


# b5020a8e 21-Dec-2017 Lan Tianyu <tianyu.lan@intel.com>

KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel.

Syzbot reports crashes in kvm_irqfd_assign(), caused by use-after-free
when kvm_irqfd_assign() and kvm_irqfd_deassign() run in parallel
for one specific eventfd. When the assign path hasn't finished but irqfd
has been added to kvm->irqfds.items list, another thead may deassign the
eventfd and free struct kvm_kernel_irqfd(). The assign path then uses
the struct kvm_kernel_irqfd that has been freed by deassign path. To avoid
such issue, keep irqfd under kvm->irq_srcu protection after the irqfd
has been added to kvm->irqfds.items list, and call synchronize_srcu()
in irq_shutdown() to make sure that irqfd has been fully initialized in
the assign path.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Tianyu Lan <tianyu.lan@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9965ed17 05-Mar-2018 Christoph Hellwig <hch@lst.de>

fs: add new vfs_poll and file_can_poll helpers

These abstract out calls to the poll method in preparation for changes
in how we poll.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# a9a08845 11-Feb-2018 Linus Torvalds <torvalds@linux-foundation.org>

vfs: do bulk POLL* -> EPOLL* replacement

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3ad6f93e 03-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

annotate poll-related wait keys

__poll_t is also used as wait key in some waitqueues.
Verify that wait_..._poll() gets __poll_t as key and
provide a helper for wakeup functions to get back to
that __poll_t value.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e6c8adca 03-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

anntotate the places where ->poll() return values go

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 95e2a3b3 16-Sep-2017 Jan H. Schönherr <jschoenh@amazon.de>

Revert "KVM: Don't accept obviously wrong gsi values via KVM_IRQFD"

This reverts commit 36ae3c0a36b7456432fedce38ae2f7bd3e01a563.

The commit broke compilation on !CONFIG_HAVE_KVM_IRQ_ROUTING. Also,
there may be cases with CONFIG_HAVE_KVM_IRQ_ROUTING, where larger
gsi values make sense.

As the commit was meant as an early indicator to user space that
something is wrong, reverting just restores the previous behavior
where overly large values are ignored when encountered (without
any direct feedback).

Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>


# 36ae3c0a 07-Sep-2017 Jan H. Schönherr <jschoenh@amazon.de>

KVM: Don't accept obviously wrong gsi values via KVM_IRQFD

We cannot add routes for gsi values >= KVM_MAX_IRQ_ROUTES -- see
kvm_set_irq_routing(). Hence, there is no sense in accepting them
via KVM_IRQFD. Prevent them from entering the system in the first
place.

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4a12f951 07-Jul-2017 Christian Borntraeger <borntraeger@de.ibm.com>

KVM: mark kvm->busses as rcu protected

mark kvm->busses as rcu protected and use the correct access
function everywhere.

found by sparse
virt/kvm/kvm_main.c:3490:15: error: incompatible types in comparison expression (different address spaces)
virt/kvm/kvm_main.c:3509:15: error: incompatible types in comparison expression (different address spaces)
virt/kvm/kvm_main.c:3561:15: error: incompatible types in comparison expression (different address spaces)
virt/kvm/kvm_main.c:3644:15: error: incompatible types in comparison expression (different address spaces)

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# ac6424b9 19-Jun-2017 Ingo Molnar <mingo@kernel.org>

sched/wait: Rename wait_queue_t => wait_queue_entry_t

Rename:

wait_queue_t => wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 993225ad 07-Apr-2017 David Hildenbrand <david@redhat.com>

KVM: x86: rename kvm_vcpu_request_scan_ioapic()

Let's rename it into a proper arch specific callback.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>


# 90db1043 23-Mar-2017 David Hildenbrand <david@redhat.com>

KVM: kvm_io_bus_unregister_dev() should never fail

No caller currently checks the return value of
kvm_io_bus_unregister_dev(). This is evil, as all callers silently go on
freeing their device. A stale reference will remain in the io_bus,
getting at least used again, when the iobus gets teared down on
kvm_destroy_vm() - leading to use after free errors.

There is nothing the callers could do, except retrying over and over
again.

So let's simply remove the bus altogether, print an error and make
sure no one can access this broken bus again (returning -ENOMEM on any
attempt to access it).

Fixes: e93f8a0f821e ("KVM: convert io_bus to SRCU")
Cc: stable@vger.kernel.org # 3.4+
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 36343f6e 26-Oct-2016 Paolo Bonzini <pbonzini@redhat.com>

KVM: fix OOPS on flush_work

The conversion done by commit 3706feacd007 ("KVM: Remove deprecated
create_singlethread_workqueue") is broken. It flushes a single work
item &irqfd->shutdown instead of all of them, and even worse if there
is no irqfd on the list then you get a NULL pointer dereference.
Revert the virt/kvm/eventfd.c part of that patch; to avoid the
deprecated function, just allocate our own workqueue---it does
not even have to be unbound---with alloc_workqueue.

Fixes: 3706feacd007
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 3706feac 30-Aug-2016 Bhaktipriya Shridhar <bhaktipriya96@gmail.com>

KVM: Remove deprecated create_singlethread_workqueue

The workqueue "irqfd_cleanup_wq" queues a single work item
&irqfd->shutdown and hence doesn't require ordering. It is a host-wide
workqueue for issuing deferred shutdown requests aggregated from all
vm* instances. It is not being used on a memory reclaim path.
Hence, it has been converted to use system_wq.
The work item has been flushed in kvm_irqfd_release().

The workqueue "wqueue" queues a single work item &timer->expired
and hence doesn't require ordering. Also, it is not being used on
a memory reclaim path. Hence, it has been converted to use system_wq.

System workqueues have been able to handle high level of concurrency
for a long time now and hence it's not required to have a singlethreaded
workqueue just to gain concurrency. Unlike a dedicated per-cpu workqueue
created with create_singlethread_workqueue(), system_wq allows multiple
work items to overlap executions even on the same CPU; however, a
per-cpu workqueue doesn't have any CPU locality or global ordering
guarantee unless the target CPU is explicitly specified and thus the
increase of local concurrency shouldn't make any difference.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 14717e20 05-May-2016 Alex Williamson <alex.williamson@redhat.com>

kvm: Conditionally register IRQ bypass consumer

If we don't support a mechanism for bypassing IRQs, don't register as
a consumer. This eliminates meaningless dev_info()s when the connect
fails between producer and consumer, such as on AMD systems where
kvm_x86_ops->update_pi_irte is not implemented

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b97e6de9 28-Oct-2015 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic

We do not want to do too much work in atomic context, in particular
not walking all the VCPUs of the virtual machine. So we want
to distinguish the architecture-specific injection function for irqfd
from kvm_set_msi. Since it's still empty, reuse the newly added
kvm_arch_set_irq and rename it to kvm_arch_set_irq_inatomic.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c9a5ecca 16-Oct-2015 Andrey Smetanin <asmetanin@virtuozzo.com>

kvm/eventfd: add arch-specific set_irq

Allow for arch-specific interrupt types to be set. For that, add
kvm_arch_set_irq() which takes interrupt type-specific action if it
recognizes the interrupt type given, and -EWOULDBLOCK otherwise.

The default implementation always returns -EWOULDBLOCK.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ba1aefcd 16-Oct-2015 Andrey Smetanin <asmetanin@virtuozzo.com>

kvm/eventfd: factor out kvm_notify_acked_gsi()

Factor out kvm_notify_acked_gsi() helper to iterate over EOI listeners
and notify those matching the given gsi.

It will be reused in the upcoming Hyper-V SynIC implementation.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 351dc647 16-Oct-2015 Andrey Smetanin <asmetanin@virtuozzo.com>

kvm/eventfd: avoid loop inside irqfd_update()

The loop(for) inside irqfd_update() is unnecessary
because any other value for irq_entry.type will just trigger
schedule_work(&irqfd->inject) in irqfd_wakeup.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f70c20aa 18-Sep-2015 Feng Wu <feng.wu@intel.com>

KVM: Add an arch specific hooks in 'struct kvm_kernel_irqfd'

This patch adds an arch specific hooks 'arch_update' in
'struct kvm_kernel_irqfd'. On Intel side, it is used to
update the IRTE when VT-d posted-interrupts is used.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9016cfb5 18-Sep-2015 Eric Auger <eric.auger@linaro.org>

KVM: eventfd: add irq bypass consumer management

This patch adds the registration/unregistration of an
irq_bypass_consumer on irqfd assignment/deassignment.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 1a02b270 18-Sep-2015 Eric Auger <eric.auger@linaro.org>

KVM: introduce kvm_arch functions for IRQ bypass

This patch introduces
- kvm_arch_irq_bypass_add_producer
- kvm_arch_irq_bypass_del_producer
- kvm_arch_irq_bypass_stop
- kvm_arch_irq_bypass_start

They make possible to specialize the KVM IRQ bypass consumer in
case CONFIG_KVM_HAVE_IRQ_BYPASS is set.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
[Add weak implementations of the callbacks. - Feng]
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 166c9775 18-Sep-2015 Eric Auger <eric.auger@linaro.org>

KVM: create kvm_irqfd.h

Move _irqfd_resampler and _irqfd struct declarations in a new
public header: kvm_irqfd.h. They are respectively renamed into
kvm_kernel_irqfd_resampler and kvm_kernel_irqfd. Those datatypes
will be used by architecture specific code, in the context of
IRQ bypass manager integration.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# e9ea5069 15-Sep-2015 Jason Wang <jasowang@redhat.com>

kvm: add capability for any-length ioeventfds

Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# eefd6b06 15-Sep-2015 Jason Wang <jasowang@redhat.com>

kvm: fix double free for fast mmio eventfd

We register wildcard mmio eventfd on two buses, once for KVM_MMIO_BUS
and once on KVM_FAST_MMIO_BUS but with a single iodev
instance. This will lead to an issue: kvm_io_bus_destroy() knows
nothing about the devices on two buses pointing to a single dev. Which
will lead to double free[1] during exit. Fix this by allocating two
instances of iodevs then registering one on KVM_MMIO_BUS and another
on KVM_FAST_MMIO_BUS.

CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu
Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
task: ffff88009ae0c4b0 ti: ffff88020e7f0000 task.ti: ffff88020e7f0000
RIP: 0010:[<ffffffffc07e25d8>] [<ffffffffc07e25d8>] ioeventfd_release+0x28/0x60 [kvm]
RSP: 0018:ffff88020e7f3bc8 EFLAGS: 00010292
RAX: dead000000200200 RBX: ffff8801ec19c900 RCX: 000000018200016d
RDX: ffff8801ec19cf80 RSI: ffffea0008bf1d40 RDI: ffff8801ec19c900
RBP: ffff88020e7f3bd8 R08: 000000002fc75a01 R09: 000000018200016d
R10: ffffffffc07df6ae R11: ffff88022fc75a98 R12: ffff88021e7cc000
R13: ffff88021e7cca48 R14: ffff88021e7cca50 R15: ffff8801ec19c880
FS: 00007fc1ee3e6700(0000) GS:ffff88023e240000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8f389d8000 CR3: 000000023dc13000 CR4: 00000000001427e0
Stack:
ffff88021e7cc000 0000000000000000 ffff88020e7f3be8 ffffffffc07e2622
ffff88020e7f3c38 ffffffffc07df69a ffff880232524160 ffff88020e792d80
0000000000000000 ffff880219b78c00 0000000000000008 ffff8802321686a8
Call Trace:
[<ffffffffc07e2622>] ioeventfd_destructor+0x12/0x20 [kvm]
[<ffffffffc07df69a>] kvm_put_kvm+0xca/0x210 [kvm]
[<ffffffffc07df818>] kvm_vcpu_release+0x18/0x20 [kvm]
[<ffffffff811f69f7>] __fput+0xe7/0x250
[<ffffffff811f6bae>] ____fput+0xe/0x10
[<ffffffff81093f04>] task_work_run+0xd4/0xf0
[<ffffffff81079358>] do_exit+0x368/0xa50
[<ffffffff81082c8f>] ? recalc_sigpending+0x1f/0x60
[<ffffffff81079ad5>] do_group_exit+0x45/0xb0
[<ffffffff81085c71>] get_signal+0x291/0x750
[<ffffffff810144d8>] do_signal+0x28/0xab0
[<ffffffff810f3a3b>] ? do_futex+0xdb/0x5d0
[<ffffffff810b7028>] ? __wake_up_locked_key+0x18/0x20
[<ffffffff810f3fa6>] ? SyS_futex+0x76/0x170
[<ffffffff81014fc9>] do_notify_resume+0x69/0xb0
[<ffffffff817cb9af>] int_signal+0x12/0x17
Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 <48> 89 10 48 b8 00 01 10 00 00
RIP [<ffffffffc07e25d8>] ioeventfd_release+0x28/0x60 [kvm]
RSP <ffff88020e7f3bc8>

Cc: stable@vger.kernel.org
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 85da11ca 15-Sep-2015 Jason Wang <jasowang@redhat.com>

kvm: factor out core eventfd assign/deassign logic

This patch factors out core eventfd assign/deassign logic and leaves
the argument checking and bus index selection to callers.

Cc: stable@vger.kernel.org
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 8453fecb 15-Sep-2015 Jason Wang <jasowang@redhat.com>

kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd

We only want zero length mmio eventfd to be registered on
KVM_FAST_MMIO_BUS. So check this explicitly when arg->len is zero to
make sure this.

Cc: stable@vger.kernel.org
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# af669ac6 26-Mar-2015 Andre Przywara <andre.przywara@arm.com>

KVM: move iodev.h from virt/kvm/ to include/kvm

iodev.h contains definitions for the kvm_io_bus framework. This is
needed both by the generic KVM code in virt/kvm as well as by
architecture specific code under arch/. Putting the header file in
virt/kvm and using local includes in the architecture part seems at
least dodgy to me, so let's move the file into include/kvm, so that a
more natural "#include <kvm/iodev.h>" can be used by all of the code.
This also solves a problem later when using struct kvm_io_device
in arm_vgic.h.
Fixing up the FSF address in the GPL header and a wrong include path
on the way.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>


# e32edf4f 26-Mar-2015 Nikolay Nikolaev <n.nikolaev@virtualopensystems.com>

KVM: Redesign kvm_io_bus_ API to pass VCPU structure to the callbacks.

This is needed in e.g. ARM vGIC emulation, where the MMIO handling
depends on the VCPU that does the access.

Signed-off-by: Nikolay Nikolaev <n.nikolaev@virtualopensystems.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>


# 01c94e64 04-Mar-2015 Eric Auger <eric.auger@linaro.org>

KVM: introduce kvm_arch_intc_initialized and use it in irqfd

Introduce __KVM_HAVE_ARCH_INTC_INITIALIZED define and
associated kvm_arch_intc_initialized function. This latter
allows to test whether the virtual interrupt controller is initialized
and ready to accept virtual IRQ injection. On some architectures,
the virtual interrupt controller is dynamically instantiated, justifying
that kind of check.

The new function can now be used by irqfd to check whether the
virtual interrupt controller is ready on KVM_IRQFD request. If not,
KVM_IRQFD returns -EAGAIN.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>


# 6ef768fa 20-Nov-2014 Paolo Bonzini <pbonzini@redhat.com>

kvm: x86: move ioapic.c and irq_comm.c back to arch/x86/

ia64 does not need them anymore. Ack notifiers become x86-specific
too.

Suggested-by: Gleb Natapov <gleb@kernel.org>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 29f1b65b 22-Sep-2014 Christoffer Dall <christoffer.dall@linaro.org>

KVM: EVENTFD: Remove inclusion of irq.h

Commit c77dcac (KVM: Move more code under CONFIG_HAVE_KVM_IRQFD) added
functionality that depends on definitions in ioapic.h when
__KVM_HAVE_IOAPIC is defined.

At the same time, kvm-arm commit 0ba0951 (KVM: EVENTFD: remove inclusion
of irq.h) removed the inclusion of irq.h, an architecture-specific header
that is not present on ARM but which happened to include ioapic.h on x86.

Include ioapic.h directly in eventfd.c if __KVM_HAVE_IOAPIC is defined.
This fixes x86 and lets ARM use eventfd.c.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 0ba09511 01-Sep-2014 Eric Auger <eric.auger@linaro.org>

KVM: EVENTFD: remove inclusion of irq.h

No more needed. irq.h would be void on ARM.

Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>


# c77dcacb 06-Aug-2014 Paolo Bonzini <pbonzini@redhat.com>

KVM: Move more code under CONFIG_HAVE_KVM_IRQFD

Commits e4d57e1ee1ab (KVM: Move irq notifier implementation into
eventfd.c, 2014-06-30) included the irq notifier code unconditionally
in eventfd.c, while it was under CONFIG_HAVE_KVM_IRQCHIP before.

Similarly, commit 297e21053a52 (KVM: Give IRQFD its own separate enabling
Kconfig option, 2014-06-30) moved code from CONFIG_HAVE_IRQ_ROUTING
to CONFIG_HAVE_KVM_IRQFD but forgot to move the pieces that used to be
under CONFIG_HAVE_KVM_IRQCHIP.

Together, this broke compilation without CONFIG_KVM_XICS. Fix by adding
or changing the #ifdefs so that they point at CONFIG_HAVE_KVM_IRQFD.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 297e2105 30-Jun-2014 Paul Mackerras <paulus@samba.org>

KVM: Give IRQFD its own separate enabling Kconfig option

Currently, the IRQFD code is conditional on CONFIG_HAVE_KVM_IRQ_ROUTING.
So that we can have the IRQFD code compiled in without having the
IRQ routing code, this creates a new CONFIG_HAVE_KVM_IRQFD, makes
the IRQFD code conditional on it instead of CONFIG_HAVE_KVM_IRQ_ROUTING,
and makes all the platforms that currently select HAVE_KVM_IRQ_ROUTING
also select HAVE_KVM_IRQFD.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# e4d57e1e 30-Jun-2014 Paul Mackerras <paulus@samba.org>

KVM: Move irq notifier implementation into eventfd.c

This moves the functions kvm_irq_has_notifier(), kvm_notify_acked_irq(),
kvm_register_irq_ack_notifier() and kvm_unregister_irq_ack_notifier()
from irqchip.c to eventfd.c. The reason for doing this is that those
functions are used in connection with IRQFDs, which are implemented in
eventfd.c. In future we will want to use IRQFDs on platforms that
don't implement the GSI routing implemented in irqchip.c, so we won't
be compiling in irqchip.c, but we still need the irq notifiers. The
implementation is unchanged.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9957c86d 30-Jun-2014 Paul Mackerras <paulus@samba.org>

KVM: Move all accesses to kvm::irq_routing into irqchip.c

Now that struct _irqfd does not keep a reference to storage pointed
to by the irq_routing field of struct kvm, we can move the statement
that updates it out from under the irqfds.lock and put it in
kvm_set_irq_routing() instead. That means we then have to take a
srcu_read_lock on kvm->irq_srcu around the irqfd_update call in
kvm_irqfd_assign(), since holding the kvm->irqfds.lock no longer
ensures that that the routing can't change.

Combined with changing kvm_irq_map_gsi() and kvm_irq_map_chip_pin()
to take a struct kvm * argument instead of the pointer to the routing
table, this allows us to to move all references to kvm->irq_routing
into irqchip.c. That in turn allows us to move the definition of the
kvm_irq_routing_table struct into irqchip.c as well.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 8ba918d4 30-Jun-2014 Paul Mackerras <paulus@samba.org>

KVM: irqchip: Provide and use accessors for irq routing table

This provides accessor functions for the KVM interrupt mappings, in
order to reduce the amount of code that accesses the fields of the
kvm_irq_routing_table struct, and restrict that code to one file,
virt/kvm/irqchip.c. The new functions are kvm_irq_map_gsi(), which
maps from a global interrupt number to a set of IRQ routing entries,
and kvm_irq_map_chip_pin, which maps from IRQ chip and pin numbers to
a global interrupt number.

This also moves the update of kvm_irq_routing_table::chip[][]
into irqchip.c, out of the various kvm_set_routing_entry
implementations. That means that none of the kvm_set_routing_entry
implementations need the kvm_irq_routing_table argument anymore,
so this removes it.

This does not change any locking or data lifetime rules.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 56f89f36 30-Jun-2014 Paul Mackerras <paulus@samba.org>

KVM: Don't keep reference to irq routing table in irqfd struct

This makes the irqfd code keep a copy of the irq routing table entry
for each irqfd, rather than a reference to the copy in the actual
irq routing table maintained in kvm/virt/irqchip.c. This will enable
us to change the routing table structure in future, or even not have a
routing table at all on some platforms.

The synchronization that was previously achieved using srcu_dereference
on the read side is now achieved using a seqcount_t structure. That
ensures that we don't get a halfway-updated copy of the structure if
we read it while another thread is updating it.

We still use srcu_read_lock/unlock around the read side so that when
changing the routing table we can be sure that after calling
synchronize_srcu, nothing will be using the old routing.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 719d93cd 16-Jan-2014 Christian Borntraeger <borntraeger@de.ibm.com>

kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

When starting lots of dataplane devices the bootup takes very long on
Christian's s390 with irqfd patches. With larger setups he is even
able to trigger some timeouts in some components. Turns out that the
KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec)
when having multiple CPUs. This is caused by the synchronize_rcu and
the HZ=100 of s390. By changing the code to use a private srcu we can
speed things up. This patch reduces the boot time till mounting root
from 8 to 2 seconds on my s390 guest with 100 disks.

Uses of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu
are fine because they do not have lockdep checks (hlist_for_each_entry_rcu
uses rcu_dereference_raw rather than rcu_dereference, and write-sides
do not do rcu lockdep at all).

Note that we're hardly relying on the "sleepable" part of srcu. We just
want SRCU's faster detection of grace periods.

Testing was done by Andrew Theurer using netperf tests STREAM, MAERTS
and RR. The difference between results "before" and "after" the patch
has mean -0.2% and standard deviation 0.6%. Using a paired t-test on the
data points says that there is a 2.5% probability that the patch is the
cause of the performance difference (rather than a random fluctuation).

(Restricting the t-test to RR, which is the most likely to be affected,
changes the numbers to respectively -0.3% mean, 0.7% stdev, and 8%
probability that the numbers actually say something about the patch.
The probability increases mostly because there are fewer data points).

Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # s390
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 68c3b4d1 31-Mar-2014 Michael S. Tsirkin <mst@redhat.com>

KVM: VMX: speed up wildcard MMIO EVENTFD

With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.

Unfortunately, this only works if userspace does not need to match on
access length and data. The implementation adds a separate FAST_MMIO
bus internally. This serves two purposes:
- minimize overhead for old userspace that does not use eventfd with lengtth = 0
- minimize disruption in other code (since we don't know the length,
devices on the MMIO bus only get a valid address in write, this
way we don't need to touch all devices to teach them to handle
an invalid length)

At the moment, this optimization only has effect for EPT on x86.

It will be possible to speed up MMIO for NPT and MMU using the same
idea in the future.

With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
I was unable to detect any measureable slowdown to non-eventfd MMIO.

Making MMIO faster is important for the upcoming virtio 1.0 which
includes an MMIO signalling capability.

The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
pre-review and suggestions.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# f848a5a8 31-Mar-2014 Michael S. Tsirkin <mst@redhat.com>

KVM: support any-length wildcard ioeventfd

It is sometimes benefitial to ignore IO size, and only match on address.
In hindsight this would have been a better default than matching length
when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind
of access can be optimized on VMX: there no need to do page lookups.
This can currently be done with many ioeventfds but in a suboptimal way.

However we can't change kernel/userspace ABI without risk of breaking
some applications.
Use len = 0 to mean "ignore length for matching" in a more optimal way.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 684a0b71 17-Mar-2014 Cornelia Huck <cornelia.huck@de.ibm.com>

KVM: eventfd: Fix lock order inversion.

When registering a new irqfd, we call its ->poll method to collect any
event that might have previously been pending so that we can trigger it.
This is done under the kvm->irqfds.lock, which means the eventfd's ctx
lock is taken under it.

However, if we get a POLLHUP in irqfd_wakeup, we will be called with the
ctx lock held before getting the irqfds.lock to deactivate the irqfd,
causing lockdep to complain.

Calling the ->poll method does not really need the irqfds.lock, so let's
just move it after we've given up the irqfds.lock in kvm_irqfd_assign().

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# cffe78d9 30-Aug-2013 Al Viro <viro@zeniv.linux.org.uk>

kvm eventfd: switch to fdget

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6ea34c9b 24-May-2013 Amos Kong <akong@redhat.com>

kvm: exclude ioeventfd from counting kvm_io_range limit

We can easily reach the 1000 limit by start VM with a couple
hundred I/O devices (multifunction=on). The hardcode limit
already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000).

In userspace, we already have maximum file descriptor to
limit ioeventfd count. But kvm_io_bus devices also are used
for pit, pic, ioapic, coalesced_mmio. They couldn't be limited
by maximum file descriptor.

Currently only ioeventfds take too much kvm_io_bus devices,
so just exclude it from counting kvm_io_range limit.

Also fixed one indent issue in kvm_host.h

Signed-off-by: Amos Kong <akong@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>


# a725d56a 17-Apr-2013 Alexander Graf <agraf@suse.de>

KVM: Introduce CONFIG_HAVE_KVM_IRQ_ROUTING

Quite a bit of code in KVM has been conditionalized on availability of
IOAPIC emulation. However, most of it is generically applicable to
platforms that don't have an IOPIC, but a different type of irq chip.

Make code that only relies on IRQ routing, not an APIC itself, on
CONFIG_HAVE_KVM_IRQ_ROUTING, so that we can reuse it later.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>


# aa2fbe6d 11-Apr-2013 Yang Zhang <yang.z.zhang@Intel.com>

KVM: Let ioapic know the irq line status

Userspace may deliver RTC interrupt without query the status. So we
want to track RTC EOI for this case.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 05e07f9b 04-Apr-2013 Michael S. Tsirkin <mst@redhat.com>

kvm: fix MMIO/PIO collision misdetection

PIO and MMIO are separate address spaces, but
ioeventfd registration code mistakenly detected
two eventfds as duplicate if they use the same address,
even if one is PIO and another one MMIO.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>


# 2b83451b 27-Feb-2013 Cornelia Huck <cornelia.huck@de.ibm.com>

KVM: ioeventfd for virtio-ccw devices.

Enhance KVM_IOEVENTFD with a new flag that allows to attach to virtio-ccw
devices on s390 via the KVM_VIRTIO_CCW_NOTIFY_BUS.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# a0f155e9 27-Feb-2013 Cornelia Huck <cornelia.huck@de.ibm.com>

KVM: Initialize irqfd from kvm_init().

Currently, eventfd introduces module_init/module_exit functions
to initialize/cleanup the irqfd workqueue. This only works, however,
if no other module_init/module_exit functions are built into the
same module.

Let's just move the initialization and cleanup to kvm_init and kvm_exit.
This way, it is also clearer where kvm startup may fail.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# b67bfe0d 27-Feb-2013 Sasha Levin <sasha.levin@oracle.com>

hlist: drop the node parameter from iterators

I'm not sure why, but the hlist for each entry iterators were conceived

list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 49f8a1a5 06-Dec-2012 Alex Williamson <alex.williamson@redhat.com>

kvm: Fix irqfd resampler list walk

Typo for the next pointer means we're walking random data here.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 914daba8 08-Oct-2012 Alexander Graf <agraf@suse.de>

KVM: Distangle eventfd code from irqchip

The current eventfd code assumes that when we have eventfd, we also have
irqfd for in-kernel interrupt delivery. This is not necessarily true. On
PPC we don't have an in-kernel irqchip yet, but we can still support easily
support eventfd.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 7a84428a 21-Sep-2012 Alex Williamson <alex.williamson@redhat.com>

KVM: Add resampling irqfds for level triggered interrupts

To emulate level triggered interrupts, add a resample option to
KVM_IRQFD. When specified, a new resamplefd is provided that notifies
the user when the irqchip has been resampled by the VM. This may, for
instance, indicate an EOI. Also in this mode, posting of an interrupt
through an irqfd only asserts the interrupt. On resampling, the
interrupt is automatically de-asserted prior to user notification.
This enables level triggered interrupts to be posted and re-enabled
from vfio with no userspace intervention.

All resampling irqfds can make use of a single irq source ID, so we
reserve a new one for this interface.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 43829731 20-Aug-2012 Tejun Heo <tj@kernel.org>

workqueue: deprecate flush[_delayed]_work_sync()

flush[_delayed]_work_sync() are now spurious. Mark them deprecated
and convert all users to flush[_delayed]_work().

If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant and the regular flushes guarantee that the work item is
not pending or running on any CPU on return, so there's no reason to
use the sync flushes at all and they're going away.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mattia Dongili <malattia@linux.it>
Cc: Kent Yoder <key@linux.vnet.ibm.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Bryan Wu <bryan.wu@canonical.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-wireless@vger.kernel.org
Cc: Anton Vorontsov <cbou@mail.ru>
Cc: Sangbeom Kim <sbkim73@samsung.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Avi Kivity <avi@redhat.com>


# 326cf033 29-Jun-2012 Alex Williamson <alex.williamson@redhat.com>

KVM: Sanitize KVM_IRQFD flags

We only know of one so far.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# d4db2935 29-Jun-2012 Alex Williamson <alex.williamson@redhat.com>

KVM: Pass kvm_irqfd to functions

Prune this down to just the struct kvm_irqfd so we can avoid
changing function definition for every flag or field we use.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 743eeb0b 27-Jul-2011 Sasha Levin <levinsasha928@gmail.com>

KVM: Intelligent device lookup on I/O bus

Currently the method of dealing with an IO operation on a bus (PIO/MMIO)
is to call the read or write callback for each device registered
on the bus until we find a device which handles it.

Since the number of devices on a bus can be significant due to ioeventfds
and coalesced MMIO zones, this leads to a lot of overhead on each IO
operation.

Instead of registering devices, we now register ranges which points to
a device. Lookup is done using an efficient bsearch instead of a linear
search.

Performance test was conducted by comparing exit count per second with
200 ioeventfds created on one byte and the guest is trying to access a
different byte continuously (triggering usermode exits).
Before the patch the guest has achieved 259k exits per second, after the
patch the guest does 274k exits per second.

Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 9e02fb96 17-Mar-2011 Michael S. Tsirkin <mst@redhat.com>

KVM: fix crash on irqfd deassign

irqfd in kvm used flush_work incorrectly: it assumed that work scheduled
previously can't run after flush_work, but since kvm uses a non-reentrant
workqueue (by means of schedule_work) we need flush_work_sync to get that
guarantee.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reported-by: Jean-Philippe Menil <jean-philippe.menil@univ-nantes.fr>
Tested-by: Jean-Philippe Menil <jean-philippe.menil@univ-nantes.fr>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# c8ce057e 06-Mar-2011 Michael S. Tsirkin <mst@redhat.com>

KVM: improve comment on rcu use in irqfd_deassign

The RCU use in kvm_irqfd_deassign is tricky: we have rcu_assign_pointer
but no synchronize_rcu: synchronize_rcu is done by kvm_irq_routing_update
which we share a spinlock with.

Fix up a comment in an attempt to make this clearer.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# bd2b53b2 18-Nov-2010 Michael S. Tsirkin <mst@redhat.com>

KVM: fast-path msi injection with irqfd

Store irq routing table pointer in the irqfd object,
and use that to inject MSI directly without bouncing out to
a kernel thread.

While we touch this structure, rearrange irqfd fields to make fastpath
better packed for better cache utilization.

This also adds some comments about locking rules and rcu usage in code.

Some notes on the design:
- Use pointer into the rt instead of copying an entry,
to make it possible to use rcu, thus side-stepping
locking complexities. We also save some memory this way.
- Old workqueue code is still used for level irqs.
I don't think we DTRT with level anyway, however,
it seems easier to keep the code around as
it has been thought through and debugged, and fix level later than
rip out and re-instate it later.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 6bbfb265 19-Sep-2010 Michael S. Tsirkin <mst@redhat.com>

KVM: fix irqfd assign/deassign race

I think I see the following (theoretical) race:

During irqfd assign, we drop irqfds lock before we
schedule inject work. Therefore, deassign running
on another CPU could cause shutdown and flush to run
before inject, causing user after free in inject.

A simple fix it to schedule inject under the lock.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 221d059d 23-May-2010 Avi Kivity <avi@redhat.com>

KVM: Update Red Hat copyrights

Signed-off-by: Avi Kivity <avi@redhat.com>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 8b97fb0f 13-Jan-2010 Michael S. Tsirkin <mst@redhat.com>

KVM: do not store wqh in irqfd

wqh is unused, so we do not need to store it in irqfd anymore

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 79fac95e 23-Dec-2009 Marcelo Tosatti <mtosatti@redhat.com>

KVM: convert slots_lock to a mutex

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# e93f8a0f 23-Dec-2009 Marcelo Tosatti <mtosatti@redhat.com>

KVM: convert io_bus to SRCU

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# b6a114d2 13-Jan-2010 Michael S. Tsirkin <mst@redhat.com>

KVM: fix spurious interrupt with irqfd

kvm didn't clear irqfd counter on deassign, as a result we could get a
spurious interrupt when irqfd is assigned back. this leads to poor
performance and, in theory, guest crash.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# f1d1c309 13-Jan-2010 Michael S. Tsirkin <mst@redhat.com>

KVM: only allow one gsi per fd

Looks like repeatedly binding same fd to multiple gsi's with irqfd can
use up a ton of kernel memory for irqfd structures.

A simple fix is to allow each fd to only trigger one gsi: triggering a
storm of interrupts in guest is likely useless anyway, and we can do it
by binding a single gsi to many interrupts if we really want to.

Cc: stable@kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Acked-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 680b3648 24-Aug-2009 Gleb Natapov <gleb@redhat.com>

KVM: Drop kvm->irq_lock lock from irq injection path

The only thing it protects now is interrupt injection into lapic and
this can work lockless. Even now with kvm->irq_lock in place access
to lapic is not entirely serialized since vcpu access doesn't take
kvm->irq_lock.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 6223011f 28-Jul-2009 Julia Lawall <julia@diku.dk>

KVM: correct error-handling code

This code is not executed before file has been initialized to the result of
calling eventfd_fget. This function returns an ERR_PTR value in an error
case instead of NULL. Thus the test that file is not NULL is always true.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@match exists@
expression x, E;
statement S1, S2;
@@

x = eventfd_fget(...)
... when != x = E
(
* if (x == NULL || ...) S1 else S2
|
* if (x == NULL && ...) S1 else S2
)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Avi Kivity <avi@redhat.com>


# d34e6b17 07-Jul-2009 Gregory Haskins <ghaskins@novell.com>

KVM: add ioeventfd support

ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.

Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.

However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.

Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.

To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().

We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.

You can download this test harness here:

ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2

The measured results are as follows:

qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt

I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:

qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt

these are just for fun, for now, until I can gather more data.

Here is a graph for your convenience:

http://developer.novell.com/wiki/images/7/76/Iofd-chart.png

The conclusion to draw is that we save about 4us by skipping the userspace
hop.

--------------------

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# fa40a821 04-Jun-2009 Marcelo Tosatti <mtosatti@redhat.com>

KVM: switch irq injection/acking data structures to irq_lock

Protect irq injection/acking data structures with a separate irq_lock
mutex. This fixes the following deadlock:

CPU A CPU B
kvm_vm_ioctl_deassign_dev_irq()
mutex_lock(&kvm->lock); worker_thread()
-> kvm_deassign_irq() -> kvm_assigned_dev_interrupt_work_handler()
-> deassign_host_irq() mutex_lock(&kvm->lock);
-> cancel_work_sync() [blocked]

[gleb: fix ia64 path]

Reported-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 721eecbf 20-May-2009 Gregory Haskins <ghaskins@novell.com>

KVM: irqfd

KVM provides a complete virtual system environment for guests, including
support for injecting interrupts modeled after the real exception/interrupt
facilities present on the native platform (such as the IDT on x86).
Virtual interrupts can come from a variety of sources (emulated devices,
pass-through devices, etc) but all must be injected to the guest via
the KVM infrastructure. This patch adds a new mechanism to inject a specific
interrupt to a guest using a decoupled eventfd mechnanism: Any legal signal
on the irqfd (using eventfd semantics from either userspace or kernel) will
translate into an injected interrupt in the guest at the next available
interrupt window.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>