History log of /linux-master/drivers/vfio/pci/vfio_pci_config.c
Revision Date Author Comments
# 30e920e1 20-Feb-2024 Ankit Agrawal <ankita@nvidia.com>

vfio/pci: rename and export range_intersect_range

range_intersect_range determines an overlap between two ranges. If an
overlap, the helper function returns the overlapping offset and size.

The VFIO PCI variant driver emulates the PCI config space BAR offset
registers. These offset may be accessed for read/write with a variety
of lengths including sub-word sizes from sub-word offsets. The driver
makes use of this helper function to read/write the targeted part of
the emulated register.

Make this a vfio_pci_core function, rename and export as GPL. Also
update references in virtio driver.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20240220115055.23546-3-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# d9824f70 23-May-2023 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Also demote hiding standard cap messages

Apply the same logic as commit 912b625b4dcf ("vfio/pci: demote hiding
ecap messages to debug level") for the less common case of hiding
standard capabilities.

Reviewed-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230523225250.1215911-1-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 912b625b 04-May-2023 Oleksandr Natalenko <oleksandr@natalenko.name>

vfio/pci: demote hiding ecap messages to debug level

Seeing a burst of messages like this:

vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0
vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x25@0x200
vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x26@0x210
vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x27@0x250
vfio-pci 0000:98:00.1: vfio_ecap_init: hiding ecap 0x25@0x200
vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0
vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x25@0x200
vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x26@0x210
vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x27@0x250
vfio-pci 0000:b1:00.1: vfio_ecap_init: hiding ecap 0x25@0x200

is of little to no value for an ordinary user.

Hence, use pci_dbg() instead of pci_info().

Signed-off-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Cédric Le Goater <clg@redhat.com>
Tested-by: YangHang Liu <yanghliu@redhat.com>
Link: https://lore.kernel.org/r/20230504131654.24922-1-oleksandr@natalenko.name
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 6467d074 17-Mar-2023 K V P, Satyanarayana <satyanarayana.k.v.p@intel.com>

vfio/pci: Add DVSEC PCI Extended Config Capability to user visible list.

The Designated Vendor-Specific Extended Capability (DVSEC Capability) is an
optional Extended Capability that is permitted to be implemented by any PCI
Express Function. This allows PCI Express component vendors to use
the Extended Capability mechanism to expose vendor-specific registers that can
be present in components by a variety of vendors. A DVSEC Capability structure
can tell vendor-specific software which features a particular component
supports.

An example usage of DVSEC is Intel Platform Monitoring Technology (PMT) for
enumerating and accessing hardware monitoring capabilities on a device.
PMT encompasses three device monitoring features, Telemetry (device metrics),
Watcher (sampling/tracing), and Crashlog. The DVSEC is used to discover these
features and provide a BAR offset to their registers with the Intel vendor code.

The current VFIO driver does not pass DVSEC capabilities to Virtual Machine (VM)
which makes PMT not to work inside the virtual machine. This series adds DVSEC
capability to user visible list to allow its use with VFIO. VFIO supports
passing of Vendor Specific Extended Capability (VSEC) and raw write access to
device. DVSEC also passed to VM in the same way as of VSEC.

Signed-off-by: K V P Satyanarayana <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20230317082222.3355912-1-satyanarayana.k.v.p@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 0886196c 08-Jan-2023 Jason Gunthorpe <jgg@ziepe.ca>

vfio: Use GFP_KERNEL_ACCOUNT for userspace persistent allocations

Use GFP_KERNEL_ACCOUNT for userspace persistent allocations.

The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this
is untrusted allocation triggered from userspace and should be a subject
of kmem accounting, and as such it is controlled by the cgroup
mechanism.

The way to find the relevant allocations was for example to look at the
close_device function and trace back all the kfrees to their
allocations.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230108154427.32609-4-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# c462a8c5 26-Aug-2022 Jason Gunthorpe <jgg@ziepe.ca>

vfio/pci: Simplify the is_intx/msi/msix/etc defines

Only three of these are actually used, simplify to three inline functions,
and open code the if statement in vfio_pci_config.c.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/3-v2-1bd95d72f298+e0e-vfio_pci_priv_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# e34a0425 26-Aug-2022 Jason Gunthorpe <jgg@ziepe.ca>

vfio/pci: Split linux/vfio_pci_core.h

The header in include/linux should have only the exported interface for
other vfio_pci modules to use. Internal definitions for vfio_pci.ko
should be in a "priv" header along side the .c files.

Move the internal declarations out of vfio_pci_core.h. They either move to
vfio_pci_priv.h or to the C file that is the only user.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/1-v2-1bd95d72f298+e0e-vfio_pci_priv_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 099fd2c2 31-Jul-2022 Bo Liu <liubo03@inspur.com>

vfio/pci: fix the wrong word

This patch fixes a wrong word in comment.

Signed-off-by: Bo Liu <liubo03@inspur.com>
Link: https://lore.kernel.org/r/20220801013918.2520-1-liubo03@inspur.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 6577067d 03-Jul-2022 Bo Liu <liubo03@inspur.com>

vfio/pci: fix the wrong word

This patch fixes a wrong word in comment.

Signed-off-by: Bo Liu <liubo03@inspur.com>
Link: https://lore.kernel.org/r/20220704023649.3913-1-liubo03@inspur.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 54918c28 18-May-2022 Abhishek Sahu <abhsahu@nvidia.com>

vfio/pci: Virtualize PME related registers bits and initialize to zero

If any PME event will be generated by PCI, then it will be mostly
handled in the host by the root port PME code. For example, in the case
of PCIe, the PME event will be sent to the root port and then the PME
interrupt will be generated. This will be handled in
drivers/pci/pcie/pme.c at the host side. Inside this, the
pci_check_pme_status() will be called where PME_Status and PME_En bits
will be cleared. So, the guest OS which is using vfio-pci device will
not come to know about this PME event.

To handle these PME events inside guests, we need some framework so
that if any PME events will happen, then it needs to be forwarded to
virtual machine monitor. We can virtualize PME related registers bits
and initialize these bits to zero so vfio-pci device user will assume
that it is not capable of asserting the PME# signal from any power state.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-4-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 2b2c651b 18-May-2022 Abhishek Sahu <abhsahu@nvidia.com>

vfio/pci: Invalidate mmaps and block the access in D3hot power state

According to [PCIe v5 5.3.1.4.1] for D3hot state

"Configuration and Message requests are the only TLPs accepted by a
Function in the D3Hot state. All other received Requests must be
handled as Unsupported Requests, and all received Completions may
optionally be handled as Unexpected Completions."

Currently, if the vfio PCI device has been put into D3hot state and if
user makes non-config related read/write request in D3hot state, these
requests will be forwarded to the host and this access may cause
issues on a few systems.

This patch leverages the memory-disable support added in commit
'abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on
disabled memory")' to generate page fault on mmap access and
return error for the direct read/write. If the device is D3hot state,
then the error will be returned for MMIO access. The IO access generally
does not make the system unresponsive so the IO access can still happen
in D3hot state. The default value should be returned in this case
without bringing down the complete system.

Also, the power related structure fields need to be protected so
we can use the same 'memory_lock' to protect these fields also.
This protection is mainly needed when user changes the PCI
power state by writing into PCI_PM_CTRL register.
vfio_lock_and_set_power_state() wrapper function will take the
required locks and then it will invoke the vfio_pci_set_power_state().

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-2-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 7fa005ca 26-Aug-2021 Max Gurtovoy <mgurtovoy@nvidia.com>

vfio/pci: Introduce vfio_pci_core.ko

Now that vfio_pci has been split into two source modules, one focusing on
the "struct pci_driver" (vfio_pci.c) and a toolbox library of code
(vfio_pci_core.c), complete the split and move them into two different
kernel modules.

As before vfio_pci.ko continues to present the same interface under sysfs
and this change will have no functional impact.

Splitting into another module and adding exports allows creating new HW
specific VFIO PCI drivers that can implement device specific
functionality, such as VFIO migration interfaces or specialized device
requirements.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-14-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 53647510 26-Aug-2021 Max Gurtovoy <mgurtovoy@nvidia.com>

vfio/pci: Rename vfio_pci_device to vfio_pci_core_device

This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change any
logic.

The new vfio_pci_core_device structure will be the main structure of the
core driver and later on vfio_pci_device structure will be the main
structure of the generic vfio_pci driver.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-4-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 9a389938 26-Aug-2021 Max Gurtovoy <mgurtovoy@nvidia.com>

vfio/pci: Rename vfio_pci_private.h to vfio_pci_core.h

This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change any
logic.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# d1ce2c79 14-May-2021 Zhen Lei <thunder.leizhen@huawei.com>

vfio/pci: Fix error return code in vfio_ecap_init()

The error code returned from vfio_ext_cap_len() is stored in 'len', not
in 'ret'.

Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Message-Id: <20210515020458.6771-1-thunder.leizhen@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# d0915b32 30-Mar-2021 Zhen Lei <thunder.leizhen@huawei.com>

vfio/pci: fix a couple of spelling mistakes

There are several spelling mistakes, as follows:
thru ==> through
presense ==> presence

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Message-Id: <20210326083528.1329-4-thunder.leizhen@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 515ecd53 10-Sep-2020 Matthew Rosato <mjrosato@linux.ibm.com>

vfio/pci: Decouple PCI_COMMAND_MEMORY bit checks from is_virtfn

While it is true that devices with is_virtfn=1 will have a Memory Space
Enable bit that is hard-wired to 0, this is not the only case where we
see this behavior -- For example some bare-metal hypervisors lack
Memory Space Enable bit emulation for devices not setting is_virtfn
(s390). Fix this by instead checking for the newly-added
no_command_memory bit which directly denotes the need for
PCI_COMMAND_MEMORY emulation in vfio.

Fixes: abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on disabled memory")
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 1c0f6825 20-Sep-2020 Zenghui Yu <yuzenghui@huawei.com>

vfio/pci: Don't regenerate vconfig for all BARs if !bardirty

Now we regenerate vconfig for all the BARs via vfio_bar_fixup(), every
time any offset of any of them are read. Though BARs aren't re-read
regularly, the regeneration can be avoided if no BARs had been written
since they were last read, in which case vdev->bardirty is false.

Let's return immediately in vfio_bar_fixup() if bardirty is false.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# ebfa440c 25-Jun-2020 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Fix SR-IOV VF handling with MMIO blocking

SR-IOV VFs do not implement the memory enable bit of the command
register, therefore this bit is not set in config space after
pci_enable_device(). This leads to an unintended difference
between PF and VF in hand-off state to the user. We can correct
this by setting the initial value of the memory enable bit in our
virtualized config space. There's really no need however to
ever fault a user on a VF though as this would only indicate an
error in the user's management of the enable bit, versus a PF
where the same access could trigger hardware faults.

Fixes: abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on disabled memory")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 3e63b94b 09-May-2020 Qian Cai <cai@lca.pw>

vfio/pci: fix memory leaks in alloc_perm_bits()

vfio_pci_disable() calls vfio_config_free() but forgets to call
free_perm_bits() resulting in memory leaks,

unreferenced object 0xc000000c4db2dee0 (size 16):
comm "qemu-kvm", pid 4305, jiffies 4295020272 (age 3463.780s)
hex dump (first 16 bytes):
00 00 ff 00 ff ff ff ff ff ff ff ff ff ff 00 00 ................
backtrace:
[<00000000a6a4552d>] alloc_perm_bits+0x58/0xe0 [vfio_pci]
[<00000000ac990549>] vfio_config_init+0xdf0/0x11b0 [vfio_pci]
init_pci_cap_msi_perm at drivers/vfio/pci/vfio_pci_config.c:1125
(inlined by) vfio_msi_cap_len at drivers/vfio/pci/vfio_pci_config.c:1180
(inlined by) vfio_cap_len at drivers/vfio/pci/vfio_pci_config.c:1241
(inlined by) vfio_cap_init at drivers/vfio/pci/vfio_pci_config.c:1468
(inlined by) vfio_config_init at drivers/vfio/pci/vfio_pci_config.c:1707
[<000000006db873a1>] vfio_pci_open+0x234/0x700 [vfio_pci]
[<00000000630e1906>] vfio_group_fops_unl_ioctl+0x8e0/0xb84 [vfio]
[<000000009e34c54f>] ksys_ioctl+0xd8/0x130
[<000000006577923d>] sys_ioctl+0x28/0x40
[<000000006d7b1cf2>] system_call_exception+0x114/0x1e0
[<0000000008ea7dd5>] system_call_common+0xf0/0x278
unreferenced object 0xc000000c4db2e330 (size 16):
comm "qemu-kvm", pid 4305, jiffies 4295020272 (age 3463.780s)
hex dump (first 16 bytes):
00 ff ff 00 ff ff ff ff ff ff ff ff ff ff 00 00 ................
backtrace:
[<000000004c71914f>] alloc_perm_bits+0x44/0xe0 [vfio_pci]
[<00000000ac990549>] vfio_config_init+0xdf0/0x11b0 [vfio_pci]
[<000000006db873a1>] vfio_pci_open+0x234/0x700 [vfio_pci]
[<00000000630e1906>] vfio_group_fops_unl_ioctl+0x8e0/0xb84 [vfio]
[<000000009e34c54f>] ksys_ioctl+0xd8/0x130
[<000000006577923d>] sys_ioctl+0x28/0x40
[<000000006d7b1cf2>] system_call_exception+0x114/0x1e0
[<0000000008ea7dd5>] system_call_common+0xf0/0x278

Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
Signed-off-by: Qian Cai <cai@lca.pw>
[aw: rolled in follow-up patch]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# bc138db1 08-Apr-2020 Alex Williamson <alex.williamson@redhat.com>

vfio-pci: Mask cap zero

The PCI Code and ID Assignment Specification changed capability ID 0
from reserved to a NULL capability in the v1.1 revision. The NULL
capability is defined to include only the 16-bit capability header,
ie. only the ID and next pointer. Unfortunately vfio-pci creates a
map of config space, where ID 0 is used to reserve the standard type
0 header. Finding an actual capability with this ID therefore results
in a bogus range marked in that map and conflicts with subsequent
capabilities. As this seems to be a dummy capability anyway and we
already support dropping capabilities, let's hide this one rather than
delving into the potentially subtle dependencies within our map.

Seen on an NVIDIA Tesla T4.

Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# abafbc55 22-Apr-2020 Alex Williamson <alex.williamson@redhat.com>

vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

Accessing the disabled memory space of a PCI device would typically
result in a master abort response on conventional PCI, or an
unsupported request on PCI express. The user would generally see
these as a -1 response for the read return data and the write would be
silently discarded, possibly with an uncorrected, non-fatal AER error
triggered on the host. Some systems however take it upon themselves
to bring down the entire system when they see something that might
indicate a loss of data, such as this discarded write to a disabled
memory space.

To avoid this, we want to try to block the user from accessing memory
spaces while they're disabled. We start with a semaphore around the
memory enable bit, where writers modify the memory enable state and
must be serialized, while readers make use of the memory region and
can access in parallel. Writers include both direct manipulation via
the command register, as well as any reset path where the internal
mechanics of the reset may both explicitly and implicitly disable
memory access, and manipulation of the MSI-X configuration, where the
MSI-X vector table resides in MMIO space of the device. Readers
include the read and write file ops to access the vfio device fd
offsets as well as memory mapped access. In the latter case, we make
use of our new vma list support to zap, or invalidate, those memory
mappings in order to force them to be faulted back in on access.

Our semaphore usage will stall user access to MMIO spaces across
internal operations like reset, but the user might experience new
behavior when trying to access the MMIO space while disabled via the
PCI command register. Access via read or write while disabled will
return -EIO and access via memory maps will result in a SIGBUS. This
is expected to be compatible with known use cases and potentially
provides better error handling capabilities than present in the
hardware, while avoiding the more readily accessible and severe
platform error responses that might otherwise occur.

Fixes: CVE-2020-12888
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# c9c13ba4 27-Sep-2019 Denis Efremov <efremov@linux.com>

PCI: Add PCI_STD_NUM_BARS for the number of standard BARs

Code that iterates over all standard PCI BARs typically uses
PCI_STD_RESOURCE_END. However, that requires the unusual test
"i <= PCI_STD_RESOURCE_END" rather than something the typical
"i < PCI_STD_NUM_BARS".

Add a definition for PCI_STD_NUM_BARS and change loops to use the more
idiomatic C style to help avoid fencepost errors.

Link: https://lore.kernel.org/r/20190927234026.23342-1-efremov@linux.com
Link: https://lore.kernel.org/r/20190927234308.23935-1-efremov@linux.com
Link: https://lore.kernel.org/r/20190916204158.6889-3-efremov@linux.com
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Sebastian Ott <sebott@linux.ibm.com> # arch/s390/
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> # video/fbdev/
Acked-by: Gustavo Pimentel <gustavo.pimentel@synopsys.com> # pci/controller/dwc/
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com> # scsi/pm8001/
Acked-by: Martin K. Petersen <martin.petersen@oracle.com> # scsi/pm8001/
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # memstick/


# d2912cb1 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# a88a7b3e 30-Mar-2019 Bjorn Helgaas <bhelgaas@google.com>

vfio: Use dev_printk() when possible

Use dev_printk() when possible to make messages consistent with other
device-related messages.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 51ef3a00 09-Feb-2019 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Restore device state on PM transition

PCI core handles save and restore of device state around reset, but
when using pci_set_power_state() we can unintentionally trigger a soft
reset of the device, where PCI core only restores the BAR state. If
we're using vfio-pci's idle D3 support to try to put devices into low
power when unused, this might trigger a reset when the device is woken
for use. Also power state management by the user, or within a guest,
can put the device into D3 power state with potentially limited
ability to restore the device if it should undergo a reset. The PCI
spec does not define the extent of a soft reset and many devices
reporting soft reset on D3->D0 transition do not undergo a PCI config
space reset. It's therefore assumed safe to unconditionally restore
the remainder of the state if the device indicates soft reset
support, even on a user initiated wakeup.

Implement a wrapper in vfio-pci to tag devices reporting PM reset
support, save their state on transitions into D3 and restore on
transitions back to D0.

Reported-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# db04264f 25-Sep-2018 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Mask buggy SR-IOV VF INTx support

The SR-IOV spec requires that VFs must report zero for the INTx pin
register as VFs are precluded from INTx support. It's much easier for
the host kernel to understand whether a device is a VF and therefore
whether a non-zero pin register value is bogus than it is to do the
same in userspace. Override the INTx count for such devices and
virtualize the pin register to provide a consistent view of the device
to the user.

As this is clearly a spec violation, warn about it to support hardware
validation, but also provide a known whitelist as it doesn't do much
good to continue complaining if the hardware vendor doesn't plan to
fix it.

Known devices with this issue: 8086:270c

Tested-by: Gage Eads <gage.eads@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 30ea32ab 25-Sep-2018 Li Qiang <liq3ea@gmail.com>

vfio/pci: Fix potential memory leak in vfio_msi_cap_len

Free allocated vdev->msi_perm in error path.

Signed-off-by: Li Qiang <liq3ea@gmail.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# cf0d53ba 02-Oct-2017 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Virtualize Maximum Read Request Size

MRRS defines the maximum read request size a device is allowed to
make. Drivers will often increase this to allow more data transfer
with a single request. Completions to this request are bound by the
MPS setting for the bus. Aside from device quirks (none known), it
doesn't seem to make sense to set an MRRS value less than MPS, yet
this is a likely scenario given that user drivers do not have a
system-wide view of the PCI topology. Virtualize MRRS such that the
user can set MRRS >= MPS, but use MPS as the floor value that we'll
write to hardware.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 52318497 02-Oct-2017 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Virtualize Maximum Payload Size

With virtual PCI-Express chipsets, we now see userspace/guest drivers
trying to match the physical MPS setting to a virtual downstream port.
Of course a lone physical device surrounded by virtual interconnects
cannot make a correct decision for a proper MPS setting. Instead,
let's virtualize the MPS control register so that writes through to
hardware are disallowed. Userspace drivers like QEMU assume they can
write anything to the device and we'll filter out anything dangerous.
Since mismatched MPS can lead to AER and other faults, let's add it
to the kernel side rather than relying on userspace virtualization to
handle it.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>


# 796b7550 27-Jul-2017 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Fix handling of RC integrated endpoint PCIe capability size

Root complex integrated endpoints do not have a link and therefore may
use a smaller PCIe capability in config space than we expect when
building our config map. Add a case for these to avoid reporting an
erroneous overlap.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# cc10385b 21-Sep-2016 Wang Sheng-Hui <shhuiw@foxmail.com>

PCI: Move config space size macros to pci_regs.h

Move PCI configuration space size macros (PCI_CFG_SPACE_SIZE and
PCI_CFG_SPACE_EXP_SIZE) from drivers/pci/pci.h to
include/uapi/linux/pci_regs.h so they can be used by more drivers and
eliminate duplicate definitions.

[bhelgaas: Expand comment to include PCI-X details]
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>


# f4cb4100 18-Nov-2016 Cao jin <caoj.fnst@cn.fujitsu.com>

vfio/pci: Drop unnecessary pcibios_err_to_errno()

As of commit d97ffe236894 ("PCI: Fix return value from
pci_user_{read,write}_config_*()") it's unnecessary to call
pcibios_err_to_errno() to fixup the return value from these functions.

pcibios_err_to_errno() already does simple passthrough of -errno values,
therefore no functional change is expected.

[aw: changelog]
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# ddf9dc0e 26-Sep-2016 Alex Williamson <alex.williamson@redhat.com>

vfio-pci: Virtualize PCIe & AF FLR

We use a BAR restore trick to try to detect when a user has performed
a device reset, possibly through FLR or other backdoors, to put things
back into a working state. This is important for backdoor resets, but
we can actually just virtualize the "front door" resets provided via
PCIe and AF FLR. Set these bits as virtualized + writable, allowing
the default write to set them in vconfig, then we can simply check the
bit, perform an FLR of our own, and clear the bit. We don't actually
have the granularity in PCI to specify the type of reset we want to
do, but generally devices don't implement both PCIe and AF FLR and
we'll favor these over other types of reset, so we should generally
lineup. We do test whether the device provides the requested FLR type
to stay consistent with hardware capabilities though.

This seems to fix several instance of devices getting into bad states
with userspace drivers, like dpdk, running inside a VM.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Greg Rose <grose@lightfleet.com>


# 8138dabb 17-Aug-2016 Wei Jiangang <weijg.fnst@cn.fujitsu.com>

vfio/pci: Fix typos in comments

Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# ce7585f3 31-May-2016 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Allow VPD short read

The size of the VPD area is not necessarily 4-byte aligned, so a
pci_vpd_read() might return less than 4 bytes. Zero our buffer and
accept anything other than an error. Intel X710 NICs exercise this.

Fixes: 4e1a635552d3 ("vfio/pci: Use kernel VPD access functions")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# f7055280 28-Apr-2016 Alexey Kardashevskiy <aik@ozlabs.ru>

vfio_pci: Test for extended capabilities if config space > 256 bytes

PCI-Express spec says that reading 4 bytes at offset 100h should return
zero if there is no extended capability so VFIO reads this dword to
know if there are extended capabilities.

However it is not always possible to access the extended space so
generic PCI code in pci_cfg_space_size_ext() checks if
pci_read_config_dword() can read beyond 100h and if the check fails,
it sets the config space size to 100h.

VFIO does its own extended capabilities check by reading at offset 100h
which may produce 0xffffffff which VFIO treats as the extended config
space presense and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset 100h which is beyond the buffer allocated for
vdev->vconfig (which is 256 bytes) which leads to random memory
corruption.

This makes VFIO only check for the extended capabilities if
the discovered config size is more than 256 bytes.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# dc928109 24-Mar-2016 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Add test for BAR restore

If a device is reset without the memory or i/o bits enabled in the
command register we may not detect it, potentially leaving the device
without valid BAR programming. Add an additional test to check the
BARs on each write to the command register.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 45074405 24-Mar-2016 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Hide broken INTx support from user

INTx masking has two components, the first is that we need the ability
to prevent the device from continuing to assert INTx. This is
provided via the DisINTx bit in the command register and is the only
thing we can really probe for when testing if INTx masking is
supported. The second component is that the device needs to indicate
if INTx is asserted via the interrupt status bit in the device status
register. With these two features we can generically determine if one
of the devices we own is asserting INTx, signal the user, and mask the
interrupt while the user services the device.

Generally if one or both of these components is broken we resort to
APIC level interrupt masking, which requires an exclusive interrupt
since we have no way to determine the source of the interrupt in a
shared configuration. This often makes it difficult or impossible to
configure the system for userspace use of the device, for an interrupt
mode that the user may not need.

One possible configuration of broken INTx masking is that the DisINTx
support is fully functional, but the interrupt status bit never
signals interrupt assertion. In this case we do have the ability to
prevent the device from asserting INTx, but lack the ability to
identify the interrupt source. For this case we can simply pretend
that the device lacks INTx support entirely, keeping DisINTx set on
the physical device, virtualizing this bit for the user, and
virtualizing the interrupt pin register to indicate no INTx support.
We already support virtualization of the DisINTx bit and already
virtualize the interrupt pin for platforms without INTx support. By
tying these components together, setting DisINTx on open and reset,
and identifying devices broken in this particular way, we can provide
support for them w/o the handicap of APIC level INTx masking.

Intel i40e (XL710/X710) 10/20/40GbE NICs have been identified as being
broken in this specific way. We leave the vfio-pci.nointxmask option
as a mechanism to bypass this support, enabling INTx on the device
with all the requirements of APIC level masking.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>


# a13b6459 22-Feb-2016 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Expose shadow ROM as PCI option ROM

Integrated graphics may have their ROM shadowed at 0xc0000 rather than
implement a PCI option ROM. Make this ROM appear to the user using
the ROM BAR.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 345d7104 22-Feb-2016 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Enable virtual register in PCI config space

Typically config space for a device is mapped out into capability
specific handlers and unassigned space. The latter allows direct
read/write access to config space. Sometimes we know about registers
living in this void space and would like an easy way to virtualize
them, similar to how BAR registers are managed. To do this, create
one more pseudo (fake) PCI capability to be handled as purely virtual
space. Reads and writes are serviced entirely from virtual config
space.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 222e684c 09-Nov-2015 Dan Carpenter <dan.carpenter@oracle.com>

vfio/pci: make an array larger

Smatch complains about a possible out of bounds error:

drivers/vfio/pci/vfio_pci_config.c:1241 vfio_cap_init()
error: buffer overflow 'pci_cap_length' 20 <= 20

The problem is that pci_cap_length[] was defined as large enough to
hold "PCI_CAP_ID_AF + 1" elements. The code in vfio_cap_init() assumes
it has PCI_CAP_ID_MAX + 1 elements. Originally, PCI_CAP_ID_AF and
PCI_CAP_ID_MAX were the same but then we introduced PCI_CAP_ID_EA in
commit f80b0ba95964 ("PCI: Add Enhanced Allocation register entries")
so now the array is too small.

Let's fix this by making the array size PCI_CAP_ID_MAX + 1. And let's
make a similar change to pci_ext_cap_length[] for consistency. Also
both these arrays can be made const.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 4e1a6355 27-Oct-2015 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Use kernel VPD access functions

The PCI VPD capability operates on a set of window registers in PCI
config space. Writing to the address register triggers either a read
or write, depending on the setting of the PCI_VPD_ADDR_F bit within
the address register. The data register provides either the source
for writes or the target for reads.

This model is susceptible to being broken by concurrent access, for
which the kernel has adopted a set of access functions to serialize
these registers. Additionally, commits like 932c435caba8 ("PCI: Add
dev_flags bit to access VPD through function 0") and 7aa6ca4d39ed
("PCI: Add VPD function 0 quirk for Intel Ethernet devices") indicate
that VPD registers can be shared between functions on multifunction
devices creating dependencies between otherwise independent devices.

Fortunately it's quite easy to emulate the VPD registers, simply
storing copies of the address and data registers in memory and
triggering a VPD read or write on writes to the address register.
This allows vfio users to avoid seeing spurious register changes from
accesses on other devices and enables the use of shared quirks in the
host kernel. We can theoretically still race with access through
sysfs, but the window of opportunity is much smaller.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Mark Rustad <mark.d.rustad@intel.com>


# 1d53a3a7 07-Nov-2014 Frank Blaschka <frank.blaschka@de.ibm.com>

vfio: make vfio run on s390

add Kconfig switch to hide INTx
add Kconfig switch to let vfio announce PCI BARs are not mapable

Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 846fc709 13-Aug-2014 Chen, Gong <gong.chen@linux.intel.com>

PCI/AER: Rename PCI_ERR_UNC_TRAIN to PCI_ERR_UNC_UND

In PCIe r1.0, sec 5.10.2, bit 0 of the Uncorrectable Error Status, Mask,
and Severity Registers was for "Training Error." In PCIe r1.1, sec 7.10.2,
bit 0 was redefined to be "Undefined."

Rename PCI_ERR_UNC_TRAIN to PCI_ERR_UNC_UND to reflect this change.

No functional change.

[bhelgaas: changelog]
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>


# afa63252 30-May-2014 Alex Williamson <alex.williamson@redhat.com>

vfio/pci: Fix sizing of DPA and THP express capabilities

When sizing the TPH capability we store the register containing the
table size into the 'dword' variable, but then use the uninitialized
'byte' variable to analyze the size. The table size is also actually
reported as an N-1 value, so correct sizing to account for this.

The round_up() for both TPH and DPA is unnecessary, remove it.

Detected by Coverity: CID 714665 & 715156

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 274127a1 17-Dec-2013 Alex Williamson <alex.williamson@redhat.com>

PCI: Rename PCI_VC_PORT_REG1/2 to PCI_VC_PORT_CAP1/2

These are set of two capability registers, it's pretty much given that
they're registers, so reflect their purpose in the name.

Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>


# 17638db1 04-Sep-2013 Alex Williamson <alex.williamson@redhat.com>

vfio-pci: Test for extended config space

Having PCIe/PCI-X capability isn't enough to assume that there are
extended capabilities. Both specs define that the first capability
header is all zero if there are no extended capabilities. Testing
for this avoids an erroneous message about hiding capability 0x0 at
offset 0x100.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# aa2cba51 15-Apr-2013 Yijing Wang <wangyijing@huawei.com>

PCI/VFIO: use pcie_flags_reg instead of access PCI-E Capabilities Register

Currently, we use pcie_flags_reg to cache PCI-E Capabilities Register,
because PCI-E Capabilities Register bits are almost read-only. This patch
use pcie_caps_reg() instead of another access PCI-E Capabilities Register.

Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# a7d1ea1c 01-Apr-2013 Alex Williamson <alex.williamson@redhat.com>

vfio-pci: Enable raw access to unassigned config space

Devices like be2net hide registers between the gaps in capabilities
and architected regions of PCI config space. Our choices to support
such devices is to either build an ever growing and unmanageable white
list or rely on hardware isolation to protect us. These registers are
really no different than MMIO or I/O port space registers, which we
don't attempt to regulate, so treat PCI config space in the same way.

Reported-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Gavin Shan <shangw@linux.vnet.ibm.com>


# 180b1381 01-Apr-2013 Alex Williamson <alex.williamson@redhat.com>

vfio-pci: Use byte granularity in config map

The config map previously used a byte per dword to map regions of
config space to capabilities. Modulo a bug where we round the length
of capabilities down instead of up, this theoretically works well and
saves space so long as devices don't try to hide registers in the gaps
between capabilities. Unfortunately they do exactly that so we need
byte granularity on our config space map. Increase the allocation of
the config map and split accesses at capability region boundaries.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Gavin Shan <shangw@linux.vnet.ibm.com>


# 25e9789d 15-Mar-2013 Arnd Bergmann <arnd@arndb.de>

vfio: include <linux/slab.h> for kmalloc

The vfio drivers call kmalloc or kzalloc, but do not
include <linux/slab.h>, which causes build errors on
ARM.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: kvm@vger.kernel.org


# 2dd11948 18-Feb-2013 Alex Williamson <alex.williamson@redhat.com>

vfio-pci: Manage user power state transitions

We give the user access to change the power state of the device but
certain transitions result in an uninitialized state which the user
cannot resolve. To fix this we need to mark the PowerState field of
the PMCSR register read-only and effect the requested change on behalf
of the user. This has the added benefit that pdev->current_state
remains accurate while controlled by the user.

The primary example of this bug is a QEMU guest doing a reboot where
the device it put into D3 on shutdown and becomes unusable on the next
boot because the device did a soft reset on D3->D0 (NoSoftRst-).

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 906ee99d 14-Feb-2013 Alex Williamson <alex.williamson@redhat.com>

vfio-pci: Cleanup BAR access

We can actually handle MMIO and I/O port from the same access function
since PCI already does abstraction of this. The ROM BAR only requires
a minor difference, so it gets included too. vfio_pci_config_readwrite
gets renamed for consistency.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 5641ade4 14-Feb-2013 Alex Williamson <alex.williamson@redhat.com>

vfio-pci: Enable PCIe extended capabilities on v1

Even PCIe 1.x had extended config space.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 89e1f7d4 31-Jul-2012 Alex Williamson <alex.williamson@redhat.com>

vfio: Add PCI device driver

Add PCI device support for VFIO. PCI devices expose regions
for accessing config space, I/O port space, and MMIO areas
of the device. PCI config access is virtualized in the kernel,
allowing us to ensure the integrity of the system, by preventing
various accesses while reducing duplicate support across various
userspace drivers. I/O port supports read/write access while
MMIO also supports mmap of sufficiently sized regions. Support
for INTx, MSI, and MSI-X interrupts are provided using eventfds to
userspace.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>