History log of /linux-master/drivers/iommu/iommufd/device.c
Revision Date Author Comments
# bd7a2826 12-Nov-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Add iommufd_ctx to iommufd_put_object()

Will be used in the next patch.

Link: https://lore.kernel.org/r/1-v2-ca9e00171c5b+123-iommufd_syz4_jgg@nvidia.com/
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# bd529dbb 25-Oct-2023 Nicolin Chen <nicolinc@nvidia.com>

iommufd: Add a nested HW pagetable object

IOMMU_HWPT_ALLOC already supports iommu_domain allocation for usersapce.
But it can only allocate a hw_pagetable that associates to a given IOAS,
i.e. only a kernel-managed hw_pagetable of IOMMUFD_OBJ_HWPT_PAGING type.

IOMMU drivers can now support user-managed hw_pagetables, for two-stage
translation use cases that require user data input from the user space.

Add a new IOMMUFD_OBJ_HWPT_NESTED type with its abort/destroy(). Pair it
with a new iommufd_hwpt_nested structure and its to_hwpt_nested() helper.
Update the to_hwpt_paging() helper, so a NESTED-type hw_pagetable can be
handled in the callers, for example iommufd_hw_pagetable_enforce_rr().

Screen the inputs including the parent PAGING-type hw_pagetable that has
a need of a new nest_parent flag in the iommufd_hwpt_paging structure.

Extend the IOMMU_HWPT_ALLOC ioctl to accept an IOMMU driver specific data
input which is tagged by the enum iommu_hwpt_data_type. Also, update the
@pt_id to accept hwpt_id too besides an ioas_id. Then, use them to allocate
a hw_pagetable of IOMMUFD_OBJ_HWPT_NESTED type using the
iommufd_hw_pagetable_alloc_nested() allocator.

Link: https://lore.kernel.org/r/20231026043938.63898-8-yi.l.liu@intel.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 89db3163 25-Oct-2023 Nicolin Chen <nicolinc@nvidia.com>

iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable

To prepare for IOMMUFD_OBJ_HWPT_NESTED, derive struct iommufd_hwpt_paging
from struct iommufd_hw_pagetable, by leaving the common members in struct
iommufd_hw_pagetable. Add a __iommufd_object_alloc and to_hwpt_paging()
helpers for the new structure.

Then, update "hwpt" to "hwpt_paging" throughout the files, accordingly.

Link: https://lore.kernel.org/r/20231026043938.63898-5-yi.l.liu@intel.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 58d84f43 25-Oct-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd/device: Wrap IOMMUFD_OBJ_HWPT_PAGING-only configurations

Some of the configurations during the attach/replace() should only apply
to IOMMUFD_OBJ_HWPT_PAGING. Once IOMMUFD_OBJ_HWPT_NESTED gets introduced
in a following patch, keeping them unconditionally in the common routine
will not work.

Wrap all of those PAGING-only configurations together into helpers. Do a
hwpt_is_paging check whenever calling them or their fallback routines.

Link: https://lore.kernel.org/r/20231026043938.63898-4-yi.l.liu@intel.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 9744a7ab 25-Oct-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Rename IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING

To add a new IOMMUFD_OBJ_HWPT_NESTED, rename the HWPT object to confine
it to PAGING hwpts/domains. The following patch will separate the hwpt
structure as well.

Link: https://lore.kernel.org/r/20231026043938.63898-3-yi.l.liu@intel.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 2ccabf81 23-Oct-2023 Nicolin Chen <nicolinc@nvidia.com>

iommufd: Only enforce cache coherency in iommufd_hw_pagetable_alloc

According to the conversation in the following link:
https://lore.kernel.org/linux-iommu/20231020135501.GG3952@nvidia.com/

The enforce_cache_coherency should be set/enforced in the hwpt allocation
routine. The iommu driver in its attach_dev() op should decide whether to
reject or not a device that doesn't match with the configuration of cache
coherency. Drop the enforce_cache_coherency piece in the attach/replace()
and move the remaining "num_devices" piece closer to the refcount that is
using it.

Accordingly drop its function prototype in the header and mark it static.
Also add some extra comments to clarify the expected behaviors.

Link: https://lore.kernel.org/r/20231024012958.30842-1-nicolinc@nvidia.com
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 76236838 24-Oct-2023 Joao Martins <joao.m.martins@oracle.com>

iommufd: Add capabilities to IOMMU_GET_HW_INFO

Extend IOMMUFD_CMD_GET_HW_INFO op to query generic iommu capabilities for a
given device.

Capabilities are IOMMU agnostic and use device_iommu_capable() API passing
one of the IOMMU_CAP_*. Enumerate IOMMU_CAP_DIRTY_TRACKING for now in the
out_capabilities field returned back to userspace.

Link: https://lore.kernel.org/r/20231024135109.73787-9-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 89d63875 28-Sep-2023 Yi Liu <yi.l.liu@intel.com>

iommufd: Flow user flags for domain allocation to domain_alloc_user()

Extends iommufd_hw_pagetable_alloc() to accept user flags, the uAPI will
provide the flags.

Link: https://lore.kernel.org/r/20230928071528.26258-4-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 55dd4023 18-Aug-2023 Yi Liu <yi.l.liu@intel.com>

iommufd: Add IOMMU_GET_HW_INFO

Under nested IOMMU translation, userspace owns the stage-1 translation
table (e.g. the stage-1 page table of Intel VT-d or the context table of
ARM SMMUv3, and etc.). Stage-1 translation tables are vendor specific, and
need to be compatible with the underlying IOMMU hardware. Hence, userspace
should know the IOMMU hardware capability before creating and configuring
the stage-1 translation table to kernel.

This adds IOMMU_GET_HW_INFO ioctl to query the IOMMU hardware information
(a.k.a capability) for a given device. The returned data is vendor
specific, userspace needs to decode it with the structure by the output
@out_data_type field.

As only physical devices have IOMMU hardware, so this will return error if
the given device is not a physical device.

Link: https://lore.kernel.org/r/20230818101033.4100-4-yi.l.liu@intel.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 70c16123 28-Jul-2023 Nicolin Chen <nicolinc@nvidia.com>

iommufd: Add iommufd_access_replace() API

Taking advantage of the new iommufd_access_change_ioas_id helper, add an
iommufd_access_replace() API for the VFIO emulated pathway to use.

Link: https://lore.kernel.org/r/a3267b924fd5f45e0d3a1dd13a9237e923563862.1690523699.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 6129b59f 28-Jul-2023 Nicolin Chen <nicolinc@nvidia.com>

iommufd: Use iommufd_access_change_ioas in iommufd_access_destroy_object

Update iommufd_access_destroy_object() to call the new
iommufd_access_change_ioas() helper.

It is impossible to legitimately race iommufd_access_destroy_object() with
iommufd_access_change_ioas() as iommufd_access_destroy_object() is only
called once the refcount reache zero, so any concurrent
iommufd_access_change_ioas() is already UAFing the memory.

Link: https://lore.kernel.org/r/f9fbeca2cde7f8515da18d689b3e02a6a40a5e14.1690523699.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 9227da78 28-Jul-2023 Nicolin Chen <nicolinc@nvidia.com>

iommufd: Add iommufd_access_change_ioas(_id) helpers

The complication of the mutex and refcount will be amplified after we
introduce the replace support for accesses. So, add a preparatory change
of a constitutive helper iommufd_access_change_ioas() and its wrapper
iommufd_access_change_ioas_id(). They can simply take care of existing
iommufd_access_attach() and iommufd_access_detach(), properly sequencing
the refcount puts so that they are truely at the end of the sequence after
we know the IOAS pointer is not required any more.

Link: https://lore.kernel.org/r/da0c462532193b447329c4eb975a596f47e49b70.1690523699.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 5d5c85ff 28-Jul-2023 Nicolin Chen <nicolinc@nvidia.com>

iommufd: Allow passing in iopt_access_list_id to iopt_remove_access()

This is a preparatory change for ioas replacement support for accesses.
The replacement routine does an iopt_add_access() for a new IOAS first and
then iopt_remove_access() for the old IOAS upon the success of the first
call. However, the first call overrides the iopt_access_list_id in the
access struct, resulting in iopt_remove_access() being unable to work on
the old IOAS.

Add an iopt_access_list_id as a parameter to iopt_remove_access, so the
replacement routine can save the id before it gets overwritten. Pass the
id in iopt_remove_access() for a proper cleanup.

The existing callers should just pass in access->iopt_access_list_id.

Link: https://lore.kernel.org/r/7bb939b9e0102da0c099572bb3de78ab7622221e.1690523699.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# e88d4ec1 17-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Add iommufd_device_replace()

Replace allows all the devices in a group to move in one step to a new
HWPT. Further, the HWPT move is done without going through a blocking
domain so that the IOMMU driver can implement some level of
non-distruption to ongoing DMA if that has meaning for it (eg for future
special driver domains)

Replace uses a lot of the same logic as normal attach, except the actual
domain change over has different restrictions, and we are careful to
sequence things so that failure is going to leave everything the way it
was, and not get trapped in a blocking domain or something if there is
ENOMEM.

Link: https://lore.kernel.org/r/14-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# ea2d6124 17-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Reorganize iommufd_device_attach into iommufd_device_change_pt

The code flow for first time attaching a PT and replacing a PT is very
similar except for the lowest do_attach step.

Reorganize this so that the do_attach step is a function pointer.

Replace requires destroying the old HWPT once it is replaced. This
destruction cannot be done under all the locks that are held in the
function pointer, so the signature allows returning a HWPT which will be
destroyed by the caller after everything is unlocked.

Link: https://lore.kernel.org/r/12-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 31422dff 17-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Fix locking around hwpt allocation

Due to the auto_domains mechanism the ioas->mutex must be held until
the hwpt is completely setup by iommufd_object_abort_and_destroy() or
iommufd_object_finalize().

This prevents a concurrent iommufd_device_auto_get_domain() from seeing
an incompletely initialized object through the ioas->hwpt_list.

To make this more consistent move the unlock until after finalize.

Fixes: e8d57210035b ("iommufd: Add kAPI toward external drivers for physical devices")
Link: https://lore.kernel.org/r/11-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 17bad527 17-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Add enforced_cache_coherency to iommufd_hw_pagetable_alloc()

Logically the HWPT should have the coherency set properly for the device
that it is being created for when it is created.

This was happening implicitly if the immediate_attach was set because
iommufd_hw_pagetable_attach() does it as the first thing.

Do it unconditionally so !immediate_attach works properly.

Link: https://lore.kernel.org/r/9-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# d03f1336 17-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Move putting a hwpt to a helper function

Next patch will need to call this from two places.

Link: https://lore.kernel.org/r/8-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 1d149ab2 17-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Make sw_msi_start a group global

The sw_msi_start is only set by the ARM drivers and it is always constant.
Due to the way vfio/iommufd allow domains to be re-used between
devices we have a built in assumption that there is only one value
for sw_msi_start and it is global to the system.

To make replace simpler where we may not reparse the
iommu_get_resv_regions() move the sw_msi_start to the iommufd_group so it
is always available once any HWPT has been attached.

Link: https://lore.kernel.org/r/7-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 269c5238 17-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Use the iommufd_group to avoid duplicate MSI setup

This only needs to be done once per group, not once per device. The once
per device was a way to make the device list work. Since we are abandoning
this we can optimize things a bit.

Link: https://lore.kernel.org/r/6-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 34f327a9 17-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Keep track of each device's reserved regions instead of groups

The driver facing API in the iommu core makes the reserved regions
per-device. An algorithm in the core code consolidates the regions of all
the devices in a group to return the group view.

To allow for devices to be hotplugged into the group iommufd would re-load
the entire group's reserved regions for each device, just in case they
changed.

Further iommufd already has to deal with duplicated/overlapping reserved
regions as it must union all the groups together.

Thus simplify all of this to just use the device reserved regions
interface directly from the iommu driver.

Link: https://lore.kernel.org/r/5-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 91a2e17e 17-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Replace the hwpt->devices list with iommufd_group

The devices list was used as a simple way to avoid having per-group
information. Now that this seems to be unavoidable, just commit to
per-group information fully and remove the devices list from the HWPT.

The iommufd_group stores the currently assigned HWPT for the entire group
and we can manage the per-device attach/detach with a list in the
iommufd_group.

For destruction the flow is organized to make the following patches
easier, the actual call to iommufd_object_destroy_user() is done at the
top of the call chain without holding any locks. The HWPT to be destroyed
is returned out from the locked region to make this possible. Later
patches create locking that requires this.

Link: https://lore.kernel.org/r/3-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 3a3329a7 17-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Add iommufd_group

When the hwpt to device attachment is fairly static we could get away with
the simple approach of keeping track of the groups via a device list. But
with replace this is infeasible.

Add an automatically managed struct that is 1:1 with the iommu_group
per-ictx so we can store the necessary tracking information there.

Link: https://lore.kernel.org/r/2-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# d525a5b8 17-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Move isolated msi enforcement to iommufd_device_bind()

With the recent rework this no longer needs to be done at domain
attachment time, we know if the device is usable by iommufd when we bind
it.

The value of msi_device_has_isolated_msi() is not allowed to change while
a driver is bound.

Link: https://lore.kernel.org/r/1-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# e23a6217 18-Jul-2023 Nicolin Chen <nicolinc@nvidia.com>

iommufd/device: Add iommufd_access_detach() API

Previously, the detach routine is only done by the destroy(). And it was
called by vfio_iommufd_emulated_unbind() when the device runs close(), so
all the mappings in iopt were cleaned in that setup, when the call trace
reaches this detach() routine.

Now, there's a need of a detach uAPI, meaning that it does not only need
a new iommufd_access_detach() API, but also requires access->ops->unmap()
call as a cleanup. So add one.

However, leaving that unprotected can introduce some potential of a race
condition during the pin_/unpin_pages() call, where access->ioas->iopt is
getting referenced. So, add an ioas_lock to protect the context of iopt
referencings.

Also, to allow the iommufd_access_unpin_pages() callback to happen via
this unmap() call, add an ioas_unpin pointer, so the unpin routine won't
be affected by the "access->ioas = NULL" trick.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718135551.6592-15-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 78d3df45 18-Jul-2023 Yi Liu <yi.l.liu@intel.com>

iommufd: Add helper to retrieve iommufd_ctx and devid

This is needed by the vfio-pci driver to report affected devices in the
hot-reset for a given device.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718105542.4138-6-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 86b0a96c 18-Jul-2023 Yi Liu <yi.l.liu@intel.com>

iommufd: Add iommufd_ctx_has_group()

This adds the helper to check if any device within the given iommu_group
has been bound with the iommufd_ctx. This is helpful for the checking on
device ownership for the devices which have not been bound but cannot be
bound to any other iommufd_ctx as the iommu_group has been bound.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718105542.4138-5-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 99f98a7c 25-Jul-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: IOMMUFD_DESTROY should not increase the refcount

syzkaller found a race where IOMMUFD_DESTROY increments the refcount:

obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY);
if (IS_ERR(obj))
return PTR_ERR(obj);
iommufd_ref_to_users(obj);
/* See iommufd_ref_to_users() */
if (!iommufd_object_destroy_user(ucmd->ictx, obj))

As part of the sequence to join the two existing primitives together.

Allowing the refcount the be elevated without holding the destroy_rwsem
violates the assumption that all temporary refcount elevations are
protected by destroy_rwsem. Racing IOMMUFD_DESTROY with
iommufd_object_destroy_user() will cause spurious failures:

WARNING: CPU: 0 PID: 3076 at drivers/iommu/iommufd/device.c:477 iommufd_access_destroy+0x18/0x20 drivers/iommu/iommufd/device.c:478
Modules linked in:
CPU: 0 PID: 3076 Comm: syz-executor.0 Not tainted 6.3.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
RIP: 0010:iommufd_access_destroy+0x18/0x20 drivers/iommu/iommufd/device.c:477
Code: e8 3d 4e 00 00 84 c0 74 01 c3 0f 0b c3 0f 1f 44 00 00 f3 0f 1e fa 48 89 fe 48 8b bf a8 00 00 00 e8 1d 4e 00 00 84 c0 74 01 c3 <0f> 0b c3 0f 1f 44 00 00 41 57 41 56 41 55 4c 8d ae d0 00 00 00 41
RSP: 0018:ffffc90003067e08 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888109ea0300 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000ffffffff
RBP: 0000000000000004 R08: 0000000000000000 R09: ffff88810bbb3500
R10: ffff88810bbb3e48 R11: 0000000000000000 R12: ffffc90003067e88
R13: ffffc90003067ea8 R14: ffff888101249800 R15: 00000000fffffffe
FS: 00007ff7254fe6c0(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555557262da8 CR3: 000000010a6fd000 CR4: 0000000000350ef0
Call Trace:
<TASK>
iommufd_test_create_access drivers/iommu/iommufd/selftest.c:596 [inline]
iommufd_test+0x71c/0xcf0 drivers/iommu/iommufd/selftest.c:813
iommufd_fops_ioctl+0x10f/0x1b0 drivers/iommu/iommufd/main.c:337
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x84/0xc0 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0x80 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

The solution is to not increment the refcount on the IOMMUFD_DESTROY path
at all. Instead use the xa_lock to serialize everything. The refcount
check == 1 and xa_erase can be done under a single critical region. This
avoids the need for any refcount incrementing.

It has the downside that if userspace races destroy with other operations
it will get an EBUSY instead of waiting, but this is kind of racing is
already dangerous.

Fixes: 2ff4bed7fee7 ("iommufd: File descriptor, context, kconfig and makefiles")
Link: https://lore.kernel.org/r/2-v1-85aacb2af554+bc-iommufd_syz3_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reported-by: syzbot+7574ebfe589049630608@syzkaller.appspotmail.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# dbe245cd 20-Jun-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Call iopt_area_contig_done() under the lock

The iter internally holds a pointer to the area and
iopt_area_contig_done() will dereference it. The pointer is not valid
outside the iova_rwsem.

syzkaller reports:

BUG: KASAN: slab-use-after-free in iommufd_access_unpin_pages+0x363/0x370
Read of size 8 at addr ffff888022286e20 by task syz-executor669/5771

CPU: 0 PID: 5771 Comm: syz-executor669 Not tainted 6.4.0-rc5-syzkaller-00313-g4c605260bc60 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
Call Trace:
<TASK>
dump_stack_lvl+0xd9/0x150
print_address_description.constprop.0+0x2c/0x3c0
kasan_report+0x11c/0x130
iommufd_access_unpin_pages+0x363/0x370
iommufd_test_access_unmap+0x24b/0x390
iommufd_access_notify_unmap+0x24c/0x3a0
iopt_unmap_iova_range+0x4c4/0x5f0
iopt_unmap_all+0x27/0x50
iommufd_ioas_unmap+0x3d0/0x490
iommufd_fops_ioctl+0x317/0x4b0
__x64_sys_ioctl+0x197/0x210
do_syscall_64+0x39/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fec1dae3b19
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fec1da74308 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fec1db6b438 RCX: 00007fec1dae3b19
RDX: 0000000020000100 RSI: 0000000000003b86 RDI: 0000000000000003
RBP: 00007fec1db6b430 R08: 00007fec1da74700 R09: 0000000000000000
R10: 00007fec1da74700 R11: 0000000000000246 R12: 00007fec1db6b43c
R13: 00007fec1db39074 R14: 6d6f692f7665642f R15: 0000000000022000
</TASK>

Allocated by task 5770:
kasan_save_stack+0x22/0x40
kasan_set_track+0x25/0x30
__kasan_kmalloc+0xa2/0xb0
iopt_alloc_area_pages+0x94/0x560
iopt_map_user_pages+0x205/0x4e0
iommufd_ioas_map+0x329/0x5f0
iommufd_fops_ioctl+0x317/0x4b0
__x64_sys_ioctl+0x197/0x210
do_syscall_64+0x39/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 5770:
kasan_save_stack+0x22/0x40
kasan_set_track+0x25/0x30
kasan_save_free_info+0x2e/0x40
____kasan_slab_free+0x160/0x1c0
slab_free_freelist_hook+0x8b/0x1c0
__kmem_cache_free+0xaf/0x2d0
iopt_unmap_iova_range+0x288/0x5f0
iopt_unmap_all+0x27/0x50
iommufd_ioas_unmap+0x3d0/0x490
iommufd_fops_ioctl+0x317/0x4b0
__x64_sys_ioctl+0x197/0x210
do_syscall_64+0x39/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd

The parallel unmap free'd iter->area the instant the lock was released.

Fixes: 51fe6141f0f6 ("iommufd: Data structure to provide IOVA to PFN mapping")
Link: https://lore.kernel.org/r/2-v2-9a03761d445d+54-iommufd_syz2_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reported-by: syzbot+6c8d756f238a75fc3eb8@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/000000000000905eba05fe38e9f2@google.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 632fda7f 27-Mar-2023 Yi Liu <yi.l.liu@intel.com>

vfio-iommufd: Make vfio_iommufd_emulated_bind() return iommufd_access ID

vfio device cdev needs to return iommufd_access ID to userspace if
bind_iommufd succeeds.

Link: https://lore.kernel.org/r/20230327093351.44505-5-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 54b47585 27-Mar-2023 Nicolin Chen <nicolinc@nvidia.com>

iommufd: Create access in vfio_iommufd_emulated_bind()

There are needs to created iommufd_access prior to have an IOAS and set
IOAS later. Like the vfio device cdev needs to have an iommufd object
to represent the bond (iommufd_access) and IOAS replacement.

Moves the iommufd_access_create() call into vfio_iommufd_emulated_bind(),
making it symmetric with the __vfio_iommufd_access_destroy() call in the
vfio_iommufd_emulated_unbind(). This means an access is created/destroyed
by the bind()/unbind(), and the vfio_iommufd_emulated_attach_ioas() only
updates the access->ioas pointer.

Since vfio_iommufd_emulated_bind() does not provide ioas_id, drop it from
the argument list of iommufd_access_create(). Instead, add a new access
API iommufd_access_attach() to set the access->ioas pointer. Also, set
vdev->iommufd_attached accordingly, similar to the physical pathway.

Link: https://lore.kernel.org/r/20230327093351.44505-3-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 65c619ae 01-Mar-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd/selftest: Make selftest create a more complete mock device

iommufd wants to use more infrastructure, like the iommu_group, that the
mock device does not support. Create a more complete mock device that can
go through the whole cycle of ownership, blocking domain, and has an
iommu_group.

This requires creating a real struct device on a real bus to be able to
connect it to a iommu_group. Unfortunately we cannot formally attach the
mock iommu driver as an actual driver as the iommu core does not allow
more than one driver or provide a general way for busses to link to
iommus. This can be solved with a little hack to open code the dev_iommus
struct.

With this infrastructure things work exactly the same as the normal domain
path, including the auto domains mechanism and direct attach of hwpts. As
the created hwpt is now an autodomain it is no longer required to destroy
it and trying to do so will trigger a failure.

Link: https://lore.kernel.org/r/11-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 339fbf3a 01-Mar-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Make iommufd_hw_pagetable_alloc() do iopt_table_add_domain()

The HWPT is always linked to an IOAS and once a HWPT exists its domain
should be fully mapped. This ended up being split up into device.c during
a two phase creation that was a bit confusing.

Move the iopt_table_add_domain() into iommufd_hw_pagetable_alloc() by
having it call back to device.c to complete the domain attach in the
required order.

Calling iommufd_hw_pagetable_alloc() with immediate_attach = false will
work on most drivers, but notably the SMMU drivers will fail because they
can't decide what kind of domain to create until they are attached. This
will be fixed when the domain_alloc function can take in a struct device.

Link: https://lore.kernel.org/r/6-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 7e7ec8a5 01-Mar-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Move iommufd_device to iommufd_private.h

hw_pagetable.c will need this in the next patches.

Link: https://lore.kernel.org/r/5-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 25cde97d 01-Mar-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Move ioas related HWPT destruction into iommufd_hw_pagetable_destroy()

A HWPT is permanently associated with an IOAS when it is created, remove
the strange situation where a refcount != 0 HWPT can have been
disconnected from the IOAS by putting all the IOAS related destruction in
the object destroy function.

Initializing a HWPT is two stages, we have to allocate it, attach it to a
device and then populate the domain. Once the domain is populated it is
fully linked to the IOAS.

Arrange things so that all the error unwinds flow through the
iommufd_hw_pagetable_destroy() and allow it to handle all cases.

Link: https://lore.kernel.org/r/4-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 342b9cab 01-Mar-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Consistently manage hwpt_item

This should be added immediately after every iopt_table_add_domain(), and
deleted after every iopt_table_remove_domain() under the ioas->mutex.

Tidy things to be consistent.

Link: https://lore.kernel.org/r/3-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 7214c1c8 01-Mar-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Add iommufd_lock_obj() around the auto-domains hwpts

A later patch will require this locking - currently under the ioas mutex
the hwpt can not have a 0 reference and be on the list.

Link: https://lore.kernel.org/r/2-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 085fcc7e 01-Mar-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Assert devices_lock for iommufd_hw_pagetable_has_group()

The hwpt->devices list is locked by this, make it clearer.

Link: https://lore.kernel.org/r/1-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# b4ff830e 13-Feb-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Do not add the same hwpt to the ioas->hwpt_list twice

The hwpt is added to the hwpt_list only during its creation, it is never
added again. This hunk is some missed leftover from rework. Adding it
twice will corrupt the linked list in some cases.

It effects HWPT specific attachment, which is something the test suite
cannot cover until we can create a legitimate struct device with a
non-system iommu "driver" (ie we need the bus removed from the iommu code)

Cc: stable@vger.kernel.org
Fixes: e8d57210035b ("iommufd: Add kAPI toward external drivers for physical devices")
Link: https://lore.kernel.org/r/1-v1-4336b5cb2fe4+1d7-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reported-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 25fc417f 08-Dec-2022 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Convert to msi_device_has_isolated_msi()

Trivially use the new API.

Link: https://lore.kernel.org/r/4-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# d6c55c0a 07-Dec-2022 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Change the order of MSI setup

Eric points out this is wrong for the rare case of someone using
allow_unsafe_interrupts on ARM. We always have to setup the MSI window in
the domain if the iommu driver asks for it.

Move the iommu_get_msi_cookie() setup to the top of the function and
always do it, regardless of the security mode. Add checks to
iommufd_device_setup_msi() to ensure the driver is not doing something
incomprehensible. No current driver will set both a HW and SW MSI window,
or have more than one SW MSI window.

Fixes: e8d57210035b ("iommufd: Add kAPI toward external drivers for physical devices")
Link: https://lore.kernel.org/r/3-v1-0362a1a1c034+98-iommufd_fixes1_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reported-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 52f52858 29-Nov-2022 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Add additional invariant assertions

These are on performance paths so we protect them using the
CONFIG_IOMMUFD_TEST to not take a hit during normal operation.

These are useful when running the test suite and syzkaller to find data
structure inconsistencies early.

Link: https://lore.kernel.org/r/18-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# f4b20bb3 29-Nov-2022 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Add kernel support for testing iommufd

Provide a mock kernel module for the iommu_domain that allows it to run
without any HW and the mocking provides a way to directly validate that
the PFNs loaded into the iommu_domain are correct. This exposes the access
kAPI toward userspace to allow userspace to explore the functionality of
pages.c and io_pagetable.c

The mock also simulates the rare case of PAGE_SIZE > iommu page size as
the mock will operate at a 2K iommu page size. This allows exercising all
of the calculations to support this mismatch.

This is also intended to support syzkaller exploring the same space.

However, it is an unusually invasive config option to enable all of
this. The config option should not be enabled in a production kernel.

Link: https://lore.kernel.org/r/16-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390
Tested-by: Eric Auger <eric.auger@redhat.com> # aarch64
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 8d40205f 29-Nov-2022 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Add kAPI toward external drivers for kernel access

Kernel access is the mode that VFIO "mdevs" use. In this case there is no
struct device and no IOMMU connection. iommufd acts as a record keeper for
accesses and returns the actual struct pages back to the caller to use
however they need. eg with kmap or the DMA API.

Each caller must create a struct iommufd_access with
iommufd_access_create(), similar to how iommufd_device_bind() works. Using
this struct the caller can access blocks of IOVA using
iommufd_access_pin_pages() or iommufd_access_rw().

Callers must provide a callback that immediately unpins any IOVA being
used within a range. This happens if userspace unmaps the IOVA under the
pin.

The implementation forwards the access requests directly to the iopt
infrastructure that manages the iopt_pages_access.

Link: https://lore.kernel.org/r/14-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# e8d57210 29-Nov-2022 Jason Gunthorpe <jgg@ziepe.ca>

iommufd: Add kAPI toward external drivers for physical devices

Add the four functions external drivers need to connect physical DMA to
the IOMMUFD:

iommufd_device_bind() / iommufd_device_unbind()
Register the device with iommufd and establish security isolation.

iommufd_device_attach() / iommufd_device_detach()
Connect a bound device to a page table

Binding a device creates a device object ID in the uAPI, however the
generic API does not yet provide any IOCTLs to manipulate them.

Link: https://lore.kernel.org/r/13-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>