History log of /linux-master/drivers/vfio/vfio_iommu_type1.c
Revision Date Author Comments
# 5b992412 29-Feb-2024 Alex Williamson <alex.williamson@redhat.com>

Revert "vfio/type1: Unpin zero pages"

This reverts commit 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4.

This was a heinous workaround and it turns out it's been fixed in mm
twice since it was introduced. Most recently, commit c8070b787519
("mm: Don't pin ZERO_PAGE in pin_user_pages()") would have prevented
running up the zeropage refcount, but even before that commit
84209e87c696 ("mm/gup: reliable R/O long-term pinning in COW mappings")
avoids the vfio use case from pinning the zeropage at all, instead
replacing it with exclusive anonymous pages.

Remove this now useless overhead.

Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240229223544.257207-1-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 160912fc 30-Nov-2023 Pasha Tatashin <pasha.tatashin@soleen.com>

vfio/type1: account iommu allocations

iommu allocations should be accounted in order to allow admins to
monitor and limit the amount of iommu memory.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20231130200900.2320829-1-pasha.tatashin@soleen.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# a881b496 09-Aug-2023 Stefan Hajnoczi <stefanha@redhat.com>

vfio: align capability structures

The VFIO_DEVICE_GET_INFO, VFIO_DEVICE_GET_REGION_INFO, and
VFIO_IOMMU_GET_INFO ioctls fill in an info struct followed by capability
structs:

+------+---------+---------+-----+
| info | caps[0] | caps[1] | ... |
+------+---------+---------+-----+

Both the info and capability struct sizes are not always multiples of
sizeof(u64), leaving u64 fields in later capability structs misaligned.

Userspace applications currently need to handle misalignment manually in
order to support CPU architectures and programming languages with strict
alignment requirements.

Make life easier for userspace by ensuring alignment in the kernel. This
is done by padding info struct definitions and by copying out zeroes
after capability structs that are not aligned.

The new layout is as follows:

+------+---------+---+---------+-----+
| info | caps[0] | 0 | caps[1] | ... |
+------+---------+---+---------+-----+

In this example caps[0] has a size that is not multiples of sizeof(u64),
so zero padding is added to align the subsequent structure.

Adding zero padding between structs does not break the uapi. The memory
layout is specified by the info.cap_offset and caps[i].next fields
filled in by the kernel. Applications use these field values to locate
structs and are therefore unaffected by the addition of zero padding.

Note that code that copies out info structs with padding is updated to
always zero the struct and copy out as many bytes as userspace
requested. This makes the code shorter and avoids potential information
leaks by ensuring padding is initialized.

Originally-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230809203144.2880050-1-stefanha@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# cd24e2a6 01-Aug-2023 Stefan Hajnoczi <stefanha@redhat.com>

vfio/type1: fix cap_migration information leak

Fix an information leak where an uninitialized hole in struct
vfio_iommu_type1_info_cap_migration on the stack is exposed to userspace.

The definition of struct vfio_iommu_type1_info_cap_migration contains a hole as
shown in this pahole(1) output:

struct vfio_iommu_type1_info_cap_migration {
struct vfio_info_cap_header header; /* 0 8 */
__u32 flags; /* 8 4 */

/* XXX 4 bytes hole, try to pack */

__u64 pgsize_bitmap; /* 16 8 */
__u64 max_dirty_bitmap_size; /* 24 8 */

/* size: 32, cachelines: 1, members: 4 */
/* sum members: 28, holes: 1, sum holes: 4 */
/* last cacheline: 32 bytes */
};

The cap_mig variable is filled in without initializing the hole:

static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu,
struct vfio_info_cap *caps)
{
struct vfio_iommu_type1_info_cap_migration cap_mig;

cap_mig.header.id = VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION;
cap_mig.header.version = 1;

cap_mig.flags = 0;
/* support minimum pgsize */
cap_mig.pgsize_bitmap = (size_t)1 << __ffs(iommu->pgsize_bitmap);
cap_mig.max_dirty_bitmap_size = DIRTY_BITMAP_SIZE_MAX;

return vfio_info_add_capability(caps, &cap_mig.header, sizeof(cap_mig));
}

The structure is then copied to a temporary location on the heap. At this point
it's already too late and ioctl(VFIO_IOMMU_GET_INFO) copies it to userspace
later:

int vfio_info_add_capability(struct vfio_info_cap *caps,
struct vfio_info_cap_header *cap, size_t size)
{
struct vfio_info_cap_header *header;

header = vfio_info_cap_add(caps, size, cap->id, cap->version);
if (IS_ERR(header))
return PTR_ERR(header);

memcpy(header + 1, cap + 1, size - sizeof(*header));

return 0;
}

This issue was found by code inspection.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Fixes: ad721705d09c ("vfio iommu: Add migration capability to report supported features")
Link: https://lore.kernel.org/r/20230801155352.1391945-1-stefanha@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# c33c7948 12-Jun-2023 Ryan Roberts <ryan.roberts@arm.com>

mm: ptep_get() conversion

Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.

But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.

Conversion was done using Coccinelle:

----

// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch

virtual patch

@ depends on patch @
pte_t *v;
@@

- *v
+ ptep_get(v)

----

Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.

Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.

Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0b295316 17-May-2023 Lorenzo Stoakes <lstoakes@gmail.com>

mm/gup: remove unused vmas parameter from pin_user_pages_remote()

No invocation of pin_user_pages_remote() uses the vmas parameter, so
remove it. This forms part of a larger patch set eliminating the use of
the vmas parameters altogether.

Link: https://lkml.kernel.org/r/28f000beb81e45bf538a2aaa77c90f5482b67a32.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4752354a 19-May-2023 Yan Zhao <yan.y.zhao@intel.com>

vfio/type1: check pfn valid before converting to struct page

Check physical PFN is valid before converting the PFN to a struct page
pointer to be returned to caller of vfio_pin_pages().

vfio_pin_pages() pins user pages with contiguous IOVA.
If the IOVA of a user page to be pinned belongs to vma of vm_flags
VM_PFNMAP, pin_user_pages_remote() will return -EFAULT without returning
struct page address for this PFN. This is because usually this kind of PFN
(e.g. MMIO PFN) has no valid struct page address associated.
Upon this error, vaddr_get_pfns() will obtain the physical PFN directly.

While previously vfio_pin_pages() returns to caller PFN arrays directly,
after commit
34a255e67615 ("vfio: Replace phys_pfn with pages for vfio_pin_pages()"),
PFNs will be converted to "struct page *" unconditionally and therefore
the returned "struct page *" array may contain invalid struct page
addresses.

Given current in-tree users of vfio_pin_pages() only expect "struct page *
returned, check PFN validity and return -EINVAL to let the caller be
aware of IOVAs to be pinned containing PFN not able to be returned in
"struct page *" array. So that, the caller will not consume the returned
pointer (e.g. test PageReserved()) and avoid error like "supervisor read
access in kernel mode".

Fixes: 34a255e67615 ("vfio: Replace phys_pfn with pages for vfio_pin_pages()")
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230519065843.10653-1-yan.y.zhao@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 428e106a 12-Mar-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: Introduce untagged_addr_remote()

untagged_addr() removes tags/metadata from the address and brings it to
the canonical form. The helper is implemented on arm64 and sparc. Both of
them do untagging based on global rules.

However, Linear Address Masking (LAM) on x86 introduces per-process
settings for untagging. As a result, untagged_addr() is now only
suitable for untagging addresses for the current proccess.

The new helper untagged_addr_remote() has to be used when the address
targets remote process. It requires the mmap lock for target mm to be
taken.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/all/20230312112612.31869-6-kirill.shutemov%40linux.intel.com


# a5ac1f81 31-Jan-2023 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: revert "implement notify callback"

This is dead code. Revert it.
commit 487ace134053 ("vfio/type1: implement notify callback")

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1675184289-267876-7-git-send-email-steven.sistare@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# da4f1c2e 31-Jan-2023 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: revert "block on invalid vaddr"

Revert this dead code:
commit 898b9eaeb3fe ("vfio/type1: block on invalid vaddr")

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1675184289-267876-6-git-send-email-steven.sistare@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 90fdd158 31-Jan-2023 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: restore locked_vm

When a vfio container is preserved across exec or fork-exec, the new
task's mm has a locked_vm count of 0. After a dma vaddr is updated using
VFIO_DMA_MAP_FLAG_VADDR, locked_vm remains 0, and the pinned memory does
not count against the task's RLIMIT_MEMLOCK.

To restore the correct locked_vm count, when VFIO_DMA_MAP_FLAG_VADDR is
used and the dma's mm has changed, add the dma's locked_vm count to
the new mm->locked_vm, subject to the rlimit, and subtract it from the
old mm->locked_vm.

Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1675184289-267876-5-git-send-email-steven.sistare@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 18e29270 31-Jan-2023 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: track locked_vm per dma

Track locked_vm per dma struct, and create a new subroutine, both for use
in a subsequent patch. No functional change.

Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1675184289-267876-4-git-send-email-steven.sistare@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 046eca50 31-Jan-2023 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: prevent underflow of locked_vm via exec()

When a vfio container is preserved across exec, the task does not change,
but it gets a new mm with locked_vm=0, and loses the count from existing
dma mappings. If the user later unmaps a dma mapping, locked_vm underflows
to a large unsigned value, and a subsequent dma map request fails with
ENOMEM in __account_locked_vm.

To avoid underflow, grab and save the mm at the time a dma is mapped.
Use that mm when adjusting locked_vm, rather than re-acquiring the saved
task's mm, which may have changed. If the saved mm is dead, do nothing.

locked_vm is incremented for existing mappings in a subsequent patch.

Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1675184289-267876-3-git-send-email-steven.sistare@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# ef3a3f6a 31-Jan-2023 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR

Disable the VFIO_UPDATE_VADDR capability if mediated devices are present.
Their kernel threads could be blocked indefinitely by a misbehaving
userland while trying to pin/unpin pages while vaddrs are being updated.

Do not allow groups to be added to the container while vaddr's are invalid,
so we never need to block user threads from pinning, and can delete the
vaddr-waiting code in a subsequent patch.

Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1675184289-267876-2-git-send-email-steven.sistare@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 1369459b 23-Jan-2023 Jason Gunthorpe <jgg@ziepe.ca>

iommu: Add a gfp parameter to iommu_map()

The internal mechanisms support this, but instead of exposting the gfp to
the caller it wrappers it into iommu_map() and iommu_map_atomic()

Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Link: https://lore.kernel.org/r/1-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# 6b1a7a00 08-Dec-2022 Jason Gunthorpe <jgg@ziepe.ca>

vfio/type1: Convert to iommu_group_has_isolated_msi()

Trivially use the new API.

Link: https://lore.kernel.org/r/3-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 895c0747 10-Jan-2023 Niklas Schnelle <schnelle@linux.ibm.com>

vfio/type1: Respect IOMMU reserved regions in vfio_test_domain_fgsp()

Since commit cbf7827bc5dc ("iommu/s390: Fix potential s390_domain
aperture shrinking") the s390 IOMMU driver uses reserved regions for the
system provided DMA ranges of PCI devices. Previously it reduced the
size of the IOMMU aperture and checked it on each mapping operation.
On current machines the system denies use of DMA addresses below 2^32 for
all PCI devices.

Usually mapping IOVAs in a reserved regions is harmless until a DMA
actually tries to utilize the mapping. However on s390 there is
a virtual PCI device called ISM which is implemented in firmware and
used for cross LPAR communication. Unlike real PCI devices this device
does not use the hardware IOMMU but inspects IOMMU translation tables
directly on IOTLB flush (s390 RPCIT instruction). If it detects IOVA
mappings outside the allowed ranges it goes into an error state. This
error state then causes the device to be unavailable to the KVM guest.

Analysing this we found that vfio_test_domain_fgsp() maps 2 pages at DMA
address 0 irrespective of the IOMMUs reserved regions. Even if usually
harmless this seems wrong in the general case so instead go through the
freshly updated IOVA list and try to find a range that isn't reserved,
and fits 2 pages, is PAGE_SIZE * 2 aligned. If found use that for
testing for fine grained super pages.

Fixes: af029169b8fd ("vfio/type1: Check reserved region conflict and update iova list")
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230110164427.4051938-2-schnelle@linux.ibm.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# fa49364c 16-Aug-2022 Robin Murphy <robin.murphy@arm.com>

iommu/dma: Move public interfaces to linux/iommu.h

The iommu-dma layer is now mostly encapsulated by iommu_dma_ops, with
only a couple more public interfaces left pertaining to MSI integration.
Since these depend on the main IOMMU API header anyway, move their
declarations there, taking the opportunity to update the half-baked
comments to proper kerneldoc along the way.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/9cd99738f52094e6bed44bfee03fa4f288d20695.1660668998.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# 873aefb3 29-Aug-2022 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Unpin zero pages

There's currently a reference count leak on the zero page. We increment
the reference via pin_user_pages_remote(), but the page is later handled
as an invalid/reserved page, therefore it's not accounted against the
user and not unpinned by our put_pfn().

Introducing special zero page handling in put_pfn() would resolve the
leak, but without accounting of the zero page, a single user could
still create enough mappings to generate a reference count overflow.

The zero page is always resident, so for our purposes there's no reason
to keep it pinned. Therefore, add a loop to walk pages returned from
pin_user_pages_remote() and unpin any zero pages.

Cc: stable@vger.kernel.org
Reported-by: Luboslav Pivarc <lpivarc@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/166182871735.3518559.8884121293045337358.stgit@omen
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 34a255e6 22-Jul-2022 Nicolin Chen <nicolinc@nvidia.com>

vfio: Replace phys_pfn with pages for vfio_pin_pages()

Most of the callers of vfio_pin_pages() want "struct page *" and the
low-level mm code to pin pages returns a list of "struct page *" too.
So there's no gain in converting "struct page *" to PFN in between.

Replace the output parameter "phys_pfn" list with a "pages" list, to
simplify callers. This also allows us to replace the vfio_iommu_type1
implementation with a more efficient one.

And drop the pfn_valid check in the gvt code, as there is no need to
do such a check at a page-backed struct page pointer.

For now, also update vfio_iommu_type1 to fit this new parameter too.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Eric Farman <farman@linux.ibm.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20220723020256.30081-11-nicolinc@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 44abdd16 22-Jul-2022 Nicolin Chen <nicolinc@nvidia.com>

vfio: Pass in starting IOVA to vfio_pin/unpin_pages API

The vfio_pin/unpin_pages() so far accepted arrays of PFNs of user IOVA.
Among all three callers, there was only one caller possibly passing in
a non-contiguous PFN list, which is now ensured to have contiguous PFN
inputs too.

Pass in the starting address with "iova" alone to simplify things, so
callers no longer need to maintain a PFN list or to pin/unpin one page
at a time. This also allows VFIO to use more efficient implementations
of pin/unpin_pages.

For now, also update vfio_iommu_type1 to fit this new parameter too,
while keeping its input intact (being user_iova) since we don't want
to spend too much effort swapping its parameters and local variables
at that level.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>
Acked-by: Eric Farman <farman@linux.ibm.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20220723020256.30081-6-nicolinc@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# e8f90717 22-Jul-2022 Nicolin Chen <nicolinc@nvidia.com>

vfio: Make vfio_unpin_pages() return void

There's only one caller that checks its return value with a WARN_ON_ONCE,
while all other callers don't check the return value at all. Above that,
an undo function should not fail. So, simplify the API to return void by
embedding similar WARN_ONs.

Also for users to pinpoint which condition fails, separate WARN_ON lines,
yet remove the "driver->ops->unpin_pages" check, since it's unreasonable
for callers to unpin on something totally random that wasn't even pinned.
And remove NULL pointer checks for they would trigger oops vs. warnings.
Note that npage is already validated in the vfio core, thus drop the same
check in the type1 code.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20220723020256.30081-2-nicolinc@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 8cfc5b60 19-Jul-2022 Jason Gunthorpe <jgg@ziepe.ca>

vfio: Replace the iommu notifier with a device list

Instead of bouncing the function call to the driver op through a blocking
notifier just have the iommu layer call it directly.

Register each device that is being attached to the iommu with the lower
driver which then threads them on a linked list and calls the appropriate
driver op at the right time.

Currently the only use is if dma_unmap() is defined.

Also, fully lock all the debugging tests on the pinning path that a
dma_unmap is registered.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v4-681e038e30fd+78-vfio_unmap_notif_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# ffed0518 26-Jun-2022 Li Zhe <lizhe.67@bytedance.com>

vfio: remove useless judgement

In function vfio_dma_do_unmap(), we currently prevent process to unmap
vfio dma region whose mm_struct is different from the vfio_dma->task.
In our virtual machine scenario which is using kvm and qemu, this
judgement stops us from liveupgrading our qemu, which uses fork() &&
exec() to load the new binary but the new process cannot do the
VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.

This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
task structure to vfio_dma") for the security reason. But it seems that
no other task who has no family relationship with old and new process
can get the same vfio_dma struct here for the reason of resource
isolation. So this patch delete it.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
Link: https://lore.kernel.org/r/20220627035109.73745-1-lizhe.67@bytedance.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# eed20c78 24-Jun-2022 Robin Murphy <robin.murphy@arm.com>

vfio/type1: Simplify bus_type determination

Since IOMMU groups are mandatory for drivers to support, it stands to
reason that any device which has been successfully added to a group
must be on a bus supported by that IOMMU driver, and therefore a domain
viable for any device in the group must be viable for all devices in
the group. This already has to be the case for the IOMMU API's internal
default domain, for instance. Thus even if the group contains devices on
different buses, that can only mean that the IOMMU driver actually
supports such an odd topology, and so without loss of generality we can
expect the bus type of any device in a group to be suitable for IOMMU
API calls.

Furthermore, scrutiny reveals a lack of protection for the bus being
removed while vfio_iommu_type1_attach_group() is using it; the reference
that VFIO holds on the iommu_group ensures that data remains valid, but
does not prevent the group's membership changing underfoot.

We can address both concerns by recycling vfio_bus_type() into some
superficially similar logic to indirect the IOMMU API calls themselves.
Each call is thus protected from races by the IOMMU group's own locking,
and we no longer need to hold group-derived pointers beyond that scope.
It also gives us an easy path for the IOMMU API's migration of bus-based
interfaces to device-based, of which we can already take the first step
with device_iommu_capable(). As with domains, any capability must in
practice be consistent for devices in a given group - and after all it's
still the same capability which was expected to be consistent across an
entire bus! - so there's no need for any complicated validation.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/194a12d3434d7b38f84fa96503c7664451c8c395.1656092606.git.robin.murphy@arm.com
[aw: add comment to vfio_iommu_device_capable()]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 71cfafda 10-Apr-2022 Jason Gunthorpe <jgg@ziepe.ca>

vfio: Move the Intel no-snoop control off of IOMMU_CACHE

IOMMU_CACHE means "normal DMA to this iommu_domain's IOVA should be cache
coherent" and is used by the DMA API. The definition allows for special
non-coherent DMA to exist - ie processing of the no-snoop flag in PCIe
TLPs - so long as this behavior is opt-in by the device driver.

The flag is mainly used by the DMA API to synchronize the IOMMU setting
with the expected cache behavior of the DMA master. eg based on
dev_is_dma_coherent() in some case.

For Intel IOMMU IOMMU_CACHE was redefined to mean 'force all DMA to be
cache coherent' which has the practical effect of causing the IOMMU to
ignore the no-snoop bit in a PCIe TLP.

x86 platforms are always IOMMU_CACHE, so Intel should ignore this flag.

Instead use the new domain op enforce_cache_coherency() which causes every
IOPTE created in the domain to have the no-snoop blocking behavior.

Reconfigure VFIO to always use IOMMU_CACHE and call
enforce_cache_coherency() to operate the special Intel behavior.

Remove the IOMMU_CACHE test from Intel IOMMU.

Ultimately VFIO plumbs the result of enforce_cache_coherency() back into
the x86 platform code through kvm_arch_register_noncoherent_dma() which
controls if the WBINVD instruction is available in the guest. No other
archs implement kvm_arch_register_noncoherent_dma() nor are there any
other known consumers of VFIO_DMA_CC_IOMMU that might be affected by the
user visible result change on non-x86 archs.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v3-2cf356649677+a32-intel_no_snoop_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# 2bed2ced 12-Dec-2021 Jiacheng Shi <billsjc@sjtu.edu.cn>

vfio/iommu_type1: replace kfree with kvfree

Variables allocated by kvzalloc should not be freed by kfree.
Because they may be allocated by vmalloc.
So we replace kfree with kvfree here.

Fixes: d6a4c185660c ("vfio iommu: Implementation of ioctl for dirty pages tracking")
Signed-off-by: Jiacheng Shi <billsjc@sjtu.edu.cn>
Link: https://lore.kernel.org/r/20211212091600.2560-1-billsjc@sjtu.edu.cn
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 3f901389 24-Sep-2021 Christoph Hellwig <hch@lst.de>

vfio/iommu_type1: remove IS_IOMMU_CAP_DOMAIN_IN_CONTAINER

IS_IOMMU_CAP_DOMAIN_IN_CONTAINER just obsfucated the checks being
performed, so open code it in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-16-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 296e505b 24-Sep-2021 Christoph Hellwig <hch@lst.de>

vfio/iommu_type1: remove the "external" domain

The external_domain concept rather misleading and not actually needed.
Replace it with a list of emulated groups in struct vfio_iommu.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-15-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 65cdbf10 24-Sep-2021 Christoph Hellwig <hch@lst.de>

vfio/iommu_type1: initialize pgsize_bitmap in ->open

Ensure pgsize_bitmap is always valid by initializing it to PAGE_MASK
in vfio_iommu_type1_open and remove the now pointless update for
the external domain case in vfio_iommu_type1_attach_group, which was
just setting pgsize_bitmap to PAGE_MASK when only external domains
were attached.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210924155705.4258-14-hch@lst.de
[aw: s/ULONG_MAX/PAGE_MASK/ per discussion in link]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# c3c0fa9d 24-Sep-2021 Christoph Hellwig <hch@lst.de>

vfio: clean up the check for mediated device in vfio_iommu_type1

Pass the group flags to ->attach_group and remove the messy check for
the bus type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-12-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# fda49d97 24-Sep-2021 Christoph Hellwig <hch@lst.de>

vfio: remove the unused mdev iommu hook

The iommu_device field in struct mdev_device has never been used
since it was added more than 2 years ago.

This is a manual revert of commit 7bd50f0cd2
("vfio/type1: Add domain at(de)taching group helpers").

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-11-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 8cc02d22 24-Sep-2021 Christoph Hellwig <hch@lst.de>

vfio: move the vfio_iommu_driver_ops interface out of <linux/vfio.h>

Create a new private drivers/vfio/vfio.h header for the interface between
the VFIO core and the iommu drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-10-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# ffc95d1b 23-Aug-2021 Anthony Yznaga <anthony.yznaga@oracle.com>

vfio/type1: Fix vfio_find_dma_valid return

vfio_find_dma_valid is defined to return WAITED on success if it was
necessary to wait. However, the loop forgets the WAITED value returned
by vfio_wait() and returns 0 in a later iteration. Fix it.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Link: https://lore.kernel.org/r/1629736550-2388-1-git-send-email-anthony.yznaga@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# c7396f2e 08-Jun-2021 Max Gurtovoy <mgurtovoy@nvidia.com>

vfio/iommu_type1: rename vfio_group struck to vfio_iommu_group

The vfio_group structure is already defined in vfio module so in order
to improve code readability and for simplicity, rename the vfio_group
structure in vfio_iommu_type1 module to vfio_iommu_group.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Link: https://lore.kernel.org/r/20210608112841.51897-1-mgurtovoy@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 85715d68 28-Jun-2021 Liam Howlett <liam.howlett@oracle.com>

vfio: use vma_lookup() instead of find_vma_intersection()

vma_lookup() finds the vma of a specific address with a cleaner interface
and is more readable.

Link: https://lkml.kernel.org/r/20210521174745.2219620-12-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78b23814 13-May-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

vfio/iommu_type1: Use struct_size() for kzalloc()

Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows
that, in the worst scenario, could lead to heap overflows.

This code was detected with the help of Coccinelle and, audited and
fixed manually.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Message-Id: <20210513230155.GA217517@embeddedor>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 7e147547 01-Apr-2021 Christoph Hellwig <hch@lst.de>

iommu: remove DOMAIN_ATTR_NESTING

Use an explicit enable_nesting method instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Li Yang <leoyang.li@nxp.com>
Link: https://lore.kernel.org/r/20210401155256.298656-17-hch@lst.de
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# bc9a05ee 01-Apr-2021 Christoph Hellwig <hch@lst.de>

iommu: remove DOMAIN_ATTR_GEOMETRY

The geometry information can be trivially queried from the iommu_domain
struture.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Li Yang <leoyang.li@nxp.com>
Link: https://lore.kernel.org/r/20210401155256.298656-16-hch@lst.de
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# 43dcf6cc 11-Apr-2021 Keqian Zhu <zhukeqian1@huawei.com>

vfio/iommu_type1: Remove unused pinned_page_dirty_scope in vfio_iommu

pinned_page_dirty_scope is optimized out by commit 010321565a7d
("vfio/iommu_type1: Mantain a counter for non_pinned_groups"),
but appears again due to some issues during merging branches.
We can safely remove it here.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Message-Id: <20210412024415.30676-1-zhukeqian1@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 2a3d15f2 06-Apr-2021 Jason Gunthorpe <jgg@ziepe.ca>

vfio/mdev: Add missing typesafety around mdev_device

The mdev API should accept and pass a 'struct mdev_device *' in all
places, not pass a 'struct device *' and cast it internally with
to_mdev_device(). Particularly in its struct mdev_driver functions, the
whole point of a bus's struct device_driver wrapper is to provide type
safety compared to the default struct device_driver.

Further, the driver core standard is for bus drivers to expose their
device structure in their public headers that can be used with
container_of() inlines and '&foo->dev' to go between the class levels, and
'&foo->dev' to be used with dev_err/etc driver core helper functions. Move
'struct mdev_device' to mdev.h

Once done this allows moving some one instruction exported functions to
static inlines, which in turns allows removing one of the two grotesque
symbol_get()'s related to mdev in the core code.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <3-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# a536019d 06-Apr-2021 Shenming Lu <lushenming@huawei.com>

vfio/type1: Remove the almost unused check in vfio_iommu_type1_unpin_pages

The check i > npage at the end of vfio_iommu_type1_unpin_pages is unused
unless npage < 0, but if npage < 0, this function will return npage, which
should return -EINVAL instead. So let's just check the parameter npage at
the start of the function. By the way, replace unpin_exit with break.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
Message-Id: <20210406135009.1707-1-lushenming@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 06d738c8 30-Mar-2021 Zhen Lei <thunder.leizhen@huawei.com>

vfio/type1: fix a couple of spelling mistakes

There are several spelling mistakes, as follows:
userpsace ==> userspace
Accouting ==> Accounting
exlude ==> exclude

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Message-Id: <20210326083528.1329-2-thunder.leizhen@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 60c988bc 24-Mar-2021 Daniel Jordan <daniel.m.jordan@oracle.com>

vfio/type1: Empty batch for pfnmap pages

When vfio_pin_pages_remote() returns with a partial batch consisting of
a single VM_PFNMAP pfn, a subsequent call will unfortunately try
restoring it from batch->pages, resulting in vfio mapping the wrong page
and unbalancing the page refcount.

Prevent the function from returning with this kind of partial batch to
avoid the issue. There's no explicit check for a VM_PFNMAP pfn because
it's awkward to do so, so infer it from characteristics of the batch
instead. This may result in occasional false positives but keeps the
code simpler.

Fixes: 4d83de6da265 ("vfio/type1: Batch page pinning")
Link: https://lkml.kernel.org/r/20210323133254.33ed9161@omen.home.shazbot.org/
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Message-Id: <20210325010552.185481-1-daniel.m.jordan@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 4ab4fcfc 07-Mar-2021 Daniel Jordan <daniel.m.jordan@oracle.com>

vfio/type1: fix vaddr_get_pfns() return in vfio_pin_page_external()

vaddr_get_pfns() now returns the positive number of pfns successfully
gotten instead of zero. vfio_pin_page_external() might return 1 to
vfio_iommu_type1_pin_pages(), which will treat it as an error, if
vaddr_get_pfns() is successful but vfio_pin_page_external() doesn't
reach vfio_lock_acct().

Fix it up in vfio_pin_page_external(). Found by inspection.

Fixes: be16c1fd99f4 ("vfio/type1: Change success value of vaddr_get_pfn()")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Message-Id: <20210308172452.38864-1-daniel.m.jordan@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 7dc4b2fd 25-Feb-2021 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: fix unmap all on ILP32

Some ILP32 architectures support mapping a 32-bit vaddr within a 64-bit
iova space. The unmap-all code uses 32-bit SIZE_MAX as an upper bound on
the extent of the mappings within iova space, so mappings above 4G cannot
be found and unmapped. Use U64_MAX instead, and use u64 for size variables.
This also fixes a static analysis bug found by the kernel test robot running
smatch for ILP32.

Fixes: 0f53afa12bae ("vfio/type1: unmap cleanup")
Fixes: c19650995374 ("vfio/type1: implement unmap all")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Message-Id: <1614281102-230747-1-git-send-email-steven.sistare@oracle.com>
Link: https://lore.kernel.org/linux-mm/20210222141043.GW2222@kadam
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 4d83de6d 19-Feb-2021 Daniel Jordan <daniel.m.jordan@oracle.com>

vfio/type1: Batch page pinning

Pinning one 4K page at a time is inefficient, so do it in batches of 512
instead. This is just an optimization with no functional change
intended, and in particular the driver still calls iommu_map() with the
largest physically contiguous range possible.

Add two fields in vfio_batch to remember where to start between calls to
vfio_pin_pages_remote(), and use vfio_batch_unpin() to handle remaining
pages in the batch in case of error.

qemu pins pages for guests around 8% faster on my test system, a
two-node Broadwell server with 128G memory per node. The qemu process
was bound to one node with its allocations constrained there as well.

base test
guest ---------------- ----------------
mem (GB) speedup avg sec (std) avg sec (std)
1 7.4% 0.61 (0.00) 0.56 (0.00)
2 8.3% 0.93 (0.00) 0.85 (0.00)
4 8.4% 1.46 (0.00) 1.34 (0.00)
8 8.6% 2.54 (0.01) 2.32 (0.00)
16 8.3% 4.66 (0.00) 4.27 (0.01)
32 8.3% 8.94 (0.01) 8.20 (0.01)
64 8.2% 17.47 (0.01) 16.04 (0.03)
120 8.5% 32.45 (0.13) 29.69 (0.01)

perf diff confirms less time spent in pup. Here are the top ten
functions:

Baseline Delta Abs Symbol

78.63% +6.64% clear_page_erms
1.50% -1.50% __gup_longterm_locked
1.27% -0.78% __get_user_pages
+0.76% kvm_zap_rmapp.constprop.0
0.54% -0.53% vmacache_find
0.55% -0.51% get_pfnblock_flags_mask
0.48% -0.48% __get_user_pages_remote
+0.39% slot_rmap_walk_next
+0.32% vfio_pin_map_dma
+0.26% kvm_handle_hva_range
...

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 4b6c33b3 19-Feb-2021 Daniel Jordan <daniel.m.jordan@oracle.com>

vfio/type1: Prepare for batched pinning with struct vfio_batch

Get ready to pin more pages at once with struct vfio_batch, which
represents a batch of pinned pages.

The struct has a fallback page pointer to avoid two unlikely scenarios:
pointlessly allocating a page if disable_hugepages is enabled or failing
the whole pinning operation if the kernel can't allocate memory.

vaddr_get_pfn() becomes vaddr_get_pfns() to prepare for handling
multiple pages, though for now only one page is stored in the pages
array.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# be16c1fd 19-Feb-2021 Daniel Jordan <daniel.m.jordan@oracle.com>

vfio/type1: Change success value of vaddr_get_pfn()

vaddr_get_pfn() simply returns 0 on success. Have it report the number
of pfns successfully gotten instead, whether from page pinning or
follow_fault_pfn(), which will be used later when batching pinning.

Change the last check in vfio_pin_pages_remote() for consistency with
the other two.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 07956b62 16-Feb-2021 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Use follow_pte()

follow_pfn() doesn't make sure that we're using the correct page
protections, get the pte with follow_pte() so that we can test
protections and get the pfn from the pte.

Fixes: 5cbf3264bc71 ("vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 35ac5991 17-Feb-2021 Tian Tao <tiantao6@hisilicon.com>

vfio/iommu_type1: Fix duplicate included kthread.h

linux/kthread.h is included more than once, remove the one that isn't
necessary.

Fixes: 898b9eaeb3fe ("vfio/type1: block on invalid vaddr")
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 01032156 24-Jan-2021 Keqian Zhu <zhukeqian1@huawei.com>

vfio/iommu_type1: Mantain a counter for non_pinned_groups

With this counter, we never need to traverse all groups to update
pinned_scope of vfio_iommu.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 4a19f37a 22-Jan-2021 Keqian Zhu <zhukeqian1@huawei.com>

vfio/iommu_type1: Fix some sanity checks in detach group

vfio_sanity_check_pfn_list() is used to check whether pfn_list and
notifier are empty when remove the external domain, so it makes a
wrong assumption that only external domain will use the pinning
interface.

Now we apply the pfn_list check when a vfio_dma is removed and apply
the notifier check when all domains are removed.

Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# d0a78f91 22-Jan-2021 Keqian Zhu <zhukeqian1@huawei.com>

vfio/iommu_type1: Populate full dirty when detach non-pinned group

If a group with non-pinned-page dirty scope is detached with dirty
logging enabled, we should fully populate the dirty bitmaps at the
time it's removed since we don't know the extent of its previous DMA,
nor will the group be present to trigger the full bitmap when the user
retrieves the dirty bitmap.

Fixes: d6a4c185660c ("vfio iommu: Implementation of ioctl for dirty pages tracking")
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 898b9eae 29-Jan-2021 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: block on invalid vaddr

Block translation of host virtual address while an iova range has an
invalid vaddr.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 487ace13 29-Jan-2021 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: implement notify callback

Implement a notify callback that remembers if the container's file
descriptor has been closed.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# c3cbab24 29-Jan-2021 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: implement interfaces to update vaddr

Implement VFIO_DMA_UNMAP_FLAG_VADDR, VFIO_DMA_MAP_FLAG_VADDR, and
VFIO_UPDATE_VADDR. This is a partial implementation. Blocking is
added in a subsequent patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 40ae9b80 29-Jan-2021 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: massage unmap iteration

Modify the iteration in vfio_dma_do_unmap so it does not depend on deletion
of each dma entry. Add a variant of vfio_find_dma that returns the entry
with the lowest iova in the search range to initialize the iteration. No
externally visible change, but this behavior is needed in the subsequent
update-vaddr patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# c1965099 29-Jan-2021 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: implement unmap all

Implement VFIO_DMA_UNMAP_FLAG_ALL.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 0f53afa1 29-Jan-2021 Steve Sistare <steven.sistare@oracle.com>

vfio/type1: unmap cleanup

Minor changes in vfio_dma_do_unmap to improve readability, which also
simplify the subsequent unmap-all patch. No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# bdfae1c9 08-Dec-2020 Lu Baolu <baolu.lu@linux.intel.com>

vfio/type1: Add vfio_group_iommu_domain()

Add the API for getting the domain from a vfio group. This could be used
by the physical device drivers which rely on the vfio/mdev framework for
mediated device user level access. The typical use case like below:

unsigned int pasid;
struct vfio_group *vfio_group;
struct iommu_domain *iommu_domain;
struct device *dev = mdev_dev(mdev);
struct device *iommu_device = mdev_get_iommu_device(dev);

if (!iommu_device ||
!iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
return -EINVAL;

vfio_group = vfio_group_get_external_user_from_dev(dev);
if (IS_ERR_OR_NULL(vfio_group))
return -EFAULT;

iommu_domain = vfio_group_iommu_domain(vfio_group);
if (IS_ERR_OR_NULL(iommu_domain)) {
vfio_group_put_external_user(vfio_group);
return -EFAULT;
}

pasid = iommu_aux_get_pasid(iommu_domain, iommu_device);
if (pasid < 0) {
vfio_group_put_external_user(vfio_group);
return -EFAULT;
}

/* Program device context with pasid value. */
...

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 572f64c7 22-Oct-2020 Zenghui Yu <yuzenghui@huawei.com>

vfio/type1: Use the new helper to find vfio_group

When attaching a new group to the container, let's use the new helper
vfio_iommu_find_iommu_group() to check if it's already attached. There
is no functional change.

Also take this chance to add a missing blank line.

Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 2e6cfd49 16-Oct-2020 Xiaoyang Xu <xuxiaoyang2@huawei.com>

vfio iommu type1: Fix memory leak in vfio_iommu_type1_pin_pages

pfn is not added to pfn_list when vfio_add_to_pfn_list fails.
vfio_unpin_page_external will exit directly without calling
vfio_iova_put_vfio_pfn. This will lead to a memory leak.

Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
Signed-off-by: Xiaoyang Xu <xuxiaoyang2@huawei.com>
[aw: simplified logic, add Fixes]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 2c5af985 15-Sep-2020 Yan Zhao <yan.y.zhao@intel.com>

vfio/type1: fix dirty bitmap calculation in vfio_dma_rw

The count of dirtied pages is not only determined by count of copied
pages, but also by the start offset.

e.g. if offset = PAGE_SIZE - 1, and *copied=2, the dirty pages count
is 2, instead of 1 or 0.

Fixes: d6a4c185660c ("vfio iommu: Implementation of ioctl for dirty pages tracking")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 7d6e1329 15-Sep-2020 Matthew Rosato <mjrosato@linux.ibm.com>

vfio iommu: Add dma available capability

Commit 492855939bdb ("vfio/type1: Limit DMA mappings per container")
added the ability to limit the number of memory backed DMA mappings.
However on s390x, when lazy mapping is in use, we use a very large
number of concurrent mappings. Let's provide the current allowable
number of DMA mappings to userspace via the IOMMU info chain so that
userspace can take appropriate mitigation.

Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# aae4c8e2 17-Aug-2020 Tom Murphy <murphyt7@tcd.ie>

iommu: Rename iommu_tlb_* functions to iommu_iotlb_*

To keep naming consistent we should stick with *iotlb*. This patch
renames a few remaining functions.

Signed-off-by: Tom Murphy <murphyt7@tcd.ie>
Link: https://lore.kernel.org/r/20200817210051.13546-1-murphyt7@tcd.ie
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# df561f66 23-Aug-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

treewide: Use fallthrough pseudo-keyword

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>


# aae7a75a 17-Aug-2020 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Add proper error unwind for vfio_iommu_replay()

The vfio_iommu_replay() function does not currently unwind on error,
yet it does pin pages, perform IOMMU mapping, and modify the vfio_dma
structure to indicate IOMMU mapping. The IOMMU mappings are torn down
when the domain is destroyed, but the other actions go on to cause
trouble later. For example, the iommu->domain_list can be empty if we
only have a non-IOMMU backed mdev attached. We don't currently check
if the list is empty before getting the first entry in the list, which
leads to a bogus domain pointer. If a vfio_dma entry is erroneously
marked as iommu_mapped, we'll attempt to use that bogus pointer to
retrieve the existing physical page addresses.

This is the scenario that uncovered this issue, attempting to hot-add
a vfio-pci device to a container with an existing mdev device and DMA
mappings, one of which could not be pinned, causing a failure adding
the new group to the existing container and setting the conditions
for a subsequent attempt to explode.

To resolve this, we can first check if the domain_list is empty so
that we can reject replay of a bogus domain, should we ever encounter
this inconsistent state again in the future. The real fix though is
to add the necessary unwind support, which means cleaning up the
current pinning if an IOMMU mapping fails, then walking back through
the r-b tree of DMA entries, reading from the IOMMU which ranges are
mapped, and unmapping and unpinning those ranges. To be able to do
this, we also defer marking the DMA entry as IOMMU mapped until all
entries are processed, in order to allow the unwind to know the
disposition of each entry.

Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
Reported-by: Zhiyi Guo <zhguo@redhat.com>
Tested-by: Zhiyi Guo <zhguo@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 64019a2e 11-Aug-2020 Peter Xu <peterx@redhat.com>

mm/gup: remove task_struct pointer for all gup code

After the cleanup of page fault accounting, gup does not need to pass
task_struct around any more. Remove that parameter in the whole gup
stack.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ccd59dce 12-Jul-2020 Liu Yi L <yi.l.liu@intel.com>

vfio/type1: Refactor vfio_iommu_type1_ioctl()

This patch refactors the vfio_iommu_type1_ioctl() to use switch instead of
if-else, and each command got a helper function.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# e1907d67 27-Jul-2020 Xiang Zheng <zhengxiang9@huawei.com>

vfio/type1: Add conditional rescheduling after iommu map failed

Commit c5e6688752c2 ("vfio/type1: Add conditional rescheduling")
missed a "cond_resched()" in vfio_iommu_map if iommu map failed.

This is a very tiny optimization and the case can hardly happen.

Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# f5678e7f 10-Jun-2020 Christoph Hellwig <hch@lst.de>

kernel: better document the use_mm/unuse_mm API contract

Switch the function documentation to kerneldoc comments, and add
WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
not have ->mm set (or has ->mm set in the case of unuse_mm).

Also give the functions a kthread_ prefix to better document the use case.

[hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
[sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb]
Acked-by: Haren Myneni <haren@linux.ibm.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4dbe59a6 10-Jun-2020 Christoph Hellwig <hch@lst.de>

kernel: move use_mm/unuse_mm to kthread.c

cover the newly merged use_mm/unuse_mm caller in vfio

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Link: http://lkml.kernel.org/r/20200416053158.586887-2-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8ed45c5 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: use coccinelle to convert mmap_sem rwsem call sites

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c8e9df47 02-Jun-2020 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu: typecast corrections

Fixes sparse warnings by adding '__user' in typecast for
copy_[from,to]_user()

Fixes: d6a4c185660c ("vfio iommu: Implementation of ioctl for dirty pages tracking")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# cd0bb41e 02-Jun-2020 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu: Use shift operation for 64-bit integer division

Fixes compilation error with ARCH=i386.

Error fixed by this commit:
ld: drivers/vfio/vfio_iommu_type1.o: in function `vfio_dma_populate_bitmap':
>> vfio_iommu_type1.c:(.text+0x666): undefined reference to `__udivdi3'

Fixes: d6a4c185660c ("vfio iommu: Implementation of ioctl for dirty pages tracking")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 95fc87b4 28-May-2020 Kirti Wankhede <kwankhede@nvidia.com>

vfio: Selective dirty page tracking if IOMMU backed device pins pages

Added a check such that only singleton IOMMU groups can pin pages.
>From the point when vendor driver pins any pages, consider IOMMU group
dirty page scope to be limited to pinned pages.

To optimize to avoid walking list often, added flag
pinned_page_dirty_scope to indicate if all of the vfio_groups for each
vfio_domain in the domain_list dirty page scope is limited to pinned
pages. This flag is updated on first pinned pages request for that IOMMU
group and on attaching/detaching group.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# ad721705 28-May-2020 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu: Add migration capability to report supported features

Added migration capability in IOMMU info chain.
User application should check IOMMU info chain for migration capability
to use dirty page tracking feature provided by kernel module.
User application must check page sizes supported and maximum dirty
bitmap size returned by this capability structure for ioctls used to get
dirty bitmap.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 331e33d2 28-May-2020 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap

DMA mapped pages, including those pinned by mdev vendor drivers, might
get unpinned and unmapped while migration is active and device is still
running. For example, in pre-copy phase while guest driver could access
those pages, host device or vendor driver can dirty these mapped pages.
Such pages should be marked dirty so as to maintain memory consistency
for a user making use of dirty page tracking.

To get bitmap during unmap, user should allocate memory for bitmap, set
it all zeros, set size of allocated memory, set page size to be
considered for bitmap and set flag VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# d6a4c185 28-May-2020 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu: Implementation of ioctl for dirty pages tracking

VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
- Start dirty pages tracking while migration is active
- Stop dirty pages tracking.
- Get dirty pages bitmap. Its user space application's responsibility to
copy content of dirty pages from source to destination during migration.

To prevent DoS attack, memory for bitmap is allocated per vfio_dma
structure. Bitmap size is calculated considering smallest supported page
size. Bitmap is allocated for all vfio_dmas when dirty logging is enabled

Bitmap is populated for already pinned pages when bitmap is allocated for
a vfio_dma with the smallest supported page size. Update bitmap from
pinning functions when tracking is enabled. When user application queries
bitmap, check if requested page size is same as page size used to
populated bitmap. If it is equal, copy bitmap, but if not equal, return
error.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>

Fixed error reported by build bot by changing pgsize type from uint64_t
to size_t.
Reported-by: kbuild test robot <lkp@intel.com>

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# cade075f 28-May-2020 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu: Cache pgsize_bitmap in struct vfio_iommu

Calculate and cache pgsize_bitmap when iommu->domain_list is updated
and iommu->external_domain is set for mdev device.
Add iommu->lock protection when cached pgsize_bitmap is accessed.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 65817085 28-May-2020 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu: Remove atomicity of ref_count of pinned pages

vfio_pfn.ref_count is always updated while holding iommu->lock, using
atomic variable is overkill.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 41311242 28-Apr-2020 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Support faulting PFNMAP vmas

With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
the range being faulted into the vma. Add support to manually provide
that, in the same way as done on KVM with hva_to_pfn_remapped().

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 5cbf3264 16-Apr-2020 Sean Christopherson <seanjc@google.com>

vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()

Use follow_pfn() to get the PFN of a PFNMAP VMA instead of assuming that
vma->vm_pgoff holds the base PFN of the VMA. This fixes a bug where
attempting to do VFIO_IOMMU_MAP_DMA on an arbitrary PFNMAP'd region of
memory calculates garbage for the PFN.

Hilariously, this only got detected because the first "PFN" calculated
by vaddr_get_pfn() is PFN 0 (vma->vm_pgoff==0), and iommu_iova_to_phys()
uses PA==0 as an error, which triggers a WARN in vfio_unmap_unpin()
because the translation "failed". PFN 0 is now unconditionally reserved
on x86 in order to mitigate L1TF, which causes is_invalid_reserved_pfn()
to return true and in turns results in vaddr_get_pfn() returning success
for PFN 0. Eventually the bogus calculation runs into PFNs that aren't
reserved and leads to failure in vfio_pin_map_dma(). The subsequent
call to vfio_remove_dma() attempts to unmap PFN 0 and WARNs.

WARNING: CPU: 8 PID: 5130 at drivers/vfio/vfio_iommu_type1.c:750 vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio ...
CPU: 8 PID: 5130 Comm: sgx Tainted: G W 5.6.0-rc5-705d787c7fee-vfio+ #3
Hardware name: Intel Corporation Mehlow UP Server Platform/Moss Beach Server, BIOS CNLSE2R1.D00.X119.B49.1803010910 03/01/2018
RIP: 0010:vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
Code: <0f> 0b 49 81 c5 00 10 00 00 e9 c5 fe ff ff bb 00 10 00 00 e9 3d fe
RSP: 0018:ffffbeb5039ebda8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9a55cbf8d480 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a52b771c200
RBP: 0000000000000000 R08: 0000000000000040 R09: 00000000fffffff2
R10: 0000000000000001 R11: ffff9a51fa896000 R12: 0000000184010000
R13: 0000000184000000 R14: 0000000000010000 R15: ffff9a55cb66ea08
FS: 00007f15d3830b40(0000) GS:ffff9a55d5600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561cf39429e0 CR3: 000000084f75f005 CR4: 00000000003626e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
vfio_remove_dma+0x17/0x70 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0x9e3/0xa7b [vfio_iommu_type1]
ksys_ioctl+0x92/0xb0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x4c/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15d04c75d7
Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48

Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 0ea971f8 08-Apr-2020 Yan Zhao <yan.y.zhao@intel.com>

vfio: avoid possible overflow in vfio_iommu_type1_pin_pages

add parentheses to avoid possible vaddr overflow.

Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 205323b8 08-Apr-2020 Yan Zhao <yan.y.zhao@intel.com>

vfio: checking of validity of user vaddr in vfio_dma_rw

instead of calling __copy_to/from_user(), use copy_to_from_user() to
ensure vaddr range is a valid user address range before accessing them.

Fixes: 8d46c0cca5f4 ("vfio: introduce vfio_dma_rw to read/write a range of IOVAs")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reported-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# f44efca0 01-Apr-2020 Andre Przywara <andre.przywara@arm.com>

vfio: Ignore -ENODEV when getting MSI cookie

When we try to get an MSI cookie for a VFIO device, that can fail if
CONFIG_IOMMU_DMA is not set. In this case iommu_get_msi_cookie() returns
-ENODEV, and that should not be fatal.

Ignore that case and proceed with the initialisation.

This fixes VFIO with a platform device on the Calxeda Midway (no MSIs).

Fixes: f6810c15cf973f ("iommu/arm-smmu: Clean up early-probing workarounds")
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 8d46c0cc 24-Mar-2020 Yan Zhao <yan.y.zhao@intel.com>

vfio: introduce vfio_dma_rw to read/write a range of IOVAs

vfio_dma_rw will read/write a range of user space memory pointed to by
IOVA into/from a kernel buffer without enforcing pinning the user space
memory.

TODO: mark the IOVAs to user space memory dirty if they are written in
vfio_dma_rw().

Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# f1f6a7dd 30-Jan-2020 John Hubbard <jhubbard@nvidia.com>

mm, tree-wide: rename put_user_page*() to unpin_user_page*()

In order to provide a clearer, more symmetric API for pinning and
unpinning DMA pages. This way, pin_user_pages*() calls match up with
unpin_user_pages*() calls, and the API is a lot closer to being
self-explanatory.

Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 19fed0da 30-Jan-2020 John Hubbard <jhubbard@nvidia.com>

vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion

1. Change vfio from get_user_pages_remote(), to
pin_user_pages_remote().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is probably
more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.GB720@lst.de

Link: http://lkml.kernel.org/r/20200107224558.2362728-20-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3567813e 30-Jan-2020 John Hubbard <jhubbard@nvidia.com>

vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call

Update VFIO to take advantage of the recently loosened restriction on
FOLL_LONGTERM with get_user_pages_remote(). Also, now it is possible to
fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
wasn't setting FOLL_LONGTERM.

Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
just in order to call page_to_pfn().

Also, now that the the DAX check ("if a VMA is DAX, don't allow long
term pinning") is in the internals of get_user_pages_remote() and
__gup_longterm_locked(), there's no need for it at the VFIO call site. So
remove it.

Link: http://lkml.kernel.org/r/20200107224558.2362728-8-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 026948f0 02-Oct-2019 Ben Luo <luoben@linux.alibaba.com>

vfio/type1: remove hugepage checks in is_invalid_reserved_pfn()

Currently, no hugepage split code can transfer the reserved bit
from head to tail during the split, so checking the head can't make
a difference in a racing condition with hugepage spliting.

The buddy wouldn't allow a driver to allocate an hugepage if any
subpage is reserved in the e820 map at boot, if any driver sets the
reserved bit of head page before mapping the hugepage in userland,
it needs to set the reserved bit in all subpages to be safe.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 95f89e09 15-Oct-2019 Joerg Roedel <jroedel@suse.de>

vfio/type1: Initialize resv_msi_base

After enabling CONFIG_IOMMU_DMA on X86 a new warning appears when
compiling vfio:

drivers/vfio/vfio_iommu_type1.c: In function ‘vfio_iommu_type1_attach_group’:
drivers/vfio/vfio_iommu_type1.c:1827:7: warning: ‘resv_msi_base’ may be used uninitialized in this function [-Wmaybe-uninitialized]
ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The warning is a false positive, because the call to iommu_get_msi_cookie()
only happens when vfio_iommu_has_sw_msi() returned true. And that only
happens when it also set resv_msi_base.

But initialize the variable anyway to get rid of the warning.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 6cf5354c 25-Sep-2019 Andrey Konovalov <andreyknvl@google.com>

vfio/type1: untag user pointers in vaddr_get_pfn

This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.

vaddr_get_pfn() uses provided user pointers for vma lookups, which can
only by done with untagged pointers.

Untag user pointers in this function.

Link: http://lkml.kernel.org/r/87422b4d72116a975896f2b19b00f38acbd28f33.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b09d6e47 23-Jul-2019 Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>

vfio/type1: remove duplicate retrieval of reserved regions

As we now already have the reserved regions list, just pass that into
vfio_iommu_has_sw_msi() fn.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# a7170720 23-Jul-2019 Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>

vfio/type1: Add IOVA range capability support

This allows the user-space to retrieve the supported IOVA
range(s), excluding any non-relaxable reserved regions. The
implementation is based on capability chains, added to
VFIO_IOMMU_GET_INFO ioctl.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 9b77e5c7 23-Jul-2019 Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>

vfio/type1: check dma map request is within a valid iova range

This checks and rejects any dma map request outside valid iova
range.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# f45daadf 23-Jul-2019 Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>

vfio/type1: Update iova list on detach

Get a copy of iova list on _group_detach and try to update the list.
On success replace the current one with the copy. Leave the list as
it is if update fails.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# af029169 23-Jul-2019 Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>

vfio/type1: Check reserved region conflict and update iova list

This retrieves the reserved regions associated with dev group and
checks for conflicts with any existing dma mappings. Also update
the iova list excluding the reserved regions.

Reserved regions with type IOMMU_RESV_DIRECT_RELAXABLE are
excluded from above checks as they are considered as directly
mapped regions which are known to be relaxable.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 1108696a 23-Jul-2019 Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>

vfio/type1: Introduce iova list and add iommu aperture validity check

This introduces an iova list that is valid for dma mappings. Make
sure the new iommu aperture window doesn't conflict with the current
one or with any existing dma mappings during attach.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# a7d20dc1 02-Jul-2019 Will Deacon <will@kernel.org>

iommu: Introduce struct iommu_iotlb_gather for batching TLB flushes

To permit batching of TLB flushes across multiple calls to the IOMMU
driver's ->unmap() implementation, introduce a new structure for
tracking the address range to be flushed and the granularity at which
the flushing is required.

This is hooked into the IOMMU API and its caller are updated to make use
of the new structure. Subsequent patches will plumb this into the IOMMU
drivers as well, but for now the gathering information is ignored.

Signed-off-by: Will Deacon <will@kernel.org>


# 6d1bcb95 02-Jul-2019 Will Deacon <will@kernel.org>

iommu: Remove empty iommu_tlb_range_add() callback from iommu_ops

Commit add02cfdc9bc ("iommu: Introduce Interface for IOMMU TLB Flushing")
added three new TLB flushing operations to the IOMMU API so that the
underlying driver operations can be batched when unmapping large regions
of IO virtual address space.

However, the ->iotlb_range_add() callback has not been implemented by
any IOMMU drivers (amd_iommu.c implements it as an empty function, which
incurs the overhead of an indirect branch). Instead, drivers either flush
the entire IOTLB in the ->iotlb_sync() callback or perform the necessary
invalidation during ->unmap().

Attempting to implement ->iotlb_range_add() for arm-smmu-v3.c revealed
two major issues:

1. The page size used to map the region in the page-table is not known,
and so it is not generally possible to issue TLB flushes in the most
efficient manner.

2. The only mutable state passed to the callback is a pointer to the
iommu_domain, which can be accessed concurrently and therefore
requires expensive synchronisation to keep track of the outstanding
flushes.

Remove the callback entirely in preparation for extending ->unmap() and
->iotlb_sync() to update a token on the caller's stack.

Signed-off-by: Will Deacon <will@kernel.org>


# 79eb597c 16-Jul-2019 Daniel Jordan <daniel.m.jordan@oracle.com>

mm: add account_locked_vm utility function

locked_vm accounting is done roughly the same way in five places, so
unify them in a helper.

Include the helper's caller in the debug print to distinguish between
callsites.

Error codes stay the same, so user-visible behavior does too. The one
exception is that the -EPERM case in tce_account_locked_vm is removed
because Alexey has never seen it triggered.

[daniel.m.jordan@oracle.com: v3]
Link: http://lkml.kernel.org/r/20190529205019.20927-1-daniel.m.jordan@oracle.com
[sfr@canb.auug.org.au: fix mm/util.c]
Link: http://lkml.kernel.org/r/20190524175045.26897-1-daniel.m.jordan@oracle.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Alan Tull <atull@kernel.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Moritz Fischer <mdf@kernel.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Wu Hao <hao.wu@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d2912cb1 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 932f4a63 13-May-2019 Ira Weiny <ira.weiny@intel.com>

mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM

Pach series "Add FOLL_LONGTERM to GUP fast and use it".

HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages. These pages can be held for a significant time. But
get_user_pages_fast() does not protect against mapping FS DAX pages.

Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks. XDP has also
shown interest in using this functionality.[1]

In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.

[1] https://lkml.org/lkml/2019/3/19/939

"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move. I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem. Then I think we can change the flag to a
better name.

Secondly, it depends on how often you are registering memory. I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance. I don't have the numbers as the
tests for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.

As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.

This patch (of 7):

This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast(). Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.

Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.

This patch does not change any functionality. In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked. However, callers of get_user_pages_fast()
were not "protected".

FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.

NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages. This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.

As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.

In review[1] it was asked:

<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?

A couple of points.

First "longterm" is a relative thing and at this point is probably a
misnomer. This is really flagging a pin which is going to be given to
hardware and can't move. I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem. Then I think we can
change the flag to a better name.

Second, It depends on how often you are registering memory. I have spoken
with some RDMA users who consider MR in the performance path... For the
overall application performance. I don't have the numbers as the tests
for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.

As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.

</quote>

[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

[ira.weiny@intel.com: v3]
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# be068fa2 11-Apr-2019 Lu Baolu <baolu.lu@linux.intel.com>

vfio/type1: Handle different mdev isolation type

This adds the support to determine the isolation type
of a mediated device group by checking whether it has
an iommu device. If an iommu device exists, an iommu
domain will be allocated and then attached to the iommu
device. Otherwise, keep the same behavior as it is.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# 7bd50f0c 11-Apr-2019 Lu Baolu <baolu.lu@linux.intel.com>

vfio/type1: Add domain at(de)taching group helpers

This adds helpers to attach or detach a domain to a
group. This will replace iommu_attach_group() which
only works for non-mdev devices.

If a domain is attaching to a group which includes the
mediated devices, it should attach to the iommu device
(a pci device which represents the mdev in iommu scope)
instead. The added helper supports attaching domain to
groups for both pci and mdev devices.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# 49285593 03-Apr-2019 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Limit DMA mappings per container

Memory backed DMA mappings are accounted against a user's locked
memory limit, including multiple mappings of the same memory. This
accounting bounds the number of such mappings that a user can create.
However, DMA mappings that are not backed by memory, such as DMA
mappings of device MMIO via mmaps, do not make use of page pinning
and therefore do not count against the user's locked memory limit.
These mappings still consume memory, but the memory is not well
associated to the process for the purpose of oom killing a task.

To add bounding on this use case, we introduce a limit to the total
number of concurrent DMA mappings that a user is allowed to create.
This limit is exposed as a tunable module option where the default
value of 64K is expected to be well in excess of any reasonable use
case (a large virtual machine configuration would typically only make
use of tens of concurrent mappings).

This fixes CVE-2019-3882.

Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 58fec830 07-Jan-2019 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Fix unmap overflow off-by-one

The below referenced commit adds a test for integer overflow, but in
doing so prevents the unmap ioctl from ever including the last page of
the address space. Subtract one to compare to the last address of the
unmap to avoid the overflow and wrap-around.

Fixes: 71a7d3d78e3c ("vfio/type1: silence integer overflow warning")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
Cc: stable@vger.kernel.org # v4.15+
Reported-by: Pei Zhang <pezhang@redhat.com>
Debugged-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 7a30423a 08-Nov-2018 Joerg Roedel <jroedel@suse.de>

vfio/type1: Remove map_try_harder() code path

The AMD IOMMU driver can now map a huge-page where smaller
mappings existed before, so this code-path is no longer
triggered.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# 544c05a6 09-Jul-2018 Gustavo A. R. Silva <gustavo@embeddedor.com>

vfio: Mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# bb94b55a 29-Jun-2018 Jason Gunthorpe <jgg@ziepe.ca>

vfio: Use get_user_pages_longterm correctly

The patch noted in the fixes below converted get_user_pages_fast() to
get_user_pages_longterm(), however the two calls differ in a few ways.

First _fast() is documented to not require the mmap_sem, while _longterm()
is documented to need it. Hold the mmap sem as required.

Second, _fast accepts an 'int write' while _longterm uses 'unsigned int
gup_flags', so the expression '!!(prot & IOMMU_WRITE)' is only working by
luck as FOLL_WRITE is currently == 0x1. Use the expected FOLL_WRITE
constant instead.

Fixes: 94db151dc892 ("vfio: disable filesystem-dax page pinning")
Cc: <stable@vger.kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 48d8476b 11-May-2018 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Fix task tracking for QEMU vCPU hotplug

MAP_DMA ioctls might be called from various threads within a process,
for example when using QEMU, the vCPU threads are often generating
these calls and we therefore take a reference to that vCPU task.
However, QEMU also supports vCPU hotplug on some machines and the task
that called MAP_DMA may have exited by the time UNMAP_DMA is called,
resulting in the mm_struct pointer being NULL and thus a failure to
match against the existing mapping.

To resolve this, we instead take a reference to the thread
group_leader, which has the same mm_struct and resource limits, but
is less likely exit, at least in the QEMU case. A difficulty here is
guaranteeing that the capabilities of the group_leader match that of
the calling thread, which we resolve by tracking CAP_IPC_LOCK at the
time of calling rather than at an indeterminate time in the future.
Potentially this also results in better efficiency as this is now
recorded once per MAP_DMA ioctl.

Reported-by: Xu Yandong <xuyandong2@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 89c29def 02-Jun-2018 Alex Williamson <alex.williamson@redhat.com>

Revert "vfio/type1: Improve memory pinning process for raw PFN mapping"

Bisection by Amadeusz Sławiński implicates this commit leading to bad
page state issues after VM shutdown, likely due to unbalanced page
references. The original commit was intended only as a performance
improvement, therefore revert for offline rework.

Link: https://lkml.org/lkml/2018/6/2/97
Fixes: 356e88ebe447 ("vfio/type1: Improve memory pinning process for raw PFN mapping")
Cc: Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
Reported-by: Amadeusz Sławiński <amade@asmblr.net>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 356e88eb 21-Mar-2018 Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>

vfio/type1: Improve memory pinning process for raw PFN mapping

When using vfio to pass through a PCIe device (e.g. a GPU card) that
has a huge BAR (e.g. 16GB), a lot of cycles are wasted on memory
pinning because PFNs of PCI BAR are not backed by struct page, and
the corresponding VMA has flag VM_PFNMAP.

With this change, when pinning a region which is a raw PFN mapping,
it can skip unnecessary user memory pinning process, and thus, can
significantly improve VM's boot up time when passing through devices
via VFIO. In my test on a Xeon E5 2.6GHz, the time mapping a 16GB
BAR was reduced from about 0.4s to 1.5us.

Signed-off-by: Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 6bd06f5a 21-Mar-2018 Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>

vfio/type1: Adopt fast IOTLB flush interface when unmap IOVAs

VFIO IOMMU type1 currently upmaps IOVA pages synchronously, which requires
IOTLB flushing for every unmapping. This results in large IOTLB flushing
overhead when handling pass-through devices has a large number of mapped
IOVAs. This can be avoided by using the new IOTLB flushing interface.

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[aw - use LIST_HEAD]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 94db151d 04-Feb-2018 Dan Williams <dan.j.williams@intel.com>

vfio: disable filesystem-dax page pinning

Filesystem-DAX is incompatible with 'longterm' page pinning. Without
page cache indirection a DAX mapping maps filesystem blocks directly.
This means that the filesystem must not modify a file's block map while
any page in a mapping is pinned. In order to prevent the situation of
userspace holding of filesystem operations indefinitely, disallow
'longterm' Filesystem-DAX mappings.

RDMA has the same conflict and the plan there is to add a 'with lease'
mechanism to allow the kernel to notify userspace that the mapping is
being torn down for block-map maintenance. Perhaps something similar can
be put in place for vfio.

Note that xfs and ext4 still report:

"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"

...at mount time, and resolving the dax-dma-vs-truncate problem is one
of the last hurdles to remove that designation.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org>
Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
Tested-by: Haozhong Zhang <haozhong.zhang@intel.com>
Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 71a7d3d7 20-Oct-2017 Dan Carpenter <dan.carpenter@oracle.com>

vfio/type1: silence integer overflow warning

I get a static checker warning about the potential integer overflow if
we add "unmap->iova + unmap->size". The integer overflow isn't really
harmful, but we may as well fix it. Also unmap->size gets truncated to
size_t when we pass it to vfio_find_dma() so we could check for too high
values of that as well.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# f203f7f1 10-Aug-2017 Robin Murphy <robin.murphy@arm.com>

vfio/type1: Give hardware MSI regions precedence

If the IOMMU driver advertises 'real' reserved regions for MSIs, but
still includes the software-managed region as well, we are currently
blind to the former and will configure the IOMMU domain to map MSIs into
the latter, which is unlikely to work as expected.

Since it would take a ridiculous hardware topology for both regions to
be valid (which would be rather difficult to support in general), we
should be safe to assume that the presence of any hardware regions makes
the software region irrelevant. However, the IOMMU driver might still
advertise the software region by default, particularly if the hardware
regions are filled in elsewhere by generic code, so it might not be fair
for VFIO to be super-strict about not mixing them. To that end, make
vfio_iommu_has_sw_msi() robust against the presence of both region types
at once, so that we end up doing what is almost certainly right, rather
than what is almost certainly wrong.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# db406cc0 10-Aug-2017 Robin Murphy <robin.murphy@arm.com>

vfio/type1: Cope with hardware MSI reserved regions

For ARM-based systems with a GICv3 ITS to provide interrupt isolation,
but hardware limitations which are worked around by having MSIs bypass
SMMU translation (e.g. HiSilicon Hip06/Hip07), VFIO neglects to check
for the IRQ_DOMAIN_FLAG_MSI_REMAP capability, (and thus erroneously
demands unsafe_interrupts) if a software-managed MSI region is absent.

Fix this by always checking for isolation capability at both the IRQ
domain and IOMMU domain levels, rather than predicating that on whether
MSIs require an IOMMU mapping (which was always slightly tenuous logic).

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 7cb671e7 17-Apr-2017 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Reduce repetitive calls in vfio_pin_pages_remote()

vfio_pin_pages_remote() is typically called to iterate over a range
of memory. Testing CAP_IPC_LOCK is relatively expensive, so it makes
sense to push it up to the caller, which can then repeatedly call
vfio_pin_pages_remote() using that value. This can show nearly a 20%
improvement on the worst case path through VFIO_IOMMU_MAP_DMA with
contiguous page mapping disabled. Testing RLIMIT_MEMLOCK is much more
lightweight, but we bring it along on the same principle and it does
seem to show a marginal improvement.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 80dbe1fb 16-Apr-2017 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Prune vfio_pin_page_external()

With vfio_lock_acct() testing the locked memory limit under mmap_sem,
it's redundant to do it here for a single page. We can also reorder
our tests such that we can avoid testing for reserved pages if we're
not doing accounting and let vfio_lock_acct() test the process
CAP_IPC_LOCK. Finally, this function oddly returns 1 on success.
Update to return zero on success, -errno on error. Since the function
only pins a single page, there's no need to return the number of pages
pinned.

N.B. vfio_pin_pages_remote() can pin a large contiguous range of pages
before calling vfio_lock_acct(). If we were to similarly remove the
extra test there, a user could temporarily pin far more pages than
they're allowed.

Suggested-by: Kirti Wankhede <kwankhede@nvidia.com>
Suggested-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 0cfef2b7 13-Apr-2017 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Remove locked page accounting workqueue

If the mmap_sem is contented then the vfio type1 IOMMU backend will
defer locked page accounting updates to a workqueue task. This has a
few problems and depending on which side the user tries to play, they
might be over-penalized for unmaps that haven't yet been accounted or
race the workqueue to enter more mappings than they're allowed. The
original intent of this workqueue mechanism seems to be focused on
reducing latency through the ioctl, but we cannot do so at the cost
of correctness. Remove this workqueue mechanism and update the
callers to allow for failure. We can also now recheck the limit under
write lock to make sure we don't exceed it.

vfio_pin_pages_remote() also now necessarily includes an unwind path
which we can jump to directly if the consecutive page pinning finds
that we're exceeding the user's memory limits. This avoids the
current lazy approach which does accounting and mapping up to the
fault, only to return an error on the next iteration to unwind the
entire vfio_dma.

Cc: stable@vger.kernel.org
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 9d3a4de4 16-Mar-2017 Robin Murphy <robin.murphy@arm.com>

iommu: Disambiguate MSI region types

The introduction of reserved regions has left a couple of rough edges
which we could do with sorting out sooner rather than later. Since we
are not yet addressing the potential dynamic aspect of software-managed
reservations and presenting them at arbitrary fixed addresses, it is
incongruous that we end up displaying hardware vs. software-managed MSI
regions to userspace differently, especially since ARM-based systems may
actually require one or the other, or even potentially both at once,
(which iommu-dma currently has no hope of dealing with at all). Let's
resolve the former user-visible inconsistency ASAP before the ABI has
been baked into a kernel release, in a way that also lays the groundwork
for the latter shortcoming to be addressed by follow-up patches.

For clarity, rename the software-managed type to IOMMU_RESV_SW_MSI, use
IOMMU_RESV_MSI to describe the hardware type, and document everything a
little bit. Since the x86 MSI remapping hardware falls squarely under
this meaning of IOMMU_RESV_MSI, apply that type to their regions as well,
so that we tell the same story to userspace across all platforms.

Secondly, as the various region types require quite different handling,
and it really makes little sense to ever try combining them, convert the
bitfield-esque #defines to a plain enum in the process before anyone
gets the wrong impression.

Fixes: d30ddcaa7b02 ("iommu: Add a new type field in iommu_resv_region")
Reviewed-by: Eric Auger <eric.auger@redhat.com>
CC: Alex Williamson <alex.williamson@redhat.com>
CC: David Woodhouse <dwmw2@infradead.org>
CC: kvm@vger.kernel.org
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# 3f07c014 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>

We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 6e84f315 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h>

We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

The APIs that are going to be moved first are:

mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 2c9f1af5 09-Feb-2017 Wei Yongjun <weiyongjun1@huawei.com>

vfio/type1: Fix error return code in vfio_iommu_type1_attach_group()

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 5d704992189f ("vfio/type1: Allow transparent MSI IOVA allocation")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# 9d72f87b 19-Jan-2017 Eric Auger <eric.auger@redhat.com>

vfio/type1: Check MSI remapping at irq domain level

In case the IOMMU translates MSI transactions (typical case
on ARM), we check MSI remapping capability at IRQ domain
level. Otherwise it is checked at IOMMU level.

At this stage the arm-smmu-(v3) still advertise the
IOMMU_CAP_INTR_REMAP capability at IOMMU level. This will be
removed in subsequent patches.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>


# 5d704992 19-Jan-2017 Eric Auger <eric.auger@redhat.com>

vfio/type1: Allow transparent MSI IOVA allocation

When attaching a group to the container, check the group's
reserved regions and test whether the IOMMU translates MSI
transactions. If yes, we initialize an IOVA allocator through
the iommu_get_msi_cookie API. This will allow the MSI IOVAs
to be transparently allocated on MSI controller's compose().

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>


# 94a6fa89 12-Jan-2017 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Remove pid_namespace.h include

Using has_capability() rather than ns_capable(), we're no longer using
this header.

Cc: Jike Song <jike.song@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# d1b333d1 12-Jan-2017 Jike Song <jike.song@intel.com>

vfio iommu type1: fix the testing of capability for remote task

Before the mdev enhancement type1 iommu used capable() to test the
capability of current task; in the course of mdev development a
new requirement, testing for another task other than current, was
raised. ns_capable() was used for this purpose, however it still
tests current, the only difference is, in a specified namespace.

Fix it by using has_capability() instead, which tests the cap for
specified task in init_user_ns, the same namespace as capable().

Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 6c38c055 30-Dec-2016 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Restore mapping performance with mdev support

As part of the mdev support, type1 now gets a task reference per
vfio_dma and uses that to get an mm reference for the task while
working on accounting. That's correct, but it's not fast. For some
paths, like vfio_pin_pages_remote(), we know we're only called from
user context, so we can restore the lighter weight calls. In other
cases, we're effectively already testing whether we're in the stored
task context elsewhere, extend this vfio_lock_acct() as well.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>


# 5b56d49f 14-Dec-2016 Lorenzo Stoakes <lstoakes@gmail.com>

mm: add locked parameter to get_user_pages_remote()

Patch series "mm: unexport __get_user_pages_unlocked()".

This patch series continues the cleanup of get_user_pages*() functions
taking advantage of the fact we can now pass gup_flags as we please.

It firstly adds an additional 'locked' parameter to
get_user_pages_remote() to allow for its callers to utilise
VM_FAULT_RETRY functionality. This is necessary as the invocation of
__get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
this and no other existing higher level function would allow it to do
so.

Secondly existing callers of __get_user_pages_unlocked() are replaced
with the appropriate higher-level replacement -
get_user_pages_unlocked() if the current task and memory descriptor are
referenced, or get_user_pages_remote() if other task/memory descriptors
are referenced (having acquiring mmap_sem.)

This patch (of 2):

Add a int *locked parameter to get_user_pages_remote() to allow
VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().

Taking into account the previous adjustments to get_user_pages*()
functions allowing for the passing of gup_flags, we are now in a
position where __get_user_pages_unlocked() need only be exported for his
ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
subsequently unexport __get_user_pages_unlocked() as well as allowing
for future flexibility in the use of get_user_pages_remote().

[sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2b8bb1d7 06-Dec-2016 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu type1: Fix size argument to vfio_find_dma() in pin_pages/unpin_pages

Passing zero for the size to vfio_find_dma() isn't compatible with
matching the start address of an existing vfio_dma. Doing so triggers a
corner case. In vfio_find_dma(), when the start address is equal to
dma->iova and size is 0, check for the end of search range makes it to
take wrong side of RB-tree. That fails the search even though the address
is present in mapped dma ranges.
In functions pin_pages and unpin_pages, the iova which is being searched
is base address of page to be pinned or unpinned. So here size should be
set to PAGE_SIZE, as argument to vfio_find_dma().

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 7c03f428 06-Dec-2016 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu type1: Fix size argument to vfio_find_dma() during DMA UNMAP.

Passing zero for the size to vfio_find_dma() isn't compatible with
matching the start address of an existing vfio_dma. Doing so triggers a
corner case. In vfio_find_dma(), when the start address is equal to
dma->iova and size is 0, check for the end of search range makes it to
take wrong side of RB-tree. That fails the search even though the address
is present in mapped dma ranges. Due to this, in vfio_dma_do_unmap(),
while checking boundary conditions, size should be set to 1 for verifying
start address of unmap range.
vfio_find_dma() is also used to verify last address in unmap range with
size = 0, but in that case address to be searched is calculated with
start + size - 1 and so it works correctly.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
[aw: changelog tweak]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 3cedd7d7 05-Dec-2016 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu type1: WARN_ON if notifier block is not unregistered

mdev vendor driver should unregister the iommu notifier since the vfio
iommu can persist beyond the attachment of the mdev group. WARN_ON will
show warning if vendor driver doesn't unregister the notifier and is
forced to follow the implementations steps.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 22195cbd 30-Nov-2016 Jike Song <jike.song@intel.com>

vfio: vfio_register_notifier: classify iommu notifier

Currently vfio_register_notifier assumes that there is only one
notifier chain, which is in vfio_iommu. However, the user might
also be interested in events other than vfio_iommu, for example,
vfio_group. Refactor vfio_{un}register_notifier implementation
to make it feasible.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Jike Song <jike.song@intel.com>
[aw: merge with commit 816ca69ea9c7 ("vfio: Fix handling of error returned by 'vfio_group_get_from_dev()'"), remove typedef]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# c086de81 16-Nov-2016 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu: Add blocking notifier to notify DMA_UNMAP

Added blocking notifier to IOMMU TYPE1 driver to notify vendor drivers
about DMA_UNMAP.
Exported two APIs vfio_register_notifier() and vfio_unregister_notifier().
Notifier should be registered, if external user wants to use
vfio_pin_pages()/vfio_unpin_pages() APIs to pin/unpin pages.
Vendor driver should use VFIO_IOMMU_NOTIFY_DMA_UNMAP action to invalidate
mappings.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# a54eb550 16-Nov-2016 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu type1: Add support for mediated devices

VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- Domain for external user is tracked separately in vfio_iommu structure.
It is allocated when group for first mdev device is attached.
- Pages pinned for external domain are tracked in each vfio_dma structure
for that iova range.
- Page tracking rb-tree in vfio_dma keeps <iova, pfn, ref_count>. Key of
rb-tree is iova, but it actually aims to track pfns.
- On external pin request for an iova, page is pinned once, if iova is
already pinned and tracked, ref_count is incremented.
- External unpin request unpins pages only when ref_count is 0.
- Pinned pages list is used to find pfn from iova and then unpin it.
WARN_ON is added if there are entires in pfn_list while detaching the
group and releasing the domain.
- Page accounting is updated to account in its address space where the
pages are pinned/unpinned, i.e dma->task
- Accouting for mdev device is only done if there is no iommu capable
domain in the container. When there is a direct device assigned to the
container and that domain is iommu capable, all pages are already pinned
during DMA_MAP.
- Page accouting is updated on hot plug and unplug mdev device and pass
through device.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- Linux VM hot plug and unplug vGPU device while GPU pass through device
exist
- Linux VM hot plug and unplug GPU pass through device while vGPU device
exist

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 8f0d5bb9 16-Nov-2016 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu type1: Add task structure to vfio_dma

Add task structure to vfio_dma structure. Task structure is used for:
- During DMA_UNMAP, same task who mapped it or other task who shares same
address space is allowed to unmap, otherwise unmap fails.
QEMU maps few iova ranges initially, then fork threads and from the child
thread calls DMA_UNMAP on previously mapped iova. Since child shares same
address space, DMA_UNMAP is successful.
- Avoid accessing struct mm while process is exiting by acquiring
reference of task's mm during page accounting.
- It is also used to get task mlock capability and rlimit for mlock.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 7896c998 16-Nov-2016 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu type1: Add find_iommu_group() function

Add find_iommu_group()

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# ea85cf35 16-Nov-2016 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu type1: Update argument of vaddr_get_pfn()

Update arguments of vaddr_get_pfn() to take struct mm_struct *mm as input
argument.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 3624a248 16-Nov-2016 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu type1: Update arguments of vfio_lock_acct

Added task structure as input argument to vfio_lock_acct() function.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 2169037d 16-Nov-2016 Kirti Wankhede <kwankhede@nvidia.com>

vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops

Added APIs for pining and unpining set of pages. These call back into
backend iommu module to actually pin and unpin pages.
Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.

Renamed static functions in vfio_type1_iommu.c to resolve conflicts

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 089f1c6b 30-May-2016 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Fix build warning

This function cannot actually be called with npage = 0, so in practice
this doesn't return an uninitialized value.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# d16e0faa 07-Apr-2016 Robin Murphy <robin.murphy@arm.com>

iommu: Allow selecting page sizes per domain

Many IOMMUs support multiple page table formats, meaning that any given
domain may only support a subset of the hardware page sizes presented in
iommu_ops->pgsize_bitmap. There are also certain use-cases where the
creator of a domain may want to control which page sizes are used, for
example to force the use of hugepage mappings to reduce pagetable walk
depth.

To this end, add a per-domain pgsize_bitmap to represent the subset of
page sizes actually in use, to make it possible for domains with
different requirements to coexist.

Signed-off-by: Will Deacon <will.deacon@arm.com>
[rm: hijacked and rebased original patch with new commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# 8160c4e4 28-Feb-2016 Michael S. Tsirkin <mst@redhat.com>

vfio: fix ioctl error handling

Calling return copy_to_user(...) in an ioctl will not
do the right thing if there's a pagefault:
copy_to_user returns the number of bytes not copied
in this case.

Fix up vfio to do
return copy_to_user(...)) ?
-EFAULT : 0;

everywhere.

Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# d4f50ee2 23-Dec-2015 Pierre Morel <pmorel@linux.vnet.ibm.com>

vfio/iommu_type1: make use of info.flags

The flags entry is there to tell the user that some
optional information is available.

Since we report the iova_pgsizes signal it to the user
by setting the flags to VFIO_IOMMU_INFO_PGSIZES.

Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 4644321f 29-Oct-2015 Eric Auger <eric.auger@linaro.org>

vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

Current vfio_pgsize_bitmap code hides the supported IOMMU page
sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
does not support PAGE_SIZE page, the alignment check on map/unmap
is done with larger page sizes, if any. This can fail although
mapping could be done with pages smaller than PAGE_SIZE.

This patch modifies vfio_pgsize_bitmap implementation so that,
in case the IOMMU supports page sizes smaller than PAGE_SIZE
we pretend PAGE_SIZE is supported and hide sub-PAGE_SIZE sizes.
That way the user will be able to map/unmap buffers whose size/
start address is aligned with PAGE_SIZE. Pinning code uses that
granularity while iommu driver can use the sub-PAGE_SIZE size
to map the buffer.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# c5e66887 06-Feb-2015 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Add conditional rescheduling

IOMMU operations can be expensive and it's not very difficult for a
user to give us a lot of work to do for a map or unmap operation.
Killing a large VM will vfio assigned devices can result in soft
lockups and IOMMU tracing shows that we can easily spend 80% of our
time with need-resched set. A sprinkling of conf_resched() calls
after map and unmap calls has a very tiny affect on performance
while resulting in traces with <1% of calls overflowing into needs-
resched.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# babbf176 06-Feb-2015 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Chunk contiguous reserved/invalid page mappings

We currently map invalid and reserved pages, such as often occur from
mapping MMIO regions of a VM through the IOMMU, using single pages.
There's really no reason we can't instead follow the methodology we
use for normal pages and find the largest possible physically
contiguous chunk for mapping. The only difference is that we don't
do locked memory accounting for these since they're not back by RAM.

In most applications this will be a very minor improvement, but when
graphics and GPGPU devices are in play, MMIO BARs become non-trivial.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 6fe1010d 06-Feb-2015 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: DMA unmap chunking

When unmapping DMA entries we try to rely on the IOMMU API behavior
that allows the IOMMU to unmap a larger area than requested, up to
the size of the original mapping. This works great when the IOMMU
supports superpages *and* they're in use. Otherwise, each PAGE_SIZE
increment is unmapped separately, resulting in poor performance.

Instead we can use the IOVA-to-physical-address translation provided
by the IOMMU API and unmap using the largest contiguous physical
memory chunk available, which is also how vfio/type1 would have
mapped the region. For a synthetic 1TB guest VM mapping and shutdown
test on Intel VT-d (2M IOMMU pagesize support), this achieves about
a 30% overall improvement mapping standard 4K pages, regardless of
IOMMU superpage enabling, and about a 40% improvement mapping 2M
hugetlbfs pages when IOMMU superpages are not available. Hugetlbfs
with IOMMU superpages enabled is effectively unchanged.

Unfortunately the same algorithm does not work well on IOMMUs with
fine-grained superpages, like AMD-Vi, costing about 25% extra since
the IOMMU will automatically unmap any power-of-two contiguous
mapping we've provided it. We add a routine and a domain flag to
detect this feature, leaving AMD-Vi unaffected by this unmap
optimization.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# f5c9eceb 29-Sep-2014 Will Deacon <will@kernel.org>

vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type

VFIO allows devices to be safely handed off to userspace by putting
them behind an IOMMU configured to ensure DMA and interrupt isolation.
This enables userspace KVM clients, such as kvmtool and qemu, to further
map the device into a virtual machine.

With IOMMUs such as the ARM SMMU, it is then possible to provide SMMU
translation services to the guest operating system, which are nested
with the existing translation installed by VFIO. However, enabling this
feature means that the IOMMU driver must be informed that the VFIO domain
is being created for the purposes of nested translation.

This patch adds a new IOMMU type (VFIO_TYPE1_NESTING_IOMMU) to the VFIO
type-1 driver. The new IOMMU type acts identically to the
VFIO_TYPE1v2_IOMMU type, but additionally sets the DOMAIN_ATTR_NESTING
attribute on its IOMMU domains.

Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# eb165f05 05-Sep-2014 Joerg Roedel <jroedel@suse.de>

vfio: Convert to use new iommu_capable() API function

Cc: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>


# c8dbca16 30-May-2014 Alex Williamson <alex.williamson@redhat.com>

vfio/iommu_type1: Avoid overflow

Coverity reports use of a tained scalar used as a loop boundary.
For the most part, any values passed from userspace for a DMA mapping
size, IOVA, or virtual address are valid, with some alignment
constraints. The size is ultimately bound by how many pages the user
is able to lock, IOVA is tested by the IOMMU driver when doing a map,
and the virtual address needs to pass get_user_pages. The only
problem I can find is that we do expect the __u64 user values to fit
within our variables, which might not happen on 32bit platforms. Add
a test for this and return error on overflow. Also propagate use of
the type-correct local variables throughout the function.

The above also points to the 'end' variable, which can be zero if
we're operating at the very top of the address space. We try to
account for this, but our loop botches it. Rework the loop to use
the remaining size as our loop condition rather than the IOVA vs end.

Detected by Coverity: CID 714659

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 668f9abb 03-Mar-2014 David Rientjes <rientjes@google.com>

mm: close PageTail race

Commit bf6bddf1924e ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa429318 26-Feb-2014 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Add extension to test DMA cache coherence of IOMMU

Now that the type1 IOMMU backend can support IOMMU_CACHE, we need to
be able to test whether coherency is currently enforced. Add an
extension for this.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 1ef3e2bc 26-Feb-2014 Alex Williamson <alex.williamson@redhat.com>

vfio/iommu_type1: Multi-IOMMU domain support

We currently have a problem that we cannot support advanced features
of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee
that those features will be supported by all of the hardware units
involved with the domain over its lifetime. For instance, the Intel
VT-d architecture does not require that all DRHDs support snoop
control. If we create a domain based on a device behind a DRHD that
does support snoop control and enable SNP support via the IOMMU_CACHE
mapping option, we cannot then add a device behind a DRHD which does
not support snoop control or we'll get reserved bit faults from the
SNP bit in the pagetables. To add to the complexity, we can't know
the properties of a domain until a device is attached.

We could pass this problem off to userspace and require that a
separate vfio container be used, but we don't know how to handle page
accounting in that case. How do we know that a page pinned in one
container is the same page as a different container and avoid double
billing the user for the page.

The solution is therefore to support multiple IOMMU domains per
container. In the majority of cases, only one domain will be required
since hardware is typically consistent within a system. However, this
provides us the ability to validate compatibility of domains and
support mixed environments where page table flags can be different
between domains.

To do this, our DMA tracking needs to change. We currently try to
coalesce user mappings into as few tracking entries as possible. The
problem then becomes that we lose granularity of user mappings. We've
never guaranteed that a user is able to unmap at a finer granularity
than the original mapping, but we must honor the granularity of the
original mapping. This coalescing code is therefore removed, allowing
only unmaps covering complete maps. The change in accounting is
fairly small here, a typical QEMU VM will start out with roughly a
dozen entries, so it's arguable if this coalescing was ever needed.

We also move IOMMU domain creation to the point where a group is
attached to the container. An interesting side-effect of this is that
we now have access to the device at the time of domain creation and
can probe the devices within the group to determine the bus_type.
This finally makes vfio_iommu_type1 completely device/bus agnostic.
In fact, each IOMMU domain can host devices on different buses managed
by different physical IOMMUs, and present a single DMA mapping
interface to the user. When a new domain is created, mappings are
replayed to bring the IOMMU pagetables up to the state of the current
container. And of course, DMA mapping and unmapping automatically
traverse all of the configured IOMMU domains.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Varun Sethi <Varun.Sethi@freescale.com>


# d93b3ac0 11-Oct-2013 Antonios Motakis <a.motakis@virtualopensystems.com>

VFIO: vfio_iommu_type1: fix bug caused by break in nested loop

In vfio_iommu_type1.c there is a bug in vfio_dma_do_map, when checking
that pages are not already mapped. Since the check is being done in a
for loop nested within the main loop, breaking out of it does not create
the intended behavior. If the underlying IOMMU driver returns a non-NULL
value, this will be ignored and mapping the DMA range will be attempted
anyway, leading to unpredictable behavior.

This interracts badly with the ARM SMMU driver issue fixed in the patch
that was submitted with the title:
"[PATCH 2/2] ARM: SMMU: return NULL on error in arm_smmu_iova_to_phys"
Both fixes are required in order to use the vfio_iommu_type1 driver
with an ARM SMMU.

This patch refactors the function slightly, in order to also make this
kind of bug less likely.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 8d38ef19 01-Jul-2013 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Fix leak on error path

We also don't handle unpinning zero pages as an error on other exits
so we can fix that inconsistency by rolling in the next conditional
return.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# f5bfdbf2 25-Jun-2013 Alex Williamson <alex.williamson@redhat.com>

vfio/type1: Fix missed frees and zero sized removes

With hugepage support we can only properly aligned and sized ranges.
We only guarantee that we can unmap the same ranges mapped and not
arbitrary sub-ranges. This means we might not free anything or might
free more than requested. The vfio unmap interface started storing
the unmapped size to return to userspace to handle this. This patch
fixes a few places where we don't properly handle those cases, moves
a memory allocation to a place where failure is an option and checks
our loops to make sure we don't get into an infinite loop trying to
remove an overlap.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 5c6c2b21 21-Jun-2013 Alex Williamson <alex.williamson@redhat.com>

vfio: Provide module option to disable vfio_iommu_type1 hugepage support

Add a module option to vfio_iommu_type1 to disable IOMMU hugepage
support. This causes iommu_map to only be called with single page
mappings, disabling the IOMMU driver's ability to use hugepages.
This option can be enabled by loading vfio_iommu_type1 with
disable_hugepages=1 or dynamically through sysfs. If enabled
dynamically, only new mappings are restricted.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 166fd7d9 21-Jun-2013 Alex Williamson <alex.williamson@redhat.com>

vfio: hugepage support for vfio_iommu_type1

We currently send all mappings to the iommu in PAGE_SIZE chunks,
which prevents the iommu from enabling support for larger page sizes.
We still need to pin pages, which means we step through them in
PAGE_SIZE chunks, but we can batch up contiguous physical memory
chunks to allow the iommu the opportunity to use larger pages. The
approach here is a bit different that the one currently used for
legacy KVM device assignment. Rather than looking at the vma page
size and using that as the maximum size to pass to the iommu, we
instead simply look at whether the next page is physically
contiguous. This means we might ask the iommu to map a 4MB region,
while legacy KVM might limit itself to a maximum of 2MB.

Splitting our mapping path also allows us to be smarter about locked
memory because we can more easily unwind if the user attempts to
exceed the limit. Therefore, rather than assuming that a mapping
will result in locked memory, we test each page as it is pinned to
determine whether it locks RAM vs an mmap'd MMIO region. This should
result in better locking granularity and less locked page fudge
factors in userspace.

The unmap path uses the same algorithm as legacy KVM. We don't want
to track the pfn for each mapping ourselves, but we need the pfn in
order to unpin pages. We therefore ask the iommu for the iova to
physical address translation, ask it to unpin a page, and see how many
pages were actually unpinned. iommus supporting large pages will
often return something bigger than a page here, which we know will be
physically contiguous and we can unpin a batch of pfns. iommus not
supporting large mappings won't see an improvement in batching here as
they only unmap a page at a time.

With this change, we also make a clarification to the API for mapping
and unmapping DMA. We can only guarantee unmaps at the same
granularity as used for the original mapping. In other words,
unmapping a subregion of a previous mapping is not guaranteed and may
result in a larger or smaller unmapping than requested. The size
field in the unmapping structure is updated to reflect this.
Previously this was unmodified on mapping, always returning the the
requested unmap size. This is now updated to return the actual unmap
size on success, allowing userspace to appropriately track mappings.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# cd9b2268 21-Jun-2013 Alex Williamson <alex.williamson@redhat.com>

vfio: Convert type1 iommu to use rbtree

We need to keep track of all the DMA mappings of an iommu container so
that it can be automatically unmapped when the user releases the file
descriptor. We currently do this using a simple list, where we merge
entries with contiguous iovas and virtual addresses. Using a tree for
this is a bit more efficient and allows us to use common code instead
of inventing our own.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>


# 73fa0d10 31-Jul-2012 Alex Williamson <alex.williamson@redhat.com>

vfio: Type1 IOMMU implementation

This VFIO IOMMU backend is designed primarily for AMD-Vi and Intel
VT-d hardware, but is potentially usable by anything supporting
similar mapping functionality. We arbitrarily call this a Type1
backend for lack of a better name. This backend has no IOVA
or host memory mapping restrictions for the user and is optimized
for relatively static mappings. Mapped areas are pinned into system
memory.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>